Code to crawl Wikipedia using BFS and DFS
This projects consists of two python files - one consists of code for the implementation of a crawler using breadth first search and depth first search. The other file FocusedCrawler.py consists of the code or the implementation of a focused crawler. The SimpleCrawler.py returns four text files - two for a list of unique URLS fetched using BFS and DFS; and another two files consisting of duplicate URLS retrieved using BFS and DFS. The focused crawler also gives a text file for unique URLs as well as one for duplicate URLs.
URLLIB, NLTK, Beautiful soup, httplib2, time, collections
This project requires Python 2.7. There are a few packages which have to be installed to be able to run the code. Use the package manager pip to install them as follows:
pip install foobar
pip install urllib
pip install bs4
pip install nltk
pip install httplib2
pip install time
pip install collectionspython SimpleCrawler.py
python FocusedCrawler.py