Crawler For Wikipedia

Code to crawl Wikipedia using BFS and DFS

This projects consists of two python files - one consists of code for the implementation of a crawler using breadth first search and depth first search. The other file FocusedCrawler.py consists of the code or the implementation of a focused crawler. The SimpleCrawler.py returns four text files - two for a list of unique URLS fetched using BFS and DFS; and another two files consisting of duplicate URLS retrieved using BFS and DFS. The focused crawler also gives a text file for unique URLs as well as one for duplicate URLs.

Libraries used

URLLIB, NLTK, Beautiful soup, httplib2, time, collections

Installation

This project requires Python 2.7. There are a few packages which have to be installed to be able to run the code. Use the package manager pip to install them as follows:

pip install foobar
pip install urllib
pip install bs4 
pip install nltk
pip install httplib2
pip install time
pip install collections

Usage

python SimpleCrawler.py
python FocusedCrawler.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
FocusedCrawler.py		FocusedCrawler.py
README.md		README.md
SimpleCrawler.py		SimpleCrawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler For Wikipedia

Libraries used

Installation

Usage

About

Uh oh!

Releases

Packages

Languages

khazana/CrawlerForWikipedia

Folders and files

Latest commit

History

Repository files navigation

Crawler For Wikipedia

Libraries used

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages