GitHub - agstudy/crawler: Blog crawler for the blogforever project.

BlogForever crawler

Install for python 2.6:

pip install scrapy==0.18.4
pip install lxml httplib2 feedparser selenium python-Levenshtein
install http://phantomjs.org/download.html to /opt/phantomjs/bin/phantomjs

Run:

scrapy crawl newcrawl -a startat=http://www.quantumdiaries.org/
scrapy crawl updatecrawl -a startat=http://www.quantumdiaries.org/ -a since=1388593000

Test:

pip install pytest pytest-incremental
py.test

Source tree docstrings:

bibcrawl
├── model
│   ├── commentitem.py: Blog comment Item
│   ├── objectitem.py: Super class of comment and post item
│   └── postitem.py: Blog post Item
├── pipelines
│   ├── backendpropagate.py: Saves the item in the back-end
│   ├── downloadfeeds.py: Downloads comments web feed
│   ├── downloadimages.py: Download images
│   ├── extractcomments.py: Extracts all comments from html using the comment feed
│   ├── files.py: Files pipeline back-ported to python 2.6
│   ├── processhtml.py: Process html to extract article, title and author
│   └── renderjavascript.py: Renders the original page with PhantomJS and takes a screenshot
├── spiders
│   ├── newcrawl.py: Entirely crawls a new blog
│   ├── rsscrawl.py: Super class of new and update crawl
│   └── updatecrawl.py: Partialy crawls a blog for new content of the web feed
├── utils
│   ├── contentextractor.py: Extracts the content of blog posts using a RSS feed
│   ├── ohpython.py: Essential functions that should have been part of python core
│   ├── parsing.py: Parsing functions
│   ├── priorityheuristic.py: Priority heuristic for page download, favors page with links to posts
│   ├── stringsimilarity.py: Dice's coefficient similarity function
│   └── webdriverpool.py: Pool of PhantomJS processes to parallelize page rendering
├── blogmonitor.py: Queries the database and starts new and update crawls when needed
└── settings.py: Scrapy settings

TODO:

Add to the DB, per blog

link to web-feed
latest etag of this feed
date of last crawl (unix format)

Blog monitor algo:

if isFresh, start an updatecrawl with last crawl date
otherwise we are fine for this blog.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
bibcrawl		bibcrawl
.gitignore		.gitignore
LICENSE.MIT		LICENSE.MIT
README.md		README.md
pylintrc		pylintrc
pytest.ini		pytest.ini
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BlogForever crawler

Source tree docstrings:

TODO:

About

Uh oh!

Releases

Packages

License

agstudy/crawler

Folders and files

Latest commit

History

Repository files navigation

BlogForever crawler

Source tree docstrings:

TODO:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages