Install for python 2.6:
pip install scrapy==0.18.4
pip install lxml httplib2 feedparser selenium python-Levenshtein
install http://phantomjs.org/download.html to /opt/phantomjs/bin/phantomjs
Run:
scrapy crawl newcrawl -a startat=http://www.quantumdiaries.org/
scrapy crawl updatecrawl -a startat=http://www.quantumdiaries.org/ -a since=1388593000
Test:
pip install pytest pytest-incremental
py.test
bibcrawl
βββ model
βΒ Β βββ commentitem.py: Blog comment Item
βΒ Β βββ objectitem.py: Super class of comment and post item
βΒ Β βββ postitem.py: Blog post Item
βββ pipelines
βΒ Β βββ backendpropagate.py: Saves the item in the back-end
βΒ Β βββ downloadfeeds.py: Downloads comments web feed
βΒ Β βββ downloadimages.py: Download images
βΒ Β βββ extractcomments.py: Extracts all comments from html using the comment feed
βΒ Β βββ files.py: Files pipeline back-ported to python 2.6
βΒ Β βββ processhtml.py: Process html to extract article, title and author
βΒ Β βββ renderjavascript.py: Renders the original page with PhantomJS and takes a screenshot
βββ spiders
βΒ Β βββ newcrawl.py: Entirely crawls a new blog
βΒ Β βββ rsscrawl.py: Super class of new and update crawl
βΒ Β βββ updatecrawl.py: Partialy crawls a blog for new content of the web feed
βββ utils
βΒ Β βββ contentextractor.py: Extracts the content of blog posts using a RSS feed
βΒ Β βββ ohpython.py: Essential functions that should have been part of python core
βΒ Β βββ parsing.py: Parsing functions
βΒ Β βββ priorityheuristic.py: Priority heuristic for page download, favors page with links to posts
βΒ Β βββ stringsimilarity.py: Dice's coefficient similarity function
βΒ Β βββ webdriverpool.py: Pool of PhantomJS processes to parallelize page rendering
βββ blogmonitor.py: Queries the database and starts new and update crawls when needed
βββ settings.py: Scrapy settings
Add to the DB, per blog
- link to web-feed
- latest etag of this feed
- date of last crawl (unix format)
Blog monitor algo:
if isFresh, start an updatecrawl with last crawl date
otherwise we are fine for this blog.