scrapy

Best used to obtain one "stream" of data at a time, without trying to obtain data from different pages

scrapy runspider spider.py -o file.json

Scrapy shell

Display HTML source of the scraped page

print(response.txt)

Get {URL}

fetch('url')

Select a CSS selector

# Returns a `SelectorList`
response.css('p')
# Retrieve full HTML elements
response.css('p').extract()

Retrieve only the text within the element

response.css('p::text').extract()
response.css('p::text').extract_first()
response.css('p::text').extract()[0]

Get the href attribute value for an anchor tag

response.css('a').attrib['href']

Launch Scrapy shell and scrape $URL

scrapy shell $URL

Make a default spider named {quotes} that will be restricted to {domain}

scrapy genspider quotes domain

scrapy runspider scrapy1.py

Run a spider, saving scraped data to a JSON file

scrapy runspider spider.py -o items.json

Method which contains most of the logic of the spider, especially after the yield keyword. For multiple items, a structural basis for iteration must be found and for each iteration, data is yielded

Pagination

Extract URL from link using standard CSS selection techniques

Add the domain name to a relative link

response.urljoin()

Recursively call the parse method again on the next page

yield scrapy.Request(url=next_page_url, callback=self.parse)

Scrape detail pages

parse_details would be a spider method sibling to the main parse method
if a detail page has more information than the main, then the yield keyword should be in parse_details

yield scrapy.Request(url={url}, callback=self.parse_details)

scrapy

Scrapy shell

Pagination

Scrape detail pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Django

Concepts

Standard Library

Packages

Clone this wiki locally