-
Notifications
You must be signed in to change notification settings - Fork 0
scrapy
Best used to obtain one "stream" of data at a time, without trying to obtain data from different pages
scrapy runspider spider.py -o file.jsonDisplay HTML source of the scraped page
print(response.txt)Get {URL}
fetch('url')Select a CSS selector
# Returns a `SelectorList`
response.css('p')
# Retrieve full HTML elements
response.css('p').extract()Retrieve only the text within the element
response.css('p::text').extract()
response.css('p::text').extract_first()
response.css('p::text').extract()[0]Get the href attribute value for an anchor tag
response.css('a').attrib['href']Launch Scrapy shell and scrape $URL
scrapy shell $URLMake a default spider named {quotes} that will be restricted to {domain}
scrapy genspider quotes domainscrapy runspider scrapy1.pyRun a spider, saving scraped data to a JSON file
scrapy runspider spider.py -o items.jsonMethod which contains most of the logic of the spider, especially after the yield keyword. For multiple items, a structural basis for iteration must be found and for each iteration, data is yielded
Extract URL from link using standard CSS selection techniques
Add the domain name to a relative link
response.urljoin()Recursively call the parse method again on the next page
yield scrapy.Request(url=next_page_url, callback=self.parse)-
parse_detailswould be a spider method sibling to the mainparsemethod - if a detail page has more information than the main, then the
yieldkeyword should be inparse_details
yield scrapy.Request(url={url}, callback=self.parse_details)- argparse ?
- array ?
- asyncio ?
- bisect ?
- csv ?
- ctypes ?
- curses ?
- datetime ?
- functools ?
- getpass ?
- glob ?
- heapq ?
- http ?
- json ?
- logging ?
- optparse ?
- os ?
- pathlib ?
- platform ?
- pythonnet ?
- random ?
- socket ?
- subprocess ?
- sqlite3 ?
- sys ?
- termcolor ?
- threading ?
- trace ?
- typing ?
- unittest ?
- urllib ?
- venv ?
- weakref ?
- winrm ?