Add examples for new quotes.toscrape.com endpoints (fixes #15) #16

rjanain · 2025-12-17T11:43:40Z

This PR adds spiders for 6 new endpoints available on quotes.toscrape.com, providing comprehensive coverage of modern web scraping techniques. Currently, quotesbot only covers 2 out of 8 endpoints on the sandbox - this update brings complete endpoint coverage.

Closes #15

The quotes.toscrape.com sandbox has evolved significantly with new endpoints designed to teach modern web scraping challenges. While quotesbot has remained an excellent starting point with CSS and XPath examples, it doesn't demonstrate techniques for JavaScript rendering, APIs, authentication, and other scenarios that students encounter in real-world scraping projects.

Changes

1. New Spiders for Modern Endpoints

JavaScript & API Handling:

toscrape-js.py → /js/ endpoint
- Extracts data from JavaScript-rendered content
- Parses JSON embedded in <script> tags as var data = [...]
- Handles pagination with JS-generated content
- Demonstrates reverse-engineering client-side rendering
toscrape-scroll.py → /api/quotes?page=N endpoint
- Consumes REST API directly instead of HTML scraping
- Parses JSON responses with proper pagination handling
- Shows when to use API endpoints over HTML parsing
- Demonstrates API pagination with has_next and page fields

Authentication & Forms:

toscrape-login.py → /login endpoint
- Demonstrates complete authentication flow
- Uses FormRequest.from_response() for automatic CSRF handling
- Validates successful login before scraping
- Maintains session across paginated requests
toscrape-viewstate.py → /search.aspx endpoint
- Handles ASP.NET ViewState forms
- Extracts and submits __VIEWSTATE hidden fields
- Demonstrates stateful form submissions
- Teaches techniques for legacy enterprise applications

Complex Layouts:

toscrape-table.py → /tableful/ endpoint
- Parses quotes in HTML table structure
- Demonstrates table scraping patterns
- Shows iteration through rows and cells
toscrape-random.py → /random endpoint
- Scrapes endpoints with dynamic/random content
- Simple example for non-deterministic sources

2. Updated Existing Spiders

toscrape-css.py and toscrape-xpath.py:
- Now use QuotesbotItem instead of plain dicts for consistency
- Replaced deprecated .extract_first() and .extract() with .get() and .getall()
- Maintains backward compatibility in functionality

3. Enhanced Data Model

items.py: Added explicit field definitions for text, author, and tags
- Makes the item structure clear and educational
- Previously just had pass with commented placeholder

4. Comprehensive Documentation

README.md:
- Added "About This Project" section explaining endpoint coverage
- Maps each spider to its specific endpoint URL
- Organizes spiders by technique category (HTML, JS/API, Auth, Layouts)
- Provides learning path progression from basic to advanced
- Includes direct endpoint URLs for hands-on exploration
- Added installation instructions and example commands

Add text, author, and tags fields to QuotesbotItem to properly structure the scraped quote data. This makes the item definition more explicit and useful for all spiders in the project.

Add 6 new spiders to demonstrate various modern web scraping scenarios: - toscrape-js: Extract data from JavaScript-rendered content by parsing embedded JSON data in script tags - toscrape-scroll: Handle infinite scroll pages using the API endpoint with JSON responses - toscrape-login: Demonstrate form-based authentication with CSRF token handling using FormRequest.from_response - toscrape-table: Scrape data from table layouts by selecting table rows and cells - toscrape-viewstate: Handle ASP.NET ViewState forms commonly found in legacy enterprise applications - toscrape-random: Scrape single random quote endpoint These spiders provide practical examples for students learning to handle the challenges of modern websites beyond basic HTML parsing.

Enhance the README to provide comprehensive documentation for all spiders: - Add "What's New" section highlighting the modern scraping techniques - Organize spiders into Basic and Advanced categories for better learning - Add detailed descriptions of each spider's purpose and technique - Include Installation section with setup instructions - Add Example Commands section with practical usage examples - Add Tips for Students section with learning recommendations - Provide suggested learning path from basic to advanced techniques This makes the project more accessible to students learning modern web scraping and clearly communicates the value of the new additions.

Update documentation to emphasize that: - quotes.toscrape.com provides 8 different endpoints for learning - Each endpoint teaches specific modern scraping techniques - QuotesBot now provides complete coverage of all endpoints - Spiders are mapped directly to their target endpoints Changes: - Reframe "About This Project" to focus on endpoint coverage - Map each spider to its specific endpoint URL - Organize by technique category (HTML, JS/API, Auth, Layouts) - Add direct endpoint URLs in "Tips for Learning" section - Update learning path to progress through endpoints logically This positions QuotesBot as the complete companion for the quotes.toscrape.com learning sandbox.

…dize string formatting

Fix formatting issue in README for Basic Spiders section.

Added installation instructions for Scrapy in quotesbot.

rjanain and others added 7 commits December 17, 2025 16:46

Define QuotesbotItem fields for quote scraping

cad2832

Add text, author, and tags fields to QuotesbotItem to properly structure the scraped quote data. This makes the item definition more explicit and useful for all spiders in the project.

Refactor spiders to use QuotesbotItem for yielding quotes and standar…

c5628ef

…dize string formatting

Correct typo in Basic Spiders section

044096b

Fix formatting issue in README for Basic Spiders section.

Add Scrapy installation instructions to README

d540815

Added installation instructions for Scrapy in quotesbot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add examples for new quotes.toscrape.com endpoints (fixes #15) #16

Add examples for new quotes.toscrape.com endpoints (fixes #15) #16

Uh oh!

rjanain commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add examples for new quotes.toscrape.com endpoints (fixes #15) #16

Are you sure you want to change the base?

Add examples for new quotes.toscrape.com endpoints (fixes #15) #16

Uh oh!

Conversation

rjanain commented Dec 17, 2025

Changes

1. New Spiders for Modern Endpoints

2. Updated Existing Spiders

3. Enhanced Data Model

4. Comprehensive Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant