Skip to content

Conversation

@rjanain
Copy link

@rjanain rjanain commented Dec 17, 2025

This PR adds spiders for 6 new endpoints available on quotes.toscrape.com, providing comprehensive coverage of modern web scraping techniques. Currently, quotesbot only covers 2 out of 8 endpoints on the sandbox - this update brings complete endpoint coverage.

Closes #15

The quotes.toscrape.com sandbox has evolved significantly with new endpoints designed to teach modern web scraping challenges. While quotesbot has remained an excellent starting point with CSS and XPath examples, it doesn't demonstrate techniques for JavaScript rendering, APIs, authentication, and other scenarios that students encounter in real-world scraping projects.

Changes

1. New Spiders for Modern Endpoints

JavaScript & API Handling:

  • toscrape-js.py/js/ endpoint

    • Extracts data from JavaScript-rendered content
    • Parses JSON embedded in <script> tags as var data = [...]
    • Handles pagination with JS-generated content
    • Demonstrates reverse-engineering client-side rendering
  • toscrape-scroll.py/api/quotes?page=N endpoint

    • Consumes REST API directly instead of HTML scraping
    • Parses JSON responses with proper pagination handling
    • Shows when to use API endpoints over HTML parsing
    • Demonstrates API pagination with has_next and page fields

Authentication & Forms:

  • toscrape-login.py/login endpoint

    • Demonstrates complete authentication flow
    • Uses FormRequest.from_response() for automatic CSRF handling
    • Validates successful login before scraping
    • Maintains session across paginated requests
  • toscrape-viewstate.py/search.aspx endpoint

    • Handles ASP.NET ViewState forms
    • Extracts and submits __VIEWSTATE hidden fields
    • Demonstrates stateful form submissions
    • Teaches techniques for legacy enterprise applications

Complex Layouts:

  • toscrape-table.py/tableful/ endpoint

    • Parses quotes in HTML table structure
    • Demonstrates table scraping patterns
    • Shows iteration through rows and cells
  • toscrape-random.py/random endpoint

    • Scrapes endpoints with dynamic/random content
    • Simple example for non-deterministic sources

2. Updated Existing Spiders

  • toscrape-css.py and toscrape-xpath.py:
    • Now use QuotesbotItem instead of plain dicts for consistency
    • Replaced deprecated .extract_first() and .extract() with .get() and .getall()
    • Maintains backward compatibility in functionality

3. Enhanced Data Model

  • items.py: Added explicit field definitions for text, author, and tags
    • Makes the item structure clear and educational
    • Previously just had pass with commented placeholder

4. Comprehensive Documentation

  • README.md:
    • Added "About This Project" section explaining endpoint coverage
    • Maps each spider to its specific endpoint URL
    • Organizes spiders by technique category (HTML, JS/API, Auth, Layouts)
    • Provides learning path progression from basic to advanced
    • Includes direct endpoint URLs for hands-on exploration
    • Added installation instructions and example commands

rjanain and others added 7 commits December 17, 2025 16:46
Add text, author, and tags fields to QuotesbotItem to properly structure
the scraped quote data. This makes the item definition more explicit and
useful for all spiders in the project.
Add 6 new spiders to demonstrate various modern web scraping scenarios:

- toscrape-js: Extract data from JavaScript-rendered content by parsing
  embedded JSON data in script tags
- toscrape-scroll: Handle infinite scroll pages using the API endpoint
  with JSON responses
- toscrape-login: Demonstrate form-based authentication with CSRF token
  handling using FormRequest.from_response
- toscrape-table: Scrape data from table layouts by selecting table rows
  and cells
- toscrape-viewstate: Handle ASP.NET ViewState forms commonly found in
  legacy enterprise applications
- toscrape-random: Scrape single random quote endpoint

These spiders provide practical examples for students learning to handle
the challenges of modern websites beyond basic HTML parsing.
Enhance the README to provide comprehensive documentation for all spiders:

- Add "What's New" section highlighting the modern scraping techniques
- Organize spiders into Basic and Advanced categories for better learning
- Add detailed descriptions of each spider's purpose and technique
- Include Installation section with setup instructions
- Add Example Commands section with practical usage examples
- Add Tips for Students section with learning recommendations
- Provide suggested learning path from basic to advanced techniques

This makes the project more accessible to students learning modern web
scraping and clearly communicates the value of the new additions.
Update documentation to emphasize that:
- quotes.toscrape.com provides 8 different endpoints for learning
- Each endpoint teaches specific modern scraping techniques
- QuotesBot now provides complete coverage of all endpoints
- Spiders are mapped directly to their target endpoints

Changes:
- Reframe "About This Project" to focus on endpoint coverage
- Map each spider to its specific endpoint URL
- Organize by technique category (HTML, JS/API, Auth, Layouts)
- Add direct endpoint URLs in "Tips for Learning" section
- Update learning path to progress through endpoints logically

This positions QuotesBot as the complete companion for the
quotes.toscrape.com learning sandbox.
Fix formatting issue in README for Basic Spiders section.
Added installation instructions for Scrapy in quotesbot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add examples for new quotes.toscrape.com endpoints demonstrating modern web scraping techniques

1 participant