An AI-powered web scraper with RAG (Retrieval-Augmented Generation) capabilities for semantic search over scraped content.
- Direct Scraping: Scrape specific websites by URL with configurable depth and page limits
- Smart Scraping: AI-powered scraping based on natural language queries (requires search API integration)
- Real-time Monitoring: WebSocket-based job monitoring with live progress updates
- RAG Integration: Automatic content chunking and embedding using sentence transformers
- Semantic Search: Natural language search over all scraped content using Pinecone vector database
- Modern UI: Beautiful React frontend with Tailwind CSS and shadcn/ui components
- Job Management: Track all scraping jobs with detailed statistics and scraped URL lists
- FastAPI: Modern Python web framework
- SQLAlchemy: Async database ORM
- Pinecone: Vector database for embeddings
- Groq: AI model for intelligent search and analysis
- Sentence Transformers: Text embedding models
- BeautifulSoup4: HTML parsing
- Playwright: Browser automation for JavaScript-heavy sites
- HTTPX: Async HTTP client
- React 18: Modern UI library
- Vite: Fast build tool
- TanStack Query: Data fetching and caching
- React Router: Client-side routing
- Tailwind CSS: Utility-first CSS framework
- Lucide Icons: Beautiful icon library
- Zustand: State management
- Python 3.9+
- Node.js 18+
- Groq API key (get from console.groq.com)
- Pinecone API key (get from pinecone.io)
git clone <repository-url>
cd Universal_Scraper# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers (optional, for JavaScript-heavy sites)
playwright installcd frontend
npm install
cd ..Create a .env file in the root directory:
# API Keys (REQUIRED)
GROQ_API_KEY=your_groq_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
# Pinecone Configuration
PINECONE_ENVIRONMENT=us-west1-gcp
PINECONE_INDEX_NAME=universal-scraper
# Database
DATABASE_URL=sqlite+aiosqlite:///./scraper.db
# Storage
DATA_STORAGE_PATH=./data
# Scraper Configuration
MAX_CONCURRENT_SCRAPES=5
MAX_DEPTH=3
REQUEST_TIMEOUT=30
RATE_LIMIT_DELAY=1.0
USER_AGENT=UniversalScraper/1.0
# RAG Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2
CHUNK_SIZE=500
CHUNK_OVERLAP=50
TOP_K_RESULTS=10
# Server Configuration
HOST=0.0.0.0
PORT=8000
RELOAD=trueBefore running the application, create a Pinecone index:
- Log in to Pinecone Console
- Create a new index with the following settings:
- Name:
universal-scraper(or match yourPINECONE_INDEX_NAME) - Dimensions:
384(for all-MiniLM-L6-v2 model) - Metric:
cosine - Region: Choose your preferred region
- Name:
chmod +x start.sh
./start.shstart.bat# Activate virtual environment
source venv/bin/activate # or venv\Scripts\activate on Windows
# Run backend server
python -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000cd frontend
npm run devThe application will be available at:
- Frontend: http://localhost:5173
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Navigate to Direct Scrape in the sidebar
- Enter a website URL (e.g.,
https://example.com) - Configure settings:
- Max Depth: How many levels of links to follow (1-10)
- Max Pages: Maximum number of pages to scrape
- Click Start Scraping
- You'll be redirected to the Job Monitor to track progress
- First, scrape some websites using Direct Scrape
- Navigate to Search in the sidebar
- Enter a natural language query
- View semantically relevant results from your scraped content
- Click on sources to visit the original pages
- Navigate to Job Monitor in the sidebar
- View all scraping jobs and their status
- Click on a job to see detailed information
- Monitor real-time progress with WebSocket updates
- View all scraped URLs and their status
- Navigate to Smart Scrape in the sidebar
- Enter a natural language query describing what you want to find
- The AI will generate optimal search queries
- Relevant websites will be automatically discovered and scraped
POST /scrape/direct- Start direct scraping jobPOST /scrape/smart- Start smart scraping job
GET /jobs/- List all jobsGET /jobs/{job_id}- Get job detailsGET /jobs/{job_id}/urls- Get scraped URLs for a jobGET /jobs/stats/overview- Get system statisticsWS /jobs/ws/{job_id}- WebSocket for real-time job updates
POST /search/- Perform semantic searchGET /search/history- Get recent search queries
Universal_Scraper/
├── backend/
│ ├── ai/ # AI agents (Groq, search, site selection)
│ ├── api/ # FastAPI routes and schemas
│ ├── rag/ # RAG pipeline (chunking, embedding, retrieval)
│ ├── scraper/ # Web scraping logic
│ ├── storage/ # Database and file storage
│ ├── utils/ # Utilities (logging, validation)
│ ├── config.py # Configuration management
│ └── main.py # FastAPI application entry point
├── frontend/
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── lib/ # API client and utilities
│ │ ├── App.jsx # Main app component
│ │ └── main.jsx # Entry point
│ ├── index.html
│ └── package.json
├── requirements.txt # Python dependencies
├── .env # Environment variables (create this)
└── README.md # This file
- API Layer: FastAPI receives scraping requests
- Background Tasks: Jobs run asynchronously
- Crawler: Fetches and parses web pages
- Storage: Saves raw content locally and metadata to SQLite
- RAG Pipeline:
- Chunks text content
- Generates embeddings using sentence transformers
- Stores vectors in Pinecone
- WebSocket: Broadcasts real-time progress updates
- User Interface: React components with Tailwind CSS
- API Client: Axios for HTTP requests
- State Management: TanStack Query for server state, Zustand for local state
- Real-time Updates: WebSocket connection for job monitoring
- Routing: React Router for navigation
MAX_CONCURRENT_SCRAPES: Number of parallel scraping jobsMAX_DEPTH: Default maximum crawl depthREQUEST_TIMEOUT: HTTP request timeout in secondsRATE_LIMIT_DELAY: Delay between requests in seconds
EMBEDDING_MODEL: Sentence transformer model nameCHUNK_SIZE: Text chunk size for embeddingCHUNK_OVERLAP: Overlap between chunksTOP_K_RESULTS: Default number of search results
Error: "Pinecone index not found"
- Make sure you created the Pinecone index with the correct name and dimensions (384)
Error: "Database connection failed"
- Ensure the
DATA_STORAGE_PATHdirectory exists or the app has write permissions
Error: "Groq API key invalid"
- Verify your API key in the
.envfile - Check your API key at console.groq.com
Error: "Network Error" or "Failed to fetch"
- Ensure the backend server is running on port 8000
- Check CORS settings in
backend/main.py
WebSocket not connecting
- Verify the WebSocket URL in
JobMonitor.jsx - Check browser console for connection errors
Pages not being scraped
- Check if the website blocks scrapers (User-Agent, robots.txt)
- Try increasing
REQUEST_TIMEOUT - For JavaScript-heavy sites, Playwright integration may be needed
# Backend tests
pytest
# Frontend tests (if configured)
cd frontend
npm test# Backend
black backend/
# Frontend
cd frontend
npm run lint- Adjust Concurrency: Increase
MAX_CONCURRENT_SCRAPESfor faster scraping - Rate Limiting: Adjust
RATE_LIMIT_DELAYto be respectful to target servers - Chunk Size: Smaller chunks = more granular search but more vectors
- Top K Results: Balance between relevance and response time
- Never commit
.envfiles to version control - Use strong API keys and rotate them regularly
- Be respectful of robots.txt and rate limits
- Consider legal implications of web scraping in your jurisdiction
- Search API integration for Smart Scraping
- PDF and document scraping support
- Export functionality (JSON, CSV, Markdown)
- Scheduled scraping jobs
- User authentication and multi-tenancy
- Advanced filtering and search options
- Mobile-responsive improvements
- Docker containerization
- Cloud deployment guides (AWS, GCP, Azure)
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For issues, questions, or feature requests, please open an issue on GitHub.
Built with ❤️ using FastAPI, React, and AI