A completely free, local-only Retrieval-Augmented Generation (RAG) chatbot that uses arXiv papers as its knowledge base. Built with open-source components and designed to run entirely on your local machine with zero costs.
- Zero Cost: Uses only free, open-source models and libraries
- Local Only: Runs entirely on your machine, no external APIs
- arXiv Integration: Automatically downloads and indexes recent research papers
- Fast Retrieval: FAISS-based similarity search for efficient document retrieval
- Smart Retrieval: Intelligent document retrieval with similarity scoring
- Optional LLM Generation: AI-powered answer synthesis with local models (TinyLlama, Phi-3, Mistral)
- REST API: FastAPI-based service with comprehensive endpoints
- Docker Support: Easy deployment with Docker and docker-compose
- Auto-Refresh: Nightly cron job to update the knowledge base
- Evaluation Tools: Built-in metrics for hit@k, latency, and retrieval quality
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β arXiv Papers βββββΆβ Preprocessing βββββΆβ FAISS Index β
β (Downloader) β β (Chunking + β β (Embeddings) β
β β β Embeddings) β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ βββββββββββββββββββ β
β User Query βββββΆβ RAG Pipeline βββββββββββββ
β β β (Retrieval + β
β β β Generation) β
βββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ
β FastAPI β
β (REST API) β
βββββββββββββββββββ
- Python 3.9+
- 8-16 GB RAM (recommended for smooth operation)
- macOS/Linux/Windows (tested on macOS M2)
- Docker (optional, for containerized deployment)
-
Clone the repository
git clone https://github.com/YOUR_USERNAME/QueryGenie.git cd QueryGenie -
Create virtual environment
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download and process papers
python src/arxiv_downloader.py python src/preprocessing.py
-
Start the API server
# Using the new backend structure python backend/main.py # Or using the legacy API (alternative) python src/api.py
-
Build and run with Docker Compose
docker-compose -f docker-compose.prod.yml up -d
-
Initialize the system (first time only)
# Download papers and create index docker-compose -f docker-compose.prod.yml exec backend python src/arxiv_downloader.py docker-compose -f docker-compose.prod.yml exec backend python src/preprocessing.py
python src/arxiv_downloader.pyThis downloads recent papers from arXiv (AI, ML, NLP, CV categories).
python src/preprocessing.pyThis processes papers, creates embeddings, and builds the FAISS index.
# Using the new backend structure (recommended)
python backend/main.py
# Or using the legacy API
python src/api.pyThe API will be available at http://localhost:8000
- API Documentation:
http://localhost:8000/docs - Health Check:
http://localhost:8000/api/v1/health
curl -X POST "http://localhost:8000/api/v1/ask" \
-H "Content-Type: application/json" \
-d '{"question": "What are the latest advances in transformer architectures?"}'POST /api/v1/ask- Ask a question to the RAG systemGET /api/v1/health- Health check and system statusGET /api/v1/metrics- Performance metrics and statisticsPOST /api/v1/refresh- Trigger index refresh
Note: The API uses versioned endpoints under /api/v1/. For interactive API documentation, visit http://localhost:8000/docs.
curl -X POST "http://localhost:8000/api/v1/ask" \
-H "Content-Type: application/json" \
-d '{
"question": "How do attention mechanisms work in transformers?",
"k": 5,
"max_context_length": 5000,
"max_answer_length": 300
}'curl http://localhost:8000/api/v1/healthcurl http://localhost:8000/api/v1/metricsThe system uses these free, open-source models:
- Embeddings:
sentence-transformers/all-MiniLLM-L6-v2(384 dimensions, ~90MB) - Retrieval: FAISS-based similarity search with sentence-transformers
- Generation (Optional): Local LLM via llama.cpp (TinyLlama, Phi-3, or Mistral)
Fast and lightweight - returns formatted context from retrieved papers.
# Using new backend structure (recommended)
python backend/main.py
# Or using legacy API
python src/api.pyAI-powered answer synthesis using local models.
# Install LLM dependencies first
pip install llama-cpp-python huggingface-hub
# Enable LLM generation (new backend)
USE_LLM=true python backend/main.py
# Use specific model
USE_LLM=true LLM_MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0" python backend/main.py
# Or using legacy API
USE_LLM=true LLM_MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0" python src/api.pySupported Models:
TinyLlama/TinyLlama-1.1B-Chat-v1.0(Fastest, ~600MB)microsoft/phi-2(Better quality, ~2.3GB)mistralai/Mistral-7B-Instruct-v0.2(Best quality, ~4GB)
π See SETUP_LLM.md for detailed LLM setup instructions.
You can modify the models in the source code:
# In src/preprocessing.py
processor = DocumentProcessor(
embedding_model="sentence-transformers/all-MiniLLM-L6-v2", # Change this
chunk_size=512,
chunk_overlap=50
)
# In src/rag_pipeline.py
rag_pipeline = RAGPipeline(
faiss_manager,
use_llm=True, # Enable LLM generation
model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Change this
)python src/evaluation.pyEdit test_queries.json to add your own test questions.
- Latency: Average response time (retrieval + generation)
- Hit@k: Percentage of queries with relevant results in top-k
- Retrieval Quality: Average similarity scores and diversity
chmod +x scripts/setup_cron.sh
./scripts/setup_cron.shThis sets up a cron job to refresh the index every night at 2 AM.
python scripts/refresh_index.pyQueryGenie/
βββ backend/ # FastAPI backend application
β βββ api/
β β βββ v1/
β β βββ routes.py # API v1 endpoints
β βββ main.py # Backend entry point
βββ frontend/ # React + TypeScript frontend
β βββ src/
β β βββ components/ # React components
β β βββ services/ # API services
β β βββ App.tsx # Main app component
β βββ Dockerfile # Frontend container
βββ src/ # Core RAG logic (shared)
β βββ __init__.py
β βββ api.py # Legacy API (deprecated)
β βββ arxiv_downloader.py # Paper downloader
β βββ preprocessing.py # Document processing
β βββ faiss_manager.py # FAISS index management
β βββ rag_pipeline.py # RAG pipeline
β βββ llm_generator.py # LLM generation
βββ scripts/
β βββ refresh_index.py # Index refresh script
β βββ setup_cron.sh # Cron job setup
βββ data/ # Data directory (FAISS index, papers)
βββ models/ # LLM model files (if using LLM)
βββ requirements.txt # Python dependencies
βββ Dockerfile.backend # Backend Docker configuration
βββ docker-compose.prod.yml # Production Docker Compose setup
βββ test_queries.json # Test queries
βββ README.md # This file
# Start all services (backend + frontend)
docker-compose -f docker-compose.prod.yml up -d
# View logs
docker-compose -f docker-compose.prod.yml logs -f
# Stop services
docker-compose -f docker-compose.prod.yml downThe services will be available at:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
# Build backend image
docker build -f Dockerfile.backend -t querygenie-backend .
# Run backend container
docker run -p 8000:8000 \
-v $(pwd)/data:/app/data \
-v $(pwd)/models:/app/models \
-e USE_LLM=true \
querygenie-backend-
"FAISS index not found"
- Run the preprocessing pipeline first
- Check if
data/faiss_index.faissexists
-
"Out of memory"
- Reduce
chunk_sizein preprocessing - Use a smaller embedding model
- Close other applications
- Reduce
-
"Model download failed"
- Check internet connection
- Clear Hugging Face cache:
rm -rf ~/.cache/huggingface
-
"Slow performance"
- Use GPU if available (set
device="cuda") - Reduce
max_context_length - Use fewer retrieved sources
- Use GPU if available (set
- GPU Acceleration: Set
device="cuda"in RAGPipeline - Memory Usage: Adjust
chunk_sizeandbatch_size - Index Size: Limit number of papers downloaded
On MacBook Air M2 (8GB RAM):
- Index Creation: ~5-10 minutes for 200 papers
- Query Response: ~2-5 seconds per query
- Memory Usage: ~2-4GB during operation
- Index Size: ~100-500MB depending on corpus size
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is open source and available under the MIT License.
- Hugging Face for providing free, open-source models
- Facebook AI for FAISS similarity search
- arXiv for providing open access to research papers
- FastAPI for the excellent web framework
For issues and questions:
- Check the troubleshooting section
- Review the logs in
logs/directory - Open an issue on GitHub
- Check the API documentation at
http://localhost:8000/docs(interactive Swagger UI)
QueryGenie - Bringing the power of RAG to your local machine, completely free! π