A production-grade Retrieval-Augmented Generation (RAG) pipeline designed for scalability and modularity.
The system supports parallel ingestion, distributed document processing, and query-time retrieval with context-aware LLM generation, using Supabase, Redis, and Qdrant Cloud.
The system follows a distributed worker-based architecture:
- FastAPI Server — handles user requests for ingestion and querying.
- Redis Queue — decouples ingestion requests from heavy processing.
- Ingestion Workers — fetch content, chunk, embed, and store it in Qdrant.
- Supabase (PostgreSQL) — manages document metadata and ingestion job tracking.
- Qdrant Cloud — vector store for similarity search.
- Google Gemini API — provides LLM responses using retrieved context.
| Component | Technology | Reasoning |
|---|---|---|
| Backend API | FastAPI | Lightweight, async, and production-friendly. |
| Task Queue | Redis | Simple, fast message broker for decoupled ingestion. |
| Database | Supabase (PostgreSQL) | Manages metadata and ingestion states; integrates easily. |
| Vector Store | Qdrant Cloud | Cloud-hosted, scalable, and efficient for similarity search. |
| Embeddings | all-MiniLM-L6-v2 (384-dim) |
Compact yet high-quality embeddings for semantic retrieval. |
| LLM | Google Gemini API | High-quality contextual reasoning and summarization. |
| Column | Type | Description |
|---|---|---|
doc_id |
uuid |
Primary key |
url |
text |
Source document URL |
source |
text |
Origin identifier (e.g., "website", "pdf") |
status |
text |
Processing status |
created_at |
timestamp |
Record creation time |
| Field | Type | Description |
|---|---|---|
id |
UUID |
Unique chunk ID |
vector |
float[384] |
Embedding vector |
text_snippet |
TEXT |
Extracted document chunk |
url |
TEXT |
Source document URL |
Endpoint:
POST /ingest-url
Body:
{
"url": "https://python.langchain.com/docs/get_started/introduction/",
"source": "LangChain Docs"
}Response:
{
"job_id": "c89a7a7e-9129-4a3a-a2f9-b5af0b25a643",
"status": "queued"
}Curl Example:
curl -X POST http://localhost:8000/ingest-url \
-H "Content-Type: application/json" \
-d '{"url": "https://python.langchain.com/docs/get_started/introduction/", "source": "LangChain Docs"}'Endpoint:
POST /query
Body:
{
"query": "What is LangChain used for?"
}Response:
{
"response": "LangChain is used to build applications powered by large language models.",
"sources": ["https://python.langchain.com/docs/get_started/introduction/"]
}Curl Example:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What is LangChain used for?"}'git clone https://github.com/<your-username>/rag-engine.git
cd rag-enginepython -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windowspip install -r requirements.txtCreate a .env file based on .env.example:
SUPABASE_URL=<your_supabase_url>
SUPABASE_KEY=<your_supabase_key>
QDRANT_URL=<your_qdrant_cloud_url>
QDRANT_API_KEY=<your_qdrant_api_key>
REDIS_URL=redis://localhost:6379
GOOGLE_API_KEY=<your_gemini_api_key>
uvicorn app.main:app --reloadpython -m app.workers.ingestion_worker(Optional) Run multiple workers to test scalability.
To simulate concurrent ingestion:
python scripts/stress_test_ingest.pyThis will trigger 10 simultaneous ingestion jobs against /ingest-url.
- Redis ensures distributed, idempotent job processing.
- Workers can scale horizontally (multiple processes or containers).
- Qdrant Cloud handles concurrent read/write loads.
- FastAPI is fully async, capable of handling concurrent queries.
- Supabase maintains ingestion/job consistency.
See the pipline in action!
https://drive.google.com/file/d/1ydQwsPfh6i1cXSDJ_OeS684h2bDJd8v3/view?usp=sharing

