A Retrieval-Augmented-Generation (RAG) system that scrape, index, and query programming documentation (e.g. Matplotlib or NumPy docs). It allows users to ask precise, technical questions about the scraped documentation and receive truth-based, cited answers with minimal hallucination.
- 🧩 Overview
- ⚙️ System Architecture
- 🧠 Tech stack
- 🚀 Setup Guide
- 🔄 System Flow
- 🕷️ Crawler
- 🧮 Backend API
- 💫 Embedding & LLM Configuration
- 💬 Frontend Overview
- 📊 RAGAS Evaluation Summary
- 🧾Credits
code-compass connects multiple components into a cohesive pipeline:
- Crawler – Scrapes documentation websites, converts HTML → Markdown, cleans artifacts, and chunks content for embedding.
- Vector Database – Stores semantic embeddings using Postgres +
pgvector. - Backend (FastAPI) – Handles user prompts, retrieves relevant chunks via similarity search, and queries an LLM to produce grounded answers.
- Frontend (React) – Provides a conversational interface where users can chat with the assistant and view citation links.
- Deployment – Fully containerized using Docker Compose.
| Component | Tech | Description |
|---|---|---|
| Backend | Python, FastAPI, SQLAlchemy, Alembic | REST API & data layer |
| Database | PostgreSQL + pgvector |
Stores embeddings for semantic retrieval |
| Crawler | Python + Scrapy + html2text |
Fetches & preprocesses docs |
| Embedding Model | BAAI/bge-base-en-v1.5 |
Creates document and query embeddings |
| Reranking Model | BAAI/bge-reranker-base |
Refines retrieved documents to select the most relevant chunks |
| LLM Options | Gemini 2.5 Flash Lite / Other models | Answers questions using retrieved context |
| Frontend | React + Vite + TypeScript | Chat interface |
| Deployment | Docker Compose | Unified environment setup |
git clone https://github.com/airelcamilo/code-compass.gitCreate .env file for backend, crawler, and frontend based on the provided .env.example file.
docker compose up --buildalembic upgrade headModify the urls.txt file
scrapy crawl doc_spider -a urls_file="./urls.txt" --loglevel=INFOFrontend: http://localhost:3001
Backend: http://localhost:8000
-
Crawl phase
- The Scrapy crawler fetches documentation pages from the URLs specified by the user.
- Each page’s HTML is cleaned and converted to Markdown using
html2text, ensuring readability and consistent formatting. - During processing, each subsection (the parent) is split into child chunks for better retrieval performance.
- The resulting parent–child structure ensures that both prose and code snippets are preserved with context, improving retrieval accuracy.
- Each chunk is embedded into a high-dimensional vector using the
BAAI/bge-base-en-v1.5model. - The resulting text chunks, metadata, and embeddings are stored in a PostgreSQL database with the
pgvectorextension, enabling fast vector similarity search.
-
Question phase
- The user sends a prompt to
/askendpoint. - The query is refined using LLM.
- The backend embeds the refined query using
BAAI/bge-base-en-v1.5. - It performs a similarity search against stored document chunks in
pgvector. - Retrieved parent chunks are reranked with the
BAAI/bge-reranker-basemodel to select the most relevant ones. - Previous conversation history (if any) is summarized to provide additional context.
- A structured, context-aware prompt is then built for the LLM, combining:
- Style hint based on query
- Conversation summary
- Top-ranked document content
- The user’s question and refine query
- The LLM generates a fact-grounded answer with citations.
- The API responds with the answer, citations, and an associated session identifier.
- The user sends a prompt to
-
Conversation persistence
- The backend includes a unique session identifier in the response header:
X-Session-Id - The frontend captures and stores this value in
localStorage. - All subsequent
/askrequests reuse thisX-Session-Id, allowing the backend to maintain conversation context across multiple questions.
- The backend includes a unique session identifier in the response header:
Command:
scrapy crawl doc_spider -a urls_file="./urls.txt" --loglevel=INFO- Fetch documentation pages from the user-provided URLs using Scrapy.
- Clean HTML content and convert relative links to absolute URLs for consistent referencing.
- Extract structured
DocItemobjects with the following schema:{ title, section, subsection, content, url, metadata } - Each page’s HTML is cleaned and converted to Markdown using
html2text, ensuring readability and consistent formatting. - During processing, each subsection (parent) is split into child chunks for improved retrieval performance:
- Code blocks are treated as standalone chunks to preserve formatting, syntax, and context.
- Text blocks are recursively split into smaller, semantically coherent chunks using
RecursiveCharacterTextSplitter(chunk size = 800, overlap = 100).
- Each chunk is embedded into a high-dimensional vector using the
BAAI/bge-base-en-v1.5model to capture its meaning. - Store all child chunks, parent chunks, metadata, and embeddings in a PostgreSQL database with
pgvectorextension.
X-Session-Id: Session identifier for conversation continuity. If missing, backend generates one and returns it in response headers.
POST /ask
Ask a question based on indexed documentation.
Request
{
"prompt": "How do I plot multiple y axes in Matplotlib?",
"max_token": 8192
}Behavior:
- Refine the query using LLM (configurable via
.env), either Gemini 2.5 Flash Lite or other models. - Embed the query using
BAAI/bge-base-en-v1.5to capture its semantic meaning. - Retrieve the top-k similar document chunks from the
pgvectordatabase via cosine similarity. - Rerank the retrieved chunks using the
BAAI/bge-reranker-basemodel to identify the most relevant top-n results. - Load and summarize previous conversation (if any) using the stored session context to maintain continuity.
- Construct a structured prompt for the LLM containing:
- System instructions (e.g., ensure factual accuracy, use citations).
- Style hint based on query
- The summarized conversation context.
- The top-ranked retrieved document chunks.
- The user’s current question and refined query.
- Call the selected LLM to generate a contextually grounded answer.
- Return the LLM-generated answer along with structured citations referencing the original source URLs.
GET /conversations
Returns the list of all previous conversation exchanges associated with the current X-Session-Id.
Used by the frontend to restore chat history and maintain context across sessions.
| Setting | Description |
|---|---|
| Embedding model | BAAI/bge-base-en-v1.5 |
| Vector DB | Postgres + pgvector, similarity = cosine distance. |
| LLM models | Configurable via .env: Gemini 2.5 Flash Lite or other models. |
| Token limits | Token limits are applied when sending prompts for query refinement and answer generation. |
Example .env snippet:
MODEL_PROVIDER=gemini
LLAMA_MODEL=models/microsoft_Phi-4-mini-instruct-Q4_K_M.gguf
GEMINI_MODEL=gemini-2.5-flash-lite
GEMINI_API_KEY=- Built using React + Vite + TypeScript.
- Provides an interactive chat interface.
- Displays assistant answers with clickable citation links.
- Automatically manages
X-Session-Idfor persistent conversation.
Evaluation was performed using the RAGAS framework to measure the quality of retrieval and generation.
| Metric | Mean | Median | Description |
|---|---|---|---|
| Faithfulness | 0.780 |
0.724 |
Indicates how accurately the generated answers reflect the retrieved documents. A higher score means fewer hallucinations and better factual grounding. |
| Context Precision | 0.616 |
0.583 |
Measures the proportion of retrieved documents that are relevant to the question. High precision implies cleaner, more focused retrieval. |
| Context Recall | 0.641 |
0.666 |
Reflects how much of the relevant context was successfully retrieved. High recall ensures completeness of documents. |
| Answer Relevancy | 0.625 |
0.833 |
Evaluates how well the final answer addresses the user’s query directly and coherently. |
Number of evaluated queries: 10
- Faithfulness (
0.780mean) is solid, showing that the LLM produces grounded answers with minimal hallucination. - Context Precision (
0.616mean) and Context Recall (0.641mean) are closely aligned, indicating a balanced retriever that fetches enough relevant documents without excessive noise. - The high median for Answer Relevancy (
0.833) compared to its mean (0.625) suggests inconsistent query difficulty, some answers are precise, while others were limited by sparse context or less informative chunks. - Given the evaluation was conducted on limited hardware using a compact embedding model (
BAAI/bge-base-en-v1.5), the overall performance is decent for a lightweight RAG system.
Airel Camilo Khairan © 2025
