A lightweight AI assistant service with a Gradio UI, LangGraph workflow orchestration, and RAG retrieval (Chroma via LlamaIndex).
Local development runs with Ollama, while production can use the OpenAI API.
- ✅ Current: Vector RAG (LlamaIndex Retriever → Chroma Vector DB)
- 🔜 Optional: KG-augmented RAG / GraphRAG (Graph Retriever → Neo4j Knowledge Graph)
User (Browser)
│
▼
┌───────────────────────────────┐
│ Gradio UI │
│ - Question input │
│ - Answer + Citations output │
└───────────────┬───────────────┘
│
▼
┌───────────────────────────────┐
│ LangGraph │
│ Workflow Orchestration │
│ - State management │
│ - (optional) retry/branching │
└───────────────┬───────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Retrieval Layer │
│ │
│ (Current) Vector RAG │
│ LlamaIndex Retriever ─────▶ Chroma (Vector DB) │
│ │
│ (Future, Optional) KG-augmented RAG │
│ Graph Retriever ─────────▶ Neo4j (Knowledge Graph) │
└───────────────┬─────────────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ LLM Layer │
│ Prompting + Invocation │
│ │
│ - Local Dev: Ollama │
│ - Production: OpenAI API │
└───────────────┬───────────────┘
│
▼
Answer + Citations
This project does not fine-tune the model.
Your documents are indexed into a vector database and later retrieved and injected into the prompt at question time.
- ✅ “Remembers” via retrieval (Chroma)
- ❌ Does not change model weights (no training / fine-tuning)
Typical files in this repo:
.
├─ app.py # Gradio UI entrypoint
├─ graph.py # LangGraph workflow (decide → retrieve → generate)
├─ rag_vector.py # Vector RAG retrieval (Chroma via LlamaIndex)
├─ rag_graph.py # (Optional) GraphRAG retrieval (Neo4j)
├─ llm_factory.py # LLM provider factory (Ollama vs OpenAI)
├─ ingest.py # One-time (or on-change) docs → Chroma indexing
└─ data/docs/ # Your knowledge base source documents (md, txt, etc.)
Indexes your knowledge base into Chroma.
- Reads files under
data/docs/ - Chunks documents and computes embeddings
- Upserts embeddings into Chroma (persistent storage)
Run this before
app.py(and again whenever docs change).
Query-time retrieval against Chroma:
- Connects to Chroma (
CHROMA_DIR,COLLECTION_NAME) - Retrieves top-k relevant chunks using LlamaIndex retriever
- Returns:
context(joined passages)sources(source path + score per passage)
GraphRAG retrieval for entity/relationship-centric queries:
- Uses Neo4j as a Knowledge Graph store
- Produces additional
context+sourcesto merge with vector results
LangGraph orchestration:
node_decide: switches GraphRAG on/off viaUSE_GRAPH_RAGnode_vector_retrieve: runs Vector RAGnode_graph_retrieve: runs GraphRAG (optional)node_generate: prompts the LLM using retrieved context
and forces exactly oneCitations:block in the final output.
Provider switching:
- Local dev:
ChatOllama(...)(requires Ollama running) - Production: OpenAI chat model (requires
OPENAI_API_KEY)
Gradio UI:
- Chat input
- Chat output with citations
- CSS tuned to avoid “double scroll” (one scroll area inside the chat only)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtollama serve
# Pull your chat model (example)
ollama pull llama3.1:8b
# Pull embedding model used by ingest (example)
ollama pull nomic-embed-textpython ingest.pypython app.pyMinimal local dev:
LLM_PROVIDER=ollama
USE_GRAPH_RAG=false
CHROMA_DIR=./storage/chroma
COLLECTION_NAME=mechatbot
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.1:8b
OLLAMA_EMBED_MODEL=nomic-embed-text
MAX_CITATIONS=2Production (OpenAI):
LLM_PROVIDER=openai
OPENAI_API_KEY=YOUR_KEY
⚠️ Never commit.envto Git.
Ollama is not running:
ollama servePull the model name you configured:
ollama pull llama3.1:8b- Lower
top_kingraph.py/rag_vector.py - Use
MAX_CITATIONS=2(or 1) - Add a similarity threshold in
rag_vector.py(optional)
- RAG indexing stores your docs locally (Chroma).
- If you use OpenAI in production, retrieved text may be sent to the API at inference time.
- With Ollama (local), inference stays on your machine.
Add your preferred license here.
rm -rf ./storage/chroma