A full-stack LLM chat application with Retrieval-Augmented Generation (RAG), built with FastAPI, Streamlit, and Ollama. Supports multiple interaction modes including direct chat, document-grounded Q&A with source citations, structured tool output, and automated RAG evaluation.
2026-02-09.14-59-21.mp4
- Chat Mode -- Direct conversation with a local LLM
- RAG Mode -- Document-grounded answers with retrieved source citations (filename, page number, relevance score)
- Tool Mode -- LLM outputs structured robot-action JSON, validated by the backend
- Evaluation Mode -- Automated RAG accuracy scoring against a predefined question bank
- Streaming -- Real-time token-by-token responses via Server-Sent Events (SSE)
- Session Persistence -- SQLite-backed conversation history with multi-session support and automatic reload
- Single Config -- All settings (model, server URLs, RAG parameters) in one
config.yaml
User --> Streamlit (8501) --> FastAPI (8000) --> Ollama (11434)
|
+--> SQLite (sessions.db)
+--> FAISS (index/)
+--> RAG Pipeline
Streamlit application with modular components for each interaction mode. Handles streaming display, mode switching, session management, and document upload.
FastAPI server exposing routers for chat, RAG, tool, and eval endpoints. Services layer handles LLM communication (Ollama), session persistence (SQLite), and the full RAG pipeline.
PDF extraction (PyPDF2) → sentence-aware chunking with overlap → embeddings (SentenceTransformers all-MiniLM-L6-v2) → FAISS vector similarity search → top-k retrieval with metadata.
- Sessions: SQLite database (
sessions.db) storing timestamped, mode-tagged messages with session IDs - Vector Index: FAISS index files in
index/
.
├── run.sh # One-command startup script
├── config.yaml # Central configuration
├── requirements.txt # Python dependencies
├── backend/
│ ├── main.py # FastAPI app entry point
│ ├── config.py # Loads config.yaml into Settings
│ ├── models.py # Pydantic request/response models
│ ├── db/
│ │ └── database.py # SQLite session storage
│ ├── routers/
│ │ ├── chat.py # /chat endpoints
│ │ ├── rag.py # /rag endpoints
│ │ ├── tool.py # /tool endpoints
│ │ └── eval.py # /eval endpoints
│ ├── services/
│ │ ├── llm.py # Ollama client + streaming
│ │ ├── session.py # Session management
│ │ ├── eval.py # RAG evaluation logic
│ │ └── rag/
│ │ ├── __init__.py # Retriever (orchestrates RAG pipeline)
│ │ ├── document_processor.py # PDF text extraction
│ │ ├── chunker.py # Sentence-aware chunking
│ │ ├── embedder.py # SentenceTransformers embeddings
│ │ └── vector_store.py # FAISS index + retrieval
│ └── eval_questions.yaml # Evaluation question bank
├── frontend/
│ ├── app.py # Streamlit entry point
│ ├── config.py # Frontend config loader
│ └── components/
│ ├── chat.py # Chat mode UI
│ ├── rag_chat.py # RAG mode UI
│ ├── tool_chat.py # Tool mode UI
│ ├── eval.py # Evaluation mode UI
│ └── sidebar.py # Sidebar (session, mode, docs)
├── data/ # PDF documents for RAG
├── index/ # Generated FAISS index files
└── requirements.md # Original assignment specification
- Python 3.11+
- Ollama -- local LLM inference server
If Ollama is not already installed:
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | shmacOS (Homebrew):
brew install ollamaWindows / Other:
Download from ollama.com/download.
Verify the installation:
ollama --version# Clone the repo
git clone <repo-url>
cd LLM_WebUI_RAG_Grade_2
# Install Python dependencies
pip install -r requirements.txt
# Start everything (Ollama, backend, frontend) with one command
chmod +x run.sh
./run.shThe script will:
- Start Ollama if it isn't already running
- Pull the
llama3.2:latestmodel (one-time download) - Launch the FastAPI backend on port 8000
- Launch the Streamlit frontend on port 8501
Open http://localhost:8501 in your browser.
If you prefer to start each component separately:
# Terminal 1 -- Start Ollama
ollama serve
# Terminal 2 -- Pull the model (one-time)
ollama pull llama3.2:latest
# Terminal 3 -- Start the backend
uvicorn backend.main:app --host 0.0.0.0 --port 8000
# Terminal 4 -- Start the frontend
streamlit run frontend/app.py --server.port 8501All settings live in config.yaml:
| Setting | Default | Description |
|---|---|---|
model |
llama3.2:latest |
Ollama model to use |
llm_server_url |
http://localhost:11434 |
Ollama server URL |
backend_host |
0.0.0.0 |
Backend bind address |
backend_port |
8000 |
Backend port |
backend_url |
http://localhost:8000 |
URL the frontend uses to reach the backend |
db_path |
sessions.db |
SQLite database path |
session_memory_turns |
10 |
Number of conversation turns to reload per session |
embedding_model |
all-MiniLM-L6-v2 |
SentenceTransformers model for embeddings |
vector_store_path |
./index/faiss.index |
FAISS index file path |
chunk_size |
500 |
Character-based chunk size for document splitting |
chunk_overlap |
50 |
Overlap between chunks |
top_k |
3 |
Number of retrieved chunks per query |
max_context_tokens |
4096 |
Maximum context window for RAG prompts |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
POST |
/stream?mode=chat|rag|tool |
Unified streaming endpoint (SSE) |
POST |
/chat |
Blocking chat response |
POST |
/chat/stream |
Streaming chat (SSE) |
GET |
/chat/history |
Retrieve session conversation history |
GET |
/chat/sessions |
List recent sessions |
POST |
/rag |
Blocking RAG response |
POST |
/rag/stream |
Streaming RAG query (SSE) |
POST |
/rag/upload |
Upload PDF/TXT documents and ingest |
POST |
/rag/ingest |
Build index from ./data/ directory |
GET |
/rag/status |
Check RAG readiness and indexed docs |
DELETE |
/rag/documents/{filename} |
Remove a document and rebuild index |
POST |
/tool |
Blocking tool-mode response |
POST |
/tool/stream |
Streaming tool query (SSE) |
GET |
/eval |
Run automated RAG evaluation |
A full-stack LLM + RAG application with:
- FastAPI backend serving both blocking and streaming (SSE) endpoints for chat and RAG queries, with SQLite session persistence.
- Streamlit frontend with clearly labeled modes (Chat, RAG, Tool, Eval), real-time token-by-token streaming, document upload/management, source citation display, session switching, conversation history filtering by mode, and per-response performance metrics (TTFT, total time, tokens/sec).
- RAG pipeline using PyPDF2 for PDF extraction, sentence-boundary-aware chunking with overlap, SentenceTransformers (
all-MiniLM-L6-v2) for embeddings, and FAISS for vector similarity search. Retrieved sources are displayed with filename, page number, relevance score, and text preview. - Session management with SQLite storage of all messages (timestamped, mode-tagged), session listing, and automatic reload of the most recent N turns when switching sessions.
Architecture follows a clean separation: backend/routers/ for API endpoints, backend/services/ for business logic (LLM, session, RAG), and frontend/components/ for UI modules.
- Streaming across the full stack -- Getting SSE to work reliably from Ollama through FastAPI through Streamlit required careful handling of chunked responses, connection timeouts, and partial JSON parsing.
- Session state in Streamlit -- Streamlit reruns the entire script on every interaction, so managing mode switches, session switches, and history filtering without losing state or triggering unnecessary reruns took iteration.
- RAG context construction -- Balancing chunk size, overlap, and the number of retrieved sources to give the LLM enough context without exceeding token limits or diluting relevance.
- Better chunking -- Use token-based chunking instead of character-based for more precise control.
- Connection pooling -- Replace per-request SQLite connections with a connection pool for better concurrent performance.