Skip to content

BrandonGarate177/Fullstack-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM RAG WebUI

A full-stack LLM chat application with Retrieval-Augmented Generation (RAG), built with FastAPI, Streamlit, and Ollama. Supports multiple interaction modes including direct chat, document-grounded Q&A with source citations, structured tool output, and automated RAG evaluation.

DEMO VIDEO:

2026-02-09.14-59-21.mp4

Features

  • Chat Mode -- Direct conversation with a local LLM
  • RAG Mode -- Document-grounded answers with retrieved source citations (filename, page number, relevance score)
  • Tool Mode -- LLM outputs structured robot-action JSON, validated by the backend
  • Evaluation Mode -- Automated RAG accuracy scoring against a predefined question bank
  • Streaming -- Real-time token-by-token responses via Server-Sent Events (SSE)
  • Session Persistence -- SQLite-backed conversation history with multi-session support and automatic reload
  • Single Config -- All settings (model, server URLs, RAG parameters) in one config.yaml

Architecture

User --> Streamlit (8501) --> FastAPI (8000) --> Ollama (11434)
                                  |
                                  +--> SQLite (sessions.db)
                                  +--> FAISS  (index/)
                                  +--> RAG Pipeline

Frontend (frontend/)

Streamlit application with modular components for each interaction mode. Handles streaming display, mode switching, session management, and document upload.

Backend (backend/)

FastAPI server exposing routers for chat, RAG, tool, and eval endpoints. Services layer handles LLM communication (Ollama), session persistence (SQLite), and the full RAG pipeline.

RAG Pipeline (backend/services/rag/)

PDF extraction (PyPDF2) → sentence-aware chunking with overlap → embeddings (SentenceTransformers all-MiniLM-L6-v2) → FAISS vector similarity search → top-k retrieval with metadata.

Persistence

  • Sessions: SQLite database (sessions.db) storing timestamped, mode-tagged messages with session IDs
  • Vector Index: FAISS index files in index/

Project Structure

.
├── run.sh                     # One-command startup script
├── config.yaml                # Central configuration
├── requirements.txt           # Python dependencies
├── backend/
│   ├── main.py                # FastAPI app entry point
│   ├── config.py              # Loads config.yaml into Settings
│   ├── models.py              # Pydantic request/response models
│   ├── db/
│   │   └── database.py        # SQLite session storage
│   ├── routers/
│   │   ├── chat.py            # /chat endpoints
│   │   ├── rag.py             # /rag endpoints
│   │   ├── tool.py            # /tool endpoints
│   │   └── eval.py            # /eval endpoints
│   ├── services/
│   │   ├── llm.py             # Ollama client + streaming
│   │   ├── session.py         # Session management
│   │   ├── eval.py            # RAG evaluation logic
│   │   └── rag/
│   │       ├── __init__.py              # Retriever (orchestrates RAG pipeline)
│   │       ├── document_processor.py    # PDF text extraction
│   │       ├── chunker.py              # Sentence-aware chunking
│   │       ├── embedder.py             # SentenceTransformers embeddings
│   │       └── vector_store.py         # FAISS index + retrieval
│   └── eval_questions.yaml    # Evaluation question bank
├── frontend/
│   ├── app.py                 # Streamlit entry point
│   ├── config.py              # Frontend config loader
│   └── components/
│       ├── chat.py            # Chat mode UI
│       ├── rag_chat.py        # RAG mode UI
│       ├── tool_chat.py       # Tool mode UI
│       ├── eval.py            # Evaluation mode UI
│       └── sidebar.py         # Sidebar (session, mode, docs)
├── data/                      # PDF documents for RAG
├── index/                     # Generated FAISS index files
└── requirements.md            # Original assignment specification

Prerequisites

  • Python 3.11+
  • Ollama -- local LLM inference server

Installing Ollama

If Ollama is not already installed:

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

macOS (Homebrew):

brew install ollama

Windows / Other:

Download from ollama.com/download.

Verify the installation:

ollama --version

Quick Start

# Clone the repo
git clone <repo-url>
cd LLM_WebUI_RAG_Grade_2

# Install Python dependencies
pip install -r requirements.txt

# Start everything (Ollama, backend, frontend) with one command
chmod +x run.sh
./run.sh

The script will:

  1. Start Ollama if it isn't already running
  2. Pull the llama3.2:latest model (one-time download)
  3. Launch the FastAPI backend on port 8000
  4. Launch the Streamlit frontend on port 8501

Open http://localhost:8501 in your browser.


Manual Start

If you prefer to start each component separately:

# Terminal 1 -- Start Ollama
ollama serve

# Terminal 2 -- Pull the model (one-time)
ollama pull llama3.2:latest

# Terminal 3 -- Start the backend
uvicorn backend.main:app --host 0.0.0.0 --port 8000

# Terminal 4 -- Start the frontend
streamlit run frontend/app.py --server.port 8501

Configuration

All settings live in config.yaml:

Setting Default Description
model llama3.2:latest Ollama model to use
llm_server_url http://localhost:11434 Ollama server URL
backend_host 0.0.0.0 Backend bind address
backend_port 8000 Backend port
backend_url http://localhost:8000 URL the frontend uses to reach the backend
db_path sessions.db SQLite database path
session_memory_turns 10 Number of conversation turns to reload per session
embedding_model all-MiniLM-L6-v2 SentenceTransformers model for embeddings
vector_store_path ./index/faiss.index FAISS index file path
chunk_size 500 Character-based chunk size for document splitting
chunk_overlap 50 Overlap between chunks
top_k 3 Number of retrieved chunks per query
max_context_tokens 4096 Maximum context window for RAG prompts

API Endpoints

Method Endpoint Description
GET /health Health check
POST /stream?mode=chat|rag|tool Unified streaming endpoint (SSE)
POST /chat Blocking chat response
POST /chat/stream Streaming chat (SSE)
GET /chat/history Retrieve session conversation history
GET /chat/sessions List recent sessions
POST /rag Blocking RAG response
POST /rag/stream Streaming RAG query (SSE)
POST /rag/upload Upload PDF/TXT documents and ingest
POST /rag/ingest Build index from ./data/ directory
GET /rag/status Check RAG readiness and indexed docs
DELETE /rag/documents/{filename} Remove a document and rebuild index
POST /tool Blocking tool-mode response
POST /tool/stream Streaming tool query (SSE)
GET /eval Run automated RAG evaluation

Write-Up

What I Built

A full-stack LLM + RAG application with:

  • FastAPI backend serving both blocking and streaming (SSE) endpoints for chat and RAG queries, with SQLite session persistence.
  • Streamlit frontend with clearly labeled modes (Chat, RAG, Tool, Eval), real-time token-by-token streaming, document upload/management, source citation display, session switching, conversation history filtering by mode, and per-response performance metrics (TTFT, total time, tokens/sec).
  • RAG pipeline using PyPDF2 for PDF extraction, sentence-boundary-aware chunking with overlap, SentenceTransformers (all-MiniLM-L6-v2) for embeddings, and FAISS for vector similarity search. Retrieved sources are displayed with filename, page number, relevance score, and text preview.
  • Session management with SQLite storage of all messages (timestamped, mode-tagged), session listing, and automatic reload of the most recent N turns when switching sessions.

Architecture follows a clean separation: backend/routers/ for API endpoints, backend/services/ for business logic (LLM, session, RAG), and frontend/components/ for UI modules.

Challenges

  • Streaming across the full stack -- Getting SSE to work reliably from Ollama through FastAPI through Streamlit required careful handling of chunked responses, connection timeouts, and partial JSON parsing.
  • Session state in Streamlit -- Streamlit reruns the entire script on every interaction, so managing mode switches, session switches, and history filtering without losing state or triggering unnecessary reruns took iteration.
  • RAG context construction -- Balancing chunk size, overlap, and the number of retrieved sources to give the LLM enough context without exceeding token limits or diluting relevance.

Future Improvements

  • Better chunking -- Use token-based chunking instead of character-based for more precise control.
  • Connection pooling -- Replace per-request SQLite connections with a connection pool for better concurrent performance.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published