LLM RAG WebUI

A full-stack LLM chat application with Retrieval-Augmented Generation (RAG), built with FastAPI, Streamlit, and Ollama. Supports multiple interaction modes including direct chat, document-grounded Q&A with source citations, structured tool output, and automated RAG evaluation.

DEMO VIDEO:

2026-02-09.14-59-21.mp4

Features

Chat Mode -- Direct conversation with a local LLM
RAG Mode -- Document-grounded answers with retrieved source citations (filename, page number, relevance score)
Tool Mode -- LLM outputs structured robot-action JSON, validated by the backend
Evaluation Mode -- Automated RAG accuracy scoring against a predefined question bank
Streaming -- Real-time token-by-token responses via Server-Sent Events (SSE)
Session Persistence -- SQLite-backed conversation history with multi-session support and automatic reload
Single Config -- All settings (model, server URLs, RAG parameters) in one config.yaml

Architecture

User --> Streamlit (8501) --> FastAPI (8000) --> Ollama (11434)
                                  |
                                  +--> SQLite (sessions.db)
                                  +--> FAISS  (index/)
                                  +--> RAG Pipeline

Frontend (`frontend/`)

Streamlit application with modular components for each interaction mode. Handles streaming display, mode switching, session management, and document upload.

Backend (`backend/`)

FastAPI server exposing routers for chat, RAG, tool, and eval endpoints. Services layer handles LLM communication (Ollama), session persistence (SQLite), and the full RAG pipeline.

RAG Pipeline (`backend/services/rag/`)

PDF extraction (PyPDF2) → sentence-aware chunking with overlap → embeddings (SentenceTransformers all-MiniLM-L6-v2) → FAISS vector similarity search → top-k retrieval with metadata.

Persistence

Sessions: SQLite database (sessions.db) storing timestamped, mode-tagged messages with session IDs
Vector Index: FAISS index files in index/

Project Structure

.
├── run.sh                     # One-command startup script
├── config.yaml                # Central configuration
├── requirements.txt           # Python dependencies
├── backend/
│   ├── main.py                # FastAPI app entry point
│   ├── config.py              # Loads config.yaml into Settings
│   ├── models.py              # Pydantic request/response models
│   ├── db/
│   │   └── database.py        # SQLite session storage
│   ├── routers/
│   │   ├── chat.py            # /chat endpoints
│   │   ├── rag.py             # /rag endpoints
│   │   ├── tool.py            # /tool endpoints
│   │   └── eval.py            # /eval endpoints
│   ├── services/
│   │   ├── llm.py             # Ollama client + streaming
│   │   ├── session.py         # Session management
│   │   ├── eval.py            # RAG evaluation logic
│   │   └── rag/
│   │       ├── __init__.py              # Retriever (orchestrates RAG pipeline)
│   │       ├── document_processor.py    # PDF text extraction
│   │       ├── chunker.py              # Sentence-aware chunking
│   │       ├── embedder.py             # SentenceTransformers embeddings
│   │       └── vector_store.py         # FAISS index + retrieval
│   └── eval_questions.yaml    # Evaluation question bank
├── frontend/
│   ├── app.py                 # Streamlit entry point
│   ├── config.py              # Frontend config loader
│   └── components/
│       ├── chat.py            # Chat mode UI
│       ├── rag_chat.py        # RAG mode UI
│       ├── tool_chat.py       # Tool mode UI
│       ├── eval.py            # Evaluation mode UI
│       └── sidebar.py         # Sidebar (session, mode, docs)
├── data/                      # PDF documents for RAG
├── index/                     # Generated FAISS index files
└── requirements.md            # Original assignment specification

Prerequisites

Python 3.11+
Ollama -- local LLM inference server

Installing Ollama

If Ollama is not already installed:

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

macOS (Homebrew):

brew install ollama

Windows / Other:

Download from ollama.com/download.

Verify the installation:

ollama --version

Quick Start

# Clone the repo
git clone <repo-url>
cd LLM_WebUI_RAG_Grade_2

# Install Python dependencies
pip install -r requirements.txt

# Start everything (Ollama, backend, frontend) with one command
chmod +x run.sh
./run.sh

The script will:

Start Ollama if it isn't already running
Pull the llama3.2:latest model (one-time download)
Launch the FastAPI backend on port 8000
Launch the Streamlit frontend on port 8501

Open http://localhost:8501 in your browser.

Manual Start

If you prefer to start each component separately:

# Terminal 1 -- Start Ollama
ollama serve

# Terminal 2 -- Pull the model (one-time)
ollama pull llama3.2:latest

# Terminal 3 -- Start the backend
uvicorn backend.main:app --host 0.0.0.0 --port 8000

# Terminal 4 -- Start the frontend
streamlit run frontend/app.py --server.port 8501

Configuration

All settings live in config.yaml:

Setting	Default	Description
`model`	`llama3.2:latest`	Ollama model to use
`llm_server_url`	`http://localhost:11434`	Ollama server URL
`backend_host`	`0.0.0.0`	Backend bind address
`backend_port`	`8000`	Backend port
`backend_url`	`http://localhost:8000`	URL the frontend uses to reach the backend
`db_path`	`sessions.db`	SQLite database path
`session_memory_turns`	`10`	Number of conversation turns to reload per session
`embedding_model`	`all-MiniLM-L6-v2`	SentenceTransformers model for embeddings
`vector_store_path`	`./index/faiss.index`	FAISS index file path
`chunk_size`	`500`	Character-based chunk size for document splitting
`chunk_overlap`	`50`	Overlap between chunks
`top_k`	`3`	Number of retrieved chunks per query
`max_context_tokens`	`4096`	Maximum context window for RAG prompts

API Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`POST`	`/stream?mode=chat\|rag\|tool`	Unified streaming endpoint (SSE)
`POST`	`/chat`	Blocking chat response
`POST`	`/chat/stream`	Streaming chat (SSE)
`GET`	`/chat/history`	Retrieve session conversation history
`GET`	`/chat/sessions`	List recent sessions
`POST`	`/rag`	Blocking RAG response
`POST`	`/rag/stream`	Streaming RAG query (SSE)
`POST`	`/rag/upload`	Upload PDF/TXT documents and ingest
`POST`	`/rag/ingest`	Build index from `./data/` directory
`GET`	`/rag/status`	Check RAG readiness and indexed docs
`DELETE`	`/rag/documents/{filename}`	Remove a document and rebuild index
`POST`	`/tool`	Blocking tool-mode response
`POST`	`/tool/stream`	Streaming tool query (SSE)
`GET`	`/eval`	Run automated RAG evaluation

Write-Up

What I Built

A full-stack LLM + RAG application with:

FastAPI backend serving both blocking and streaming (SSE) endpoints for chat and RAG queries, with SQLite session persistence.
Streamlit frontend with clearly labeled modes (Chat, RAG, Tool, Eval), real-time token-by-token streaming, document upload/management, source citation display, session switching, conversation history filtering by mode, and per-response performance metrics (TTFT, total time, tokens/sec).
RAG pipeline using PyPDF2 for PDF extraction, sentence-boundary-aware chunking with overlap, SentenceTransformers (all-MiniLM-L6-v2) for embeddings, and FAISS for vector similarity search. Retrieved sources are displayed with filename, page number, relevance score, and text preview.
Session management with SQLite storage of all messages (timestamped, mode-tagged), session listing, and automatic reload of the most recent N turns when switching sessions.

Architecture follows a clean separation: backend/routers/ for API endpoints, backend/services/ for business logic (LLM, session, RAG), and frontend/components/ for UI modules.

Challenges

Streaming across the full stack -- Getting SSE to work reliably from Ollama through FastAPI through Streamlit required careful handling of chunked responses, connection timeouts, and partial JSON parsing.
Session state in Streamlit -- Streamlit reruns the entire script on every interaction, so managing mode switches, session switches, and history filtering without losing state or triggering unnecessary reruns took iteration.
RAG context construction -- Balancing chunk size, overlap, and the number of retrieved sources to give the LLM enough context without exceeding token limits or diluting relevance.

Future Improvements

Better chunking -- Use token-based chunking instead of character-based for more precise control.
Connection pooling -- Replace per-request SQLite connections with a connection pool for better concurrent performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM RAG WebUI

DEMO VIDEO:

Features

Architecture

Frontend (`frontend/`)

Backend (`backend/`)

RAG Pipeline (`backend/services/rag/`)

Persistence

Project Structure

Prerequisites

Installing Ollama

Quick Start

Manual Start

Configuration

API Endpoints

Write-Up

What I Built

Challenges

Future Improvements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.streamlit		.streamlit
backend		backend
data		data
docs		docs
frontend		frontend
index		index
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
requirements.md		requirements.md
requirements.txt		requirements.txt
run.sh		run.sh

BrandonGarate177/Fullstack-RAG

Folders and files

Latest commit

History

Repository files navigation

LLM RAG WebUI

DEMO VIDEO:

Features

Architecture

Frontend (frontend/)

Backend (backend/)

RAG Pipeline (backend/services/rag/)

Persistence

Project Structure

Prerequisites

Installing Ollama

Quick Start

Manual Start

Configuration

API Endpoints

Write-Up

What I Built

Challenges

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Frontend (`frontend/`)

Backend (`backend/`)

RAG Pipeline (`backend/services/rag/`)

Packages