This project implements a context-aware conversational chatbot designed to handle multi-turn conversations reliably.
The system maintains short-term conversational memory, supports automatic fallback between language models, and uses in-memory caching to reduce latency and API usage. If the primary cloud-based model fails, the chatbot seamlessly switches to a locally hosted model without interrupting the user experience.
The objective of this project is to demonstrate practical system design, reliability, and engineering trade-offs in conversational AI — not to train or fine-tune language models.
- Session-based chat history is maintained
- Previous user and assistant messages are included in future prompts
- Enables coherent follow-up questions and context-aware responses
- Memory is stored in-memory for simplicity and speed
- Primary model: Groq-hosted LLM (cloud-based, low latency)
- Fallback model: Ollama (locally hosted)
- Automatic fallback occurs if the primary model:
- Fails
- Times out
- Encounters API, quota, or network errors
- Ensures uninterrupted responses even when external services fail
- Relevant past messages are retrieved from the session history
- Retrieved messages are appended directly to the prompt
- No embeddings or vector databases are used
- Prompt-based retrieval is sufficient for short conversational context
- Frequently repeated user queries are cached in memory
- Cached responses are returned instantly
- Reduces latency, API calls, and external model usage costs
- Lightweight frontend using HTML, CSS, and JavaScript
- User messages are right-aligned
- Assistant messages are left-aligned
- Clean, chat-style UI similar to common messaging applications
User (Browser)
→ FastAPI Backend
├── Cache (instant return if hit)
├── Session Memory
├── Primary LLM (Groq)
│ └── on failure → Fallback LLM (Ollama)
- Backend: FastAPI (Python)
- Frontend: HTML, CSS, JavaScript
- Primary LLM: Groq API
- Fallback LLM: Ollama (local inference)
- HTTP Client: httpx
- Environment Management: python-dotenv
- User sends a message through the browser interface
- The backend receives the message along with a session identifier
- Conversation history for that session is retrieved
- Cache is checked for an existing response
- If no cache hit:
- The primary LLM (Groq) is called
- On failure, the system automatically falls back to Ollama
- The response is stored in session memory and cache
- The response is returned to the frontend
This project intentionally avoids over-engineering and sticks strictly to the stated objectives.
- Context-aware multi-turn conversation
- Automatic model fallback for reliability
- Practical system design with clear separation of components
- Cost-aware API usage through caching
- Robust behavior under failure conditions
❌ A production-ready chatbot
- No authentication, rate limiting, monitoring, logging, or deployment hardening
❌ A custom-trained or fine-tuned LLM
- Uses existing pre-trained models only
❌ A full Retrieval-Augmented Generation (RAG) system
- No vector databases
- No embeddings
- No semantic document search
- Memory is session-based and non-persistent
- No database-backed long-term memory
- No cache TTL or eviction policies
- Cache is fully in-memory
- Ollama responses may be lower quality than cloud models
- Local inference requires sufficient system resources
These limitations are intentional and aligned with the project scope.
-
Activate the virtual environment
venv\Scripts\activate -
Start the FastAPI server
python -m uvicorn app.main:app --reload -
Open the application in a browser
http://127.0.0.1:8000
- Hi
- My name is Arun
- What is my name?
- I have my Calculus exam tomorrow
- What subject is my exam for?
- What is 25 × 16?
- Ask the same question twice to observe caching
- Disconnect internet access to test Ollama fallback
Click here to view the demo video
This project prioritizes clarity, reliability, and correctness over unnecessary complexity.
All architectural and design choices were made deliberately to match the stated objectives and constraints, making the project suitable for academic evaluation and technical review.