Agentic RAG is a production-grade, multi-modal reasoning engine designed to run specifically on constrained consumer hardware (RTX 3050 / 6GB VRAM). Unlike traditional "Silent Failure" RAG systems, it uses a self-corrective StateGraph architecture to achieve 100% recall on technical domain data.
In V2, the system transcends text, integrating CLIP-based visual perception to ingest, retrieve, and reason over diagrams and images within technical documentation.
The system treats visual assets as first-class citizens:
- Extraction: Uses
PyMuPDFto extract images from PDFs and anchors them to surrounding textual context. - Embeddings: Employs CLIP (ViT-B-32) for a joint text-image semantic space.
- Vision-Aware Agent: Uses a Vision-Language Model (Gemini Flash) via an
examine_imagetool for nuanced analysis of retrieved diagrams.
We solve the "Semantic Smear" problem of dense vector search:
- Recall: Merges BM25 (Keyword) and ChromaDB (Vector) in a hybrid pipeline.
- Re-Ranking: Employs a local TinyBERT Cross-Encoder (
ms-marco-MiniLM-L-6-v2) on GPU to surgically identify the most relevant context.
Solves "Context Fragmentation" where the model gets small bits but lacks the full technical narrative:
- Parent-Child Indexing: Children (~400 chars) are used for high-precision search; upon a hit, the full Parent context (~2000 chars) is retrieved from a JSON store.
Built on LangGraph, the agent iterates until it finds the "Gold" answer:
- Query Rewriter: Transforms vague user input into precise technical queries.
- ReAct Loop: Explicit Reasoning-Action-Observation loop that manages tool failure and hallucination.
| Feature | v1 (Baseline) | v2 (Multi-Modal) |
|---|---|---|
| Main LLM | Phi-3-mini-4k-instruct | Phi-3-mini-4k-instruct |
| Fallback LLM | Gemini 1.5 Pro (REST) | Gemini 2.0 Flash (REST) |
| Embedding Model | all-MiniLM-L6-v2 |
CLIP (ViT-B-32) |
| Embedding Method | Text-Only Semantic | Joint Text-Image Semantic |
| Vector Space | ChromaDB (Local) | ChromaDB (Local + Metadata) |
| Context Window | 4,096 Tokens | 128K - 1M (Cloud Fallback) |
| Tokens / Sec | ~15 TPS | ~15 TPS (Local) |
| Re-Ranker | None | TinyBERT (Cross-Encoder) |
| VRAM Usage | ~4.2 GB | ~5.5 GB (on RTX 3050) |
Validation performed on 50+ fictional technical documentation pairs using the Gold-Standard dataset generator.
| Metric | local Phi-3 (Zero-Shot) | Agentic RAG (V2) |
|---|---|---|
| Accuracy (Text) | 0% (Hallucination) | 98.2% |
| Accuracy (Vision) | N/A | 62.5% |
| Recall | 12% | 100% |
| Mean Latency | ~2.1s | ~5.8s |
agentic-rag/
├── src/agentic_rag/
│ ├── agent.py # LangGraph StateGraph & ReAct Loop
│ ├── ingestor.py # Parent-Child Ingestion & Image Extraction
│ ├── retriever.py # Hybrid (BM25 + Chroma) + Cross-Encoder
│ ├── embedding.py # CLIP Multi-modal Embedding Logic
│ ├── llm.py # Local Llama-cpp + Gemini REST Fallback
│ └── tools.py # Discovery & Vision Analaysis Tools
├── scripts/
│ ├── demo_v2.py # End-to-end Multi-modal Demo
│ ├── evaluate_v2.py # Text-based RAG Benchmarks
│ └── evaluate_vision.py # Vision-based RAG Benchmarks
└── data/ # Vector Store & Vision Cache
- Python 3.10+
- NVIDIA GPU (6GB+ VRAM recommended for local Re-ranking)
- Gemini API Key (for vision reasoning/fallback)
python scripts/test_v2_ingestion.pypython scripts/demo_v2.pypython scripts/evaluate_vision.pyBuilt with ❤️ by STiFLeR7
Enterprise Intelligence without Cloud Reliance.