Skip to content

On-premise agentic RAG solution with hybrid retrieval and real-time hallucination detection. 85% hit rate, 93% hallucination mitigation precision

Notifications You must be signed in to change notification settings

hakeematyab/Queryable-Shared-Reference-Repository

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

Queryable Shared Reference Repository

Privacy-focused, on-premises RAG system for scientific literature

Demo Video GitHub

Built for Vitek Lab, Northeastern University

The Problem

Research groups manage thousands of scientific papers but can't use cloud LLMs due to privacy concerns with sensitive research data. Existing reference managers lack intelligent querying, and LLMs hallucinate—fabricating citations and facts that undermine research integrity.

The Solution

A fully on-premises agentic RAG system that enables natural language queries across scientific literature with built-in hallucination detection and mitigation—no external API calls, complete data privacy.

Key Results

Objective Target Achieved Status
Retrieval (Hit Rate@5) ≥75% 85.1%
Retrieval (MRR@5) ≥65% 86.4%
Generation Faithfulness ≥85% 88.6%
Answer Relevancy ≥80% 80.04%
Hallucination Detection (F1) ≥80% 85.3%
Hallucination Mitigation (Precision) ≥85% 93%
Latency (Simple Query) <10s ~4.6s
Latency (Complex Query) <60s ~12-15s
GPU Memory ≤25GB ~18GB
External APIs None Fully Private

Research Insights

Hallucination Mitigation Strategies

Evaluated four prompting approaches on answerable, unanswerable, and borderline queries:

Strategy Best For Precision Recall
Baseline Low 100%
Explicit IDK Clear questions ~93% ~50%
Confidence Threshold High-stakes 100% ~20%
Confidence Rubric Ambiguous queries ~87%* ~40%

*Only ~6% precision drop on borderline queries vs ~29% for Explicit IDK

Recommendation: Use Explicit IDK for standard queries; switch to Confidence Rubric for ambiguous questions.

Context Length & "Lost in the Middle"

Discovered that model conservatism increases with context length—not hallucination rate. Key finding: answers in the middle of long contexts are hardest to retrieve.

Practical guidance:

  • Limit conversations to ~10% of context window, OR
  • Implement aggressive context summarization
  • Front-load critical information in prompts

Tech Stack

Component Selection Rationale
Embedding Gemma (8K context) Best Hit Rate/MRR with hybrid chunking
Reranker GTE Reranker Best MRR + large context window for scalability
Retrieval BM25 + Semantic + Reranker Robust real-world performance
Generation Qwen3 8B Highest faithfulness + relevancy balance
Hallucination Detection Bespoke RoBERTa Best F1 per billion parameters
Document Processing Docling Layout-aware extraction with structure preservation

Infrastructure: Runs on Mac Studio M2 Ultra (~18GB VRAM utilized)

Quick Start

# Clone and setup
git clone https://github.com/hakeematyab/Queryable-Shared-Reference-Repository.git
cd Queryable-Shared-Reference-Repository/app
chmod +x setup.sh startup.sh shutdown.sh

# Start all services
./startup.sh

# Access at http://localhost:3000

Requirements: Python 3.11+, Node.js 18+, Ollama, ~25GB VRAM

Features

  • Natural Language Queries — Ask questions across your paper collection
  • Three-Tiered Trust Badges — Visual grounding indicators (✓ Green, ⚠ Amber, ✕ Red)
  • Citation Tracking — Source attribution for all responses
  • Paper Ingestion — Upload PDFs directly through the interface
  • Chat History — Persistent conversations per user
  • Fully Private — No external API calls, runs entirely on-premises

Documentation

About

On-premise agentic RAG solution with hybrid retrieval and real-time hallucination detection. 85% hit rate, 93% hallucination mitigation precision

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •