High-confidence document Q&A system using LLM swarm consensus with hallucination detection and correction.
Quick Links: 🚀 Quick Start | Design Doc | All Docs
Ready to process your PDF?
- Interactive TUI: Simply run
frfrto launch the Terminal User Interface for visual session management - CLI Mode: Use
frfr processfor one-command extraction and querying, or see QUICKSTART.md for details
Run frfr without any arguments to launch the interactive TUI:
frfrThe TUI provides:
- Session Browser: View and navigate all sessions with document counts and fact statistics
- Session Detail View: Explore documents within a session
- Facts Browser: Filter and search through extracted facts with real-time search
- Query Interface: Ask natural language questions about your facts
- Keyboard Navigation: Full keyboard shortcuts (
qto quit,?for help,ESCto go back)
All CLI commands are still available:
frfr <command> # Use specific CLI command
frfr --cli # Show CLI help
frfr tui # Explicitly launch TUIFrfr extracts structured, validated facts from complex documents (SOC2 reports, penetration test reports, design specs) with high precision.
Current Implementation (V5 - Production Ready):
- ✅ PDF text extraction with OCR fallback
- ✅ LLM-based fact extraction with enhanced metadata (8 fields)
- ✅ Maximum depth extraction mode
- ✅ Multiple evidence quotes support (V5)
- ✅ Real-time validation against source text
- ✅ Parallel processing with resume capability
- ✅ Post-processing pipeline (QV tagging, filtering)
- ✅ Document-aware sessions with intelligent naming
- ✅ Multi-document support with automatic session renaming
Planned Features (Future Phases):
- 🔮 Multiple LLM instances with swarm consensus
- 🔮 Semantic comparison and clustering
- 🔮 Contradiction detection and resolution
- 🔮 Judge model synthesis
- 🔮 Interactive Q&A over extracted facts
┌─────────────────────────────────────────┐
│ User Interface │
│ ┌────────────────┐ ┌───────────────┐ │
│ │ TUI (Textual) │ │ CLI (Rich) │ │
│ │ - Session │ │ - One-command │ │
│ │ Browser │ │ workflow │ │
│ │ - Facts View │ │ - Scriptable │ │
│ │ - Query UI │ │ │ │
│ └────────────────┘ └───────────────┘ │
└────────┬────────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Session Management (Local) │
│ - Session tracking & resume │
│ - Progress persistence │
│ - Artifact storage │
└────────┬───────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Document Processing (Active) │
│ - PDF OCR (ImageMagick + Tesseract) │
│ - PyPDF2 for text-based PDFs │
│ - Smart chunking (overlap + resume) │
└────────┬───────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Enhanced Fact Extraction (V5) ✅ │
│ - Claude Sonnet via CLI │
│ - Maximum depth extraction │
│ - Multiple evidence quotes (V5) │
│ - 8 metadata fields (specificity, │
│ entities, QV, process details) │
│ - Parallel processing (5-11 workers) │
└────────┬───────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Real-Time Validation (Active) ✅ │
│ - Quote verification against source │
│ - Line number validation │
│ - Fuzzy matching (70% threshold) │
│ - Fact recovery for medium confidence │
└────────┬───────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ Post-Processing Pipeline (V5) ✅ │
│ - Retroactive QV tagging │
│ - Quality scoring │
│ - Aggressive filtering (35% QV) │
│ - Consolidated JSON output │
└────────────────────────────────────────┘
Future enhancements will add:
- Swarm Consensus: Multiple LLM instances with voting
- Semantic Clustering: Group similar facts, detect outliers
- Contradiction Resolution: Judge model for conflicting facts
- Enhanced Interactive Q&A: Advanced querying capabilities over extracted facts
frfr/
├── frfr/
│ ├── __init__.py
│ ├── cli.py # ✅ CLI interface (7 commands)
│ ├── config.py # ✅ Configuration management
│ ├── session.py # ✅ Document-aware sessions w/ LLM naming
│ ├── tui/ # ✅ Terminal User Interface
│ │ ├── __init__.py
│ │ ├── app.py # ✅ Main TUI application
│ │ ├── state.py # ✅ Application state management
│ │ ├── screens/ # ✅ TUI screens
│ │ │ ├── home.py # ✅ Session browser
│ │ │ ├── session_detail.py # ✅ Session detail view
│ │ │ ├── facts_browser.py # ✅ Facts filtering & search
│ │ │ └── query.py # ✅ Query interface
│ │ └── widgets/ # ✅ Custom widgets
│ ├── documents/
│ │ ├── __init__.py
│ │ └── pdf_extractor.py # ✅ PDF OCR + PyPDF2 extraction
│ ├── extraction/
│ │ ├── __init__.py
│ │ ├── fact_extractor.py # ✅ LLM-based extraction (V5)
│ │ ├── schemas.py # ✅ Enhanced fact schemas (V5)
│ │ ├── claude_client.py # ✅ Claude CLI wrapper
│ │ ├── extraction_patterns.py # ✅ V3 regex patterns
│ │ └── v4_enhancements.py # ✅ V4 filtering logic
│ ├── validation/
│ │ ├── __init__.py
│ │ ├── fact_validator.py # ✅ Real-time validation (V5)
│ │ └── quote_corrector.py # ✅ LLM-based quote correction
│ │
│ ├── consensus/ # 🔮 PLANNED (Phase 2)
│ │ └── __init__.py # (empty - future swarm consensus)
│ ├── judge/ # 🔮 PLANNED (Phase 2)
│ │ └── __init__.py # (empty - future judge model)
│ ├── workflows/ # 🔮 PLANNED (Phase 2)
│ │ └── __init__.py # (empty - future orchestration)
│ └── reporting/ # 🔮 PLANNED (Phase 2)
│ └── __init__.py # (empty - future reporting)
│
├── scripts/
│ └── ... # Helper scripts
├── tests/
│ └── ... # Test files
├── pyproject.toml
├── requirements.txt
└── README.md
Legend:
- ✅ = Implemented and production-ready
- 🔮 = Planned for future phases
- Python 3.10+
- Claude CLI (authenticated with
claude login) - ImageMagick
- Tesseract OCR
# Clone repository
git clone <repo-url>
cd frfr
# Install system dependencies (macOS)
brew install imagemagick tesseract
# Install Python dependencies
pip install -r requirements.txt
# Install package in development mode
pip install -e .
# Authenticate with Claude CLI
claude loginThe primary entrypoint for document processing is the PDF extraction API. It provides a clean Python interface for converting PDFs to text.
Using the CLI:
# Extract a PDF
frfr extract your-file.pdf output/extracted_text.txt
# View the output
cat output/extracted_text.txt | head -50from frfr.documents import extract_pdf_to_text, get_pdf_info
# Get PDF metadata
info = get_pdf_info('documents/your-file.pdf')
print(f"Pages: {info['pages']}")
print(f"Encrypted: {info['is_encrypted']}")
# Extract full PDF to text file
result = extract_pdf_to_text(
pdf_path='documents/your-file.pdf',
output_path='output/extracted_text.txt'
)
print(f"Method: {result['method']}") # 'pypdf2' (fast, clean)
print(f"Pages: {result['pages']}") # 155
print(f"Characters: {result['total_chars']:,}") # 476,143The system automatically chooses the best method:
-
PyPDF2 (default): For text-based PDFs
- Fast, clean extraction
- Handles encrypted PDFs (with pycryptodome)
- Preserves formatting
- Zero OCR artifacts
-
OCR (fallback): For scanned/image PDFs
- Tesseract with LSTM neural network
- 400 DPI quality
- Smart artifact cleaning
extract_pdf_to_text(pdf_path, output_path)
Extract text from entire PDF and save to file.
Returns:
{
"method": "pypdf2",
"pages": 155,
"total_chars": 476143,
"output_file": "/path/to/output.txt"
}extract_pdf_page_to_text(pdf_path, page_num)
Extract text from a single page (0-indexed).
Returns: tuple[str, str] - (text, method)
get_pdf_info(pdf_path)
Get PDF metadata.
Returns:
{
"pages": 155,
"is_encrypted": True,
"file_size": 1858673
}Frfr uses intelligent session management to organize your document processing:
project/
├── inputs/ # Symlinks to original PDFs
│ ├── doc1.pdf -> /original/path/doc1.pdf
│ └── doc2.pdf -> /another/path/doc2.pdf
├── outputs/ # All transformations
│ ├── doc1_text.txt
│ ├── doc1_facts.json
│ ├── doc2_text.txt
│ └── doc2_facts.json
└── .frfr_sessions/ # Session working data
└── sess_vendor_security_assessment_20251105_164525/
├── metadata.json # Document registry & history
├── summaries/ # LLM-generated summaries
├── facts/ # Per-chunk extracted facts
└── chunks/ # Original chunk text
Sessions are automatically named using Claude LLM based on your documents:
# Single document
frfr process documents/soc2_audit_report.pdf
# Creates: sess_soc2_audit_report_20251105_164525
# Multiple documents
frfr process documents/vendor_security.pdf documents/compliance_docs.pdf documents/risk_assessment.pdf
# Creates: sess_vendor_security_compliance_20251105_164531
# (Claude generates a succinct title from document names)As you add documents, the session name updates to stay topical:
# Start with first document
frfr process documents/vendor_questionnaire.pdf
# Session: sess_vendor_questionnaire_20251105_173454
# Add second document - session automatically renamed!
frfr process documents/vendor_questionnaire.pdf documents/compliance_report.pdf
# Session: sess_security_compliance_documentation_20251105_173454
# ℹ Session name updated to reflect documents
# Add third document - renamed again!
frfr process documents/vendor_questionnaire.pdf documents/compliance_report.pdf documents/risk_assessment.pdf
# Session: sess_security_compliance_assessment_20251105_173454
# ℹ Session name updated to reflect documentsAll renames are tracked in session metadata with complete history.
Process multiple PDFs in a single session for cross-document analysis:
# Process multiple documents together
frfr process documents/doc1.pdf documents/doc2.pdf documents/doc3.pdf
# Or build up a session over time
frfr process documents/doc1.pdf --session-id my_session
frfr process documents/doc2.pdf --session-id my_session # Adds to existing session
frfr process documents/doc3.pdf --session-id my_session # Session name updatesEach document is tracked with:
- Original PDF path (absolute)
- Symlink in
inputs/ - Text file in
outputs/ - Facts file in
outputs/ - Processing status (pending/processing/completed/failed)
The process supercommand runs the complete pipeline from PDF to interactive querying in one command:
# Process a single PDF from start to finish
frfr process documents/soc2_report.pdf
# Process multiple PDFs in one session
frfr process documents/doc1.pdf documents/doc2.pdf documents/doc3.pdf
# With custom settings
frfr process documents/report.pdf \
--max-workers 11 \
--multipass \
--show-facts
# Process without entering interactive mode
frfr process documents/report.pdf --no-interactive
# Use a specific session ID
frfr process documents/report.pdf --session-id my_custom_sessionThis command:
- ✅ Extracts PDF to text
- ✅ Extracts facts using LLM
- ✅ Validates facts against source
- ✅ Launches interactive query mode
For more control over individual steps:
# 1. Extract PDF to text
frfr extract documents/soc2_report.pdf output/soc2_text.txt
# 2. Extract facts with V5 features
frfr extract-facts output/soc2_text.txt \
--document-name my_soc2 \
--max-workers 11
# Output:
# ✅ Session: sess_abc123
# ✅ Processing chunks... [170/170] (28 minutes)
# ✅ Extracted 2,487 facts
# ✅ Consolidated: output/my_soc2_facts.json
# 3. Validate facts against source
frfr validate-facts output/my_soc2_facts.json output/soc2_text.txt
# Output:
# ✅ Total: 2,487 facts
# ✅ Valid: 2,487 (100%)
# ✅ Validation rate: 100%
# 4. Check session progress (for resume)
frfr session-info sess_abc123
# 5. Resume if interrupted
frfr extract-facts output/soc2_text.txt \
--document-name my_soc2 \
--session-id sess_abc123 \
--start-chunk 85# Planned future capability:
frfr query sess_abc123 --interactive
> does the system implement 2-factor authentication?
[Querying 2,487 extracted facts...]
[Finding relevant facts with semantic search...]
Answer: Yes, 2FA implemented with SMS and TOTP.
Supporting Facts: 3 facts found (lines 1245, 1389, 2103)
Confidence: High (multiple sources)
> exit# Basic usage - single document
frfr process documents/report.pdf
# Multiple documents in one session
frfr process documents/doc1.pdf documents/doc2.pdf documents/doc3.pdf
# With custom settings
frfr process documents/report.pdf \
--session-id my_session # Use specific session ID (optional)
--max-workers 11 # Parallel Claude processes (default: 5)
--chunk-size 500 # Lines per chunk (default: 500)
--overlap 100 # Overlap between chunks (default: 100)
--multipass # Enable multi-pass extraction
--skip-validation # Skip validation step
--no-interactive # Don't enter interactive mode
--show-facts # Show cited facts in interactive mode# Extract facts with parallel processing
frfr extract-facts text.txt \
--document-name doc_name \
--max-workers 11 # Parallel Claude processes (default: 5)
--chunk-size 500 # Lines per chunk (default: 500)
--overlap 100 # Overlap between chunks (default: 100)
# Enable multi-pass extraction (CUECs, tests, quantitative, technical)
frfr extract-facts text.txt \
--document-name doc_name \
--multipass
# Resume interrupted extraction
frfr extract-facts text.txt \
--document-name doc_name \
--session-id sess_abc123 \
--start-chunk 85
# Validate with custom output
frfr validate-facts facts.json text.txt \
--show-invalid-only \
--output validation_report.json
# Interactive querying
frfr interactive facts.json --show-factsExtracted facts follow this structure:
{
"claim": "System implements 2FA via SMS and TOTP",
"source_doc": "soc2_report.pdf",
"source_location": "Page 42, Section 4.2.1",
"evidence_quote": "Multi-factor authentication is enforced for all user accounts, supporting both SMS-based codes and TOTP authenticator applications.",
"confidence": 0.92
}Generated reports include:
- Executive Summary: Direct answer to question
- Confidence Score: Overall confidence (0-100%)
- Supporting Facts: All consensus facts with citations
- Methodology: Swarm size, consensus reached, outliers discarded
- Appendix:
- Corrected hallucinations (facts that didn't reach consensus)
- Resolved contradictions (conflicting facts and judge's resolution)
- Low-confidence facts (flagged but not included)
- PDFs converted via PyPDF2 (fast, clean) or OCR fallback (Tesseract)
- Encrypted PDFs handled automatically (pycryptodome)
- Documents chunked with sliding window (configurable size + overlap)
- Smart resume capability for interrupted extractions
- Claude Sonnet via CLI (headless mode)
- Maximum depth extraction (5-10 facts per paragraph)
- Enhanced schema with 8 metadata fields:
fact_type,control_family,specificity_scoreentities,quantitative_values,process_detailssection_context,related_control_ids
- V5 Feature: Multiple evidence quotes per fact
- Parallel processing (5-11 workers)
- Section-aware prompting (Control Testing, System Description, CUEC)
- Every fact validated against source text immediately
- Line-number-based quote verification
- Fuzzy matching (70% threshold) for OCR artifacts
- Fact recovery for medium-confidence matches (40-79%)
- 100% validation rate achieved in production
- Retroactive QV tagging (scans claims for missed quantitative values)
- Quality scoring (specificity + entities + process details)
- Aggressive filtering to achieve target QV coverage (35%)
- Consolidated JSON output with session metadata
- Planned: Swarm extraction with multiple instances
- Planned: Semantic clustering and consensus voting
- Planned: Contradiction detection and judge resolution
- Planned: Interactive Q&A over extracted facts
Current Phase: ✅ Phase 1 Complete - Production Ready
- ✅ PDF text extraction (PyPDF2 + OCR fallback)
- ✅ Enhanced fact extraction with 8 metadata fields
- ✅ Maximum depth extraction mode
- ✅ Multiple evidence quotes support (V5)
- ✅ Real-time validation (100% rate achieved)
- ✅ Parallel processing (5-11 workers)
- ✅ Document-aware sessions with LLM naming
- ✅ Automatic session renaming as documents are added
- ✅ Multi-document support with cross-document queries
- ✅ Post-processing pipeline (QV tagging, filtering)
- ✅ Comprehensive CLI (7 commands)
- 1,011 validated facts from 155-page SOC2 report
- 35.0% quantitative value coverage (target achieved)
- 0.878 average specificity (high quality)
- 28 minutes extraction time (170 chunks, 11 workers)
- 100% validation rate (all facts verified against source)
- 🔮 Multi-instance swarm extraction with consensus voting
- 🔮 Semantic clustering and outlier detection
- 🔮 Contradiction detection and judge resolution
- 🔮 Enhanced interactive Q&A over extracted facts
- 🔮 Web UI wrapper around CLI
This project is open source. Contributions welcome for:
- Additional document format support
- Improved consensus algorithms
- Better chunking strategies
- UI/UX enhancements
TBD
- Security Audits: "Does this pentest report identify any critical vulnerabilities?"
- Compliance: "Does this SOC2 report implement the controls in this reference spec?"
- Design Review: "Does this architecture doc address the scaling requirements from this spec?"
- Governance: "What data retention policies are described in this document?"
The system is designed for high-stakes questions where accuracy matters more than speed.