Frfr

High-confidence document Q&A system using LLM swarm consensus with hallucination detection and correction.

Quick Links: 🚀 Quick Start | Design Doc | All Docs

Ready to process your PDF?

Interactive TUI: Simply run frfr to launch the Terminal User Interface for visual session management
CLI Mode: Use frfr process for one-command extraction and querying, or see QUICKSTART.md for details

Interface Options

Terminal User Interface (TUI) - Default

Run frfr without any arguments to launch the interactive TUI:

frfr

The TUI provides:

Session Browser: View and navigate all sessions with document counts and fact statistics
Session Detail View: Explore documents within a session
Facts Browser: Filter and search through extracted facts with real-time search
Query Interface: Ask natural language questions about your facts
Keyboard Navigation: Full keyboard shortcuts (q to quit, ? for help, ESC to go back)

Command Line Interface (CLI)

All CLI commands are still available:

frfr <command>           # Use specific CLI command
frfr --cli               # Show CLI help
frfr tui                 # Explicitly launch TUI

Overview

Frfr extracts structured, validated facts from complex documents (SOC2 reports, penetration test reports, design specs) with high precision.

Current Implementation (V5 - Production Ready):

✅ PDF text extraction with OCR fallback
✅ LLM-based fact extraction with enhanced metadata (8 fields)
✅ Maximum depth extraction mode
✅ Multiple evidence quotes support (V5)
✅ Real-time validation against source text
✅ Parallel processing with resume capability
✅ Post-processing pipeline (QV tagging, filtering)
✅ Document-aware sessions with intelligent naming
✅ Multi-document support with automatic session renaming

Planned Features (Future Phases):

🔮 Multiple LLM instances with swarm consensus
🔮 Semantic comparison and clustering
🔮 Contradiction detection and resolution
🔮 Judge model synthesis
🔮 Interactive Q&A over extracted facts

Architecture

Current Implementation (Phase 1: Extraction & Validation) ✅

┌─────────────────────────────────────────┐
│          User Interface                 │
│  ┌────────────────┐  ┌───────────────┐ │
│  │  TUI (Textual) │  │ CLI (Rich)    │ │
│  │  - Session     │  │ - One-command │ │
│  │    Browser     │  │   workflow    │ │
│  │  - Facts View  │  │ - Scriptable  │ │
│  │  - Query UI    │  │               │ │
│  └────────────────┘  └───────────────┘ │
└────────┬────────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────┐
│      Session Management (Local)        │
│  - Session tracking & resume           │
│  - Progress persistence                │
│  - Artifact storage                    │
└────────┬───────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────┐
│       Document Processing (Active)     │
│  - PDF OCR (ImageMagick + Tesseract)  │
│  - PyPDF2 for text-based PDFs         │
│  - Smart chunking (overlap + resume)  │
└────────┬───────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────┐
│   Enhanced Fact Extraction (V5) ✅     │
│  - Claude Sonnet via CLI               │
│  - Maximum depth extraction            │
│  - Multiple evidence quotes (V5)       │
│  - 8 metadata fields (specificity,     │
│    entities, QV, process details)      │
│  - Parallel processing (5-11 workers)  │
└────────┬───────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────┐
│    Real-Time Validation (Active) ✅    │
│  - Quote verification against source   │
│  - Line number validation              │
│  - Fuzzy matching (70% threshold)      │
│  - Fact recovery for medium confidence │
└────────┬───────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────┐
│   Post-Processing Pipeline (V5) ✅     │
│  - Retroactive QV tagging              │
│  - Quality scoring                     │
│  - Aggressive filtering (35% QV)       │
│  - Consolidated JSON output            │
└────────────────────────────────────────┘

Planned Architecture (Phase 2: Consensus & Q&A) 🔮

Future enhancements will add:

Swarm Consensus: Multiple LLM instances with voting
Semantic Clustering: Group similar facts, detect outliers
Contradiction Resolution: Judge model for conflicting facts
Enhanced Interactive Q&A: Advanced querying capabilities over extracted facts

Module Structure

frfr/
├── frfr/
│   ├── __init__.py
│   ├── cli.py                      # ✅ CLI interface (7 commands)
│   ├── config.py                   # ✅ Configuration management
│   ├── session.py                  # ✅ Document-aware sessions w/ LLM naming
│   ├── tui/                        # ✅ Terminal User Interface
│   │   ├── __init__.py
│   │   ├── app.py                 # ✅ Main TUI application
│   │   ├── state.py               # ✅ Application state management
│   │   ├── screens/               # ✅ TUI screens
│   │   │   ├── home.py           # ✅ Session browser
│   │   │   ├── session_detail.py # ✅ Session detail view
│   │   │   ├── facts_browser.py  # ✅ Facts filtering & search
│   │   │   └── query.py          # ✅ Query interface
│   │   └── widgets/               # ✅ Custom widgets
│   ├── documents/
│   │   ├── __init__.py
│   │   └── pdf_extractor.py       # ✅ PDF OCR + PyPDF2 extraction
│   ├── extraction/
│   │   ├── __init__.py
│   │   ├── fact_extractor.py      # ✅ LLM-based extraction (V5)
│   │   ├── schemas.py              # ✅ Enhanced fact schemas (V5)
│   │   ├── claude_client.py        # ✅ Claude CLI wrapper
│   │   ├── extraction_patterns.py  # ✅ V3 regex patterns
│   │   └── v4_enhancements.py      # ✅ V4 filtering logic
│   ├── validation/
│   │   ├── __init__.py
│   │   ├── fact_validator.py      # ✅ Real-time validation (V5)
│   │   └── quote_corrector.py      # ✅ LLM-based quote correction
│   │
│   ├── consensus/                  # 🔮 PLANNED (Phase 2)
│   │   └── __init__.py             # (empty - future swarm consensus)
│   ├── judge/                      # 🔮 PLANNED (Phase 2)
│   │   └── __init__.py             # (empty - future judge model)
│   ├── workflows/                  # 🔮 PLANNED (Phase 2)
│   │   └── __init__.py             # (empty - future orchestration)
│   └── reporting/                  # 🔮 PLANNED (Phase 2)
│       └── __init__.py             # (empty - future reporting)
│
├── scripts/
│   └── ...                         # Helper scripts
├── tests/
│   └── ...                         # Test files
├── pyproject.toml
├── requirements.txt
└── README.md

Legend:

✅ = Implemented and production-ready
🔮 = Planned for future phases

Prerequisites

Python 3.10+
Claude CLI (authenticated with claude login)
ImageMagick
Tesseract OCR

Installation

# Clone repository
git clone <repo-url>
cd frfr

# Install system dependencies (macOS)
brew install imagemagick tesseract

# Install Python dependencies
pip install -r requirements.txt

# Install package in development mode
pip install -e .

# Authenticate with Claude CLI
claude login

PDF Text Extraction API

The primary entrypoint for document processing is the PDF extraction API. It provides a clean Python interface for converting PDFs to text.

Quick Start: Extract a PDF

Using the CLI:

# Extract a PDF
frfr extract your-file.pdf output/extracted_text.txt

# View the output
cat output/extracted_text.txt | head -50

Python API

from frfr.documents import extract_pdf_to_text, get_pdf_info

# Get PDF metadata
info = get_pdf_info('documents/your-file.pdf')
print(f"Pages: {info['pages']}")
print(f"Encrypted: {info['is_encrypted']}")

# Extract full PDF to text file
result = extract_pdf_to_text(
    pdf_path='documents/your-file.pdf',
    output_path='output/extracted_text.txt'
)

print(f"Method: {result['method']}")      # 'pypdf2' (fast, clean)
print(f"Pages: {result['pages']}")         # 155
print(f"Characters: {result['total_chars']:,}")  # 476,143

Extraction Strategy

The system automatically chooses the best method:

PyPDF2 (default): For text-based PDFs
- Fast, clean extraction
- Handles encrypted PDFs (with pycryptodome)
- Preserves formatting
- Zero OCR artifacts
OCR (fallback): For scanned/image PDFs
- Tesseract with LSTM neural network
- 400 DPI quality
- Smart artifact cleaning

API Reference

extract_pdf_to_text(pdf_path, output_path)

Extract text from entire PDF and save to file.

Returns:

{
    "method": "pypdf2",
    "pages": 155,
    "total_chars": 476143,
    "output_file": "/path/to/output.txt"
}

extract_pdf_page_to_text(pdf_path, page_num)

Extract text from a single page (0-indexed).

Returns: tuple[str, str] - (text, method)

get_pdf_info(pdf_path)

Get PDF metadata.

Returns:

{
    "pages": 155,
    "is_encrypted": True,
    "file_size": 1858673
}

Document-Aware Sessions

Frfr uses intelligent session management to organize your document processing:

Session Structure

project/
├── inputs/                    # Symlinks to original PDFs
│   ├── doc1.pdf -> /original/path/doc1.pdf
│   └── doc2.pdf -> /another/path/doc2.pdf
├── outputs/                   # All transformations
│   ├── doc1_text.txt
│   ├── doc1_facts.json
│   ├── doc2_text.txt
│   └── doc2_facts.json
└── .frfr_sessions/           # Session working data
    └── sess_vendor_security_assessment_20251105_164525/
        ├── metadata.json     # Document registry & history
        ├── summaries/        # LLM-generated summaries
        ├── facts/           # Per-chunk extracted facts
        └── chunks/          # Original chunk text

Intelligent Session Naming

Sessions are automatically named using Claude LLM based on your documents:

# Single document
frfr process documents/soc2_audit_report.pdf
# Creates: sess_soc2_audit_report_20251105_164525

# Multiple documents
frfr process documents/vendor_security.pdf documents/compliance_docs.pdf documents/risk_assessment.pdf
# Creates: sess_vendor_security_compliance_20251105_164531
# (Claude generates a succinct title from document names)

Automatic Session Renaming

As you add documents, the session name updates to stay topical:

# Start with first document
frfr process documents/vendor_questionnaire.pdf
# Session: sess_vendor_questionnaire_20251105_173454

# Add second document - session automatically renamed!
frfr process documents/vendor_questionnaire.pdf documents/compliance_report.pdf
# Session: sess_security_compliance_documentation_20251105_173454
# ℹ  Session name updated to reflect documents

# Add third document - renamed again!
frfr process documents/vendor_questionnaire.pdf documents/compliance_report.pdf documents/risk_assessment.pdf
# Session: sess_security_compliance_assessment_20251105_173454
# ℹ  Session name updated to reflect documents

All renames are tracked in session metadata with complete history.

Multi-Document Sessions

Process multiple PDFs in a single session for cross-document analysis:

# Process multiple documents together
frfr process documents/doc1.pdf documents/doc2.pdf documents/doc3.pdf

# Or build up a session over time
frfr process documents/doc1.pdf --session-id my_session
frfr process documents/doc2.pdf --session-id my_session  # Adds to existing session
frfr process documents/doc3.pdf --session-id my_session  # Session name updates

Each document is tracked with:

Original PDF path (absolute)
Symlink in inputs/
Text file in outputs/
Facts file in outputs/
Processing status (pending/processing/completed/failed)

Usage

Quick Start: Process Command (One-Shot)

The process supercommand runs the complete pipeline from PDF to interactive querying in one command:

# Process a single PDF from start to finish
frfr process documents/soc2_report.pdf

# Process multiple PDFs in one session
frfr process documents/doc1.pdf documents/doc2.pdf documents/doc3.pdf

# With custom settings
frfr process documents/report.pdf \
  --max-workers 11 \
  --multipass \
  --show-facts

# Process without entering interactive mode
frfr process documents/report.pdf --no-interactive

# Use a specific session ID
frfr process documents/report.pdf --session-id my_custom_session

This command:

✅ Extracts PDF to text
✅ Extracts facts using LLM
✅ Validates facts against source
✅ Launches interactive query mode

Step-by-Step Workflow (Advanced)

For more control over individual steps:

# 1. Extract PDF to text
frfr extract documents/soc2_report.pdf output/soc2_text.txt

# 2. Extract facts with V5 features
frfr extract-facts output/soc2_text.txt \
  --document-name my_soc2 \
  --max-workers 11

# Output:
# ✅ Session: sess_abc123
# ✅ Processing chunks... [170/170] (28 minutes)
# ✅ Extracted 2,487 facts
# ✅ Consolidated: output/my_soc2_facts.json

# 3. Validate facts against source
frfr validate-facts output/my_soc2_facts.json output/soc2_text.txt

# Output:
# ✅ Total: 2,487 facts
# ✅ Valid: 2,487 (100%)
# ✅ Validation rate: 100%

# 4. Check session progress (for resume)
frfr session-info sess_abc123

# 5. Resume if interrupted
frfr extract-facts output/soc2_text.txt \
  --document-name my_soc2 \
  --session-id sess_abc123 \
  --start-chunk 85

Interactive Q&A (Future Phase 2) 🔮

# Planned future capability:
frfr query sess_abc123 --interactive

> does the system implement 2-factor authentication?

[Querying 2,487 extracted facts...]
[Finding relevant facts with semantic search...]

Answer: Yes, 2FA implemented with SMS and TOTP.
Supporting Facts: 3 facts found (lines 1245, 1389, 2103)
Confidence: High (multiple sources)

> exit

CLI Options

Process Command (Supercommand)

# Basic usage - single document
frfr process documents/report.pdf

# Multiple documents in one session
frfr process documents/doc1.pdf documents/doc2.pdf documents/doc3.pdf

# With custom settings
frfr process documents/report.pdf \
  --session-id my_session     # Use specific session ID (optional)
  --max-workers 11            # Parallel Claude processes (default: 5)
  --chunk-size 500            # Lines per chunk (default: 500)
  --overlap 100               # Overlap between chunks (default: 100)
  --multipass                 # Enable multi-pass extraction
  --skip-validation           # Skip validation step
  --no-interactive            # Don't enter interactive mode
  --show-facts                # Show cited facts in interactive mode

Individual Commands (Advanced)

# Extract facts with parallel processing
frfr extract-facts text.txt \
  --document-name doc_name \
  --max-workers 11            # Parallel Claude processes (default: 5)
  --chunk-size 500            # Lines per chunk (default: 500)
  --overlap 100               # Overlap between chunks (default: 100)

# Enable multi-pass extraction (CUECs, tests, quantitative, technical)
frfr extract-facts text.txt \
  --document-name doc_name \
  --multipass

# Resume interrupted extraction
frfr extract-facts text.txt \
  --document-name doc_name \
  --session-id sess_abc123 \
  --start-chunk 85

# Validate with custom output
frfr validate-facts facts.json text.txt \
  --show-invalid-only \
  --output validation_report.json

# Interactive querying
frfr interactive facts.json --show-facts

Fact Schema

Extracted facts follow this structure:

{
  "claim": "System implements 2FA via SMS and TOTP",
  "source_doc": "soc2_report.pdf",
  "source_location": "Page 42, Section 4.2.1",
  "evidence_quote": "Multi-factor authentication is enforced for all user accounts, supporting both SMS-based codes and TOTP authenticator applications.",
  "confidence": 0.92
}

Report Format

Generated reports include:

Executive Summary: Direct answer to question
Confidence Score: Overall confidence (0-100%)
Supporting Facts: All consensus facts with citations
Methodology: Swarm size, consensus reached, outliers discarded
Appendix:
- Corrected hallucinations (facts that didn't reach consensus)
- Resolved contradictions (conflicting facts and judge's resolution)
- Low-confidence facts (flagged but not included)

How It Works (Current Implementation)

1. Document Processing ✅

PDFs converted via PyPDF2 (fast, clean) or OCR fallback (Tesseract)
Encrypted PDFs handled automatically (pycryptodome)
Documents chunked with sliding window (configurable size + overlap)
Smart resume capability for interrupted extractions

2. Enhanced Fact Extraction (V5) ✅

Claude Sonnet via CLI (headless mode)
Maximum depth extraction (5-10 facts per paragraph)
Enhanced schema with 8 metadata fields:
- fact_type, control_family, specificity_score
- entities, quantitative_values, process_details
- section_context, related_control_ids
V5 Feature: Multiple evidence quotes per fact
Parallel processing (5-11 workers)
Section-aware prompting (Control Testing, System Description, CUEC)

3. Real-Time Validation ✅

Every fact validated against source text immediately
Line-number-based quote verification
Fuzzy matching (70% threshold) for OCR artifacts
Fact recovery for medium-confidence matches (40-79%)
100% validation rate achieved in production

4. Post-Processing Pipeline ✅

Retroactive QV tagging (scans claims for missed quantitative values)
Quality scoring (specificity + entities + process details)
Aggressive filtering to achieve target QV coverage (35%)
Consolidated JSON output with session metadata

5. Future: Consensus & Q&A 🔮

Planned: Swarm extraction with multiple instances
Planned: Semantic clustering and consensus voting
Planned: Contradiction detection and judge resolution
Planned: Interactive Q&A over extracted facts

Development Status

Current Phase: ✅ Phase 1 Complete - Production Ready

Implemented Features (V5)

✅ PDF text extraction (PyPDF2 + OCR fallback)
✅ Enhanced fact extraction with 8 metadata fields
✅ Maximum depth extraction mode
✅ Multiple evidence quotes support (V5)
✅ Real-time validation (100% rate achieved)
✅ Parallel processing (5-11 workers)
✅ Document-aware sessions with LLM naming
✅ Automatic session renaming as documents are added
✅ Multi-document support with cross-document queries
✅ Post-processing pipeline (QV tagging, filtering)
✅ Comprehensive CLI (7 commands)

Production Metrics (V5)

1,011 validated facts from 155-page SOC2 report
35.0% quantitative value coverage (target achieved)
0.878 average specificity (high quality)
28 minutes extraction time (170 chunks, 11 workers)
100% validation rate (all facts verified against source)

Planned Features (Phase 2)

🔮 Multi-instance swarm extraction with consensus voting
🔮 Semantic clustering and outlier detection
🔮 Contradiction detection and judge resolution
🔮 Enhanced interactive Q&A over extracted facts
🔮 Web UI wrapper around CLI

Contributing

This project is open source. Contributions welcome for:

Additional document format support
Improved consensus algorithms
Better chunking strategies
UI/UX enhancements

License

TBD

Use Cases

Security Audits: "Does this pentest report identify any critical vulnerabilities?"
Compliance: "Does this SOC2 report implement the controls in this reference spec?"
Design Review: "Does this architecture doc address the scaling requirements from this spec?"
Governance: "What data retention policies are described in this document?"

The system is designed for high-stakes questions where accuracy matters more than speed.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.claude		.claude
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
bin		bin
docs		docs
documents		documents
frfr		frfr
scripts		scripts
sessions		sessions
tests		tests
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CONVERSATION_ENHANCEMENTS.md		CONVERSATION_ENHANCEMENTS.md
GOVERNANCE.md		GOVERNANCE.md
HIGHLIGHTING_IMPROVEMENTS.md		HIGHLIGHTING_IMPROVEMENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
renovate.json		renovate.json
requirements-local.txt		requirements-local.txt
requirements.txt		requirements.txt
v5_extraction_monitor.sh		v5_extraction_monitor.sh

License

block/frfr

Folders and files

Latest commit

History

Repository files navigation

Frfr

Interface Options

Terminal User Interface (TUI) - Default

Command Line Interface (CLI)

Overview

Architecture

Current Implementation (Phase 1: Extraction & Validation) ✅

Planned Architecture (Phase 2: Consensus & Q&A) 🔮

Module Structure

Prerequisites

Installation

PDF Text Extraction API

Quick Start: Extract a PDF

Python API

Extraction Strategy

API Reference

Document-Aware Sessions

Session Structure

Intelligent Session Naming

Automatic Session Renaming

Multi-Document Sessions

Usage

Quick Start: Process Command (One-Shot)

Step-by-Step Workflow (Advanced)

Interactive Q&A (Future Phase 2) 🔮

CLI Options

Process Command (Supercommand)

Individual Commands (Advanced)

Fact Schema

Report Format

How It Works (Current Implementation)

1. Document Processing ✅

2. Enhanced Fact Extraction (V5) ✅

3. Real-Time Validation ✅

4. Post-Processing Pipeline ✅

5. Future: Consensus & Q&A 🔮

Development Status

Implemented Features (V5)

Production Metrics (V5)

Planned Features (Phase 2)

Contributing

License

Use Cases

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages