AI-powered document classification that herds files into organized folders
Features • Quick Start • Usage • Configuration • Documentation
Drover uses LLMs to analyze documents and suggest consistent, policy-compliant filesystem paths and filenames. Named after herding dogs that drove livestock, Drover herds your scattered files into an organized folder structure.
- Multi-Provider AI — Works with Ollama (local), OpenAI, Anthropic, and OpenRouter
- Intelligent Classification — Categorizes documents by domain, category, and document type
- Smart Sampling — Adaptive page sampling for efficient processing of large documents
- Taxonomy System — Extensible controlled vocabularies with strict or fallback modes
- NARA-Compliant Naming — Generates standardized filenames:
{doctype}-{vendor}-{subject}-{date}.pdf - macOS Tagging — Apply classification as native filesystem tags
- Batch Processing — Classify multiple documents with JSONL output
- Evaluation Framework — Measure accuracy against ground truth datasets
- Python 3.13.x
- Ollama (for local inference) or API keys for cloud providers
# Clone and install
git clone https://github.com/ckrough/drover.git
cd drover
pip install -e .
# Download required NLTK data (one-time)
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng'); nltk.download('punkt_tab')"# Using local Ollama (default)
drover classify document.pdf
# Using OpenAI
export OPENAI_API_KEY="sk-..."
drover classify document.pdf --ai-provider openai --ai-model gpt-4oAnalyze documents and output suggested file paths:
drover classify invoice.pdf
drover classify *.pdf --batch # Multiple files, JSONL output
drover classify doc.pdf --metrics # Include AI metrics
drover classify doc.pdf --log-level verbose # Detailed loggingClassify and apply native filesystem tags:
drover tag document.pdf --dry-run # Preview tags
drover tag document.pdf --tag-fields domain,vendor
drover tag --tag-mode replace document.pdf # Replace existing tagsMeasure classification accuracy against ground truth:
drover evaluate eval/ground_truth.jsonl
drover evaluate eval/ground_truth.jsonl --output-format json{
"original": "scan001.pdf",
"suggested_path": "financial/banking/statement/statement-chase-checking-20240115.pdf",
"domain": "financial",
"category": "banking",
"doctype": "statement",
"vendor": "chase",
"date": "20240115",
"subject": "checking"
}| Variable | Description | Default |
|---|---|---|
DROVER_AI_PROVIDER |
AI provider (ollama, openai, anthropic, openrouter) | ollama |
DROVER_AI_MODEL |
Model name | llama3.2:latest |
DROVER_TAXONOMY |
Classification taxonomy | household |
DROVER_NAMING_STYLE |
Filename policy | nara |
DROVER_SAMPLE_STRATEGY |
Page sampling (full, first_n, bookends, adaptive) | adaptive |
DROVER_LOG_LEVEL |
Logging verbosity (quiet, verbose, debug) | quiet |
Drover searches for configuration in order: --config PATH → drover.yaml → ~/.config/drover/config.yaml
# drover.yaml
ai:
provider: openai
model: gpt-4o
temperature: 0.0
taxonomy: household
taxonomy_mode: fallback
naming_style: nara
concurrency: 4| Provider | API Key Variable | Example Model |
|---|---|---|
| Ollama | — (local) | llama3.2:latest |
| OpenAI | OPENAI_API_KEY |
gpt-4o |
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
| OpenRouter | OPENROUTER_API_KEY |
anthropic/claude-sonnet-4 |
| Category | Extensions |
|---|---|
.pdf |
|
| Images | .png, .jpg, .jpeg, .gif, .bmp, .tiff, .tif |
| Office | .docx, .doc, .xlsx, .xls, .pptx, .ppt |
| Text | .txt, .md, .html, .htm, .csv, .tsv |
| Other | .eml, .epub, .odt, .rtf |
Drover follows a pipeline architecture with extensible plugin systems:
[Document] → [Loader] → [Classifier] → [PathBuilder] → [Output]
↓ ↓ ↓
[Sampling] [Taxonomy] [NamingPolicy]
Tech Stack:
- CLI: Click
- LLM: LangChain with structured output
- Config: Pydantic
- Logging: structlog
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Lint and format
ruff check src/ --fix && ruff format src/
# Security scan
bandit -r src/ -c pyproject.toml- Contributing Guide — Development setup, architecture, and extension guides
- ADR-001: Chain-of-Thought Prompting — 7-step reasoning for accurate classification
- ADR-002: Privacy-First Design — Local-first, zero telemetry approach
MIT
