Drover

AI-powered document classification that herds files into organized folders

Features • Quick Start • Usage • Configuration • Documentation

Drover uses LLMs to analyze documents and suggest consistent, policy-compliant filesystem paths and filenames. Named after herding dogs that drove livestock, Drover herds your scattered files into an organized folder structure.

Features

Multi-Provider AI — Works with Ollama (local), OpenAI, Anthropic, and OpenRouter
Intelligent Classification — Categorizes documents by domain, category, and document type
Smart Sampling — Adaptive page sampling for efficient processing of large documents
Taxonomy System — Extensible controlled vocabularies with strict or fallback modes
NARA-Compliant Naming — Generates standardized filenames: {doctype}-{vendor}-{subject}-{date}.pdf
macOS Tagging — Apply classification as native filesystem tags
Batch Processing — Classify multiple documents with JSONL output
Evaluation Framework — Measure accuracy against ground truth datasets

Quick Start

Prerequisites

Python 3.13.x
Ollama (for local inference) or API keys for cloud providers

Installation

# Clone and install
git clone https://github.com/ckrough/drover.git
cd drover
pip install -e .

# Download required NLTK data (one-time)
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng'); nltk.download('punkt_tab')"

Classify Your First Document

# Using local Ollama (default)
drover classify document.pdf

# Using OpenAI
export OPENAI_API_KEY="sk-..."
drover classify document.pdf --ai-provider openai --ai-model gpt-4o

Usage

Classify Command

Analyze documents and output suggested file paths:

drover classify invoice.pdf
drover classify *.pdf --batch                    # Multiple files, JSONL output
drover classify doc.pdf --metrics                # Include AI metrics
drover classify doc.pdf --log-level verbose      # Detailed logging

Tag Command (macOS)

Classify and apply native filesystem tags:

drover tag document.pdf --dry-run                # Preview tags
drover tag document.pdf --tag-fields domain,vendor
drover tag --tag-mode replace document.pdf       # Replace existing tags

Evaluate Command

Measure classification accuracy against ground truth:

drover evaluate eval/ground_truth.jsonl
drover evaluate eval/ground_truth.jsonl --output-format json

Output Format

{
  "original": "scan001.pdf",
  "suggested_path": "financial/banking/statement/statement-chase-checking-20240115.pdf",
  "domain": "financial",
  "category": "banking",
  "doctype": "statement",
  "vendor": "chase",
  "date": "20240115",
  "subject": "checking"
}

Configuration

Environment Variables

Variable	Description	Default
`DROVER_AI_PROVIDER`	AI provider (ollama, openai, anthropic, openrouter)	`ollama`
`DROVER_AI_MODEL`	Model name	`llama3.2:latest`
`DROVER_TAXONOMY`	Classification taxonomy	`household`
`DROVER_NAMING_STYLE`	Filename policy	`nara`
`DROVER_SAMPLE_STRATEGY`	Page sampling (full, first_n, bookends, adaptive)	`adaptive`
`DROVER_LOG_LEVEL`	Logging verbosity (quiet, verbose, debug)	`quiet`

Config File

Drover searches for configuration in order: --config PATH → drover.yaml → ~/.config/drover/config.yaml

# drover.yaml
ai:
  provider: openai
  model: gpt-4o
  temperature: 0.0

taxonomy: household
taxonomy_mode: fallback
naming_style: nara
concurrency: 4

AI Providers

Provider	API Key Variable	Example Model
Ollama	— (local)	`llama3.2:latest`
OpenAI	`OPENAI_API_KEY`	`gpt-4o`
Anthropic	`ANTHROPIC_API_KEY`	`claude-sonnet-4-20250514`
OpenRouter	`OPENROUTER_API_KEY`	`anthropic/claude-sonnet-4`

Supported File Formats

Category	Extensions
PDF	`.pdf`
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif`
Office	`.docx`, `.doc`, `.xlsx`, `.xls`, `.pptx`, `.ppt`
Text	`.txt`, `.md`, `.html`, `.htm`, `.csv`, `.tsv`
Other	`.eml`, `.epub`, `.odt`, `.rtf`

Architecture

Drover follows a pipeline architecture with extensible plugin systems:

[Document] → [Loader] → [Classifier] → [PathBuilder] → [Output]
                ↓             ↓              ↓
           [Sampling]   [Taxonomy]    [NamingPolicy]

Tech Stack:

CLI: Click
LLM: LangChain with structured output
Config: Pydantic
Logging: structlog

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Lint and format
ruff check src/ --fix && ruff format src/

# Security scan
bandit -r src/ -c pyproject.toml

Documentation

Contributing Guide — Development setup, architecture, and extension guides
ADR-001: Chain-of-Thought Prompting — 7-step reasoning for accurate classification
ADR-002: Privacy-First Design — Local-first, zero telemetry approach

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
design		design
docs		docs
eval		eval
scripts		scripts
src/drover		src/drover
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
drover.yaml		drover.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Drover

Features

Quick Start

Prerequisites

Installation

Classify Your First Document

Usage

Classify Command

Tag Command (macOS)

Evaluate Command

Output Format

Configuration

Environment Variables

Config File

AI Providers

Supported File Formats

Architecture

Development

Documentation

License

About

Uh oh!

Languages

License

ckrough/drover

Folders and files

Latest commit

History

Repository files navigation

Drover

Features

Quick Start

Prerequisites

Installation

Classify Your First Document

Usage

Classify Command

Tag Command (macOS)

Evaluate Command

Output Format

Configuration

Environment Variables

Config File

AI Providers

Supported File Formats

Architecture

Development

Documentation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Languages