LLM TaskBench

Task-first LLM evaluation framework for domain experts to benchmark models on real use cases—not generic metrics.

The Problem

Existing LLM evaluation tools (DeepEval, Promptfoo, Eleuther AI) focus on generic benchmarks and are built for AI engineers. When a domain expert needs to know "which LLM best extracts action items from meeting notes" or "the most cost-effective model for bug triage," generic BLEU/ROUGE scores don't help.

The Solution

LLM TaskBench shifts from metric-first to task-first evaluation:

Define use cases in Markdown - Human-readable USE-CASE.md with goals and edge cases
Auto-generate prompts - Framework analyzes ground truth to create optimal prompts
LLM-as-judge scoring - Claude/GPT-4 evaluates outputs against your criteria
Cost-aware recommendations - Not just "best" but "best for your budget"

Key Insight

Based on testing 42+ production LLMs:

Conventional Wisdom	Reality
Bigger models = better	405B didn't beat 72B on our tasks
"Reasoning-optimized" = better reasoning	Sometimes performed worse
Higher price = higher quality	Zero correlation found

What actually matters: Task-specific evaluation reveals which models excel at your use case.

Quick start

Install deps

pip install -e .

Set your key

export OPENROUTER_API_KEY=sk-or-...

List available use cases

taskbench list-usecases

Run evaluation on a use case

taskbench run sample-usecases/00-lecture-concept-extraction \
  --models anthropic/claude-sonnet-4,openai/gpt-4o

Generate prompts for a use case (without running)

taskbench generate-prompts sample-usecases/00-lecture-concept-extraction

Folder-Based Use Cases

Use cases are now organized in folders with:

USE-CASE.md - Human-friendly description with goal, evaluation notes, edge cases
data/ - Input data files
ground-truth/ - Expected outputs for comparison

sample-usecases/
├── 00-lecture-concept-extraction/
│   ├── USE-CASE.md
│   ├── data/
│   │   ├── lecture-01-python-basics.txt
│   │   ├── lecture-02-ml-fundamentals.txt
│   │   └── lecture-03-system-design.txt
│   └── ground-truth/
│       ├── lecture-01-concepts.csv
│       ├── lecture-02-concepts.csv
│       └── lecture-03-concepts.csv
├── 01-meeting-action-items/
│   └── ...
└── 02-bug-report-triage/
    └── ...

The framework automatically:

Parses USE-CASE.md for goal, evaluation notes, edge cases
Matches data files to ground truth by naming patterns
Uses LLM to analyze and generate task prompts and judge rubrics
Saves generated prompts to generated-prompts.json for reuse

Key Commands

Run Evaluation

taskbench run <usecase_folder> [options]

Options:

--models / -m - Comma-separated model IDs
--data / -d - Specific data file to use (if multiple)
--output / -o - Output file path
--regenerate-prompts - Force regenerate prompts
--skip-judge - Skip judge evaluation

List Use Cases

taskbench list-usecases [folder]

Generate Prompts

taskbench generate-prompts <usecase_folder> [--force]

Legacy Commands

taskbench evaluate - Run with YAML task definition
taskbench recommend - Load saved results and recommend
taskbench models - List priced models
taskbench validate - Validate task YAML
taskbench sample - Run bundled sample task

Docker

cp .env.example .env
# add your OPENROUTER_API_KEY to .env

# CLI mode
docker compose -f docker-compose.cli.yml build
docker compose -f docker-compose.cli.yml run --rm taskbench-cli list-usecases

# UI mode
docker compose -f docker-compose.ui.yml up --build
# API at http://localhost:8000, UI at http://localhost:5173

UI

The web UI provides:

Browse and select use cases from sample-usecases/
View use case details, data files, and ground truth
Generate and preview prompts
Select models to evaluate
Run evaluations and view results
Compare model performance with cost tracking

How Prompt Generation Works

When you run a use case, the framework:

Parses USE-CASE.md - Extracts goal, evaluation notes, edge cases
Analyzes Data/Ground-Truth - Matches input files to expected outputs
Generates Prompts via LLM - Creates:
- Task prompt for model execution
- Judge prompt for output evaluation
- Rubric with compliance checks and scoring weights
Saves to Folder - Prompts saved in generated-prompts.json

Example generated rubric:

{
  "critical_requirements": [
    {"name": "duration_bounds", "description": "Segments 2-7 minutes", "penalty": 8}
  ],
  "compliance_checks": [
    {"check": "timestamp_format", "severity": "HIGH", "penalty": 5}
  ],
  "weights": {"accuracy": 40, "format": 20, "compliance": 40}
}

Environment Variables

OPENROUTER_API_KEY (required)
TASKBENCH_MAX_CONCURRENCY (default 5)
TASKBENCH_PROMPT_GEN_MODEL (default anthropic/claude-sonnet-4.5)
TASKBENCH_MAX_TOKENS (default 4000)
TASKBENCH_TEMPERATURE (default 0.7)
TASKBENCH_USE_GENERATION_LOOKUP (true/false, default true)

Cost Tracking

Inline usage requested on every call
Billed cost fetched from /generation?id=... when available
Results store token counts, generation IDs, and per-model totals
Judge evaluation costs tracked separately

Results Organization

Results are automatically saved to results/{usecase-name}/:

results/
├── 00-lecture-concept-extraction/
│   └── 2025-12-26_233901_lecture-01-python-basics.json
├── 01-meeting-action-items/
│   └── 2025-12-26_234802_meeting-01-standup.json
└── ...

Sample Use Cases & Benchmark Results

#	Use Case	Claude Sonnet 4	GPT-4o-mini	Key Finding
00	Lecture Concepts	93/100	35/100	GPT ignores duration constraints
01	Meeting Actions	82/100	66/100	GPT misses implicit tasks
02	Bug Triage	86/100	75/100	Both usable
03	Regex Generation	97/100	0/100	GPT fails entirely
04	Data Cleaning	88/100	76/100	Both usable

See detailed results in each use case's taskbench-results.md.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    CLI: taskbench run/list-usecases         │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────┴───────────────────────────────┐
│               Folder-Based Use Case Processing              │
│  UseCaseParser → DataAnalyzer → PromptGenerator (LLM)       │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────┴───────────────────────────────┐
│                    Core Evaluation                          │
│  Executor (parallel) → Judge (LLM scoring) → CostTracker    │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────┴───────────────────────────────┐
│                    OpenRouter API                           │
│          100+ LLM models (Claude, GPT-4, Gemini, etc.)      │
└─────────────────────────────────────────────────────────────┘

Key Components

Component	Purpose	Location
Use Case Parser	Parse USE-CASE.md folders	`src/taskbench/usecase_parser.py`
Prompt Generator	LLM-driven prompt creation	`src/taskbench/prompt_generator.py`
Executor	Parallel model execution	`src/taskbench/evaluation/executor.py`
Judge	LLM-as-judge scoring	`src/taskbench/evaluation/judge.py`
Cost Tracker	Token/cost tracking	`src/taskbench/evaluation/cost.py`
CLI	Command interface	`src/taskbench/cli/main.py`

Documentation

USAGE.md - Full user guide with examples
docs/ARCHITECTURE.md - Technical architecture
docs/API.md - API reference
sample-usecases/ - Example use cases with results

License

MIT

Author

Sri Bolisetty (@KnightSri)

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
backend		backend
config		config
docs		docs
frontend		frontend
results		results
sample-usecases		sample-usecases
src/taskbench		src/taskbench
tasks		tasks
tests		tests
usecases		usecases
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
RunTaskBench.sh		RunTaskBench.sh
USAGE.md		USAGE.md
docker-compose.cli.yml		docker-compose.cli.yml
docker-compose.ui.yml		docker-compose.ui.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ui_app.py		ui_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM TaskBench

The Problem

The Solution

Key Insight

Quick start

Folder-Based Use Cases

Key Commands

Run Evaluation

List Use Cases

Generate Prompts

Legacy Commands

Docker

UI

How Prompt Generation Works

Environment Variables

Cost Tracking

Results Organization

Sample Use Cases & Benchmark Results

Architecture

Key Components

Documentation

License

Author

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

knightsri/llm-taskbench

Folders and files

Latest commit

History

Repository files navigation

LLM TaskBench

The Problem

The Solution

Key Insight

Quick start

Folder-Based Use Cases

Key Commands

Run Evaluation

List Use Cases

Generate Prompts

Legacy Commands

Docker

UI

How Prompt Generation Works

Environment Variables

Cost Tracking

Results Organization

Sample Use Cases & Benchmark Results

Architecture

Key Components

Documentation

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages