LLM Benchmark Runner

A command-line tool to benchmark local (Ollama) and remote (Google Gemini) Large Language Models (LLMs). It evaluates models against a configurable set of tasks, monitors system resources, generates a detailed HTML report with performance visualizations, and supports exporting results.

Features

Multi-Provider Support: Benchmark models served locally via Ollama (host configurable) and remotely via the Google Gemini API.
Flexible Task Definition: Define benchmark tasks in a simple JSON format (benchmark_tasks.json), organized by category. Includes tasks with varied prompts (e.g., different instructions, personas).
Configuration File: Manage default settings (models, paths, API keys/URLs, weights, features, timeouts) via a config.yaml file. CLI arguments and environment variables override file settings.
Diverse Evaluation Methods:
- Keyword matching (strict 'all' or flexible 'any')
- Weighted keyword scoring for nuanced evaluation
- JSON and YAML structure validation and comparison
- Regex-based information extraction with optional validation rules
- Python code execution and testing against defined test cases
- Classification with confidence score checking
- Semantic similarity comparison (optional, requires sentence-transformers)
Resource Monitoring (Optional):
- Track CPU RAM usage delta for Ollama models (requires psutil).
- Track NVIDIA GPU memory usage delta (GPU 0) for Ollama models (requires pynvml).
Performance Metrics: Measure API response time and tokens/second (Ollama only).
Scoring: Calculates overall accuracy, average scores for partial credit tasks, an "Ollama Performance Score", and a category-weighted "Overall Score".
Reporting:
- Generates a comprehensive HTML report with summary tables, performance plots (rankings for scores, accuracy, token/sec, resource usage, comparison by prompt stage), and detailed per-task results.
- Optional export of summary results to CSV (--export-summary-csv).
- Optional export of detailed task results to JSON (--export-details-json).
Caching: Caches results to speed up subsequent runs (configurable TTL).
Utilities: Includes a --check-dependencies flag to verify installation and basic functionality of optional libraries.
Configurable: Control models, tasks, retries, paths, optional features, scoring weights, API endpoints, and more via config.yaml, environment variables, and command-line arguments.

Installation

Clone the repository:

git clone https://github.com/colonelpanik/llm-bench.git
cd llm-bench

Recommended: Create and Activate a Virtual Environment:

python -m venv .venv
# On Linux/macOS:
source .venv/bin/activate
# On Windows:
# .\.venv\Scripts\activate

Install Core Dependencies: The core requirements are requests and PyYAML.
```
pip install requests PyYAML
```
Install Optional Dependencies (As Needed): Install libraries for features you intend to use. See requirements.txt for details.
- RAM Monitoring (--ram-monitor enable): pip install psutil
- GPU Monitoring (--gpu-monitor enable): pip install pynvml (Requires NVIDIA drivers/CUDA toolkit correctly installed)
- Report Plots (--visualizations enable): pip install matplotlib
- Semantic Evaluation (--semantic-eval enable): pip install sentence-transformers (Downloads model files on first use)
You can check the status of optional dependencies using:
```
python -m benchmark_cli --check-dependencies
```

Configuration Precedence

Settings are determined in the following order (later steps override earlier ones):

Base Defaults: Hardcoded minimal defaults in config.py.
config.yaml: Settings loaded from the YAML configuration file (default: config.yaml, path configurable via --config-file). This is the primary place to set your defaults.
Environment Variables:
- GEMINI_API_KEY overrides api.gemini_api_key from config.yaml.
- OLLAMA_HOST overrides api.ollama_host_url from config.yaml (e.g., OLLAMA_HOST=http://some-other-ip:11434).
Command-Line Arguments: Any arguments provided on the command line override all previous settings (e.g., --test-model, --tasks-file, --gemini-key, --ram-monitor disable).

Usage

The main entry point is benchmark_cli.py.

Show Help:

python -m benchmark_cli --help

Basic Run (Uses defaults from config.yaml or base):

python -m benchmark_cli

Run specific models, overriding config defaults, clear cache:

python -m benchmark_cli --test-model llama3:8b --test-model gemini-1.5-flash-latest --clear-cache -v

Run only 'nlp' tasks category and open report:

python -m benchmark_cli --task-set nlp --open-report

Run only specific tasks by name:

python -m benchmark_cli --task-name "Sentiment - Complex Complaint" --task-name "Code - Python Factorial"

Set Gemini API Key via CLI (highest precedence):

python -m benchmark_cli --gemini-key "YOUR_API_KEY" --test-model gemini-1.5-pro-latest

(Alternatively, set GEMINI_API_KEY environment variable or define in config.yaml)

Use a custom configuration file:

python -m benchmark_cli --config-file my_settings.yaml

Run and export summary results to CSV:

python -m benchmark_cli --export-summary-csv

Check if optional dependencies are installed and working:

python -m benchmark_cli --check-dependencies

Configuration Files

config.yaml (Default): Define default models, API endpoints (ollama_host_url, gemini_api_key), paths, weights, timeouts, feature toggles, etc. See the default file for structure and comments.
benchmark_tasks.json (Default): Define your benchmark tasks here. Path configurable in config.yaml or via --tasks-file.
report_template.html (Default): Customize the HTML report template. Path configurable in config.yaml or via --template-file.

Output

HTML Report (benchmark_report/report.html): Detailed report. Path configurable.
Plots (benchmark_report/images/*.png): PNG images embedded in the report. Path configurable.
Cache Files (benchmark_cache/cache_*.json): Stored results. Path configurable.
Export Files (Optional):
- benchmark_report/summary_*.csv: Summary CSV file if --export-summary-csv is used.
- benchmark_report/details_*.json: Detailed JSON file if --export-details-json is used.
Console Output: Progress, summaries, warnings, errors. Use -v for more detail.

GitHub Actions / CI

This project uses GitHub Actions for automated testing. Unit tests are run automatically on pushes and pull requests to the main branch against multiple Python versions.

The tests mock external services (Ollama, Gemini) and do not require live instances or API keys to run in the CI environment.

Docker Usage

A Dockerfile is provided for running the benchmark tool in a containerized environment with all dependencies included.

1. Build the Docker Image:

From the project root directory:

docker build -t llm-bench .

2. Run the Benchmark:

Running the benchmark requires connecting the container to your running Ollama instance. The method depends on your operating system.

On Linux: Use --network=host to share the host's network stack. Ollama running on localhost:11434 on the host will be accessible via the same address inside the container. Mount volumes for configuration, tasks, reports, and cache.
```
docker run --rm -it --network=host \
  -v ./config.yaml:/app/config.yaml \
  -v ./benchmark_tasks.json:/app/benchmark_tasks.json \
  -v ./benchmark_report:/app/benchmark_report \
  -v ./benchmark_cache:/app/benchmark_cache \
  llm-bench \
  --test-model llama3:8b --open-report
```
(Note: --open-report might not work reliably from within Docker unless you have a browser configured.)
On macOS or Windows (Docker Desktop): --network=host is not typically supported. Instead, Docker provides a special DNS name host.docker.internal which resolves to the host machine. You need to tell llm-bench to use this address for Ollama.
- Option A (Recommended): Using Environment Variable: Set the OLLAMA_HOST environment variable when running the container.
```
docker run --rm -it \
  -v ./config.yaml:/app/config.yaml \
  -v ./benchmark_tasks.json:/app/benchmark_tasks.json \
  -v ./benchmark_report:/app/benchmark_report \
  -v ./benchmark_cache:/app/benchmark_cache \
  -e OLLAMA_HOST="http://host.docker.internal:11434" \
  llm-bench \
  --test-model llama3:8b
```
- Option B: Modifying config.yaml: Add or modify the api.ollama_host_url setting in your config.yaml (which you mount into the container) to point to http://host.docker.internal:11434.
  
  Example config.yaml snippet:
```
api:
  # ... other keys ...
  ollama_host_url: http://host.docker.internal:11434
```
  Then run without the -e OLLAMA_HOST flag:
```
docker run --rm -it \
  -v ./config.yaml:/app/config.yaml \
  # ... other volumes ...
  llm-bench \
  --test-model llama3:8b
```

Running Specific Commands:

You can pass any llm-bench command-line arguments after the image name:

# Check dependencies inside the container
docker run --rm -it llm-bench --check-dependencies

# Run specific models with verbose output
docker run --rm -it --network=host -v $(pwd):/app llm-bench --test-model mistral:7b -v

(Adjust --network=host or add -e OLLAMA_HOST based on your OS for commands requiring Ollama.)

License

MIT License.

Contributing

Contributions welcome! Please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
evaluation		evaluation
tests		tests
Dockerfile		Dockerfile
MAN.md		MAN.md
README.md		README.md
TODO.md		TODO.md
benchmark_cli.py		benchmark_cli.py
benchmark_runner.py		benchmark_runner.py
benchmark_tasks.json		benchmark_tasks.json
cache_manager.py		cache_manager.py
config.py		config.py
config.yaml		config.yaml
embed_bench.py		embed_bench.py
export.py		export.py
llm_clients.py		llm_clients.py
report_template.html		report_template.html
reporting.py		reporting.py
requirements-dev.txt		requirements-dev.txt
requirements-prod.txt		requirements-prod.txt
scoring.py		scoring.py
synthetic_benchmark_data.json		synthetic_benchmark_data.json
system_monitor.py		system_monitor.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmark Runner

Features

Installation

Configuration Precedence

Usage

Configuration Files

Output

GitHub Actions / CI

Docker Usage

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

colonelpanik/llm-bench

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Runner

Features

Installation

Configuration Precedence

Usage

Configuration Files

Output

GitHub Actions / CI

Docker Usage

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages