A command-line tool to benchmark local (Ollama) and remote (Google Gemini) Large Language Models (LLMs). It evaluates models against a configurable set of tasks, monitors system resources, generates a detailed HTML report with performance visualizations, and supports exporting results.
- Multi-Provider Support: Benchmark models served locally via Ollama (host configurable) and remotely via the Google Gemini API.
- Flexible Task Definition: Define benchmark tasks in a simple JSON format (
benchmark_tasks.json), organized by category. Includes tasks with varied prompts (e.g., different instructions, personas). - Configuration File: Manage default settings (models, paths, API keys/URLs, weights, features, timeouts) via a
config.yamlfile. CLI arguments and environment variables override file settings. - Diverse Evaluation Methods:
- Keyword matching (strict 'all' or flexible 'any')
- Weighted keyword scoring for nuanced evaluation
- JSON and YAML structure validation and comparison
- Regex-based information extraction with optional validation rules
- Python code execution and testing against defined test cases
- Classification with confidence score checking
- Semantic similarity comparison (optional, requires
sentence-transformers)
- Resource Monitoring (Optional):
- Track CPU RAM usage delta for Ollama models (requires
psutil). - Track NVIDIA GPU memory usage delta (GPU 0) for Ollama models (requires
pynvml).
- Track CPU RAM usage delta for Ollama models (requires
- Performance Metrics: Measure API response time and tokens/second (Ollama only).
- Scoring: Calculates overall accuracy, average scores for partial credit tasks, an "Ollama Performance Score", and a category-weighted "Overall Score".
- Reporting:
- Generates a comprehensive HTML report with summary tables, performance plots (rankings for scores, accuracy, token/sec, resource usage, comparison by prompt stage), and detailed per-task results.
- Optional export of summary results to CSV (
--export-summary-csv). - Optional export of detailed task results to JSON (
--export-details-json).
- Caching: Caches results to speed up subsequent runs (configurable TTL).
- Utilities: Includes a
--check-dependenciesflag to verify installation and basic functionality of optional libraries. - Configurable: Control models, tasks, retries, paths, optional features, scoring weights, API endpoints, and more via
config.yaml, environment variables, and command-line arguments.
-
Clone the repository:
git clone https://github.com/colonelpanik/llm-bench.git cd llm-bench -
Recommended: Create and Activate a Virtual Environment:
python -m venv .venv # On Linux/macOS: source .venv/bin/activate # On Windows: # .\.venv\Scripts\activate
-
Install Core Dependencies: The core requirements are
requestsandPyYAML.pip install requests PyYAML
-
Install Optional Dependencies (As Needed): Install libraries for features you intend to use. See
requirements.txtfor details.- RAM Monitoring (
--ram-monitor enable):pip install psutil - GPU Monitoring (
--gpu-monitor enable):pip install pynvml(Requires NVIDIA drivers/CUDA toolkit correctly installed) - Report Plots (
--visualizations enable):pip install matplotlib - Semantic Evaluation (
--semantic-eval enable):pip install sentence-transformers(Downloads model files on first use)
You can check the status of optional dependencies using:
python -m benchmark_cli --check-dependencies
- RAM Monitoring (
Settings are determined in the following order (later steps override earlier ones):
- Base Defaults: Hardcoded minimal defaults in
config.py. config.yaml: Settings loaded from the YAML configuration file (default:config.yaml, path configurable via--config-file). This is the primary place to set your defaults.- Environment Variables:
GEMINI_API_KEYoverridesapi.gemini_api_keyfromconfig.yaml.OLLAMA_HOSToverridesapi.ollama_host_urlfromconfig.yaml(e.g.,OLLAMA_HOST=http://some-other-ip:11434).
- Command-Line Arguments: Any arguments provided on the command line override all previous settings (e.g.,
--test-model,--tasks-file,--gemini-key,--ram-monitor disable).
The main entry point is benchmark_cli.py.
Show Help:
python -m benchmark_cli --helpBasic Run (Uses defaults from config.yaml or base):
python -m benchmark_cliRun specific models, overriding config defaults, clear cache:
python -m benchmark_cli --test-model llama3:8b --test-model gemini-1.5-flash-latest --clear-cache -vRun only 'nlp' tasks category and open report:
python -m benchmark_cli --task-set nlp --open-reportRun only specific tasks by name:
python -m benchmark_cli --task-name "Sentiment - Complex Complaint" --task-name "Code - Python Factorial"Set Gemini API Key via CLI (highest precedence):
python -m benchmark_cli --gemini-key "YOUR_API_KEY" --test-model gemini-1.5-pro-latest(Alternatively, set GEMINI_API_KEY environment variable or define in config.yaml)
Use a custom configuration file:
python -m benchmark_cli --config-file my_settings.yamlRun and export summary results to CSV:
python -m benchmark_cli --export-summary-csvCheck if optional dependencies are installed and working:
python -m benchmark_cli --check-dependenciesconfig.yaml(Default): Define default models, API endpoints (ollama_host_url,gemini_api_key), paths, weights, timeouts, feature toggles, etc. See the default file for structure and comments.benchmark_tasks.json(Default): Define your benchmark tasks here. Path configurable inconfig.yamlor via--tasks-file.report_template.html(Default): Customize the HTML report template. Path configurable inconfig.yamlor via--template-file.
- HTML Report (
benchmark_report/report.html): Detailed report. Path configurable. - Plots (
benchmark_report/images/*.png): PNG images embedded in the report. Path configurable. - Cache Files (
benchmark_cache/cache_*.json): Stored results. Path configurable. - Export Files (Optional):
benchmark_report/summary_*.csv: Summary CSV file if--export-summary-csvis used.benchmark_report/details_*.json: Detailed JSON file if--export-details-jsonis used.
- Console Output: Progress, summaries, warnings, errors. Use
-vfor more detail.
This project uses GitHub Actions for automated testing. Unit tests are run automatically on pushes and pull requests to the main branch against multiple Python versions.
The tests mock external services (Ollama, Gemini) and do not require live instances or API keys to run in the CI environment.
A Dockerfile is provided for running the benchmark tool in a containerized environment with all dependencies included.
1. Build the Docker Image:
From the project root directory:
docker build -t llm-bench .2. Run the Benchmark:
Running the benchmark requires connecting the container to your running Ollama instance. The method depends on your operating system.
-
On Linux: Use
--network=hostto share the host's network stack. Ollama running onlocalhost:11434on the host will be accessible via the same address inside the container. Mount volumes for configuration, tasks, reports, and cache.docker run --rm -it --network=host \ -v ./config.yaml:/app/config.yaml \ -v ./benchmark_tasks.json:/app/benchmark_tasks.json \ -v ./benchmark_report:/app/benchmark_report \ -v ./benchmark_cache:/app/benchmark_cache \ llm-bench \ --test-model llama3:8b --open-report
(Note:
--open-reportmight not work reliably from within Docker unless you have a browser configured.) -
On macOS or Windows (Docker Desktop):
--network=hostis not typically supported. Instead, Docker provides a special DNS namehost.docker.internalwhich resolves to the host machine. You need to tellllm-benchto use this address for Ollama.-
Option A (Recommended): Using Environment Variable: Set the
OLLAMA_HOSTenvironment variable when running the container.docker run --rm -it \ -v ./config.yaml:/app/config.yaml \ -v ./benchmark_tasks.json:/app/benchmark_tasks.json \ -v ./benchmark_report:/app/benchmark_report \ -v ./benchmark_cache:/app/benchmark_cache \ -e OLLAMA_HOST="http://host.docker.internal:11434" \ llm-bench \ --test-model llama3:8b -
Option B: Modifying
config.yaml: Add or modify theapi.ollama_host_urlsetting in yourconfig.yaml(which you mount into the container) to point tohttp://host.docker.internal:11434.Example
config.yamlsnippet:api: # ... other keys ... ollama_host_url: http://host.docker.internal:11434
Then run without the
-e OLLAMA_HOSTflag:docker run --rm -it \ -v ./config.yaml:/app/config.yaml \ # ... other volumes ... llm-bench \ --test-model llama3:8b
-
Running Specific Commands:
You can pass any llm-bench command-line arguments after the image name:
# Check dependencies inside the container
docker run --rm -it llm-bench --check-dependencies
# Run specific models with verbose output
docker run --rm -it --network=host -v $(pwd):/app llm-bench --test-model mistral:7b -v(Adjust --network=host or add -e OLLAMA_HOST based on your OS for commands requiring Ollama.)
MIT License.
Contributions welcome! Please open an issue or submit a pull request.