agent-benchmark is a comprehensive testing framework for AI agents built on the Model Context Protocol (MCP). It enables systematic testing, validation, and benchmarking of AI agents across different LLM providers with robust assertion capabilities.
- Overview
- Key Features
- Installation
- Command Line Reference
- Configuration
- Test/Suite Definition
- Assertions
- Template System
- Data Extraction
- Reports
- Usage Examples
- Best Practices
- Troubleshooting
- CI/CD Integration
- Architecture Notes
- Contributing
agent-benchmark provides a declarative YAML-based approach to testing AI agents that interact with MCP servers. It supports multiple LLM providers, various MCP server types, and comprehensive assertion mechanisms to validate agent behavior.
Test agents across different LLM providers in parallel:
- Google AI (Gemini models)
- Vertex AI (Google Cloud Gemini)
- Anthropic (Claude models)
- OpenAI (GPT models)
- Azure OpenAI
- Groq
Connect to MCP servers via:
- stdio: Run MCP servers as local processes
- SSE: Connect to remote MCP servers via Server-Sent Events
Organize tests into sessions with shared context and message history, simulating real conversational flows.
Run multiple test files with centralized configuration, shared variables, and unified success criteria.
Validate agent behavior with 20+ assertion types covering:
- Tool usage patterns
- Output validation
- Performance metrics
- Boolean combinators (anyOf, allOf, not) for complex logic
Dynamic test generation with Handlebars-style templates supporting:
- Random data generation
- Timestamp manipulation
- Faker integration
- String manipulation
Extract data from tool results using JSONPath to pass between tests in a session.
Generate reports in multiple formats:
- Console output with color-coded results
- HTML reports with performance comparison
- JSON export
- Markdown documentation
Install the latest version with a single command:
Linux/macOS:
curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.sh | bashWindows (PowerShell):
irm https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.ps1 | iexMinimal Install (60-70% smaller download)
For slower connections or to save bandwidth, use the UPX-compressed version:
curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install-min.sh | bashNote: The minimal version may trigger antivirus warnings on some systems as UPX compression is sometimes flagged by security software.
Manual Installation from Pre-built Binaries
Download the appropriate file for your system from the releases page:
Regular versions (recommended):
- Linux (AMD64):
agent-benchmark_vX.X.X_linux_amd64.tar.gz - Linux (ARM64):
agent-benchmark_vX.X.X_linux_arm64.tar.gz - macOS (Intel):
agent-benchmark_vX.X.X_darwin_amd64.tar.gz - macOS (Apple Silicon):
agent-benchmark_vX.X.X_darwin_arm64.tar.gz - Windows (AMD64):
agent-benchmark_vX.X.X_windows_amd64.zip - Windows (ARM64):
agent-benchmark_vX.X.X_windows_arm64.zip
UPX compressed (smaller size, not available for Windows ARM64):
- Linux (AMD64):
agent-benchmark_vX.X.X_linux_amd64_upx.tar.gz - Linux (ARM64):
agent-benchmark_vX.X.X_linux_arm64_upx.tar.gz - macOS (Intel):
agent-benchmark_vX.X.X_darwin_amd64_upx.tar.gz - macOS (Apple Silicon):
agent-benchmark_vX.X.X_darwin_arm64_upx.tar.gz - Windows (AMD64):
agent-benchmark_vX.X.X_windows_amd64_upx.zip
Extract and move to your PATH:
# Linux/macOS
tar -xzf agent-benchmark_*.tar.gz
sudo mv agent-benchmark /usr/local/bin/
# Windows
# Extract the ZIP file and add the binary to your PATHBuild from Source
Requirements: Go 1.25 or higher
Linux/macOS:
# Clone the repository
git clone https://github.com/mykhaliev/agent-benchmark
cd agent-benchmark
# Build the binary
go build -o agent-benchmark
# (Optional) Move to your PATH
sudo mv agent-benchmark /usr/local/bin/Windows (PowerShell):
# Clone the repository
git clone https://github.com/mykhaliev/agent-benchmark
cd agent-benchmark
# Build the binary
.\build.ps1
# or
go build -o agent-benchmark.exeAfter installation, verify it works:
agent-benchmark -vRun your first benchmark:
agent-benchmark -f tests.yaml -o report.html -verboseagent-benchmark [options]
Required (one of):
-f <file> Path to test configuration file (YAML)
-s <file> Path to suite configuration file (YAML)
-generate-report <file> Generate HTML report from existing JSON results file
Optional:
-o <file> Output report filename without extension (default: report)
-l <file> Log file path (default: stdout)
-reportType <types> Report format(s): html, json, md (default: html)
Multiple formats supported as comma-separated values
Examples: -reportType html
-reportType html,json
-reportType html,json,md
-verbose Enable verbose logging
-v Show version and exitExamples:
# Run single test file with verbose output
./agent-benchmark -f tests.yaml -verbose
# Run test suite with JSON report
./agent-benchmark -s suite.yaml -o results -reportType json
# Run with custom log file
./agent-benchmark -f tests.yaml -l test-run.log
# Generate Markdown report
./agent-benchmark -f tests.yaml -o report -reportType md
# Generate HTML report from existing JSON results (fast iteration)
./agent-benchmark -generate-report results.json -o new-report
# Generate both JSON and HTML reports (for later regeneration)
./agent-benchmark -f tests.yaml -o results -reportType json,htmlConfiguration files use YAML format with six main sections:
providers: # LLM provider configurations
servers: # MCP server definitions
agents: # Agent configurations
sessions: # Test sessions
settings: # Global test settings
variables: # Reusable variablesThe framework supports running multiple test files through a suite configuration:
name: "Complete Test Suite"
test_files:
- tests/basic-operations.yaml
- tests/advanced-features.yaml
- tests/edge-cases.yaml
providers:
- name: gemini
type: GOOGLE
token: "{{GOOGLE_API_KEY}}"
model: gemini-2.0-flash
servers:
- name: filesystem
type: stdio
command: npx @modelcontextprotocol/server-filesystem /tmp
agents:
- name: test-agent
provider: gemini
servers:
- name: filesystem
settings:
verbose: true
max_iterations: 10
tool_timeout: 30s
test_delay: 2s
variables:
base_path: "/tmp/tests"
timestamp: "{{now format='unix'}}"
criteria:
success_rate: "0.8" # 80% of tests must passSuite Configuration Benefits:
- Centralized provider and server definitions
- Shared variables across all test files
- Unified success criteria
- Single command execution for multiple test files
Provider Types:
GOOGLE- Google AI (Gemini)VERTEX- Vertex AI (Google Cloud Gemini)ANTHROPIC- Anthropic (Claude)OPENAI- OpenAI (GPT)AZURE- Azure OpenAIGROQ- Groq
Define LLM providers for your agents:
providers:
- name: gemini-flash
type: GOOGLE
token: {{GOOGLE_API_KEY}}
model: gemini-2.0-flash
- name: claude-sonnet
type: ANTHROPIC
token: {{ANTHROPIC_API_KEY}}
model: claude-sonnet-4-20250514
- name: gpt-4
type: OPENAI
token: {{OPENAI_API_KEY}}
model: gpt-4o-mini
baseUrl: https://api.openai.com/v1 # Optional
- name: azure-gpt
type: AZURE
token: {{AZURE_API_KEY}}
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
- name: azure-entra
type: AZURE
auth_type: entra_id # Use Microsoft Entra ID authentication (passwordless)
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
- name: vertex-ai
type: VERTEX
project_id: "your-gcp-project-id"
location: "us-central1"
credentials_path: "/path/to/service-account.json"
model: gemini-2.0-flash
- name: gpt-4
type: GROQ
token: {{GROQ_API_KEY}}
model: openai/gpt-oss-120b
baseUrl: https://api.groq.com/openai/v1 # OptionalThe AZURE provider supports two authentication methods:
API Key Authentication (default):
providers:
- name: azure-apikey
type: AZURE
auth_type: api_key # Optional, this is the default
token: {{AZURE_OPENAI_API_KEY}}
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-previewMicrosoft Entra ID Authentication (passwordless):
providers:
- name: azure-entra
type: AZURE
auth_type: entra_id # Uses DefaultAzureCredential
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
# No token required - uses Azure credentials from environmentEntra ID authentication uses Azure's DefaultAzureCredential, which automatically tries multiple authentication methods in order:
- Environment variables:
AZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_CLIENT_SECRET - Workload Identity (for Kubernetes)
- Managed Identity (when running in Azure)
- Azure CLI (
az login) - Azure Developer CLI (
azd auth login) - Azure PowerShell (
Connect-AzAccount)
Required RBAC Role:
Your identity must have the "Cognitive Services OpenAI User" role (or higher) assigned on the Azure OpenAI resource. Without this role, you will receive a 401 Unauthorized error.
To assign the role using Azure CLI:
# Get your Azure OpenAI resource ID
az cognitiveservices account show \
--name <your-openai-resource-name> \
--resource-group <your-resource-group> \
--query id -o tsv
# Assign the required role
az role assignment create \
--assignee <your-email-or-principal-id> \
--role "Cognitive Services OpenAI User" \
--scope <resource-id-from-above>Note: Role assignments can take up to 5-10 minutes to propagate.
For more information, see Azure Identity authentication and Azure OpenAI RBAC roles.
Providers can be configured with rate limits to proactively throttle requests and avoid exceeding API quotas:
providers:
- name: azure-gpt
type: AZURE
token: {{AZURE_API_KEY}}
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
rate_limits:
tpm: 30000 # Tokens per minute limit (proactive throttling)
rpm: 60 # Requests per minute limit (proactive throttling)Rate Limit Configuration Options:
| Option | Description | Default |
|---|---|---|
tpm |
Maximum tokens per minute | No limit |
rpm |
Maximum requests per minute | No limit |
Behavior:
- When rate limits are configured, the framework uses a token bucket algorithm to proactively limit request rates
- This prevents hitting provider rate limits by throttling requests before they're sent
- Rate limiting is applied per-provider, allowing different limits for different API endpoints
By default, 429 (Too Many Requests) errors are treated as regular errors and cause the test to fail immediately. If you want to retry on 429 errors, you can configure this separately:
providers:
- name: azure-gpt
type: AZURE
token: {{AZURE_API_KEY}}
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
rate_limits:
tpm: 30000 # Proactive rate limiting
rpm: 60
retry:
retry_on_429: true # Enable retry on 429 errors (default: false)
max_retries: 3 # Max retry attempts (default: 3 when enabled)Retry Configuration Options:
| Option | Description | Default |
|---|---|---|
retry_on_429 |
Enable automatic retry when receiving 429 errors | false |
max_retries |
Number of retry attempts for 429 errors | 3 (when enabled) |
Behavior:
- By default, 429 errors fail immediately (no retry)
- When
retry_on_429: trueis set, the framework will retry with exponential backoff - The framework extracts the wait duration from:
- HTTP
Retry-Afterheader (preferred) - parsed as seconds or HTTP-date - Error message text (fallback) - e.g., "retry after 30 seconds"
- HTTP
- A 1-second buffer is added to ensure we're past the rate limit window
Note: Rate limiting (proactive throttling) and 429 retry handling (reactive recovery) are separate concepts. You can use either or both depending on your needs
Configure MCP servers that agents will interact with:
servers:
- name: filesystem-server
type: stdio
command: npx @modelcontextprotocol/server-filesystem /tmpservers:
- name: remote-api
type: sse
url: https://api.example.com/mcp/events
headers:
- "Authorization: Bearer {{API_TOKEN}}"
- "X-Custom-Header: value"Server Types:
stdio- Standard Input/Output communicationsse- Server-Sent Events over HTTP
Control server initialization and process delays:
servers:
- name: slow-server
type: stdio
command: python server.py
server_delay: 45s # Wait up to 45s for initialization
process_delay: 1s # Wait 1s after process startsDelay Parameters:
server_delay- Maximum time to wait for server initialization (default: 30s)process_delay- Delay after starting process before initialization (default: 300ms)
servers:
- name: authenticated-api
type: sse
url: https://api.example.com/mcp/sse
headers:
- "Authorization: Bearer {{API_TOKEN}}"
- "X-API-Version: 2024-01"
- "X-Client-ID: agent-benchmark"Define agents that combine providers with MCP servers:
agents:
- name: research-agent
provider: gemini-flash
system_prompt: |
You are an autonomous research agent.
Execute tasks directly without asking for clarification.
Use available tools to complete the requested tasks.
servers:
- name: filesystem-server
allowedTools: # Optional: restrict tool access
- read_file
- list_directory
- name: remote-api
- name: coding-agent
provider: claude-sonnet
servers:
- name: filesystem-server # No tool restrictionsAgent Configuration:
name- Unique agent identifierprovider- Reference to provider namesystem_prompt- Optional system prompt prepended to all conversations (supports templates)servers- List of MCP serversallowedTools- Optional tool whitelist per server
System Prompt Templates:
The system_prompt field supports template variables for dynamic context:
{{AGENT_NAME}}- Current agent name{{SESSION_NAME}}- Current session name{{PROVIDER_NAME}}- Provider name being used
Example:
agents:
- name: test-agent
provider: gemini-flash
system_prompt: |
You are {{AGENT_NAME}} using {{PROVIDER_NAME}}.
Currently running session: {{SESSION_NAME}}.
Execute all tasks autonomously.Organize tests into sessions with shared conversational context:
sessions:
- name: File Operations
tests:
- name: Create a file
prompt: "Create a file called {{filename}} with content: Hello World"
assertions:
- type: tool_called
tool: write_file
- name: Read the file
prompt: "Read the file {{filename}}"
assertions:
- type: tool_called
tool: read_file
- type: output_contains
value: "Hello World"Session Features:
- Tests within a session share message history
- Variables persist across tests in a session
- Simulates multi-turn conversations
Global configuration for test execution:
settings:
verbose: true # Enable detailed logging
max_iterations: 10 # Maximum agent reasoning loops
timeout: 30s # Tool execution timeout (legacy, use tool_timeout)
tool_timeout: 30s # Tool execution timeout
test_delay: 2s # Delay between testsDefine reusable variables with template support:
variables:
filename: "test-{{randomValue type='ALPHANUMERIC' length=8}}.txt"
timestamp: "{{now format='unix'}}"
user_id: "{{randomInt lower=1000 upper=9999}}"
email: "{{faker 'Internet.email'}}"Variables can:
- Use template helpers
- Reference environmental variables
Delay individual test execution:
tests:
- name: Rate-limited API call
prompt: "Make API request"
start_delay: 5s # Wait 5 seconds before starting
assertions:
- type: tool_called
tool: api_requestPause between all tests:
settings:
test_delay: 2s # 2 second pause after each testUse Cases:
- Respect API rate limits
- Allow system state to settle
- Prevent resource exhaustion
Define minimum success rate for test suites:
criteria:
success_rate: 0.75 # 75% pass rate requiredExit Code Behavior:
| Scenario | Exit Code |
|---|---|
| All tests pass / Success rate met | 0 |
| Some tests fail / Success rate not met | 1 |
Reference environment variables in configuration:
providers:
- name: claude
type: ANTHROPIC
token: "{{ANTHROPIC_API_KEY}}"
model: claude-sonnet-4-20250514
servers:
- name: api-server
type: sse
url: "{{API_BASE_URL}}"
headers:
- "Authorization: Bearer {{API_TOKEN}}"
variables:
workspace: "{{WORKSPACE_PATH}}"Convention:
- Use
{{VAR_NAME}}syntax - Set before running tests
- Common for tokens, URLs, paths
export ANTHROPIC_API_KEY="sk-ant-..."
export API_BASE_URL="https://api.example.com"
export WORKSPACE_PATH="/tmp/workspace"
./agent-benchmark -f tests.yamlagent-benchmark provides 20+ assertion types to validate agent behavior:
Verify agent only uses available tools:
assertions:
- type: no_hallucinated_toolsVerify a specific tool was invoked:
assertions:
- type: tool_called
tool: create_fileEnsure a tool was NOT invoked:
assertions:
- type: tool_not_called
tool: delete_databaseValidate the exact number of tool calls. The tool name is optional; if it is not specified, the number of all tool calls will be verified:
assertions:
- type: tool_call_count
tool: search_api
count: 3Verify tools were called in a specific sequence:
assertions:
- type: tool_call_order
sequence:
- validate_input
- process_data
- save_resultsCheck tool parameters match exactly:
assertions:
- type: tool_param_equals
tool: create_user
params:
name: "John Doe"
age: 30
email: "john@example.com"
settings.theme: "dark" # Nested parameter with dot notationNested Parameter Validation:
Use dot notation for nested parameters:
assertions:
- type: tool_param_equals
tool: create_resource
params:
name: "test-resource"
config.timeout: "30"
config.retry.max_attempts: "3"
config.retry.backoff: "exponential"
metadata.tags.environment: "production"Dot Notation Rules:
- Navigate nested maps with dots
- Validate deeply nested values
- Compare exact matches at any depth
Validate parameters with regex patterns:
assertions:
- type: tool_param_matches_regex
tool: send_email
params:
recipient: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"Validate tool results using JSONPath:
assertions:
- type: tool_result_matches_json
tool: get_user
path: "$.data.user.name"
value: "John Doe"Check if output contains specific text:
assertions:
- type: output_contains
value: "Operation completed successfully"Ensure output doesn't contain specific text:
assertions:
- type: output_not_contains
value: "error"Validate output with regex pattern:
assertions:
- type: output_regex
pattern: "^User ID: [0-9]{4,}$"Limit approximate token usage:
assertions:
- type: max_tokens
value: 1000Token Estimation:
Token usage for OpenAI, Google and Anthropic models is taken from GenerationInfo For other models formula is:
tokens = output_length / 4
This approximation:
- Provides rough token counts
- Useful for max_tokens assertions
- Not exact (varies by tokenizer)
Ensure execution completes within time limit:
assertions:
- type: max_latency_ms
value: 5000 # 5 secondsVerify execution completed without errors:
assertions:
- type: no_error_messagesBoolean combinators allow you to create complex assertion logic using JSON Schema-style operators. These are useful when LLMs may achieve the same outcome through different approaches.
Pass if ANY child assertion passes (OR logic):
assertions:
# Pass if the LLM used keyboard_control OR ui_automation
- anyOf:
- type: tool_called
tool: keyboard_control
- type: tool_called
tool: ui_automationPass if ALL child assertions pass (AND logic):
assertions:
# Pass if both conditions are met
- allOf:
- type: tool_called
tool: create_file
- type: output_contains
value: "File created successfully"Pass if the child assertion FAILS (negation):
assertions:
# Pass if output does NOT contain "error" (equivalent to output_not_contains)
- not:
type: output_contains
value: "error"Combinators can be nested for complex logic:
assertions:
# Pass if: (keyboard OR ui_automation) AND no errors
- allOf:
- anyOf:
- type: tool_called
tool: keyboard_control
- type: tool_called
tool: ui_automation
- type: no_error_messages
# Pass if NOT (error in output AND failed tool)
- not:
allOf:
- type: output_contains
value: "error"
- type: tool_not_called
tool: success_handlerUse Cases:
- Testing LLMs that may use different tools to achieve the same goal
- Validating that at least one of several acceptable outcomes occurred
- Creating exclusion rules (must NOT match a pattern)
- Complex conditional validation logic
agent-benchmark includes a powerful template engine based on Handlebars with custom helpers:
Generate random strings:
# Alphanumeric (default)
{{randomValue length=10}}
# Output: aB3xY9kL2m
# Alphabetic only
{{randomValue type='ALPHABETIC' length=8}}
# Output: AbCdEfGh
# Numeric only
{{randomValue type='NUMERIC' length=6}}
# Output: 123456
# Hexadecimal
{{randomValue type='HEXADECIMAL' length=8}}
# Output: 1a2b3c4d
# Alphanumeric with symbols
{{randomValue type='ALPHANUMERIC_AND_SYMBOLS' length=12}}
# Output: aB3@xY9!kL2#
# UUID
{{randomValue type='UUID'}}
# Output: 550e8400-e29b-41d4-a716-446655440000
# Uppercase
{{randomValue type='ALPHABETIC' length=8 uppercase=true}}
# Output: ABCDEFGHTypes:
ALPHANUMERIC(default) - Letters and numbersALPHABETIC- Letters onlyNUMERIC- Numbers onlyHEXADECIMAL- Hex characters (0-9, a-f)ALPHANUMERIC_AND_SYMBOLS- Letters, numbers, and symbolsUUID- UUID v4
Generate random integers:
# Random int between 0 and 100 (default)
{{randomInt}}
# Custom range
{{randomInt lower=1000 upper=9999}}
# Output: 5847
# Negative range
{{randomInt lower=-100 upper=100}}Generate random decimal numbers:
# Random decimal between 0.00 and 100.00 (default)
{{randomDecimal}}
# Custom range
{{randomDecimal lower=10.5 upper=99.9}}
# Output: 45.73Generate timestamps with formatting and offsets:
# Current ISO8601 timestamp (default)
{{now}}
# Output: 2024-01-15T14:30:00Z
# Unix epoch (milliseconds)
{{now format='epoch'}}
# Output: 1705329000000
# Unix timestamp (seconds)
{{now format='unix'}}
# Output: 1705329000
# Custom format (Java SimpleDateFormat style)
{{now format='yyyy-MM-dd HH:mm:ss'}}
# Output: 2024-01-15 14:30:00
# With timezone
{{now timezone='America/New_York'}}
# With offset
{{now offset='3 days'}}
{{now offset='-24 hours'}}
{{now offset='1 years'}}
# Combined
{{now format='yyyy-MM-dd' offset='7 days' timezone='UTC'}}Offset Units:
seconds/secondminutes/minutehours/hourdays/dayweeks/weekmonths/monthyears/year
Generate realistic fake data:
# Names
{{faker 'Name.first_name'}} # John
{{faker 'Name.last_name'}} # Smith
{{faker 'Name.full_name'}} # John Smith
{{faker 'Name.prefix'}} # Mr.
{{faker 'Name.suffix'}} # Jr.
# Addresses
{{faker 'Address.street'}} # 123 Main St
{{faker 'Address.city'}} # New York
{{faker 'Address.state'}} # California
{{faker 'Address.state_abbrev'}} # CA
{{faker 'Address.country'}} # United States
{{faker 'Address.postcode'}} # 12345
# Phone
{{faker 'Phone.number'}} # 555-1234
{{faker 'Phone.number_formatted'}} # (555) 123-4567
# Internet
{{faker 'Internet.email'}} # john@example.com
{{faker 'Internet.username'}} # john_doe_123
{{faker 'Internet.url'}} # https://example.com
{{faker 'Internet.ipv4'}} # 192.168.1.1
{{faker 'Internet.ipv6'}} # 2001:0db8:85a3::8a2e:0370:7334
{{faker 'Internet.mac'}} # 00:1B:44:11:3A:B7
# Company
{{faker 'Company.name'}} # Tech Corp
{{faker 'Company.suffix'}} # Inc.
{{faker 'Company.profession'}} # Software Engineer
# Lorem
{{faker 'Lorem.word'}} # ipsum
{{faker 'Lorem.sentence'}} # Lorem ipsum dolor sit amet
{{faker 'Lorem.paragraph'}} # Full paragraph text
# Finance
{{faker 'Finance.credit_card'}} # 4532-1234-5678-9010
{{faker 'Finance.currency'}} # USD
# Misc
{{faker 'Misc.uuid'}} # 550e8400-e29b-41d4-a716-446655440000
{{faker 'Misc.boolean'}} # true/false
{{faker 'Misc.date'}} # 2024-01-15
{{faker 'Misc.time'}} # 14:30:00
{{faker 'Misc.timestamp'}} # 1705329000
{{faker 'Misc.digit'}} # 7Remove substrings:
{{cut "Hello World" "World"}}
# Output: Hello
{{cut filename ".txt"}}Replace substrings:
{{replace "Hello World" "World" "Universe"}}
# Output: Hello Universe
{{replace email "@example.com" "@test.com"}}Extract substrings:
{{substring "Hello World" start=0 end=5}}
# Output: Hello
{{substring text start=6}}
# Output: Rest of string from position 6Extract data from tool results to use in subsequent tests:
sessions:
- name: User Workflow
tests:
- name: Create user
prompt: "Create a new user"
extractors:
- type: jsonpath
tool: create_user
path: "$.data.user.id"
variable_name: user_id
assertions:
- type: tool_called
tool: create_user
- name: Get user details
prompt: "Get details for user {{user_id}}"
assertions:
- type: tool_called
tool: get_user
- type: tool_param_equals
tool: get_user
params:
id: "{{user_id}}"Extractor Configuration:
type- Extraction method (currently:jsonpath)tool- Tool name to extract frompath- JSONPath expressionvariable_name- Variable name for template context
Use Cases:
- Extract IDs from creation operations
- Pass data between sequential tests
- Validate consistency across operations
agent-benchmark generates comprehensive reports in multiple formats. You can specify the output filename with -o (extension added automatically) and generate multiple formats simultaneously using -reportType with comma-separated values.
📊 View Sample Reports - See example HTML reports covering all test configuration permutations (single/multi agent, single/multi test, sessions, suites).
📖 Report Documentation - Detailed documentation on report hierarchy, sections, and adaptive display.
- Console - Real-time colored output during execution (default, always shown)
- HTML - Rich visual dashboard with charts and metrics
- JSON - Structured data for programmatic analysis
- Markdown - Documentation-friendly format
# Console output only (default)
agent-benchmark -f test.yaml
# Generate HTML report
agent-benchmark -f test.yaml -o my-report -reportType html
# Generate multiple formats
agent-benchmark -f test.yaml -o my-report -reportType html,json,md
# All formats
agent-benchmark -f test.yaml -o my-report -reportType html,json,mdReal-time colored output displayed during test execution with three main sections:
Server Comparison Summary
- Test-by-test comparison across agents
- Pass/fail status with checkmarks
- Duration per agent
- Provider information
- Summary statistics (e.g., "2/2 servers passed")
Detailed Test Results
- Individual test results per agent
- All assertion results with pass/fail indicators
- Detailed metrics for each assertion (expected vs actual values)
- Token usage and latency information
- Error details (if any)
Execution Summary
- Total tests, passed, and failed counts
- Pass rate percentage
- Total tool calls
- Total errors
- Total and average duration
- Total tokens used
Example:
═══════════════════════════════════════════════════════════════
SERVER COMPARISON SUMMARY
═══════════════════════════════════════════════════════════════
📋 Test: Create file [100% passed]
Summary: 2/2 servers passed
┌─────────────────────────────────────────────────────────────┐
│ Server/Agent │ Status │ Duration │
├─────────────────────────────────────────────────────────────┤
│ gemini-agent │ ✓ PASS │ 2.34s │
│ └─ [GOOGLE] │ │ │
│ claude-agent │ ✗ FAIL │ 3.12s │
│ └─ [ANTHROPIC] │ │ │
└─────────────────────────────────────────────────────────────┘
═══════════════════════════════════════════════════════════════
DETAILED TEST RESULTS
═══════════════════════════════════════════════════════════════
📋 Test: Create file
✓ gemini-agent [GOOGLE] (2.34s)
✓ tool_called: Tool 'write_file' was called
✓ tool_param_equals: Tool called with correct parameters
✓ max_latency_ms: Latency: 2340ms (max: 5000ms)
• actual: 2340
• max: 5000
✗ claude-agent [ANTHROPIC] (3.12s)
✓ tool_called: Tool 'write_file' was called
✗ tool_param_equals: Tool called with incorrect parameters
• expected: {"path": "test.txt", "content": "Hello"}
• actual: {"path": "test.txt"}
✓ max_latency_ms: Latency: 3120ms (max: 5000ms)
• actual: 3120
• max: 5000
═══════════════════════════════════════════════════════════════
Total: 2 | Passed: 1 | Failed: 1
═══════════════════════════════════════════════════════════════
================================================================================
[Summary] Test Execution Summary
================================================================================
Total Tests: 2
Passed: 1 (50.0%)
Failed: 1 (50.0%)
Total Tool Calls: 2
Total Errors: 1
Total Duration: 5460ms (avg: 2730ms per test)
Total Tokens: 350
================================================================================
Rich visual report featuring:
Summary Dashboard
- Total/Passed/Failed test counts
- Overall success rate with color-coded statistics
Agent Performance Comparison
- Statistics by agent with visual metrics
- Success rates with percentage indicators
- Average duration and latency
- Token usage (total and average per test)
- Pass/fail counts per agent
Server Comparison Summary
- Side-by-side test results across agents
- Per-test success rates
- Execution duration comparison
- Failed server details with error messages
Detailed Test Results
- Full execution details per agent
- Individual assertion results with pass/fail status
- Performance metrics (duration, tokens, latency)
- Tool call information and parameters
The HTML report is built from modular, reusable template components. Each report type composes these building blocks differently based on context (single agent vs multi-agent, single file vs suite, etc.).
graph TD
subgraph "Main Layout"
A[report.html] --> B[summary-cards]
A --> C[comparison-matrix]
A --> D[agent-leaderboard]
A --> E[file-summary]
A --> F[session-summary]
A --> G[test-results]
A --> H[fullscreen-overlay]
A --> I[scripts]
end
subgraph "Test Results Container"
G --> J[test-group]
end
subgraph "View Selection"
J -->|"1 agent"| K[single-agent-detail]
J -->|"2+ agents"| L[multi-agent-comparison]
end
subgraph "Single Agent Components"
K --> K1[agent-assertions]
K --> K2[agent-errors]
K --> K3[agent-sequence-diagram]
K --> K4[agent-tool-calls]
K --> K5[agent-messages]
K --> K6[agent-final-output]
end
subgraph "Multi-Agent Components"
L --> L1[comparison-table]
L --> L2[tool-comparison]
L --> L3[errors-comparison]
L --> L4[sequence-comparison]
L --> L5[outputs-comparison]
end
Reports are designed hierarchically, with each level building upon the previous:
| Level | Report Type | Description | Key Components |
|---|---|---|---|
| 1 | Single Agent, Single Test | Simplest case - one agent, one test | Summary cards, single-agent-detail |
| 2 | Single Agent, Multiple Tests | Multiple independent tests, same agent | + test-overview table |
| 3 | Multiple Agents | Compare agents on same tests | + comparison-matrix, agent-leaderboard |
| 4 | Multiple Sessions | Tests grouped by session with shared context | + session-summary |
| 5 | Full Suite | Multi-agent, multi-session, multi-file | All components combined |
Generate sample reports for each level:
go run test/generate_reports.goThis creates hierarchical sample reports in generated_reports/:
01_single_agent_single_test.html- Level 1: One agent, one test02_single_agent_multi_test.html- Level 2: One agent, multiple tests03_multi_agent_single_test.html- Level 3: Multiple agents, one test (leaderboard)04_multi_agent_multi_test.html- Level 4: Multiple agents, multiple tests (matrix)05_single_agent_multi_session.html- Level 5: One agent, multiple sessions06_multi_agent_multi_session.html- Level 6: Multiple agents, multiple sessions07_single_agent_multi_file.html- Level 7: One agent, multiple files08_multi_agent_multi_file.html- Level 8: Full suite (multiple agents, sessions, files)09_failed_with_errors.html- Error display example
Single Agent Report - One agent running tests:
graph LR
subgraph "Single Agent Report"
A[summary-cards] --> B[test-results]
B --> C[test-group]
C --> D[single-agent-detail]
D --> D1[assertions]
D --> D2[errors]
D --> D3[sequence-diagram]
D --> D4[tool-calls]
D --> D5[messages]
D --> D6[final-output]
end
Multi-Agent Report - Multiple agents compared on same tests:
graph LR
subgraph "Multi-Agent Report"
A[summary-cards] --> B[comparison-matrix]
B --> C[agent-leaderboard]
C --> D[test-results]
D --> E[test-group]
E --> F[multi-agent-comparison]
F --> F1[comparison-table]
F --> F2[tool-comparison]
F --> F3[errors-comparison]
F --> F4[sequence-comparison]
F --> F5[outputs-comparison]
end
Multi-Session Report - Tests organized by conversation sessions:
graph LR
subgraph "Multi-Session Report"
A[summary-cards] --> B[session-summary]
B --> C[test-results]
C --> D[test-group]
D --> E[single-agent-detail]
end
Full Suite Report - Multiple test files with optional multi-agent:
graph LR
subgraph "Full Suite Report"
A[summary-cards] --> B[comparison-matrix]
B --> C[agent-leaderboard]
C --> D[file-summary]
D --> E[test-results]
E --> F[test-group]
F -->|"1 agent"| G[single-agent-detail]
F -->|"2+ agents"| H[multi-agent-comparison]
end
| Component | Purpose | Used In |
|---|---|---|
summary-cards |
Top-level stats (total/passed/failed/tokens/duration) | All reports |
comparison-matrix |
Test × Agent pass/fail matrix | Multi-agent |
agent-leaderboard |
Ranked agent performance table | Multi-agent |
file-summary |
Test file grouping with stats | Suite runs |
session-summary |
Session grouping with flow diagrams | Multi-session |
test-results |
Container for all test groups | All reports |
test-group |
Single test, decides single vs multi view | All reports |
single-agent-detail |
Detailed expandable view for one agent | Single-agent |
multi-agent-comparison |
Side-by-side comparison table | Multi-agent |
agent-assertions |
Assertion results list | Single-agent |
agent-errors |
Error messages display | Single-agent |
agent-sequence-diagram |
Mermaid execution flow diagram | Single-agent |
agent-tool-calls |
Tool calls timeline with params/results | Single-agent |
agent-messages |
Conversation history | Single-agent |
agent-final-output |
Final agent response | Single-agent |
tool-comparison |
Tool calls side-by-side | Multi-agent |
errors-comparison |
Errors side-by-side | Multi-agent |
sequence-comparison |
Diagrams side-by-side (click to fullscreen) | Multi-agent |
outputs-comparison |
Final outputs side-by-side | Multi-agent |
fullscreen-overlay |
Modal overlay for enlarged diagrams | All reports |
scripts |
Mermaid init, expand/collapse, fullscreen JS | All reports |
Structured test results for programmatic analysis and CI/CD integration:
{
"agent_benchmark_version": "1.0.0",
"generated_at": "2024-01-15T14:30:00Z",
"summary": {
"total": 10,
"passed": 8,
"failed": 2
},
"comparison_summary": {
"Test Name": {
"testName": "Create file",
"serverResults": {
"gemini-agent": {
"agentName": "gemini-agent",
"provider": "GOOGLE",
"passed": true,
"duration": 2340,
"errors": []
},
"claude-agent": {
"agentName": "claude-agent",
"provider": "ANTHROPIC",
"passed": false,
"duration": 3120,
"errors": ["Tool parameter mismatch"]
}
},
"totalRuns": 2,
"passedRuns": 1,
"failedRuns": 1
}
},
"detailed_results": [
{
"execution": {
"testName": "Create file",
"agentName": "gemini-agent",
"providerType": "GOOGLE",
"startTime": "2024-01-15T14:30:00Z",
"endTime": "2024-01-15T14:30:02Z",
"tokensUsed": 150,
"latencyMs": 2340,
"errors": []
},
"assertions": [
{
"type": "tool_called",
"passed": true,
"message": "Tool 'write_file' was called"
},
{
"type": "tool_param_equals",
"passed": true,
"message": "Tool 'write_file' called with correct parameters"
}
],
"passed": true
}
]
}Key Fields
- summary - Overall test statistics
- comparison_summary - Cross-agent comparison data
- detailed_results - Full execution details with assertions
- agent_benchmark_version - Version of the tool used
- generated_at - Report generation timestamp
Documentation-friendly format ideal for README files, wikis, and technical documentation. Key Features
- Clean, readable format for documentation
- Summary tables with comparison data
- Detailed assertion results per agent
- Easy to include in GitHub README or wiki pages
- Portable across documentation platforms
- Quick visual identification of pass/fail status
providers:
- name: gemini
type: GOOGLE
token: ${GOOGLE_API_KEY}
model: gemini-2.0-flash
servers:
- name: fs
type: stdio
command: npx @modelcontextprotocol/server-filesystem /tmp
agents:
- name: file-agent
provider: gemini
servers:
- name: fs
settings:
verbose: true
max_iterations: 5
variables:
filename: "test-{{randomValue length=8}}.txt"
content: "{{faker 'Lorem.paragraph'}}"
sessions:
- name: File Tests
tests:
- name: Create file
prompt: "Create a file {{filename}} with content: {{content}}"
assertions:
- type: tool_called
tool: write_file
- type: file_created
path: "/tmp/{{filename}}"
- name: Read file
prompt: "Read {{filename}}"
assertions:
- type: tool_called
tool: read_file
- type: output_contains
value: "{{content}}"Run:
./agent-benchmark -f file-tests.yaml -o results.html -verboseproviders:
- name: claude
type: ANTHROPIC
token: ${ANTHROPIC_API_KEY}
model: claude-sonnet-4-20250514
servers:
- name: api-server
type: sse
url: https://api.example.com/mcp/events
headers:
- "Authorization: Bearer ${API_TOKEN}"
agents:
- name: api-agent
provider: claude
servers:
- name: api-server
settings:
tool_timeout: 10s
max_iterations: 8
variables:
user_id: "{{randomInt lower=1000 upper=9999}}"
email: "{{faker 'Internet.email'}}"
timestamp: "{{now format='unix'}}"
sessions:
- name: User Management
tests:
- name: Create user
prompt: |
Create a new user with:
- ID: {{user_id}}
- Email: {{email}}
- Created: {{timestamp}}
assertions:
- type: tool_called
tool: create_user
- type: tool_param_equals
tool: create_user
params:
id: "{{user_id}}"
email: "{{email}}"
- type: output_json_valid
- type: max_latency_ms
value: 5000
- name: Fetch user
prompt: "Get user {{user_id}}"
assertions:
- type: tool_called
tool: get_user
- type: output_matches_json
path: "$.data.email"
value: "{{email}}"test:
stage: test
script:
- ./agent-benchmark -s suite.yaml -o results.json -reportType json
artifacts:
when: always
paths:
- results.json
reports:
junit: results.json
variables:
GOOGLE_API_KEY: ${GOOGLE_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}Within a session, tests share conversation history:
Session Start
├─ Test 1: "Create file"
│ └─ Messages: [user, assistant, tool_response]
├─ Test 2: "Read file" # Has Test 1 history
│ └─ Messages: [prev..., user, assistant, tool_response]
└─ Test 3: "Delete file" # Has Test 1 & 2 history
└─ Messages: [prev..., user, assistant, tool_response]
1. User sends prompt
2. Agent calls LLM with tools
3. LLM responds with:
a) Final answer → Done
b) Tool calls → Execute tools → Back to step 2
4. Repeat until:
- Final answer received
- Max iterations reached
- Context cancelled
- Error occurred
The agent can detect when an LLM asks for clarification instead of taking action. This is a common issue where LLMs respond with questions like:
- "Would you like me to..."
- "Do you want me to..."
- "Should I proceed..."
- "Please confirm..."
This feature is disabled by default and can be enabled per agent using the clarification_detection configuration.
Configuration:
agents:
- name: autonomous-agent
provider: my-provider
clarification_detection:
enabled: true # Enable detection (default: false)
level: warning # Log level: "info", "warning", or "error" (default: "warning")
use_builtin_patterns: true # Use builtin detection patterns (default: true)
custom_patterns: # Additional regex patterns (optional)
- "(?i)¿te gustaría" # Spanish clarification
- "(?i)möchten sie" # German clarification
servers:
- name: my-serverConfiguration Options:
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable clarification detection |
level |
string | "warning" |
Log level: info, warning, or error |
use_builtin_patterns |
bool | true |
Use the 16 builtin English detection patterns |
custom_patterns |
list | [] |
Additional regex patterns for detection |
Log Levels:
| Level | Behavior |
|---|---|
info |
Logs detection at INFO level, does NOT record as error |
warning |
Logs at WARN level and records in test errors (default) |
error |
Logs at ERROR level and records in test errors |
Custom Patterns:
Custom patterns are additive by default - they are checked in addition to the builtin patterns. To use only custom patterns, set use_builtin_patterns: false:
agents:
- name: multilingual-agent
provider: my-provider
clarification_detection:
enabled: true
use_builtin_patterns: false # Disable builtin English patterns
custom_patterns:
- "(?i)¿desea que" # Spanish: "Do you want me to"
- "(?i)möchten sie dass" # German: "Would you like me to"
- "(?i)voulez-vous que" # French: "Do you want me to"Tip: Combine with system_prompt to instruct the LLM to act autonomously:
agents:
- name: autonomous-agent
provider: my-provider
system_prompt: |
Execute tasks directly without asking for confirmation.
Do not ask "Would you like me to..." or "Should I proceed...".
clarification_detection:
enabled: true
level: error # Treat clarification requests as errors
servers:
- name: my-serverApache 2.0 License - See LICENSE file for details
Issues: https://github.com/mykhaliev/agent-benchmark/issues
Contributing:
- Fork the repository
- Create feature branch
- Submit pull request