Percy is a project aimed at building an automated system to evaluate the performance of various large language models (LLMs) from multiple providers (OpenAI, Anthropic, Ollama, OpenRouter) on the official amateur radio exams (Technician, General, and Extra). The goal is to see which models can pass each level of the exam with a measurable degree of certainty.
The project follows these steps:
-
Extract the Question Pool
Implement a system to extract and structure the NCVEC question pools for all license classes. -
Randomized Test Generation
Randomly generate tests according to NCVEC exam rules:- Technician: 35 questions
- General: 35 questions
- Extra: 50 questions Each test ensures balanced question selection by category.
-
Question-Answering System
Build a system that presents the questions (including those with diagrams) to different LLMs and retrieves their answers. -
Result Analysis
Determine whether the LLM passed or failed the test based on official pass/fail criteria (74% required to pass) for each license class.
/percy
│
├── /question_pools # Contains the processed question pools (JSON format)
│ ├── technician.json
│ ├── general.json
│ └── extra.json
│
├── /scripts # Scripts for processing question pools, generating tests, evaluating models, and analyzing results
│ ├── extract_pool.py
│ ├── generate_test.py
│ ├── evaluate_test.py
│ └── analyze_results.py
│
├── /tests # Generated tests for different license classes
│ ├── test_001.json
│ ├── test_002.json
│ └── ...
│
├── /outputs # Results and logs of model evaluations
│ ├── gpt4-mini_test_001_results.json
│ ├── gpt4o_test_002_results.json
│ └── ...
│
├── /schema # JSON schemas for data validation
│ ├── question-pool.schema.json
│ ├── test-result.schema.json
│ ├── llmstats-schema.json
│ └── board-schema.json # Schema for aggregated leaderboard data
│
└── README.md # Project documentation
The question pool JSON files follow a strict schema defined in schema/question-pool.schema.json. This schema enforces:
- Valid license class values ("technician", "general", "extra")
- Version format (YYYY-YYYY)
- Question ID patterns (e.g., T1A01 for Technician)
- Group ID patterns (e.g., T1A)
- Required fields and data types
- Answer option constraints (A, B, C, D)
- Exactly 4 answer choices per question
The schema ensures consistency and validates the structure of all question pool JSON files.
Example question pool JSON that follows the schema:
{
"license_class": "technician",
"version": "2022-2026",
"group_titles": {
"T1A": "Purpose and permissible use of the Amateur Radio Service",
"T1B": "Authorized frequencies; Frequency allocations",
"T1C": "Operator licensing"
},
"questions": [
{
"id": "T1A01",
"group": "T1A",
"question": "Which of the following is a purpose of the Amateur Radio Service as stated in the FCC rules and regulations?",
"answers": [
{
"option": "A",
"text": "Providing personal radio communications for as many citizens as possible"
},
{
"option": "B",
"text": "Providing communications during international emergencies only"
},
{
"option": "C",
"text": "Advancing skills in the technical and communication phases of the radio art"
},
{
"option": "D",
"text": "All of these choices are correct"
}
],
"correct_answer": "C"
}
]
}The test result JSON files follow a strict schema defined in schema/test-result.schema.json. This schema enforces:
- Required fields like provider, model name, test ID, etc.
- Timestamp format and validation
- Question result structure including token usage
- Token usage tracking at both question and test level
- Support for RAG context tracking
- Support for image handling in questions
Example test result that follows the schema:
{
"provider": "openai",
"test_id": "tech_001",
"model_name": "gpt-4",
"timestamp": "2024-01-20T15:30:45Z",
"questions": [
{
"question_id": "T1A01",
"model_answer": "C",
"correct_answer": "C",
"is_correct": true,
"has_image": false,
"token_usage": {
"input_tokens": 245,
"output_tokens": 89,
"total_tokens": 334,
"input_token_details": {
"audio": 0,
"cache_read": 0
},
"output_token_details": {
"audio": 0,
"reasoning": 89
}
}
}
],
"total_questions": 35,
"correct_answers": 30,
"score_percentage": 85.71,
"duration_seconds": 87.5,
"used_cot": true,
"used_rag": false,
"temperature": 0.0,
"token_usage": {
"prompt_tokens": 8575,
"completion_tokens": 3115,
"total_tokens": 11690
},
"pool_name": "technician-2022-2026",
"pool_id": "T"
}-
Clone the Repository
git clone https://github.com/testaco/percy.git cd percy -
Create and Activate Virtual Environment
python -m venv venv source venv/bin/activate # On Linux/Mac # or .\venv\Scripts\activate # On Windows
-
Install Dependencies
pip install -r requirements.txt
-
Set Up API Keys for LLM Providers
Depending on which LLM providers you intend to use, set up the necessary API keys as environment variables:
-
OpenAI:
export OPENAI_API_KEY='your-openai-api-key'
-
Anthropic:
export ANTHROPIC_API_KEY='your-anthropic-api-key'
-
OpenRouter:
export OPENROUTER_API_KEY='your-openrouter-api-key'
-
Ollama:
- Ensure Ollama is installed and running locally. Visit Ollama's documentation for setup instructions.
-
The extract_pool.py script converts a DOCX question pool file into a structured JSON format. Both input and output paths are required.
python scripts/extract_pool.py --input <docx_file> --output <json_file>Example:
python scripts/extract_pool.py --input question_pools/technician-2022-2026.docx --output question_pools/technician-2022-2026.jsonAdditional options:
--test: Run test to verify first question format--debug: Enable debug output
This script randomly selects one question from each group in the question pool and saves the test in the /tests folder.
python scripts/generate_test.py --pool-file <json_file>Example:
python scripts/generate_test.py --pool-file question_pools/technician-2022-2026.jsonAdditional options:
--output-dir: Directory to save the generated test (default: tests)
This script runs the specified model from the chosen provider on a generated test and saves the results in the /data/evaluations folder.
python scripts/evaluate_test.py --test-file <test_file> --model <model_name>Parameters:
--test-file: Path to the test JSON file.--model: Name of the LLM model to use.--temperature: (Optional) Temperature setting for the LLM (default:0.0).--cot: (Optional) Enable Chain of Thought reasoning mode.
Example with Chain of Thought:
python scripts/evaluate_test.py --test-file tests/technician_test.json --model openai/gpt-4 --cotExamples:
-
Using OpenAI GPT-4:
python scripts/evaluate_test.py --test-file tests/technician_test.json --model openai/gpt-4
-
Using Anthropic Claude-v1:
python scripts/evaluate_test.py --test-file tests/general_test.json --model anthropic/claude-3-sonnet
-
Using OpenRouter ChatGPT-4:
python scripts/evaluate_test.py --test-file tests/extra_test.json --model openrouter/chatgpt-4o-latest
Note:
- Ensure you have the necessary API keys and environment variables set for each provider. Follow the docs for litellm for specific providers.
- For OpenAI, set
OPENAI_API_KEY. - For Anthropic, set
ANTHROPIC_API_KEY. - For OpenRouter, set
OPENROUTER_API_KEY.
The generate_batches.py script automatically creates batch evaluation configurations based on model metadata from the LLMStats dataset:
python scripts/generate_batches.py --output-dir configsThis script:
- Loads model metadata from
data/llmstats.json - Categorizes models into groups:
- Proprietary: Models from major commercial providers (OpenAI, Anthropic, etc.)
- Small: Open models with <8B parameters
- Medium: Open models with 8-80B parameters
- Large: Open models with >80B parameters
- Creates optimized configurations for each category:
- Proprietary: Base configuration (no CoT/RAG)
- Small: Full features (CoT+RAG with varied temperatures)
- Medium: RAG-only with varied temperatures
- Large: Base configuration only
- Generates YAML configuration files for each category
The generated files can be used directly with the batch evaluation script.
The batch evaluation system allows running multiple tests with different configurations using a YAML configuration file:
python scripts/batch_evaluate.py --config batch_config.yamlConfiguration File Format:
The configuration file should be in YAML format with the following structure:
# Example batch_config.yaml
batch_name: "initial_evaluation_run" # Unique identifier for this batch
parameters:
model_providers: # Valid model-provider pairs
- "openai/gpt-4"
- "openai/gpt-3.5-turbo"
- "anthropic/claude-3-sonnet"
- "anthropic/claude-2"
- "ollama/llama2"
temperature:
- 0.0
- 0.7
use_cot:
- true
- false
use_rag:
- true
- false
test_patterns:
- "tests/technician_*.json"
- "tests/general_*.json"The script will:
- Use the specified batch name for organizing results
- Generate combinations using only valid model-provider pairs
- Run each combination on all matching test files
- Save progress continuously to allow resuming interrupted runs
- Store results in
outputs/batch_runs/
This script processes the output from the model evaluation and determines whether the model passed or failed the exam based on the specific rules for that license class (74% required to pass).
Run the script with the result file:
python scripts/analyze_results.py --result-file outputs/gpt4-mini_test_001_results.jsonThe script will output:
- License class of the test
- Score achieved
- Required passing score (74%)
- Whether the model passed or failed
The extract_llmstats.py script aggregates metadata about LLM models and their providers from the LLMStats submodule into a single JSON dataset. This data includes model capabilities, context sizes, benchmark scores, and provider-specific information like pricing and throughput.
# Update submodule and extract stats
python scripts/extract_llmstats.py -y
# Skip submodule update and extract stats
python scripts/extract_llmstats.py -n
# Prompt for submodule update
python scripts/extract_llmstats.pyThe script:
- Optionally updates the LLMStats submodule
- Extracts model metadata from
LLMStats/models/<creator>/<model_id>/model.jsonfiles - Combines with provider data from
LLMStats/providers/<provider>/provider.jsonfiles - Saves aggregated dataset to
data/llmstats.json
The output follows the schema defined in schema/llmstats-schema.json and includes:
- Model capabilities (context sizes, multimodal support, etc.)
- Benchmark scores across various datasets
- Provider-specific information (pricing, throughput, latency)
- Links to documentation, papers, and model weights
This metadata is used to:
- Calculate costs for test evaluations
- Track model capabilities and limitations
- Compare performance across benchmarks
- Reference model documentation and resources
The generate_handbook.py script creates an educational handbook by generating comprehensive content for specified sub-element groups using a Large Language Model (LLM).
python scripts/generate_handbook.py --patterns <patterns> --provider <provider_name> --model-name <model_name> --temperature <temperature>Parameters:
--patterns: List of glob patterns to filter groups (e.g.,"*1*","G2A"). This determines which question groups will be used to generate content.--provider: LLM provider to use (openaioranthropic). Default isopenai.--model-name: Name of the LLM model to use. Defaults togpt-4for OpenAI or the appropriate model for Anthropic.--temperature: Sampling temperature for the LLM (default:0.7).
Examples:
-
Using OpenAI GPT-4 to generate content for all sub-element groups containing '1':
python scripts/generate_handbook.py --patterns "*1*" --provider openai --model-name "gpt-4" --temperature 0.7
-
Using Anthropic's Claude-v1 to generate content for group G2A:
python scripts/generate_handbook.py --patterns "G2A" --provider anthropic --model-name "claude-v1" --temperature 0.7
Output:
The generated content will be saved as markdown files in the handbook directory, one file per sub-element group (e.g., handbook/T1A.md).
Notes:
- Ensure you have the necessary API keys and environment variables set for the chosen LLM provider.
- For OpenAI, set
OPENAI_API_KEY. - For Anthropic, set
ANTHROPIC_API_KEY.
- For OpenAI, set
- The generated markdown files can be converted to other formats (e.g., PDF, EPUB) using Pandoc or similar tools.
The handbook_indexer.py script creates a FAISS vector store index of the handbook content for use in RAG (Retrieval Augmented Generation) pipelines.
# Build the index (creates cache/handbook_index_*.npz)
python scripts/handbook_indexer.py
# Force rebuild index
python scripts/handbook_indexer.py --force
# Test search functionality
python scripts/handbook_indexer.py --query "What is impedance matching?"Features:
- Uses FAISS for efficient similarity search
- Caches computed embeddings to avoid redundant processing
- Splits content into chunks for fine-grained retrieval
- Supports semantic search queries
The publish_handbook.py script converts the markdown handbook content into a formatted PDF document, organizing content by chapters.
python scripts/publish_handbook.pyRequirements:
- Pandoc must be installed on your system
- PDFtk must be installed on your system
Output:
- Individual chapter PDFs in
handbook/chapters/ - Complete handbook at
handbook/handbook.pdf
The script:
- Generates individual chapter PDFs from markdown files
- Applies consistent formatting using pandoc
- Combines all chapters into a single handbook.pdf
The generate_board.py script aggregates test results and LLM metadata into a comprehensive board dataset for the leaderboard application. This data includes test performance, costs, and model capabilities.
python scripts/generate_board.pyThe script:
- Processes all JSON files in the
outputsdirectory - Combines test results with LLM metadata from
data/llmstats.json - Calculates additional metrics like:
- Pass/fail status (74% required)
- Margin to passing score
- Token usage and costs
- Performance metrics (tokens/second)
- Saves the aggregated board data to
data/board.json
The output follows the schema defined in schema/board-schema.json and includes:
- Test results with detailed metrics
- Model parameters and capabilities
- Performance statistics
- Cost analysis per test
- License class information
This board data powers the leaderboard web application's:
- Overall rankings
- Performance comparisons
- Cost analysis
- Detailed test result views
Feel free to open issues and submit pull requests as the project evolves. The README will be updated over time as we build out the features.