This repository contains tools for analyzing the stability of leaderboards on ForecastBench. The analysis calculates difficulty-adjusted Brier scores and performs various stability metrics to assess how consistent model rankings are over time.
forecastbench-stability-analysis/
├── src/
│ ├── stability_analysis.py # Core analysis functions and classes
│ ├── leaderboard_viewer.html # Interactive results viewer
│ └── trendline_graph.html # Interactive trend visualization
├── data/
│ ├── raw/
│ │ ├── forecast_sets/ # Forecast JSON files organized by date
│ │ ├── question_sets/ # Question metadata JSON files
│ │ └── model_release_dates.csv
│ ├── processed/ # Intermediate processed data
│ └── results/ # Analysis outputs and visualizations
├── run_analysis.py # Main analysis script
└── notebooks/ # Development notebooks
- Python 3.8+
- Required packages:
pip install pandas numpy matplotlib pyfixest- Clone the repository:
git clone <repository-url>
cd forecastbench-stability-analysis- Ensure your data structure matches the expected format:
- Forecast JSON files in
data/raw/forecast_sets/YYYY-MM-DD/ - Question JSON files in
data/raw/question_sets/ - Model release dates in
data/raw/model_release_dates.csv
- Forecast JSON files in
Run the complete analysis pipeline:
python run_analysis.pyThis will:
- Parse forecast and question JSON files
- Process and merge the data
- Calculate difficulty-adjusted Brier scores
- Generate multiple leaderboard configurations
- Perform stability and sample size analyses
- Create interactive HTML viewers for results exploration and trend visualization
Expected runtime: 1-2 minutes on a standard laptop
Key parameters can be modified in run_analysis.py:
# Analysis parameters
REFERENCE_DATE = "2025-09-10" # Reference date for calculations
IMPUTATION_THRESHOLD = 0.05 # Max fraction of imputed forecasts
MAX_MODEL_DAYS_RELEASED = 365 # Model age limit for estimation
STABILITY_THRESHOLDS = [100, 180] # Days active for stability analysis
# Stability metrics to calculate
STABILITY_METRICS = [
"correlation", # Pearson correlation
"rank_correlation", # Spearman rank correlation
"median_displacement", # Median rank displacement
"top25_retention", # Top-25% retention rate
]You can also use the analysis functions directly:
from src.stability_analysis import (
parse_forecast_data,
process_parsed_data,
compute_diff_adj_scores,
create_leaderboard,
perform_stability_analysis
)
# Parse forecast data
df_forecasts = parse_forecast_data("./data/raw/forecast_sets/")
# Calculate difficulty-adjusted scores
df_with_scores = compute_diff_adj_scores(
df=df_forecasts,
max_model_days_released=365,
drop_baseline_models=["Always 0.5", "Random Uniform"]
)
# Create leaderboard
df_leaderboard = create_leaderboard(
df_with_scores,
min_days_active_market=50,
min_days_active_dataset=50
)
# Perform stability analysis
perform_stability_analysis(
df_with_scores=df_with_scores,
model_days_active_threshold=100,
results_folder="./results",
metrics=["correlation", "rank_correlation"]
)After running the analysis, you'll find:
data/results/
├── leaderboard_*.csv # Various leaderboard configurations
├── trendline_graph_*.csv # Data for trend visualizations
├── stability_*_threshold_*.csv # Stability analysis results
├── sample_size_analysis_*.csv # Sample size analysis results
├── stability_*_*.png # Stability metric visualizations
└── sample_size_analysis_*.png # Sample size plots
- Leaderboards:
leaderboard_baseline.csv,leaderboard_proposal.csv, etc. - Trend Data:
trendline_graph_*.csv- Performance trends over time by model category - Stability Metrics: Stability metrics at various thresholds
- Sample Size Analysis: Days needed for models to reach question thresholds
The analysis generates two interactive HTML visualizations:
- File Upload: Drag & drop CSV files for instant visualization
- Sorting: Click column headers to sort by any metric
- Filtering: Search models by name
- Smart Formatting: Automatic numeric formatting and "n.e.d." (not enough data) indicators
- Responsive Design: Works on desktop and mobile devices
- SOTA Tracking: Automatically identifies State-of-the-Art models at release
- Interactive Filtering: Toggle between overall/market/dataset performance
- Zoom & Pan: Shift+drag to zoom into time periods, Escape to reset
- Benchmark Lines: Toggle reference lines for human forecaster performance
- Trend Analysis: Linear regression lines for SOTA model progression
- Rich Tooltips: Detailed model information on hover
Both viewers work offline and require no additional setup beyond opening the HTML files in a web browser.
{
"organization": "ModelProvider",
"model": "Model Name",
"forecasts": [
{
"id": "question_id",
"forecast_due_date": "YYYY-MM-DD",
"forecast": 0.75,
"resolved_to": 1,
"resolved": true,
"resolution_date": "YYYY-MM-DD",
"source": "metaculus",
"imputed": false
}
]
}{
"forecast_due_date": "YYYY-MM-DD",
"question_set": "llm",
"questions": [
{
"id": "question_id",
"source": "metaculus",
"freeze_datetime_value": 0.65,
"url": "https://..."
}
]
}model,release_date
GPT-4o,2024-05-13
Claude-3.5-Sonnet,2024-06-20
For each stability threshold (e.g., 100 days), the analysis:
- Calculates baseline scores using complete forecasting records
- For each day X from 0 to threshold:
- Filters to forecasts made within first X days of model activity
- Calculates incomplete-period scores
- Computes stability metrics comparing incomplete vs. complete rankings
Determines the number of days needed for 80% of models to reach various resolved question thresholds, helping inform data requirements for reliable model evaluation.