Skip to content

forecastingresearch/forecastbench-stability-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ForecastBench Stability Analysis

This repository contains tools for analyzing the stability of leaderboards on ForecastBench. The analysis calculates difficulty-adjusted Brier scores and performs various stability metrics to assess how consistent model rankings are over time.

Project Structure

forecastbench-stability-analysis/
├── src/
│   ├── stability_analysis.py    # Core analysis functions and classes
│   ├── leaderboard_viewer.html  # Interactive results viewer
│   └── trendline_graph.html     # Interactive trend visualization
├── data/
│   ├── raw/
│   │   ├── forecast_sets/       # Forecast JSON files organized by date
│   │   ├── question_sets/       # Question metadata JSON files  
│   │   └── model_release_dates.csv
│   ├── processed/               # Intermediate processed data
│   └── results/                 # Analysis outputs and visualizations
├── run_analysis.py             # Main analysis script
└── notebooks/                  # Development notebooks

Installation

Prerequisites

  • Python 3.8+
  • Required packages:
pip install pandas numpy matplotlib pyfixest

Setup

  1. Clone the repository:
git clone <repository-url>
cd forecastbench-stability-analysis
  1. Ensure your data structure matches the expected format:
    • Forecast JSON files in data/raw/forecast_sets/YYYY-MM-DD/
    • Question JSON files in data/raw/question_sets/
    • Model release dates in data/raw/model_release_dates.csv

Usage

Quick Start

Run the complete analysis pipeline:

python run_analysis.py

This will:

  1. Parse forecast and question JSON files
  2. Process and merge the data
  3. Calculate difficulty-adjusted Brier scores
  4. Generate multiple leaderboard configurations
  5. Perform stability and sample size analyses
  6. Create interactive HTML viewers for results exploration and trend visualization

Expected runtime: 1-2 minutes on a standard laptop

Configuration

Key parameters can be modified in run_analysis.py:

# Analysis parameters
REFERENCE_DATE = "2025-09-10"           # Reference date for calculations
IMPUTATION_THRESHOLD = 0.05             # Max fraction of imputed forecasts
MAX_MODEL_DAYS_RELEASED = 365           # Model age limit for estimation
STABILITY_THRESHOLDS = [100, 180]       # Days active for stability analysis

# Stability metrics to calculate
STABILITY_METRICS = [
    "correlation",                      # Pearson correlation
    "rank_correlation",                 # Spearman rank correlation  
    "median_displacement",              # Median rank displacement
    "top25_retention",                  # Top-25% retention rate
]

Programmatic Usage

You can also use the analysis functions directly:

from src.stability_analysis import (
    parse_forecast_data, 
    process_parsed_data,
    compute_diff_adj_scores,
    create_leaderboard,
    perform_stability_analysis
)

# Parse forecast data
df_forecasts = parse_forecast_data("./data/raw/forecast_sets/")

# Calculate difficulty-adjusted scores
df_with_scores = compute_diff_adj_scores(
    df=df_forecasts,
    max_model_days_released=365,
    drop_baseline_models=["Always 0.5", "Random Uniform"]
)

# Create leaderboard
df_leaderboard = create_leaderboard(
    df_with_scores,
    min_days_active_market=50,
    min_days_active_dataset=50
)

# Perform stability analysis
perform_stability_analysis(
    df_with_scores=df_with_scores,
    model_days_active_threshold=100,
    results_folder="./results",
    metrics=["correlation", "rank_correlation"]
)

Output Files

Results Directory Structure

After running the analysis, you'll find:

data/results/
├── leaderboard_*.csv                    # Various leaderboard configurations
├── trendline_graph_*.csv                # Data for trend visualizations
├── stability_*_threshold_*.csv          # Stability analysis results
├── sample_size_analysis_*.csv           # Sample size analysis results  
├── stability_*_*.png                    # Stability metric visualizations
└── sample_size_analysis_*.png           # Sample size plots

Key Output Files

  • Leaderboards: leaderboard_baseline.csv, leaderboard_proposal.csv, etc.
  • Trend Data: trendline_graph_*.csv - Performance trends over time by model category
  • Stability Metrics: Stability metrics at various thresholds
  • Sample Size Analysis: Days needed for models to reach question thresholds

Interactive Visualizations

The analysis generates two interactive HTML visualizations:

Leaderboard Viewer (leaderboard_viewer.html)

  • File Upload: Drag & drop CSV files for instant visualization
  • Sorting: Click column headers to sort by any metric
  • Filtering: Search models by name
  • Smart Formatting: Automatic numeric formatting and "n.e.d." (not enough data) indicators
  • Responsive Design: Works on desktop and mobile devices

Trend Graph (trendline_graph.html)

  • SOTA Tracking: Automatically identifies State-of-the-Art models at release
  • Interactive Filtering: Toggle between overall/market/dataset performance
  • Zoom & Pan: Shift+drag to zoom into time periods, Escape to reset
  • Benchmark Lines: Toggle reference lines for human forecaster performance
  • Trend Analysis: Linear regression lines for SOTA model progression
  • Rich Tooltips: Detailed model information on hover

Both viewers work offline and require no additional setup beyond opening the HTML files in a web browser.

Data Format Requirements

Forecast JSON Structure

{
  "organization": "ModelProvider",
  "model": "Model Name",
  "forecasts": [
    {
      "id": "question_id",
      "forecast_due_date": "YYYY-MM-DD",
      "forecast": 0.75,
      "resolved_to": 1,
      "resolved": true,
      "resolution_date": "YYYY-MM-DD",
      "source": "metaculus",
      "imputed": false
    }
  ]
}

Question JSON Structure

{
  "forecast_due_date": "YYYY-MM-DD",
  "question_set": "llm",
  "questions": [
    {
      "id": "question_id", 
      "source": "metaculus",
      "freeze_datetime_value": 0.65,
      "url": "https://..."
    }
  ]
}

Model Release Dates CSV

model,release_date
GPT-4o,2024-05-13
Claude-3.5-Sonnet,2024-06-20

Methodology

Stability Assessment

For each stability threshold (e.g., 100 days), the analysis:

  1. Calculates baseline scores using complete forecasting records
  2. For each day X from 0 to threshold:
    • Filters to forecasts made within first X days of model activity
    • Calculates incomplete-period scores
    • Computes stability metrics comparing incomplete vs. complete rankings

Sample Size Analysis

Determines the number of days needed for 80% of models to reach various resolved question thresholds, helping inform data requirements for reliable model evaluation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published