End-to-end evaluation pipeline for benchmark MLLMs on 2D-to-3D Origami spatial mapping and reasoning tasks (FrontierIR, NeusymBridge, & LMReasoning Bridge @ AAAI 2026, Springer Nature 2026).
Paper: arXiv:2512.22207 | Dataset: Hugging Face
GamiBench includes 186 valid and 186 impossible crease-pattern examples. Each crease pattern uses mountain/valley fold assignments and is paired with corresponding 3D folded outcomes across 6 viewpoints (top, bottom, front, back, right, left).
The evaluation framework is config-driven and reproducible, supporting single-model and multi-model runs, automatic and deterministic triple-task task generation with seeds, checkpoint/resume, and structured result logging for end-to-end benchmarking.
.
├── configs/ # Configuration files (YAML/JSON)
│ ├── base.yaml # Base configuration template
│ ├── experiments/ # Experiment-specific configs
│ ├── models/ # Model configurations
│ └── datasets/ # Dataset configurations
├── data/ # Dataset folders (creases + fold viewpoints)
│ └── GamiBench/
├── models/ # Model definitions, wrappers
│ ├── base.py # BaseModel interface
│ └── model_factory.py # Model factory
├── evaluators/ # Evaluation logic
│ └── base.py # BaseEvaluator interface
├── baselines/ # Baseline implementations
├── experiments/ # Experiment scripts
├── utils/ # Shared utilities
│ ├── config_loader.py
│ ├── logger.py
│ ├── seeding.py
│ ├── data_loader.py
│ └── result_saver.py
├── outputs/ # Results, logs, checkpoints
│ ├── results/
│ ├── logs/
│ └── checkpoints/
├── scripts/ # One-off scripts, analysis
├── pipeline.py # Main pipeline orchestration
└── run.py # Main entry point
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Run with a config file
python run.py configs/experiments/example.yaml
# With overrides
python run.py configs/experiments/example.yaml \
--override model.name=gpt-4 \
--override evaluator.batch_size=10python run.py \
--benchmark ... \
--model gpt-4 \
--model-config configs/models/openai.yamlThe notebook evaluation flow is available as reproducible script runners:
# Single model (standard + alternative-view + impossible tasks)
python run.py configs/experiments/gamibench_single.yaml
# Multi-model suite (closed + open model groups, deterministic task plan)
python scripts/run_gamibench_suite.py --config configs/experiments/gamibench_suite.yaml --group all
# Closed-only or open-only
python scripts/run_gamibench_suite.py --group closed
python scripts/run_gamibench_suite.py --group open
# Run only selected models by id
python scripts/run_gamibench_suite.py --models openai_gpt4o_mini claude_4_5_sonnet
# Resume unfinished model checkpoints
python scripts/run_gamibench_suite.py --resumeSet provider keys before running:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export XAI_API_KEY=...
export OPENROUTER_API_KEY=...You can publish and sync the dataset with the built-in scripts.
# 1) Authenticate once
hf auth login
# 2) Publish dataset to a HF dataset repo
python scripts/publish_hf_dataset.py \
--repo-id YOUR_USERNAME/GamiBench \
--private
# Optional dry-run preview
python scripts/publish_hf_dataset.py \
--repo-id YOUR_USERNAME/GamiBench \
--dry-runThis uploads:
data/GamiBench(dataset files)configs/experiments/gamibench_single.yamlandgamibench_suite.yamlhf/README_dataset.mdas the dataset card (README.mdin HF repo)
To download/sync the dataset locally:
python scripts/download_hf_dataset.py \
--repo-id YOUR_USERNAME/GamiBench \
--local-dir dataConfiguration files use YAML format and support:
- Hierarchical configs (base + experiment-specific)
- Environment variables (
${VAR_NAME}) - Command-line overrides
experiment_name: "my_experiment"
seed: 42
model:
type: "openai"
name: "gpt-4"
api_key: "${OPENAI_API_KEY}"
temperature: 0.7
dataset:
path: "data/my_dataset.json"
format: "json"
evaluator:
type: "my_benchmark"
batch_size: 10
output_dir: "outputs/results"- Create model class inheriting from
BaseModel:
# models/my_model.py
from .base import BaseModel
class MyModel(BaseModel):
def generate(self, prompt, **kwargs):
# Implementation
pass
def score(self, prompt, completion, **kwargs):
# Implementation
pass- Register in factory:
# models/__init__.py
from .my_model import MyModel
ModelFactory.register("my_model", MyModel)- Create evaluator class inheriting from
BaseEvaluator:
# evaluators/my_benchmark.py
from .base import BaseEvaluator
class MyBenchmarkEvaluator(BaseEvaluator):
def evaluate(self):
results = self.evaluate_batch(self.data)
metrics = self.compute_metrics(results)
return {
'results': results,
'metrics': metrics
}
def evaluate_single(self, example):
# Implementation
passResults are saved in outputs/results/ with:
results.json: Full evaluation resultsmetrics.json: Aggregated metricsconfig.yaml: Frozen configurationmetadata.json: Experiment metadata
- See
configs/base.yamlfor configuration options - See
models/base.pyfor model interface - See
evaluators/base.pyfor evaluator interface
- Follow the modular structure
- Add docstrings to all functions
- Write tests for new components
- Update documentation
Please cite our work for future usage!
@misc{spencer2025gamibenchevaluatingspatialreasoning,
title={GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks},
author={Ryan Spencer and Roey Yaari and Ritvik Vemavarapu and Joyce Yang and Steven Ngo and Utkarsh Sharma},
year={2025},
eprint={2512.22207},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.22207},
}