- 2025.10 - Project launch with core architecture and initial evaluation tasks
- Continuous Updates - New research tasks and agent implementations welcome
InnovatorBench is a groundbreaking AI agent benchmarking framework specifically designed to evaluate intelligent agents' ability to conduct innovative LLM research tasks. The project provides a standardized, scalable environment where AI agents can engage in innovative exploration within simulated research scenarios, including:
π¬ Innovative Research Capability Assessment - Testing whether AI agents can independently conduct valuable LLM research
π― Standardized Benchmarking - Providing unified task description, evaluation metrics, and execution environments
π οΈ Modular Architecture - Supporting multiple agent types, tool integrations, and environment configurations
- π§ Agent Support - ReAct Agent with context management
- ποΈ Research Environment Simulation - ResearchGym provides standardized research workspace
- π§ Rich Tool Integration - Search, browsing, code execution, file operations, file parsing and more
- πΎ State Management - Checkpoint system supports task interruption recovery
- π Evaluation Framework - Standardized metrics for assessing agent research outcomes. Support DIY.
- Clone the project
git clone https://github.com/GAIR-NLP/InnovatorBench.git
cd InnovatorBench- Install Conda for Agents and Evaluations
# Install Conda
conda create -n ai_innovator python=3.10
pip install -r requirements.txt
# Editable install with no version control (alpaca_eval==0.6.2)
pip install -e alpaca_eval-0.6.2Important: This environemnt also contains the package for evaluation. If you want to use your own agent, you should also set this conda environemnt to start the InnovatorBench evaluation environemnt.
- Prepare InnovatorBench Dataset
- Download InnovatorBench Dataset from huggingface
- Unzip the Dataset and save it to a path that can be accessed by all computers in the cluster. (Suggest:
./datasets) - Download datasets & checkpoints based on README.md in the dataset.
- Copy & Paste
corpusfolder intask_10andtask_16 - Move
evaluationsin InnovatorBench folder. (i.e../evaluations) - Update
workspace_dataset_pathin theresearch_gym/configs/tasks/task_i.yamlinto the real dataset path like./datasets
- Prepare ResearchGym
- Set api in
alpaca_eval-0.6.2/client_configs/openai_configs.yaml(Reference: alpaca_eval) - Prepare docker, web browse, etc. in backend
- Deploy dockers and web server, update the relative key (
computer_ip,web_server_host)inresearch_gym/configs/tasks/task_i.yaml, each task needs unique computer ips in the same time, but it can use the same web server. (You many need to deploy several dockers at the same time.) - If all your computers needs proxy to connect each other, set
cmd_proxy_urlin theresearch_gym/configs/tasks/task_i.yaml, otherwise, set it to null - Update other keys in the
research_gym/configs/tasks/task_i.yaml - Set api in
evaluations/base/data_classes.py
- Config
agents/config/agent_config.yamloragents/config/agent_config_browse.yaml - Change the config in
research_gym/configs/tasks/task_i.yamlto your own path/keys. - If you want to create your own task, copy
research_gym/configs/tasks/base.yamland change the parameter inside it.
Run a single task
conda activate ai_innovator
cd path_to_InnovatorBench
python main.py -t path_to_env_config -a path_to_agent_configInnovatorBench/
βββ agents/ # Intelligent Agent System
β βββ agents/ # Core agent implementations (ReAct, etc.)
β βββ config/ # Agent configuration files
β βββ context/ # Context management system
β βββ utils/ # Agent utility functions
βββ ResearchGym/ # Research Environment Simulator
β βββ action/ # Action execution system
β βββ observation/ # Observation processing system
β βββ backend/ # The backend used in the ResearchGym
β βββ applications/ # Application tool integration
β βββ configs/ # Environment configuration
βββ evaluations/ # Evaluation Task Collection
β βββ base/ # Evaluation base classes
β βββ task_1/ # Task 1: Dataset Construction & Analysis
β βββ task_2/ # Task 2: Model Training & Optimization
β βββ ...
βββ llm/ # Large Language Model Integration
β βββ openai_client.py # OpenAI API client
β βββ anthropic_client.py # Anthropic API client
β βββ base_client.py # Base client interface
βββ main.py # Main program entry
βββ requirements.txt # Project dependencies
βββ ...
| Benchmark | Task Resource | Max Eval times | Multi-GPU / Multi-Node | Save and Restore | Creativity | Time Horizon |
|---|---|---|---|---|---|---|
| SWE-bench | GitHub Issues | 1 | β | β | β | 30mβ2h |
| ScienceAgentBench | Scientific Papers | 1 | β | β | β | 10m |
| RExBench | NeurIPS, ACL*, etc. Paper | 3 | β | β | β | 6hβ12h |
| RE-Bench | Design Manually | 1 | β | β | β | 12mβ2h |
| EXP-Bench | NeurIPS, ICLR Papers | 1 | β | β | β | 35m |
| PaperBench | ICML 2024 Papers | 1 | β | β | β | 1hβ3h |
| ML-Bench | Kaggle Competitions | 1 | β | β | β | Unknown |
| MLE-bench | Kaggle ML Tasks | β | β | β | β | 10m |
| InnovatorBench | NeurIPS, ICLR, etc. Papers | 4 | β | β | β | 2hβ36h |
Leaderboard under construction - Agent implementations welcome!
| Research Domain | Claude Sonnet 4 | GPT-5 | GLM-4.5 | Kimi-K2 | ||||
|---|---|---|---|---|---|---|---|---|
| Final Score | Best Score | Final Score | Best Score | Final Score | Best Score | Final Score | Best Score | |
| Data Construction | 25.47 | 26.88 | 8.41 | 8.41 | 15.29 | 22.65 | 14.01 | 14.08 |
| Data Filtering | 30.89 | 31.47 | 8.97 | 9.48 | 5.16 | 5.36 | 7.39 | 7.97 |
| Data Augmentation | 22.73 | 22.73 | 0.00 | 0.00 | 25.49 | 25.49 | 2.47 | 2.47 |
| Loss Design | 12.98 | 12.98 | 0.04 | 2.74 | 7.63 | 7.63 | 0.00 | 0.00 |
| Reward Design | 11.56 | 11.56 | 0.00 | 0.00 | 0.00 | 0.00 | 3.23 | 3.23 |
| Scaffold Construction | 36.63 | 37.74 | 60.07 | 60.07 | 3.33 | 3.33 | 3.33 | 3.33 |
| Weighted Average | 24.01 | 24.54 | 12.04 | 12.52 | 11.85 | 13.35 | 5.35 | 5.45 |
We welcome contributions of all kinds!
- Create new tasks in the
evaluations/directory - Implement evaluation logic and test cases
- Submit Pull Request
- Extend agent types or optimize existing implementations
- Add new tool integrations
- Improve context management
- Submit bug reports in GitHub Issues
- Share ideas and suggestions in Discussions
This project is licensed under the Apache-2.0 License.
We thank the following open source projects and communities:
If you use InnovatorBench in your research, please cite:
@misc{wu2025innovatorbenchevaluatingagentsability,
title={InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research},
author={Yunze Wu and Dayuan Fu and Weiye Si and Zhen Huang and Mohan Jiang and Keyu Li and Shijie Xia and Jie Sun and Tianze Xu and Xiangkun Hu and Pengrui Lu and Xiaojie Cai and Lyumanshan Ye and Wenhong Zhu and Yang Xiao and Pengfei Liu},
year={2025},
eprint={2510.27598},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.27598},
}β If you find this project helpful, please give us a Star!
