InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

🌟 Recent News

2025.10 - Project launch with core architecture and initial evaluation tasks
Continuous Updates - New research tasks and agent implementations welcome

📖 Project Overview

InnovatorBench is a groundbreaking AI agent benchmarking framework specifically designed to evaluate intelligent agents' ability to conduct innovative LLM research tasks. The project provides a standardized, scalable environment where AI agents can engage in innovative exploration within simulated research scenarios, including:

🔬 Innovative Research Capability Assessment - Testing whether AI agents can independently conduct valuable LLM research

🎯 Standardized Benchmarking - Providing unified task description, evaluation metrics, and execution environments

🛠️ Modular Architecture - Supporting multiple agent types, tool integrations, and environment configurations

Core Features

🧠 Agent Support - ReAct Agent with context management
🏋️ Research Environment Simulation - ResearchGym provides standardized research workspace
🔧 Rich Tool Integration - Search, browsing, code execution, file operations, file parsing and more
💾 State Management - Checkpoint system supports task interruption recovery
📈 Evaluation Framework - Standardized metrics for assessing agent research outcomes. Support DIY.

🚀 Quick Start

Installation

Clone the project

git clone https://github.com/GAIR-NLP/InnovatorBench.git
cd InnovatorBench

Install Conda for Agents and Evaluations

# Install Conda
conda create -n ai_innovator python=3.10

pip install -r requirements.txt

# Editable install with no version control (alpaca_eval==0.6.2)
pip install -e alpaca_eval-0.6.2

Important: This environemnt also contains the package for evaluation. If you want to use your own agent, you should also set this conda environemnt to start the InnovatorBench evaluation environemnt.

Prepare InnovatorBench Dataset

Download InnovatorBench Dataset from huggingface
Unzip the Dataset and save it to a path that can be accessed by all computers in the cluster. (Suggest: ./datasets)
Download datasets & checkpoints based on README.md in the dataset.
Copy & Paste corpus folder in task_10 and task_16
Move evaluations in InnovatorBench folder. (i.e. ./evaluations)
Update workspace_dataset_path in the research_gym/configs/tasks/task_i.yaml into the real dataset path like ./datasets

Prepare ResearchGym

Set api in alpaca_eval-0.6.2/client_configs/openai_configs.yaml (Reference: alpaca_eval)
Prepare docker, web browse, etc. in backend
Deploy dockers and web server, update the relative key (computer_ip,web_server_host)in research_gym/configs/tasks/task_i.yaml, each task needs unique computer ips in the same time, but it can use the same web server. (You many need to deploy several dockers at the same time.)
If all your computers needs proxy to connect each other, set cmd_proxy_url in the research_gym/configs/tasks/task_i.yaml, otherwise, set it to null
Update other keys in the research_gym/configs/tasks/task_i.yaml
Set api in evaluations/base/data_classes.py

Configuration

Config agents/config/agent_config.yaml or agents/config/agent_config_browse.yaml
Change the config in research_gym/configs/tasks/task_i.yaml to your own path/keys.
If you want to create your own task, copy research_gym/configs/tasks/base.yamland change the parameter inside it.

Run

Run a single task

conda activate ai_innovator
cd path_to_InnovatorBench
python main.py -t path_to_env_config -a path_to_agent_config

🏗️ Project Architecture

InnovatorBench/
├── agents/                 # Intelligent Agent System
│   ├── agents/             # Core agent implementations (ReAct, etc.)
│   ├── config/             # Agent configuration files
│   ├── context/            # Context management system
│   └── utils/              # Agent utility functions
├── ResearchGym/            # Research Environment Simulator
│   ├── action/             # Action execution system
│   ├── observation/        # Observation processing system
│   ├── backend/            # The backend used in the ResearchGym
│   ├── applications/       # Application tool integration
│   └── configs/            # Environment configuration
├── evaluations/            # Evaluation Task Collection
│   ├── base/               # Evaluation base classes
│   ├── task_1/             # Task 1: Dataset Construction & Analysis
│   ├── task_2/             # Task 2: Model Training & Optimization
│   └──  ...
├── llm/                    # Large Language Model Integration
│   ├── openai_client.py    # OpenAI API client
│   ├── anthropic_client.py # Anthropic API client
│   └── base_client.py      # Base client interface
├── main.py                 # Main program entry
├── requirements.txt        # Project dependencies
└── ...

📊 Benchmark Statistics

Benchmark	Task Resource	Max Eval times	Multi-GPU / Multi-Node	Save and Restore	Creativity	Time Horizon
SWE-bench	GitHub Issues	1	❌	❌	❌	30m–2h
ScienceAgentBench	Scientific Papers	1	❌	❌	✅	10m
RExBench	NeurIPS, ACL*, etc. Paper	3	❌	❌	❌	6h–12h
RE-Bench	Design Manually	1	❌	❌	❌	12m–2h
EXP-Bench	NeurIPS, ICLR Papers	1	❌	❌	❌	35m
PaperBench	ICML 2024 Papers	1	❌	❌	❌	1h–3h
ML-Bench	Kaggle Competitions	1	❌	❌	❌	Unknown
MLE-bench	Kaggle ML Tasks	∞	❌	❌	✅	10m
InnovatorBench	NeurIPS, ICLR, etc. Papers	4	✅	✅	✅	2h–36h

🏆 Leaderboard

Leaderboard under construction - Agent implementations welcome!

Research Domain	Claude Sonnet 4		GPT-5		GLM-4.5		Kimi-K2
	Final Score	Best Score	Final Score	Best Score	Final Score	Best Score	Final Score	Best Score
Data Construction	25.47	26.88	8.41	8.41	15.29	22.65	14.01	14.08
Data Filtering	30.89	31.47	8.97	9.48	5.16	5.36	7.39	7.97
Data Augmentation	22.73	22.73	0.00	0.00	25.49	25.49	2.47	2.47
Loss Design	12.98	12.98	0.04	2.74	7.63	7.63	0.00	0.00
Reward Design	11.56	11.56	0.00	0.00	0.00	0.00	3.23	3.23
Scaffold Construction	36.63	37.74	60.07	60.07	3.33	3.33	3.33	3.33
Weighted Average	24.01	24.54	12.04	12.52	11.85	13.35	5.35	5.45

🤝 Contributing

We welcome contributions of all kinds!

Submitting New Tasks

Create new tasks in the evaluations/ directory
Implement evaluation logic and test cases
Submit Pull Request

Improving Agent Implementations

Extend agent types or optimize existing implementations
Add new tool integrations
Improve context management

Reporting Issues or Suggestions

Submit bug reports in GitHub Issues
Share ideas and suggestions in Discussions

📄 License

This project is licensed under the Apache-2.0 License.

🙏 Acknowledgments

We thank the following open source projects and communities:

📝 Citation

If you use InnovatorBench in your research, please cite:

@misc{wu2025innovatorbenchevaluatingagentsability,
      title={InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research}, 
      author={Yunze Wu and Dayuan Fu and Weiye Si and Zhen Huang and Mohan Jiang and Keyu Li and Shijie Xia and Jie Sun and Tianze Xu and Xiangkun Hu and Pengrui Lu and Xiaojie Cai and Lyumanshan Ye and Wenhong Zhu and Yang Xiao and Pengfei Liu},
      year={2025},
      eprint={2510.27598},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.27598}, 
}

⭐ If you find this project helpful, please give us a Star!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

🌟 Recent News

📖 Project Overview

Core Features

🚀 Quick Start

Installation

Configuration

Run

🏗️ Project Architecture

📊 Benchmark Statistics

🏆 Leaderboard

🤝 Contributing

Submitting New Tasks

Improving Agent Implementations

Reporting Issues or Suggestions

📄 License

🙏 Acknowledgments

📝 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
agents		agents
alpaca_eval-0.6.2		alpaca_eval-0.6.2
docs		docs
llm		llm
research_gym		research_gym
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

GAIR-NLP/InnovatorBench

Folders and files

Latest commit

History

Repository files navigation

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

🌟 Recent News

📖 Project Overview

Core Features

🚀 Quick Start

Installation

Configuration

Run

🏗️ Project Architecture

📊 Benchmark Statistics

🏆 Leaderboard

🤝 Contributing

Submitting New Tasks

Improving Agent Implementations

Reporting Issues or Suggestions

📄 License

🙏 Acknowledgments

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages