Skip to content

GAIR-NLP/InnovatorBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

🌟 Recent News

  • 2025.10 - Project launch with core architecture and initial evaluation tasks
  • Continuous Updates - New research tasks and agent implementations welcome

πŸ“– Project Overview

InnovatorBench is a groundbreaking AI agent benchmarking framework specifically designed to evaluate intelligent agents' ability to conduct innovative LLM research tasks. The project provides a standardized, scalable environment where AI agents can engage in innovative exploration within simulated research scenarios, including:

πŸ”¬ Innovative Research Capability Assessment - Testing whether AI agents can independently conduct valuable LLM research

🎯 Standardized Benchmarking - Providing unified task description, evaluation metrics, and execution environments

πŸ› οΈ Modular Architecture - Supporting multiple agent types, tool integrations, and environment configurations

Core Features

  • 🧠 Agent Support - ReAct Agent with context management
  • πŸ‹οΈ Research Environment Simulation - ResearchGym provides standardized research workspace
  • πŸ”§ Rich Tool Integration - Search, browsing, code execution, file operations, file parsing and more
  • πŸ’Ύ State Management - Checkpoint system supports task interruption recovery
  • πŸ“ˆ Evaluation Framework - Standardized metrics for assessing agent research outcomes. Support DIY.

πŸš€ Quick Start

Installation

  1. Clone the project
git clone https://github.com/GAIR-NLP/InnovatorBench.git
cd InnovatorBench
  1. Install Conda for Agents and Evaluations
# Install Conda
conda create -n ai_innovator python=3.10

pip install -r requirements.txt

# Editable install with no version control (alpaca_eval==0.6.2)
pip install -e alpaca_eval-0.6.2

Important: This environemnt also contains the package for evaluation. If you want to use your own agent, you should also set this conda environemnt to start the InnovatorBench evaluation environemnt.

  1. Prepare InnovatorBench Dataset
  • Download InnovatorBench Dataset from huggingface
  • Unzip the Dataset and save it to a path that can be accessed by all computers in the cluster. (Suggest: ./datasets)
  • Download datasets & checkpoints based on README.md in the dataset.
  • Copy & Paste corpus folder in task_10 and task_16
  • Move evaluations in InnovatorBench folder. (i.e. ./evaluations)
  • Update workspace_dataset_path in the research_gym/configs/tasks/task_i.yaml into the real dataset path like ./datasets
  1. Prepare ResearchGym
  • Set api in alpaca_eval-0.6.2/client_configs/openai_configs.yaml (Reference: alpaca_eval)
  • Prepare docker, web browse, etc. in backend
  • Deploy dockers and web server, update the relative key (computer_ip,web_server_host)in research_gym/configs/tasks/task_i.yaml, each task needs unique computer ips in the same time, but it can use the same web server. (You many need to deploy several dockers at the same time.)
  • If all your computers needs proxy to connect each other, set cmd_proxy_url in the research_gym/configs/tasks/task_i.yaml, otherwise, set it to null
  • Update other keys in the research_gym/configs/tasks/task_i.yaml
  • Set api in evaluations/base/data_classes.py

Configuration

  1. Config agents/config/agent_config.yaml or agents/config/agent_config_browse.yaml
  2. Change the config in research_gym/configs/tasks/task_i.yaml to your own path/keys.
  3. If you want to create your own task, copy research_gym/configs/tasks/base.yamland change the parameter inside it.

Run

Run a single task

conda activate ai_innovator
cd path_to_InnovatorBench
python main.py -t path_to_env_config -a path_to_agent_config

πŸ—οΈ Project Architecture

InnovatorBench/
β”œβ”€β”€ agents/                 # Intelligent Agent System
β”‚   β”œβ”€β”€ agents/             # Core agent implementations (ReAct, etc.)
β”‚   β”œβ”€β”€ config/             # Agent configuration files
β”‚   β”œβ”€β”€ context/            # Context management system
β”‚   └── utils/              # Agent utility functions
β”œβ”€β”€ ResearchGym/            # Research Environment Simulator
β”‚   β”œβ”€β”€ action/             # Action execution system
β”‚   β”œβ”€β”€ observation/        # Observation processing system
β”‚   β”œβ”€β”€ backend/            # The backend used in the ResearchGym
β”‚   β”œβ”€β”€ applications/       # Application tool integration
β”‚   └── configs/            # Environment configuration
β”œβ”€β”€ evaluations/            # Evaluation Task Collection
β”‚   β”œβ”€β”€ base/               # Evaluation base classes
β”‚   β”œβ”€β”€ task_1/             # Task 1: Dataset Construction & Analysis
β”‚   β”œβ”€β”€ task_2/             # Task 2: Model Training & Optimization
β”‚   └──  ...
β”œβ”€β”€ llm/                    # Large Language Model Integration
β”‚   β”œβ”€β”€ openai_client.py    # OpenAI API client
β”‚   β”œβ”€β”€ anthropic_client.py # Anthropic API client
β”‚   └── base_client.py      # Base client interface
β”œβ”€β”€ main.py                 # Main program entry
β”œβ”€β”€ requirements.txt        # Project dependencies
└── ...

πŸ“Š Benchmark Statistics

statistics

Benchmark Task Resource Max Eval times Multi-GPU / Multi-Node Save and Restore Creativity Time Horizon
SWE-bench GitHub Issues 1 ❌ ❌ ❌ 30m–2h
ScienceAgentBench Scientific Papers 1 ❌ ❌ βœ… 10m
RExBench NeurIPS, ACL*, etc. Paper 3 ❌ ❌ ❌ 6h–12h
RE-Bench Design Manually 1 ❌ ❌ ❌ 12m–2h
EXP-Bench NeurIPS, ICLR Papers 1 ❌ ❌ ❌ 35m
PaperBench ICML 2024 Papers 1 ❌ ❌ ❌ 1h–3h
ML-Bench Kaggle Competitions 1 ❌ ❌ ❌ Unknown
MLE-bench Kaggle ML Tasks ∞ ❌ ❌ βœ… 10m
InnovatorBench NeurIPS, ICLR, etc. Papers 4 βœ… βœ… βœ… 2h–36h

πŸ† Leaderboard

Leaderboard under construction - Agent implementations welcome!

Research Domain Claude Sonnet 4 GPT-5 GLM-4.5 Kimi-K2
Final Score Best Score Final Score Best Score Final Score Best Score Final Score Best Score
Data Construction 25.47 26.88 8.41 8.41 15.29 22.65 14.01 14.08
Data Filtering 30.89 31.47 8.97 9.48 5.16 5.36 7.39 7.97
Data Augmentation 22.73 22.73 0.00 0.00 25.49 25.49 2.47 2.47
Loss Design 12.98 12.98 0.04 2.74 7.63 7.63 0.00 0.00
Reward Design 11.56 11.56 0.00 0.00 0.00 0.00 3.23 3.23
Scaffold Construction 36.63 37.74 60.07 60.07 3.33 3.33 3.33 3.33
Weighted Average 24.01 24.54 12.04 12.52 11.85 13.35 5.35 5.45

🀝 Contributing

We welcome contributions of all kinds!

Submitting New Tasks

  1. Create new tasks in the evaluations/ directory
  2. Implement evaluation logic and test cases
  3. Submit Pull Request

Improving Agent Implementations

  1. Extend agent types or optimize existing implementations
  2. Add new tool integrations
  3. Improve context management

Reporting Issues or Suggestions

  • Submit bug reports in GitHub Issues
  • Share ideas and suggestions in Discussions

πŸ“„ License

This project is licensed under the Apache-2.0 License.

πŸ™ Acknowledgments

We thank the following open source projects and communities:

πŸ“ Citation

If you use InnovatorBench in your research, please cite:

@misc{wu2025innovatorbenchevaluatingagentsability,
      title={InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research}, 
      author={Yunze Wu and Dayuan Fu and Weiye Si and Zhen Huang and Mohan Jiang and Keyu Li and Shijie Xia and Jie Sun and Tianze Xu and Xiangkun Hu and Pengrui Lu and Xiaojie Cai and Lyumanshan Ye and Wenhong Zhu and Yang Xiao and Pengfei Liu},
      year={2025},
      eprint={2510.27598},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.27598}, 
}

⭐ If you find this project helpful, please give us a Star!

About

A benchmark for LLMs on complicated long-horizon tasks that last for days.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published