Skip to content

Support tool-based benchmark evaluation #5

@RuishanFang

Description

@RuishanFang

Description:

To support a variety of benchmark tests, our framework needs to support evaluations on more benchmarks, especially those that require tool-based evaluation (e.g., swebench, webarena).

Proposed Benchmarks to Support:

  • GAIA
  • SWE-bench
  • WebArena
  • HotPotQA

Considerations:

  • The performance of different agents can vary across benchmarks. Developers should select appropriate benchmarks based on the characteristics of the agent system.

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions