Support tool-based benchmark evaluation

**Description:**

To support a variety of benchmark tests, our framework needs to support evaluations on more benchmarks, especially those that require **tool-based evaluation** (e.g., swebench, webarena).

### Proposed Benchmarks to Support:
- GAIA  
- SWE-bench  
- WebArena  
- HotPotQA  

### Considerations:
* The performance of different agents can vary across benchmarks. Developers should select appropriate benchmarks based on the characteristics of the agent system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support tool-based benchmark evaluation #5

Proposed Benchmarks to Support:

Considerations:

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support tool-based benchmark evaluation #5

Description

Proposed Benchmarks to Support:

Considerations:

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions