-
Notifications
You must be signed in to change notification settings - Fork 9
Open
1 / 11 of 1 issue completedDescription
Description:
To support a variety of benchmark tests, our framework needs to support evaluations on more benchmarks, especially those that require tool-based evaluation (e.g., swebench, webarena).
Proposed Benchmarks to Support:
- GAIA
- SWE-bench
- WebArena
- HotPotQA
Considerations:
- The performance of different agents can vary across benchmarks. Developers should select appropriate benchmarks based on the characteristics of the agent system.
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels