SWTBench Score

Hello,

Recently I ran Openhands (with gpt-5-20250807), and when using the official grading script, the score is lower than that on the SWT leaderboard (79.8%). 

Specifically, aftter excluding 14 failed cases due to evaluation error (https://github.com/logic-star-ai/swt-bench/issues/40), Openhands + gpt5 achieves a resolved rate of 69.9% (293/419). It seems that the PR here (https://github.com/OpenHands/benchmarks/pull/340) does not address this problem, since I graded predictions using the official SWT-Bench script.

Could you please share any comments or thoughts on this?
Also, I was wondering why you chose to use your own grading script (eval_infer.py) instead of the official evaluation script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SWTBench Score #342

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SWTBench Score #342

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions