-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
Hello,
Recently I ran Openhands (with gpt-5-20250807), and when using the official grading script, the score is lower than that on the SWT leaderboard (79.8%).
Specifically, aftter excluding 14 failed cases due to evaluation error (logic-star-ai/swt-bench#40), Openhands + gpt5 achieves a resolved rate of 69.9% (293/419). It seems that the PR here (#340) does not address this problem, since I graded predictions using the official SWT-Bench script.
Could you please share any comments or thoughts on this?
Also, I was wondering why you chose to use your own grading script (eval_infer.py) instead of the official evaluation script.
Metadata
Metadata
Assignees
Labels
No labels