Increase accuracy of SWT bench results #340

juanmichelini · 2026-01-18T15:07:03Z

We were getting low SWT bench results due to harness errors. Shoutout to @simonrosenberg , we tested many things to get to this PR.

SWT-bench dataset is based on the SWE-bench dataset but contains less instances (433 vs 500). v0 implementation of SWT-bench used the SWE-bench dataset, when migrating we decided to implement SWT-bench dataset.
It seems that their harness contains a bug when running with the SWT bench dataset, but works with the SWE-bench dataset.

This fix increases the results of experiment with time stamp 26-01-16-19-09 from

Total instances: 10
Submitted instances: 10
Resolved instances: 0
Unresolved instances: 9
Empty patch instances: 0
Error instances: 1
Eval limit: 10
Success rate: 0/10 (0.0%)

to

Total instances: 10
Instances completed: 10
Mean coverage: 0.8166666666666667
Mean coverage delta: 0.8166666666666667
Instances resolved: 9
Instances unresolved: 1
Instances with errors: 0
Instances still running: 0
Still existing images: 10

- Updated eval_infer.py to improve evaluation accuracy - Modified patch_utils.py to enhance patch handling - Applied code formatting Co-authored-by: openhands <openhands@all-hands.dev>

- Revert from keep_only_test_files_in_patch back to remove_files_from_patch - Keep new evaluation arguments (--patch_types vanilla, --build_mode api) - Remove keep_only_test_files_in_patch function from patch_utils.py Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-01-18T15:12:00Z

benchmarks/swtbench/eval_infer.py

 def run_swtbench_evaluation(
    predictions_file: str,
-    dataset: str = "eth-sri/SWT-bench_Verified_bm25_27k_zsp",
+    dataset: str = "princeton-nlp/SWE-bench_Verified",


Is this correct?

Co-authored-by: openhands <openhands@all-hands.dev>

…ench Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini and others added 4 commits January 18, 2026 11:14

Fix SWT low accuracy evaluation results

0c02231

- Updated eval_infer.py to improve evaluation accuracy - Modified patch_utils.py to enhance patch handling - Applied code formatting Co-authored-by: openhands <openhands@all-hands.dev>

Restore uv.lock to match main

74c0aac

Co-authored-by: openhands <openhands@all-hands.dev>

Update dataset to princeton-nlp/SWE-bench_Verified

d4dd305

Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini requested a review from simonrosenberg January 18, 2026 15:09

neubig reviewed Jan 18, 2026

View reviewed changes

simonrosenberg approved these changes Jan 18, 2026

View reviewed changes

juanmichelini and others added 2 commits January 18, 2026 12:33

Fix eval_infer.py to properly handle evaluation results

36f7a0e

Co-authored-by: openhands <openhands@all-hands.dev>

Add comment explaining why SWE-bench dataset is used instead of SWT-b…

48bfd6b

…ench Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini merged commit 378a91f into main Jan 18, 2026
2 checks passed

enochii mentioned this pull request Jan 19, 2026

SWTBench Score #342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase accuracy of SWT bench results #340

Increase accuracy of SWT bench results #340

Uh oh!

juanmichelini commented Jan 18, 2026 •

edited

Loading

Uh oh!

neubig Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Increase accuracy of SWT bench results #340

Increase accuracy of SWT bench results #340

Uh oh!

Conversation

juanmichelini commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neubig Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juanmichelini commented Jan 18, 2026 •

edited

Loading