Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Jan 18, 2026

We were getting low SWT bench results due to harness errors. Shoutout to @simonrosenberg , we tested many things to get to this PR.

SWT-bench dataset is based on the SWE-bench dataset but contains less instances (433 vs 500). v0 implementation of SWT-bench used the SWE-bench dataset, when migrating we decided to implement SWT-bench dataset.
It seems that their harness contains a bug when running with the SWT bench dataset, but works with the SWE-bench dataset.

This fix increases the results of experiment with time stamp 26-01-16-19-09 from

  • Total instances: 10
  • Submitted instances: 10
  • Resolved instances: 0
  • Unresolved instances: 9
  • Empty patch instances: 0
  • Error instances: 1
  • Eval limit: 10
  • Success rate: 0/10 (0.0%)

to

  • Total instances: 10
  • Instances completed: 10
  • Mean coverage: 0.8166666666666667
  • Mean coverage delta: 0.8166666666666667
  • Instances resolved: 9
  • Instances unresolved: 1
  • Instances with errors: 0
  • Instances still running: 0
  • Still existing images: 10

juanmichelini and others added 4 commits January 18, 2026 11:14
- Updated eval_infer.py to improve evaluation accuracy
- Modified patch_utils.py to enhance patch handling
- Applied code formatting

Co-authored-by: openhands <openhands@all-hands.dev>
- Revert from keep_only_test_files_in_patch back to remove_files_from_patch
- Keep new evaluation arguments (--patch_types vanilla, --build_mode api)
- Remove keep_only_test_files_in_patch function from patch_utils.py

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
def run_swtbench_evaluation(
predictions_file: str,
dataset: str = "eth-sri/SWT-bench_Verified_bm25_27k_zsp",
dataset: str = "princeton-nlp/SWE-bench_Verified",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?

juanmichelini and others added 2 commits January 18, 2026 12:33
Co-authored-by: openhands <openhands@all-hands.dev>
…ench

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini juanmichelini merged commit 378a91f into main Jan 18, 2026
2 checks passed
@enochii enochii mentioned this pull request Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants