Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

Summary

This PR changes the commit0 evaluation metric from counting resolved instances (all tests pass) to counting total passing tests.

Changes

  • Change total_instances from 16 to 3628 (total number of tests across all instances)
  • Update success rate calculation from resolved_instances / completed_instances to total_passed_tests / total_instances
  • Update logging to reflect new metric semantics
  • Maintain backward compatibility in report structure

Why This Change

The previous metric counted how many instances (repositories) had all tests passing. The new metric counts the total number of individual tests that passed across all instances, providing a more granular measure of success.

Testing

  • Created sample test data with 4 instances totaling 671 tests with 521 passing
  • Verified calculation: 521 / 3628 = 14.4% success rate
  • Report structure maintained for backward compatibility

Next Steps

This PR is marked as draft until we verify the results look correct after re-running all evaluations with the new metric.

Related PR in evaluation repo (to be created): Change BENCHMARK_INSTANCE_COUNTS['commit0'] from 16 to 3628

juanmichelini and others added 2 commits January 18, 2026 17:16
- Change total_instances from 16 to 3628 (total tests across all instances)
- Update success rate calculation from resolved_instances/completed_instances to total_passed_tests/total_instances
- Update logging to reflect new metric semantics
- Maintain backward compatibility in report structure

Co-authored-by: openhands <openhands@all-hands.dev>
- Keep total_instances=16 (number of repositories)
- Add new field sum_num_passed to report (sum of all num_passed)
- Update logging to clarify repository vs test metrics
- Accuracy will be calculated as: sum_num_passed / 3628 total tests

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 18, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #341 at branch `commit0-test-count-metric`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants