Skip to content

Fix OpenHands SDK Modal startup and full-eval reliability#37

Draft
AlienKevin wants to merge 1 commit intocooperbench:mainfrom
AlienKevin:kevin/eval
Draft

Fix OpenHands SDK Modal startup and full-eval reliability#37
AlienKevin wants to merge 1 commit intocooperbench:mainfrom
AlienKevin:kevin/eval

Conversation

@AlienKevin
Copy link

@AlienKevin AlienKevin commented Feb 27, 2026

Summary

  • preserve OpenHands -oh image entrypoint in the SDK adapter so agent-server starts correctly
  • avoid repo import shadowing in OpenHands sandboxes by using / workdir
  • make Modal eval backend use explicit keepalive startup (entrypoint([]) + sleep infinity) for stable sb.exec
  • update vendored OpenHands SDK/tools code for Python 3.11 compatibility

Why

  • fixes repeated Modal Sandbox ... not found failures during run/eval
  • enables end-to-end reproduction of gemini-3-flash + openhands_sdk on full CooperBench

Scope

  • src/cooperbench/agents/openhands_agent_sdk/adapter.py
  • src/cooperbench/eval/backends/modal.py
  • vendored OpenHands SDK/tools compatibility fixes under src/cooperbench/agents/openhands_agent_sdk/

Validation

Reproduced Gemini 3 Flash + OpenHands performance on CooperBench:

{
  "pass_rate": 0.2846153846153846,
  "total_runs": 652,
  "passed": 185,
  "failed": 465,
  "errors": 2,
  "skipped": 0,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant