-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
- The evaluation runner has two retry layers, which is confusing to users who expect retries to be controlled by a single flag:
- Parent attempt loop: max_attempts (default=3) re-runs unresolved instances across passes.
- Child worker retry loop: max_retries controls in-process retries for a single instance within one attempt.
- Setting max_retries=0 still results in up to 3 runs per instance because max_attempts defaults to 3.
- Users conflate these and are surprised by multiple attempts when max_retries=0.
Current behavior (benchmarks/utils/evaluation.py)
- max_attempts drives the outer loop (for attempt in range(1, max_attempts + 1)), re-processing instances that did not resolve.
- max_retries is used inside _process_one_mp (worker) to decide how many times to retry the same instance in that attempt. With max_retries=0 the worker tries once, but the parent may still revisit the instance on subsequent attempts.
Requested changes
- Simplify/clarify retry semantics so users can reliably control retries with a single concept, or at least document them clearly in workflow inputs.
- Consider reducing to one retry parameter (e.g., total tries per instance) or mapping inputs to explicit behavior (e.g., max_attempts=1 by default when max_retries=0, or expose a “total_runs” knob).
- Update workflow input documentation to make the two-layer retry explicit if they remain separate.
Why this matters
- With current defaults, “max_retries=0” still yields up to 3 runs per instance, which is unexpected and complicates debugging/runtime investigations.
References
- Parent loop in benchmarks/utils/evaluation.py (max_attempts)
- Worker loop in benchmarks/utils/evaluation.py::_process_one_mp (max_retries)
Metadata
Metadata
Assignees
Labels
No labels