clarify retry semantics (max_attempts vs max_retries) in evaluation

 - The evaluation runner has two retry layers, which is confusing to users who expect retries to be controlled by a single flag:
      1. Parent attempt loop: max_attempts (default=3) re-runs unresolved instances across passes.
      2. Child worker retry loop: max_retries controls in-process retries for a single instance within one attempt.
  - Setting max_retries=0 still results in up to 3 runs per instance because max_attempts defaults to 3.
  - Users conflate these and are surprised by multiple attempts when max_retries=0.

  Current behavior (benchmarks/utils/evaluation.py)

  - max_attempts drives the outer loop (for attempt in range(1, max_attempts + 1)), re-processing instances that did not resolve.
  - max_retries is used inside _process_one_mp (worker) to decide how many times to retry the same instance in that attempt. With max_retries=0 the worker tries once, but the parent may still revisit the instance on subsequent attempts.

  Requested changes

  - Simplify/clarify retry semantics so users can reliably control retries with a single concept, or at least document them clearly in workflow inputs.
  - Consider reducing to one retry parameter (e.g., total tries per instance) or mapping inputs to explicit behavior (e.g., max_attempts=1 by default when max_retries=0, or expose a “total_runs” knob).
  - Update workflow input documentation to make the two-layer retry explicit if they remain separate.

  Why this matters

  - With current defaults, “max_retries=0” still yields up to 3 runs per instance, which is unexpected and complicates debugging/runtime investigations.

  References

  - Parent loop in benchmarks/utils/evaluation.py (max_attempts)
  - Worker loop in benchmarks/utils/evaluation.py::_process_one_mp (max_retries)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clarify retry semantics (max_attempts vs max_retries) in evaluation #295

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clarify retry semantics (max_attempts vs max_retries) in evaluation #295

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions