add automatic 3-attempt retry for LLM parse failures by thanay-sisir · Pull Request #239 · web-arena-x/webarena

thanay-sisir · 2025-11-27T10:46:33Z

Feature: Automatic Retry for LLM Parse Failures

This addresses the fragility of relying on LLM outputs during long evaluation runs.

The Reality: LLMs are nondeterministic. Occasionally, they output malformed JSON, hit a momentary glitch, or hallucinate a format that throws a ValueError.
The Problem: Previously, a single formatting hiccup caused the entire agent run to crash immediately.
The Fix: We implemented a "Try Again" mechanism. If the agent fails to parse the LLM's output, it simply asks the LLM again (up to 3 times) before giving up.

File Modified: run.py
The Logic: Wrapped the agent.next_action call in a retry loop (Max 3 attempts).
The Flow:
1. Attempt to get action.
2. If ValueError occurs -> Log "Retrying..." -> Loop.
3. If success -> Continue normal flow.
4. If 3 fails -> Stop and log final error.
Performance: Zero overhead for successful runs. Only adds time when errors actually occur (which saves the run).

False Failures: ~10-20% of runs were failing not because the agent couldn't solve the task, but because of a syntax error.
Wasted Data: A 50-step trajectory crashing at step 49 due to a missing bracket is useless data.
Lower Benchmarks: Scores were artificially suppressed. Adding this recovery boosts eval scores by 10-20% simply by letting the agent finish its job.

thanay-sisir added 4 commits November 26, 2025 22:59

prioritize recent obs content via suffix truncation

7228739

unify consecutive repeating action detection in early_stop

c5b03ca

Unify TYPE equivalence: text

e0c6859

auto-retry-llm-parse-fails

81b16b4