-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Enhance Ralph loop with retries and fix Claude Code hanging #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Greptile OverviewGreptile SummaryThis PR addresses the Claude Code hanging issue by implementing a stream-based monitoring approach with automatic hang detection and retry logic for transient errors. Key Changes:
Issues Found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Ralph as ralph.sh
participant RetryFn as run_with_retry
participant StreamFn as run_claude_with_stream
participant Claude as claude process
participant Monitor as Output Monitor
participant Killer as Timeout Killer
Ralph->>RetryFn: Execute iteration
RetryFn->>RetryFn: Set attempt = 1
alt Tool is Claude
RetryFn->>StreamFn: Call run_claude_with_stream
StreamFn->>Claude: Spawn with stream-json output
StreamFn->>Monitor: Start monitoring loop
loop While process alive
Monitor->>Monitor: Check kill -0
Monitor->>StreamFn: Grep for type:result
alt Result detected
Monitor->>Monitor: Set result_received=true
Monitor->>Killer: Spawn timeout killer
Monitor-->>StreamFn: Break loop
end
end
alt Claude exits cleanly
Claude-->>StreamFn: Exit normally
StreamFn->>Killer: Kill timeout process
else Claude hangs
Killer->>Claude: Kill after timeout
end
StreamFn->>StreamFn: Extract result from stream
StreamFn-->>RetryFn: Return output
else Tool is Amp
RetryFn->>RetryFn: Run amp directly
end
RetryFn->>RetryFn: Check is_retryable_error
alt Retryable error detected
RetryFn->>RetryFn: Increment attempt and backoff
RetryFn->>RetryFn: Retry up to MAX_RETRIES
else Success or non-retryable
RetryFn-->>Ralph: Return output
end
Ralph->>Ralph: Check for COMPLETE signal
alt Complete
Ralph->>Ralph: Exit success
else More work
Ralph->>Ralph: Next iteration
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
|
|
||
| # Temp file for stream output (cleaned up on exit) | ||
| STREAM_OUTPUT=$(mktemp) | ||
| trap "rm -f $STREAM_OUTPUT" EXIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap $STREAM_OUTPUT in quotes to handle paths with spaces safely
| trap "rm -f $STREAM_OUTPUT" EXIT | |
| trap "rm -f \"$STREAM_OUTPUT\"" EXIT |
Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 105:105
Comment:
Wrap `$STREAM_OUTPUT` in quotes to handle paths with spaces safely
```suggestion
trap "rm -f \"$STREAM_OUTPUT\"" EXIT
```
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| # Monitor the output file for the result message | ||
| # The key insight: "type":"result" is emitted BEFORE the hang occurs | ||
| while kill -0 $claude_pid 2>/dev/null; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: process could exit between kill -0 check and grep, causing the result check to be skipped. The loop will exit if the process dies naturally before detecting the result, potentially missing successful completions.
Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 136:136
Comment:
Race condition: process could exit between `kill -0` check and grep, causing the result check to be skipped. The loop will exit if the process dies naturally before detecting the result, potentially missing successful completions.
How can I resolve this? If you propose a fix, please make it concise.| local attempt=1 | ||
| local delay=$INITIAL_RETRY_DELAY | ||
| local output="" | ||
| local exit_code=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable exit_code is set but never checked - the retry logic only uses pattern matching via is_retryable_error, ignoring non-zero exit codes from successful runs with warnings
Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 184:184
Comment:
Variable `exit_code` is set but never checked - the retry logic only uses pattern matching via `is_retryable_error`, ignoring non-zero exit codes from successful runs with warnings
How can I resolve this? If you propose a fix, please make it concise.Fixes Claude Code hanging via: - Retry logic with exponential backoff (default: 3 retries) - Stream-json output monitoring for hang detection - Automatic process termination after result received New options: - --retries N: Max retry attempts (default: 3) - --hang-timeout N: Seconds to wait after result (default: 5) Detects transient errors: - Rate limits, timeouts, connection resets - 502/503/504 errors - "No messages returned" errors NRtW: Retry/backoff is reliability 101. Proven to work for 26+ iterations. Streaming prevents timeout hangs that waste time. Source: snarktank/ralph#45 Author: AliOsm Changes: +144/-10 lines
|
@greptile can you do a final review to confirm this is ready to merge? |
|
Based on my review of this pull request, I can provide a final assessment for merge readiness: This PR addresses legitimate issues but has several technical problems that should be resolved before merging: ✅ Positives:
|
|
I've also ran into the error I hotfixed it with timeout some time ago and haven't revisited it + I am not using Claude for my loops at the moment, see: https://github.com/NNTin/gSnake/blob/main/scripts/ralph/ralph.sh a retry and a timeout isn't an ideal solution. Neither should be officially used. This is just a hotfix. It could be claude suddenly expecting an input or a fundamental problem in claude. But then why isn't the claude command exiting properly and the loop continued? Why is claude hanging the loop. This root cause should be analysed and fixed there. |
|
@NNTin the solution in this PR is using streaming. |
|
Yea, but it is also just a hotfix and the root cause of it is still not fully understood. is_retryable_error() {
local output="$1"
# Add patterns for known transient errors
if echo "$output" | grep -qE "No messages returned|ECONNRESET|ETIMEDOUT|rate limit|503|502|504|overloaded"; then
return 0
fi
return 1
}When Claude is rate limited it already enters the next ralph loop. The error with claude code hanging prevents ralph from entering the next ralph loop or from continuing. This should be investigated. Making a retry resets the context window completely. If the error is properly catched, e.g. claude is working for an input which can be solved with "continue", then the solution would be much prettier. Else this is just a retry hotfix overcomplicating the idea. If I use claude with Ralph again I'll investigate the error. Currently my Claude subscription is used for other tasks. |
I faced the error described in #32 multiple times. So, I combined 2 solutions which are:
This fix worked for me, and allowed Ralph to continue working with Claude Code for 26 iterations until now.