Skip to content

Conversation

@AliOsm
Copy link

@AliOsm AliOsm commented Jan 24, 2026

I faced the error described in #32 multiple times. So, I combined 2 solutions which are:

  • Retries (All tools)
  • Streaming (Claude Code only)

This fix worked for me, and allowed Ralph to continue working with Claude Code for 26 iterations until now.

@greptile-apps
Copy link

greptile-apps bot commented Jan 24, 2026

Greptile Overview

Greptile Summary

This PR addresses the Claude Code hanging issue by implementing a stream-based monitoring approach with automatic hang detection and retry logic for transient errors.

Key Changes:

  • Added --retries and --hang-timeout CLI options for configurable resilience
  • Implemented run_claude_with_stream() function that uses --output-format stream-json to detect when Claude completes work (via "type":"result" message) and terminates hung processes after a timeout
  • Added run_with_retry() function with exponential backoff for transient network/API errors
  • Both Amp and Claude Code now benefit from retry logic for common errors (ECONNRESET, ETIMEDOUT, rate limits, 5xx errors)

Issues Found:

  • Race condition in process monitoring loop (ralph.sh:136) where Claude could exit between the kill -0 check and result detection
  • Unused exit_code variable suggests the retry logic doesn't differentiate between failures and successes with warnings

Confidence Score: 4/5

  • This PR is safe to merge with minor issues that don't affect core functionality
  • The implementation successfully addresses the reported hanging issue with a practical solution. The race condition identified is a minor timing issue that's unlikely to cause problems in practice since the result message typically arrives well before process exit. The unused exit_code variable is a code cleanliness issue but doesn't impact functionality.
  • No files require special attention - the logic is sound and addresses the core issue effectively

Important Files Changed

Filename Overview
ralph.sh Adds retry logic and stream-based hang detection for Claude Code to prevent hanging issues, includes new CLI options for configurable retries and hang timeout

Sequence Diagram

sequenceDiagram
    participant Ralph as ralph.sh
    participant RetryFn as run_with_retry
    participant StreamFn as run_claude_with_stream
    participant Claude as claude process
    participant Monitor as Output Monitor
    participant Killer as Timeout Killer

    Ralph->>RetryFn: Execute iteration
    RetryFn->>RetryFn: Set attempt = 1
    
    alt Tool is Claude
        RetryFn->>StreamFn: Call run_claude_with_stream
        StreamFn->>Claude: Spawn with stream-json output
        StreamFn->>Monitor: Start monitoring loop
        
        loop While process alive
            Monitor->>Monitor: Check kill -0
            Monitor->>StreamFn: Grep for type:result
            
            alt Result detected
                Monitor->>Monitor: Set result_received=true
                Monitor->>Killer: Spawn timeout killer
                Monitor-->>StreamFn: Break loop
            end
        end
        
        alt Claude exits cleanly
            Claude-->>StreamFn: Exit normally
            StreamFn->>Killer: Kill timeout process
        else Claude hangs
            Killer->>Claude: Kill after timeout
        end
        
        StreamFn->>StreamFn: Extract result from stream
        StreamFn-->>RetryFn: Return output
    else Tool is Amp
        RetryFn->>RetryFn: Run amp directly
    end
    
    RetryFn->>RetryFn: Check is_retryable_error
    
    alt Retryable error detected
        RetryFn->>RetryFn: Increment attempt and backoff
        RetryFn->>RetryFn: Retry up to MAX_RETRIES
    else Success or non-retryable
        RetryFn-->>Ralph: Return output
    end
    
    Ralph->>Ralph: Check for COMPLETE signal
    
    alt Complete
        Ralph->>Ralph: Exit success
    else More work
        Ralph->>Ralph: Next iteration
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile


# Temp file for stream output (cleaned up on exit)
STREAM_OUTPUT=$(mktemp)
trap "rm -f $STREAM_OUTPUT" EXIT
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap $STREAM_OUTPUT in quotes to handle paths with spaces safely

Suggested change
trap "rm -f $STREAM_OUTPUT" EXIT
trap "rm -f \"$STREAM_OUTPUT\"" EXIT
Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 105:105

Comment:
Wrap `$STREAM_OUTPUT` in quotes to handle paths with spaces safely

```suggestion
trap "rm -f \"$STREAM_OUTPUT\"" EXIT
```

How can I resolve this? If you propose a fix, please make it concise.


# Monitor the output file for the result message
# The key insight: "type":"result" is emitted BEFORE the hang occurs
while kill -0 $claude_pid 2>/dev/null; do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: process could exit between kill -0 check and grep, causing the result check to be skipped. The loop will exit if the process dies naturally before detecting the result, potentially missing successful completions.

Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 136:136

Comment:
Race condition: process could exit between `kill -0` check and grep, causing the result check to be skipped. The loop will exit if the process dies naturally before detecting the result, potentially missing successful completions.

How can I resolve this? If you propose a fix, please make it concise.

local attempt=1
local delay=$INITIAL_RETRY_DELAY
local output=""
local exit_code=0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable exit_code is set but never checked - the retry logic only uses pattern matching via is_retryable_error, ignoring non-zero exit codes from successful runs with warnings

Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 184:184

Comment:
Variable `exit_code` is set but never checked - the retry logic only uses pattern matching via `is_retryable_error`, ignoring non-zero exit codes from successful runs with warnings

How can I resolve this? If you propose a fix, please make it concise.

elasticdotventures pushed a commit to elasticdotventures/_b00t_ that referenced this pull request Jan 31, 2026
Fixes Claude Code hanging via:
- Retry logic with exponential backoff (default: 3 retries)
- Stream-json output monitoring for hang detection
- Automatic process termination after result received

New options:
- --retries N: Max retry attempts (default: 3)
- --hang-timeout N: Seconds to wait after result (default: 5)

Detects transient errors:
- Rate limits, timeouts, connection resets
- 502/503/504 errors
- "No messages returned" errors

NRtW: Retry/backoff is reliability 101. Proven to work for 26+ iterations.
Streaming prevents timeout hangs that waste time.

Source: snarktank/ralph#45
Author: AliOsm
Changes: +144/-10 lines
@snarktank
Copy link
Owner

@greptile can you do a final review to confirm this is ready to merge?

@greptile-apps
Copy link

greptile-apps bot commented Feb 1, 2026

Based on my review of this pull request, I can provide a final assessment for merge readiness:

This PR addresses legitimate issues but has several technical problems that should be resolved before merging:

Positives:

  • Directly addresses the documented issue Loop stops #32 with Claude Code hanging
  • Adds useful retry mechanism for transient network errors
  • The streaming approach is a clever solution to the hanging problem
  • Good exponential backoff implementation
  • Comprehensive error pattern matching for retryable errors

⚠️ Issues that need fixing:

  1. Path safety bug - Line 105: $STREAM_OUTPUT should be quoted in the trap to handle paths with spaces
  2. Race condition - Line 136: Process could exit naturally between the kill -0 check and grep, potentially missing successful completions
  3. Unused exit code - Line 184: exit_code variable is captured but never used in retry logic

🔧 Minor concerns:

  • The stream-json parsing with jq could be more robust
  • No validation that jq is installed (dependency not documented)
  • The $killer_pid cleanup could be more defensive

Recommendation:

Not ready to merge yet - Please address the three technical issues mentioned above first. The core approach is sound and will solve the hanging problem, but these bugs could cause reliability issues in production.

The fixes should be straightforward - mainly adding quotes, reordering the race condition check, and deciding whether to use the exit code in retry logic.

@NNTin
Copy link

NNTin commented Feb 11, 2026

I've also ran into the error

═══════════════════════════════════════════════════════
  Ralph Iteration 3 of 10
═══════════════════════════════════════════════════════
This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason:
Error: No messages returned
    at tA0 (/$bunfs/root/claude:5332:78)
    at processTicksAndRejections (native:7:39)

I hotfixed it with timeout some time ago and haven't revisited it + I am not using Claude for my loops at the moment, see: https://github.com/NNTin/gSnake/blob/main/scripts/ralph/ralph.sh

a retry and a timeout isn't an ideal solution. Neither should be officially used. This is just a hotfix.

It could be claude suddenly expecting an input or a fundamental problem in claude. But then why isn't the claude command exiting properly and the loop continued? Why is claude hanging the loop. This root cause should be analysed and fixed there.

@AliOsm
Copy link
Author

AliOsm commented Feb 11, 2026

@NNTin the solution in this PR is using streaming.

@NNTin
Copy link

NNTin commented Feb 11, 2026

Yea, but it is also just a hotfix and the root cause of it is still not fully understood.
The relevant code part in your PR is:

is_retryable_error() {
  local output="$1"
  # Add patterns for known transient errors
  if echo "$output" | grep -qE "No messages returned|ECONNRESET|ETIMEDOUT|rate limit|503|502|504|overloaded"; then
    return 0
  fi
  return 1
}

When Claude is rate limited it already enters the next ralph loop. The error with claude code hanging prevents ralph from entering the next ralph loop or from continuing. This should be investigated. Making a retry resets the context window completely. If the error is properly catched, e.g. claude is working for an input which can be solved with "continue", then the solution would be much prettier.

Else this is just a retry hotfix overcomplicating the idea.

If I use claude with Ralph again I'll investigate the error. Currently my Claude subscription is used for other tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants