Skip to content

fix(daemon): throttle OpenCode capture retries for dead server ports#404

Open
proboscis wants to merge 1 commit intomainfrom
issue/orch-427/run-20260209-225545
Open

fix(daemon): throttle OpenCode capture retries for dead server ports#404
proboscis wants to merge 1 commit intomainfrom
issue/orch-427/run-20260209-225545

Conversation

@proboscis
Copy link
Owner

Summary

  • Implement per-run exponential backoff (10s → 60s cap) for message capture failures against dead OpenCode server endpoints
  • Apply stricter backoff (2x multiplier) for ECONNREFUSED errors specifically
  • Rate-limit and deduplicate log messages to avoid spam

Motivation

Connection refused errors from stale OpenCode server ports created high-volume log noise and unnecessary retry load when daemon monitoring attempted message capture against older runs whose servers had stopped.

Changes

internal/daemon/daemon.go

  • Extended RunState struct with capture backoff state fields:
    • CaptureFailCount - tracks consecutive failures
    • LastCaptureFailAt - when last failure occurred
    • NextCaptureRetryAt - backoff deadline
    • LastCaptureError - for log deduplication

internal/daemon/monitor.go

  • Added backoff constants: initial=10s, max=60s, factor=2x
  • shouldSkipCapture() - skips capture during backoff period
  • handleCaptureError() - sets backoff and logs first/new errors only
  • resetCaptureBackoff() - clears state on successful capture
  • calculateCaptureBackoff() - exponential backoff with ECONNREFUSED boost
  • isConnectionRefused() - detects connection refused via net.OpError or string

internal/daemon/monitor_test.go

  • Tests for shouldSkipCapture, calculateCaptureBackoff, handleCaptureError, resetCaptureBackoff, isConnectionRefused

Acceptance Criteria Evidence

Criterion Evidence
Connection-refused spam reduced to bounded, periodic logs First error logged, then debug-level only for repeats. Backoff caps at 60s.
Capture loop avoids tight retries shouldSkipCapture() enforces NextCaptureRetryAt deadline.
Normal live OpenCode capture still works resetCaptureBackoff() clears state on success; no changes to successful path.
Tests cover retry/backoff behavior 5 new test functions with subtests covering all backoff scenarios.
=== RUN   TestShouldSkipCapture
=== RUN   TestShouldSkipCapture/no_backoff_state_allows_capture
=== RUN   TestShouldSkipCapture/future_NextCaptureRetryAt_skips_capture
=== RUN   TestShouldSkipCapture/past_NextCaptureRetryAt_allows_capture
--- PASS: TestShouldSkipCapture (0.00s)
=== RUN   TestCalculateCaptureBackoff
=== RUN   TestCalculateCaptureBackoff/first_failure_uses_initial_backoff
=== RUN   TestCalculateCaptureBackoff/exponential_backoff_increases
=== RUN   TestCalculateCaptureBackoff/backoff_caps_at_max
--- PASS: TestCalculateCaptureBackoff (0.00s)
=== RUN   TestHandleCaptureError
=== RUN   TestHandleCaptureError/first_error_sets_backoff_state
=== RUN   TestHandleCaptureError/consecutive_errors_increase_fail_count
--- PASS: TestHandleCaptureError (0.00s)
=== RUN   TestResetCaptureBackoff
--- PASS: TestResetCaptureBackoff (0.00s)
=== RUN   TestIsConnectionRefused
--- PASS: TestIsConnectionRefused (0.00s)

Fixes: orch-427

Implement per-run backoff for message capture failures, especially
connection refused errors from stale OpenCode server endpoints.

Key changes:
- Add exponential backoff (10s -> 60s cap) for capture failures
- Apply stricter backoff (2x multiplier) for ECONNREFUSED specifically
- Track capture failure state per-run (count, timestamps, last error)
- Rate-limit/deduplicate logs (first error + different errors only)
- Reset backoff on successful capture

This reduces connection refused spam to bounded, periodic logs and
avoids tight retry loops against known-dead endpoints.

Fixes: orch-427
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant