From 60770ffd3456e3adb6294dd0291dbf83d6919701 Mon Sep 17 00:00:00 2001 From: konard Date: Fri, 30 Jan 2026 18:58:56 +0100 Subject: [PATCH 1/5] Initial commit with task details Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: https://github.com/link-assistant/agent/issues/146 --- CLAUDE.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..70ea0a6 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,5 @@ +Issue to solve: https://github.com/link-assistant/agent/issues/146 +Your prepared branch: issue-146-635c24f49ba7 +Your prepared working directory: /tmp/gh-issue-solver-1769795934959 + +Proceed. From c51aa8a65922b1ce6a36f443df4f9988449614f5 Mon Sep 17 00:00:00 2001 From: konard Date: Sat, 31 Jan 2026 00:15:54 +0100 Subject: [PATCH 2/5] docs: add case study for issue #146 - agent CLI stuck with no timeout Deep analysis of the 2h10m hang incident including: - Complete timeline reconstruction from logs - Root cause: missing streamText timeout configuration - Contributing factors: no session inactivity watchdog, unref'd intervals - Proposed solutions using AI SDK's built-in timeout options Refs #146 Co-Authored-By: Claude Opus 4.5 --- docs/case-studies/issue-146/README.md | 185 ++++++++++++++++++++++++++ 1 file changed, 185 insertions(+) create mode 100644 docs/case-studies/issue-146/README.md diff --git a/docs/case-studies/issue-146/README.md b/docs/case-studies/issue-146/README.md new file mode 100644 index 0000000..f415fc4 --- /dev/null +++ b/docs/case-studies/issue-146/README.md @@ -0,0 +1,185 @@ +# Case Study: Issue #146 - Agent CLI is stuck, there was no timeout + +## Summary + +The Agent CLI process became stuck for **2 hours and 10 minutes** with no error message, no timeout, and no recovery mechanism. The process had to be manually terminated with CTRL+C. This case study reconstructs the timeline, identifies root causes, and proposes solutions. + +## Issue Reference + +- **Issue:** https://github.com/link-assistant/agent/issues/146 +- **Reported by:** @konard +- **Date:** 2026-01-30 +- **Component:** Agent CLI (`@link-assistant/agent` v0.8.11) +- **Runtime:** Bun +- **AI SDK:** `ai` v6.0.0-beta.99 (Vercel AI SDK) + +## Incident Timeline + +### Session 1 (Working normally) + +| Time (UTC) | Event | +|---|---| +| 15:01:05 | solve.mjs starts (v1.9.0), tool: `agent`, model: `opencode/big-pickle` | +| 15:01:35 | Agent CLI launched. Session `ses_3f0937c16ffe7n0KHrkhDHOw7o` begins | +| 15:02:36 | First step_start. Agent creates todo list (8 items) | +| 15:06:35 | **ERROR: "The operation timed out."** (Timeout #1 - recovered after ~1m19s) | +| 15:07:54 | Agent recovers and continues working | +| 15:21:50 | Session 1 ends normally with `reason: stop` | + +### Session 2 (Gets stuck) + +| Time (UTC) | Event | +|---|---| +| 15:22:00 | Session 2 begins (`ses_3f080aea6ffeMEjbgsZFV7KCgy`) - auto-restart for uncommitted changes | +| 15:22:11 | First step_start | +| 15:27:08 | **ERROR: "The operation timed out."** (Timeout #2 - recovered after ~15s) | +| 15:37:01.645 | **LAST LOG LINE WITH ACTIVITY** - step_finish with `reason: "tool-calls"` | +| _(silence)_ | **2 hours, 10 minutes, 4 seconds of no activity** | +| 17:47:05.206 | "Keeping directory" message appears | +| 17:47:05.260 | **CTRL+C** - Manual forced termination | + +### Key Observations + +1. The last `step_finish` had `reason: "tool-calls"`, indicating the model requested more tool calls +2. The next `step_start` **never materialized** - the process was stuck between steps +3. Two previous timeouts ("The operation timed out.") were recoverable, but the final hang had **no timeout error at all** +4. Token usage at last step: input=118, output=89, cache_read=103,658, reasoning=1 +5. No error was logged before the 2h10m gap + +## Root Cause Analysis + +### Primary Root Cause: No `streamText` chunk/step timeout + +The `streamText()` call in `js/src/session/prompt.ts:614` does **not** configure any timeout parameter. The Vercel AI SDK v6 supports a `timeout` option with three sub-properties: + +- `totalMs` - Total timeout for the entire call +- `stepMs` - Timeout for each individual LLM step +- `chunkMs` - Timeout between stream chunks (detects stalled streams) + +None of these are configured. When the upstream API connection stalls (TCP connection stays open but no data flows), the `streamText` call waits indefinitely. + +**Evidence:** In `js/src/session/prompt.ts:614-714`, the `streamText()` call includes `abortSignal`, `maxRetries: 0`, `stopWhen: stepCountIs(1)`, etc. but NO `timeout` configuration. + +### Contributing Factor: No session-level inactivity timeout + +The `SessionPrompt.loop()` function (`js/src/session/prompt.ts:233-728`) runs in a `while(true)` loop. There is no watchdog timer that detects if a step takes too long. If `streamText` hangs, the entire loop hangs. + +### Contributing Factor: `messagePromise` in continuous mode has no timeout + +In `js/src/cli/continuous-mode.js:264-274` and `487-497`, the `messagePromise` waits for a `session.idle` event with no timeout. If the session never reaches idle state (because `streamText` is stuck), this promise waits forever. + +### Contributing Factor: `setInterval` polling not `.unref()`-ed + +In `js/src/cli/continuous-mode.js:380` and `597`, `setInterval()` calls that poll `stdinReader.isRunning()` are not `.unref()`-ed. While this doesn't cause the hang directly, it prevents the Node.js event loop from exiting naturally even if all other work is done. Other parts of the codebase (e.g., `provider/models.ts:97`, `project/state.ts:54`) correctly use `.unref()`. + +### Related: Previous timeout retry work (PR #143) + +PR #143 ("Add automatic retry for timeout errors with configurable intervals") added retry handling for `TimeoutError` (from `AbortSignal.timeout()`). However, this only handles the case where a timeout IS detected. The core issue is that no timeout is SET on the `streamText` call, so the `DOMException` with `name === 'TimeoutError'` is never generated in the first place for stalled-stream scenarios. + +## Sequence of Events (Reconstructed) + +``` +┌─────────────────┐ +│ prompt.ts loop │ while(true) loop in SessionPrompt.loop() +│ step N │ +└────────┬────────┘ + │ + ▼ +┌─────────────────┐ +│ streamText() │ No timeout configured +│ (AI SDK v6) │ +└────────┬────────┘ + │ + ▼ +┌─────────────────┐ +│ HTTP/2 stream │ Connection to LLM API (e.g., Anthropic) +│ to provider │ +└────────┬────────┘ + │ + ▼ ← Stream stalls here (TCP alive, no data) + │ + ┌────┴────────────────────────────┐ + │ No chunkMs timeout configured │ + │ No stepMs timeout configured │ + │ No totalMs timeout configured │ + │ AbortSignal not timed │ + │ │ + │ Process waits indefinitely... │ + │ (2 hours 10 minutes) │ + └─────────────────────────────────┘ + │ + ▼ + CTRL+C (manual intervention) +``` + +## Proposed Solutions + +### Solution 1: Add `timeout` to `streamText()` call (Primary fix) + +Add `chunkMs` and `stepMs` timeouts to the `streamText()` call in `js/src/session/prompt.ts`. + +```typescript +const result = await processor.process(() => + streamText({ + // ... existing options ... + timeout: { + chunkMs: 120_000, // 2 minutes between chunks (detect stalled streams) + stepMs: 600_000, // 10 minutes per step + }, + }) +); +``` + +**Rationale:** +- `chunkMs: 120_000` (2 minutes) matches the existing MCP `BUILTIN_DEFAULT_TOOL_CALL_TIMEOUT` +- `stepMs: 600_000` (10 minutes) matches the existing MCP `BUILTIN_MAX_TOOL_CALL_TIMEOUT` +- These values are generous enough to allow large model responses while catching stalls + +**References:** +- [AI SDK `streamText` timeout docs](https://ai-sdk.dev/docs/reference/ai-sdk-core/stream-text) +- [AI SDK Issue #5438: Promises hang on streamText()](https://github.com/vercel/ai/issues/5438) + +### Solution 2: Add `.unref()` to `setInterval` in continuous mode + +```javascript +const checkRunning = setInterval(() => { + // ... +}, 100); +checkRunning.unref(); // Allow process to exit naturally +``` + +### Solution 3: Add configurable timeout via CLI/environment + +Allow users to configure the stream timeout via: +- Environment variable: `AGENT_STREAM_CHUNK_TIMEOUT_MS` (default: 120000) +- Environment variable: `AGENT_STREAM_STEP_TIMEOUT_MS` (default: 600000) + +## Existing Libraries & Components + +| Library/Component | Relevance | +|---|---| +| [Vercel AI SDK `timeout` option](https://ai-sdk.dev/docs/reference/ai-sdk-core/stream-text) | Built-in solution - `chunkMs`, `stepMs`, `totalMs` | +| [AbortSignal.timeout()](https://developer.mozilla.org/en-US/docs/Web/API/AbortSignal/timeout_static) | Standard Web API for timeout-based abort signals | +| Agent's own `withTimeout()` utility (`js/src/util/timeout.ts`) | Already used for MCP tool timeouts | +| Agent's `SessionRetry` (`js/src/session/retry.ts`) | Already handles API and socket error retries | + +## Impact Assessment + +- **Severity:** High - Process hangs indefinitely, wastes compute and API resources +- **Frequency:** Intermittent - Depends on network conditions and provider reliability +- **Affected users:** All users running the agent CLI in continuous or direct mode +- **Workaround:** Manual CTRL+C termination and restart + +## Files Referenced + +- `js/src/session/prompt.ts:614-714` - `streamText()` call (missing timeout) +- `js/src/session/processor.ts:41-395` - Stream processing loop +- `js/src/cli/continuous-mode.js:264-274, 380, 487-497, 597` - Message promise and interval polling +- `js/src/util/timeout.ts` - Existing timeout utility +- `js/src/session/retry.ts` - Existing retry logic +- `js/src/mcp/index.ts:17-26` - MCP timeout defaults (reference values) + +## Logs + +- `logs/solve-full.log` - Complete session log showing the 2h10m hang +- `logs/solution-draft-failed.log` - Failed solution draft attempt (separate issue: invalid Unicode surrogate in API request) From 6a9cfd28a5087c2ca074af98904cac87e59717da Mon Sep 17 00:00:00 2001 From: konard Date: Sat, 31 Jan 2026 00:19:33 +0100 Subject: [PATCH 3/5] fix: add stream timeout to prevent agent CLI from hanging indefinitely Root cause: The streamText() call had no timeout configuration, so when the upstream LLM API connection stalled (TCP alive but no data), the process waited indefinitely with no recovery mechanism. Changes: - Add chunkMs (2min) and stepMs (10min) timeouts to streamText() call using the AI SDK's built-in timeout option - Add configurable env vars: AGENT_STREAM_CHUNK_TIMEOUT_MS (default 120s) and AGENT_STREAM_STEP_TIMEOUT_MS (default 600s) - Add .unref() to setInterval in continuous mode to allow natural exit - Add tests for stream timeout configuration Fixes #146 Co-Authored-By: Claude Opus 4.5 --- js/src/cli/continuous-mode.js | 4 ++ js/src/flag/flag.ts | 19 +++++ js/src/session/prompt.ts | 4 ++ js/tests/stream-timeout.test.js | 120 ++++++++++++++++++++++++++++++++ 4 files changed, 147 insertions(+) create mode 100644 js/tests/stream-timeout.test.js diff --git a/js/src/cli/continuous-mode.js b/js/src/cli/continuous-mode.js index 03e8d81..9e9db3a 100644 --- a/js/src/cli/continuous-mode.js +++ b/js/src/cli/continuous-mode.js @@ -391,6 +391,8 @@ export async function runContinuousServerMode( waitForPending(); } }, 100); + // Allow process to exit naturally when no other work remains + checkRunning.unref(); // Also handle SIGINT process.on('SIGINT', () => { @@ -608,6 +610,8 @@ export async function runContinuousDirectMode( waitForPending(); } }, 100); + // Allow process to exit naturally when no other work remains + checkRunning.unref(); // Also handle SIGINT process.on('SIGINT', () => { diff --git a/js/src/flag/flag.ts b/js/src/flag/flag.ts index 773c867..a91edbb 100644 --- a/js/src/flag/flag.ts +++ b/js/src/flag/flag.ts @@ -63,6 +63,25 @@ export namespace Flag { 'OPENCODE_DRY_RUN' ); + // Stream timeout configuration + // chunkMs: timeout between stream chunks - detects stalled streams (default: 2 minutes) + // stepMs: timeout for each individual LLM step (default: 10 minutes) + export function STREAM_CHUNK_TIMEOUT_MS(): number { + const val = getEnv( + 'LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS', + 'AGENT_STREAM_CHUNK_TIMEOUT_MS' + ); + return val ? parseInt(val, 10) : 120_000; + } + + export function STREAM_STEP_TIMEOUT_MS(): number { + const val = getEnv( + 'LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS', + 'AGENT_STREAM_STEP_TIMEOUT_MS' + ); + return val ? parseInt(val, 10) : 600_000; + } + // Compact JSON mode - output JSON on single lines (NDJSON format) // Enabled by AGENT_CLI_COMPACT env var or --compact-json flag // Uses getter to check env var at runtime for tests diff --git a/js/src/session/prompt.ts b/js/src/session/prompt.ts index 38f759a..34e6d15 100644 --- a/js/src/session/prompt.ts +++ b/js/src/session/prompt.ts @@ -613,6 +613,10 @@ export namespace SessionPrompt { const result = await processor.process(() => streamText({ + timeout: { + chunkMs: Flag.STREAM_CHUNK_TIMEOUT_MS(), + stepMs: Flag.STREAM_STEP_TIMEOUT_MS(), + }, onError(error) { log.error(() => ({ message: 'stream error', error })); }, diff --git a/js/tests/stream-timeout.test.js b/js/tests/stream-timeout.test.js new file mode 100644 index 0000000..9c2ad9c --- /dev/null +++ b/js/tests/stream-timeout.test.js @@ -0,0 +1,120 @@ +import { test, expect, describe } from 'bun:test'; +import { Flag } from '../src/flag/flag.ts'; + +describe('Stream timeout configuration', () => { + describe('STREAM_CHUNK_TIMEOUT_MS', () => { + test('returns default value of 120000 (2 minutes)', () => { + // Save and clear env + const saved = process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS; + const savedNew = process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS; + delete process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS; + delete process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS; + + expect(Flag.STREAM_CHUNK_TIMEOUT_MS()).toBe(120_000); + + // Restore + if (saved !== undefined) { + process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS = saved; + } + if (savedNew !== undefined) { + process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS = savedNew; + } + }); + + test('reads from AGENT_STREAM_CHUNK_TIMEOUT_MS env var', () => { + const saved = process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS; + const savedNew = process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS; + delete process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS; + process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS = '60000'; + + expect(Flag.STREAM_CHUNK_TIMEOUT_MS()).toBe(60_000); + + // Restore + if (saved !== undefined) { + process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS = saved; + } else { + delete process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS; + } + if (savedNew !== undefined) { + process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS = savedNew; + } + }); + + test('LINK_ASSISTANT_AGENT prefix takes priority', () => { + const saved = process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS; + const savedNew = process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS; + process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS = '60000'; + process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS = '30000'; + + expect(Flag.STREAM_CHUNK_TIMEOUT_MS()).toBe(30_000); + + // Restore + if (saved !== undefined) { + process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS = saved; + } else { + delete process.env.AGENT_STREAM_CHUNK_TIMEOUT_MS; + } + if (savedNew !== undefined) { + process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS = savedNew; + } else { + delete process.env.LINK_ASSISTANT_AGENT_STREAM_CHUNK_TIMEOUT_MS; + } + }); + }); + + describe('STREAM_STEP_TIMEOUT_MS', () => { + test('returns default value of 600000 (10 minutes)', () => { + const saved = process.env.AGENT_STREAM_STEP_TIMEOUT_MS; + const savedNew = process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS; + delete process.env.AGENT_STREAM_STEP_TIMEOUT_MS; + delete process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS; + + expect(Flag.STREAM_STEP_TIMEOUT_MS()).toBe(600_000); + + if (saved !== undefined) { + process.env.AGENT_STREAM_STEP_TIMEOUT_MS = saved; + } + if (savedNew !== undefined) { + process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS = savedNew; + } + }); + + test('reads from AGENT_STREAM_STEP_TIMEOUT_MS env var', () => { + const saved = process.env.AGENT_STREAM_STEP_TIMEOUT_MS; + const savedNew = process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS; + delete process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS; + process.env.AGENT_STREAM_STEP_TIMEOUT_MS = '300000'; + + expect(Flag.STREAM_STEP_TIMEOUT_MS()).toBe(300_000); + + if (saved !== undefined) { + process.env.AGENT_STREAM_STEP_TIMEOUT_MS = saved; + } else { + delete process.env.AGENT_STREAM_STEP_TIMEOUT_MS; + } + if (savedNew !== undefined) { + process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS = savedNew; + } + }); + + test('LINK_ASSISTANT_AGENT prefix takes priority', () => { + const saved = process.env.AGENT_STREAM_STEP_TIMEOUT_MS; + const savedNew = process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS; + process.env.AGENT_STREAM_STEP_TIMEOUT_MS = '300000'; + process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS = '120000'; + + expect(Flag.STREAM_STEP_TIMEOUT_MS()).toBe(120_000); + + if (saved !== undefined) { + process.env.AGENT_STREAM_STEP_TIMEOUT_MS = saved; + } else { + delete process.env.AGENT_STREAM_STEP_TIMEOUT_MS; + } + if (savedNew !== undefined) { + process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS = savedNew; + } else { + delete process.env.LINK_ASSISTANT_AGENT_STREAM_STEP_TIMEOUT_MS; + } + }); + }); +}); From 10b288827c02f054b3382935a98b6be93c19519d Mon Sep 17 00:00:00 2001 From: konard Date: Sat, 31 Jan 2026 00:19:56 +0100 Subject: [PATCH 4/5] chore: add changeset for stream timeout fix Co-Authored-By: Claude Opus 4.5 --- js/.changeset/stream-timeout.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 js/.changeset/stream-timeout.md diff --git a/js/.changeset/stream-timeout.md b/js/.changeset/stream-timeout.md new file mode 100644 index 0000000..fe9a735 --- /dev/null +++ b/js/.changeset/stream-timeout.md @@ -0,0 +1,5 @@ +--- +'@link-assistant/agent': patch +--- + +Add stream timeout to prevent agent CLI from hanging indefinitely when LLM API connections stall. Configurable via AGENT_STREAM_CHUNK_TIMEOUT_MS (default: 2min) and AGENT_STREAM_STEP_TIMEOUT_MS (default: 10min) environment variables. From 32655375a9ebab9c9358937a30f4d7045212274e Mon Sep 17 00:00:00 2001 From: konard Date: Sat, 31 Jan 2026 00:21:22 +0100 Subject: [PATCH 5/5] Revert "Initial commit with task details" This reverts commit 60770ffd3456e3adb6294dd0291dbf83d6919701. --- CLAUDE.md | 5 ----- 1 file changed, 5 deletions(-) delete mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index 70ea0a6..0000000 --- a/CLAUDE.md +++ /dev/null @@ -1,5 +0,0 @@ -Issue to solve: https://github.com/link-assistant/agent/issues/146 -Your prepared branch: issue-146-635c24f49ba7 -Your prepared working directory: /tmp/gh-issue-solver-1769795934959 - -Proceed.