Skip to content

Shell Timeout Mitigation Strategies - Architectural Patterns for Command Resilience #201

@joelteply

Description

@joelteply

Problem Statement

The current 30-second shell timeout is causing failures for legitimate long-running commands (e.g., npm install, large file operations). We need architectural patterns to handle varying command durations without arbitrary timeout increases.

Proposed Mitigation Strategies

1. Exponential Backoff with Jitter

  • Start with short timeout, increase exponentially on retry
  • Add random jitter to prevent thundering herd
  • Example: 5s → 10s → 20s → 40s with ±20% jitter

2. Command Decomposition

  • Break long operations into smaller checkpointed steps
  • Example: npm install → check cache → download packages → link dependencies
  • Each step has appropriate timeout for its complexity

3. Timeout Tiering by Command Type

  • Fast commands (ls, pwd): 5s
  • Medium commands (git status, file reads): 15s
  • Heavy commands (npm install, builds): 60s+
  • Commands declare their tier via metadata

4. Circuit Breaker Pattern

  • Track command failure rates
  • Open circuit after N consecutive timeouts
  • Prevent cascading failures
  • Half-open state for gradual recovery

Concrete Example: npm install

Current behavior: Hits 30s timeout on large dependency trees

Proposed solution:

{
  command: 'npm install',
  timeoutTier: 'heavy',
  baseTimeout: 60000,
  retryStrategy: 'exponential',
  decompose: [
    { step: 'cache-check', timeout: 5000 },
    { step: 'download', timeout: 45000 },
    { step: 'link', timeout: 10000 }
  ]
}

Required Metrics for Configuration

To properly configure timeouts, we need:

  1. Command duration histograms - distribution of actual execution times
  2. 95th/99th percentile durations - understand outliers vs typical cases
  3. Timeout frequency by command type - which commands fail most often
  4. Success rates across timeout thresholds - find optimal values

Implementation Phases

Phase 1: Collect metrics on current command durations
Phase 2: Implement timeout tiering for known command types
Phase 3: Add exponential backoff for retryable commands
Phase 4: Implement circuit breaker for system-wide resilience

Related Discussions

This issue emerged from chat discussion about shell command reliability and the need for more sophisticated timeout handling than a single global value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions