-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem Statement
The current 30-second shell timeout is causing failures for legitimate long-running commands (e.g., npm install, large file operations). We need architectural patterns to handle varying command durations without arbitrary timeout increases.
Proposed Mitigation Strategies
1. Exponential Backoff with Jitter
- Start with short timeout, increase exponentially on retry
- Add random jitter to prevent thundering herd
- Example: 5s → 10s → 20s → 40s with ±20% jitter
2. Command Decomposition
- Break long operations into smaller checkpointed steps
- Example:
npm install→ check cache → download packages → link dependencies - Each step has appropriate timeout for its complexity
3. Timeout Tiering by Command Type
- Fast commands (ls, pwd): 5s
- Medium commands (git status, file reads): 15s
- Heavy commands (npm install, builds): 60s+
- Commands declare their tier via metadata
4. Circuit Breaker Pattern
- Track command failure rates
- Open circuit after N consecutive timeouts
- Prevent cascading failures
- Half-open state for gradual recovery
Concrete Example: npm install
Current behavior: Hits 30s timeout on large dependency trees
Proposed solution:
{
command: 'npm install',
timeoutTier: 'heavy',
baseTimeout: 60000,
retryStrategy: 'exponential',
decompose: [
{ step: 'cache-check', timeout: 5000 },
{ step: 'download', timeout: 45000 },
{ step: 'link', timeout: 10000 }
]
}Required Metrics for Configuration
To properly configure timeouts, we need:
- Command duration histograms - distribution of actual execution times
- 95th/99th percentile durations - understand outliers vs typical cases
- Timeout frequency by command type - which commands fail most often
- Success rates across timeout thresholds - find optimal values
Implementation Phases
Phase 1: Collect metrics on current command durations
Phase 2: Implement timeout tiering for known command types
Phase 3: Add exponential backoff for retryable commands
Phase 4: Implement circuit breaker for system-wide resilience
Related Discussions
This issue emerged from chat discussion about shell command reliability and the need for more sophisticated timeout handling than a single global value.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request