Skip to content

feat(daemon): add abrupt-exit diagnostics and stale-socket startup recovery#405

Open
proboscis wants to merge 1 commit intomainfrom
issue/orch-428/run-20260209-225618
Open

feat(daemon): add abrupt-exit diagnostics and stale-socket startup recovery#405
proboscis wants to merge 1 commit intomainfrom
issue/orch-428/run-20260209-225618

Conversation

@proboscis
Copy link
Owner

Summary

  • Add shutdown reason tracking to daemon metadata for incident triage
  • Implement startup recovery for stale PID, socket, and lock files
  • Detect and clearly log abrupt exits vs graceful shutdowns

Changes

Shutdown Reason Tracking

  • Added ShutdownReason and ShutdownAt fields to DaemonMetadata
  • Constants for shutdown reasons: graceful, restart, stopped
  • Shutdown reason recorded before daemon exit (signal-based or API/command)
  • Empty shutdown reason indicates abrupt exit (crash, OOM, etc.)

Startup Recovery

  • CheckAndRecoverStaleArtifacts() function detects and cleans up:
    • Stale PID file (process not running)
    • Stale socket file (no daemon responding)
    • Stale lock file (lock available)
  • Safe recovery: returns error if active daemon detected (protects healthy daemons)
  • Analyzes previous daemon's shutdown reason for incident triage

Lifecycle Logging

All lifecycle events now use LIFECYCLE: prefix for easy grep:

LIFECYCLE: global daemon started (pid=12345, binary=/usr/bin/orch)
LIFECYCLE: received signal SIGTERM, initiating graceful shutdown
LIFECYCLE: graceful shutdown completed
STARTUP RECOVERY: found stale PID file (pid=12344, process not running) - removing
STARTUP RECOVERY: previous daemon (pid=12344, started=...) exited ABRUPTLY without graceful shutdown

Acceptance Criteria Verification

  • Startup logs clearly indicate when stale runtime state was detected and repaired

    • STARTUP RECOVERY: prefixed log messages show exactly what was found and cleaned up
  • Recovery does not kill/override a healthy active daemon

    • CheckAndRecoverStaleArtifacts returns error if PID file refers to running process
    • Socket connectivity test before cleanup
  • Incident triage can distinguish graceful stop from abrupt disappearance using logs

    • Graceful: LIFECYCLE: graceful shutdown completed
    • Abrupt: STARTUP RECOVERY: ... exited ABRUPTLY without graceful shutdown
  • Added test coverage for stale artifact recovery path

    • 7 new tests covering all scenarios:
      • TestShutdownReasonTracking
      • TestCheckAndRecoverStaleArtifacts_NoArtifacts
      • TestCheckAndRecoverStaleArtifacts_StalePID
      • TestCheckAndRecoverStaleArtifacts_StaleSocket
      • TestCheckAndRecoverStaleArtifacts_AbruptExit
      • TestCheckAndRecoverStaleArtifacts_ActiveDaemon
      • TestCheckAndRecoverStaleArtifacts_GracefulShutdownLogged

Test Results

=== RUN   TestShutdownReasonTracking
--- PASS: TestShutdownReasonTracking (0.00s)
=== RUN   TestCheckAndRecoverStaleArtifacts_NoArtifacts
--- PASS: TestCheckAndRecoverStaleArtifacts_NoArtifacts (0.00s)
=== RUN   TestCheckAndRecoverStaleArtifacts_StalePID
--- PASS: TestCheckAndRecoverStaleArtifacts_StalePID (0.00s)
=== RUN   TestCheckAndRecoverStaleArtifacts_StaleSocket
--- PASS: TestCheckAndRecoverStaleArtifacts_StaleSocket (0.00s)
=== RUN   TestCheckAndRecoverStaleArtifacts_AbruptExit
--- PASS: TestCheckAndRecoverStaleArtifacts_AbruptExit (0.00s)
=== RUN   TestCheckAndRecoverStaleArtifacts_ActiveDaemon
--- PASS: TestCheckAndRecoverStaleArtifacts_ActiveDaemon (0.00s)
=== RUN   TestCheckAndRecoverStaleArtifacts_GracefulShutdownLogged
--- PASS: TestCheckAndRecoverStaleArtifacts_GracefulShutdownLogged (0.00s)
PASS

Fixes: orch-428

…covery

- Add ShutdownReason tracking to DaemonMetadata (graceful/restart/stopped)
- Record shutdown reason before daemon exits (signal-based or API/command)
- Add startup stale artifact detection for PID, socket, and lock files
- Detect and log abrupt exits (no shutdown reason recorded)
- Protect against killing/overriding healthy active daemon
- Clear stale artifacts safely during startup recovery
- Add LIFECYCLE log prefix for incident triage visibility
- Add comprehensive test coverage for all recovery scenarios

Fixes: orch-428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant