Skip to content

feat(runner): add kill command to terminate running sandboxes#3153

Merged
seven332 merged 1 commit intomainfrom
feat/runner-kill
Feb 19, 2026
Merged

feat(runner): add kill command to terminate running sandboxes#3153
seven332 merged 1 commit intomainfrom
feat/runner-kill

Conversation

@seven332
Copy link
Contributor

Summary

  • Add runner kill <run-id> [--force] command to terminate running Firecracker sandboxes
  • Discovers processes via /proc scan (no --config required, like doctor)
  • Supports full UUID or unique prefix matching for run_id
  • Conditional cleanup based on orphan status:
    • Parent runner alive: just SIGKILL the process group, runner handles all cleanup automatically
    • Orphan (parent dead): SIGKILL + remove workspace dir + socket dir
  • Interactive confirmation prompt (skip with --force)

Closes #3152

Test plan

  • cargo clippy -p runner passes
  • cargo test -p runner — 147 tests pass (9 new)
  • Deploy to dev-2 and test with live sandboxes

🤖 Generated with Claude Code

@seven332
Copy link
Contributor Author

Code Review: PR #3153

Summary

This PR adds a runner kill <run-id> [--force] command to terminate running Firecracker sandboxes. The design follows a clear phased approach: discover processes via /proc scan, resolve the target by full UUID or prefix match, confirm interactively (or skip with --force), kill the process, and conditionally clean up based on orphan status.

4 commits reviewed across 3 files (1 new, 2 modified).

Key Findings

Critical Issues (P0)

None.

High Priority (P1)

None.

Medium Priority (P2)

  1. kill_process() discards the kill error reason (crates/runner/src/cmd/kill.rs lines 131-138):
    The nix::sys::signal::kill().is_ok() pattern loses the errno. Operators cannot distinguish "EPERM: permission denied" (suggesting they need sudo) from "ESRCH: no such process" (process already exited). Consider using a match to log the error via tracing::warn before returning false:
    fn kill_process(pid: u32) -> bool {
        let Ok(pid_i32) = i32::try_from(pid) else {
            return false;
        };
        match nix::sys::signal::kill(
            nix::unistd::Pid::from_raw(pid_i32),
            nix::sys::signal::Signal::SIGKILL,
        ) {
            Ok(()) => true,
            Err(e) => {
                tracing::warn!(pid, %e, "failed to send SIGKILL");
                false
            }
        }
    }

Low Priority Observations

  • cleanup_orphan silently succeeds on missing dirs: The remove_dir_all().is_ok() pattern means a failed cleanup shows [FAIL] in output but no error detail is logged. Acceptable for a CLI tool but could be improved with tracing::debug for diagnostics.

Bad Smell Analysis

All 17 bad smell categories checked -- zero violations found:

Category Count
Mock violations 0
Test coverage issues 0
Defensive programming 0
Type safety issues 0
Hardcoded values 0
Lint suppressions 0
Bad test patterns 0

Positive Highlights

  1. Excellent code reuse -- shares process::discover_all() and process::is_orphan() with the doctor command
  2. Smart conditional cleanup -- when the parent runner is alive, it handles cleanup via crash detection; manual cleanup only for orphans
  3. Robust orphan detection -- ppid chain walk handles the runner -> sudo -> ip netns exec -> sudo -u -> firecracker spawn chain
  4. Correctness fix in last commit -- switching from killpg to kill because the scanned PID may not be a PGID
  5. Well-structured commit history -- each commit is focused and well-documented
  6. Test quality -- 7 pure function tests (resolve_target + kill_process), no mocks, no I/O

Recommendations

  • P2: Log error reason from nix::sys::signal::kill in kill_process for operator debuggability
  • Optional: Log error details in cleanup_orphan for diagnostics
  • Deploy to dev-2 and test with live sandboxes (as noted in PR test plan)

Verdict

LGTM -- Clean, well-designed PR with no critical or high priority issues. The medium priority suggestion about logging kill errors is a nice-to-have for operator experience but not a blocker. Ready to merge once live testing is done.


Full review details: codereviews/20260219/

@seven332
Copy link
Contributor Author

Code Review: PR #3153

Reviewer: Claude Code (automated)

Summary

This PR adds a runner kill <run-id> command that terminates running Firecracker sandboxes by scanning /proc for processes, resolving the target by full UUID or prefix match, and performing conditional cleanup based on whether the process is an orphan. The implementation is clean, well-structured, and follows existing patterns from the doctor command.

Key Findings

No Critical Issues (P0)

No blocking issues found.

Observations and Suggestions (P1/P2)

  1. Comment vs. code mismatch in Phase 4 (kill.rs:55)
    The comment says // Phase 4: Kill process group but the function was renamed from kill_process_group to kill_process in commit 6fef031 because it uses nix::sys::signal::kill (single process) not killpg (process group). The comment should be updated to match:

    // Phase 4: Kill process

    Severity: Low (documentation accuracy)

  2. Orphan cleanup does not cover netns resources
    The module doc says the runner handles cleanup of "proxy, netns, workspace, status" when the parent is alive. For orphans, cleanup_orphan only removes the workspace dir and socket dir. If the parent runner is truly dead, the network namespace allocated for that sandbox would also be leaked. The doctor command already detects orphan namespaces (detect_orphan_namespaces), so this is a known gap — but worth noting for a future iteration of the kill command. (The gc command also doesn't clean netns, so this is consistent.)
    Severity: Low (known gap, documented by design — parent cleanup handles it, and doctor can detect the leak)

  3. cleanup_orphan silently succeeds when directories don't exist
    remove_dir_all(...).await.is_ok() will report false (FAIL) for both "directory doesn't exist" and "permission denied". This means the output might show [FAIL] Workspace: ... even when the workspace was never created or was already cleaned up. Consider distinguishing NotFound from actual errors:

    let success = match tokio::fs::remove_dir_all(&workspace).await {
        Ok(()) => true,
        Err(e) if e.kind() == std::io::ErrorKind::NotFound => true,
        Err(_) => false,
    };

    Severity: Low (UX polish)

  4. confirm() uses .unwrap_or(false) on JoinHandle
    The spawn_blocking result is unwrapped with .unwrap_or(false). If the blocking task panics, this silently treats it as "user said no". This is fine for a CLI confirmation prompt, but worth noting the behavior is intentional.
    Severity: Informational only

Positive Observations

  • Good code evolution across commits: The progression from killpg to kill, from inline is_reparented to shared process::is_orphan, and from hardcoded /run/vm0/sock to RuntimePaths::new().sock_dir() shows thoughtful iteration.
  • Proper reuse of existing infrastructure: Uses process::discover_all() and process::is_orphan() shared with doctor, avoiding duplication.
  • Correct orphan detection: Walking the ppid chain (not just immediate parent) accounts for the runner -> sudo -> ip netns exec -> sudo -u -> firecracker spawn chain.
  • Return ExitCode::FAILURE when kill fails: The second commit properly fixed the original always-success return.
  • Good test coverage: 6 tests covering resolve_target (full match, prefix, ambiguous, no match, empty input) and kill_process (nonexistent PID). Pure function testing — no mocking needed.
  • Interactive confirmation with --force bypass is a good CLI UX pattern.

Verdict

LGTM -- The code is well-structured, follows existing patterns, and the iteration across commits addressed the important issues. The remaining suggestions (comment typo, cleanup UX) are all low-severity polish items that could be addressed in follow-up.

@seven332
Copy link
Contributor Author

Code Review: PR #3153 — feat(runner): add kill command to terminate running sandboxes

Summary

Adds runner kill <run-id> [--force] — a CLI command that discovers running Firecracker sandboxes via /proc scanning, resolves the target by full UUID or unique prefix, and terminates it. The implementation uses conditional cleanup: if the parent runner daemon is alive, only the process is killed (the runner handles cleanup via its crash detection path); for orphan processes, the kill command also removes workspace and socket directories.

The initial commit had several issues (hardcoded paths, naive orphan detection, killpg on non-PGID, silent failures), but all were addressed in follow-up commits within the PR. The final state of the code is clean and well-designed.

Verdict: LGTM

No critical or high-priority issues remain. All identified problems were resolved within the PR itself. The code is well-structured, appropriately simple, and follows project conventions.


Positive Highlights

  • Excellent reuse of existing infrastructureprocess::discover_all() and process::is_orphan() are shared with the doctor command, no duplication.
  • Smart conditional cleanup — Only performs manual cleanup for orphans; lets the runner daemon handle cleanup for managed processes. This avoids double-cleanup or resource leaks.
  • Clean phase-based architecture — discover -> resolve -> confirm -> kill -> cleanup. Each phase is a separate function with clear responsibility.
  • Good fail-fast design — Empty run_id guard, non-zero exit code on failure, proper error propagation via RunnerResult.
  • Proper observabilitytracing::warn! with structured fields (pid, errno) on kill failure; tracing::info! on completion.
  • Strong YAGNI adherence — No unnecessary abstractions, config files, or plugin systems. Does exactly what's needed.

Issues Resolved Within PR

These were found in the initial commit and fixed in subsequent commits — listed for completeness:

Issue Severity Resolved In
Hardcoded socket path /run/vm0/sock P1 fd56e0c (uses RuntimePaths)
Naive ppid-only orphan detection P1 fd56e0c (uses process::is_orphan chain walk)
Silent kill failure (no error logging) P1 f7e3f1d (logs nix errno)
killpg on non-PGID P1 6fef031 (switched to kill)
Always-success exit code on kill failure P1 fd56e0c
Empty run_id matches all sandboxes P1 21ee759

Remaining Observations (Non-blocking)

  1. P2 — Commit hygiene: The PR contains 6 commits where the initial commit has several issues fixed by the following 5. Consider squashing before merge for a cleaner history. At minimum, squash the fix-up commits into the feature commit.

  2. P2 — cleanup_orphan lacks test coverage: The function involves filesystem operations (remove_dir_all), so this is understandable. Could be covered with a temp-dir test in a follow-up.

  3. P2 — confirm() silently swallows spawn_blocking panics via .unwrap_or(false). This is arguably correct behavior (fail safe = don't kill), but a tracing::warn! in the Err case would improve debuggability.


Test Coverage

  • 6 tests covering resolve_target (4 cases: full UUID, prefix match, ambiguous, no match), kill_process (nonexistent PID), and empty input guard.
  • Tests exercise pure functions with real data, no mocking needed.
  • Missing coverage: cleanup_orphan (filesystem), run_kill integration (requires /proc) — acceptable for follow-up.

Bad Smell Analysis

Category Count
Mock violations 0
Defensive programming 0
Type safety issues 0
Hardcoded values (remaining) 0
Lint suppressions 0
Fallback patterns 0
Timer/delay issues 0

Full review details: codereviews/20260219/

@seven332
Copy link
Contributor Author

Code Review: PR #3153 — feat(runner): add kill command to terminate running sandboxes

Summary

This PR adds a runner kill <run-id> [--force] command that terminates running Firecracker sandboxes. It reuses the existing /proc scanning infrastructure from the process module, supports UUID prefix matching, and conditionally performs cleanup based on whether the sandbox is orphaned or still managed by a live runner daemon. Overall structure is clean and well-organized.

Key Findings

P0 — Critical: kill sends signal to single PID, not process group

File: crates/runner/src/cmd/kill.rs line 132-146

The kill_process() function uses nix::sys::signal::kill() (single PID), but the issue requirement (#3152) and the module doc comment (line 4) both explicitly call for killpg(SIGKILL) on the process group.

Firecracker is spawned with .process_group(0) (see crates/sandbox-fc/src/sandbox.rs:196), so its PID equals its PGID. The existing sandbox-fc crate uses killpg for this exact reason — to kill the entire process tree including ip netns exec, sudo, and firecracker itself (see crates/sandbox-fc/src/process.rs:8).

Using kill (single PID) instead of killpg means:

  • Intermediate processes in the spawn chain (sudo, ip netns exec) may survive
  • The cleanup won't be equivalent to what the runner's own Drop impl does
  • Zombie child processes may linger

Fix: Replace nix::sys::signal::kill() with nix::sys::signal::killpg():

fn kill_process(pid: u32) -> bool {
    let Ok(pid_i32) = i32::try_from(pid) else {
        return false;
    };
    match nix::sys::signal::killpg(
        nix::unistd::Pid::from_raw(pid_i32),
        nix::sys::signal::Signal::SIGKILL,
    ) {
        Ok(()) => true,
        Err(e) => {
            tracing::warn!(pid, error = %e, "failed to kill process group");
            false
        }
    }
}

Also consider renaming the function to kill_process_group for clarity, consistent with sandbox-fc/src/process.rs.

P1 — Orphan cleanup is incomplete compared to issue spec

File: crates/runner/src/cmd/kill.rs line 152-169

The cleanup_orphan function removes:

  1. Workspace dir: {base_dir}/workspaces/{run_id}/
  2. Socket dir: /run/vm0/sock/{run_id}/

But the issue design also mentions overlay directories. The FactoryPaths in sandbox-fc/src/paths.rs shows overlays are stored at {base_dir}/overlays/. If overlay cleanup is intentionally deferred, the PR description should note this. However, the workspace directory may contain the overlay (as overlay.ext4 per SandboxPaths), in which case removing the workspace dir handles it. Worth a clarifying comment either way.

P1 — Module doc is misleading

File: crates/runner/src/cmd/kill.rs lines 1-8

The doc says "killing the Firecracker process group" but the code kills a single process. If the P0 fix is applied, the doc is accurate. If not, the doc is misleading.

Observations (Non-blocking)

Process race between discovery and kill

There is a TOCTOU window between discover_all() and the actual kill_process() call — the firecracker process could exit between discovery and kill. The code handles this gracefully (ESRCH from kill returns false, user sees "process may have already exited"). This is acceptable for a manual CLI command.

map_or allocates a String unnecessarily

Lines 77 and 191 use target.ppid.map_or("unknown".into(), |p| p.to_string()) which always allocates. A minor nit — could use map_or_else to avoid the allocation on the Some path, but this is a CLI command that runs once, so performance is irrelevant.

Tests are appropriate

The 6 tests cover the pure resolve_target logic well (full UUID, prefix, ambiguous, no match, empty input) plus a smoke test for kill_process with a nonexistent PID. These are proper unit tests for pure functions, which aligns with the testing guidelines. No /proc mocking needed since the pure parsers are already well-tested in process.rs.

Verdict

Changes Requested — The P0 issue (kill vs killpg) is a functional bug that could leave zombie processes. The fix is a one-line change (kill -> killpg). After that fix, this looks good to merge.


Reviewed by Claude Code

@seven332
Copy link
Contributor Author

Code Review: PR #3153

feat(runner): add kill command to terminate running sandboxes

Summary

This is a well-designed, clean addition of a runner kill <run-id> [--force] CLI command. The implementation follows project conventions closely, reuses existing infrastructure (process::discover_all(), process::is_orphan()), and applies YAGNI throughout. The squashed commit resolves all issues from the earlier iteration. 4 files changed, +342 lines.

Key Findings

Critical Issues (P0)

None.

High Priority (P1)

  1. Silent error swallowing in remove_dir_if_exists (kill.rs:176-180) — Non-NotFound errors are caught and return false without logging the actual error. Operators will see [FAIL] but cannot diagnose the cause.

    Suggested fix:

    Err(e) => {
        tracing::warn!(path = %path.display(), error = %e, "failed to remove directory");
        false
    }
  2. JoinError handling in confirm() (kill.rs:209-212) — spawn_blocking panic is caught with tracing::warn, but for a destructive operation, tracing::error is more appropriate since a panic here is unexpected.

Medium Priority (P2)

  1. killpg fallback to single-PID kill (kill.rs:135-152) — When killpg fails, falls back to killing only the single PID. This fallback pattern may leave intermediate processes (sudo, ip netns exec) alive. Worth evaluating whether failing fast with a clear error message is safer.

  2. Fragile test assumption (kill.rs:280-283) — Test assumes PID 999999 does not exist. Acceptable but a brief comment documenting this assumption would improve clarity.

Bad Smell Analysis

Category Count
Mock violations 0
Test coverage issues 0
Defensive programming 2 (P1-1, P1-2)
Fallback patterns 1 (P2-1)
Type safety issues 0
Lint suppressions 0
Hardcoded URLs 0

Positive Highlights

  • Excellent code reuse with the existing process module
  • Clean phase-based architecture: discover -> resolve -> confirm -> kill -> cleanup
  • Smart conditional cleanup: skips cleanup when parent runner is alive (it handles cleanup via crash detection)
  • Thorough resolve_target tests covering exact match, prefix match, ambiguous prefix, no match, and empty input
  • Proper /proc/{pid}/stat parsing handling parenthesized comm fields
  • No lint suppressions, no type safety issues, no defensive programming in the hot path
  • Consistent with crate conventions (error types, CLI structure, module organization)

Recommendations

  1. Must fix: Log the actual error in remove_dir_if_exists so cleanup failures are diagnosable
  2. Must fix: Elevate JoinError logging to tracing::error in confirm()
  3. Consider: Re-evaluate killpg -> single-kill fallback — failing fast may be safer than partial cleanup
  4. Consider: Add a unit test for read_pgid parsing with synthetic /proc/{pid}/stat content

Verdict

Changes Requested — Two P1 issues (silent error swallowing, inadequate panic logging) should be addressed before merge. Both are small fixes. The overall design and implementation quality is high.


Full review details: codereviews/20260219/

@seven332
Copy link
Contributor Author

Code Review: PR #3153

Summary

This PR adds a runner kill <run-id> [--force] CLI command to terminate running Firecracker sandboxes. The implementation is clean, well-structured, and follows all project conventions. It reuses existing infrastructure (process::discover_all(), process::is_orphan()) and introduces a smart conditional cleanup model that avoids double-cleanup conflicts with the parent runner.

Files changed: 4 files, +345 lines
Tests: 7 new tests (6 sync + 1 async)

Key Findings

Critical Issues (P0)

None.

High Priority (P1)

None. Both P1 issues from the previous review iteration (951c707) have been fixed:

  • remove_dir_if_exists now logs errors via tracing::warn! (was silently swallowing)
  • confirm() now uses tracing::error! for JoinError (was tracing::warn!)

Medium Priority (P2) — Non-blocking

ID Location Description
P2-1 kill.rs:150-163 killpg falls back to single-PID kill — could leave intermediate processes (sudo, ip netns exec) alive. Acceptable design choice but could be documented more explicitly in the fallback warning message.
P2-2 kill.rs:310-312 Test assumes PID 999999 doesn't exist. Fragile but acceptable for a /proc-dependent test.
P2-3 process.rs:129-142 No unit test for read_pgid parsing — the rsplit_once(')') technique is correct but edge cases (comm with parens) could benefit from a synthetic test. Per YAGNI, acceptable to defer.

Bad Smell Analysis

Category Count
Mock violations 0
Test coverage issues 0
Defensive programming 0
Type safety issues 0
Fallback patterns 1 (P2-1, acceptable)
Lint suppressions 0
Hardcoded URLs 0
Timer/delay issues 0

Positive Aspects

  • Excellent YAGNI compliance — does exactly what's needed, nothing more
  • Strong code reuse with the existing process module
  • Clean five-phase architecture: discover, resolve, confirm, kill, cleanup
  • Smart conditional cleanup based on orphan status avoids race conditions
  • Thorough resolve_target test coverage including all edge cases (exact match, prefix, ambiguous, no match, empty input)
  • Zero lint suppressions, zero type safety issues
  • Consistent with existing crate conventions throughout
  • PGID-based process group kill correctly handles the complex spawn chain (sudo -> ip netns exec -> firecracker)

Verdict

LGTM — Ready to merge. No critical or high-priority issues remain. The three medium-priority items are optional improvements that can be addressed in follow-up work. The live sandbox testing on dev-2 (per the PR test plan) should be completed before merge.


Full review details: codereviews/20260219/

Add `runner kill <run-id> [--force]` that discovers running Firecracker
sandboxes via /proc scanning, resolves the target by full UUID or unique
prefix, and terminates it with SIGKILL via killpg on the process group.

Cleanup is conditional: if the parent runner daemon is alive, it handles
cleanup via its crash detection path. For orphan processes (parent runner
dead), the kill command also removes workspace and socket directories.

Key design decisions:
- Reads PGID from /proc/{pid}/stat and uses killpg to kill the entire
  process group (sudo, ip netns exec, firecracker) in one signal
- Orphan detection walks the ppid chain (up to 16 levels) to account
  for the runner -> sudo -> ip netns exec -> sudo -u -> firecracker
  spawn chain
- Reuses process::discover_all() and process::is_orphan() shared with
  the doctor command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@seven332
Copy link
Contributor Author

Code Review: PR #3153

Summary

Clean, well-structured implementation of a runner kill <run-id> [--force] command. The code follows all project conventions, has no bad smell violations, and demonstrates good architectural decisions. No critical or high-priority issues found.

Files changed: 4 (371 insertions)

  • crates/runner/src/cmd/kill.rs — New file: kill command implementation (310 lines)
  • crates/runner/src/cmd/mod.rs — Module declaration + re-export
  • crates/runner/src/main.rs — CLI enum variant + dispatch
  • crates/runner/src/process.rs — PGID parsing (shared utility)

Key Findings

Critical Issues (P0)

None.

High Priority (P1)

None.

Medium Priority

None actionable. An initial concern about overlay directory cleanup during orphan cleanup was investigated and found to be a non-issue: SandboxPaths::overlay() returns workspace.join("overlay.ext4"), so the overlay file lives inside the workspace directory and is cleaned up when the workspace is removed.

Notable Design Decisions

  1. Conditional cleanup strategy (kill.rs, lines 66-79): When the parent runner is alive, only SIGKILL is sent — the runner's own crash detection (monitor_process -> crash_notify) handles all cleanup. Manual cleanup only happens for orphan processes. This avoids race conditions and duplicate cleanup.

  2. PGID-based process group kill (kill.rs, lines 137-161): Reads the actual PGID from /proc/{pid}/stat rather than assuming PID == PGID. Uses killpg to terminate the entire spawn chain (sudo -> ip netns exec -> sudo -u -> firecracker) in one signal.

  3. Robust /proc/pid/stat parsing (process.rs, lines 129-141): Uses rsplit_once(')') to handle comm fields containing spaces and parentheses — a well-known edge case in proc stat parsing.

  4. Code reuse with the doctor command: Shares process::discover_all() and process::is_orphan().

Bad Smell Analysis

All clear across 17 categories:

Category Status
Mock violations 0
Test coverage issues 0
Defensive programming 0
Type safety issues 0
Lint/type suppressions 0
Hardcoded URLs 0
Fallback patterns 0
Error handling issues 0
Timer/delay issues 0

Test Coverage

11 new tests added:

  • kill.rs (6 tests): resolve_target (full UUID, prefix, ambiguous, no match, empty input) + kill_process_group (nonexistent PID)
  • process.rs (5 tests): parse_pgid_from_stat (simple case, spaces in comm, parens in comm, empty, truncated)
  • All pure function tests with no mocking — consistent with project testing guidelines

Remaining Action Item

  • Deploy to dev-2 and test with live sandboxes (as noted in PR test plan)

Verdict

LGTM — This is a clean, focused implementation with good design decisions, proper error handling, and solid test coverage. Ready to merge after live testing on dev-2.


Full review details: codereviews/20260219/
Review by Claude Code (code-quality)

@seven332 seven332 added this pull request to the merge queue Feb 19, 2026
Merged via the queue into main with commit 26d4e7d Feb 19, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(runner): implement kill command to terminate running sandboxes

1 participant

Comments