Skip to content

[plan][feature] Add Slurm (srun+podman) support to sandbox #492

@Entropy-xcy

Description

@Entropy-xcy

Problem Statement

Currently, the sandbox only supports local container runtimes (podman/docker) for running containers directly on the local machine. For HPC environments using Slurm workload manager, users need to run containers on compute nodes via srun instead of locally.

Current Limitations:

  • new command only runs containers locally with podman/docker
  • attach command only uses podman exec for local containers
  • rm command only calls podman rm for local containers
  • reset command cancels/stops all sandbox containers

Requirements:

  1. Support running containers on Slurm compute nodes via srun podman-srun
  2. Add -s/--slurm flag to new subcommand to enable Slurm mode
  3. Use sattach to attach to tmux sessions in Slurm mode
  4. Use scancel to cancel Slurm jobs when removing sandboxes
  5. reset should NOT cancel all Slurm jobs (only clean up local resources)

Proposed Solution

Files to Modify

  • sandbox/run.py - Main sandbox manager

Implementation Details

1. New Subcommand with Slurm Support

Add -s/--slurm argument to new subcommand:

python sandbox/run.py new -n my-sandbox -s  # Run on Slurm
python sandbox/run.py new -n my-sandbox      # Run locally (default)

Changes to cmd_new():

  • Parse --slurm flag from args
  • When --slurm is enabled:
    • Use srun --no-container-remap with podman-srun wrapper script
    • The podman-srun wrapper handles per-node storage paths automatically
    • Submit job to Slurm and capture job ID for tracking
    • Store job ID in database instead of container ID
  • When --slurm is disabled (default):
    • Keep existing local podman/docker behavior

2. Database Schema Update

Add slurm_job_id column to track Slurm jobs:

ALTER TABLE sandboxes ADD COLUMN slurm_job_id TEXT;

3. Attach Command with Slurm Support

Changes to cmd_attach():

  • Check if sandbox has slurm_job_id
  • If Slurm mode:
    • Use sattach to connect to the tmux session
    • Format: sattach -m <job_id>.<step_id>:<session_name>
  • If local mode:
    • Keep existing podman exec behavior

4. Remove Command with Slurm Support

Changes to cmd_rm():

  • Check if sandbox has slurm_job_id
  • If Slurm mode:
    • Call scancel <job_id> to cancel the Slurm job
    • Clean up local worktree only (no container to remove)
  • If local mode:
    • Keep existing podman stop && podman rm behavior

5. Reset Command - No Slurm Cancellation

Changes to cmd_reset():

  • Only clean up local resources:
    • Remove work directories
    • Remove database file
    • Remove cached images
  • DO NOT cancel any Slurm jobs (user should manage Slurm jobs separately)
  • Print warning that Slurm jobs are not affected

6. List Command Enhancement

Changes to cmd_ls():

  • Show job ID for Slurm-mode sandboxes
  • Show "slurm" status indicator

Reference Implementation

The podman-srun wrapper script (./podman-srun) demonstrates how to run podman in Slurm environments:

  • Creates node-specific storage directories (~/.cache/podman/<hostname>/)
  • Configures storage.conf for overlay driver with fuse-overlayfs
  • Configures containers.conf with cgroupfs manager
  • Runs podman commands via srun on compute nodes

Example usage of podman-srun:

srun podman-srun run hello-world  # Run container on Slurm node

Test Strategy

Test Cases

  1. Slurm Mode - New Sandbox

    • Input: python run.py new -n test-sb -s
    • Verify: Slurm job is submitted, job ID is stored in database
    • Verify: Worktree is created
  2. Slurm Mode - Attach

    • Input: python run.py attach -n test-sb
    • Verify: sattach connects to tmux session
  3. Slurm Mode - Remove

    • Input: python run.py rm -n test-sb
    • Verify: Slurm job is cancelled via scancel
    • Verify: Worktree is removed
  4. Local Mode - Unchanged Behavior

    • Input: python run.py new -n local-sb
    • Verify: Container runs locally (existing behavior)
  5. Reset - Slurm Jobs Preserved

    • Input: python run.py reset
    • Verify: Worktree and images removed
    • Verify: Slurm jobs are NOT cancelled

Prerequisites for Testing

  • Access to Slurm cluster
  • srun, sattach, scancel commands available
  • podman-srun script in PATH or referenced correctly

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentize:planPlan created by /ultra-planner commandenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions