-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
agentize:planPlan created by /ultra-planner commandPlan created by /ultra-planner commandenhancementNew feature or requestNew feature or request
Description
Problem Statement
Currently, the sandbox only supports local container runtimes (podman/docker) for running containers directly on the local machine. For HPC environments using Slurm workload manager, users need to run containers on compute nodes via srun instead of locally.
Current Limitations:
newcommand only runs containers locally with podman/dockerattachcommand only usespodman execfor local containersrmcommand only callspodman rmfor local containersresetcommand cancels/stops all sandbox containers
Requirements:
- Support running containers on Slurm compute nodes via
srun podman-srun - Add
-s/--slurmflag tonewsubcommand to enable Slurm mode - Use
sattachto attach to tmux sessions in Slurm mode - Use
scancelto cancel Slurm jobs when removing sandboxes resetshould NOT cancel all Slurm jobs (only clean up local resources)
Proposed Solution
Files to Modify
sandbox/run.py- Main sandbox manager
Implementation Details
1. New Subcommand with Slurm Support
Add -s/--slurm argument to new subcommand:
python sandbox/run.py new -n my-sandbox -s # Run on Slurm
python sandbox/run.py new -n my-sandbox # Run locally (default)Changes to cmd_new():
- Parse
--slurmflag from args - When
--slurmis enabled:- Use
srun --no-container-remapwithpodman-srunwrapper script - The podman-srun wrapper handles per-node storage paths automatically
- Submit job to Slurm and capture job ID for tracking
- Store job ID in database instead of container ID
- Use
- When
--slurmis disabled (default):- Keep existing local podman/docker behavior
2. Database Schema Update
Add slurm_job_id column to track Slurm jobs:
ALTER TABLE sandboxes ADD COLUMN slurm_job_id TEXT;3. Attach Command with Slurm Support
Changes to cmd_attach():
- Check if sandbox has
slurm_job_id - If Slurm mode:
- Use
sattachto connect to the tmux session - Format:
sattach -m <job_id>.<step_id>:<session_name>
- Use
- If local mode:
- Keep existing
podman execbehavior
- Keep existing
4. Remove Command with Slurm Support
Changes to cmd_rm():
- Check if sandbox has
slurm_job_id - If Slurm mode:
- Call
scancel <job_id>to cancel the Slurm job - Clean up local worktree only (no container to remove)
- Call
- If local mode:
- Keep existing
podman stop && podman rmbehavior
- Keep existing
5. Reset Command - No Slurm Cancellation
Changes to cmd_reset():
- Only clean up local resources:
- Remove work directories
- Remove database file
- Remove cached images
- DO NOT cancel any Slurm jobs (user should manage Slurm jobs separately)
- Print warning that Slurm jobs are not affected
6. List Command Enhancement
Changes to cmd_ls():
- Show job ID for Slurm-mode sandboxes
- Show "slurm" status indicator
Reference Implementation
The podman-srun wrapper script (./podman-srun) demonstrates how to run podman in Slurm environments:
- Creates node-specific storage directories (
~/.cache/podman/<hostname>/) - Configures storage.conf for overlay driver with fuse-overlayfs
- Configures containers.conf with cgroupfs manager
- Runs podman commands via
srunon compute nodes
Example usage of podman-srun:
srun podman-srun run hello-world # Run container on Slurm nodeTest Strategy
Test Cases
-
Slurm Mode - New Sandbox
- Input:
python run.py new -n test-sb -s - Verify: Slurm job is submitted, job ID is stored in database
- Verify: Worktree is created
- Input:
-
Slurm Mode - Attach
- Input:
python run.py attach -n test-sb - Verify:
sattachconnects to tmux session
- Input:
-
Slurm Mode - Remove
- Input:
python run.py rm -n test-sb - Verify: Slurm job is cancelled via
scancel - Verify: Worktree is removed
- Input:
-
Local Mode - Unchanged Behavior
- Input:
python run.py new -n local-sb - Verify: Container runs locally (existing behavior)
- Input:
-
Reset - Slurm Jobs Preserved
- Input:
python run.py reset - Verify: Worktree and images removed
- Verify: Slurm jobs are NOT cancelled
- Input:
Prerequisites for Testing
- Access to Slurm cluster
srun,sattach,scancelcommands availablepodman-srunscript in PATH or referenced correctly
Metadata
Metadata
Assignees
Labels
agentize:planPlan created by /ultra-planner commandPlan created by /ultra-planner commandenhancementNew feature or requestNew feature or request