-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestpipelinePipeline (scripts/run.sh)Pipeline (scripts/run.sh)
Description
Problem
Currently there's no way to resume an interrupted training run:
- Without
--overwrite: Existing output files trigger the skip logic, so thetrainstep is skipped entirely - With
--overwrite: The existing checkpoint directory is deleted, losing all training progress
Motivation
Long training runs (hours) can be interrupted by:
- GPU crashes / OOM
- System reboots
- User mistakes
Losing progress is frustrating and wastes compute.
Proposed Solution
Add a --resume flag (or similar) that:
- Does not skip the train step when output exists
- Does not delete existing checkpoints
- Passes
--trainer.load-dirpointing to the existing run's checkpoint directory
Alternatively, detect if a partial training run exists (checkpoints present but training incomplete) and automatically resume.
Expected Behavior
Users should be able to continue training from a checkpoint without losing progress. The underlying trainer (sdf-train) already supports this via:
--trainer.load-dir PATH— directory containing checkpoints--trainer.load-step INT— specific step to resume from
Alternatives Considered
Current workaround: Users must manually invoke sdf-train with --trainer.load-dir to resume. This works but requires knowing the internal command structure.
Tasks
- Add
--resumeflag torun.sh - Modify skip logic to allow resume when flag is set
- Pass
--trainer.load-dirtosdf-trainwhen resuming - Optionally auto-detect latest checkpoint step
- Update README with resume documentation
- Add tests for resume behavior
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpipelinePipeline (scripts/run.sh)Pipeline (scripts/run.sh)