Skip to content

feat(pipeline): Support resuming training runs #10

@hummat

Description

@hummat

Problem

Currently there's no way to resume an interrupted training run:

  • Without --overwrite: Existing output files trigger the skip logic, so the train step is skipped entirely
  • With --overwrite: The existing checkpoint directory is deleted, losing all training progress

Motivation

Long training runs (hours) can be interrupted by:

  • GPU crashes / OOM
  • System reboots
  • User mistakes

Losing progress is frustrating and wastes compute.

Proposed Solution

Add a --resume flag (or similar) that:

  1. Does not skip the train step when output exists
  2. Does not delete existing checkpoints
  3. Passes --trainer.load-dir pointing to the existing run's checkpoint directory

Alternatively, detect if a partial training run exists (checkpoints present but training incomplete) and automatically resume.

Expected Behavior

Users should be able to continue training from a checkpoint without losing progress. The underlying trainer (sdf-train) already supports this via:

  • --trainer.load-dir PATH — directory containing checkpoints
  • --trainer.load-step INT — specific step to resume from

Alternatives Considered

Current workaround: Users must manually invoke sdf-train with --trainer.load-dir to resume. This works but requires knowing the internal command structure.

Tasks

  • Add --resume flag to run.sh
  • Modify skip logic to allow resume when flag is set
  • Pass --trainer.load-dir to sdf-train when resuming
  • Optionally auto-detect latest checkpoint step
  • Update README with resume documentation
  • Add tests for resume behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpipelinePipeline (scripts/run.sh)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions