Skip to content

Feature Request: User-Configurable Replication-Level Checkpoints #81

@mronkko

Description

@mronkko

This is related to #79. SimDesign currently supports resuming incomplete simulations on the design level but not on the replication level. Adding support for incremental checkpointing within design conditions to enable resumption from partially-completed replications, not just completed conditions, would be useful. This would provide robust crash recovery for long-running simulations with many replications per condition.

I used Claude Code to suggest how this could be implemented.

Motivation

Current Implementation

runSimulation() currently saves temporary files only after each complete design condition finishes all its replications. This means:

  • ❌ If a condition with 10,000 replications crashes at replication 6,900, all 6,900 completed replications are lost
  • ❌ Resume only works at condition-level granularity (e.g., resume at condition 5, not at replication 6,901 within condition 5)
  • ❌ For HPC environments with time limits, conditions requiring >4 hours cannot leverage the resume feature

Relevant code:

# runSimulation.R lines 1626-1736
for(i in start:end){  # Loop over design CONDITIONS
    tmp <- Analysis(...)  # Runs ALL replications via lapply/pblapply
    Result_list[[i]] <- ...  # Store aggregated results
    if(save || save_results)
        saveRDS(Result_list, ...)  # Save AFTER condition completes
}

The Analysis() function uses lapply/pblapply, which blocks until all replications complete, preventing intermediate saves.

Why resuming incomplete designs would be useful

In my current simulation:

  • Setup: N=25,600 sample size, 10,000 replications, takes ~6 hours per condition
  • Problem: SLURM cluster has 4-hour time limit
  • Result: Job killed at 69% completion (6,900 reps), all work lost
  • Consequence: Cannot use runSimulation()'s resume feature; must switch to workarounds

This affects anyone running simulations where:

  • Individual conditions take hours to complete
  • Working on HPC clusters with time limits
  • HPC cluster scheduler resource allocation works better for shorter tasks that complete partial designs and then resume over using longer tasks that try to run the full design
  • Debugging specific replication failures

Proposed Solution

Add user-configurable checkpointing similar to Stata's simulate command, which allows users to specify checkpoint frequency.

Design Goals

  1. Backward compatible: Opt-in via new parameter, existing code works unchanged
  2. User configurable: Let users balance checkpoint frequency vs I/O overhead
  3. Parallel-friendly: Should work with both serial and parallel execution modes
  4. Clean recovery: Automatic resume from last checkpoint without manual intervention

Recommended Implementation: Chunked Checkpoints

Similar to Stata's approach, split replications into user-defined chunks and save after each chunk completes.

New Parameter

runSimulation(...,
              checkpoint_replications = NULL)  # NULL (default) = no checkpointing
                                               # Integer = checkpoint every N reps

Example Usage

# Checkpoint every 1,000 replications
runSimulation(
  design = Design,
  replications = 10000,
  generate = Generate,
  analyse = Analyse,
  summarise = Summarise,
  checkpoint_replications = 1000,  # Save after every 1,000 reps
  save = TRUE
)

Behavior:

  • Runs replications 1-1,000 → saves checkpoint
  • Runs replications 1,001-2,000 → saves checkpoint
  • ...continues...
  • If interrupted at replication 7,342:
    • Checkpoint contains replications 1-7,000
    • On resume, starts at replication 7,001
    • Only loses last 342 reps, not all 7,342

Implementation Approach

File: R/analysis.R

if(!is.null(checkpoint_replications) && replications > checkpoint_replications) {
    # Chunked execution with checkpointing
    num_chunks <- ceiling(replications / checkpoint_replications)
    chunk_starts <- seq(1, replications, by=checkpoint_replications)
    chunk_ends <- pmin(chunk_starts + checkpoint_replications - 1, replications)

    # Check for existing checkpoint
    checkpoint_file <- file.path(save_results_out_rootdir,
                                paste0('SIMDESIGN-CHECKPOINT_',
                                       condition$ID, '.rds'))

    if(file.exists(checkpoint_file)) {
        checkpoint <- readRDS(checkpoint_file)
        start_chunk <- checkpoint$completed_chunks + 1
        all_results <- checkpoint$results
    } else {
        start_chunk <- 1
        all_results <- vector('list', replications)
    }

    # Process remaining chunks
    for(chunk_idx in start_chunk:num_chunks) {
        chunk_range <- chunk_starts[chunk_idx]:chunk_ends[chunk_idx]

        # Run chunk (works with serial, parallel, or future)
        chunk_results <- if(useFuture) {
            future.apply::future_lapply(chunk_range, used_mainsim, ...)
        } else if(!is.null(cl)) {
            parallel::parLapply(cl, chunk_range, used_mainsim, ...)
        } else {
            lapply(chunk_range, used_mainsim, ...)
        }

        # Save chunk results
        all_results[chunk_range] <- chunk_results

        # Save checkpoint
        checkpoint <- list(
            condition_ID = condition$ID,
            completed_chunks = chunk_idx,
            completed_replications = chunk_ends[chunk_idx],
            results = all_results[1:chunk_ends[chunk_idx]],
            random_seed_state = .GlobalEnv$.Random.seed,
            timestamp = Sys.time()
        )
        saveRDS(checkpoint, checkpoint_file)
    }

    results <- all_results

    # Cleanup checkpoint after completion
    file.remove(checkpoint_file)

} else {
    # Standard execution (existing code unchanged)
    results <- lapply(1L:replications, used_mainsim, ...)
}

Benefits

Minimal overhead: Checkpoint only every N replications, not every single one
Parallel compatible: Works with all execution modes (serial, parallel, future)
User control: Users choose checkpoint frequency based on their needs
Backward compatible: Existing simulations work unchanged (defaults to NULL)
Clean UX: Automatic resume, no manual file management

Comparison to Stata's simulate

Stata's simulate command offers:

simulate ..., reps(10000) saving(results.dta) every(1000)

The every() option specifies checkpoint frequency. This proposal mirrors that design:

  • Stata: every(1000) saves every 1,000 replications
  • SimDesign: checkpoint_replications = 1000 saves every 1,000 replications

Alternative Considered

Per-replication checkpointing (save after every single replication):

  • ❌ High I/O overhead (10,000 disk writes for 10,000 reps)
  • ❌ Creates thousands of files to manage
  • ❌ Diminishing returns for most use cases

Relationship to Issue #79

This proposal complements #79 by addressing a related but distinct problem:

Issue Problem Solution
#79 max_time saves empty/invalid results Fix result preservation on timeout
This No way to resume within-condition Add checkpoint mechanism

Both are needed for robust HPC simulation workflows:

  1. max_time Exceeded in runArraySimulation Saves Unusable Results #79 ensures partial results are preserved when timeout occurs
  2. This proposal enables resuming from those partial results

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions