-
Notifications
You must be signed in to change notification settings - Fork 22
Description
This is related to #79. SimDesign currently supports resuming incomplete simulations on the design level but not on the replication level. Adding support for incremental checkpointing within design conditions to enable resumption from partially-completed replications, not just completed conditions, would be useful. This would provide robust crash recovery for long-running simulations with many replications per condition.
I used Claude Code to suggest how this could be implemented.
Motivation
Current Implementation
runSimulation() currently saves temporary files only after each complete design condition finishes all its replications. This means:
- ❌ If a condition with 10,000 replications crashes at replication 6,900, all 6,900 completed replications are lost
- ❌ Resume only works at condition-level granularity (e.g., resume at condition 5, not at replication 6,901 within condition 5)
- ❌ For HPC environments with time limits, conditions requiring >4 hours cannot leverage the resume feature
Relevant code:
# runSimulation.R lines 1626-1736
for(i in start:end){ # Loop over design CONDITIONS
tmp <- Analysis(...) # Runs ALL replications via lapply/pblapply
Result_list[[i]] <- ... # Store aggregated results
if(save || save_results)
saveRDS(Result_list, ...) # Save AFTER condition completes
}The Analysis() function uses lapply/pblapply, which blocks until all replications complete, preventing intermediate saves.
Why resuming incomplete designs would be useful
In my current simulation:
- Setup: N=25,600 sample size, 10,000 replications, takes ~6 hours per condition
- Problem: SLURM cluster has 4-hour time limit
- Result: Job killed at 69% completion (6,900 reps), all work lost
- Consequence: Cannot use
runSimulation()'s resume feature; must switch to workarounds
This affects anyone running simulations where:
- Individual conditions take hours to complete
- Working on HPC clusters with time limits
- HPC cluster scheduler resource allocation works better for shorter tasks that complete partial designs and then resume over using longer tasks that try to run the full design
- Debugging specific replication failures
Proposed Solution
Add user-configurable checkpointing similar to Stata's simulate command, which allows users to specify checkpoint frequency.
Design Goals
- Backward compatible: Opt-in via new parameter, existing code works unchanged
- User configurable: Let users balance checkpoint frequency vs I/O overhead
- Parallel-friendly: Should work with both serial and parallel execution modes
- Clean recovery: Automatic resume from last checkpoint without manual intervention
Recommended Implementation: Chunked Checkpoints
Similar to Stata's approach, split replications into user-defined chunks and save after each chunk completes.
New Parameter
runSimulation(...,
checkpoint_replications = NULL) # NULL (default) = no checkpointing
# Integer = checkpoint every N repsExample Usage
# Checkpoint every 1,000 replications
runSimulation(
design = Design,
replications = 10000,
generate = Generate,
analyse = Analyse,
summarise = Summarise,
checkpoint_replications = 1000, # Save after every 1,000 reps
save = TRUE
)Behavior:
- Runs replications 1-1,000 → saves checkpoint
- Runs replications 1,001-2,000 → saves checkpoint
- ...continues...
- If interrupted at replication 7,342:
- Checkpoint contains replications 1-7,000
- On resume, starts at replication 7,001
- Only loses last 342 reps, not all 7,342
Implementation Approach
File: R/analysis.R
if(!is.null(checkpoint_replications) && replications > checkpoint_replications) {
# Chunked execution with checkpointing
num_chunks <- ceiling(replications / checkpoint_replications)
chunk_starts <- seq(1, replications, by=checkpoint_replications)
chunk_ends <- pmin(chunk_starts + checkpoint_replications - 1, replications)
# Check for existing checkpoint
checkpoint_file <- file.path(save_results_out_rootdir,
paste0('SIMDESIGN-CHECKPOINT_',
condition$ID, '.rds'))
if(file.exists(checkpoint_file)) {
checkpoint <- readRDS(checkpoint_file)
start_chunk <- checkpoint$completed_chunks + 1
all_results <- checkpoint$results
} else {
start_chunk <- 1
all_results <- vector('list', replications)
}
# Process remaining chunks
for(chunk_idx in start_chunk:num_chunks) {
chunk_range <- chunk_starts[chunk_idx]:chunk_ends[chunk_idx]
# Run chunk (works with serial, parallel, or future)
chunk_results <- if(useFuture) {
future.apply::future_lapply(chunk_range, used_mainsim, ...)
} else if(!is.null(cl)) {
parallel::parLapply(cl, chunk_range, used_mainsim, ...)
} else {
lapply(chunk_range, used_mainsim, ...)
}
# Save chunk results
all_results[chunk_range] <- chunk_results
# Save checkpoint
checkpoint <- list(
condition_ID = condition$ID,
completed_chunks = chunk_idx,
completed_replications = chunk_ends[chunk_idx],
results = all_results[1:chunk_ends[chunk_idx]],
random_seed_state = .GlobalEnv$.Random.seed,
timestamp = Sys.time()
)
saveRDS(checkpoint, checkpoint_file)
}
results <- all_results
# Cleanup checkpoint after completion
file.remove(checkpoint_file)
} else {
# Standard execution (existing code unchanged)
results <- lapply(1L:replications, used_mainsim, ...)
}Benefits
✅ Minimal overhead: Checkpoint only every N replications, not every single one
✅ Parallel compatible: Works with all execution modes (serial, parallel, future)
✅ User control: Users choose checkpoint frequency based on their needs
✅ Backward compatible: Existing simulations work unchanged (defaults to NULL)
✅ Clean UX: Automatic resume, no manual file management
Comparison to Stata's simulate
Stata's simulate command offers:
simulate ..., reps(10000) saving(results.dta) every(1000)The every() option specifies checkpoint frequency. This proposal mirrors that design:
- Stata:
every(1000)saves every 1,000 replications - SimDesign:
checkpoint_replications = 1000saves every 1,000 replications
Alternative Considered
Per-replication checkpointing (save after every single replication):
- ❌ High I/O overhead (10,000 disk writes for 10,000 reps)
- ❌ Creates thousands of files to manage
- ❌ Diminishing returns for most use cases
Relationship to Issue #79
This proposal complements #79 by addressing a related but distinct problem:
| Issue | Problem | Solution |
|---|---|---|
| #79 | max_time saves empty/invalid results |
Fix result preservation on timeout |
| This | No way to resume within-condition | Add checkpoint mechanism |
Both are needed for robust HPC simulation workflows:
max_timeExceeded inrunArraySimulationSaves Unusable Results #79 ensures partial results are preserved when timeout occurs- This proposal enables resuming from those partial results