Feature Request: User-Configurable Replication-Level Checkpoints

This is related to #79. SimDesign currently supports resuming incomplete simulations on the design level but not on the replication level. Adding support for **incremental checkpointing within design conditions** to enable resumption from partially-completed replications, not just completed conditions, would be useful. This would provide robust crash recovery for long-running simulations with many replications per condition. 

I used Claude Code to suggest how this could be implemented.

## Motivation

### Current Implementation

`runSimulation()` currently saves temporary files **only after each complete design condition** finishes all its replications. This means:

- ❌ If a condition with 10,000 replications crashes at replication 6,900, **all 6,900 completed replications are lost**
- ❌ Resume only works at condition-level granularity (e.g., resume at condition 5, not at replication 6,901 within condition 5)
- ❌ For HPC environments with time limits, conditions requiring >4 hours cannot leverage the resume feature

**Relevant code:**
```r
# runSimulation.R lines 1626-1736
for(i in start:end){  # Loop over design CONDITIONS
    tmp <- Analysis(...)  # Runs ALL replications via lapply/pblapply
    Result_list[[i]] <- ...  # Store aggregated results
    if(save || save_results)
        saveRDS(Result_list, ...)  # Save AFTER condition completes
}
```

The `Analysis()` function uses `lapply`/`pblapply`, which blocks until **all replications complete**, preventing intermediate saves.

### Why resuming incomplete designs would be useful

In my current simulation:
- **Setup:** N=25,600 sample size, 10,000 replications, takes ~6 hours per condition
- **Problem:** SLURM cluster has 4-hour time limit
- **Result:** Job killed at 69% completion (6,900 reps), all work lost
- **Consequence:** Cannot use `runSimulation()`'s resume feature; must switch to workarounds

This affects anyone running simulations where:
- Individual conditions take hours to complete
- Working on HPC clusters with time limits
- HPC cluster scheduler resource allocation works better for shorter tasks that complete partial designs and then resume over using longer tasks that try to run the full design
- Debugging specific replication failures


## Proposed Solution

Add **user-configurable checkpointing** similar to **Stata's `simulate` command**, which allows users to specify checkpoint frequency.

### Design Goals

1. **Backward compatible**: Opt-in via new parameter, existing code works unchanged
2. **User configurable**: Let users balance checkpoint frequency vs I/O overhead
3. **Parallel-friendly**: Should work with both serial and parallel execution modes
4. **Clean recovery**: Automatic resume from last checkpoint without manual intervention

### Recommended Implementation: Chunked Checkpoints

Similar to Stata's approach, split replications into user-defined chunks and save after each chunk completes.

#### New Parameter

```r
runSimulation(...,
              checkpoint_replications = NULL)  # NULL (default) = no checkpointing
                                               # Integer = checkpoint every N reps
```

#### Example Usage

```r
# Checkpoint every 1,000 replications
runSimulation(
  design = Design,
  replications = 10000,
  generate = Generate,
  analyse = Analyse,
  summarise = Summarise,
  checkpoint_replications = 1000,  # Save after every 1,000 reps
  save = TRUE
)
```

**Behavior:**
- Runs replications 1-1,000 → saves checkpoint
- Runs replications 1,001-2,000 → saves checkpoint
- ...continues...
- If interrupted at replication 7,342:
  - Checkpoint contains replications 1-7,000
  - On resume, starts at replication 7,001
  - Only loses last 342 reps, not all 7,342

#### Implementation Approach

**File:** `R/analysis.R`

```r
if(!is.null(checkpoint_replications) && replications > checkpoint_replications) {
    # Chunked execution with checkpointing
    num_chunks <- ceiling(replications / checkpoint_replications)
    chunk_starts <- seq(1, replications, by=checkpoint_replications)
    chunk_ends <- pmin(chunk_starts + checkpoint_replications - 1, replications)

    # Check for existing checkpoint
    checkpoint_file <- file.path(save_results_out_rootdir,
                                paste0('SIMDESIGN-CHECKPOINT_',
                                       condition$ID, '.rds'))

    if(file.exists(checkpoint_file)) {
        checkpoint <- readRDS(checkpoint_file)
        start_chunk <- checkpoint$completed_chunks + 1
        all_results <- checkpoint$results
    } else {
        start_chunk <- 1
        all_results <- vector('list', replications)
    }

    # Process remaining chunks
    for(chunk_idx in start_chunk:num_chunks) {
        chunk_range <- chunk_starts[chunk_idx]:chunk_ends[chunk_idx]

        # Run chunk (works with serial, parallel, or future)
        chunk_results <- if(useFuture) {
            future.apply::future_lapply(chunk_range, used_mainsim, ...)
        } else if(!is.null(cl)) {
            parallel::parLapply(cl, chunk_range, used_mainsim, ...)
        } else {
            lapply(chunk_range, used_mainsim, ...)
        }

        # Save chunk results
        all_results[chunk_range] <- chunk_results

        # Save checkpoint
        checkpoint <- list(
            condition_ID = condition$ID,
            completed_chunks = chunk_idx,
            completed_replications = chunk_ends[chunk_idx],
            results = all_results[1:chunk_ends[chunk_idx]],
            random_seed_state = .GlobalEnv$.Random.seed,
            timestamp = Sys.time()
        )
        saveRDS(checkpoint, checkpoint_file)
    }

    results <- all_results

    # Cleanup checkpoint after completion
    file.remove(checkpoint_file)

} else {
    # Standard execution (existing code unchanged)
    results <- lapply(1L:replications, used_mainsim, ...)
}
```

### Benefits

✅ **Minimal overhead**: Checkpoint only every N replications, not every single one
✅ **Parallel compatible**: Works with all execution modes (serial, parallel, future)
✅ **User control**: Users choose checkpoint frequency based on their needs
✅ **Backward compatible**: Existing simulations work unchanged (defaults to NULL)
✅ **Clean UX**: Automatic resume, no manual file management

### Comparison to Stata's `simulate`

Stata's `simulate` command offers:
```stata
simulate ..., reps(10000) saving(results.dta) every(1000)
```

The `every()` option specifies checkpoint frequency. This proposal mirrors that design:
- **Stata:** `every(1000)` saves every 1,000 replications
- **SimDesign:** `checkpoint_replications = 1000` saves every 1,000 replications

## Alternative Considered

**Per-replication checkpointing** (save after every single replication):
- ❌ High I/O overhead (10,000 disk writes for 10,000 reps)
- ❌ Creates thousands of files to manage
- ❌ Diminishing returns for most use cases

## Relationship to Issue #79

This proposal complements #79 by addressing a related but distinct problem:

| Issue | Problem | Solution |
|-------|---------|----------|
| **#79** | `max_time` saves empty/invalid results | Fix result preservation on timeout |
| **This** | No way to resume within-condition | Add checkpoint mechanism |

Both are needed for robust HPC simulation workflows:
1. #79 ensures partial results are preserved when timeout occurs
2. This proposal enables resuming from those partial results



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: User-Configurable Replication-Level Checkpoints #81

Motivation

Current Implementation

Why resuming incomplete designs would be useful

Proposed Solution

Design Goals

Recommended Implementation: Chunked Checkpoints

New Parameter

Example Usage

Implementation Approach

Benefits

Comparison to Stata's `simulate`

Alternative Considered

Relationship to Issue #79

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue	Problem	Solution
#79	`max_time` saves empty/invalid results	Fix result preservation on timeout
This	No way to resume within-condition	Add checkpoint mechanism

Feature Request: User-Configurable Replication-Level Checkpoints #81

Description

Motivation

Current Implementation

Why resuming incomplete designs would be useful

Proposed Solution

Design Goals

Recommended Implementation: Chunked Checkpoints

New Parameter

Example Usage

Implementation Approach

Benefits

Comparison to Stata's simulate

Alternative Considered

Relationship to Issue #79

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Comparison to Stata's `simulate`