Skip to content

Conversation

@sebhoerl
Copy link
Collaborator

@sebhoerl sebhoerl commented Jan 12, 2026

This PR is a WIP to provide a way that automatically and intuitively restarts simulations. We need this feature as our cluster may stop a job when another high-priority job arrives. However, we can let a job be restarted automatically. So the idea here is to have some automatic process that detects that the simulation already ran and then continues at the last possible iteration.

The current implementation does the following:

  • It takes the desired output directory from controller.outputDirectory
  • Instead of writing directly in there, the config is automatically rewritten to write into {output}/restart-X. The first restart will have the index 0, and so on.
  • When a simulation is started and some {output}/restart-Y is found, it will (1) traverse in ordered sequence all the existing restart directories, and (2) traverse all iteration directories in these restart directories
  • We define a list of files that need to be present in one iteration so it is tagged as valid
  • If we find a valid previous iteration, we rewrite the configuration of the current process such that the output files of the last valid iteration are used as input

This is a proof-of-concept, needs to be tested properly. For now, there are techincally included:

  • Restarting the plans
  • Restarting the termination criteria
  • Restarting VDF

Some additional changes:

  • Fixes restarting of termination criterion (was not implemented properly before)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants