-
Notifications
You must be signed in to change notification settings - Fork 1
Implementing a VVUQ campaign into the FabNESO plugin #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
matt-graham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dleggat for working on this.
The FabNESO/run_vvuq_instance.sh script seems necessary because of the way that the VVUQ campaign attempts to run the NESO solver after encoding. There are two options for running the simulation within the VVUQ code, ExecuteLocal or ExecutePython. The former executes a bash command, whilst the latter executes a Python function.
I had initially assumed that the latter would be the most useful, since we already have Python functions for running NESO set up, but VVUQ gives no way of determining what run you're doing, or from what location in this case - the only piece of information from the campaign that you can access is the current parameter values.
On the other hand, ExecuteLocal runs a command in the current VVUQ run directory, so by calling a Python single_run_vvuq task from the bash script using this, we can do the location manipulation that we need for the decoder to extract the information that we need.
Thanks for the explanation. What you've proposed makes sense I think if we are going the route of using the EasyVVUQ Campaign class to orchestrate dispatching the runs. However, I'm not sure given the constraints this seems to impose if that is the best solution here. In particular it looks like at the moment the neso tasks are dispatched sequentially and synchronously by default, even when running on a batch scheduler like on ARCHER2? Given there are no dependencies between the different runs ideally we'd like to be able to schedule these all to run in parallel (unlike the VBMC case where we needed to wait for each run to complete to decide on the parameters to use for the next run) as this would be a significant time saver when running on a system like ARCHER2. The nested calls to a FabSim task via the command-line interface within another FabSim task also create quite a verbose output as we have the FabSim output from both the outer and inner level tasks being woven together.
An alternative to using the whole EasyVVUQ campaign infrastructure is to just use the components we need to peform a polynomial chaos expansion (PCE) as we are doing here - specifically the PCESampler and PCEAnalysis classes. The former allows constructing sets of parameters to evaluate the model at and the latter can be used to process the model outputs to fit the coefficients of the PCE. If we use the parameter dictionaries outputted by the PCESampler to create a set of conditions file in a 'sweep' directory we can then use the FabSim run_ensemble function to dispatch jobs for each of these parameter sets. This means we can reuse the existing infrastructure we have for overriding parameters in the conditions files without needing to replicate this using a EasyVVUQ encoder, and then have a separate task which deals with performing the analysis step using the PCEAnalysis class once the ensemble of runs has completed and the results been fetched back.
I've had a go at implementing this alternative approach on a branch mmg/pce-task. This works both locally and on ARCHER2 and on the latter allows jobs to be submitted asynchronously and in parallel. Though it's a shame given all your efforts on this, given it seems like it would be quite a lot of work to get the approach here working, I'd lean towards closing this PR and opening a new PR with this alternative approach instead. Apologies about this, I should have looked in to this more previously and put more guidance in the issue.
| def _parse_vvuq_parameters( | ||
| vvuq_param_file: Path, | ||
| ) -> tuple[dict[Any, dict[str, float]], dict[str, Any], list[str]]: | ||
| # read the inpurt file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # read the inpurt file | |
| # read the input file |
| with Path.open(vvuq_param_file) as f: | ||
| data = f.read() | ||
| d = literal_eval(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the example vvuq_parameters.txt file it looks like this could be assumed to be in JSON format? If so it would be better use the load function from json module in standard library (and also probably change the file extension of the example parameters file to vvuq_parameter.json).
| vvuq_params[parameter] = { | ||
| "type": "float", | ||
| "default": param_values[0], | ||
| } | ||
| vary_length = 3 | ||
| if len(param_values) == vary_length: | ||
| # Make the chaospy distribution. Possibly to be able to edit this later? | ||
| parameter_variations[parameter] = cp.Uniform( | ||
| param_values[1], param_values[2] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| vvuq_params[parameter] = { | |
| "type": "float", | |
| "default": param_values[0], | |
| } | |
| vary_length = 3 | |
| if len(param_values) == vary_length: | |
| # Make the chaospy distribution. Possibly to be able to edit this later? | |
| parameter_variations[parameter] = cp.Uniform( | |
| param_values[1], param_values[2] | |
| ) | |
| default_value, lower_bound, upper_bound = param_values | |
| vvuq_params[parameter] = { | |
| "type": "float", | |
| "default": default_value, | |
| } | |
| parameter_variations[parameter] = cp.Uniform(lower_bound, upper_bound) |
Rather than indexing in to param_values and explicitly checking its length (and not doing anything if it isn't correct length), I would say unpacking it would be better as this (a) makes the code self documenting about what the different indices into param_values are expected to correspond to and (b) will raise an error if param_values is of an unexpected length.
| # If we're running remotely, tell the submission to wait until done | ||
| if "archer2" in fab.env.remote: | ||
| fab.update_environment( | ||
| { | ||
| "job_dispatch": "cd /work/$project/$project/$username ; sbatch --wait", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this in this case as ideally we'd be able to send off batches of runs asynchronously here (as there is no sequential dependencies between jobs)?
|
|
||
| # Copy the mesh and conditions files here for FabSIM to read it properly | ||
| shutil.copyfile(mesh_file, config_path / Path(mesh_file).name) | ||
| shutil.copyfile("conditions.xml", config_path / "conditions.xml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to assume the conditions.xml file is in the local working directory? Is it always the case that the EasyVVUQ encoder will output it there? And will overwrite an existing file with that name? And will the file be removed after the job has finished? If not it would probably be better to have this written to a temporary file / directory which is automatically cleaned up.
| run_vvuq_sequential: Set to False to parallelise. Default is True because this | ||
| can cause issues if unchecked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the issues if run_vvuq_sequential=False that if running on localhost we will end up with too many jobs trying to run in parallel? If so, this presumably would not be the case when running on a system with a scheduler like ARCHER2, where we would probably not want to submit jobs sequentially, so this need some clarification if so.
| campaign = uq.Campaign(name=output_dir, params=params, actions=actions) | ||
|
|
||
| # Make and attach a sampler to the campaign | ||
| my_sampler = uq.sampling.PCESampler(vary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would probably want to be able to pass in polynomial_order as an argument to task as this controls fidelity of expansion and computational cost (number of parameter sets model is evaluated for will be (polynomial_order + 1)**n_parameter_varied where n_parameter_varied is the number of parameters varied.
| ) | ||
|
|
||
| # Make the VVUQ campaign | ||
| campaign = uq.Campaign(name=output_dir, params=params, actions=actions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If output_dir does not exist we get a FileNotFoundError error at this point. Ideally we'd either automatically deal with creating the directory if necessary or document that this needs to be done before running the task.
| campaign.set_sampler(my_sampler) | ||
|
|
||
| # Run VVUQ | ||
| campaign.execute(sequential=run_vvuq_sequential).collate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| campaign.execute(sequential=run_vvuq_sequential).collate() | |
| campaign.execute(sequential=bool(run_vvuq_sequential)).collate() |
I think this might be necessary to ensure a string input parsed from command-line options passed to task is converted to a bool
|
|
||
| # Run an instance of NESO | ||
| neso( | ||
| temporary_config_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing temporary_config_dir as the config argument to neso works when running on localhost but as its an absolute path on local filesystem, it doesn't work when running on a remote system, for example ARCHER2, with we getting permission denied errors when FabSim tries to create the config directory in the remote file system using a file path from the local file system.
Resolves #5
Implementing a VVUQ campaign into the FabNESO plugin. The necessary inputs from the user are a conditions template, and a txt file containing the parameter substitutions and outputs to read in a config directory.
The
FabNESO/run_vvuq_instance.shscript seems necessary because of the way that the VVUQ campaign attempts to run the NESO solver after encoding. There are two options for running the simulation within the VVUQ code,ExecuteLocalorExecutePython. The former executes a bash command, whilst the latter executes a Python function.I had initially assumed that the latter would be the most useful, since we already have Python functions for running NESO set up, but VVUQ gives no way of determining what run you're doing, or from what location in this case - the only piece of information from the campaign that you can access is the current parameter values.
On the other hand,
ExecuteLocalruns a command in the current VVUQ run directory, so by calling a Pythonsingle_run_vvuqtask from the bash script using this, we can do the location manipulation that we need for the decoder to extract the information that we need.