Skip to content

Conversation

@dleggat
Copy link
Contributor

@dleggat dleggat commented Feb 13, 2024

Resolves #5

Implementing a VVUQ campaign into the FabNESO plugin. The necessary inputs from the user are a conditions template, and a txt file containing the parameter substitutions and outputs to read in a config directory.

The FabNESO/run_vvuq_instance.sh script seems necessary because of the way that the VVUQ campaign attempts to run the NESO solver after encoding. There are two options for running the simulation within the VVUQ code, ExecuteLocal or ExecutePython. The former executes a bash command, whilst the latter executes a Python function.

I had initially assumed that the latter would be the most useful, since we already have Python functions for running NESO set up, but VVUQ gives no way of determining what run you're doing, or from what location in this case - the only piece of information from the campaign that you can access is the current parameter values.

On the other hand, ExecuteLocal runs a command in the current VVUQ run directory, so by calling a Python single_run_vvuq task from the bash script using this, we can do the location manipulation that we need for the decoder to extract the information that we need.

@dleggat dleggat requested a review from matt-graham February 13, 2024 11:52
Copy link
Collaborator

@matt-graham matt-graham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dleggat for working on this.

The FabNESO/run_vvuq_instance.sh script seems necessary because of the way that the VVUQ campaign attempts to run the NESO solver after encoding. There are two options for running the simulation within the VVUQ code, ExecuteLocal or ExecutePython. The former executes a bash command, whilst the latter executes a Python function.

I had initially assumed that the latter would be the most useful, since we already have Python functions for running NESO set up, but VVUQ gives no way of determining what run you're doing, or from what location in this case - the only piece of information from the campaign that you can access is the current parameter values.

On the other hand, ExecuteLocal runs a command in the current VVUQ run directory, so by calling a Python single_run_vvuq task from the bash script using this, we can do the location manipulation that we need for the decoder to extract the information that we need.

Thanks for the explanation. What you've proposed makes sense I think if we are going the route of using the EasyVVUQ Campaign class to orchestrate dispatching the runs. However, I'm not sure given the constraints this seems to impose if that is the best solution here. In particular it looks like at the moment the neso tasks are dispatched sequentially and synchronously by default, even when running on a batch scheduler like on ARCHER2? Given there are no dependencies between the different runs ideally we'd like to be able to schedule these all to run in parallel (unlike the VBMC case where we needed to wait for each run to complete to decide on the parameters to use for the next run) as this would be a significant time saver when running on a system like ARCHER2. The nested calls to a FabSim task via the command-line interface within another FabSim task also create quite a verbose output as we have the FabSim output from both the outer and inner level tasks being woven together.

An alternative to using the whole EasyVVUQ campaign infrastructure is to just use the components we need to peform a polynomial chaos expansion (PCE) as we are doing here - specifically the PCESampler and PCEAnalysis classes. The former allows constructing sets of parameters to evaluate the model at and the latter can be used to process the model outputs to fit the coefficients of the PCE. If we use the parameter dictionaries outputted by the PCESampler to create a set of conditions file in a 'sweep' directory we can then use the FabSim run_ensemble function to dispatch jobs for each of these parameter sets. This means we can reuse the existing infrastructure we have for overriding parameters in the conditions files without needing to replicate this using a EasyVVUQ encoder, and then have a separate task which deals with performing the analysis step using the PCEAnalysis class once the ensemble of runs has completed and the results been fetched back.

I've had a go at implementing this alternative approach on a branch mmg/pce-task. This works both locally and on ARCHER2 and on the latter allows jobs to be submitted asynchronously and in parallel. Though it's a shame given all your efforts on this, given it seems like it would be quite a lot of work to get the approach here working, I'd lean towards closing this PR and opening a new PR with this alternative approach instead. Apologies about this, I should have looked in to this more previously and put more guidance in the issue.

def _parse_vvuq_parameters(
vvuq_param_file: Path,
) -> tuple[dict[Any, dict[str, float]], dict[str, Any], list[str]]:
# read the inpurt file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# read the inpurt file
# read the input file

Comment on lines +591 to +593
with Path.open(vvuq_param_file) as f:
data = f.read()
d = literal_eval(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the example vvuq_parameters.txt file it looks like this could be assumed to be in JSON format? If so it would be better use the load function from json module in standard library (and also probably change the file extension of the example parameters file to vvuq_parameter.json).

Comment on lines +597 to +606
vvuq_params[parameter] = {
"type": "float",
"default": param_values[0],
}
vary_length = 3
if len(param_values) == vary_length:
# Make the chaospy distribution. Possibly to be able to edit this later?
parameter_variations[parameter] = cp.Uniform(
param_values[1], param_values[2]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vvuq_params[parameter] = {
"type": "float",
"default": param_values[0],
}
vary_length = 3
if len(param_values) == vary_length:
# Make the chaospy distribution. Possibly to be able to edit this later?
parameter_variations[parameter] = cp.Uniform(
param_values[1], param_values[2]
)
default_value, lower_bound, upper_bound = param_values
vvuq_params[parameter] = {
"type": "float",
"default": default_value,
}
parameter_variations[parameter] = cp.Uniform(lower_bound, upper_bound)

Rather than indexing in to param_values and explicitly checking its length (and not doing anything if it isn't correct length), I would say unpacking it would be better as this (a) makes the code self documenting about what the different indices into param_values are expected to correspond to and (b) will raise an error if param_values is of an unexpected length.

Comment on lines +648 to +654
# If we're running remotely, tell the submission to wait until done
if "archer2" in fab.env.remote:
fab.update_environment(
{
"job_dispatch": "cd /work/$project/$project/$username ; sbatch --wait",
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this in this case as ideally we'd be able to send off batches of runs asynchronously here (as there is no sequential dependencies between jobs)?


# Copy the mesh and conditions files here for FabSIM to read it properly
shutil.copyfile(mesh_file, config_path / Path(mesh_file).name)
shutil.copyfile("conditions.xml", config_path / "conditions.xml")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to assume the conditions.xml file is in the local working directory? Is it always the case that the EasyVVUQ encoder will output it there? And will overwrite an existing file with that name? And will the file be removed after the job has finished? If not it would probably be better to have this written to a temporary file / directory which is automatically cleaned up.

Comment on lines +730 to +731
run_vvuq_sequential: Set to False to parallelise. Default is True because this
can cause issues if unchecked.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the issues if run_vvuq_sequential=False that if running on localhost we will end up with too many jobs trying to run in parallel? If so, this presumably would not be the case when running on a system with a scheduler like ARCHER2, where we would probably not want to submit jobs sequentially, so this need some clarification if so.

campaign = uq.Campaign(name=output_dir, params=params, actions=actions)

# Make and attach a sampler to the campaign
my_sampler = uq.sampling.PCESampler(vary)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would probably want to be able to pass in polynomial_order as an argument to task as this controls fidelity of expansion and computational cost (number of parameter sets model is evaluated for will be (polynomial_order + 1)**n_parameter_varied where n_parameter_varied is the number of parameters varied.

)

# Make the VVUQ campaign
campaign = uq.Campaign(name=output_dir, params=params, actions=actions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If output_dir does not exist we get a FileNotFoundError error at this point. Ideally we'd either automatically deal with creating the directory if necessary or document that this needs to be done before running the task.

campaign.set_sampler(my_sampler)

# Run VVUQ
campaign.execute(sequential=run_vvuq_sequential).collate()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
campaign.execute(sequential=run_vvuq_sequential).collate()
campaign.execute(sequential=bool(run_vvuq_sequential)).collate()

I think this might be necessary to ensure a string input parsed from command-line options passed to task is converted to a bool


# Run an instance of NESO
neso(
temporary_config_dir,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing temporary_config_dir as the config argument to neso works when running on localhost but as its an absolute path on local filesystem, it doesn't work when running on a remote system, for example ARCHER2, with we getting permission denied errors when FabSim tries to create the config directory in the remote file system using a file path from the local file system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore running EasyVVUQ campaigns with FabNESO

3 participants