Skip to content

Refactor configuration handling on system initialization#80

Open
grassesi wants to merge 87 commits intodevelopfrom
grasse-62-refactor_configuration
Open

Refactor configuration handling on system initialization#80
grassesi wants to merge 87 commits intodevelopfrom
grasse-62-refactor_configuration

Conversation

@grassesi
Copy link
Collaborator

@grassesi grassesi commented Jan 6, 2025

This merge request addresses most of the pain points with the configuration raised in issue #62.

A new configuration backend (utils.config) has been introduced consisting of a hierarchy of dataclasses. The purpose of this backend is to document all available configuration options, allow comfortable access (especially to nested parameter lists such as "fields"), and provide serialization/deserialization to/from the already used model_id<wandb_id>.json files to track configuration.

The new configuration backend is designed to to replace utils.utils.Config. To maintain compatibility during this migration utils.config_adapter.Config is designed as a drop-in replacement supporting all the same semantics as utils.utils.Config (except assigning unknown configuration options), while already taking advantage of the new backend. This enables a bunch of tools for exploring and refactoring the code base that rely on static code analysis.

Information about important file paths such as input data, pretrained models and experiment output was very hardwired throughout the system. It is now more flexible to handle: config.config now includes the dataclasses and PlatformConfig and UserConfig. PlatformConfig includes the path to the input data as well as a location for pretrained models. These platform unique parameters have to be provided by a simple json file for each platform. UserConfig specifies all output pathes. Under the assumption that atmorep will be run on a HPC platform using slurm, the output directories will be subdirectories of $SLURM_SUBMIT_DIR.

Initialization was streamlined in core.evaluator, core.train, core.train_multi:

  • code relating to handling of renamed and new options was moved to the configuration backend.
  • Duplication of initializing code in core.evaluator, core.train, core.train_multi was consolidated into the core.train.initialize_atmorep function. Initialization includes:
    • initializing the UserConfig from $SLURM_SUBMIT_DIR
    • calls to utils.utils.init_torch, utils.utils.setup_ddp
    • initializing an 'empty' utils.config_adapter.Config object
    • returning devices, par_rank, par_size, config
  • Duplicated code train_continue (identical in train/train_mutli) was removed from from train_multi.
  • switching between continuing training or fresh training is now a simple boolean toggle

For a summary of the changes:

  • 939f7b6 - 91aeb8d deal with implementing the new backend for configuration including (de)serialization and tests.
  • 67e9951 - 9c362cc includes the improvements to the handling of the global paths defined in config.config from the branch grasse-62-refactor_path_config
  • 9b5bcd4 - 7dff83d implement the compatibility layer by introducing the utils.config_adapter.Config class, as well as including small fixes for (missing options etc.) for the configuration backend.
  • e66f3f8 - 795d99e streamline and simplify redundant code in core.evaluator, core.train, core.train_multi related to the handling of renamed options (now handled by the configuration backend) and initialization (now handled centrally by core.train.initialize_atmorep)

grassesi and others added 30 commits November 20, 2024 11:41
This change aligns the interface for initialization
better with utils.utils.Config
config_facade.ConfigFacade was not a "Facade" that
facilitates access to some subcomponents, but
rather functions as a "Adapter" that allows the
non-refactored code to interact with the refactored
configuration.

Accordingly it was renamed to config_adapter.Config
and all references to a "Facade" in the documentation/
test function names where updated as well.
@grassesi grassesi requested a review from clessig January 7, 2025 08:14
@grassesi grassesi marked this pull request as ready for review January 13, 2025 14:49
@@ -0,0 +1,4 @@
{
"input_data": "/p/scratch/atmo-rep/data/era5_1deg/months/",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paths here should only be base paths, e.g. /p/scratch/atmo-rep/. The rest should be consistent across infras.

def train_continue( wandb_id, epoch, Trainer, epoch_continue = -1) :

def initialize_atmorep(with_ddp):
atmorep_project_dir = Path(os.environ["SLURM_SUBMIT_DIR"])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ddp init should be in a separate file from the path setup.


dates = args['dates']
evaluator = Evaluator.load( cf, model_id, model_epoch, devices)
print("after loading:", evaluator.cf.user_config)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a debug comment?

except :
if __name__ == "__main__":
train_fresh = False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not pretty how it was before, admittedly, but have an additional variable doesn't help either.


if 0 == cf.par_rank :
directory = Path( config.path_results, 'id{}'.format( cf.wandb_id))
directory = self.user_config.results / f"id{cf.wandb_id}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent usage of quotation marks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments