Refactor configuration handling on system initialization by grassesi · Pull Request #80 · clessig/atmorep

grassesi · 2025-01-06T09:43:08Z

This merge request addresses most of the pain points with the configuration raised in issue #62.

A new configuration backend (utils.config) has been introduced consisting of a hierarchy of dataclasses. The purpose of this backend is to document all available configuration options, allow comfortable access (especially to nested parameter lists such as "fields"), and provide serialization/deserialization to/from the already used model_id<wandb_id>.json files to track configuration.

The new configuration backend is designed to to replace utils.utils.Config. To maintain compatibility during this migration utils.config_adapter.Config is designed as a drop-in replacement supporting all the same semantics as utils.utils.Config (except assigning unknown configuration options), while already taking advantage of the new backend. This enables a bunch of tools for exploring and refactoring the code base that rely on static code analysis.

Information about important file paths such as input data, pretrained models and experiment output was very hardwired throughout the system. It is now more flexible to handle: config.config now includes the dataclasses and PlatformConfig and UserConfig. PlatformConfig includes the path to the input data as well as a location for pretrained models. These platform unique parameters have to be provided by a simple json file for each platform. UserConfig specifies all output pathes. Under the assumption that atmorep will be run on a HPC platform using slurm, the output directories will be subdirectories of $SLURM_SUBMIT_DIR.

Initialization was streamlined in core.evaluator, core.train, core.train_multi:

code relating to handling of renamed and new options was moved to the configuration backend.
Duplication of initializing code in core.evaluator, core.train, core.train_multi was consolidated into the core.train.initialize_atmorep function. Initialization includes:
- initializing the UserConfig from $SLURM_SUBMIT_DIR
- calls to utils.utils.init_torch, utils.utils.setup_ddp
- initializing an 'empty' utils.config_adapter.Config object
- returning devices, par_rank, par_size, config
Duplicated code train_continue (identical in train/train_mutli) was removed from from train_multi.
switching between continuing training or fresh training is now a simple boolean toggle

For a summary of the changes:

939f7b6 - 91aeb8d deal with implementing the new backend for configuration including (de)serialization and tests.
67e9951 - 9c362cc includes the improvements to the handling of the global paths defined in config.config from the branch grasse-62-refactor_path_config
9b5bcd4 - 7dff83d implement the compatibility layer by introducing the utils.config_adapter.Config class, as well as including small fixes for (missing options etc.) for the configuration backend.
e66f3f8 - 795d99e streamline and simplify redundant code in core.evaluator, core.train, core.train_multi related to the handling of renamed options (now handled by the configuration backend) and initialization (now handled centrally by core.train.initialize_atmorep)

…Config

…orm classes

This change aligns the interface for initialization better with utils.utils.Config

config_facade.ConfigFacade was not a "Facade" that facilitates access to some subcomponents, but rather functions as a "Adapter" that allows the non-refactored code to interact with the refactored configuration. Accordingly it was renamed to config_adapter.Config and all references to a "Facade" in the documentation/ test function names where updated as well.

clessig · 2025-01-27T16:20:06Z

atmorep/config/jsc.json

@@ -0,0 +1,4 @@
+{
+    "input_data": "/p/scratch/atmo-rep/data/era5_1deg/months/",


The paths here should only be base paths, e.g. /p/scratch/atmo-rep/. The rest should be consistent across infras.

clessig · 2025-01-27T16:23:46Z

atmorep/core/train.py

-def train_continue( wandb_id, epoch, Trainer, epoch_continue = -1) :

+def initialize_atmorep(with_ddp):
+  atmorep_project_dir = Path(os.environ["SLURM_SUBMIT_DIR"])


The ddp init should be in a separate file from the path setup.

clessig · 2025-01-27T16:24:08Z

atmorep/core/evaluator.py


    dates = args['dates']
    evaluator = Evaluator.load( cf, model_id, model_epoch, devices)
+    print("after loading:", evaluator.cf.user_config)


That's probably a debug comment?

clessig · 2025-01-27T16:25:31Z

atmorep/core/train.py

-  except :
-    
+if __name__ == "__main__":
+  train_fresh = False


It's not pretty how it was before, admittedly, but have an additional variable doesn't help either.

clessig · 2025-01-27T16:26:20Z

atmorep/core/trainer.py


    if 0 == cf.par_rank :
-      directory = Path( config.path_results, 'id{}'.format( cf.wandb_id))
+      directory = self.user_config.results / f"id{cf.wandb_id}"


Consistent usage of quotation marks

grassesi and others added 30 commits November 20, 2024 11:41

wip field config and temporary reference model file

939f7b6

Implement config objects for the fields and predicted fields

a9bec72

use namedtuple replacing tuples of time,lat,lon

a83c535

Implement config object for the model

83559b3

write serialization/deserialization

de205f0

add missing docstring

d6c62a0

Implement config objects for run and training

774f829

Add learning rate, losses and bert config to TrainingConfig

8b4f03f

Fix spelling errors and add serialization and deserialisation for Run…

f668b84

…Config

remove prediction config

785a968

move grad_checkpointing to RunConfig

61e53a6

add (de)serialization to RunConfig

7366df5

add (de)serialization for TrainingConfig and RootConfig class

42f4064

move sample multiformer config into tests

3bf0c98

split (de)serialisation into 2 step process for better testability

429f1b0

add unit tests for new config

cb0c993

fix incorrect key with_layer_norm => with_layernorm

da70241

add missing serialization of geo_range_sampling

db4e806

use list instead of Iterable for typing

b1a3163

add configuration argument optimizer_zero

f162142

add n_size configuration argument

7170062

add configuration of cross attetion for multiformers

afeeb51

add configuration to specify singleformer to build multiformer

9853d0a

specify normalization strategy for each field

38fb41d

update tests to exclude deprecated parameters

91aeb8d

refactor: split configuration of pathes into UserConfig and HPC_Platf…

67e9951

…orm classes

dont run commands on installing atmorep, fix pytorch version

8dc3a26

add UserConfig instance to Config

20f8972

use UserConfig in trainer

d516930

use UserConfig for datta writing

83c8652

grassesi added 23 commits December 17, 2024 11:36

test adding options to empty Config

9f07920

Implement empty initialization of ConfigFacade.

0b100c5

This change aligns the interface for initialization better with utils.utils.Config

add: pytest option

6a9d202

fix: add missing argument for compatibility

5adaf5b

cosmetic fix

ac82472

cosmitc fix

5b6aff2

use introspection to create empty AtmorepConfig instance

7dff83d

implement handling of renamed or new options in AtmorepConfig

e66f3f8

improve switching between train/train_continue

d59b48d

bundle calls to init_torch, setup_ddp, Config in one place

c1ae994

use central initialization in evaluator/train/train_multi

695a404

replace old Config usng config_adapter.Config

9b72a14

remove manual handling of renamed / added options

0685b21

fix indentation

014431c

handle UserConfig also on initialization

3434434

fix: mark static method parse_args as such

a168b71

remove unused option month

6314026

comments

8ea3aab

test with mocked UserConfig

ed5f98e

remove redundant code due to usage of initialize_atmorep

c11116a

fix: missing argument

795d99e

Merge branch 'develop' into grasse-62-refactor_configuration

79f3e92

grassesi requested a review from clessig January 7, 2025 08:14

grassesi added 2 commits January 7, 2025 16:31

fix: forgotten dict.items(), load_json method

5861257

fix: missing adapter method get_self_dict

140e65e

grassesi marked this pull request as ready for review January 13, 2025 14:49

grassesi added 2 commits January 14, 2025 11:08

fix: raise error if config file is not found

476dd53

change: move calculation of model run directory to UserConfig

465a7ba

clessig reviewed Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor configuration handling on system initialization#80

Refactor configuration handling on system initialization#80
grassesi wants to merge 87 commits intodevelopfrom
grasse-62-refactor_configuration

grassesi commented Jan 6, 2025

Uh oh!

clessig Jan 27, 2025

Uh oh!

clessig Jan 27, 2025

Uh oh!

clessig Jan 27, 2025

Uh oh!

clessig Jan 27, 2025

Uh oh!

clessig Jan 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

		@@ -0,0 +1,4 @@
		{
		"input_data": "/p/scratch/atmo-rep/data/era5_1deg/months/",

Conversation

grassesi commented Jan 6, 2025

Uh oh!

clessig Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

clessig Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

clessig Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

clessig Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

clessig Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments