Allow downscaling to load merged datasets by AnnaKwa · Pull Request #846 · ai2cm/ace

AnnaKwa · 2026-02-17T23:37:53Z

These changes allow the downscaling data loader configs to load merged datasets.

Tests added

… to dataconfig

jpdunc23 · 2026-02-18T19:19:47Z

fme/downscaling/data/config.py

Since you already handle concatenation in build_from_config_sequence, you could maybe use MergeNoConcatDatasetConfig instead of MergeDatasetConfig. We use MergeNoConcatDatasetConfig for training in fme.coupled since we also handle concatenation there.

AnnaKwa · 2026-02-18T22:47:00Z

fme/downscaling/inference/output.py

    def _single_xarray_config(
-        coarse: list[XarrayDataConfig]
-        | Sequence[XarrayDataConfig | XarrayEnsembleDataConfig],
+        coarse: Sequence[


Currently inference contains a check that the coarse data config is a single XarrayDataConfig. This can be updated later in a subsequent PR to allow a merged config as well.

This is maybe getting off-track for this PR, but why does _single_xarray_config take a list as input when it only wants a single config? Maybe adding an InferenceDataLoaderConfig would help clean this up somewhat.

AnnaKwa · 2026-02-18T22:50:27Z

fme/downscaling/data/config.py

-                data_path = self.fine[0].data_path
-                file_pattern = self.fine[0].file_pattern
-                raw_paths = get_raw_paths(data_path, file_pattern)
+                first_config = self._first_data_config(self.fine[0])


This block will be removed in a subsequent PR where the static inputs are loaded using the paths provided in a training config field.

AnnaKwa · 2026-02-18T22:51:07Z

fme/downscaling/data/config.py

        assert all(
-            dataset_fine_subset.variable_metadata[key]
-            == dataset_coarse_subset.variable_metadata[key]
+            dataset_fine_subset.variable_metadata[key].units


For surface pressure, the fine and coarse datasets have different long_name.

How important is this check? Seems like units could also easily become mismatched in "harmless" ways, e.g. m^2 vs m**2.

jpdunc23

Mostly looks good to me! I have various suggestions that I think would help simplify the changes a bit, but feel free to treat as optional.

jpdunc23 · 2026-02-19T00:37:31Z

fme/downscaling/train.py

+        if isinstance(self.train_data.fine[0], MergeNoConcatDatasetConfig):
+            first_fine_config = self.train_data.fine[0].merge[0]
+        else:
+            first_fine_config = self.train_data.fine[0]


Nit: You could reuse _first_data_config() here, though since this is being removed in the future feel free to ignore.

jpdunc23 · 2026-02-19T00:55:48Z

fme/downscaling/data/config.py

+            if isinstance(config, XarrayDataConfig):
+                if config.engine == "zarr":
+                    context = "forkserver"
+                    break
+            elif getattr(config, "zarr_engine_used", False):
                context = "forkserver"
+                break


XarrayDataConfig sets its zarr_engine_used attr in its __post_init__ so I think the following works:

Suggested change

if isinstance(config, XarrayDataConfig):

if config.engine == "zarr":

context = "forkserver"

break

elif getattr(config, "zarr_engine_used", False):

context = "forkserver"

break

if getattr(config, "zarr_engine_used", False):

context = "forkserver"

break

zarr_engine_used should probably be turned into an @property of XarrayDataConfig etc. to make it a bit more visible, but that's an existing issue.

jpdunc23 · 2026-02-19T03:05:42Z

fme/downscaling/data/config.py

+    datasets: list[DatasetABC] = []
+    properties: DatasetProperties | None = None
+    if xarray_configs:
+        ds, prop = get_dataset(


It's a bit awkward that get_dataset returns XarrayConcat which then later on gets nested in another XarrayConcat. Maybe instead of using get_dataset and get_merged_datasets you can reuse the existing build methods on XarrayDataConfig and MergeNoConcatDatasetConfig? Then you should be able to avoid having to filter expanded by config type, since you can just loop over it and call config.build(names, n_timesteps) without worrying what type of config it is.

TBH, get_dataset and get_merged_datasets should probably be made private, though no need to worry about that here.

jpdunc23 · 2026-02-19T03:13:48Z

fme/downscaling/data/config.py

+            if getattr(config, "engine", None) == "zarr" or getattr(
+                config, "zarr_engine_used", False
+            ):
                mp_context = "forkserver"
+                break


Here again I think you can check zarr_engine_used for both XarrayDataConfig and MergeNoConcatDatasetConfig. I don't think you need the safety of getattr unless you're worried about self.fine having types other than what we expect.

jpdunc23 · 2026-02-19T03:18:21Z

fme/downscaling/data/config.py

+                break
+        if mp_context is None:
+            for coarse_config in self.coarse:
+                if isinstance(coarse_config, XarrayEnsembleDataConfig):


Maybe add an @property for zarr_engine_used to XarrayEnsembleDataConfig? That should simplify the logic here a lot.

jpdunc23 · 2026-02-19T03:20:18Z

fme/downscaling/data/config.py

+    def coarse_full_config(
+        self,
+    ) -> Sequence[XarrayDataConfig | MergeNoConcatDatasetConfig]:


This method seems more or less identical to DataLoaderConfig.full_config. Maybe create a shared helper?

jpdunc23 · 2026-02-19T03:25:28Z

fme/downscaling/data/config.py

        assert all(
-            dataset_fine_subset.variable_metadata[key]
-            == dataset_coarse_subset.variable_metadata[key]
+            dataset_fine_subset.variable_metadata[key].units


How important is this check? Seems like units could also easily become mismatched in "harmless" ways, e.g. m^2 vs m**2.

jpdunc23 · 2026-02-19T03:31:00Z

fme/downscaling/data/utils.py

+def replace_config_subset(
+    config: XarrayDataConfig | MergeNoConcatDatasetConfig, subset: TimeSlice
+) -> XarrayDataConfig | MergeNoConcatDatasetConfig:
+    if isinstance(config, XarrayDataConfig):
+        return dataclasses.replace(config, subset=subset)
+    elif isinstance(config, MergeNoConcatDatasetConfig):
+        merge_configs = [
+            dataclasses.replace(_config, subset=subset) for _config in config.merge
+        ]
+        return MergeNoConcatDatasetConfig(merge=merge_configs)
+    else:
+        raise ValueError(f"Invalid config type: {type(config)}")


XarrayDataConfig and MergeNoConcatDatasetConfig actually both already have methods update_subset, so you could maybe reuse that here and not have to worry about the isinstance checks.

jpdunc23 · 2026-02-19T03:37:37Z

fme/downscaling/inference/output.py

    def _single_xarray_config(
-        coarse: list[XarrayDataConfig]
-        | Sequence[XarrayDataConfig | XarrayEnsembleDataConfig],
+        coarse: Sequence[


This is maybe getting off-track for this PR, but why does _single_xarray_config take a list as input when it only wants a single config? Maybe adding an InferenceDataLoaderConfig would help clean this up somewhat.

AnnaKwa added 6 commits February 17, 2026 14:56

allow mergeddataconfig

488c59d

relax req for metadata to match (e.g. pressfc long name does not match)

5a60a51

add unit test

81fd338

very fast only

8845c7d

move helper method out as standalone function and add merged datatset…

6f104fc

… to dataconfig

add check upfront that timesteps in paired datasets are aligned

dc7bf97

jpdunc23 reviewed Feb 18, 2026

View reviewed changes

AnnaKwa added 2 commits February 18, 2026 13:47

use MergeNoConcatDatasetConfig instead

cc3e142

adjust downstream references to allow for merged config

721ec6a

AnnaKwa commented Feb 18, 2026

View reviewed changes

jpdunc23 approved these changes Feb 19, 2026

View reviewed changes

Conversation

AnnaKwa commented Feb 17, 2026

Uh oh!

jpdunc23 Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpdunc23 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

jpdunc23 Feb 18, 2026 •

edited

Loading