Skip to content

Allow downscaling to load merged datasets#846

Open
AnnaKwa wants to merge 8 commits intomainfrom
feature/downsc-merged-datasets
Open

Allow downscaling to load merged datasets#846
AnnaKwa wants to merge 8 commits intomainfrom
feature/downsc-merged-datasets

Conversation

@AnnaKwa
Copy link
Contributor

@AnnaKwa AnnaKwa commented Feb 17, 2026

These changes allow the downscaling data loader configs to load merged datasets.

  • Tests added

Comment on lines 421 to 422
Copy link
Member

@jpdunc23 jpdunc23 Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you already handle concatenation in build_from_config_sequence, you could maybe use MergeNoConcatDatasetConfig instead of MergeDatasetConfig. We use MergeNoConcatDatasetConfig for training in fme.coupled since we also handle concatenation there.

def _single_xarray_config(
coarse: list[XarrayDataConfig]
| Sequence[XarrayDataConfig | XarrayEnsembleDataConfig],
coarse: Sequence[
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently inference contains a check that the coarse data config is a single XarrayDataConfig. This can be updated later in a subsequent PR to allow a merged config as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe getting off-track for this PR, but why does _single_xarray_config take a list as input when it only wants a single config? Maybe adding an InferenceDataLoaderConfig would help clean this up somewhat.

data_path = self.fine[0].data_path
file_pattern = self.fine[0].file_pattern
raw_paths = get_raw_paths(data_path, file_pattern)
first_config = self._first_data_config(self.fine[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block will be removed in a subsequent PR where the static inputs are loaded using the paths provided in a training config field.

assert all(
dataset_fine_subset.variable_metadata[key]
== dataset_coarse_subset.variable_metadata[key]
dataset_fine_subset.variable_metadata[key].units
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For surface pressure, the fine and coarse datasets have different long_name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is this check? Seems like units could also easily become mismatched in "harmless" ways, e.g. m^2 vs m**2.

Copy link
Member

@jpdunc23 jpdunc23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me! I have various suggestions that I think would help simplify the changes a bit, but feel free to treat as optional.

Comment on lines +455 to +458
if isinstance(self.train_data.fine[0], MergeNoConcatDatasetConfig):
first_fine_config = self.train_data.fine[0].merge[0]
else:
first_fine_config = self.train_data.fine[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You could reuse _first_data_config() here, though since this is being removed in the future feel free to ignore.

Comment on lines +205 to +211
if isinstance(config, XarrayDataConfig):
if config.engine == "zarr":
context = "forkserver"
break
elif getattr(config, "zarr_engine_used", False):
context = "forkserver"
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XarrayDataConfig sets its zarr_engine_used attr in its __post_init__ so I think the following works:

Suggested change
if isinstance(config, XarrayDataConfig):
if config.engine == "zarr":
context = "forkserver"
break
elif getattr(config, "zarr_engine_used", False):
context = "forkserver"
break
if getattr(config, "zarr_engine_used", False):
context = "forkserver"
break

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr_engine_used should probably be turned into an @property of XarrayDataConfig etc. to make it a bit more visible, but that's an existing issue.

datasets: list[DatasetABC] = []
properties: DatasetProperties | None = None
if xarray_configs:
ds, prop = get_dataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit awkward that get_dataset returns XarrayConcat which then later on gets nested in another XarrayConcat. Maybe instead of using get_dataset and get_merged_datasets you can reuse the existing build methods on XarrayDataConfig and MergeNoConcatDatasetConfig? Then you should be able to avoid having to filter expanded by config type, since you can just loop over it and call config.build(names, n_timesteps) without worrying what type of config it is.

TBH, get_dataset and get_merged_datasets should probably be made private, though no need to worry about that here.

Comment on lines +438 to +442
if getattr(config, "engine", None) == "zarr" or getattr(
config, "zarr_engine_used", False
):
mp_context = "forkserver"
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again I think you can check zarr_engine_used for both XarrayDataConfig and MergeNoConcatDatasetConfig. I don't think you need the safety of getattr unless you're worried about self.fine having types other than what we expect.

break
if mp_context is None:
for coarse_config in self.coarse:
if isinstance(coarse_config, XarrayEnsembleDataConfig):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an @property for zarr_engine_used to XarrayEnsembleDataConfig? That should simplify the logic here a lot.

Comment on lines +457 to +459
def coarse_full_config(
self,
) -> Sequence[XarrayDataConfig | MergeNoConcatDatasetConfig]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems more or less identical to DataLoaderConfig.full_config. Maybe create a shared helper?

assert all(
dataset_fine_subset.variable_metadata[key]
== dataset_coarse_subset.variable_metadata[key]
dataset_fine_subset.variable_metadata[key].units
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is this check? Seems like units could also easily become mismatched in "harmless" ways, e.g. m^2 vs m**2.

Comment on lines +16 to +27
def replace_config_subset(
config: XarrayDataConfig | MergeNoConcatDatasetConfig, subset: TimeSlice
) -> XarrayDataConfig | MergeNoConcatDatasetConfig:
if isinstance(config, XarrayDataConfig):
return dataclasses.replace(config, subset=subset)
elif isinstance(config, MergeNoConcatDatasetConfig):
merge_configs = [
dataclasses.replace(_config, subset=subset) for _config in config.merge
]
return MergeNoConcatDatasetConfig(merge=merge_configs)
else:
raise ValueError(f"Invalid config type: {type(config)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XarrayDataConfig and MergeNoConcatDatasetConfig actually both already have methods update_subset, so you could maybe reuse that here and not have to worry about the isinstance checks.

def _single_xarray_config(
coarse: list[XarrayDataConfig]
| Sequence[XarrayDataConfig | XarrayEnsembleDataConfig],
coarse: Sequence[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe getting off-track for this PR, but why does _single_xarray_config take a list as input when it only wants a single config? Maybe adding an InferenceDataLoaderConfig would help clean this up somewhat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments