Conversation
fme/downscaling/data/config.py
Outdated
There was a problem hiding this comment.
Since you already handle concatenation in build_from_config_sequence, you could maybe use MergeNoConcatDatasetConfig instead of MergeDatasetConfig. We use MergeNoConcatDatasetConfig for training in fme.coupled since we also handle concatenation there.
| def _single_xarray_config( | ||
| coarse: list[XarrayDataConfig] | ||
| | Sequence[XarrayDataConfig | XarrayEnsembleDataConfig], | ||
| coarse: Sequence[ |
There was a problem hiding this comment.
Currently inference contains a check that the coarse data config is a single XarrayDataConfig. This can be updated later in a subsequent PR to allow a merged config as well.
There was a problem hiding this comment.
This is maybe getting off-track for this PR, but why does _single_xarray_config take a list as input when it only wants a single config? Maybe adding an InferenceDataLoaderConfig would help clean this up somewhat.
| data_path = self.fine[0].data_path | ||
| file_pattern = self.fine[0].file_pattern | ||
| raw_paths = get_raw_paths(data_path, file_pattern) | ||
| first_config = self._first_data_config(self.fine[0]) |
There was a problem hiding this comment.
This block will be removed in a subsequent PR where the static inputs are loaded using the paths provided in a training config field.
| assert all( | ||
| dataset_fine_subset.variable_metadata[key] | ||
| == dataset_coarse_subset.variable_metadata[key] | ||
| dataset_fine_subset.variable_metadata[key].units |
There was a problem hiding this comment.
For surface pressure, the fine and coarse datasets have different long_name.
There was a problem hiding this comment.
How important is this check? Seems like units could also easily become mismatched in "harmless" ways, e.g. m^2 vs m**2.
jpdunc23
left a comment
There was a problem hiding this comment.
Mostly looks good to me! I have various suggestions that I think would help simplify the changes a bit, but feel free to treat as optional.
| if isinstance(self.train_data.fine[0], MergeNoConcatDatasetConfig): | ||
| first_fine_config = self.train_data.fine[0].merge[0] | ||
| else: | ||
| first_fine_config = self.train_data.fine[0] |
There was a problem hiding this comment.
Nit: You could reuse _first_data_config() here, though since this is being removed in the future feel free to ignore.
| if isinstance(config, XarrayDataConfig): | ||
| if config.engine == "zarr": | ||
| context = "forkserver" | ||
| break | ||
| elif getattr(config, "zarr_engine_used", False): | ||
| context = "forkserver" | ||
| break |
There was a problem hiding this comment.
XarrayDataConfig sets its zarr_engine_used attr in its __post_init__ so I think the following works:
| if isinstance(config, XarrayDataConfig): | |
| if config.engine == "zarr": | |
| context = "forkserver" | |
| break | |
| elif getattr(config, "zarr_engine_used", False): | |
| context = "forkserver" | |
| break | |
| if getattr(config, "zarr_engine_used", False): | |
| context = "forkserver" | |
| break |
There was a problem hiding this comment.
zarr_engine_used should probably be turned into an @property of XarrayDataConfig etc. to make it a bit more visible, but that's an existing issue.
| datasets: list[DatasetABC] = [] | ||
| properties: DatasetProperties | None = None | ||
| if xarray_configs: | ||
| ds, prop = get_dataset( |
There was a problem hiding this comment.
It's a bit awkward that get_dataset returns XarrayConcat which then later on gets nested in another XarrayConcat. Maybe instead of using get_dataset and get_merged_datasets you can reuse the existing build methods on XarrayDataConfig and MergeNoConcatDatasetConfig? Then you should be able to avoid having to filter expanded by config type, since you can just loop over it and call config.build(names, n_timesteps) without worrying what type of config it is.
TBH, get_dataset and get_merged_datasets should probably be made private, though no need to worry about that here.
| if getattr(config, "engine", None) == "zarr" or getattr( | ||
| config, "zarr_engine_used", False | ||
| ): | ||
| mp_context = "forkserver" | ||
| break |
There was a problem hiding this comment.
Here again I think you can check zarr_engine_used for both XarrayDataConfig and MergeNoConcatDatasetConfig. I don't think you need the safety of getattr unless you're worried about self.fine having types other than what we expect.
| break | ||
| if mp_context is None: | ||
| for coarse_config in self.coarse: | ||
| if isinstance(coarse_config, XarrayEnsembleDataConfig): |
There was a problem hiding this comment.
Maybe add an @property for zarr_engine_used to XarrayEnsembleDataConfig? That should simplify the logic here a lot.
| def coarse_full_config( | ||
| self, | ||
| ) -> Sequence[XarrayDataConfig | MergeNoConcatDatasetConfig]: |
There was a problem hiding this comment.
This method seems more or less identical to DataLoaderConfig.full_config. Maybe create a shared helper?
| assert all( | ||
| dataset_fine_subset.variable_metadata[key] | ||
| == dataset_coarse_subset.variable_metadata[key] | ||
| dataset_fine_subset.variable_metadata[key].units |
There was a problem hiding this comment.
How important is this check? Seems like units could also easily become mismatched in "harmless" ways, e.g. m^2 vs m**2.
| def replace_config_subset( | ||
| config: XarrayDataConfig | MergeNoConcatDatasetConfig, subset: TimeSlice | ||
| ) -> XarrayDataConfig | MergeNoConcatDatasetConfig: | ||
| if isinstance(config, XarrayDataConfig): | ||
| return dataclasses.replace(config, subset=subset) | ||
| elif isinstance(config, MergeNoConcatDatasetConfig): | ||
| merge_configs = [ | ||
| dataclasses.replace(_config, subset=subset) for _config in config.merge | ||
| ] | ||
| return MergeNoConcatDatasetConfig(merge=merge_configs) | ||
| else: | ||
| raise ValueError(f"Invalid config type: {type(config)}") |
There was a problem hiding this comment.
XarrayDataConfig and MergeNoConcatDatasetConfig actually both already have methods update_subset, so you could maybe reuse that here and not have to worry about the isinstance checks.
| def _single_xarray_config( | ||
| coarse: list[XarrayDataConfig] | ||
| | Sequence[XarrayDataConfig | XarrayEnsembleDataConfig], | ||
| coarse: Sequence[ |
There was a problem hiding this comment.
This is maybe getting off-track for this PR, but why does _single_xarray_config take a list as input when it only wants a single config? Maybe adding an InferenceDataLoaderConfig would help clean this up somewhat.
These changes allow the downscaling data loader configs to load merged datasets.