Fix selection #70

matschreiner · 2025-02-14T13:31:01Z

Following up on issues #68 and #69 , raised by @observingClouds I have modification to the kwarg_selection code as well as cleaned up the code a bit further.

Issue #68

There was a bug that when datatime objects were passed for instantiation of the config file an a dataclasswizard.ParseError was raised.

1.1) I added two danra yaml configuration files, one using strings and one using datetime objects in the data fields and wrote tests that instantiate a mdp.Config from both. Howehis was not reproducing the error.
1.2) I added a test that tries to instantiate a Range config object from the datetimes which also doesn't throw an error.

1.3) even though I didn't see the error, I added datetime objects and and timedeltas as accepted types for Range class. I hope we don't see this bug again. If anyone can help me write a test to showcase the problem I'm happy.

There was a rather strict assertion that the exact points chosen in time should be in the dataset. We discussed in the issue whether this check should really be enforced. The check is not strictly necessary and it requires extra knowledge about the data making the code harder to use in general. This check was only used on time treating the time dimension differently than the other dimensions.

2.1) I have removed the assertion and instead I raise a warning when the selection.start and selection.end points are less than or greater than the range of the chose coordinate, respectively. This makes preparing datasets on datasets in the wild easier. This warning is raised equally on all dimensions.

2.2) I have removed a test that rasied an error when the exact datapoint was not in the dataset. This test also tested whether the time deltas in the sliced dataset was the same as the ones specified in the slice. I have written new simpler tests that test for this.

Issue #69

This issue had to do with the fact that the time-step was ignored when slicing the data. This seems to already have been an issue that was not introduced by the slicing functionality.

3.1) I have added functionality to slice the data using time deltas - this is not supported natively by xarray. Given the chosen stepsize and the frequency of datapoints in the dataset, we have to calculate a striding value that will simply take every n data point in the dataset. So if we use Range(start, end, 'PT6H') and the we have a datapoint every PT3H` in the dataset we have to calculate a striding of 2, which is then passed to the striding.

3.2) I have added tests to check that the slicing works as expected and that an error is raised if the stepsize is not a multiple of the frequency in the dataset.

Cleanup

I renamed the _normalize_... functions in the selection module to reflect that they are about creating time objects and I made the functions public as I needed them in tests.
I removed conditional logic to handle slicing using lists since we are explicitly not allowed lists in the typing of the input arguments. Also I'm not sure when we would use it.

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the documentation to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

leifdenby

Looking good! Thanks for working on this :)

leifdenby · 2025-02-14T15:56:04Z

mllam_data_prep/config.py

-    start: Union[str, int, float]
-    end: Union[str, int, float]
-    step: Optional[Union[str, int, float]] = None
+    start: Union[str, int, float, datetime]


ah, I'm an idiot. I am not sure we need to support str here since I added that for iso8601 formatted strings. I thought that yaml doesn't natively support datetime serialisation, but I missremembered, that is json! So, if people write iso8601 formatted strings then I think yaml should always turn that into datetime.datetime objects if define that is the type here. That also means we don't have handle turning strings into datetime/timedelta objects in the code, nice!

which I think should mean we should remove str here. Do you agree @matschreiner and @observingClouds ?

I imagine some folks might still write times as strings, but not allowing it in the first place would probably be more robust, so I'd be happy to drop str support

Hmm, I'm confused. Our current config schema supports the following:

schema_version: v0.6.0 dataset_version: v0.1.0 output: variables: static: [grid_index, static_feature] state: [time, grid_index, state_feature] forcing: [time, grid_index, forcing_feature] coord_ranges: time: start: 1990-09-03T00:00 end: 1990-09-09T00:00 step: PT3H ...

What I am suggesting is that if we replace the str type with datetime in the config dataclasses then the dataclass serialisation (from dataclasses-wizard) will handle turning the start and end fields into datatime objects, rather than us having to do it. In terms of what is in the config yaml-files they would remain unchanged though? So people can keep defining the start/end times as they already are. What was previously interpreted as a string will now simply be interpreted as a datetime serialised as a iso8601 string, no?

I was just experimenting together with @matschreiner and 1990-09-03T00:00 is not a valid ISO format and is not converted to datetime and remains a string, however 1990-09-03T00:00:00 is.

There seems to be an issue with a roundtrip though, as mdp.Config.to_yaml()/to_dict/to_json all serialize datetimes back to strings...this might be an upstream issue in the dataclass-wizard

Hmm that is strange. Omitting the seconds should be valid, https://en.wikipedia.org/wiki/ISO_8601#Times

Either the seconds, or the minutes and seconds, may be omitted from the basic or extended time formats for greater brevity but decreased precision; the resulting reduced precision time formats are:[26]
T[hh][mm] in basic format or T[hh]:[mm] in extended format, when seconds are omitted.
T[hh], when both seconds and minutes are omitted.

Maybe this is an upstream bug, or YAML requires the seconds be included...

In dataclass-wizard docs, https://dataclass-wizard.readthedocs.io/en/latest/overview.html#supported-types

For date, time, and datetime types, string values are de-serialized using the builtin fromisoformat() method; for datetime and time types, a suffix of “Z” appearing in the string is first replaced with “+00:00”, which represents UTC time. JSON values for datetime and date annotated types appearing as numbers will be de-serialized using the builtin fromtimestamp() method.

All these types are serialized back to JSON using the builtin isoformat() method. For datetime and time types, there is one noteworthy addition: the suffix “+00:00” is replaced with “Z”, which is a common abbreviation for UTC time.

this seems ok, no? It does mean the round trip wouldn't result in exactly the same yaml because the minutes will be added by isoformat() call https://docs.python.org/3/library/datetime.html#datetime.datetime.isoformat

leifdenby · 2025-02-14T15:58:44Z

tests/test_from_config.py

-    "time_stepsize",
-    [testdata.DT_ANALYSIS, testdata.DT_ANALYSIS * 2, testdata.DT_ANALYSIS / 2],
-)
-def test_time_selection(source_data_contains_time_range, time_stepsize):


how come you are getting rid of this test? Don't you like it 😆

height_levels.pkl

observingClouds · 2025-02-15T16:19:12Z

mllam_data_prep/ops/selection.py

+        assert sel_start != sel_end, "Start and end cannot be the same"
+


ds.sel({'time': slice(dt.datetime(2023,1,1), dt.datetime(2023,1,1))}) is a valid expression and returns data from dt.datetime(2023,1,1)

observingClouds · 2025-02-15T16:20:23Z

mllam_data_prep/ops/selection.py

+        if coord == "time":
+            sel_start = to_timestamp(selection.start)
+            sel_end = to_timestamp(selection.end)
+            sel_step = get_time_step(sel_step, ds)


I agree with @leifdenby that it would be good to only accept datetimes and so this rather boilerplate code is not necessary

observingClouds · 2025-02-15T16:36:23Z

mllam_data_prep/ops/selection.py


-    if not all(all_steps[0] == all_steps):
-        raise ValueError(
-            f"Step size for coordinate {coord} is not constant: {all_steps}"
-        )
-    if sel_step != first_step:
-        raise ValueError(
-            f"Step size for coordinate {coord} is not the same as requested: {first_step} != {sel_step}"
+def check_selection(ds, coord, sel_start, sel_end):
+    if ds[coord].values.min() < sel_start or ds[coord].values.max() > sel_end:
+        warnings.warn(
+            f"\nChosen slice exceeds the range of {coord} in the dataset.\n    Dataset span: [ {ds[coord].values.min()} : {ds[coord].values.max()} ]\n    Chosen slice: [ {sel_start} : {sel_end} ]\n"


This check assumes that the coordinates are sorted in ascending order, however, this is not always the case.

Example:

import xarray as xr import numpy as np # Create geographic coordinates latitude = np.array([10, 5, 0, -5, -10]) # Latitude in descending order longitude = np.array([30, 35, 40]) # Longitude in ascending order # Create some sample data data = np.random.rand(len(latitude), len(longitude)) # 2D data with latitudes and longitudes # Create an xarray Dataset with geographic coordinates ds = xr.Dataset( { 'temperature': (('latitude', 'longitude'), data), # 2D data with latitude and longitude dimensions 'precipitation': (('latitude', 'longitude'), data * 0.1), # Example precipitation data }, coords={ 'latitude': latitude, 'longitude': longitude, } ) ds.sel({'latitude': slice(8,-5)}) # this slice is within bounds but would raise the above warning

It should also be sel_start <= ds[coord].values.min().

Nevertheless, I would not do this test at all. The user should now what a valid range is.

tests/test_config.py

tests/test_selection.py

Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>

matschreiner added 25 commits February 12, 2025 17:05

add test to test time slicing

13557aa

add warning if using step

d3802f2

warn about ignoring step_sie

d7924dc

update naming

868558f

test allow timedelta

7d72542

remove tests for checking time is in data and timestep is correct

824d45d

it is specified in the docstring that it has to be a dit

ff249e9

check selection and warn

01df7c3

xarray handles this check

ede3b9a

is always range

570bc65

better warning

4dcba28

make range accept timedelta and datetime

b23e426

add test to test range

d19b304

simplify caasts

b38191c

parametrize step_sizes

a615ae7

to timestamp if coordinate is time

87760f9

add tests instantiating from string end_points and datetime endpoints

747986a

move slicing to input data and create string yaml with string datetime

da5c70c

remove redundant test

0f1466f

better names in tests

3cd8933

better warning

381bc64

remove test that doesn't test anything

85b5c27

remove obvious comment

5e217d1

remove import

96f5d93

remove pickle

7fc6f37

matschreiner requested review from leifdenby and observingClouds February 14, 2025 13:31

matschreiner added 3 commits February 14, 2025 14:34

resolve conflict and merge main

7b2845b

remove unused imports

f5fd875

allow for none in step

218c738

leifdenby reviewed Feb 14, 2025

View reviewed changes

remove height levels

e993334

observingClouds reviewed Feb 15, 2025

View reviewed changes

tests/test_config.py Outdated Show resolved Hide resolved

observingClouds reviewed Feb 15, 2025

View reviewed changes

tests/test_selection.py Outdated Show resolved Hide resolved

observingClouds reviewed Feb 15, 2025

View reviewed changes

tests/test_selection.py Outdated Show resolved Hide resolved

matschreiner and others added 3 commits February 16, 2025 11:44

improve test name

779530c

Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>

improve testname

4d52059

Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>

improve testname

3542f8b

Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>

leifdenby added the bug Something isn't working label Sep 22, 2025

		assert sel_start != sel_end, "Start and end cannot be the same"

Fix selection #70

Are you sure you want to change the base?

Fix selection #70

Uh oh!

Conversation

matschreiner commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue #68

Issue #69

Cleanup

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

leifdenby left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leifdenby Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

observingClouds Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matschreiner commented Feb 14, 2025 •

edited

Loading

leifdenby Feb 17, 2025 •

edited

Loading

observingClouds Feb 17, 2025 •

edited

Loading