Skip to content

Conversation

@matschreiner
Copy link
Contributor

@matschreiner matschreiner commented Feb 14, 2025

Following up on issues #68 and #69 , raised by @observingClouds I have modification to the kwarg_selection code as well as cleaned up the code a bit further.

Issue #68

  1. There was a bug that when datatime objects were passed for instantiation of the config file an a dataclasswizard.ParseError was raised.

1.1) I added two danra yaml configuration files, one using strings and one using datetime objects in the data fields and wrote tests that instantiate a mdp.Config from both. Howehis was not reproducing the error.
1.2) I added a test that tries to instantiate a Range config object from the datetimes which also doesn't throw an error.

1.3) even though I didn't see the error, I added datetime objects and and timedeltas as accepted types for Range class. I hope we don't see this bug again. If anyone can help me write a test to showcase the problem I'm happy.

  1. There was a rather strict assertion that the exact points chosen in time should be in the dataset. We discussed in the issue whether this check should really be enforced. The check is not strictly necessary and it requires extra knowledge about the data making the code harder to use in general. This check was only used on time treating the time dimension differently than the other dimensions.

2.1) I have removed the assertion and instead I raise a warning when the selection.start and selection.end points are less than or greater than the range of the chose coordinate, respectively. This makes preparing datasets on datasets in the wild easier. This warning is raised equally on all dimensions.

2.2) I have removed a test that rasied an error when the exact datapoint was not in the dataset. This test also tested whether the time deltas in the sliced dataset was the same as the ones specified in the slice. I have written new simpler tests that test for this.

Issue #69

  1. This issue had to do with the fact that the time-step was ignored when slicing the data. This seems to already have been an issue that was not introduced by the slicing functionality.

3.1) I have added functionality to slice the data using time deltas - this is not supported natively by xarray. Given the chosen stepsize and the frequency of datapoints in the dataset, we have to calculate a striding value that will simply take every n data point in the dataset. So if we use Range(start, end, 'PT6H') and the we have a datapoint every PT3H` in the dataset we have to calculate a striding of 2, which is then passed to the striding.

3.2) I have added tests to check that the slicing works as expected and that an error is raised if the stepsize is not a multiple of the frequency in the dataset.

Cleanup

I renamed the _normalize_... functions in the selection module to reflect that they are about creating time objects and I made the functions public as I needed them in tests.
I removed conditional logic to handle slicing using lists since we are explicitly not allowed lists in the typing of the input arguments. Also I'm not sure when we would use it.

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the documentation to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thanks for working on this :)

start: Union[str, int, float]
end: Union[str, int, float]
step: Optional[Union[str, int, float]] = None
start: Union[str, int, float, datetime]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I'm an idiot. I am not sure we need to support str here since I added that for iso8601 formatted strings. I thought that yaml doesn't natively support datetime serialisation, but I missremembered, that is json! So, if people write iso8601 formatted strings then I think yaml should always turn that into datetime.datetime objects if define that is the type here. That also means we don't have handle turning strings into datetime/timedelta objects in the code, nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which I think should mean we should remove str here. Do you agree @matschreiner and @observingClouds ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine some folks might still write times as strings, but not allowing it in the first place would probably be more robust, so I'd be happy to drop str support

Copy link
Member

@leifdenby leifdenby Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm confused. Our current config schema supports the following:

schema_version: v0.6.0
dataset_version: v0.1.0

output:
  variables:
    static: [grid_index, static_feature]
    state: [time, grid_index, state_feature]
    forcing: [time, grid_index, forcing_feature]
  coord_ranges:
    time:
      start: 1990-09-03T00:00
      end: 1990-09-09T00:00
      step: PT3H
...

What I am suggesting is that if we replace the str type with datetime in the config dataclasses then the dataclass serialisation (from dataclasses-wizard) will handle turning the start and end fields into datatime objects, rather than us having to do it. In terms of what is in the config yaml-files they would remain unchanged though? So people can keep defining the start/end times as they already are. What was previously interpreted as a string will now simply be interpreted as a datetime serialised as a iso8601 string, no?

Copy link
Contributor

@observingClouds observingClouds Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just experimenting together with @matschreiner and 1990-09-03T00:00 is not a valid ISO format and is not converted to datetime and remains a string, however 1990-09-03T00:00:00 is.

There seems to be an issue with a roundtrip though, as mdp.Config.to_yaml()/to_dict/to_json all serialize datetimes back to strings...this might be an upstream issue in the dataclass-wizard

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that is strange. Omitting the seconds should be valid, https://en.wikipedia.org/wiki/ISO_8601#Times

Either the seconds, or the minutes and seconds, may be omitted from the basic or extended time formats for greater brevity but decreased precision; the resulting reduced precision time formats are:[26]
T[hh][mm] in basic format or T[hh]:[mm] in extended format, when seconds are omitted.
T[hh], when both seconds and minutes are omitted.

Maybe this is an upstream bug, or YAML requires the seconds be included...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dataclass-wizard docs, https://dataclass-wizard.readthedocs.io/en/latest/overview.html#supported-types

For date, time, and datetime types, string values are de-serialized using the builtin fromisoformat() method; for datetime and time types, a suffix of “Z” appearing in the string is first replaced with “+00:00”, which represents UTC time. JSON values for datetime and date annotated types appearing as numbers will be de-serialized using the builtin fromtimestamp() method.

All these types are serialized back to JSON using the builtin isoformat() method. For datetime and time types, there is one noteworthy addition: the suffix “+00:00” is replaced with “Z”, which is a common abbreviation for UTC time.

this seems ok, no? It does mean the round trip wouldn't result in exactly the same yaml because the minutes will be added by isoformat() call https://docs.python.org/3/library/datetime.html#datetime.datetime.isoformat

"time_stepsize",
[testdata.DT_ANALYSIS, testdata.DT_ANALYSIS * 2, testdata.DT_ANALYSIS / 2],
)
def test_time_selection(source_data_contains_time_range, time_stepsize):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come you are getting rid of this test? Don't you like it 😆

Comment on lines +55 to +56
assert sel_start != sel_end, "Start and end cannot be the same"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.sel({'time': slice(dt.datetime(2023,1,1), dt.datetime(2023,1,1))}) is a valid expression and returns data from dt.datetime(2023,1,1)

Comment on lines +50 to +53
if coord == "time":
sel_start = to_timestamp(selection.start)
sel_end = to_timestamp(selection.end)
sel_step = get_time_step(sel_step, ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @leifdenby that it would be good to only accept datetimes and so this rather boilerplate code is not necessary

Comment on lines 81 to +85

if not all(all_steps[0] == all_steps):
raise ValueError(
f"Step size for coordinate {coord} is not constant: {all_steps}"
)
if sel_step != first_step:
raise ValueError(
f"Step size for coordinate {coord} is not the same as requested: {first_step} != {sel_step}"
def check_selection(ds, coord, sel_start, sel_end):
if ds[coord].values.min() < sel_start or ds[coord].values.max() > sel_end:
warnings.warn(
f"\nChosen slice exceeds the range of {coord} in the dataset.\n Dataset span: [ {ds[coord].values.min()} : {ds[coord].values.max()} ]\n Chosen slice: [ {sel_start} : {sel_end} ]\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check assumes that the coordinates are sorted in ascending order, however, this is not always the case.

Example:

import xarray as xr
import numpy as np

# Create geographic coordinates
latitude = np.array([10, 5, 0, -5, -10])  # Latitude in descending order
longitude = np.array([30, 35, 40])  # Longitude in ascending order

# Create some sample data
data = np.random.rand(len(latitude), len(longitude))  # 2D data with latitudes and longitudes

# Create an xarray Dataset with geographic coordinates
ds = xr.Dataset(
    {
        'temperature': (('latitude', 'longitude'), data),  # 2D data with latitude and longitude dimensions
        'precipitation': (('latitude', 'longitude'), data * 0.1),  # Example precipitation data
    },
    coords={
        'latitude': latitude,
        'longitude': longitude,
    }
)

ds.sel({'latitude': slice(8,-5)}) # this slice is within bounds but would raise the above warning

It should also be sel_start <= ds[coord].values.min().

Nevertheless, I would not do this test at all. The user should now what a valid range is.

matschreiner and others added 3 commits February 16, 2025 11:44
Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>
Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>
Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>
@leifdenby leifdenby added the bug Something isn't working label Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants