-
Notifications
You must be signed in to change notification settings - Fork 18
Fix selection #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix selection #70
Conversation
leifdenby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Thanks for working on this :)
| start: Union[str, int, float] | ||
| end: Union[str, int, float] | ||
| step: Optional[Union[str, int, float]] = None | ||
| start: Union[str, int, float, datetime] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I'm an idiot. I am not sure we need to support str here since I added that for iso8601 formatted strings. I thought that yaml doesn't natively support datetime serialisation, but I missremembered, that is json! So, if people write iso8601 formatted strings then I think yaml should always turn that into datetime.datetime objects if define that is the type here. That also means we don't have handle turning strings into datetime/timedelta objects in the code, nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which I think should mean we should remove str here. Do you agree @matschreiner and @observingClouds ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine some folks might still write times as strings, but not allowing it in the first place would probably be more robust, so I'd be happy to drop str support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm confused. Our current config schema supports the following:
schema_version: v0.6.0
dataset_version: v0.1.0
output:
variables:
static: [grid_index, static_feature]
state: [time, grid_index, state_feature]
forcing: [time, grid_index, forcing_feature]
coord_ranges:
time:
start: 1990-09-03T00:00
end: 1990-09-09T00:00
step: PT3H
...What I am suggesting is that if we replace the str type with datetime in the config dataclasses then the dataclass serialisation (from dataclasses-wizard) will handle turning the start and end fields into datatime objects, rather than us having to do it. In terms of what is in the config yaml-files they would remain unchanged though? So people can keep defining the start/end times as they already are. What was previously interpreted as a string will now simply be interpreted as a datetime serialised as a iso8601 string, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just experimenting together with @matschreiner and 1990-09-03T00:00 is not a valid ISO format and is not converted to datetime and remains a string, however 1990-09-03T00:00:00 is.
There seems to be an issue with a roundtrip though, as mdp.Config.to_yaml()/to_dict/to_json all serialize datetimes back to strings...this might be an upstream issue in the dataclass-wizard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that is strange. Omitting the seconds should be valid, https://en.wikipedia.org/wiki/ISO_8601#Times
Either the seconds, or the minutes and seconds, may be omitted from the basic or extended time formats for greater brevity but decreased precision; the resulting reduced precision time formats are:[26]
T[hh][mm] in basic format or T[hh]:[mm] in extended format, when seconds are omitted.
T[hh], when both seconds and minutes are omitted.
Maybe this is an upstream bug, or YAML requires the seconds be included...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dataclass-wizard docs, https://dataclass-wizard.readthedocs.io/en/latest/overview.html#supported-types
For date, time, and datetime types, string values are de-serialized using the builtin fromisoformat() method; for datetime and time types, a suffix of “Z” appearing in the string is first replaced with “+00:00”, which represents UTC time. JSON values for datetime and date annotated types appearing as numbers will be de-serialized using the builtin fromtimestamp() method.
All these types are serialized back to JSON using the builtin isoformat() method. For datetime and time types, there is one noteworthy addition: the suffix “+00:00” is replaced with “Z”, which is a common abbreviation for UTC time.
this seems ok, no? It does mean the round trip wouldn't result in exactly the same yaml because the minutes will be added by isoformat() call https://docs.python.org/3/library/datetime.html#datetime.datetime.isoformat
| "time_stepsize", | ||
| [testdata.DT_ANALYSIS, testdata.DT_ANALYSIS * 2, testdata.DT_ANALYSIS / 2], | ||
| ) | ||
| def test_time_selection(source_data_contains_time_range, time_stepsize): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come you are getting rid of this test? Don't you like it 😆
| assert sel_start != sel_end, "Start and end cannot be the same" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds.sel({'time': slice(dt.datetime(2023,1,1), dt.datetime(2023,1,1))}) is a valid expression and returns data from dt.datetime(2023,1,1)
| if coord == "time": | ||
| sel_start = to_timestamp(selection.start) | ||
| sel_end = to_timestamp(selection.end) | ||
| sel_step = get_time_step(sel_step, ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @leifdenby that it would be good to only accept datetimes and so this rather boilerplate code is not necessary
|
|
||
| if not all(all_steps[0] == all_steps): | ||
| raise ValueError( | ||
| f"Step size for coordinate {coord} is not constant: {all_steps}" | ||
| ) | ||
| if sel_step != first_step: | ||
| raise ValueError( | ||
| f"Step size for coordinate {coord} is not the same as requested: {first_step} != {sel_step}" | ||
| def check_selection(ds, coord, sel_start, sel_end): | ||
| if ds[coord].values.min() < sel_start or ds[coord].values.max() > sel_end: | ||
| warnings.warn( | ||
| f"\nChosen slice exceeds the range of {coord} in the dataset.\n Dataset span: [ {ds[coord].values.min()} : {ds[coord].values.max()} ]\n Chosen slice: [ {sel_start} : {sel_end} ]\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check assumes that the coordinates are sorted in ascending order, however, this is not always the case.
Example:
import xarray as xr
import numpy as np
# Create geographic coordinates
latitude = np.array([10, 5, 0, -5, -10]) # Latitude in descending order
longitude = np.array([30, 35, 40]) # Longitude in ascending order
# Create some sample data
data = np.random.rand(len(latitude), len(longitude)) # 2D data with latitudes and longitudes
# Create an xarray Dataset with geographic coordinates
ds = xr.Dataset(
{
'temperature': (('latitude', 'longitude'), data), # 2D data with latitude and longitude dimensions
'precipitation': (('latitude', 'longitude'), data * 0.1), # Example precipitation data
},
coords={
'latitude': latitude,
'longitude': longitude,
}
)
ds.sel({'latitude': slice(8,-5)}) # this slice is within bounds but would raise the above warningIt should also be sel_start <= ds[coord].values.min().
Nevertheless, I would not do this test at all. The user should now what a valid range is.
Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>
Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>
Co-authored-by: Hauke Schulz <43613877+observingClouds@users.noreply.github.com>
Following up on issues #68 and #69 , raised by @observingClouds I have modification to the kwarg_selection code as well as cleaned up the code a bit further.
Issue #68
1.1) I added two danra yaml configuration files, one using strings and one using datetime objects in the data fields and wrote tests that instantiate a mdp.Config from both. Howehis was not reproducing the error.
1.2) I added a test that tries to instantiate a Range config object from the datetimes which also doesn't throw an error.
1.3) even though I didn't see the error, I added datetime objects and and timedeltas as accepted types for Range class. I hope we don't see this bug again. If anyone can help me write a test to showcase the problem I'm happy.
2.1) I have removed the assertion and instead I raise a warning when the
selection.startandselection.endpoints are less than or greater than the range of the chose coordinate, respectively. This makes preparing datasets on datasets in the wild easier. This warning is raised equally on all dimensions.2.2) I have removed a test that rasied an error when the exact datapoint was not in the dataset. This test also tested whether the time deltas in the sliced dataset was the same as the ones specified in the slice. I have written new simpler tests that test for this.
Issue #69
3.1) I have added functionality to slice the data using time deltas - this is not supported natively by xarray. Given the chosen stepsize and the frequency of datapoints in the dataset, we have to calculate a striding value that will simply take every n data point in the dataset. So if we use
Range(start, end, 'PT6H') and the we have a datapoint everyPT3H` in the dataset we have to calculate a striding of 2, which is then passed to the striding.3.2) I have added tests to check that the slicing works as expected and that an error is raised if the stepsize is not a multiple of the frequency in the dataset.
Cleanup
I renamed the
_normalize_...functions in the selection module to reflect that they are about creating time objects and I made the functions public as I needed them in tests.I removed conditional logic to handle slicing using lists since we are explicitly not allowed lists in the typing of the input arguments. Also I'm not sure when we would use it.
Type of change
Checklist before requesting a review
pullwith--rebaseoption if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee