Skip to content

Conversation

@probberechts
Copy link
Contributor

Context

For #421 (and related PRs) we need to be able to distinguish between complete datasets and filtered datasets. For example, we should be able to raise an exception when a user tries to insert synthetically generated caries in a dataset that contains only shots. I was trying to figure out what would be the cleanest way to support this. Here is a proposal.

Changes

1. Dynamic Class Generation for Filtered Datasets

I implemented a FilteredDataset mixin that uses dynamic class generation. When Dataset.filter() is called, it now returns a dynamically generated subclass (e.g., FilteredEventDataset). This approach preserves the original type hierarchy while allowing us to tag the instance as filtered.

Example

from kloppy import statsbomb
from kloppy.domain import EventDataset, FilteredDataset

dataset = statsbomb.load_open_data()
filtered = dataset.filter("pass")

print(type(filtered))  
# <class 'FilteredEventDataset'>

print(isinstance(filtered, EventDataset)) 
# True (It still works as an EventDataset)

print(isinstance(filtered, FilteredDataset)) 
# True (It is now identified as filtered)

2. Standardized Filtering Behavior & Event References

While thinking about this, I also noticed that we have two methods for creating a filtered dataset that return different results.

If you filter while loading the data, it sets the refs taking only into account the specified event types. In this case, the previous shot.

from kloppy import statsbomb
dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# StatsBombShotEvent(qualifiers=[BodyPartQualifier(value=<BodyPart.RIGHT_FOOT: 'RIGHT_FOOT'>)])

If you filter after loading the data, it sets the refs taking into account all events.

dataset = statsbomb.load_open_data().filter("shot")
dataset.records[1].prev()
# GenericEvent(...)

I have refactored the EventDataDeserializer to standardize on the second approach. Accessing .prev() or .next() on a filtered dataset should reflect the actual event sequence of the match (e.g., the pass that led to the shot), rather than the previous record in the filtered list.

dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# GenericEvent(...)

In terms of performance, this really doesn't make a big difference.

3. Enhanced Metadata Merging

As a side-effect of the previous changes, I also improved how additional_metadata is handled during deserialization. Currently, only the StatsBomb deserializer used this, but we should move towards supporting this for all deserializers.

- Implemented dynamic class generation for filtered datasets. `Dataset.filter()` now returns a dynamically generated subclass (e.g., `FilteredEventDataset`) that inherits from both `FilteredDataset` (mixin) and the original dataset class. This avoids strict inheritance coupling while preserving type checks.
- Refactored `EventDataDeserializer` to use the Template Method pattern:
  - `deserialize()` is now concrete and handles standard filtering and metadata merging.
  - `_deserialize()` is the new abstract hook for provider-specific parsing logic.
- Added robust metadata merging in `deserialize`:
  - Valid fields in `additional_metadata` update the Metadata object directly.
  - Unsupported keys are automatically merged into `metadata.attributes` with a warning, preventing data loss or TypeErrors.
@probberechts probberechts requested a review from koenvo December 17, 2025 19:08
@probberechts probberechts changed the title refactor: Refactor dataset filtering and event deserialization Refactor dataset filtering and event deserialization Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant