Refactor dataset filtering and event deserialization #521

probberechts · 2025-12-17T19:08:03Z

Context

For #421 (and related PRs) we need to be able to distinguish between complete datasets and filtered datasets. For example, we should be able to raise an exception when a user tries to insert synthetically generated caries in a dataset that contains only shots. I was trying to figure out what would be the cleanest way to support this. Here is a proposal.

Changes

1. Dynamic Class Generation for Filtered Datasets

I implemented a FilteredDataset mixin that uses dynamic class generation. When Dataset.filter() is called, it now returns a dynamically generated subclass (e.g., FilteredEventDataset). This approach preserves the original type hierarchy while allowing us to tag the instance as filtered.

Example

from kloppy import statsbomb
from kloppy.domain import EventDataset, FilteredDataset

dataset = statsbomb.load_open_data()
filtered = dataset.filter("pass")

print(type(filtered))  
# <class 'FilteredEventDataset'>

print(isinstance(filtered, EventDataset)) 
# True (It still works as an EventDataset)

print(isinstance(filtered, FilteredDataset)) 
# True (It is now identified as filtered)

2. Standardized Filtering Behavior & Event References

While thinking about this, I also noticed that we have two methods for creating a filtered dataset that return different results.

If you filter while loading the data, it sets the refs taking only into account the specified event types. In this case, the previous shot.

from kloppy import statsbomb
dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# StatsBombShotEvent(qualifiers=[BodyPartQualifier(value=<BodyPart.RIGHT_FOOT: 'RIGHT_FOOT'>)])

If you filter after loading the data, it sets the refs taking into account all events.

dataset = statsbomb.load_open_data().filter("shot")
dataset.records[1].prev()
# GenericEvent(...)

I have refactored the EventDataDeserializer to standardize on the second approach. Accessing .prev() or .next() on a filtered dataset should reflect the actual event sequence of the match (e.g., the pass that led to the shot), rather than the previous record in the filtered list.

dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# GenericEvent(...)

In terms of performance, this really doesn't make a big difference.

3. Enhanced Metadata Merging

As a side-effect of the previous changes, I also improved how additional_metadata is handled during deserialization. Currently, only the StatsBomb deserializer used this, but we should move towards supporting this for all deserializers.

- Implemented dynamic class generation for filtered datasets. `Dataset.filter()` now returns a dynamically generated subclass (e.g., `FilteredEventDataset`) that inherits from both `FilteredDataset` (mixin) and the original dataset class. This avoids strict inheritance coupling while preserving type checks. - Refactored `EventDataDeserializer` to use the Template Method pattern: - `deserialize()` is now concrete and handles standard filtering and metadata merging. - `_deserialize()` is the new abstract hook for provider-specific parsing logic. - Added robust metadata merging in `deserialize`: - Valid fields in `additional_metadata` update the Metadata object directly. - Unsupported keys are automatically merged into `metadata.attributes` with a warning, preventing data loss or TypeErrors.

probberechts requested a review from koenvo December 17, 2025 19:08

probberechts changed the title ~~refactor: Refactor dataset filtering and event deserialization~~ Refactor dataset filtering and event deserialization Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataset filtering and event deserialization #521

Refactor dataset filtering and event deserialization #521

Uh oh!

probberechts commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor dataset filtering and event deserialization #521

Are you sure you want to change the base?

Refactor dataset filtering and event deserialization #521

Uh oh!

Conversation

probberechts commented Dec 17, 2025

Context

Changes

1. Dynamic Class Generation for Filtered Datasets

2. Standardized Filtering Behavior & Event References

3. Enhanced Metadata Merging

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant