Refactor dataset filtering and event deserialization #521
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
For #421 (and related PRs) we need to be able to distinguish between complete datasets and filtered datasets. For example, we should be able to raise an exception when a user tries to insert synthetically generated caries in a dataset that contains only shots. I was trying to figure out what would be the cleanest way to support this. Here is a proposal.
Changes
1. Dynamic Class Generation for Filtered Datasets
I implemented a
FilteredDatasetmixin that uses dynamic class generation. WhenDataset.filter()is called, it now returns a dynamically generated subclass (e.g.,FilteredEventDataset). This approach preserves the original type hierarchy while allowing us to tag the instance as filtered.Example
2. Standardized Filtering Behavior & Event References
While thinking about this, I also noticed that we have two methods for creating a filtered dataset that return different results.
If you filter while loading the data, it sets the refs taking only into account the specified event types. In this case, the previous shot.
If you filter after loading the data, it sets the refs taking into account all events.
I have refactored the
EventDataDeserializerto standardize on the second approach. Accessing.prev()or.next()on a filtered dataset should reflect the actual event sequence of the match (e.g., the pass that led to the shot), rather than the previous record in the filtered list.In terms of performance, this really doesn't make a big difference.
3. Enhanced Metadata Merging
As a side-effect of the previous changes, I also improved how
additional_metadatais handled during deserialization. Currently, only the StatsBomb deserializer used this, but we should move towards supporting this for all deserializers.