-
Notifications
You must be signed in to change notification settings - Fork 657
Open
Description
Description :
This PR updates the ParquetDataSource class in: gemma/gm/data/_parquet.py to correctly handle a list of file paths. Previously, passing a list of paths would cause the initialization/reading to fail or behave unexpectedly as it was not iterating through the paths. The new implementation iterates through provided paths, reads each table individually, and concatenates them using pyarrow.concat_tables.
Changes:
- Modified ParquetDataSource.table to normalize self.path to a list.
- Iterate through each path in the list, open the file, and read the parquet table.
- Concatenate all read tables into a single PyArrow table.
Verification:
Verified manually using a reproduction script (mocked) demonstrating that read_table is now called for each path in the list and concat_tables is utilized.
#524 PR Raised for the issue, lmk if there any any iterations . Love contributing to Google-Deepmind
Metadata
Metadata
Assignees
Labels
No labels