Skip to content

Conversation

@shreed27
Copy link

@shreed27 shreed27 commented Feb 1, 2026

Description : This PR updates the ParquetDataSource class in: gemma/gm/data/_parquet.py to correctly handle a list of file paths. Previously, passing a list of paths would cause the initialization/reading to fail or behave unexpectedly as it was not iterating through the paths. The new implementation iterates through provided paths, reads each table individually, and concatenates them using pyarrow.concat_tables.

Changes:

  • Modified ParquetDataSource.table to normalize self.path to a list.
  • Iterate through each path in the list, open the file, and read the parquet table.
  • Concatenate all read tables into a single PyArrow table.

Verification:
Verified manually using a reproduction script (mocked) demonstrating that read_table is now called for each path in the list and concat_tables is utilized.

- Implemented  for  model loading to prevent redundant IO and parsing when creating multiple instances.
- Added auto-download capability to : now automatically downloads/copies remote files (e.g., gs://) to the local cache if missing.
- Refactored  to separate model loading logic into standalone cached functions.
- Updated  to verify download behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant