Skip to content

Issue: Fix/parquet multipath support  #533

@shreed27

Description

@shreed27

Description :

This PR updates the ParquetDataSource class in: gemma/gm/data/_parquet.py to correctly handle a list of file paths. Previously, passing a list of paths would cause the initialization/reading to fail or behave unexpectedly as it was not iterating through the paths. The new implementation iterates through provided paths, reads each table individually, and concatenates them using pyarrow.concat_tables.

Changes:

  • Modified ParquetDataSource.table to normalize self.path to a list.
  • Iterate through each path in the list, open the file, and read the parquet table.
  • Concatenate all read tables into a single PyArrow table.

Verification:

Verified manually using a reproduction script (mocked) demonstrating that read_table is now called for each path in the list and concat_tables is utilized.

#524 PR Raised for the issue, lmk if there any any iterations . Love contributing to Google-Deepmind

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions