Skip to content

Conversation

@tswast
Copy link
Collaborator

@tswast tswast commented Jan 21, 2026

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-pandas API. samples Issues that are directly related to samples. labels Jan 21, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tswast, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the pandas_gbq.sample functionality by integrating support for BigLake tables. It introduces a dedicated module for BigLake-specific operations, refactors the existing sampling mechanisms to accommodate different table types, and updates dependencies to support the new BigLake interactions. The overall aim is to broaden the utility of the sampling function to a wider range of BigQuery-related data sources.

Highlights

  • BigLake Table Support: Introduced initial support for BigLake tables within the pandas_gbq.sample function, allowing it to differentiate and handle BigLake table identifiers.
  • New BigLake Utility Module: Added a new module pandas_gbq/core/biglake.py which includes utilities for interacting with BigLake tables, such as a BigLakeTableId dataclass and a get_table_metadata function to fetch table metadata using the Iceberg REST Catalog API.
  • Refactored Sampling Logic: The core sampling logic in pandas_gbq.core.sample.py has been refactored into two new helper functions: _sample_bq_table for standard BigQuery tables and _sample_biglake_table for BigLake tables. The main sample function now dispatches to these helpers based on the table_id format.
  • Dependency Update: Added requests as a new dependency to setup.py and pinned its version in testing/constraints-3.9.txt, which is required for making HTTP requests to the BigLake Iceberg REST Catalog.
  • Internal API Change: The _sample_with_tablesample and _sample_with_limit internal functions now accept a table_id string directly instead of a google.cloud.bigquery.Table object, making them more flexible.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the initial structure to support sampling from BigLake tables in pandas_gbq.sample. It refactors the existing sampling logic into separate functions for BigQuery and BigLake tables and introduces a new biglake.py module to interact with the BigLake REST API.

My review has identified a few critical issues. The core logic for sampling BigLake tables is not yet implemented. The new biglake.py module contains a critical bug in URL construction and lacks proper error handling for API requests. I've also noted some areas for improvement regarding the robustness of table ID parsing and documentation. Please address these points to complete the feature.

Comment on lines +52 to +61
return session.get(
f"{_ICEBERG_REST_CATALOG_URI}.{path}",
headers={
"x-goog-user-project": billing_project_id,
"Content-Type": "application/json; charset=utf-8",
# TODO(tswast): parameter for this option (or get from catalog metadata?)
# /iceberg/{$api_version}/restcatalog/extensions/{name=projects/*/catalogs/*}
"X-Iceberg-Access-Delegation": "vended-credentials",
},
).json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The URL for the BigLake REST API is constructed incorrectly. There's an extra . between the base URI and the path, which will lead to a 404 Not Found error. The path already starts with a /. Additionally, the response from session.get is not checked for errors before attempting to parse it as JSON. This can lead to unhelpful JSONDecodeError exceptions on HTTP failures. You should call response.raise_for_status() to handle non-2xx responses gracefully.

    response = session.get(
        f"{_ICEBERG_REST_CATALOG_URI}{path}",
        headers={
            "x-goog-user-project": billing_project_id,
            "Content-Type": "application/json; charset=utf-8",
            # TODO(tswast): parameter for this option (or get from catalog metadata?)
            # /iceberg/{$api_version}/restcatalog/extensions/{name=projects/*/catalogs/*}
            "X-Iceberg-Access-Delegation": "vended-credentials",
        },
    )
    response.raise_for_status()
    return response.json()

Comment on lines +207 to +216
def _sample_biglake_table(
*,
table_id: str,
credentials: google.oauth2.credentials.Credentials,
bqclient: google.cloud.bigquery.Client,
target_bytes: int,
progress_bar_type: str | None,
use_bqstorage_api: bool,
) -> Optional[pandas.DataFrame]:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function _sample_biglake_table is a stub with a pass statement. The PR title suggests support for BigLake tables is being added, but the core implementation is missing. This function needs to be implemented to fulfill the feature's goal. Additionally, corresponding unit tests should be added to verify its functionality.

"""
# https://iceberg.apache.org/spec/#metrics
# total-files-size
project, catalog, namespace, table = table_id.split(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Unpacking table_id.split('.') directly into four variables is unsafe. If table_id doesn't contain exactly three dots, this will raise a ValueError. It's better to validate the number of parts before unpacking to provide a more informative error message and prevent unexpected crashes.

    parts = table_id.split(".")
    if len(parts) != 4:
        raise ValueError(
            "Invalid BigLake table ID. Expected format: "
            "project.catalog.namespace.table"
        )
    project, catalog, namespace, table = parts

Comment on lines +32 to +33
"""
Docstring for get_table_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for get_table_metadata is a placeholder. Please add a proper docstring explaining the function's purpose, its parameters (table_id, credentials, billing_project_id), and what it returns. This is important for maintainability and usability.

# BigLake tables can't be read directly by the BQ Storage Read API, so make
# sure we run a query first.
parts = table_id.split(".")
if len(parts) == 4:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Relying on len(table_id.split('.')) == 4 to identify a BigLake table is fragile. BigQuery identifiers can contain dots if they are quoted (e.g., my-project.dataset.with.dot.table). This could lead to misidentifying a standard BigQuery table as a BigLake table. Consider a more robust detection mechanism. For example, you could attempt to parse the ID as a BigLake ID and handle failure, or introduce an explicit parameter to specify the table type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-pandas API. samples Issues that are directly related to samples. size: m Pull request size is medium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant