Skip to content

Support CREATE EXTERNAL TABLE backed by a Catalog with DataFusion #2021

@CTTY

Description

@CTTY

Background

DataFusion already supports CREATE EXTERNAL TABLE ... STORED AS ICEBERG. Today, iceberg-rust integrates via IcebergTableProviderFactory, but the factory primarily supports registering a static table (e.g., created from a metadata JSON path). That works for:

-- Static table (existing, backward compatible)
CREATE EXTERNAL TABLE my_table
STORED AS ICEBERG
LOCATION '/path/to/metadata.json';

However, we also want CREATE EXTERNAL TABLE to create a normal IcebergTableProvider backed by a Catalog, so users can define the catalog via SQL OPTIONS (and then resolve tables by identifier through that catalog).

Dumping my thoughts here and feedbacks are welcome!

Option A: Build Catalog inside the ProviderFactory using OPTIONS

IcebergTableProviderFactory parses OPTIONS and uses a CatalogBuilder to construct the Catalog internally, then creates a normal IcebergTableProvider

CREATE EXTERNAL TABLE my_table
STORED AS ICEBERG
LOCATION 'ignored_or_optional' // this will be ignored if a catalog is configured
OPTIONS (
  'datafusion.iceberg.catalog.type' = 'rest', // if catalog type is not configured, it should fall back to create static table
  'datafusion.iceberg.catalog.uri' = 'http://localhost:8181',
  'datafusion.iceberg.catalog.warehouse' = 's3://bucket/warehouse'
);

Option B: Allow injecting a pre-built Catalog into the factory

Essentially we have

pub struct IcebergTableProviderFactory {
  catalog: Option<Arc<dyn Catalog>>, // when it's none, fall back to static table
}
...
IcebergTableProviderFactory::new_with_catalog(Arc<dyn Catalog>)

I prefer this as it is much more straight-forward, but one drawback I can think of is users cannot easily use multiple catalogs at the same time. A workaround would look like this:

state
        .table_factories_mut()
        .insert("ICEBERG_REST_A".to_string(), Arc::new(IcebergTableProviderFactory(rest_catalog_a)));

state
        .table_factories_mut()
        .insert("ICEBERG_REST_B".to_string(), Arc::new(IcebergTableProviderFactory(rest_catalog_b)));

and then when creating the table using sql:

CREATE EXTERNAL TABLE my_table
STORED AS ICEBERG_REST_A
...

Willingness to contribute

I can contribute to this feature independently

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions