Skip to content

Conversation

@EpsilonPrime
Copy link
Member

@EpsilonPrime EpsilonPrime commented Dec 13, 2025

Add GenerateRel, a new relational operator that applies table functions to produce zero or more output rows per input row. This enables SQL LATERAL VIEW, EXPLODE, and UNNEST operations for flattening nested data structures.

Changes

  • Adds TableFunction message based on original plans for such a message
  • Adds GenerateRel message which is the first use of table functions

Key Design Decisions

  1. Dedicated TableFunction message: Provides type safety and clear semantics for multi-row producing functions, following the pattern of ScalarFunction, WindowFunction, and AggregateFunction.

  2. preserve_on_empty flag: Controls NULL handling when generators produce zero rows. Named to reflect behavior (row preservation) rather than SQL join terminology ("outer"). Equivalent to Spark's LATERAL VIEW OUTER or standard SQL's LEFT JOIN LATERAL.

  3. No automatic ordinal column: Unlike ExpandRel which appends an i32 ordinal, GenerateRel relies on table functions to include position explicitly (e.g., posexplode includes pos field in output) when needed.

  4. TableFunction not in Expression: Table functions produce multiple rows, incompatible with Expression semantics (single value). Only used in relational operators like GenerateRel.

Comparison to ExpandRel

  • ExpandRel: Fixed cardinality (plan time), for grouping sets/cube
  • GenerateRel: Variable cardinality (runtime), for lateral view/unnest

Related Work

Will be used to unfork the changes made in Apache Gluten PR #574.

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very excited for this one, as it would be useful for modeling Unnest at my job.

Couple things that I think are high priority:

  1. Could we introduce the schema changes for table functions in simple_extension_schema.yaml?
  2. Could we introduce some proper examples into something like functions_explode.yaml to show these definitions? This also helps ensure that downstream libraries are properly handling these things.
  3. For my particular use case, I would love to be able to represent queries like SELECT UNNEST(ARRAY[1,2]), UNNEST(ARRAY[4,5,6]) which has a zipping behavior rather than cartesian product. What would be the best way to represent these? (One way I can think of is as variadic unnest call which has zipping behavior.)

All in all this will be great and I'm happy to help implement them in the downstream libraries.

// standard SQL via LEFT JOIN LATERAL.
//
// Optional; defaults to false.
bool preserve_on_empty = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar idea to this PR here from @yongchul: #890

I wonder if there is some way to unify these ideas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be to make this part of TableFunction but if there are other table expressions they might need this functionality.

Comment on lines 592 to 593
// For compatibility and optimization hints.
substrait.extensions.AdvancedExtension advanced_extension = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For compatibility and optimization hints.
substrait.extensions.AdvancedExtension advanced_extension = 10;
substrait.extensions.AdvancedExtension advanced_extension = 10;

None of the other usages of advanced_extension has a comment, so I think lets just drop it to be consistent.


Also, I understand wanting to set the field number to 10 for consistency, but there are actually usages of advanced_extension which are 10, 9, 5, and 4 in this file. Any reason now to just make it 5 so that it is the next field? TBH though this is not particularly important to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having them all the same makes it easier for some proto uses (but I can't name any at the moment). As part of Substrait 1.0 I'd want to see these moved up to 100 or higher. Lower field numbers compress better so it'd be nice to reserve some space (not that we really need 100 reserved field numbers).

//
// Required.
Type output_type = 4;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this PR has a lot of overlap with this one I wrote before: #876

Though what you have here is more general than I what I did, and so I think your approach is better. I just wanted to share in case any of the discussion there is useful. One thing in particular that was discussed there: have you at all thought about variadic handling for table functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output needs to be determinable from the input. So for more complicated structures we likely need to upgrade the grammar. For now we can achieve most of what we need with function overloading.


### Output Type

The `output_type` field must be a `Struct` type where each field represents a column in the generated output rows. This explicitly defines the schema of rows produced by the table function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the struct be nullable? Can it be a named struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullable yes, named struct no. Named structs are a different kind of object (i.e. not a type). If you want name hints you can use that feature to provide them. Substrait explicitly is ignoring names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not nullable structs since you have to return some number of rows.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general question. There are (at least) two ways in which table functions are utilized. I'm not entirely sure if these are two different concepts or the same concept.

Explode / lateral join / cross apply

I believe GenerateRel captures these cases. The input to the table function is the output of another relation. The arguments will (typically) include field references (or scalar functions applied to field references).

From

The other way in which table functions are applied is by using them directly in the FROM clause. For example SELECT * FROM myfunc(7). For example see postgres. In this case the arguments are (typically? always?) all literals.

Is the TableFunction message here intended to eventually be usable in both contexts? This PR itself only adds support for the first (which is fine). Still, it seems like message itself wouldn't have to change. I think the only change I can imagine is adding a new source rel that has no input rels and specifies the table function invocation.

Add GenerateRel, a new relational operator that applies table functions to
produce zero or more output rows per input row. This enables SQL LATERAL VIEW,
EXPLODE, and UNNEST operations for flattening nested data structures.

- Add TableFunction message (lines 1222-1262)
  - New expression type for functions that produce multiple rows
  - Fields: function_reference, arguments, options, output_type
  - Output type must be a Struct defining generated columns
  - Used exclusively in relational contexts (not in Expression.rex_type)

- Add GenerateRel message (lines 559-639)
  - Applies table function to each input row
  - Fields: common, input, generator, preserve_on_empty, advanced_extension
  - Comprehensive inline documentation with examples
  - Property maintenance: distribution and orderedness not maintained
  - Output schema: [input_fields..., generated_fields...]
  - Comparison to ExpandRel included in comments

- Add GenerateRel to Rel union (line 677)
  - Field number 23 in Rel.rel_type oneof

- site/docs/relations/logical_relations.md
  - Complete Generate Operation section with signature table
  - Common table functions reference (explode, posexplode, unnest)
  - Empty collection handling (preserve_on_empty flag)
  - Detailed examples: array explode, posexplode, map explode
  - Comparison table with ExpandRel

- site/docs/expressions/table_functions.md
  - Comprehensive rewrite of table functions documentation
  - TableFunction message field descriptions
  - Differences from scalar/window/aggregate functions
  - Usage examples with GenerateRel
  - Extension mechanism explanation

1. **Dedicated TableFunction message**: Provides type safety and clear
   semantics for multi-row producing functions, following the pattern of
   ScalarFunction, WindowFunction, and AggregateFunction.

2. **preserve_on_empty flag**: Controls NULL handling when generators produce
   zero rows. Named to reflect behavior (row preservation) rather than SQL
   join terminology ("outer"). Equivalent to Spark's LATERAL VIEW OUTER or
   standard SQL's LEFT JOIN LATERAL.

3. **No automatic ordinal column**: Unlike ExpandRel which appends an i32
   ordinal, GenerateRel relies on table functions to include position
   explicitly (e.g., posexplode includes pos field in output).

4. **TableFunction not in Expression**: Table functions produce multiple rows,
   incompatible with Expression semantics (single value). Only used in
   relational operators like GenerateRel.

- ExpandRel: Fixed cardinality (plan time), for grouping sets/cube
- GenerateRel: Variable cardinality (runtime), for lateral view/unnest

Based on Apache Gluten PR substrait-io#574 but with significant improvements:
- Well-documented fields (Gluten had minimal documentation)
- Dedicated TableFunction type (vs generic Expression)
- preserve_on_empty flag for outer semantics
- Comprehensive examples and property maintenance rules

None. This is a purely additive change using a new field number in Rel.

- Protobuf compilation verified with protoc
- Follows all Substrait conventions (common field 1, input field 2,
  advanced_extension field 10)
- No trailing whitespace, proper line endings
@EpsilonPrime EpsilonPrime force-pushed the generaterel branch 2 times, most recently from 773d5a7 to b08bb10 Compare December 22, 2025 05:38
@EpsilonPrime
Copy link
Member Author

I have a general question. There are (at least) two ways in which table functions are utilized. I'm not entirely sure if these are two different concepts or the same concept.

...

Is the TableFunction message here intended to eventually be usable in both contexts? This PR itself only adds support for the first (which is fine). Still, it seems like message itself wouldn't have to change. I think the only change I can imagine is adding a new source rel that has no input rels and specifies the table function invocation.

I believe it would work as long as you had a relation that worked just fine without requiring an input (which is currently the case here). I've expanded the implementation to create a TableExpression which gives us even more future capability but we will still have the same requirement -- we need to have a relation that allows us to provide no input.

json_decode is one such function. Getting a little meta, I could also see a function (say substrait_execute) that takes a substrait plan as an argument using this. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants