feat: add AggregateRel compatibility for grouping set #890

yongchul · 2025-11-17T20:26:08Z

Motivation

AggreateRel is a very complex operation and implementations may behave differently in corner cases. We have identified one such corner case in at least two popular database systems: SQL server and Oracle.

SELECT COUNT(*) FROM T

SELECT COUNT(*) FROM T GROUP BY GROUPING SET (())

If T is empty, both queries should yield the same result: 1 row, 0. However, the two systems yield no rows on an empty table T.

Proposal

Since there can be other unidentified behavioral differences, we propose to capture these in a Compatbility message in AggregateRel and capture such behaviors in the message rather than adding a new field to the AggregateRel. The default behavior does not change.

The scheme can be extended to other operators and can be included in the dialect, whether certain behavior is supported or not.

We may consider promote this compatibility to plan-level message but it will be a potpourri of all rels in Substrait (i.e., submessage allocated per rel to prevent different behaviors all over the places).

jacques-n · 2025-11-19T18:07:10Z

Makes sense to me. It would be good to add to the spec docs too and also clarify a couple examples of systems that support each behavior.

vbarua · 2025-11-20T00:09:34Z

I think it makes sense to add this kind of toggle to AggregateRel to capture this behaviour, but I'm not sure I fully understand the scheme you're proposing.

we propose to capture these in a Compatbility message in AggregateRel and capture such behaviors in the message rather than adding a new field to the AggregateRel.

Technically you've added a field

Compatibility compatibility = 6;

to the AggregateRel (which is reasonable)

We may consider promote this compatibility to plan-level message

I don't think this would ever make sense, because as you point out it would be a potpurri of toggles and most toggles would only apply to specific rels.

Like Jacques said, it would be helpful to have this behaviour documented and explained in https://substrait.io/relations/logical_relations/#aggregate-operation

vbarua · 2025-11-20T00:20:40Z

proto/substrait/algebra.proto

+    // when specified with non-empty groupings field even when groupings includes
+    // empty grouping sets.
+    bool groupings_yield_no_rows_on_empty_input = 1;
+  }


I'm off two minds on declaring settings like this. On one hand, having single message with a bunch of boolean behavioral toggles makes it easy to add new toggles, because we can just add a new field. We would need to make sure that the default unset value matched the default behaviour when we do this. Generally though, I'm wary of boolean toggles because IMO they can be hard to understand, and are limited to switching between 2 different behaviors.

I personally lean towards the enum style of setting toggles because we can indicate the expected behavior with the name, we can declare more than 2 types of behaviors, we can add behaviors easily if we discover more weird system behaviour and we can explicitly declare the unset values as unspecified.

message EmptyInputMode { OUTPUT_MODE_UNSPECIFIED = 0; OUTPUT_MODE_YIELD_EMPTY_ROW = 1; OUTPUT_MODE_YIELD_NO_ROW = 1; }

When we add these kinds of compatibility toggles, we should also document them in the website. It would be good to include the systems this is useful for, as well as example queries to trigger the behaviour in the docs for context as well.

Yes, I plan to add the documentation -- the weeks have been crazy, can't find time... perhaps later this week or next week, I'll update the PR with the documentation.

@vbarua I agree with the enum if there are more than two options but I don't see this particular one having the third options. The message does not have to be a collection of booleans. If a field has more than two options in the future, oh well, that field should be enum.

If you do want to see enum, please let me know!

Even with 2 options, I still have a strong preference for the enum version as it also makes it possible to check if the compatibility option has been set explicitly or not. I also find it's easier to document behaviour via the enum name. yield_no_rows_on_empty_input=true makes it clear that I shouldn't output a row, but yield_no_rows_on_empty_input=false isn't as explicit. How many rows should I output? What should be in the row, if anything. I would probably need to check the documentation.

If I see values like YIELD_EMPTY_ROW and YIELD_NO_ROW on the other hand, its very clear what I should be doing and I won't need to go look at the documentation.

@vbarua In the proto or doc?

Also, if this is a theme, we are effectively banning the usage of boolean in the Substrait, which is fine as this also happens in normal programming language -- passing 0 or boolean as function argument vs. enum. If we agree on, we should go over the spec and clean up all boolean and replace with enums in 1.0 perhaps.

@vbarua changed to enum. PTAL!

we are effectively banning the usage of boolean in the Substrait,

I would say heavily discouraging 😅

Aside from nullable, I don't think there's that many. But yes something to keep in mind for 1.0.

In the proto or doc?

I was thinking in the doc. This is surprisingly weird behaviour. I was testing with something like

SELECT COUNT(*), COUNT(id), SUM(id), STRING_AGG(s, ',') FROM test;

on db-fiddle which outputs

(0, 0, null, null)

Trino does

trino> WITH test(i, s) AS (VALUES (1, 'a'), (2, 'b') LIMIT 0) -> SELECT COUNT(*), COUNT(i), SUM(i), LISTAGG(s) WITHIN GROUP (ORDER BY s) FROM test; _col0 | _col1 | _col2 | _col3 -------+-------+-------+------- 0 | 0 | NULL | NULL (1 row)

So it looks like the behaviour on empty inputs is that the count functions return 0, and all other functions return null.

COUNT is special. All other aggregates are supposed to yield NULL over empty input.

proto/substrait/algebra.proto

yongchul · 2025-11-20T04:14:19Z

We may consider promote this compatibility to plan-level message
I don't think this would ever make sense, because as you point out it would be a potpurri of toggles and most toggles would only apply to specific rels.

On flip side, if you are talking to one system, the behavior is likely fixed -- i.e., it's not applied to a single (which is very flexible) but all of the operators because they share implementation -- and just applied as blanket. Then, why not just spell out the behaviors once and assume that to the rest of the plan? :smile It is just a thought and probably better to start with this per rel toggles and promote to global if we see things really cropping up in the wild.

benbellick

One alternative idea. What if we instead model this behavior via a table function which takes in input-relations? We could take a hardline position and say that the AggregateRel in this case always returns a single row with result 0. Then we could model this behavior something like the following:

  TableFunctionRel {
    name: "make_empty_if_single_all_zero_row"
    input: AggregateRel {
      input: T (empty table)
      groupings: [()]
    }
  }

This would keep AggregateRel semantics clean and keep us oriented towards composable primitives, while making the non-standard SQL Server/Oracle behavior explicit in the plan structure rather than hidden behind a flag.

I know table functions don't exist yet (this would require introducing TableFunctionRel as a new relation type in Substrait), but this could be one way to mitigate the problem of a growing grab-bag of config settings.

yongchul · 2025-11-23T23:40:37Z

One alternative idea. What if we instead model this behavior via a table function which takes in input-relations? We could take a hardline position and say that the AggregateRel in this case always returns a single row with result 0. Then we could model this behavior something like the following:

It is an interesting approach but I don't like it because it is way more work for both producers and consumers to support these things correctly. Note that you are not just wrapping the subtree rooted at aggregate rel but TableFunction(AggregateRel) - Input. Also, this approach just happens to work in this case related to input, but it won't work if the compatibility fundamentally affecting the core behavior of aggregate rel itself (let's see how many of those cases we have down the road).

If you want to go with truly composable scenario, we probably better to have a standard wrappers or decorators to correct the behaviors (around superficial input and output) but how many of them will be there? I don't know... How many such behavioral differences to capture as flags? I don't know either but at least it is more practical in terms of implementation unless we introduce yet another "standard" wrapper framework.

We could go with this as an extension without introducing another behavior flags. Interested systems could just use these extensions (somehow discover), not bothering Substrait core. I have discussed this in the past meeting before sending out this PR. 😄

If the community is not agreeing to have this, I'm fine with going with the extension model. In the end, for this one, it is just a flag. We may discuss some standard decorator of rels...

benbellick · 2025-11-24T19:02:46Z

@yongchul thanks for the response! I see the point that you are making, especially about the case of affecting the semantics of an operation in a more profound way.

I'm coming at this with a bias toward writing Substrait plans that are generically executable rather than conforming to a specific execution model, but I recognize that might not always be practical.

Given that, the compatibility flag approach does seem like the easiest path forward here.

If we do take this approach, is it possible then to include clear documentation of which systems exhibit which behaviors (as @jacques-n mentioned), along with explicit guidance on when to use compatibility flags vs extensions? (in the site documentation)

vbarua · 2025-12-02T02:24:05Z

I missed your update between all of the other updates going on 😅

I think I understand what you're trying to accomplish with the per relation Compatibility message. It effectively becomes a bag in which we can capture all the toggles we might need for engine Compatibility, and it avoids polluting the relation itself with optional toggle fields.

On flip side, if you are talking to one system, the behavior is likely fixed

I can imagine having plan-level overrides for this in the future, but I think keeping them per relation makes the most sense for now. I'm broadly in favour of a per relation Compatibility message using enum toggles for all the settings. For each setting, we should document a clear default for when it is unset.

Table Function Discussion

I'm also in favour of re-using existing primitives and trying for composability when possible. I could see something like @benbellick's proposed make_empty_if_single_all_zero_row working for this, but like YongChul I'm not sure if this would be applicable for every type of compatibility issue, and I could see it getting a little onerous if we needed to set lots of different toggles for a single relation. Think

toggle_fn3(
    toggle_fn2(
      toggle_fn1(
        AggregateRel 
)))

In cases where the compatibility toggles can expressed as table functions, I can imagine mentioning it as such in the documentation. Then a consumer can handle the toggle by taking the aggregate and wrapping the function around it, and it's effectively sugar on the relation.

When we go to release 1.0, if we realized that all the toggles can be handle like that.

To Ben's point, I do think that guidance around these kinds of toggles, and when to add them, would be useful. While they let us capture behavioral deviations in the ecosystem, they also fragment it. Though, if we can identify the various different behaviour of core relations, we can also start thinking about compatibility shims for them all.

Cursed Golf Question

Does something like this

SELECT SUM(<col>) FROM T

also return 0 for those engines if T is empty?

Then the make_empty_if_single_all_zero_row trick wouldn't work because we couldn't distinguish between an aggregate of an empty table returning 0, and an aggregate where all the values where 0.

yongchul · 2025-12-02T07:19:30Z

Thanks @vbarua! Comments are inlined... (I personally dislike github comment system especially not having thread if the comment is at PR)

I missed your update between all of the other updates going on 😅

I think I understand what you're trying to accomplish with the per relation Compatibility message. It effectively becomes a bag in which we can capture all the toggles we might need for engine Compatibility, and it avoids polluting the relation itself with optional toggle fields.

On flip side, if you are talking to one system, the behavior is likely fixed

I can imagine having plan-level overrides for this in the future, but I think keeping them per relation makes the most sense for now. I'm broadly in favour of a per relation Compatibility message using enum toggles for all the settings. For each setting, we should document a clear default for when it is unset.

We are in agreement at least compatibility message per relation when it needed for now. I will add documentations for the toggles.

Table Function Discussion

I'm also in favour of re-using existing primitives and trying for composability when possible. I could see something like @benbellick's proposed make_empty_if_single_all_zero_row working for this, but like YongChul I'm not sure if this would be applicable for every type of compatibility issue, and I could see it getting a little onerous if we needed to set lots of different toggles for a single relation.

I'm not against the composability but what you will need for this (and probably other such modifiers) need to separate the core operator and its input. That said, it shoold be

ToggleFunc[AggregateRel] - Input

but not

ToggleFunc - AggregateRel - Input

You see ToggleFunc effectively wrapping the AggregateRel operator and inspect both input and output? Or, the ToggleFunc should take Func<Enumerable<Row> -> Enumerable<Row>>, Func<()->Enumerable<Row> with all schema derivations. I'm not sure whether we want this savvy type inference. Also, we need to define a type hierarchy (or traits) of relops to incorporate the existing relops to capture it as a high-order function IMO.

Or, at least, we would want to define a special modifier table function that takes a relop (no input), AND relops (inputs) to clearly capture above scenario. This is just to workaround the elaborated high order function type system. I can see that special modifier table function is properly composable but not just any table function.

For more simple composability, Assert operator is a very good example.

AssertOneRow(Rel)

which raise runtime error when it sees more than 1 row (useful to guard from decorelated subqueries). More generic version would be

Assert(rel, expression, "message")

where raise runtime error when expression is false if any of the rows from the rel. Note that unlike the make empty if single all zero row, assert does not look into the gut of the rels. It just sees the output and modifies the behavior.

BTW, I may propose the assert rel soon... 😄

Think
toggle_fn3(
    toggle_fn2(
      toggle_fn1(
        AggregateRel 
)))
In cases where the compatibility toggles can expressed as table functions, I can imagine mentioning it as such in the documentation. Then a consumer can handle the toggle by taking the aggregate and wrapping the function around it, and it's effectively sugar on the relation.

Good case but I'm not sure whether we need this... yet. At least, we can capture it in terms of the modifier table functions I put ahead. Perhaps, we can even flatten the hierarchy in that special modifier table functions.

To Ben's point, I do think that guidance around these kinds of toggles, and when to add them, would be useful. While they let us capture behavioral deviations in the ecosystem, they also fragment it. Though, if we can identify the various different behaviour of core relations, we can also start thinking about compatibility shims for them all.

I agree. I hope things not going too crazy and I'm reasonably optimistic that we don't grow beyond 10. Also, we should capture this in the dialect so that when systems talk they should know whether it is supported or not.

Cursed Golf Question

Does something like this
SELECT SUM(<col>) FROM T
also return 0 for those engines if T is empty?

For this, SUM is NULL when T is empty, that's SQL.

Then the make_empty_if_single_all_zero_row trick wouldn't work because we couldn't distinguish between an aggregate of an empty table returning 0, and an aggregate where all the values where 0.

The intent for this compatibility is only applicable when an empty grouping set is defined.

SELECT SUM(<col>) FROM T GROUP BY ()

is equivalent to your example but in Substrait, it can be represented as empty grouping set. If the flag or compat mode is true, then this will yield 0 row. Otherwise, it will return 1 row with NULL.

vbarua · 2025-12-02T23:27:44Z

For future reference I found an online SQL SERVER widget https://onecompiler.com/sqlserver/446em8jwe that helped me verify that

SELECT SUM(<col>) FROM T GROUP BY GROUPING SETS (())

returns 0 rows when T is empty as you said, which also make me more inclined to avoid wrapper approach.

jacques-n · 2025-12-03T17:58:32Z

The table function composable pattern seems too reductionist to me. As a bit of a history note, a couple of other people were exploring ideas around the same time I started working on Substrait. They were modeling scalar and set operations as simply functions (just different kinds of inputs/outputs). I went with a more structured approach of having those be two different concepts because the cognitive overhead was lower when you're actually working with plans. I think the same is true here. Over-decomposition makes it harder for everyone to work on things.

For the specific of boolean versus enum... I'm generally a pattern of what @vbarua said in situations where there are likely more than two imaginable variants of a behavior. In this particular circumstance, I'm not sure that exists but I'm not sure it doesn't. I'm find with enum OR bool.

vbarua · 2025-12-10T20:24:01Z

site/docs/relations/logical_relations.md

+
+NOTE: The compatibility is meant to address gaps in the core implementation of aggregation such as grouping sets. For custom aggregations, consider using aggregate extension functions. If you want to introduce a new compatibility mode, reach out Substrait PMC to discuss.
+
+#### Empty Grouping Set on Empty Input


We should document what the encoding of an empty grouping looks like in the protobuf to make it clear for users what the condition they need to detect is.

vbarua · 2025-12-10T20:24:28Z

site/docs/relations/logical_relations.md

+
+| Mode                                             | Behavior                      | Example Systems |
+| -------------------------------------------------|-------------------------------|-----------------|
+| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS    | A row for empty grouping set  | PostgreSQL      |


We should document what the row should contain as well.

vbarua · 2025-12-10T20:29:42Z

proto/substrait/algebra.proto

+    // Defines the behavior of AggregateRel when there is an empty grouping set in the `groupings`
+    // and the input is empty. An empty grouping set is an aggregation over the entire input and some
+    // systems implement different behaviors when the input is empty.
+    enum EmptyGroupingSetOnEmptyInput {


My natural inclination is to give an enum a name that captures:

The visible behaviour it modifies

The condition to trigger it

but here that would give us something like RowOutputOnEmptyGroupingSetOnEmptyInput which I'm not 100% sure is worth the verbosity.

vbarua · 2025-12-10T20:31:26Z

proto/substrait/algebra.proto

+      // If there is an empty grouping set in the `groupings`, the AggregateRel yields a single row
+      // for the empty grouping set on empty input (i.e., explicit grouping over the entire input).
+      // For example, AggregateRel[(), COUNT] yields one record of value 0 when the input is empty.
+      EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS = 1;


minor

EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS to EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROW

vbarua · 2025-12-10T20:33:00Z

proto/substrait/algebra.proto

+      EMPTY_GROUPING_SET_ON_EMPTY_INPUT_UNSPECIFIED = 0;
+      // If there is an empty grouping set in the `groupings`, the AggregateRel yields a single row
+      // for the empty grouping set on empty input (i.e., explicit grouping over the entire input).
+      // For example, AggregateRel[(), COUNT] yields one record of value 0 when the input is empty.


I don't think we can use AggregateRel[(), COUNT] as our example, because we don't have a text format defined for something like this.

EpsilonPrime · 2025-12-22T06:28:45Z

proto/substrait/algebra.proto

  }
+
+  // Various modes of operations of AggregateRel to capture different behaviors across systems.
+  message Compatibility {


I'm a bit leery of having many submessages with similar names as that will complicate parsing. But having one compatibility message with many unused parts is similarly unsatisfying. The options behavior of functions is probably the most appropriate.

yongchul added 2 commits October 13, 2025 16:27

feat: aggregate rel compatibility options

e0104c5

feat: add AggregateRel compatibility

4a2a735

yongchul requested review from EpsilonPrime, cpcloud, jacques-n, vbarua and westonpace as code owners November 17, 2025 20:26

Merge branch 'main' into grouping_set_behavior

2949a95

vbarua reviewed Nov 20, 2025

View reviewed changes

proto/substrait/algebra.proto Show resolved Hide resolved

benbellick reviewed Nov 21, 2025

View reviewed changes

Adding compatibility documentation. Simplify field name

f8f32d7

yongchul force-pushed the grouping_set_behavior branch from 0ca21b9 to f8f32d7 Compare December 9, 2025 23:17

Change bool to enum. Updated documentation accordingly.

028b6d3

vbarua reviewed Dec 10, 2025

View reviewed changes

benbellick mentioned this pull request Dec 13, 2025

feat: introduce GenerateRel for lateral view and unnest operations #917

Open

EpsilonPrime reviewed Dec 22, 2025

View reviewed changes


		NOTE: The compatibility is meant to address gaps in the core implementation of aggregation such as grouping sets. For custom aggregations, consider using aggregate extension functions. If you want to introduce a new compatibility mode, reach out Substrait PMC to discuss.

		#### Empty Grouping Set on Empty Input

feat: add AggregateRel compatibility for grouping set #890

Are you sure you want to change the base?

feat: add AggregateRel compatibility for grouping set #890

Uh oh!

Conversation

yongchul commented Nov 17, 2025

Motivation

Proposal

Uh oh!

jacques-n commented Nov 19, 2025

Uh oh!

vbarua commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbarua Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbarua Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yongchul commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbellick left a comment

Choose a reason for hiding this comment

Uh oh!

yongchul commented Nov 23, 2025

Uh oh!

benbellick commented Nov 24, 2025

Uh oh!

vbarua commented Dec 2, 2025

Table Function Discussion

Cursed Golf Question

Uh oh!

yongchul commented Dec 2, 2025

Table Function Discussion

Cursed Golf Question

Uh oh!

vbarua commented Dec 2, 2025

Uh oh!

jacques-n commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vbarua Dec 2, 2025 •

edited

Loading

vbarua Dec 10, 2025 •

edited

Loading

yongchul commented Nov 20, 2025 •

edited

Loading