-
Notifications
You must be signed in to change notification settings - Fork 189
feat: add AggregateRel compatibility for grouping set #890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Makes sense to me. It would be good to add to the spec docs too and also clarify a couple examples of systems that support each behavior. |
|
I think it makes sense to add this kind of toggle to AggregateRel to capture this behaviour, but I'm not sure I fully understand the scheme you're proposing.
Technically you've added a field Compatibility compatibility = 6;to the AggregateRel (which is reasonable)
I don't think this would ever make sense, because as you point out it would be a potpurri of toggles and most toggles would only apply to specific rels. Like Jacques said, it would be helpful to have this behaviour documented and explained in https://substrait.io/relations/logical_relations/#aggregate-operation |
| // when specified with non-empty groupings field even when groupings includes | ||
| // empty grouping sets. | ||
| bool groupings_yield_no_rows_on_empty_input = 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm off two minds on declaring settings like this. On one hand, having single message with a bunch of boolean behavioral toggles makes it easy to add new toggles, because we can just add a new field. We would need to make sure that the default unset value matched the default behaviour when we do this. Generally though, I'm wary of boolean toggles because IMO they can be hard to understand, and are limited to switching between 2 different behaviors.
I personally lean towards the enum style of setting toggles because we can indicate the expected behavior with the name, we can declare more than 2 types of behaviors, we can add behaviors easily if we discover more weird system behaviour and we can explicitly declare the unset values as unspecified.
message EmptyInputMode {
OUTPUT_MODE_UNSPECIFIED = 0;
OUTPUT_MODE_YIELD_EMPTY_ROW = 1;
OUTPUT_MODE_YIELD_NO_ROW = 1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we add these kinds of compatibility toggles, we should also document them in the website. It would be good to include the systems this is useful for, as well as example queries to trigger the behaviour in the docs for context as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I plan to add the documentation -- the weeks have been crazy, can't find time... perhaps later this week or next week, I'll update the PR with the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbarua I agree with the enum if there are more than two options but I don't see this particular one having the third options. The message does not have to be a collection of booleans. If a field has more than two options in the future, oh well, that field should be enum.
If you do want to see enum, please let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with 2 options, I still have a strong preference for the enum version as it also makes it possible to check if the compatibility option has been set explicitly or not. I also find it's easier to document behaviour via the enum name. yield_no_rows_on_empty_input=true makes it clear that I shouldn't output a row, but yield_no_rows_on_empty_input=false isn't as explicit. How many rows should I output? What should be in the row, if anything. I would probably need to check the documentation.
If I see values like YIELD_EMPTY_ROW and YIELD_NO_ROW on the other hand, its very clear what I should be doing and I won't need to go look at the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbarua In the proto or doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if this is a theme, we are effectively banning the usage of boolean in the Substrait, which is fine as this also happens in normal programming language -- passing 0 or boolean as function argument vs. enum. If we agree on, we should go over the spec and clean up all boolean and replace with enums in 1.0 perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbarua changed to enum. PTAL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are effectively banning the usage of boolean in the Substrait,
I would say heavily discouraging 😅
Aside from nullable, I don't think there's that many. But yes something to keep in mind for 1.0.
In the proto or doc?
I was thinking in the doc. This is surprisingly weird behaviour. I was testing with something like
SELECT COUNT(*), COUNT(id), SUM(id), STRING_AGG(s, ',')
FROM test;on db-fiddle which outputs
(0, 0, null, null)
Trino does
trino> WITH test(i, s) AS (VALUES (1, 'a'), (2, 'b') LIMIT 0)
-> SELECT COUNT(*), COUNT(i), SUM(i), LISTAGG(s) WITHIN GROUP (ORDER BY s) FROM test;
_col0 | _col1 | _col2 | _col3
-------+-------+-------+-------
0 | 0 | NULL | NULL
(1 row)
So it looks like the behaviour on empty inputs is that the count functions return 0, and all other functions return null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COUNT is special. All other aggregates are supposed to yield NULL over empty input.
On flip side, if you are talking to one system, the behavior is likely fixed -- i.e., it's not applied to a single (which is very flexible) but all of the operators because they share implementation -- and just applied as blanket. Then, why not just spell out the behaviors once and assume that to the rest of the plan? :smile It is just a thought and probably better to start with this per rel toggles and promote to global if we see things really cropping up in the wild. |
benbellick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative idea. What if we instead model this behavior via a table function which takes in input-relations? We could take a hardline position and say that the AggregateRel in this case always returns a single row with result 0. Then we could model this behavior something like the following:
TableFunctionRel {
name: "make_empty_if_single_all_zero_row"
input: AggregateRel {
input: T (empty table)
groupings: [()]
}
}
This would keep AggregateRel semantics clean and keep us oriented towards composable primitives, while making the non-standard SQL Server/Oracle behavior explicit in the plan structure rather than hidden behind a flag.
I know table functions don't exist yet (this would require introducing TableFunctionRel as a new relation type in Substrait), but this could be one way to mitigate the problem of a growing grab-bag of config settings.
It is an interesting approach but I don't like it because it is way more work for both producers and consumers to support these things correctly. Note that you are not just wrapping the subtree rooted at aggregate rel but If you want to go with truly composable scenario, we probably better to have a standard wrappers or decorators to correct the behaviors (around superficial input and output) but how many of them will be there? I don't know... How many such behavioral differences to capture as flags? I don't know either but at least it is more practical in terms of implementation unless we introduce yet another "standard" wrapper framework. We could go with this as an extension without introducing another behavior flags. Interested systems could just use these extensions (somehow discover), not bothering Substrait core. I have discussed this in the past meeting before sending out this PR. 😄 If the community is not agreeing to have this, I'm fine with going with the extension model. In the end, for this one, it is just a flag. We may discuss some standard decorator of rels... |
|
@yongchul thanks for the response! I see the point that you are making, especially about the case of affecting the semantics of an operation in a more profound way. I'm coming at this with a bias toward writing Substrait plans that are generically executable rather than conforming to a specific execution model, but I recognize that might not always be practical. Given that, the compatibility flag approach does seem like the easiest path forward here. If we do take this approach, is it possible then to include clear documentation of which systems exhibit which behaviors (as @jacques-n mentioned), along with explicit guidance on when to use compatibility flags vs extensions? (in the site documentation) |
|
I missed your update between all of the other updates going on 😅 I think I understand what you're trying to accomplish with the per relation
I can imagine having plan-level overrides for this in the future, but I think keeping them per relation makes the most sense for now. I'm broadly in favour of a per relation Compatibility message using enum toggles for all the settings. For each setting, we should document a clear default for when it is unset. Table Function DiscussionI'm also in favour of re-using existing primitives and trying for composability when possible. I could see something like @benbellick's proposed In cases where the compatibility toggles can expressed as table functions, I can imagine mentioning it as such in the documentation. Then a consumer can handle the toggle by taking the aggregate and wrapping the function around it, and it's effectively sugar on the relation. When we go to release 1.0, if we realized that all the toggles can be handle like that. To Ben's point, I do think that guidance around these kinds of toggles, and when to add them, would be useful. While they let us capture behavioral deviations in the ecosystem, they also fragment it. Though, if we can identify the various different behaviour of core relations, we can also start thinking about compatibility shims for them all. Cursed Golf QuestionDoes something like this also return 0 for those engines if T is empty? Then the |
|
Thanks @vbarua! Comments are inlined... (I personally dislike github comment system especially not having thread if the comment is at PR)
We are in agreement at least compatibility message per relation when it needed for now. I will add documentations for the toggles.
I'm not against the composability but what you will need for this (and probably other such modifiers) need to separate the core operator and its input. That said, it shoold be but not You see ToggleFunc effectively wrapping the AggregateRel operator and inspect both input and output? Or, the ToggleFunc should take Or, at least, we would want to define a special modifier table function that takes a relop (no input), AND relops (inputs) to clearly capture above scenario. This is just to workaround the elaborated high order function type system. I can see that special modifier table function is properly composable but not just any table function. For more simple composability, Assert operator is a very good example. which raise runtime error when it sees more than 1 row (useful to guard from decorelated subqueries). More generic version would be where raise runtime error when expression is false if any of the rows from the rel. Note that unlike the make empty if single all zero row, assert does not look into the gut of the rels. It just sees the output and modifies the behavior. BTW, I may propose the assert rel soon... 😄
Good case but I'm not sure whether we need this... yet. At least, we can capture it in terms of the modifier table functions I put ahead. Perhaps, we can even flatten the hierarchy in that special modifier table functions.
I agree. I hope things not going too crazy and I'm reasonably optimistic that we don't grow beyond 10. Also, we should capture this in the dialect so that when systems talk they should know whether it is supported or not.
For this, SUM is NULL when T is empty, that's SQL.
The intent for this compatibility is only applicable when an empty grouping set is defined. SELECT SUM(<col>) FROM T GROUP BY ()is equivalent to your example but in Substrait, it can be represented as empty grouping set. If the flag or compat mode is true, then this will yield 0 row. Otherwise, it will return 1 row with NULL. |
|
For future reference I found an online SQL SERVER widget https://onecompiler.com/sqlserver/446em8jwe that helped me verify that SELECT SUM(<col>) FROM T GROUP BY GROUPING SETS (())returns 0 rows when T is empty as you said, which also make me more inclined to avoid wrapper approach. |
|
The table function composable pattern seems too reductionist to me. As a bit of a history note, a couple of other people were exploring ideas around the same time I started working on Substrait. They were modeling scalar and set operations as simply functions (just different kinds of inputs/outputs). I went with a more structured approach of having those be two different concepts because the cognitive overhead was lower when you're actually working with plans. I think the same is true here. Over-decomposition makes it harder for everyone to work on things. For the specific of boolean versus enum... I'm generally a pattern of what @vbarua said in situations where there are likely more than two imaginable variants of a behavior. In this particular circumstance, I'm not sure that exists but I'm not sure it doesn't. I'm find with enum OR bool. |
0ca21b9 to
f8f32d7
Compare
|
|
||
| NOTE: The compatibility is meant to address gaps in the core implementation of aggregation such as grouping sets. For custom aggregations, consider using aggregate extension functions. If you want to introduce a new compatibility mode, reach out Substrait PMC to discuss. | ||
|
|
||
| #### Empty Grouping Set on Empty Input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document what the encoding of an empty grouping looks like in the protobuf to make it clear for users what the condition they need to detect is.
|
|
||
| | Mode | Behavior | Example Systems | | ||
| | -------------------------------------------------|-------------------------------|-----------------| | ||
| | EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS | A row for empty grouping set | PostgreSQL | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document what the row should contain as well.
| // Defines the behavior of AggregateRel when there is an empty grouping set in the `groupings` | ||
| // and the input is empty. An empty grouping set is an aggregation over the entire input and some | ||
| // systems implement different behaviors when the input is empty. | ||
| enum EmptyGroupingSetOnEmptyInput { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My natural inclination is to give an enum a name that captures:
- The visible behaviour it modifies
- The condition to trigger it
but here that would give us something like RowOutputOnEmptyGroupingSetOnEmptyInput which I'm not 100% sure is worth the verbosity.
| // If there is an empty grouping set in the `groupings`, the AggregateRel yields a single row | ||
| // for the empty grouping set on empty input (i.e., explicit grouping over the entire input). | ||
| // For example, AggregateRel[(), COUNT] yields one record of value 0 when the input is empty. | ||
| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor
EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROWS
to
EMPTY_GROUPING_SET_ON_EMPTY_INPUT_YIELDS_ROW
| EMPTY_GROUPING_SET_ON_EMPTY_INPUT_UNSPECIFIED = 0; | ||
| // If there is an empty grouping set in the `groupings`, the AggregateRel yields a single row | ||
| // for the empty grouping set on empty input (i.e., explicit grouping over the entire input). | ||
| // For example, AggregateRel[(), COUNT] yields one record of value 0 when the input is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use AggregateRel[(), COUNT] as our example, because we don't have a text format defined for something like this.
| } | ||
|
|
||
| // Various modes of operations of AggregateRel to capture different behaviors across systems. | ||
| message Compatibility { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit leery of having many submessages with similar names as that will complicate parsing. But having one compatibility message with many unused parts is similarly unsatisfying. The options behavior of functions is probably the most appropriate.
Motivation
AggreateRelis a very complex operation and implementations may behave differently in corner cases. We have identified one such corner case in at least two popular database systems: SQL server and Oracle.If
Tis empty, both queries should yield the same result: 1 row, 0. However, the two systems yield no rows on an empty tableT.Proposal
Since there can be other unidentified behavioral differences, we propose to capture these in a
Compatbilitymessage inAggregateReland capture such behaviors in the message rather than adding a new field to theAggregateRel. The default behavior does not change.The scheme can be extended to other operators and can be included in the
dialect, whether certain behavior is supported or not.We may consider promote this
compatibilityto plan-level message but it will be a potpourri of all rels in Substrait (i.e., submessage allocated per rel to prevent different behaviors all over the places).