-
Notifications
You must be signed in to change notification settings - Fork 189
feat: add anchor for non-ordinal plan reference #726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is technically true, but on the other hand we are already doing the same thing with function extensions when the number of functions used in a typical plan most likely will be a lot bigger than the number of subtrees. I don't think it makes sense to keep the existing behavior only due to performance considerations.
I think leaving both options especially w/o oneof will lead to a lot more confusion and possible divergence in implementations. My choice would be to deprecate
Pretty sure putting an existing field in a oneof is a backwards-compatible change. |
|
I should have time to finish this up in the next few days |
proto/substrait/algebra.proto
Outdated
| // This rel is used to create references, | ||
| // in case we refer to a RelRoot field names will be ignored | ||
| message ReferenceRel { | ||
| // points to a PlanRel in Plan.relations (ordinal reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put these in on a one of (subtree_reference and subtree_ordinal) and deprecate subtree ordinal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named the oneof ref_type since that seemed descriptive enough without being too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n is there a reason to both deprecate the existing field AND put both of the deprecaetd and new field in a oneof. It feels noisy to create a oneof if the eventual plan is to remove one of the fields, and hence the need for the oneof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof make sense.
For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.
As part of this, we would need to document that the new relation anchor should be preferred when both are set.
Created a oneof field, "ref_type," (should expand to `ref_type_case` for generated code) that captures both types of references. The old type, subtree_ordinal, has also been deprecated. This addresses feedback in the PR tagged below. PR: substrait-io#726
Created a oneof field, "ref_type," (should expand to `ref_type_case` for generated code) that captures both types of references. The old type, subtree_ordinal, has also been deprecated. This addresses feedback in the PR tagged below. PR: substrait-io#726
1cd1c49 to
bf76bb9
Compare
|
rebased on main |
|
hm, according to a stackoverflow post, moving an optional field into a |
|
By making the anchor on the plan object does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan? For instance, TPC-H 15 processes the same subtree twice and often ends up with different results (when using float). By using a reference you could process the subtree once and point to it avoiding differences. |
I think you may be misunderstanding where the anchor is? I don't think of the PlanRel as the Plan object
The change is essentially to change a relative ID (ordinal) to an absolute ID (anchor). I think in either case you need to know what you're pointing to and you can point to a subtree of the existing plan. |
vbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I understand the intent of this change.
Function / Type References
For things like functions, we define Plan-level anchors
substrait/proto/substrait/plan.proto
Lines 34 to 35 in d430e52
| // a list of extensions this plan may depend on | |
| repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2; |
substrait/proto/substrait/extensions/extensions.proto
Lines 57 to 59 in d430e52
| message ExtensionFunction { | |
| // references the extension_uri_anchor defined for a specific extension URI. | |
| uint32 extension_uri_reference = 1; |
which are then used within the plan
substrait/proto/substrait/algebra.proto
Lines 1178 to 1183 in d430e52
| // A scalar function call. | |
| message ScalarFunction { | |
| // Points to a function_anchor defined in this plan, which must refer | |
| // to a scalar function in the associated YAML file. Required; avoid | |
| // using anchor/reference zero. | |
| uint32 function_reference = 1; |
This mechanism is also used for types.
Rel References
For RelReference we don't currently use an anchor, but instead an ordinal reference
substrait/proto/substrait/algebra.proto
Lines 1756 to 1760 in d430e52
| // This rel is used to create references, | |
| // in case we refer to a RelRoot field names will be ignored | |
| message ReferenceRel { | |
| int32 subtree_ordinal = 1; | |
| } |
into an list of relations in the Plan
substrait/proto/substrait/plan.proto
Lines 37 to 38 in d430e52
| // one or more relation trees that are associated with this plan. | |
| repeated PlanRel relations = 3; |
Your Proposal
Bring references to relations into line with functions and types by attaching an anchor to the PlanRel message.
Notes
Bringing this inline with how we do references generally make sense to me. That being said:
The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains a ReferenceRel.
How does this make merging plans easier. I assume that as with ordinal-based references you would still have to rewrite anchors when they collide between plans?
It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one
If we're going to do this, I think it would be better to have only 1 way to do it and we should plan to get rid of the ordinal-based resolution.
vbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought this PR up in the Substrait sync this week. I've updated some comments with what we talked about.
proto/substrait/algebra.proto
Outdated
| // This rel is used to create references, | ||
| // in case we refer to a RelRoot field names will be ignored | ||
| message ReferenceRel { | ||
| // points to a PlanRel in Plan.relations (ordinal reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof make sense.
For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.
As part of this, we would need to document that the new relation anchor should be preferred when both are set.
|
@vbarua I'm not opposed to |
|
I think more context about the interaction with I commented on #794 (moved my comment from here). |
Created a oneof field, "ref_type," (should expand to `ref_type_case` for generated code) that captures both types of references. The old type, subtree_ordinal, has also been deprecated. This addresses feedback in the PR tagged below. PR: substrait-io#726
|
I accommodated feedback from @vbarua and rebased on main |
|
@tokoko @vbarua as long as |
vbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks reasonable to me as is.
We still need reviews from other PMCs for this: @jacques-n @westonpace @EpsilonPrime @cpcloud
When we merge this, we should make sure the commit message is accurate. This is no longer a breaking change.
This adds an "anchor" to `PlanRel` that can be referenced from a `ReferenceRel`. This anchor and reference relationship provides a non-ordinal mechanism for identifying and accessing a "subtree" or sub-graph. PlanRel. The referenced attribute, `reference_anchor`, is named to specify it is an anchor--a fixed value that can be referenced. The default value (0) means it is unset and implies it must be actively set by a producer. ReferenceRel. The existing field, `subtree_ordinal`, is deprecated. The reference attribute, `planrel_reference`, is named to specify it is a reference to a `PlanRel` message. The default value (0) means it is unset and implies it must be actively set by a producer. Note: The default value of `subtree_ordinal` (0) can look valid even when unset. If `planrel_reference` is unset: (1) during the deprecation period, `subtree_ordinal` can be used as a fallback; (2) in the future, the `ReferenceRel` message should be considered invalid. Issue: substrait-io#725 PR: 726
815fda4 to
9c27ba0
Compare
benbellick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the docs in logical_relations.md? There is still only the old field mentioned:
### Reference Properties
| Property | Description | Required |
|-----------------------------|--------------------------------------------------------------------------------| --------------------------- |
| Referred Rel | A zero-indexed positional reference to a `Rel` defined within the same `Plan`. | Required |
| int32 subtree_ordinal = 1 [deprecated = true]; | ||
|
|
||
| // points to a PlanRel in Plan.relations by use of an anchor (non-ordinal reference), | ||
| // PlanRel.reference_anchor. A default value (0) is considered unset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is it required to have a default value of 0 be considered unset? It seems a bit strange considering that all of the other places where we have reference / anchors, we allow 0 (e.g. here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defining 0 as unset prevents reference rel from pointing to arbitrary plans accidentally. although we haven't agreed on cross plan semantics, this helps leave that possibility open (ergonomically, at least)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Ben. During transition phase, both of these values need to be set and pointing to identical plans for the plan to be considered valid so there's not much chance of accidentally pointing to an incorrect plan. Afterwards (when subtree_ordinal will be removed) there's essentially no difference between this and other anchors, so I don't see a reason for disallowing 0 in this instance alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, but I think that it makes sense that if you intend for it to be used, then you set it to a value. The only "extra work" you need to do is set the value for the first PlanRel and first ReferenceRel, but you need to set it for every subsequent one regardless.
I can yield on this, but I am using it for distributed query planning, so I would just be needing to extend these semantics or add another field later if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, a PlanRel anchor is semantically independent from a ReferenceRel. It can be referenced by many separate ones. But, in an aggregation, the anchor can only be referenced by fields in the same operator, those anchors are not semantically independent from their references.
This is why the default value is fine there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then should it be optional? otherwise there is no way to know if it was set unless you go through the work of checking they're all unique. does the anchor matter at all if there are no reference rels? or can you only have multiple planrels if you have at least 1 reference rel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, during the deprecation period, if 0 is valid, then do you always have to check the second/last planrel to see if the anchor should be used or if the subtree ordinal should be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do the same as in uri/urn migration, consumer should not have to check which one to use, both should yield the exact same references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, for my own clarification, if 0 is a valid value and during the deprecation period they can/should be set to the same values (meaning the anchor only has a valid value within the plan itself and not across plans), then when are anchor semantics beneficial over ordinal semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, we'll start to see the value when we drop the deprecated field (you're free to violate the rule I suggested in your project if you can get away with it of course.
I also think if we can get this merged, we can speedrun (relatively speaking) the rest. From what I have seen, not many consumers/producers support ReferenceRel in the first place anyway.
This adds an "anchor" to
PlanRelthat can be referenced from aReferenceRel. This anchor and reference relationship provides a non-ordinal method for identifying and accessing a "subtree" or sub-graph. To maintain consistency withtype_anchorandfunction_anchor, the newsubtree_anchorattribute isuint32.This PR leaves in the original
subtree_ordinalattribute since it seems a (mildly) more performant method for referencing a subtree, but also since it is still relevant in the typical case. The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains aReferenceRel.It is expected that only one of
subtree_ordinalorsubtree_referencewill be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in aoneofconstraint.If either of the above points ((1) keeping
subtree_ordinaland (2) not enforcing aoneofconstraint) are changed, then this PR would likely become a breaking change.Also, this PR addresses #725 , but uses the name
subtree_referenceandsubtree_anchor(instead ofplan_anchor) to be consistent with the pre-existingsubtree_ordinalattribute. I don't know that relations in a substrait plan are strictly trees, so I think this could be a good time to change "subtree" to "planrel" or "subgraph" or something similar.BREAKING CHANGE: The field
subtree_ordinalis made optional with explicit presence due to being moved into a newoneoffield, "ref_type".