Skip to content

Conversation

@drin
Copy link
Member

@drin drin commented Oct 25, 2024

This adds an "anchor" to PlanRel that can be referenced from a ReferenceRel. This anchor and reference relationship provides a non-ordinal method for identifying and accessing a "subtree" or sub-graph. To maintain consistency with type_anchor and function_anchor, the new subtree_anchor attribute is uint32.

This PR leaves in the original subtree_ordinal attribute since it seems a (mildly) more performant method for referencing a subtree, but also since it is still relevant in the typical case. The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains a ReferenceRel.

It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in a oneof constraint.

If either of the above points ((1) keeping subtree_ordinal and (2) not enforcing a oneof constraint) are changed, then this PR would likely become a breaking change.

Also, this PR addresses #725 , but uses the name subtree_reference and subtree_anchor (instead of plan_anchor) to be consistent with the pre-existing subtree_ordinal attribute. I don't know that relations in a substrait plan are strictly trees, so I think this could be a good time to change "subtree" to "planrel" or "subgraph" or something similar.

BREAKING CHANGE: The field subtree_ordinal is made optional with explicit presence due to being moved into a new oneof field, "ref_type".

@CLAassistant
Copy link

CLAassistant commented Oct 25, 2024

CLA assistant check
All committers have signed the CLA.

@tokoko
Copy link
Contributor

tokoko commented Oct 28, 2024

This PR leaves in the original subtree_ordinal attribute since it seems a (mildly) more performant method for referencing a subtree

This is technically true, but on the other hand we are already doing the same thing with function extensions when the number of functions used in a typical plan most likely will be a lot bigger than the number of subtrees. I don't think it makes sense to keep the existing behavior only due to performance considerations.

It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in a oneof constraint.

I think leaving both options especially w/o oneof will lead to a lot more confusion and possible divergence in implementations. My choice would be to deprecate subtree_ordinal and eventually get rid of it, but oneof is also a viable option.

If either of the above points ((1) keeping subtree_ordinal and (2) not enforcing a oneof constraint) are changed, then this PR would likely become a breaking change.

Pretty sure putting an existing field in a oneof is a backwards-compatible change.

@drin
Copy link
Member Author

drin commented Nov 20, 2024

I should have time to finish this up in the next few days

// This rel is used to create references,
// in case we refer to a RelRoot field names will be ignored
message ReferenceRel {
// points to a PlanRel in Plan.relations (ordinal reference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put these in on a one of (subtree_reference and subtree_ordinal) and deprecate subtree ordinal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named the oneof ref_type since that seemed descriptive enough without being too verbose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n is there a reason to both deprecate the existing field AND put both of the deprecaetd and new field in a oneof. It feels noisy to create a oneof if the eventual plan is to remove one of the fields, and hence the need for the oneof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof make sense.

For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.

As part of this, we would need to document that the new relation anchor should be preferred when both are set.

drin added a commit to drin/substrait that referenced this pull request Feb 26, 2025
Created a oneof field, "ref_type," (should expand to `ref_type_case` for
generated code) that captures both types of references. The old type,
subtree_ordinal, has also been deprecated.

This addresses feedback in the PR tagged below.

PR: substrait-io#726
drin added a commit to drin/substrait that referenced this pull request Feb 26, 2025
Created a oneof field, "ref_type," (should expand to `ref_type_case` for
generated code) that captures both types of references. The old type,
subtree_ordinal, has also been deprecated.

This addresses feedback in the PR tagged below.

PR: substrait-io#726
@drin drin force-pushed the feat-refrel-anchor branch from 1cd1c49 to bf76bb9 Compare February 26, 2025 17:28
@drin
Copy link
Member Author

drin commented Feb 26, 2025

rebased on main

@drin
Copy link
Member Author

drin commented Feb 26, 2025

hm, according to a stackoverflow post, moving an optional field into a oneof is not breaking. This PR essentially makes subtree_ordinal an optional field instead of an implicit field by moving it into a oneof, but that seems backwards-compatible. However, the check for breaking changes still flags it so I decided to add a BREAKING CHANGES: footer to the PR description.

@drin drin marked this pull request as ready for review February 26, 2025 18:02
@EpsilonPrime
Copy link
Member

By making the anchor on the plan object does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan? For instance, TPC-H 15 processes the same subtree twice and often ends up with different results (when using float). By using a reference you could process the subtree once and point to it avoiding differences.

@drin
Copy link
Member Author

drin commented Jun 9, 2025

By making the anchor on the plan object...

I think you may be misunderstanding where the anchor is? I don't think of the PlanRel as the Plan object

...does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan?

The change is essentially to change a relative ID (ordinal) to an absolute ID (anchor). I think in either case you need to know what you're pointing to and you can point to a subtree of the existing plan.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I understand the intent of this change.

Function / Type References

For things like functions, we define Plan-level anchors

// a list of extensions this plan may depend on
repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2;

message ExtensionFunction {
// references the extension_uri_anchor defined for a specific extension URI.
uint32 extension_uri_reference = 1;

which are then used within the plan

// A scalar function call.
message ScalarFunction {
// Points to a function_anchor defined in this plan, which must refer
// to a scalar function in the associated YAML file. Required; avoid
// using anchor/reference zero.
uint32 function_reference = 1;

This mechanism is also used for types.

Rel References

For RelReference we don't currently use an anchor, but instead an ordinal reference

// This rel is used to create references,
// in case we refer to a RelRoot field names will be ignored
message ReferenceRel {
int32 subtree_ordinal = 1;
}

into an list of relations in the Plan
// one or more relation trees that are associated with this plan.
repeated PlanRel relations = 3;

Your Proposal

Bring references to relations into line with functions and types by attaching an anchor to the PlanRel message.

Notes

Bringing this inline with how we do references generally make sense to me. That being said:

The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains a ReferenceRel.

How does this make merging plans easier. I assume that as with ordinal-based references you would still have to rewrite anchors when they collide between plans?

It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one

If we're going to do this, I think it would be better to have only 1 way to do it and we should plan to get rid of the ordinal-based resolution.

@vbarua vbarua requested a review from jacques-n June 16, 2025 20:10
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought this PR up in the Substrait sync this week. I've updated some comments with what we talked about.

// This rel is used to create references,
// in case we refer to a RelRoot field names will be ignored
message ReferenceRel {
// points to a PlanRel in Plan.relations (ordinal reference)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof make sense.

For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.

As part of this, we would need to document that the new relation anchor should be preferred when both are set.

@tokoko
Copy link
Contributor

tokoko commented Jun 18, 2025

@vbarua I'm not opposed to planrel_reference, but one slight caveat regarding naming: this field will most likely act as a reference to the list of relations in ExtendedExpression message as well (see #794). In my PR I decided to introduce new ExtendedExpressionRel message to hold Rels. If we go that route, calling the field planrel_reference doesn't make as much sense anymore. Alternatively we can reuse PlanRel message for ExtendedExpression relations, which is doable but slightly awkward, because relations in PlanRel for extended expressions will only ever be of type Rel, never RelRoot.

@drin
Copy link
Member Author

drin commented Jun 18, 2025

I think more context about the interaction with ExtendedExpressionRel would be helpful.

I commented on #794 (moved my comment from here).

@vbarua
Copy link
Member

vbarua commented Jun 18, 2025

@tokoko left a comment on your proposal in #794 with regards to including relations in ExtendedExpressions.

drin added a commit to drin/substrait that referenced this pull request Jun 25, 2025
Created a oneof field, "ref_type," (should expand to `ref_type_case` for
generated code) that captures both types of references. The old type,
subtree_ordinal, has also been deprecated.

This addresses feedback in the PR tagged below.

PR: substrait-io#726
@drin drin force-pushed the feat-refrel-anchor branch from 5a5941a to 815fda4 Compare June 25, 2025 18:37
@drin
Copy link
Member Author

drin commented Jun 25, 2025

I accommodated feedback from @vbarua and rebased on main

@drin
Copy link
Member Author

drin commented Jun 25, 2025

@tokoko @vbarua as long as planrel_reference is a new field in ReferenceRel anyways, what if we actually nest it in a oneof, that way we can be explicit about other things the reference may point to (e.g. extendedexpr_reference) in a way that is a bit future proof (although a bit more verbose). Or, we can change it to be anchor_reference and have an enum that specifies what the expected type of message being pointed to is?

@vbarua vbarua self-requested a review July 11, 2025 00:19
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks reasonable to me as is.

We still need reviews from other PMCs for this: @jacques-n @westonpace @EpsilonPrime @cpcloud

When we merge this, we should make sure the commit message is accurate. This is no longer a breaking change.

This adds an "anchor" to `PlanRel` that can be referenced from a
`ReferenceRel`. This anchor and reference relationship provides a
non-ordinal mechanism for identifying and accessing a "subtree" or
sub-graph.

PlanRel. The referenced attribute, `reference_anchor`, is named to
specify it is an anchor--a fixed value that can be referenced. The
default value (0) means it is unset and implies it must be actively set
by a producer.

ReferenceRel. The existing field, `subtree_ordinal`, is deprecated. The
reference attribute, `planrel_reference`, is named to specify it is a
reference to a `PlanRel` message. The default value (0) means it is
unset and implies it must be actively set by a producer.

Note: The default value of `subtree_ordinal` (0) can look valid even
when unset. If `planrel_reference` is unset: (1) during the deprecation
period, `subtree_ordinal` can be used as a fallback; (2) in the
future, the `ReferenceRel` message should be considered invalid.

Issue: substrait-io#725
PR: 726
@drin drin force-pushed the feat-refrel-anchor branch from 815fda4 to 9c27ba0 Compare December 10, 2025 10:24
@drin drin requested a review from yongchul as a code owner December 10, 2025 10:24
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the docs in logical_relations.md? There is still only the old field mentioned:

### Reference Properties

| Property                    | Description                                                                    | Required                    |
|-----------------------------|--------------------------------------------------------------------------------| --------------------------- |
| Referred Rel                | A zero-indexed positional reference to a `Rel` defined within the same `Plan`. | Required                    |

int32 subtree_ordinal = 1 [deprecated = true];

// points to a PlanRel in Plan.relations by use of an anchor (non-ordinal reference),
// PlanRel.reference_anchor. A default value (0) is considered unset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is it required to have a default value of 0 be considered unset? It seems a bit strange considering that all of the other places where we have reference / anchors, we allow 0 (e.g. here).

Copy link
Member Author

@drin drin Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defining 0 as unset prevents reference rel from pointing to arbitrary plans accidentally. although we haven't agreed on cross plan semantics, this helps leave that possibility open (ergonomically, at least)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ben. During transition phase, both of these values need to be set and pointing to identical plans for the plan to be considered valid so there's not much chance of accidentally pointing to an incorrect plan. Afterwards (when subtree_ordinal will be removed) there's essentially no difference between this and other anchors, so I don't see a reason for disallowing 0 in this instance alone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, but I think that it makes sense that if you intend for it to be used, then you set it to a value. The only "extra work" you need to do is set the value for the first PlanRel and first ReferenceRel, but you need to set it for every subsequent one regardless.

I can yield on this, but I am using it for distributed query planning, so I would just be needing to extend these semantics or add another field later if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, a PlanRel anchor is semantically independent from a ReferenceRel. It can be referenced by many separate ones. But, in an aggregation, the anchor can only be referenced by fields in the same operator, those anchors are not semantically independent from their references.

This is why the default value is fine there.

Copy link
Member Author

@drin drin Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then should it be optional? otherwise there is no way to know if it was set unless you go through the work of checking they're all unique. does the anchor matter at all if there are no reference rels? or can you only have multiple planrels if you have at least 1 reference rel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, during the deprecation period, if 0 is valid, then do you always have to check the second/last planrel to see if the anchor should be used or if the subtree ordinal should be used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do the same as in uri/urn migration, consumer should not have to check which one to use, both should yield the exact same references.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, for my own clarification, if 0 is a valid value and during the deprecation period they can/should be set to the same values (meaning the anchor only has a valid value within the plan itself and not across plans), then when are anchor semantics beneficial over ordinal semantics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, we'll start to see the value when we drop the deprecated field (you're free to violate the rule I suggested in your project if you can get away with it of course.

I also think if we can get this merged, we can speedrun (relatively speaking) the rest. From what I have seen, not many consumers/producers support ReferenceRel in the first place anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants