-
Notifications
You must be signed in to change notification settings - Fork 20
CNDB-16363: Improve matched rows estimation accuracy for memory indexes #2188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Checklist before you submit for review
|
|
@pkolaczk can you add to the PR description, which issue is going to be fixed by this PR? |
|
|
It would be great to update the PR description with motivation for the work from the issue and that it's blocking for another work. |
2dbcb8e to
876dc37
Compare
876dc37 to
23c2479
Compare
|
test this please |
23c2479 to
56197c0
Compare
src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/memory/TrieMemtableIndex.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/memory/TrieMemtableIndex.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/index/sai/plan/SingleRestrictionEstimatedRowCountTest.java
Outdated
Show resolved
Hide resolved
scottfines
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new, and just trying to learn. But for what its worth, the logic looks sound to me
src/java/org/apache/cassandra/index/sai/memory/TrieMemtableIndex.java
Outdated
Show resolved
Hide resolved
adelapena
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me. I have just left a few nits that can be addressed before merging. I think the CNDB PR will need an update. Maybe IndexQueryMetricsTest.testIndexQueryMetrics will need some adapting in that PR too.
src/java/org/apache/cassandra/index/sai/plan/QueryController.java
Outdated
Show resolved
Hide resolved
❌ Build ds-cassandra-pr-gate/PR-2188 rejected by Butler3 regressions found Found 3 new test failures
Found 3 known test failures |
When a memory index contains very few rows and is split into many shards, we can expect a lot of variance in the number of rows between the shards. Hence, if we took only one shard to estimate the number of matched rows, and extrapolate that on all shards to compute the estimated matching rows from the whole index, we risk making a huge estimation error. This commit changes the algorithm to take as many shards as needed to collect enough returned or indexed rows. For very tiny datasets is it's likely to use all shards for estimation. For big datasets, one shard will likely be enough, speeding up estimation. This change also allows to remove one estimation method. We no longer need to manually choose between the estimation from the first shard and from all shards. Additionally, the accuracy of estimating of NOT_EQ rows has been improved by letting the planner know the union generated by NOT_EQ is disjoint so the result set cardinality is the sum of cardinalities of the subplans. The commit contains also a fix for a bug that caused some non-hybrid queries be counted as hybrid by the query metrics. Unused keyRange parameters have been removed from the methods for estimating row counts in the index classes.
325bad7 to
2386ac7
Compare
|



What is the issue
When a memory index contains very few rows and is split into
many shards, we can expect a lot of variance in the number of rows
between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows
from the whole index, we risk making a huge estimation error.
This turned out to be a problem when testing the query planner metrics
in #2130, when the planner estimated 0 rows and didn't bump up
the estimated row count metric.
What does this PR fix and why was it fixed
This commit changes the algorithm to take as many shards as needed
to collect enough returned or indexed rows. For very
tiny datasets is it's likely to use all shards for estimation.
For big datasets, one shard will likely be enough, speeding up
estimation.
This change also allows to remove one estimation method.
We no longer need to manually choose between the estimation
from the first shard and from all shards.