-
Notifications
You must be signed in to change notification settings - Fork 20
CNDB-16145: Split some NVQ tests into multiple subclasses #2195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Checklist before you submit for review
|
09646dc to
a032f79
Compare
|
Just a note: this references tickets https://github.com/riptano/cndb/issues/16075 and https://github.com/riptano/cndb/issues/16316 |
…id ant timeouts, and add explicit junit timeouts.
…o avoid ant timeouts, and adding explicit junit timeouts to provide better failure information.
7b448c9 to
24cc289
Compare
ekaterinadimitrova2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. I left a few comments regarding license header, code style, doc.
I have a few questions:
- Are the refactored tests failing in CI and having their own tickets? For particular tests/short-term solution to the CI problem - we need their own ticket and we should not use #16145 for that. Let's leave that one alone for now, please.
- There are more tests that are still extending
VerctorTester.Versioned- you did not find failures in CI for them? They are shorter/simpler and do not suffer from the same issues?
I am not completely convinced we should have two different approaches for the same type of testing/parameterization. Maybe we need at least a documentation for Versioned where we state the problem, and we outline when it is not a good idea to use it for that type of parameterization? @michaeljmarshall, can you take a look into this PR and share opinion too, please?
test/unit/org/apache/cassandra/index/sai/cql/NVQDisabledVectorCompaction100dTest.java
Outdated
Show resolved
Hide resolved
| @RunWith(Parameterized.class) | ||
| abstract public class VectorCompactionTest extends VectorTester | ||
| { | ||
| @Rule public final Timeout timeout = new Timeout(240, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your idea that it is not the parameterization the problem, but there are particular test methods that may be hanging and you want to catch those? How was the 240 seconds chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameterization is definitely the problem. This is just an ad-hoc solution: break the tests apart so that each class does less (and is thus less likely to hit ant timeouts), and then stick a junit timeout on it so that when they do fail, they fail with useful information. It doesn't really solve anything, unfortunately.
I picked the 240 more-or-less experimentally--it seemed the length of time that most of the tests would pass consistently in CI. In a perfect world, I think we would want it smaller--I believe https://github.com/riptano/cndb/issues/16075 mentions wanting a max 2 minute timeout for each test. Unfortunately, 2 minutes means a lot of the tests will consistently time out in the current CI environment.
| public class VectorSiftSmallTest extends VectorTester | ||
| { | ||
| private static final String DATASET = "siftsmall"; // change to "sift" for larger dataset. requires manual download | ||
| @Rule public Timeout timeout = new Timeout(180, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before - Is your idea that it is not the parameterization the problem, but there are particular test methods that may be hanging and you want to catch those? How was the 240 seconds chosen?
test/unit/org/apache/cassandra/index/sai/cql/NVQDisabledVectorSiftSmallTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/index/sai/cql/NVQEnabledVectorCompaction100dTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/index/sai/cql/NVQEnabledVectorSiftSmallTest.java
Outdated
Show resolved
Hide resolved
|
@ekaterinadimitrova2 I fixed the dumb mistakes (the code style and licensing), and cleaned up the messaging. I think the take on this PR is that it doesn't really resolve the solution (at least not permanently), it's mostly aimed at improving the near-term situation with these tests. I didn't want to make a 50-test refactor that would be impossible to review, so I stuck with these two, but if the approach is acceptable then I will be happy to make follow-on PRs to apply the same logic to other classes under the Versioned structure. |
|
❌ Build ds-cassandra-pr-gate/PR-2195 rejected by Butler2 regressions found Found 2 new test failures
Found 2 known test failures |
|
One of the tests failed because of the new timeout added: I am not convinced we should have those per-method timeouts. It seems like they are experimental (we do not have data that proves the timeout should be that long) and our CI environment is kind of unpredictable how much time tests can take. Might be more noise keeping those.
|



What is the issue
Some unit tests are getting timed out by Ant, implying that they are being slow for some reason. When Ant times out, you lose all the context that Junit creates, so we can't really see how far the test gets or anything like that, to see what the real problem is. To (hopefully) identify the actual problem, this adds shorter, explicit Junit timeouts that will at least let us see the stdout and stuff of the test to figure out what's happening when it fails.
This is ultimately aimed at helping with https://github.com/riptano/cndb/issues/16075 and https://github.com/riptano/cndb/issues/16316
What does this PR fix and why was it fixed
I manually split a few tests up into subclasses based around whether NVQ is enabled. Additionally, I added a JUnit timeout rule so that tests which take too long are timed out by junit rather than by ant. Doing this allows us to at least get some information about which tests take too long and what they were doing.
This is somewhat naive, because it does nothing to address the immediate causes of the problem: adding more versions will ultimately cause these tests to time out again in the future. However, in the short term this will help stabilize slower test and CI environments.