CNDB-16145: Split some NVQ tests into multiple subclasses #2195

scottfines · 2026-01-13T19:47:05Z

What is the issue

Some unit tests are getting timed out by Ant, implying that they are being slow for some reason. When Ant times out, you lose all the context that Junit creates, so we can't really see how far the test gets or anything like that, to see what the real problem is. To (hopefully) identify the actual problem, this adds shorter, explicit Junit timeouts that will at least let us see the stdout and stuff of the test to figure out what's happening when it fails.

This is ultimately aimed at helping with https://github.com/riptano/cndb/issues/16075 and https://github.com/riptano/cndb/issues/16316

What does this PR fix and why was it fixed

I manually split a few tests up into subclasses based around whether NVQ is enabled. Additionally, I added a JUnit timeout rule so that tests which take too long are timed out by junit rather than by ant. Doing this allows us to at least get some information about which tests take too long and what they were doing.

This is somewhat naive, because it does nothing to address the immediate causes of the problem: adding more versions will ultimately cause these tests to time out again in the future. However, in the short term this will help stabilize slower test and CI environments.

github-actions · 2026-01-13T19:47:24Z

scottfines · 2026-01-21T19:21:16Z

Just a note: this references tickets https://github.com/riptano/cndb/issues/16075 and https://github.com/riptano/cndb/issues/16316

…id ant timeouts, and add explicit junit timeouts.

…o avoid ant timeouts, and adding explicit junit timeouts to provide better failure information.

ekaterinadimitrova2

Thank you for the PR. I left a few comments regarding license header, code style, doc.

I have a few questions:

Are the refactored tests failing in CI and having their own tickets? For particular tests/short-term solution to the CI problem - we need their own ticket and we should not use #16145 for that. Let's leave that one alone for now, please.
There are more tests that are still extending VerctorTester.Versioned - you did not find failures in CI for them? They are shorter/simpler and do not suffer from the same issues?

I am not completely convinced we should have two different approaches for the same type of testing/parameterization. Maybe we need at least a documentation for Versioned where we state the problem, and we outline when it is not a good idea to use it for that type of parameterization? @michaeljmarshall, can you take a look into this PR and share opinion too, please?

test/unit/org/apache/cassandra/index/sai/cql/NVQDisabledVectorCompaction100dTest.java

ekaterinadimitrova2 · 2026-01-22T23:01:35Z

test/unit/org/apache/cassandra/index/sai/cql/VectorCompactionTest.java

-@RunWith(Parameterized.class)
 abstract public class VectorCompactionTest extends VectorTester
 {
+    @Rule public final Timeout timeout = new Timeout(240, TimeUnit.SECONDS);


Is your idea that it is not the parameterization the problem, but there are particular test methods that may be hanging and you want to catch those? How was the 240 seconds chosen?

Parameterization is definitely the problem. This is just an ad-hoc solution: break the tests apart so that each class does less (and is thus less likely to hit ant timeouts), and then stick a junit timeout on it so that when they do fail, they fail with useful information. It doesn't really solve anything, unfortunately.

I picked the 240 more-or-less experimentally--it seemed the length of time that most of the tests would pass consistently in CI. In a perfect world, I think we would want it smaller--I believe https://github.com/riptano/cndb/issues/16075 mentions wanting a max 2 minute timeout for each test. Unfortunately, 2 minutes means a lot of the tests will consistently time out in the current CI environment.

ekaterinadimitrova2 · 2026-01-22T23:01:48Z

test/unit/org/apache/cassandra/index/sai/cql/VectorSiftSmallTest.java

+public class VectorSiftSmallTest extends VectorTester
 {
    private static final String DATASET = "siftsmall"; // change to "sift" for larger dataset. requires manual download
+    @Rule public Timeout timeout = new Timeout(180, TimeUnit.SECONDS);


Same as before - Is your idea that it is not the parameterization the problem, but there are particular test methods that may be hanging and you want to catch those? How was the 240 seconds chosen?

test/unit/org/apache/cassandra/index/sai/cql/VectorTester.java

test/unit/org/apache/cassandra/index/sai/cql/NVQDisabledVectorSiftSmallTest.java

test/unit/org/apache/cassandra/index/sai/cql/NVQEnabledVectorCompaction100dTest.java

test/unit/org/apache/cassandra/index/sai/cql/NVQEnabledVectorSiftSmallTest.java

…ore expressive

scottfines · 2026-01-23T15:15:34Z

@ekaterinadimitrova2 I fixed the dumb mistakes (the code style and licensing), and cleaned up the messaging.

I think the take on this PR is that it doesn't really resolve the solution (at least not permanently), it's mostly aimed at improving the near-term situation with these tests. I didn't want to make a 50-test refactor that would be impossible to review, so I stuck with these two, but if the approach is acceptable then I will be happy to make follow-on PRs to apply the same logic to other classes under the Versioned structure.

sonarqubecloud · 2026-01-23T17:14:55Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2026-01-23T17:24:51Z

❌ Build ds-cassandra-pr-gate/PR-2195 rejected by Butler

2 regressions found
See build details here

Found 2 new test failures

Test	Explanation	Runs	Upstream
o.a.c.dht.SplitterTest.randomSplitTestVNodesRandomPartitioner (compression)	REGRESSION	🔵🔴	0 / 8
o.a.c.index.sai.cql.NVQDisabledVectorCompaction100dTest.testZeroOrOneToManyCompaction[dc] (compression)	NEW	🔴🔵	0 / 8

Found 2 known test failures

ekaterinadimitrova2 · 2026-01-23T17:48:53Z

One of the tests failed because of the new timeout added:
https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-2195/15/testReport/org.apache.cassandra.index.sai.cql/NVQDisabledVectorCompaction100dTest/tests_stage_1___jvm_unit_tests___jvm_unit_compression_tests___testZeroOrOneToManyCompaction_dc__compression_jdk11/

I am not convinced we should have those per-method timeouts. It seems like they are experimental (we do not have data that proves the timeout should be that long) and our CI environment is kind of unpredictable how much time tests can take. Might be more noise keeping those.

I picked the 240 more-or-less experimentally--it seemed the length of time that most of the tests would pass consistently in CI.

scottfines marked this pull request as ready for review January 13, 2026 19:53

scottfines force-pushed the cndb-16145 branch 8 times, most recently from 09646dc to a032f79 Compare January 21, 2026 14:23

scottfines changed the title ~~[WIP] CNDB-16145: Exploring tests which CI is timing out~~ CNDB-16145: Split some NVQ tests into multiple subclasses Jan 22, 2026

scottfines self-assigned this Jan 22, 2026

scottfines requested a review from ekaterinadimitrova2 January 22, 2026 15:35

scottfines added 2 commits January 22, 2026 11:00

CNDB-16145: Split VectorSiftSmallTest into multiple components to avo…

5bcce1f

…id ant timeouts, and add explicit junit timeouts.

CNDB-16145: Split VectorCompaction100dTest into multiple subclasses t…

24cc289

…o avoid ant timeouts, and adding explicit junit timeouts to provide better failure information.

scottfines force-pushed the cndb-16145 branch from 7b448c9 to 24cc289 Compare January 22, 2026 17:02

ekaterinadimitrova2 reviewed Jan 22, 2026

View reviewed changes

Cleaning up documentation and licensing, and renaming methods to be m…

a77c1d5

…ore expressive

CNDB-16145: Split some NVQ tests into multiple subclasses #2195

Are you sure you want to change the base?

CNDB-16145: Split some NVQ tests into multiple subclasses #2195

Uh oh!

Conversation

scottfines commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented Jan 13, 2026 • edited by scottfines Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist before you submit for review

Uh oh!

scottfines commented Jan 21, 2026

Uh oh!

ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ekaterinadimitrova2 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

scottfines Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

ekaterinadimitrova2 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scottfines commented Jan 23, 2026

Uh oh!

sonarqubecloud bot commented Jan 23, 2026

Quality Gate passed

Uh oh!

cassci-bot commented Jan 23, 2026

❌ Build ds-cassandra-pr-gate/PR-2195 rejected by Butler

Found 2 new test failures

Found 2 known test failures

Uh oh!

ekaterinadimitrova2 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scottfines commented Jan 13, 2026 •

edited

Loading

github-actions bot commented Jan 13, 2026 •

edited by scottfines

Loading

ekaterinadimitrova2 commented Jan 23, 2026 •

edited

Loading