Skip to content

Conversation

@scottfines
Copy link

@scottfines scottfines commented Jan 13, 2026

What is the issue

Some unit tests are getting timed out by Ant, implying that they are being slow for some reason. When Ant times out, you lose all the context that Junit creates, so we can't really see how far the test gets or anything like that, to see what the real problem is. To (hopefully) identify the actual problem, this adds shorter, explicit Junit timeouts that will at least let us see the stdout and stuff of the test to figure out what's happening when it fails.

This is ultimately aimed at helping with https://github.com/riptano/cndb/issues/16075 and https://github.com/riptano/cndb/issues/16316

What does this PR fix and why was it fixed

I manually split a few tests up into subclasses based around whether NVQ is enabled. Additionally, I added a JUnit timeout rule so that tests which take too long are timed out by junit rather than by ant. Doing this allows us to at least get some information about which tests take too long and what they were doing.

This is somewhat naive, because it does nothing to address the immediate causes of the problem: adding more versions will ultimately cause these tests to time out again in the future. However, in the short term this will help stabilize slower test and CI environments.

@github-actions
Copy link

github-actions bot commented Jan 13, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@scottfines scottfines marked this pull request as ready for review January 13, 2026 19:53
@scottfines scottfines force-pushed the cndb-16145 branch 8 times, most recently from 09646dc to a032f79 Compare January 21, 2026 14:23
@scottfines
Copy link
Author

@scottfines scottfines changed the title [WIP] CNDB-16145: Exploring tests which CI is timing out CNDB-16145: Split some NVQ tests into multiple subclasses Jan 22, 2026
@scottfines scottfines self-assigned this Jan 22, 2026
…id ant timeouts, and add explicit junit timeouts.
…o avoid ant timeouts, and adding explicit junit timeouts to provide better failure information.
Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I left a few comments regarding license header, code style, doc.

I have a few questions:

  1. Are the refactored tests failing in CI and having their own tickets? For particular tests/short-term solution to the CI problem - we need their own ticket and we should not use #16145 for that. Let's leave that one alone for now, please.
  2. There are more tests that are still extending VerctorTester.Versioned - you did not find failures in CI for them? They are shorter/simpler and do not suffer from the same issues?

I am not completely convinced we should have two different approaches for the same type of testing/parameterization. Maybe we need at least a documentation for Versioned where we state the problem, and we outline when it is not a good idea to use it for that type of parameterization? @michaeljmarshall, can you take a look into this PR and share opinion too, please?

@RunWith(Parameterized.class)
abstract public class VectorCompactionTest extends VectorTester
{
@Rule public final Timeout timeout = new Timeout(240, TimeUnit.SECONDS);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your idea that it is not the parameterization the problem, but there are particular test methods that may be hanging and you want to catch those? How was the 240 seconds chosen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterization is definitely the problem. This is just an ad-hoc solution: break the tests apart so that each class does less (and is thus less likely to hit ant timeouts), and then stick a junit timeout on it so that when they do fail, they fail with useful information. It doesn't really solve anything, unfortunately.

I picked the 240 more-or-less experimentally--it seemed the length of time that most of the tests would pass consistently in CI. In a perfect world, I think we would want it smaller--I believe https://github.com/riptano/cndb/issues/16075 mentions wanting a max 2 minute timeout for each test. Unfortunately, 2 minutes means a lot of the tests will consistently time out in the current CI environment.

public class VectorSiftSmallTest extends VectorTester
{
private static final String DATASET = "siftsmall"; // change to "sift" for larger dataset. requires manual download
@Rule public Timeout timeout = new Timeout(180, TimeUnit.SECONDS);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before - Is your idea that it is not the parameterization the problem, but there are particular test methods that may be hanging and you want to catch those? How was the 240 seconds chosen?

@scottfines
Copy link
Author

@ekaterinadimitrova2 I fixed the dumb mistakes (the code style and licensing), and cleaned up the messaging.

I think the take on this PR is that it doesn't really resolve the solution (at least not permanently), it's mostly aimed at improving the near-term situation with these tests. I didn't want to make a 50-test refactor that would be impossible to review, so I stuck with these two, but if the approach is acceptable then I will be happy to make follow-on PRs to apply the same logic to other classes under the Versioned structure.

@sonarqubecloud
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2195 rejected by Butler


2 regressions found
See build details here


Found 2 new test failures

Test Explanation Runs Upstream
o.a.c.dht.SplitterTest.randomSplitTestVNodesRandomPartitioner (compression) REGRESSION 🔵🔴 0 / 8
o.a.c.index.sai.cql.NVQDisabledVectorCompaction100dTest.testZeroOrOneToManyCompaction[dc] (compression) NEW 🔴🔵 0 / 8

Found 2 known test failures

@ekaterinadimitrova2
Copy link

ekaterinadimitrova2 commented Jan 23, 2026

One of the tests failed because of the new timeout added:
https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-2195/15/testReport/org.apache.cassandra.index.sai.cql/NVQDisabledVectorCompaction100dTest/tests_stage_1___jvm_unit_tests___jvm_unit_compression_tests___testZeroOrOneToManyCompaction_dc__compression_jdk11/

I am not convinced we should have those per-method timeouts. It seems like they are experimental (we do not have data that proves the timeout should be that long) and our CI environment is kind of unpredictable how much time tests can take. Might be more noise keeping those.

I picked the 240 more-or-less experimentally--it seemed the length of time that most of the tests would pass consistently in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants