Skip to content

Conversation

@ekaterinadimitrova2
Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 commented Jan 15, 2026

What is the issue

...
When SAI catches CompactionInterruptedException, we only log that index build was stopped, but we don't log the most important information contained in the CompactionInterruptedException. And it stores e.g. the reason of it being thrown.

What does this PR fix and why was it fixed

...
Log the reason SAI index build was stopped.
Testing showed a gap in our testing. The actual gap turned out to be actually a broken test that said it is testing what something it was not actually testing. I fixed it as part of this PR and added additional test too.
So this PR covers both - #15805 and #16415. I actually suggest we close #16415 in favor of #15805.

@github-actions
Copy link

github-actions bot commented Jan 15, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@ekaterinadimitrova2 ekaterinadimitrova2 changed the title DO NOT COMMIT CNDB-15805: Log the reason SAI index build was stopped Jan 20, 2026
@ekaterinadimitrova2
Copy link
Author

Code coverage shows 25% coverage and it shows that I haven't covered the log lines changes. This is inaccurate. I checked with breakpoints both the changed and new tests in NativeIndexDDLTest. As per Jenkins list here - I do not see that test class being run. - https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-2196/2/testReport/org.apache.cassandra.index.sai.cql/

Then if I go to cloud bees - I see for some of the unit test jobs that there was issue collecting all logs. It doesn't say what exactly, but I guess this may have been the problem.

I found in the logs:

[junit-timeout] Test org.apache.cassandra.index.sai.cql.NativeIndexDDLTest FAILED

--
  |   |   |  
  |   |   | BUILD FAILED
  |   |   | /home/cassandra/workspace/build.xml:1740: The following error occurred while executing this line:
  |   |   | /home/cassandra/workspace/build.xml:2079: The following error occurred while executing this line:
  |   |   | /home/cassandra/workspace/build.xml:1438: Some test(s) failed.
  |   |   |  
  |   |   | Total time: 3 minutes 12 seconds
  |   |   | Cleaning container cassandra-builderba3d8750-684e-4748-ac50-bf037011bbaa
  |   |   | Creating archive from patterns ./jvm-unit-compression-066.out: cd '.' && (find . -type f -path './jvm-unit-compression-066.out') \| tar -J -chf '/instance-storage/workspace/ds-cassandra-pr-gate_PR-2196/jvm-unit-compression-066-EC2-ec2-ds-stargazer-automation-ubuntu-m6id.xlarge-i-07065a7f098d45b73--1.out.tar.xz' --files-from -
  |   |   | ERROR: Failed to collect logs and output for partition jvm-unit-compression-066
  |   |   | Also:   hudson.model.Computer$TerminationRequest: Termination requested at Fri Jan 23 04:26:32 UTC 2026 by Thread[#211802,Ping thread for channel hudson.remoting.Channel@243e0f13:EC2 (ec2-ds-stargazer-automation) - ubuntu-m6id.xlarge (i-07065a7f098d45b73),5,main] [id=211802]
  |   |   | at hudson.model.Computer.recordTermination(Computer.java:237)
  |   |   | at hudson.model.Computer.disconnect(Computer.java:492)
  |   |   | at hudson.slaves.SlaveComputer.disconnect(SlaveComputer.java:816)
  |   |   | at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:192)
  |   |   | at hudson.remoting.PingThread.ping(PingThread.java:134)
  |   |   | at hudson.remoting.PingThread.run(PingThread.java:87)

It seems this was the first test class running in that split and the rest of the test classes were not running at all. I also noticed a note about Jenkins agent being offline. So I suspect this run was interrupted or something.

I will rebase and re-trigger CI.

@sonarqubecloud
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2196 rejected by Butler


3 regressions found
See build details here


Found 3 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.cql.NativeIndexDDLTest.concurrentTruncateWithIndexBuilding (compression) REGRESSION 🔴 0 / 8
o.a.c.index.sai.cql.VectorCompaction100dTest.testOneToManyCompactionTooManyHoles[dc false] REGRESSION 🔴 0 / 8
o.a.c.index.sai.cql.VectorSiftSmallTest.testMultiSegmentBuild[ca false] REGRESSION 🔴 0 / 8

Found 3 known test failures

@ekaterinadimitrova2
Copy link
Author

concurrentTruncateWithIndexBuilding failed in the last run, but I think it is just a flaky test. My suggestion - we remove checking the particular log as an overkill or we extend the waitForAssert to something very high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants