Skip to content

Conversation

@scottfines
Copy link

What is the issue

There are two different ways in which network timeouts are detected during Repair operations--there is a configured count-down latch which will fail if the requests take too long, and there is a separate timeout returned by the underlying messaging system. Unfortunately, when the underlying messaging system timed out, it was being treated as a general network error instead of a timeout. The net result is that very rarely the network will timeout before the count down latch, and some tests in the CI build will fail with an incorrect error message.

What does this PR fix and why was it fixed

This resolves https://github.com/riptano/cndb/issues/14687.

The main motivation is flaky unit tests, but the end user will also see a more consistent error message in the event of a network timeout that is detected at a lower level than the latch timeout.

…age as a latch timeout. This creates a more consistent behavior in the event of different network timeouts, and has a side benefit of fixing three or four different flaky tests
@github-actions
Copy link

github-actions bot commented Jan 13, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@driftx driftx self-requested a review January 15, 2026 14:26
@driftx
Copy link

driftx commented Jan 15, 2026

Restarted the CI job, hopefully that works: https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-2194/

@sonarqubecloud
Copy link

Copy link

@driftx driftx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good. Butler didn't fire from the CI restart, but there weren't many failures and they were all timeouts that don't reproduce so aren't related to this.

@scottfines scottfines merged commit a87ea67 into main Jan 16, 2026
486 of 501 checks passed
@scottfines scottfines deleted the c14687 branch January 16, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants