-
Notifications
You must be signed in to change notification settings - Fork 267
CORENET-6605: avoid flapping Degraded on transient failures #2862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdds a 2-minute persistence debounce for setting Degraded conditions using a clock abstraction and per-level first-seen timestamps; reorganizes MachineConfigPool status checks into a two-pass flow separating degraded detection from progressing/processing; updates tests to use a fake clock and validate time-based behavior. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing touches
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used📓 Path-based instructions (1)**⚙️ CodeRabbit configuration file
Files:
🧬 Code graph analysis (1)pkg/controller/statusmanager/machineconfig_status.go (2)
🔇 Additional comments (5)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jluhrsen The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/apply/apply.go (1)
130-135: Consider exponential backoff for faster recovery.With
Factor: 1.0, all retries wait a constant 5 seconds. For transient issues that resolve quickly (e.g., brief network blips), exponential backoff starting shorter would recover faster:var backoff = wait.Backoff{ Steps: 6, - Duration: 5 * time.Second, - Factor: 1.0, + Duration: 1 * time.Second, + Factor: 2.0, Jitter: 0.1, }This retries at ~1s, 2s, 4s, 8s, 16s, 32s intervals—faster initial recovery while still reaching similar total wait time.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
pkg/apply/apply.go(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/apply/apply.go
|
@jluhrsen: This pull request references CORENET-6605 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
dd0058d to
df74afe
Compare
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade 10 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/fd326400-e057-11f0-99fb-9f2dbe12e85d-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ad155810-e0f2-11f0-96b9-cee61ebafabc-0 |
|
/retest |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c8beb9e0-e119-11f0-9d9d-69e629931e4c-0 |
|
/retest |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d52d40b0-e2eb-11f0-9214-d727f6f6cf81-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e3ed6bd0-eb64-11f0-9969-dc64703d602b-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ad12e9b0-ecbc-11f0-8708-c2c4dd3a9afd-0 |
|
/test okd-scos-images |
this group of 10 did not see the issue (as expected) |
It seems like this is the real problem here; the docs say
(emphasis added) We already queue a retry if (This problem is probably endemic to CNO's controllers...) |
danwinship
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if we are going to merge the PR like this...)
pkg/apply/apply.go
Outdated
| Duration: 5 * time.Second, | ||
| Factor: 1.0, | ||
| Jitter: 0.1, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these numbers, particularly given that the test code waits 1 minute?
pkg/apply/apply.go
Outdated
| err = retry.OnError(backoff, func(err error) bool { | ||
| // Don't retry on context cancellation (graceful shutdown) | ||
| return !errors.Is(err, context.Canceled) && !errors.Is(err, context.DeadlineExceeded) | ||
| }, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code would probably be clearer if you split out the test into its own function, eg:
err = retry.OnError(backoff, isAPIServerRestartingError, func() error {
...
})
ff7b923 to
1b94f0f
Compare
|
@jluhrsen: This pull request references CORENET-6605 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e5ca30f0-f0dc-11f0-8bd4-ffddbc3ce3e2-0 |
|
/test e2e-aws-ovn-upgrade |
CNO is going Degraded on the first connection issue with the API server, but that can happen briefly on a new rollout or during transient network issues. This is seen periodically in test cases doing a new rollout like this one [0], which even does retries [1] to work around the issue. Instead of setting Degraded immediately on first failure, track when failures start and only set Degraded after they persist for 2+ minutes. ETCD has a similar pattern and uses 2 minutes as it's default. [2] Also fixes a race condition in SetFromMachineConfigPool where setNotDegraded() was being called for each non-degraded role in a loop, which could clear failure tracking before checking all roles. Restructured to use a two-pass approach: first check all roles for degradation, then check progressing status. [0] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L26 [1] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L57-L63 [2] https://github.com/openshift/cluster-etcd-operator/blob/d93728daa2fb69410025029740b0f8826479c4c3/pkg/operator/starter.go#L390 Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>
1b94f0f to
4436d4c
Compare
CNO is going Degraded on the first connection issue with the API
server, but that can happen briefly on a new rollout or during
transient network issues. This is seen periodically in test cases
doing a new rollout like this one [0], which even does retries [1]
to work around the issue.
Instead of setting Degraded immediately on first failure, track
when failures start and only set Degraded after they persist for
2+ minutes. ETCD has a similar pattern and uses 2 minutes as it's
default. [2]
Also fixes a race condition in SetFromMachineConfigPool where
setNotDegraded() was being called for each non-degraded role in
a loop, which could clear failure tracking before checking all
roles. Restructured to use a two-pass approach: first check all
roles for degradation, then check progressing status.
[0] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L26
[1] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L57-L63
[2] https://github.com/openshift/cluster-etcd-operator/blob/d93728daa2fb69410025029740b0f8826479c4c3/pkg/operator/starter.go#L390