-
Notifications
You must be signed in to change notification settings - Fork 37
SREP-2079: Making sure that alerts are processed in an atomic way #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@Nikokolas3270: This pull request references SREP-2079 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Nikokolas3270 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #175 +/- ##
==========================================
+ Coverage 53.95% 55.67% +1.71%
==========================================
Files 23 23
Lines 1820 1895 +75
==========================================
+ Hits 982 1055 +73
- Misses 780 785 +5
+ Partials 58 55 -3
🚀 New features to boost your workflow:
|
RHOBS or classic webhook may process an alert twice in a sequential or in a parallel way due to Prometheus or AlertManager redundancy. This change makes sure that the custom resources used to track notifications are tested and set (TAS) in an atomic way. This avoids sending notifications for alerts duplicates.
180b239 to
f12430d
Compare
|
@Nikokolas3270: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| return err | ||
| for _, limitedSupportReason := range limitedSupportReasons { | ||
| // If the reason matches the fleet notification LS reason, remove it | ||
| // TODO(ngrauss): The limitedSupportReason.ID() should be stored in the ManagedFleetNotificationRecord record item object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a todo here, is this something outside of the scope of this PR ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will open a new ticket for that.
Essentially the custom resources definitions should be changed to make sure that the LS which is removed is really the one signalled by the code.
| return c.inPlaceStatusUpdate() | ||
| } | ||
|
|
||
| // TODO(ngrauss): to be removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a todo here, is this something outside of the scope of this PR ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, once the CRD model will be changed we won't need anymore to restore the status the way it was
| func (c *fleetNotificationContext) inPlaceStatusUpdate() error { | ||
| // c.notificationRecordItem is a pointer but it is not part of the managedFleetNotificationRecord object | ||
| // Below code makes sure to update the oav1alpha1.NotificationRecordItem inside the managedFleetNotificationRecord object with the latest values. | ||
| // TODO(ngrauss): refactor GetNotificationRecordItem method to return a reference to the object inside the managedFleetNotificationRecord |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo? ;)
| notificationRecordItem.FiringNotificationSentCount > notificationRecordItem.ResolvedNotificationSentCount | ||
| // Counters are identical when no limited support is active | ||
| // Sent counter is higher than resolved counter by 1 when limited support is active | ||
| // TODO(ngrauss): record the limited support reason ID in the NotificationRecordItem object to be able to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo? ;)
| if resolvedCondition != nil { | ||
| lastWebhookCallTime := resolvedCondition.LastTransitionTime | ||
|
|
||
| if nowTime.Before(lastWebhookCallTime.Add(3 * time.Minute)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is the 3 minutes here intentional? The comment above says ServiceLogSent may be updated within up to 2 minutes
learning question for me, where does the 2 minutes max allowed time come from? :D
| if c.retriever.fleetNotification.ResendWait > 0 { | ||
| dontResendDuration = time.Duration(c.retriever.fleetNotification.ResendWait) * time.Hour | ||
| } else { | ||
| dontResendDuration = time.Duration(3) * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: if this is the default resendWait time, is 3 minutes a bit low?
Not sure IIUC, if an alert fires every 4 minutes will this resend everytime?
RHOBS or classic webhook may process an alert twice in a sequential or in a parallel way due to Prometheus or AlertManager redundancy. This change makes sure that the custom resources used to track notifications are tested and set (TAS) in an atomic way. This avoids sending notifications for alerts duplicates.
What type of PR is this?
bug
What this PR does / why we need it?
Customers are currently spammed and receive the same notification several times
Which Jira/Github issue(s) this PR fixes?
Fixes SREP-2079
Special notes for your reviewer:
For RHOBS webhook:
lastTransitionTimeis now only incremented when when sending a notificationFor classic webhook:
AlertFiringandAlertResolvedconditions are test and set in an atomic way.AlertFiringtimestamp is only updated when the condition status changesAlertResolvedtimestamp change any time the webhook is calledServiceLogSentcondition is not processed in an atomic way as this condition is not used to determine whether the alert was already firing or not.ServiceLogSentcondition which is processed later, asynchronously.Pre-checks (if applicable):
make generatecommand locally to validate code changes -> There is nomake generatecommand.