SREP-2079: Making sure that alerts are processed in an atomic way #175

Nikokolas3270 · 2026-01-02T18:49:29Z

RHOBS or classic webhook may process an alert twice in a sequential or in a parallel way due to Prometheus or AlertManager redundancy. This change makes sure that the custom resources used to track notifications are tested and set (TAS) in an atomic way. This avoids sending notifications for alerts duplicates.

What type of PR is this?

bug

What this PR does / why we need it?

Customers are currently spammed and receive the same notification several times

Which Jira/Github issue(s) this PR fixes?

Fixes SREP-2079

Special notes for your reviewer:

For RHOBS webhook:

For the sake of atomicity: counters are now incremented before sending the service log or the limited support notification.
lastTransitionTime is now only incremented when when sending a notification
Whole status is tested and set in an atomic way
Counters are decremented if, for some reason, the notification cannot be ultimately be sent or discarded

For classic webhook:

For the sake of atomicity: AlertFiring and AlertResolved conditions are test and set in an atomic way.
AlertFiring timestamp is only updated when the condition status changes
AlertResolved timestamp change any time the webhook is called
ServiceLogSent condition is not processed in an atomic way as this condition is not used to determine whether the alert was already firing or not.
Unlike the RHOBS webhook, there is no need to restore conditions in a previous state if a service log cannot be sent: counting the number of SLs sent and the time at which the SL was sent is handled by the ServiceLogSent condition which is processed later, asynchronously.

Pre-checks (if applicable):

Tested latest changes against a cluster
Ran make generate command locally to validate code changes -> There is no make generate command.
Included documentation changes with PR -> Not needed, no API change

openshift-ci-robot · 2026-01-02T18:49:32Z

@Nikokolas3270: This pull request references SREP-2079 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

RHOBS or classic webhook may process an alert twice in a sequential or in a parallel way due to Prometheus or AlertManager redundancy. This change makes sure that the custom resources used to track notifications are tested and set (TAS) in an atomic way. This avoids sending notifications for alerts duplicates.

What type of PR is this?

bug

What this PR does / why we need it?

Customers are currently spammed and receive the same notification several times

Which Jira/Github issue(s) this PR fixes?

Fixes SREP-2079

Special notes for your reviewer:

For RHOBS webhook:

For the sake of atomicity: counters are now incremented before sending the service log or the limited support notification.

lastTransitionTime is now only incremented when when sending a notification

Whole status is tested and set in an atomic way

Counters are decremented if, for some reason, the notification cannot be ultimately be sent or discarded

For classic webhook:

For the sake of atomicity: AlertFiring and AlertResolved conditions are test and set in an atomic way.

AlertFiring timestamp is only updated when the condition status changes

AlertResolved timestamp change any time the webhook is called

ServiceLogSent condition is not processed in an atomic way as this condition is not used to determine whether the alert was already firing or not.

Unlike the RHOBS webhook, there is no need to restore conditions in a previous state if a service log cannot be sent: counting the number of SLs sent and the time at which the SL was sent is handled by the ServiceLogSent condition which is processed later, asynchronously.

Pre-checks (if applicable):

Tested latest changes against a cluster

Ran make generate command locally to validate code changes -> There is no make generate command.

Included documentation changes with PR -> Not needed, no API change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-01-02T18:49:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Nikokolas3270

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Nikokolas3270]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2026-01-02T18:56:33Z

Codecov Report

❌ Patch coverage is 89.59538% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.67%. Comparing base (f601961) to head (f12430d).

Files with missing lines	Patch %	Lines
pkg/handlers/webhookreceiver.go	86.25%	15 Missing and 7 partials ⚠️
pkg/handlers/webhookrhobsreceiver.go	92.47%	10 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
+ Coverage   53.95%   55.67%   +1.71%     
==========================================
  Files          23       23              
  Lines        1820     1895      +75     
==========================================
+ Hits          982     1055      +73     
- Misses        780      785       +5     
+ Partials       58       55       -3

Files with missing lines	Coverage Δ
pkg/handlers/webhookrhobsreceiver.go	`90.90% <92.47%> (+5.34%)`	⬆️
pkg/handlers/webhookreceiver.go	`81.28% <86.25%> (+1.88%)`	⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

RHOBS or classic webhook may process an alert twice in a sequential or in a parallel way due to Prometheus or AlertManager redundancy. This change makes sure that the custom resources used to track notifications are tested and set (TAS) in an atomic way. This avoids sending notifications for alerts duplicates.

openshift-ci · 2026-01-02T19:06:45Z

@Nikokolas3270: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

RaphaelBut · 2026-01-14T14:48:09Z

pkg/handlers/webhookrhobsreceiver.go

-			return err
+	for _, limitedSupportReason := range limitedSupportReasons {
+		// If the reason matches the fleet notification LS reason, remove it
+		// TODO(ngrauss): The limitedSupportReason.ID() should be stored in the ManagedFleetNotificationRecord record item object


there is a todo here, is this something outside of the scope of this PR ? :)

Yes, I will open a new ticket for that.
Essentially the custom resources definitions should be changed to make sure that the LS which is removed is really the one signalled by the code.

RaphaelBut · 2026-01-14T14:48:53Z

pkg/handlers/webhookrhobsreceiver.go

+	return c.inPlaceStatusUpdate()
+}
+
+// TODO(ngrauss): to be removed


there is a todo here, is this something outside of the scope of this PR ? :)

Yes, once the CRD model will be changed we won't need anymore to restore the status the way it was

RaphaelBut · 2026-01-14T14:50:23Z

pkg/handlers/webhookrhobsreceiver.go

+func (c *fleetNotificationContext) inPlaceStatusUpdate() error {
+	// c.notificationRecordItem is a pointer but it is not part of the managedFleetNotificationRecord object
+	// Below code makes sure to update the oav1alpha1.NotificationRecordItem inside the managedFleetNotificationRecord object with the latest values.
+	// TODO(ngrauss): refactor GetNotificationRecordItem method to return a reference to the object inside the managedFleetNotificationRecord


RaphaelBut · 2026-01-14T14:50:45Z

pkg/handlers/webhookrhobsreceiver.go

+			notificationRecordItem.FiringNotificationSentCount > notificationRecordItem.ResolvedNotificationSentCount
+		// Counters are identical when no limited support is active
+		// Sent counter is higher than resolved counter by 1 when limited support is active
+		// TODO(ngrauss): record the limited support reason ID in the NotificationRecordItem object to be able to


RaphaelBut · 2026-01-14T14:55:15Z

pkg/handlers/webhookreceiver.go

+		if resolvedCondition != nil {
+			lastWebhookCallTime := resolvedCondition.LastTransitionTime
+
+			if nowTime.Before(lastWebhookCallTime.Add(3 * time.Minute)) {


question: is the 3 minutes here intentional? The comment above says ServiceLogSent may be updated within up to 2 minutes

learning question for me, where does the 2 minutes max allowed time come from? :D

RaphaelBut · 2026-01-14T15:18:13Z

pkg/handlers/webhookrhobsreceiver.go

+	if c.retriever.fleetNotification.ResendWait > 0 {
+		dontResendDuration = time.Duration(c.retriever.fleetNotification.ResendWait) * time.Hour
+	} else {
+		dontResendDuration = time.Duration(3) * time.Minute


thought: if this is the default resendWait time, is 3 minutes a bit low?
Not sure IIUC, if an alert fires every 4 minutes will this resend everytime?

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 2, 2026

openshift-ci bot requested review from Tafhim and ravitri January 2, 2026 18:49

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 2, 2026

Nikokolas3270 force-pushed the SREP-2079 branch from 180b239 to f12430d Compare January 2, 2026 19:00

RaphaelBut reviewed Jan 14, 2026

View reviewed changes

SREP-2079: Making sure that alerts are processed in an atomic way #175

Are you sure you want to change the base?

SREP-2079: Making sure that alerts are processed in an atomic way #175

Uh oh!

Conversation

Nikokolas3270 commented Jan 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Uh oh!

openshift-ci-robot commented Jan 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Uh oh!

openshift-ci bot commented Jan 2, 2026

Uh oh!

codecov-commenter commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci bot commented Jan 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Nikokolas3270 commented Jan 2, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 2, 2026 •

edited by openshift-ci bot

Loading

codecov-commenter commented Jan 2, 2026 •

edited

Loading