feat: added breakfix response time metrics#714
feat: added breakfix response time metrics#714nitz2407 wants to merge 14 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces lifecycle duration metrics for health events: node quarantine duration (time from health event generation to node cordon), pod eviction duration (event receipt to pod eviction completion), and CR generation duration (event receipt to maintenance CR creation). Timestamp fields are added to track lifecycle milestones, database schemas updated, and reconcilers enhanced to record durations at key completion points. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@fault-quarantine/pkg/reconciler/reconciler.go`:
- Around line 798-800: The call to event.HealthEvent.GeneratedTimestamp.AsTime()
can panic when GeneratedTimestamp is nil; update the caller (before passing into
updateQuarantineMetrics) to guard against nil GeneratedTimestamp: check
event.HealthEvent.GeneratedTimestamp != nil and only call AsTime() when non-nil,
otherwise use a safe zero-value (or appropriate fallback) time.Time value and
pass that to updateQuarantineMetrics so reconciliation won't crash on
legacy/malformed events.
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Line 114: The current assignment to event.Event["_received_at"] can panic if
event.Event is nil; in the reconciler.go code ensure you guard and initialize
the map before writing: check if event.Event == nil and if so set event.Event =
make(map[string]interface{}) (or the appropriate map type used by Event) and
then assign event.Event["_received_at"] = start.Unix(); reference the
event.Event map and the start.Unix() assignment so the fix is applied at the
same location.
- Around line 243-245: The metrics code observes CR generation duration even
when healthEventWithStatus.ReceivedAt is the zero value, which yields a huge
duration; update the else branch containing crGenerationDuration.Observe(...) to
first check healthEventWithStatus.ReceivedAt.IsZero() and skip calling
crGenerationDuration.Observe(...) when true (i.e., only call Observe with
time.Since(healthEventWithStatus.ReceivedAt).Seconds() if ReceivedAt.IsZero() is
false) so metrics are not polluted by uninitialized timestamps.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 519-527: The type assertion for the _received_at field
(receivedAtRaw in the event handling block) assumes int64 but json.Unmarshal
turns numbers into float64, so update the logic in the reconciler.go block that
computes evictionDuration (using receivedAtRaw, receivedAtUnix, time.Unix,
metrics.PodEvictionDuration and nodeName) to accept both float64 and int64 (and
optionally numeric strings) -- convert the value to an integer Unix seconds (or
to float seconds) before creating time.Unix and computing time.Since, log a
warning only if the type is neither supported nor convertible, and then observe
the evictionDuration with metrics.PodEvictionDuration.
🧹 Nitpick comments (2)
fault-quarantine/pkg/metrics/metrics.go (1)
192-194: Unused parameternodeNameinRecordNodeCordonDuration.The
nodeNameparameter is accepted but never used since theNodeCordonDurationhistogram has no labels. Either remove the unused parameter or consider adding anodelabel to the histogram if per-node granularity is desired for this metric.Option 1: Remove unused parameter
-func RecordNodeCordonDuration(nodeName string, generatedTimestamp time.Time) { +func RecordNodeCordonDuration(generatedTimestamp time.Time) { NodeCordonDuration.Observe(time.Since(generatedTimestamp).Seconds()) }fault-remediation/pkg/reconciler/reconciler.go (1)
538-543: Make _received_at parsing tolerant of numeric types
Strictly expectingint64can drop the value when the map originated from JSON/BSON conversions. A simple type switch avoids a zero ReceivedAt.♻️ Suggested refactor
- if receivedAtRaw, ok := eventWithToken.Event["_received_at"]; ok { - if receivedAtUnix, ok := receivedAtRaw.(int64); ok { - result.ReceivedAt = time.Unix(receivedAtUnix, 0) - } - } + if receivedAtRaw, ok := eventWithToken.Event["_received_at"]; ok { + switch v := receivedAtRaw.(type) { + case int64: + result.ReceivedAt = time.Unix(v, 0) + case int32: + result.ReceivedAt = time.Unix(int64(v), 0) + case float64: + result.ReceivedAt = time.Unix(int64(v), 0) + } + }
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 522-527: The extraction of _received_at in reconciler.go currently
only handles int64 and will silently skip valid timestamps decoded as float64,
int32, json.Number or strings; update the logic in the block that reads
eventWithToken.Event["_received_at"] (used by Reconcile() and setting
result.ReceivedAt) to perform a type switch (int64, int32, int, float64,
json.Number, string) and convert each into an int64 unix seconds (parsing
json.Number or string as needed) before calling time.Unix(...,0) so
result.ReceivedAt is correctly set for those common BSON/JSON numeric encodings.
♻️ Duplicate comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)
106-106: Guard against nilevent.Eventbefore assignment.
Line 106 can panic ifevent.Eventis nil.🛠 Suggested fix
- event.Event["_received_at"] = start.Unix() + if event.Event == nil { + event.Event = map[string]any{} + } + event.Event["_received_at"] = start.Unix()
229-231: Skip CR generation metric whenReceivedAtis zero.
time.Since(time.Time{})produces huge durations and pollutes metrics.🛠 Suggested fix
- metrics.CRGenerationDuration.Observe(time.Since(healthEventWithStatus.ReceivedAt).Seconds()) + if !healthEventWithStatus.ReceivedAt.IsZero() { + metrics.CRGenerationDuration.Observe(time.Since(healthEventWithStatus.ReceivedAt).Seconds()) + } else { + slog.Warn("ReceivedAt is zero; skipping CR generation duration metric", "node", nodeName) + }
c255847 to
d598364
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@fault-remediation/pkg/events/health_event.go`:
- Around line 17-27: The ReceivedAt field in the HealthEventDoc struct is only
tagged with json:"-" but still will be serialized to MongoDB via BSON; update
the struct so that the ReceivedAt field also has bson:"-" to prevent it from
persisting. Locate HealthEventDoc and modify the ReceivedAt field's tags (the
field name ReceivedAt in file health_event.go) to include bson:"-" alongside
json:"-" so MongoDB drivers will ignore it.
♻️ Duplicate comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)
519-527: Handle float64_received_atto avoid dropped metrics.
If the event is read from JSONB (e.g., Postgres), numbers typically unmarshal asfloat64, so theint64assertion can fail and skip observation.🛠 Suggested fix
- if receivedAtRaw, ok := event["_received_at"]; ok { - if receivedAtUnix, ok := receivedAtRaw.(int64); ok { - receivedAt := time.Unix(receivedAtUnix, 0) - evictionDuration := time.Since(receivedAt).Seconds() - metrics.PodEvictionDuration.Observe(evictionDuration) - } else { - slog.Warn("Invalid type for _received_at timestamp", "node", nodeName) - } - } + if receivedAtRaw, ok := event["_received_at"]; ok { + var receivedAtUnix int64 + switch v := receivedAtRaw.(type) { + case int64: + receivedAtUnix = v + case float64: + receivedAtUnix = int64(v) + default: + slog.Warn("Invalid type for _received_at timestamp", "node", nodeName, "type", fmt.Sprintf("%T", receivedAtRaw)) + } + if receivedAtUnix > 0 { + receivedAt := time.Unix(receivedAtUnix, 0) + evictionDuration := time.Since(receivedAt).Seconds() + metrics.PodEvictionDuration.Observe(evictionDuration) + } + }
🧹 Nitpick comments (1)
fault-quarantine/pkg/metrics/metrics.go (1)
192-195: Consider removing the unused nodeName parameter.
It’s not used by the histogram and may confuse callers unless you plan to add labels.♻️ Optional cleanup
-func RecordNodeCordonDuration(nodeName string, generatedTimestamp time.Time) { +func RecordNodeCordonDuration(generatedTimestamp time.Time) { NodeCordonDuration.Observe(time.Since(generatedTimestamp).Seconds()) }Update call sites accordingly (e.g., in
fault-quarantine/pkg/reconciler/reconciler.go).
d598364 to
b57001e
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
b57001e to
f1d00fa
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
/ok to test e5bfda5 |
1d702c4 to
008aee9
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged unit test files
|
008aee9 to
9e45fef
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged unit test files
|
# Conflicts: # fault-remediation/pkg/reconciler/reconciler_test.go # node-drainer/pkg/reconciler/reconciler.go
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
store-client/pkg/datastore/providers/postgresql/health_events.go (1)
226-264:⚠️ Potential issue | 🟠 MajorTimestamp fields not synced to JSONB document — consumers see null values.
The code writes
quarantine_finish_timestampanddrain_finish_timestampto table columns but does not propagate them into thedocumentJSONB viajsonb_set. Since all read paths (FindHealthEventsByNode,FindHealthEventsByQuery,GetHealthEventByID) unmarshal exclusively from the JSONBdocumentfield and the struct fields are pointer types (*time.Time), missing keys unmarshal tonil. This affects both the non-nil branch (lines 226-264) and the else branch (lines 268-291).The issue is most severe in
UpdateHealthEventStatusByNode(lines 337-348), which updates only the raw columns without any JSONB synchronization.Proposed fix for the non-nil branch
document = jsonb_set( jsonb_set( jsonb_set( jsonb_set( - document, + jsonb_set( + jsonb_set( + document, + '{healtheventstatus,quarantinefinishtimestamp}', + to_jsonb($2::timestamp) + ), + '{healtheventstatus,drainfinishtimestamp}', + to_jsonb($5::timestamp) + ), '{healtheventstatus,nodequarantined}', to_jsonb($1::text) ),Apply similar changes to the else branch and
UpdateHealthEventStatusByNode.
🤖 Fix all issues with AI agents
In `@data-models/pkg/model/health_event_extentions.go`:
- Line 47: The struct field QuarantineFinishTimestamp currently causes a linter
line-length failure; shorten the alignment whitespace before the type so the
declaration for QuarantineFinishTimestamp *time.Time
`bson:"quarantinefinishtimestamp,omitempty"
json:"quarantinefinishtimestamp,omitempty"` is <=120 chars, or if alignment
cannot be reduced without harming readability, add a nolint directive (matching
the style used for LastRemediationTimestamp) to the field tag to suppress the
linter error; locate the field by name QuarantineFinishTimestamp in
health_event_extentions.go and apply one of these fixes.
In `@store-client/pkg/client/convenience.go`:
- Around line 39-47: UpdateHealthEventNodeQuarantineStatus currently always sets
"healtheventstatus.quarantinefinishtimestamp", which lets UnQuarantined calls
overwrite the original finish time; change the function
(UpdateHealthEventNodeQuarantineStatus) to only include the
"healtheventstatus.quarantinefinishtimestamp" field in the fields map when the
new status indicates quarantine completion (e.g., status == "Quarantined"),
otherwise omit that key so un-quarantine or non-completing statuses don't
overwrite the existing timestamp.
In `@store-client/pkg/datastore/providers/postgresql/database_client.go`:
- Around line 314-315: The WHERE clause is hardcoded to "id = $N" which is
inconsistent with UpdateDocumentStatus's conditional use of "data->>'_id'" for
non-health_events tables; modify the code that builds whereClause (currently
using update.ToSQL() result and whereClause variable) to follow the same logic
as UpdateDocumentStatus: if c.tableName == "health_events" use "id = $%d"
otherwise use "data->>'_id' = $%d", and ensure the parameter index uses
len(args)+1 and the passed argument is the document id value; update any callers
accordingly so the predicate matches the table type.
In `@store-client/pkg/datastore/providers/postgresql/datastore.go`:
- Around line 402-406: The warning is misleading because ADD COLUMN IF NOT
EXISTS suppresses "already exists" errors; if db.ExecContext(ctx,
timestampColumn) returns an error it's a real failure. Replace the slog.Warn
call in the timestampColumns loop with a proper failure handling: either
return/propagate the error from the enclosing function (like the schemas path)
or at minimum log it as an error (use slog.Error) and include the error object
plus the failing SQL (timestampColumn) for debugging. Update the handler around
db.ExecContext(ctx, timestampColumn) accordingly (references: timestampColumns,
db.ExecContext, slog.Warn).
🧹 Nitpick comments (2)
store-client/pkg/client/interfaces.go (1)
27-27: Add a godoc comment for the new interface method.All other methods in
DatabaseClienthave doc comments. As per coding guidelines, exported Go functions require comments.Suggested fix
UpdateDocumentStatus(ctx context.Context, documentID string, statusPath string, status interface{}) error + // UpdateDocumentStatusFields updates multiple status fields in a document in one operation. + // Keys in fields are dot-notation paths (e.g. "healtheventstatus.nodequarantined"). UpdateDocumentStatusFields(ctx context.Context, documentID string, fields map[string]interface{}) errorAs per coding guidelines: "Function comments required for all exported Go functions".
store-client/pkg/datastore/providers/postgresql/health_events.go (1)
337-358:UpdateHealthEventStatusByNodealso lacks JSONB sync for all fields (pre-existing + new).This function updates only the table columns and does not touch the
documentJSONB at all. While this is a pre-existing gap, the two new timestamp columns (quarantine_finish_timestamp,drain_finish_timestamp) widen it. If any consumer reads events updated via this path and relies on the JSONB document (which all read paths do), they will see stale data.Consider adding JSONB
jsonb_setcalls here consistent withUpdateHealthEventStatus, or document why this function intentionally skips JSONB sync.
store-client/pkg/datastore/providers/postgresql/database_client.go
Outdated
Show resolved
Hide resolved
0f69523 to
b639b31
Compare
b639b31 to
707df37
Compare
Merging this branch changes the coverage (2 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
f6fb17f to
6a6ef91
Compare
Merging this branch changes the coverage (1 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
6a6ef91 to
cddfaf7
Compare
Merging this branch changes the coverage (2 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
cddfaf7 to
2ebb98b
Compare
Merging this branch changes the coverage (1 decrease, 3 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
2ebb98b to
b7e4a8c
Compare
Merging this branch changes the coverage (1 decrease, 2 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
85e65b1 to
b167c15
Compare
lalitadithya
left a comment
There was a problem hiding this comment.
Can you please add how we tested that the metric match manual observations in the following cases:
- pod restarts
- long drains
- cancelled breakfix
- immediate mode eviction failed
- immediate mode eviction success
- delete after timeout
There was a problem hiding this comment.
I don't understand what is being tested here. In code we call time.Since(timestamp).Seconds() in test also we call time.Since(timestamp).Seconds(), so when would this fail?
There was a problem hiding this comment.
This is to test CalculateDurationSeconds utility function.
There was a problem hiding this comment.
But the code we are using to test it is the same as the code we are using in the function so I'm a bit lost in how this helps
There was a problem hiding this comment.
This is just a Unit Test to test specific blob.
There was a problem hiding this comment.
Why did we change this? The previous values seem to be correct here
There was a problem hiding this comment.
This is just for consistency, since not using manual list anywhere else.
There was a problem hiding this comment.
But consistency won't help here right? Hoe can retry attempts be less than 1, am I missing something here?
There was a problem hiding this comment.
So Prometheus provided a generic default bucket which works for both decimal and whole number use cases. No worries let me revert this, since adding confusion.
There was a problem hiding this comment.
Shouldn't this be exported after all the procession is completed? In this case after the healtheventstatus.drainfinishtimestamp has been set? Otherwise it possible that the status is never updated the database and the next operation doesn't start
There was a problem hiding this comment.
Intent to emit metrics immediately after Pods get evicted successfully to measure the correct performance. If we emit it after updating node labels, db updates, then might some delay get added. Please share your thoughts might be I am overthinking.
There was a problem hiding this comment.
I think the intent is to emit the metric after the module is completed processing so that we know how long the module took. For example, even if the pod is evicted, but ND doesn't complete the rest of the activities in mongodb then FR can't start. If we stop measuring after the pod are evicted, then we will have gaps in our observability.
There was a problem hiding this comment.
What what if node drained successfully and db update failed, are we ok to loose this data?
Pod restarts: Verification is done based on taking delta of log generation timestamp - quarantinefinishtimestamp stored in db Long drains: Cancelled breakfix: Immediate mode eviction failed/success and delete after timeout: |
Can we test pod restarts while the drainer is processing events?
Can we run some mock workloads to test? Long drains are going to be 90% of the cases we need to make sure that longer drains are showing accurate numbers.
Can we run some mock/simulated user workloads to test?
Can we run some mock/simulated user workloads to test? |
c64c8d4 to
8bbe4d1
Compare
# Conflicts: # fault-quarantine/pkg/reconciler/reconciler_e2e_test.go # fault-remediation/pkg/remediation/remediation.go
2626494 to
340e538
Compare
Merging this branch changes the coverage (2 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Added the test results under Testing section. |
Summary
Added metrics to answer below queries:
Testing
Here is test environment used to test different scenarios:
Test scenarios
1. Pod restart
Run user workload on the node
Inject XID 95
Verify Fault quarantine logs to verify node quarantine duration
Result:
Node quarantine duration: 2.153339486 sec
Node drainer eviction duration: 861.239015229 sec
Fault remediation cr generation duration: 0.143644378 sec
2. Already Quarantine node
Result:
Node quarantine duration: 1.7857139229999999 sec
Node drainer eviction duration: 0.022892349 sec
Fault remediation cr generation duration: 0.152273362 sec
3. Delete after timeout
Result:
Node quarantine duration: 0.782192628 sec
Node drainer eviction duration: 870.80452913 sec
Fault remediation cr generation duration: 0.117803522 sec
4. Cancel breakfix
FQ logs
ND logs
Result:
Node quarantine duration: 0.810579399 sec
Node drainer eviction duration: Since event cancelled no meteic emitted
Fault remediation cr generation duration: Since event cancelled no metrics emitted
5. Force pod eviction mode
Result:
Node quarantine duration: 3.483826154 sec
Node drainer eviction duration: 70.551046991 sec
Fault remediation cr generation duration: 0.130705849 sec
6. Long drains
Result:
Node quarantine duration: 2.317964513 sec
Node drainer eviction duration: 1831.467014087 sec
Fault remediation cr generation duration: 0.136405048 sec
For all above scenarios verified on grafana dashboard that emit time durations are falling under correct buckets.
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests