feat: added breakfix response time metrics by nitz2407 · Pull Request #714 · NVIDIA/NVSentinel

nitz2407 · 2026-01-20T08:52:48Z

Summary

Added metrics to answer below queries:

What is mean time to remediate (Emit metrics from Janitor module)
What is the mean time to quarantine (Emit metrics from fault quarantine module)
What is mean amount of time spent waiting for user workloads to complete (Emit metrics from node drainer module)
What is mean time for cr creation (Emit metrics from fault remediation)

Testing

Here is test environment used to test different scenarios:

Job used to run on testing node

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-until-timeout
  namespace: test-namespace
spec:
  nodeName: aks-gpu-12493808-vmss00000l
  containers:
  - name: main
    image: busybox:1.36
    command:
    - /bin/sh
    - -c
    - "trap '' SIGTERM; echo 'Running until force-delete or node-drainer timeout'; sleep 900"
  restartPolicy: Never

Node drainer config

evictionTimeoutInSeconds = "60"
[[userNamespaces]]
    name = "*"
    mode = "AllowCompletion"

Grafana queries to verify the results

histogram_quantile(0.99, sum by(le) (rate(node_drainer_pod_eviction_duration_seconds_bucket[1m])))
histogram_quantile(0.99, sum by(le) (rate(fault_remediation_cr_generate_duration_seconds_bucket[1m])))
histogram_quantile(0.99, sum by(le) (rate(fault_quarantine_node_quarantine_duration_seconds_bucket[1m])))

Test scenarios

1. Pod restart

Run user workload on the node
Inject XID 95
Verify Fault quarantine logs to verify node quarantine duration

{"time":"2026-02-18T09:00:15.752561052Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771405213,\"nanos\":683000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"69957f9f4cf999a4e5fe6915\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T09:00:15.752585855Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:15.836308683Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69957f9f4cf999a4e5fe6915","status":"Quarantined"}
{"time":"2026-02-18T09:00:15.836342186Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":2.153339486}
{"time":"2026-02-18T09:00:15.836360688Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}

Verify Node drainer logs to verify if node drain has been start or not

{"time":"2026-02-18T09:00:46.1287337Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:46.128769203Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:00:46.143497034Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:00:46.143523336Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:00:46.143542938Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:46.143550939Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:46.166531615Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:00:46.166559618Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:01:26.167823286Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.167860789Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.16786809Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:01:26.182014066Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:01:26.182044169Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:01:26.18205997Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.182082172Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.203294585Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:01:26.203333489Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":4,"error":"waiting for pods to complete: 1 pods remaining"}

Restart Node drainer pod

nitijain@nitijain-mlt scripts % k delete pod -n nvsentinel node-drainer-57d96b4749-kcs6n
pod "node-drainer-57d96b4749-kcs6n" deleted
nitijain@nitijain-mlt scripts % k get pod -n nvsentinel | grep drain
node-drainer-57d96b4749-n4jzt         1/1     Running            0               14s

Verify Node drainer logs if it resume to drain the pod during previous restart and time taken to drain the pod

nitijain@nitijain-mlt scripts % k logs -f -n nvsentinel node-drainer-57d96b4749-n4jzt
2026/02/18 09:02:05 INFO Registering PostgreSQL datastore provider
2026/02/18 09:02:05 INFO Registered datastore provider provider=postgresql
2026/02/18 09:02:05 INFO Registering MongoDB datastore provider
2026/02/18 09:02:05 INFO Registered datastore provider provider=mongodb
2026/02/18 09:02:05 INFO Registering MongoDB builder factory
{"time":"2026-02-18T09:02:05.692406866Z","level":"INFO","msg":"Starting node-drainer","module":"node-drainer","version":"dev","version":"dev","commit":"none","date":"unknown"}
{"time":"2026-02-18T09:02:05.692724194Z","level":"INFO","msg":"Using new certificate path","module":"node-drainer","version":"dev","resolved_path":"/etc/ssl/client-certs"}
{"time":"2026-02-18T09:02:05.692751497Z","level":"INFO","msg":"Database client cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs"}
{"time":"2026-02-18T09:02:05.692756097Z","level":"INFO","msg":"Starting node drainer initialization","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.692762798Z","level":"INFO","msg":"Loading datastore config","module":"node-drainer","version":"dev","provider":"mongodb"}
{"time":"2026-02-18T09:02:05.692975317Z","level":"INFO","msg":"Running with partial drain disabled","module":"node-drainer","version":"dev"}
W0218 09:02:05.693027       1 client_config.go:682] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"time":"2026-02-18T09:02:05.694055813Z","level":"INFO","msg":"Successfully initialized kubernetes client","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.694264131Z","level":"INFO","msg":"Created datastore factory","module":"node-drainer","version":"dev","providers":2}
{"time":"2026-02-18T09:02:05.694282433Z","level":"INFO","msg":"Creating datastore","module":"node-drainer","version":"dev","provider":"mongodb"}
{"time":"2026-02-18T09:02:05.694303035Z","level":"INFO","msg":"NewMongoDBDataStore TLS config check","module":"node-drainer","version":"dev","hasTLSConfig":true,"certPath":"/etc/ssl/client-certs/tls.crt","caPath":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.694311935Z","level":"INFO","msg":"Extracted cert directory from TLSConfig","module":"node-drainer","version":"dev","certDir":"/etc/ssl/client-certs"}
{"time":"2026-02-18T09:02:05.694335837Z","level":"INFO","msg":"Trying to read CA cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.694370641Z","level":"INFO","msg":"Successfully read CA cert","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.696031688Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.75257921Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.752889738Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"HealthEvents","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.75290994Z","level":"INFO","msg":"Trying to read CA cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.752970045Z","level":"INFO","msg":"Successfully read CA cert","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.753298074Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.79783523Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.798143257Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"HealthEvents","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.798160159Z","level":"INFO","msg":"Successfully created adapted MongoDB store","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.798209463Z","level":"INFO","msg":"Trying to read CA cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.798258267Z","level":"INFO","msg":"Successfully read CA cert","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.798559594Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.850554112Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.85087214Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"HealthEvents","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.850885841Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.851068158Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.851279876Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"ResumeTokens","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.851796722Z","level":"INFO","msg":"ResumeToken found","module":"node-drainer","version":"dev","token":"zQAAAAJfZGF0YQC9AAAAODI2OTk1N0Y5RjAwMDAwMDAyMkIwNDJDMDEwMDI5NkU1QTEwMDRFNTk4NDM0QzQ2MzM0RkM1ODFENTZEMDExRkYxMTczRjQ2M0M2RjcwNjU3MjYxNzQ2OTZGNkU1NDc5NzA2NTAwM0M3NTcwNjQ2MTc0NjUwMDQ2NjQ2RjYzNzU2RDY1NkU3NDRCNjU3OTAwNDY2NDVGNjk2NDAwNjQ2OTk1N0Y5RjRDRjk5OUE0RTVGRTY5MTUwMDAwMDQAAA=="}
{"time":"2026-02-18T09:02:05.866243305Z","level":"INFO","msg":"Initialization completed successfully","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.866277808Z","level":"INFO","msg":"Starting Kubernetes informers","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166778297Z","level":"INFO","msg":"Kubernetes informers started and synced","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166828001Z","level":"INFO","msg":"Starting queue worker","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166839002Z","level":"INFO","msg":"Starting workqueue processor","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166852104Z","level":"INFO","msg":"Handling cold start","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166860104Z","level":"INFO","msg":"Querying for events requiring processing","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.649999314Z","level":"INFO","msg":"Found events to re-process","module":"node-drainer","version":"dev","count":1}
{"time":"2026-02-18T09:02:06.650594767Z","level":"INFO","msg":"Re-queued event from cold start","module":"node-drainer","version":"dev","nodeName":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.65062627Z","level":"INFO","msg":"Cold start processing completed","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.65063527Z","level":"INFO","msg":"Starting database event watcher","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.650648872Z","level":"INFO","msg":"All components started successfully","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.650812786Z","level":"INFO","msg":"server initialized","module":"node-drainer","version":"dev","port":2112,"read_timeout":10000000000,"write_timeout":10000000000}
{"time":"2026-02-18T09:02:06.650863891Z","level":"INFO","msg":"Event watcher started, consuming events","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.650881192Z","level":"INFO","msg":"Starting metrics server","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.651237324Z","level":"INFO","msg":"starting server","module":"node-drainer","version":"dev","addr":":2112"}
{"time":"2026-02-18T09:02:06.651377636Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.651410039Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:02:06.663321497Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:02:06.6633515Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:02:06.663371302Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.663377802Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.690662725Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:02:06.69071403Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":1,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:02:16.691354774Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:16.691387177Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:02:16.706017424Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:02:16.706044626Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:02:16.706063228Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:16.706069728Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:16.722418022Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:02:16.722448624Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":2,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:02:36.7233877Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:36.723422703Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:02:36.737882996Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:02:36.737913899Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:02:36.737935001Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:36.737942902Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:36.751592322Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:02:36.751626125Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:03:16.752380622Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:03:16.752411825Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:03:16.766346981Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:03:16.766372584Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:03:16.766390885Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:03:16.766398086Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:03:16.783643041Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:03:16.783673744Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":4,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:04:36.784553175Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:04:36.784581577Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:04:36.798327081Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:04:36.798356783Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:04:36.798378185Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:04:36.798384686Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:04:36.816884605Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:04:36.816932009Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":5,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:06:36.817969644Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:06:36.818004647Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:06:36.830792789Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:06:36.830827392Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:06:36.830850995Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:06:36.830858595Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:06:36.855068458Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:06:36.855109262Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":6,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:08:36.856384539Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.856413942Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.856422042Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.856427743Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:08:36.871009547Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:08:36.871041149Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:08:36.871059251Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.871066652Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.890268568Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:08:36.890302171Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":7,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:10:36.891282297Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:10:36.8913186Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:10:36.906787307Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:10:36.906814409Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:10:36.906835711Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:10:36.906843112Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:10:36.923007182Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:10:36.923048386Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":8,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:12:36.923921272Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.923957575Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.923964876Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:12:36.936691536Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:12:36.93673444Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:12:36.936753142Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.936761843Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.951577293Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:12:36.951611797Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":9,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:14:36.952723834Z","level":"INFO","msg":"Ignoring completed pod %s in namespace %s on node %s (status: %s) during eviction check","module":"node-drainer","version":"dev","test-pod-until-timeout":"test-namespace","aks-gpu-12493808-vmss00000l":"Succeeded"}
{"time":"2026-02-18T09:14:36.952812642Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:36.952822143Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:36.952827243Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:36.952832743Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T09:14:36.968191887Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T09:14:37.060907395Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:37.06702173Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":861.239015229}
{"time":"2026-02-18T09:14:37.067052132Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"69957f9f4cf999a4e5fe6915","evictionStatus":"Succeeded"}

Verify Fault remediation logs to check time taken to create the maintenance cr

{"time":"2026-02-18T09:14:37.068779484Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T09:14:37.086218209Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T09:14:37.191635528Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:37.191752338Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T09:14:37.191910352Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957f9f4cf999a4e5fe6915"}
{"time":"2026-02-18T09:14:37.203650479Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.143644378,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:37.203678181Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957f9f4cf999a4e5fe6915","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

Node quarantine duration: 2.153339486 sec
Node drainer eviction duration: 861.239015229 sec
Fault remediation cr generation duration: 0.143644378 sec

2. Already Quarantine node

Inject XID 95
As node is already cordoned so Fault quarantine duration should be less as process event quickly

{"time":"2026-02-18T06:34:29.021206229Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771396468,\"nanos\":339000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"69955d741c32273a78fe6914\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T06:34:29.021231931Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.121158825Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69955d741c32273a78fe6914","status":"Quarantined"}
{"time":"2026-02-18T06:34:29.121195728Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":0.782192628}
{"time":"2026-02-18T06:34:29.12121533Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}
{"time":"2026-02-18T07:26:20.87553057Z","level":"INFO","msg":"Evaluating NodeRuleEvaluator for node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T07:26:20.876457653Z","level":"INFO","msg":"Evaluating NodeRuleEvaluator for node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T07:26:20.88168692Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"6995699c4cf999a4e5fe6911","status":"AlreadyQuarantined"}
{"time":"2026-02-18T07:26:20.881716223Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":1.7857139229999999}

Node drainer also should finish quickly as pods was already drained

{"time":"2026-02-18T07:26:20.890972551Z","level":"INFO","msg":"Attempting to store resume token","module":"node-drainer","version":"dev","client":"node-drainer"}
{"time":"2026-02-18T07:26:20.891360686Z","level":"INFO","msg":"HealthEvents which are part of quarantineHealthEvent annotation","module":"node-drainer","version":"dev","eventCount":3}
{"time":"2026-02-18T07:26:20.892398578Z","level":"INFO","msg":"Full drain previously completed for node as part of old event, skipping drain","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","id":"69955d741c32273a78fe6914"}
{"time":"2026-02-18T07:26:20.892423281Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"MarkAlreadyDrained"}
{"time":"2026-02-18T07:26:20.899897549Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":0.022892349}
{"time":"2026-02-18T07:26:20.899923051Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"6995699c4cf999a4e5fe6911","evictionStatus":"AlreadyDrained"}

Verify if cr generated or not at Fault remediation

{"time":"2026-02-18T07:26:21.02043353Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T07:26:21.034314371Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"remediating","to":"remediating"}
{"time":"2026-02-18T07:26:21.034347374Z","level":"INFO","msg":"No update needed for node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"remediating"}
{"time":"2026-02-18T07:26:21.034441082Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T07:26:21.034605497Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995699c4cf999a4e5fe6911"}
{"time":"2026-02-18T07:26:21.044279362Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.152273362,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T07:26:21.044306965Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995699c4cf999a4e5fe6911","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

Node quarantine duration: 1.7857139229999999 sec
Node drainer eviction duration: 0.022892349 sec
Fault remediation cr generation duration: 0.152273362 sec

3. Delete after timeout

Run user job at node
Inject XID 95
Verify Fault quarantine logs if cordoned the node and in what time

{"time":"2026-02-18T06:34:28.890474493Z","level":"INFO","msg":"Handling event for ruleset","module":"fault-quarantine","version":"dev","event":{"CreatedAt":"2026-02-18T06:34:28.339Z","HealthEvent":{"version":1,"agent":"gpu-health-monitor","componentClass":"GPU","checkName":"GpuMemWatch","isFatal":true,"message":"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.","recommendedAction":15,"errorCode":["DCGM_FR_UNCONTAINED_ERROR"],"entitiesImpacted":[{"entityType":"GPU","entityValue":"1"},{"entityType":"PCI","entityValue":"0002:00:00.0"},{"entityType":"GPU_UUID","entityValue":"GPU-3d7408b2-d525-643f-5bfc-45a761045e14"}],"metadata":{"node.kubernetes.io/instance-type":"Standard_ND96amsr_A100_v4","nvidia.com/cuda.driver-version.full":"570.148.08","nvidia.com/cuda.driver-version.major":"570","nvidia.com/cuda.driver-version.minor":"148","nvidia.com/cuda.driver-version.revision":"08","nvidia.com/cuda.runtime-version.full":"12.8","nvidia.com/cuda.runtime-version.major":"12","nvidia.com/cuda.runtime-version.minor":"8","nvidia.com/gpu.product":"NVIDIA-A100-SXM4-80GB","providerID":"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0","topology.kubernetes.io/region":"southcentralus","topology.kubernetes.io/zone":"0"},"generatedTimestamp":{"seconds":1771396468,"nanos":339000000},"nodeName":"aks-gpu-12493808-vmss00000l","processingStrategy":1,"id":"69955d741c32273a78fe6914"},"HealthEventStatus":{"userpodsevictionstatus":{"Status":"","Message":""}}},"ruleset":"GPU fatal error ruleset"}
{"time":"2026-02-18T06:34:28.890618506Z","level":"INFO","msg":"Evaluating NodeRuleEvaluator for node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:28.89156019Z","level":"INFO","msg":"Removing manual uncordon annotation from node before applying new quarantine","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.021177926Z","level":"INFO","msg":"Cordoning node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.021206229Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771396468,\"nanos\":339000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"69955d741c32273a78fe6914\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T06:34:29.021231931Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.121158825Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69955d741c32273a78fe6914","status":"Quarantined"}
{"time":"2026-02-18T06:34:29.121195728Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":0.782192628}
{"time":"2026-02-18T06:34:29.12121533Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}

Verify Node drainer logs if node get drained after timeout

{"time":"2026-02-18T06:46:59.820471017Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T06:46:59.820507821Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":10,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T06:48:59.822208178Z","level":"INFO","msg":"Ignoring completed pod %s in namespace %s on node %s (status: %s) during eviction check","module":"node-drainer","version":"dev","test-pod-until-timeout":"test-namespace","aks-gpu-12493808-vmss00000l":"Succeeded"}
{"time":"2026-02-18T06:48:59.822294186Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.822303387Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.822307887Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.822313488Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T06:48:59.836890383Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T06:48:59.913799321Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.919534531Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":870.80452913}
{"time":"2026-02-18T06:48:59.919559133Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"69955d741c32273a78fe6914","evictionStatus":"Succeeded"}

Verify Fault remediation logs, if remediation cr gets generatedd

{"time":"2026-02-18T06:48:59.921232081Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T06:48:59.938151686Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T06:49:00.015312745Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:49:00.015436756Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T06:49:00.015599171Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-69955d741c32273a78fe6914"}
{"time":"2026-02-18T06:49:00.030816024Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.117803522,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:49:00.030868328Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-69955d741c32273a78fe6914","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

Node quarantine duration: 0.782192628 sec
Node drainer eviction duration: 870.80452913 sec
Fault remediation cr generation duration: 0.117803522 sec

4. Cancel breakfix

Run user job on a gpu node
Inject XID 95
Verify if node gets cordoned

{"time":"2026-02-18T08:11:02.865202189Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771402262,\"nanos\":152000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"699574164cf999a4e5fe6913\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T08:11:02.865227292Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:11:02.962546096Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"699574164cf999a4e5fe6913","status":"Quarantined"}
{"time":"2026-02-18T08:11:02.962581999Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":0.810579399}

Check Node drainer logs to verify if draining is in progress


{"time":"2026-02-18T08:11:33.346501259Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:11:33.346542763Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T08:12:13.34802909Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.348072594Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.348082895Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T08:12:13.361578905Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T08:12:13.361613909Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T08:12:13.36163291Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.361642611Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.380082165Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:12:13.380128069Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":4,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T08:13:33.380991357Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:13:33.38102646Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T08:13:33.401294059Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T08:13:33.401333163Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T08:13:33.401363265Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:13:33.401373466Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:13:33.428140743Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:13:33.428177646Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":5,"error":"waiting for pods to complete: 1 pods remaining"}

Manually uncordon the node, check if remediation gets cancelled or not
FQ logs

{"time":"2026-02-18T08:11:02.962598601Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}
{"time":"2026-02-18T08:14:03.213163644Z","level":"INFO","msg":"Detected manual uncordon of FQ-quarantined node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:14:03.49709277Z","level":"INFO","msg":"Updated quarantining events to cancelled status","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","firstEventId":"699574164cf999a4e5fe6913","documentsUpdated":1}
{"time":"2026-02-18T08:14:03.497133874Z","level":"INFO","msg":"Set currentQuarantinedNodes to 0 for manually uncordoned node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:14:03.497142174Z","level":"INFO","msg":"Successfully completed manual uncordon handling","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}

ND logs

{"time":"2026-02-18T08:14:03.498845122Z","level":"INFO","msg":"Detected Cancelled event, marking event as cancelled","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}
{"time":"2026-02-18T08:14:03.498877925Z","level":"INFO","msg":"Marked specific event as cancelled","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}
{"time":"2026-02-18T08:14:03.498887226Z","level":"INFO","msg":"Attempting to store resume token","module":"node-drainer","version":"dev","client":"node-drainer"}
{"time":"2026-02-18T08:15:33.42882663Z","level":"INFO","msg":"Event was cancelled, performing cleanup","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}
{"time":"2026-02-18T08:15:33.434928776Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"699574164cf999a4e5fe6913","evictionStatus":"Cancelled"}
{"time":"2026-02-18T08:15:33.44915785Z","level":"INFO","msg":"Label already absent","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state"}
{"time":"2026-02-18T08:15:33.449193753Z","level":"INFO","msg":"Successfully cleaned up cancelled event","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}

Result:

Node quarantine duration: 0.810579399 sec
Node drainer eviction duration: Since event cancelled no meteic emitted
Fault remediation cr generation duration: Since event cancelled no metrics emitted

5. Force pod eviction mode

Run user job on gpu node
Inject XID 95, but this time with drain overrides to true, like this

var nowMs = Date.now();
db.HealthEvents.insertOne({
  createdAt: new Date(nowMs),
  healthevent: {
    agent: "gpu-health-monitor",
    checkname: "GpuMemWatch",
    componentclass: "GPU",
    entitiesimpacted: [
      { entitytype: "GPU", entityvalue: "1" },
      { entitytype: "PCI", entityvalue: "0002:00:00.0" },
      { entitytype: "GPU_UUID", entityvalue: "GPU-3d7408b2-d525-643f-5bfc-45a761045e14" }
    ],
    errorcode: [ "DCGM_FR_UNCONTAINED_ERROR" ],
    generatedtimestamp: {
      seconds: Math.floor(nowMs / 1000),
      nanos: (nowMs % 1000) * 1000000
    },
    isfatal: true,
    ishealthy: false,
    drainoverrides: {
     force: true,
     skip: false
    },
    message: "GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.",
    metadata: {
      "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4",
      "nvidia.com/cuda.driver-version.full": "570.148.08",
      "nvidia.com/cuda.driver-version.major": "570",
      "nvidia.com/cuda.driver-version.minor": "148",
      "nvidia.com/cuda.driver-version.revision": "08",
      "nvidia.com/cuda.runtime-version.full": "12.8",
      "nvidia.com/cuda.runtime-version.major": "12",
      "nvidia.com/cuda.runtime-version.minor": "8",
      "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB",
      "providerID": "azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0",
      "topology.kubernetes.io/region": "southcentralus",
      "topology.kubernetes.io/zone": "0"
    },
    nodename: "aks-gpu-12493808-vmss00000l",
    processingstrategy: 1,
    quarantineoverrides: null,
    recommendedaction: 15,
    version: 1
  },
  healtheventstatus: {
    faultremediated: null,
    nodequarantined: null,
    userpodsevictionstatus: { status: "" }
  }
});

Verify if node gets cordon and in what time

{"time":"2026-02-18T08:38:13.908356611Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771403890,\"nanos\":505000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"drainOverrides\":{\"force\":true},\"processingStrategy\":1,\"id\":\"69957a754cf999a4e5fe6914\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T08:38:13.908377013Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:13.988792351Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69957a754cf999a4e5fe6914","status":"Quarantined"}
{"time":"2026-02-18T08:38:13.988829054Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":3.483826154}
{"time":"2026-02-18T08:38:13.988846356Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}

Though drain config is set to allow completion, but since force drain is set to true, pod should go for evection immediately

{"time":"2026-02-18T08:38:14.0012167Z","level":"INFO","msg":"Set initial eviction status to InProgress","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001273705Z","level":"INFO","msg":"Attempting to store resume token","module":"node-drainer","version":"dev","client":"node-drainer"}
{"time":"2026-02-18T08:38:14.001556532Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001834357Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001851759Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.00186156Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:38:14.001902264Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001906964Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T08:38:14.019435785Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"quarantined","to":"draining"}
{"time":"2026-02-18T08:38:14.123114175Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"quarantined","to":"draining"}
{"time":"2026-02-18T08:38:14.249809494Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285260274Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285297377Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285304678Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285316179Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":1,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T08:38:24.286440577Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.286746605Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.286775608Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.286783409Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:38:24.28679331Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.28680011Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T08:38:24.3007473Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"remediation-failed","to":"draining"}
{"time":"2026-02-18T08:38:24.300787104Z","level":"WARN","msg":"Invalid state transition","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"remediation-failed","to":"draining","error":"unexpected state transition: remediation-failed -> draining (expected one of: [])"}
{"time":"2026-02-18T08:38:24.372625346Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.37266395Z","level":"ERROR","msg":"Failed to update node label to draining","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","error":"unexpected state transition: remediation-failed -> draining (expected one of: [])"}
{"time":"2026-02-18T08:38:24.372695853Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.391796819Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.391836723Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.391848724Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":2,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T08:38:44.392413386Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392699712Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392715714Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392724114Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:38:44.392733815Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392740016Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T08:38:44.406787892Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T08:38:44.406813394Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T08:38:44.426993827Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.42702643Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.427034331Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.427044831Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T08:39:24.427529294Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427870325Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427887927Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427894427Z","level":"INFO","msg":"All pods evicted in namespace from node","module":"node-drainer","version":"dev","namespaces":["default","prometheus"],"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427953833Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427960133Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T08:39:24.440841406Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T08:39:24.527559501Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.534053392Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":70.551046991}
{"time":"2026-02-18T08:39:24.534077294Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"69957a754cf999a4e5fe6914","evictionStatus":"Succeeded"}

Verify if remediation cr gets generated or not

{"time":"2026-02-18T08:39:24.536090877Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T08:39:24.547581924Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T08:39:24.641829304Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.641939414Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T08:39:24.642096428Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957a754cf999a4e5fe6914"}
{"time":"2026-02-18T08:39:24.65771505Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.130705849,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.657748553Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957a754cf999a4e5fe6914","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

Node quarantine duration: 3.483826154 sec
Node drainer eviction duration: 70.551046991 sec
Fault remediation cr generation duration: 0.130705849 sec

6. Long drains

Run user job on gpu node
Inject XID 95
Verify if node gets cordon and in what time

{"time":"2026-02-18T15:48:10.266654645Z","level":"INFO","msg":"Cordoning node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T15:48:10.26671065Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771429688,\"nanos\":53000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"6995df3ab5db4c69fefe6911\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T15:48:10.266738053Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T15:48:10.370927709Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"6995df3ab5db4c69fefe6911","status":"Quarantined"}
{"time":"2026-02-18T15:48:10.370967113Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":2.317964513}
{"time":"2026-02-18T15:48:10.370987315Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}

Verify pod eviction time

{"time":"2026-02-18T16:16:41.590969475Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T16:16:41.591014879Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.591019679Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T16:16:41.626309499Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T16:16:41.626339602Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T16:16:41.626354803Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.651662412Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.651720317Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.651739819Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":18,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T16:18:41.652877081Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.652913785Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.652921585Z","level":"INFO","msg":"All pods evicted in namespace from node","module":"node-drainer","version":"dev","namespaces":["prometheus","default"],"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.65296839Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.65297439Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T16:18:41.689784767Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T16:18:41.817383671Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.823019888Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":1831.467014087}
{"time":"2026-02-18T16:18:41.82304559Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"6995df3ab5db4c69fefe6911","evictionStatus":"Succeeded"}

Verify remediation cr generation time

{"time":"2026-02-18T16:18:41.82522299Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T16:18:41.841718703Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T16:18:41.937340274Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.937439183Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T16:18:41.937626001Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995df3ab5db4c69fefe6911"}
{"time":"2026-02-18T16:18:41.953412249Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.136405048,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.953441951Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995df3ab5db4c69fefe6911","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

Node quarantine duration: 2.317964513 sec
Node drainer eviction duration: 1831.467014087 sec
Fault remediation cr generation duration: 0.136405048 sec

For all above scenarios verified on grafana dashboard that emit time durations are falling under correct buckets.

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

Release Notes

New Features
- Added metrics to track node quarantine duration from health event generation to completion.
- Added metrics to track pod eviction duration from event receipt to successful completion.
- Added metrics to track maintenance resource creation duration from event receipt to completion.
Documentation
- Updated metrics documentation with new metrics and clarified MTTR metric description.
Tests
- Added comprehensive test coverage for new metrics tracking capabilities.

coderabbitai · 2026-01-20T08:53:01Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces lifecycle duration metrics for health events: node quarantine duration (time from health event generation to node cordon), pod eviction duration (event receipt to pod eviction completion), and CR generation duration (event receipt to maintenance CR creation). Timestamp fields are added to track lifecycle milestones, database schemas updated, and reconcilers enhanced to record durations at key completion points.

Changes

Cohort / File(s)	Summary
Metrics Documentation `docs/METRICS.md`	Updated metric descriptions and added three new duration metrics with bucket definitions and explanations for node quarantine, pod eviction, and CR generation timings.
Metric Declarations `fault-quarantine/pkg/metrics/metrics.go`, `node-drainer/pkg/metrics/metrics.go`, `fault-remediation/pkg/metrics/metrics.go`	Added three new Prometheus histogram metrics: NodeQuarantineDuration (1s–1h buckets), PodEvictionDuration (1min–90day buckets), and CRGenerationDuration (1s–1h buckets).
Duration Utility `commons/pkg/metricsutil/duration.go`, `commons/pkg/metricsutil/duration_test.go`	New utility function CalculateDurationSeconds to safely compute elapsed seconds from a timestamp, with zero-value handling and comprehensive test coverage.
Data Model Extensions `data-models/pkg/model/health_event_extentions.go`, `fault-remediation/pkg/events/health_event.go`, `store-client/pkg/datastore/types.go`	Added ReceivedAt and QuarantineFinishTimestamp, DrainFinishTimestamp timestamp fields to track event lifecycle milestones across quarantine, drain, and remediation phases.
Metric Recording – Quarantine `fault-quarantine/pkg/reconciler/reconciler.go`	Updated updateQuarantineMetrics to accept generatedTimestamp, calculate cordon duration, and record NodeQuarantineDuration metric when cordoning succeeds.
Metric Recording – Remediation `fault-remediation/pkg/reconciler/reconciler.go`, `fault-remediation/pkg/remediation/remediation.go`	Added ReceivedAt tracking on event receipt and CR generation duration calculation and recording when DrainFinishTimestamp is available.
Metric Recording – Node Drainer `node-drainer/pkg/reconciler/reconciler.go`	Added PodEvictionDuration observation when eviction succeeds, calculated from quarantine finish timestamp; refactored status update to use batch field updates.
Janitor MTTR Updates `janitor/pkg/metrics/metrics.go`, `janitor/pkg/controller/rebootnode_controller.go`, `janitor/pkg/controller/terminatenode_controller.go`	Updated metric description and changed MTTR measurement from StartTime to CreationTimestamp across reboot and terminate controllers.
Database Schema & Client Interfaces `store-client/pkg/client/interfaces.go`	Added UpdateDocumentStatusFields method to DatabaseClient interface to support batch status field updates.
MongoDB Client Implementation `store-client/pkg/client/mongodb_client.go`	Implemented UpdateDocumentStatusFields to perform bulk status updates with support for nested timestamp fields.
PostgreSQL Client Implementation `store-client/pkg/client/postgresql_client.go`, `store-client/pkg/datastore/providers/postgresql/database_client.go`	Implemented UpdateDocumentStatusFields with deterministic field iteration, nested jsonb_set construction, and denormalized column synchronization for health_events.
PostgreSQL Schema & SQL Handling `store-client/pkg/datastore/providers/postgresql/datastore.go`, `store-client/pkg/datastore/providers/postgresql/health_events.go`	Added quarantine_finish_timestamp and drain_finish_timestamp columns to health_events; updated insert and update paths to handle new timestamp fields and parameter reordering.
PostgreSQL Pipeline `store-client/pkg/client/mongodb_pipeline_builder.go`	Refactored updateDescription field checks to use generic \$expr-based comparisons instead of four separate explicit checks.
Convenience Store Methods `store-client/pkg/client/convenience.go`	Refactored UpdateHealthEventNodeQuarantineStatus to update both status and quarantineFinishTimestamp in a single call via UpdateDocumentStatusFields.
Test Coverage `fault-quarantine/pkg/reconciler/reconciler_e2e_test.go`, `fault-remediation/pkg/reconciler/reconciler_e2e_test.go`, `fault-remediation/pkg/reconciler/reconciler_test.go`, `node-drainer/pkg/reconciler/reconciler_integration_test.go`	Added end-to-end and integration tests for NodeQuarantineDuration, CRGenerationDuration, PodEvictionDuration metrics, plus unit test for extractReceivedAtTimestamp helper; includes histogram count helper function.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Timestamps hop through the system flow,
Recording when each event does go—
From quarantine to drain to repair,
Metrics now dance everywhere!
Duration captured, MTTR clear,
Data insights appear! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat: added breakfix response time metrics' accurately describes the main changes, which add multiple response time metrics across several modules.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@fault-quarantine/pkg/reconciler/reconciler.go`:
- Around line 798-800: The call to event.HealthEvent.GeneratedTimestamp.AsTime()
can panic when GeneratedTimestamp is nil; update the caller (before passing into
updateQuarantineMetrics) to guard against nil GeneratedTimestamp: check
event.HealthEvent.GeneratedTimestamp != nil and only call AsTime() when non-nil,
otherwise use a safe zero-value (or appropriate fallback) time.Time value and
pass that to updateQuarantineMetrics so reconciliation won't crash on
legacy/malformed events.

In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Line 114: The current assignment to event.Event["_received_at"] can panic if
event.Event is nil; in the reconciler.go code ensure you guard and initialize
the map before writing: check if event.Event == nil and if so set event.Event =
make(map[string]interface{}) (or the appropriate map type used by Event) and
then assign event.Event["_received_at"] = start.Unix(); reference the
event.Event map and the start.Unix() assignment so the fix is applied at the
same location.
- Around line 243-245: The metrics code observes CR generation duration even
when healthEventWithStatus.ReceivedAt is the zero value, which yields a huge
duration; update the else branch containing crGenerationDuration.Observe(...) to
first check healthEventWithStatus.ReceivedAt.IsZero() and skip calling
crGenerationDuration.Observe(...) when true (i.e., only call Observe with
time.Since(healthEventWithStatus.ReceivedAt).Seconds() if ReceivedAt.IsZero() is
false) so metrics are not polluted by uninitialized timestamps.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 519-527: The type assertion for the _received_at field
(receivedAtRaw in the event handling block) assumes int64 but json.Unmarshal
turns numbers into float64, so update the logic in the reconciler.go block that
computes evictionDuration (using receivedAtRaw, receivedAtUnix, time.Unix,
metrics.PodEvictionDuration and nodeName) to accept both float64 and int64 (and
optionally numeric strings) -- convert the value to an integer Unix seconds (or
to float seconds) before creating time.Unix and computing time.Since, log a
warning only if the type is neither supported nor convertible, and then observe
the evictionDuration with metrics.PodEvictionDuration.

🧹 Nitpick comments (2)

fault-quarantine/pkg/metrics/metrics.go (1)
192-194: Unused parameter nodeName in RecordNodeCordonDuration.

The nodeName parameter is accepted but never used since the NodeCordonDuration histogram has no labels. Either remove the unused parameter or consider adding a node label to the histogram if per-node granularity is desired for this metric.
Option 1: Remove unused parameter
-func RecordNodeCordonDuration(nodeName string, generatedTimestamp time.Time) {
+func RecordNodeCordonDuration(generatedTimestamp time.Time) {
 	NodeCordonDuration.Observe(time.Since(generatedTimestamp).Seconds())
 }
fault-remediation/pkg/reconciler/reconciler.go (1)
538-543: Make _received_at parsing tolerant of numeric types
Strictly expecting int64 can drop the value when the map originated from JSON/BSON conversions. A simple type switch avoids a zero ReceivedAt.
♻️ Suggested refactor
-	if receivedAtRaw, ok := eventWithToken.Event["_received_at"]; ok {
-		if receivedAtUnix, ok := receivedAtRaw.(int64); ok {
-			result.ReceivedAt = time.Unix(receivedAtUnix, 0)
-		}
-	}
+	if receivedAtRaw, ok := eventWithToken.Event["_received_at"]; ok {
+		switch v := receivedAtRaw.(type) {
+		case int64:
+			result.ReceivedAt = time.Unix(v, 0)
+		case int32:
+			result.ReceivedAt = time.Unix(int64(v), 0)
+		case float64:
+			result.ReceivedAt = time.Unix(int64(v), 0)
+		}
+	}

fault-quarantine/pkg/reconciler/reconciler.go

fault-remediation/pkg/reconciler/reconciler.go

node-drainer/pkg/reconciler/reconciler.go

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 522-527: The extraction of _received_at in reconciler.go currently
only handles int64 and will silently skip valid timestamps decoded as float64,
int32, json.Number or strings; update the logic in the block that reads
eventWithToken.Event["_received_at"] (used by Reconcile() and setting
result.ReceivedAt) to perform a type switch (int64, int32, int, float64,
json.Number, string) and convert each into an int64 unix seconds (parsing
json.Number or string as needed) before calling time.Unix(...,0) so
result.ReceivedAt is correctly set for those common BSON/JSON numeric encodings.

♻️ Duplicate comments (2)

fault-remediation/pkg/reconciler/reconciler.go (2)
106-106: Guard against nil event.Event before assignment.
Line 106 can panic if event.Event is nil.
🛠 Suggested fix
-	event.Event["_received_at"] = start.Unix()
+	if event.Event == nil {
+		event.Event = map[string]any{}
+	}
+	event.Event["_received_at"] = start.Unix()
229-231: Skip CR generation metric when ReceivedAt is zero.
time.Since(time.Time{}) produces huge durations and pollutes metrics.
🛠 Suggested fix
-		metrics.CRGenerationDuration.Observe(time.Since(healthEventWithStatus.ReceivedAt).Seconds())
+		if !healthEventWithStatus.ReceivedAt.IsZero() {
+			metrics.CRGenerationDuration.Observe(time.Since(healthEventWithStatus.ReceivedAt).Seconds())
+		} else {
+			slog.Warn("ReceivedAt is zero; skipping CR generation duration metric", "node", nodeName)
+		}

fault-remediation/pkg/reconciler/reconciler.go

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@fault-remediation/pkg/events/health_event.go`:
- Around line 17-27: The ReceivedAt field in the HealthEventDoc struct is only
tagged with json:"-" but still will be serialized to MongoDB via BSON; update
the struct so that the ReceivedAt field also has bson:"-" to prevent it from
persisting. Locate HealthEventDoc and modify the ReceivedAt field's tags (the
field name ReceivedAt in file health_event.go) to include bson:"-" alongside
json:"-" so MongoDB drivers will ignore it.

♻️ Duplicate comments (1)

node-drainer/pkg/reconciler/reconciler.go (1)

519-527: Handle float64 _received_at to avoid dropped metrics.
If the event is read from JSONB (e.g., Postgres), numbers typically unmarshal as float64, so the int64 assertion can fail and skip observation.

🛠 Suggested fix

-	if receivedAtRaw, ok := event["_received_at"]; ok {
-		if receivedAtUnix, ok := receivedAtRaw.(int64); ok {
-			receivedAt := time.Unix(receivedAtUnix, 0)
-			evictionDuration := time.Since(receivedAt).Seconds()
-			metrics.PodEvictionDuration.Observe(evictionDuration)
-		} else {
-			slog.Warn("Invalid type for _received_at timestamp", "node", nodeName)
-		}
-	}
+	if receivedAtRaw, ok := event["_received_at"]; ok {
+		var receivedAtUnix int64
+		switch v := receivedAtRaw.(type) {
+		case int64:
+			receivedAtUnix = v
+		case float64:
+			receivedAtUnix = int64(v)
+		default:
+			slog.Warn("Invalid type for _received_at timestamp", "node", nodeName, "type", fmt.Sprintf("%T", receivedAtRaw))
+		}
+		if receivedAtUnix > 0 {
+			receivedAt := time.Unix(receivedAtUnix, 0)
+			evictionDuration := time.Since(receivedAt).Seconds()
+			metrics.PodEvictionDuration.Observe(evictionDuration)
+		}
+	}

🧹 Nitpick comments (1)

fault-quarantine/pkg/metrics/metrics.go (1)
192-195: Consider removing the unused nodeName parameter.
It’s not used by the histogram and may confuse callers unless you plan to add labels.
♻️ Optional cleanup
-func RecordNodeCordonDuration(nodeName string, generatedTimestamp time.Time) {
+func RecordNodeCordonDuration(generatedTimestamp time.Time) {
 	NodeCordonDuration.Observe(time.Since(generatedTimestamp).Seconds())
}
Update call sites accordingly (e.g., in fault-quarantine/pkg/reconciler/reconciler.go).

fault-remediation/pkg/events/health_event.go

github-actions · 2026-01-20T12:23:44Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-remediation	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation	32.12% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus	30.58% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/events	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler	21.58% (+0.22%)	👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation	28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer	0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher	22.58% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler	25.48% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client	6.11% (+0.01%)	👍
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql	5.19% (ø)
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/fault-remediation/main.go	0.00% (ø)	433	0	433
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation.go	32.12% (ø)	358	115	243
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go	30.58% (ø)	206	63	143
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/events/health_event.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go	0.00% (ø)	248	0	248
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go	21.58% (+0.22%)	1325 (+38)	286 (+11)	1039 (+27)	👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/fault_remediation_client_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go	28.16% (ø)	1289	363	926
github.com/nvidia/nvsentinel/health-events-analyzer/main.go	0.00% (ø)	278	0	278
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config/rules.go	0.00% (ø)	14	0	14
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go	22.58% (ø)	186	42	144
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go	25.48% (ø)	526	134	392
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go	3.47% (ø)	2966	103	2863
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_builder.go	20.00% (ø)	60	12	48
github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder.go	17.02% (ø)	47	8	39
github.com/nvidia/nvsentinel/store-client/pkg/client/postgresql_pipeline_builder.go	20.00% (ø)	60	12	48
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder_test.go
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
github.com/nvidia/nvsentinel/tests/health_events_analyzer_test.go

github-actions · 2026-01-21T11:53:28Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil	18.18% (+18.18%)	🎉
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics	47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler	24.01% (+0.09%)	👍
github.com/nvidia/nvsentinel/fault-remediation	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation	32.12% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus	30.58% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/events	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler	21.77% (+0.40%)	👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation	28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer	0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher	22.58% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler	25.48% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler	50.78% (+0.14%)	👍
github.com/nvidia/nvsentinel/store-client/pkg/client	6.10% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql	5.19% (ø)
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration.go	18.18% (+18.18%)	33 (+33)	6 (+6)	27 (+27)	🎉
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics/metrics.go	47.37% (ø)	19	9	10
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go	24.01% (+0.09%)	2620 (+15)	629 (+6)	1991 (+9)	👍
github.com/nvidia/nvsentinel/fault-remediation/main.go	0.00% (ø)	433	0	433
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation.go	32.12% (ø)	358	115	243
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go	30.58% (ø)	206	63	143
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/events/health_event.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go	0.00% (ø)	248	0	248
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go	21.77% (+0.40%)	1332 (+45)	290 (+15)	1042 (+30)	👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/fault_remediation_client_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go	28.16% (ø)	1289	363	926
github.com/nvidia/nvsentinel/health-events-analyzer/main.go	0.00% (ø)	278	0	278
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config/rules.go	0.00% (ø)	14	0	14
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go	22.58% (ø)	186	42	144
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go	25.48% (ø)	526	134	392
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go	50.78% (+0.14%)	644 (+14)	327 (+8)	317 (+6)	👍
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go	3.47% (ø)	2966	103	2863
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_builder.go	20.00% (ø)	60	12	48
github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder.go	17.02% (ø)	47	8	39
github.com/nvidia/nvsentinel/store-client/pkg/client/postgresql_pipeline_builder.go	20.00% (ø)	60	12	48
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration_test.go
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_integration_test.go
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder_test.go
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
github.com/nvidia/nvsentinel/tests/health_events_analyzer_test.go

nitz2407 · 2026-01-22T07:08:52Z

/ok to test e5bfda5

copy-pr-bot · 2026-01-23T07:25:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-01-23T07:41:06Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil	18.18% (+18.18%)	🎉

Coverage by file

Changed unit test files

github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration_test.go

github-actions · 2026-01-23T08:12:31Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil	18.18% (+18.18%)	🎉

Coverage by file

Changed unit test files

github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration_test.go

fault-quarantine/pkg/metrics/metrics.go

node-drainer/pkg/reconciler/reconciler.go

fault-remediation/pkg/reconciler/reconciler.go

# Conflicts: # fault-remediation/pkg/reconciler/reconciler_test.go # node-drainer/pkg/reconciler/reconciler.go

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

store-client/pkg/datastore/providers/postgresql/health_events.go (1)
226-264: ⚠️ Potential issue | 🟠 Major

Timestamp fields not synced to JSONB document — consumers see null values.

The code writes quarantine_finish_timestamp and drain_finish_timestamp to table columns but does not propagate them into the document JSONB via jsonb_set. Since all read paths (FindHealthEventsByNode, FindHealthEventsByQuery, GetHealthEventByID) unmarshal exclusively from the JSONB document field and the struct fields are pointer types (*time.Time), missing keys unmarshal to nil. This affects both the non-nil branch (lines 226-264) and the else branch (lines 268-291).

The issue is most severe in UpdateHealthEventStatusByNode (lines 337-348), which updates only the raw columns without any JSONB synchronization.
Proposed fix for the non-nil branch
 			    document = jsonb_set(
 			        jsonb_set(
 			            jsonb_set(
 			                jsonb_set(
-			                    document,
+			                    jsonb_set(
+			                        jsonb_set(
+			                            document,
+			                            '{healtheventstatus,quarantinefinishtimestamp}',
+			                            to_jsonb($2::timestamp)
+			                        ),
+			                        '{healtheventstatus,drainfinishtimestamp}',
+			                        to_jsonb($5::timestamp)
+			                    ),
 			                    '{healtheventstatus,nodequarantined}',
 			                    to_jsonb($1::text)
 			                ),
Apply similar changes to the else branch and UpdateHealthEventStatusByNode.

🤖 Fix all issues with AI agents

In `@data-models/pkg/model/health_event_extentions.go`:
- Line 47: The struct field QuarantineFinishTimestamp currently causes a linter
line-length failure; shorten the alignment whitespace before the type so the
declaration for QuarantineFinishTimestamp *time.Time
`bson:"quarantinefinishtimestamp,omitempty"
json:"quarantinefinishtimestamp,omitempty"` is <=120 chars, or if alignment
cannot be reduced without harming readability, add a nolint directive (matching
the style used for LastRemediationTimestamp) to the field tag to suppress the
linter error; locate the field by name QuarantineFinishTimestamp in
health_event_extentions.go and apply one of these fixes.

In `@store-client/pkg/client/convenience.go`:
- Around line 39-47: UpdateHealthEventNodeQuarantineStatus currently always sets
"healtheventstatus.quarantinefinishtimestamp", which lets UnQuarantined calls
overwrite the original finish time; change the function
(UpdateHealthEventNodeQuarantineStatus) to only include the
"healtheventstatus.quarantinefinishtimestamp" field in the fields map when the
new status indicates quarantine completion (e.g., status == "Quarantined"),
otherwise omit that key so un-quarantine or non-completing statuses don't
overwrite the existing timestamp.

In `@store-client/pkg/datastore/providers/postgresql/database_client.go`:
- Around line 314-315: The WHERE clause is hardcoded to "id = $N" which is
inconsistent with UpdateDocumentStatus's conditional use of "data->>'_id'" for
non-health_events tables; modify the code that builds whereClause (currently
using update.ToSQL() result and whereClause variable) to follow the same logic
as UpdateDocumentStatus: if c.tableName == "health_events" use "id = $%d"
otherwise use "data->>'_id' = $%d", and ensure the parameter index uses
len(args)+1 and the passed argument is the document id value; update any callers
accordingly so the predicate matches the table type.

In `@store-client/pkg/datastore/providers/postgresql/datastore.go`:
- Around line 402-406: The warning is misleading because ADD COLUMN IF NOT
EXISTS suppresses "already exists" errors; if db.ExecContext(ctx,
timestampColumn) returns an error it's a real failure. Replace the slog.Warn
call in the timestampColumns loop with a proper failure handling: either
return/propagate the error from the enclosing function (like the schemas path)
or at minimum log it as an error (use slog.Error) and include the error object
plus the failing SQL (timestampColumn) for debugging. Update the handler around
db.ExecContext(ctx, timestampColumn) accordingly (references: timestampColumns,
db.ExecContext, slog.Warn).

🧹 Nitpick comments (2)

store-client/pkg/client/interfaces.go (1)
27-27: Add a godoc comment for the new interface method.

All other methods in DatabaseClient have doc comments. As per coding guidelines, exported Go functions require comments.
Suggested fix
 	UpdateDocumentStatus(ctx context.Context, documentID string, statusPath string, status interface{}) error
+	// UpdateDocumentStatusFields updates multiple status fields in a document in one operation.
+	// Keys in fields are dot-notation paths (e.g. "healtheventstatus.nodequarantined").
 	UpdateDocumentStatusFields(ctx context.Context, documentID string, fields map[string]interface{}) error
As per coding guidelines: "Function comments required for all exported Go functions".
store-client/pkg/datastore/providers/postgresql/health_events.go (1)

337-358: UpdateHealthEventStatusByNode also lacks JSONB sync for all fields (pre-existing + new).

This function updates only the table columns and does not touch the document JSONB at all. While this is a pre-existing gap, the two new timestamp columns (quarantine_finish_timestamp, drain_finish_timestamp) widen it. If any consumer reads events updated via this path and relies on the JSONB document (which all read paths do), they will see stale data.

Consider adding JSONB jsonb_set calls here consistent with UpdateHealthEventStatus, or document why this function intentionally skips JSONB sync.

data-models/pkg/model/health_event_extentions.go

store-client/pkg/client/convenience.go

store-client/pkg/datastore/providers/postgresql/database_client.go

store-client/pkg/datastore/providers/postgresql/datastore.go

github-actions · 2026-02-10T10:25:47Z

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler	20.89% (+0.06%)	👍
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler	25.48% (ø)
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/controller	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/publisher	0.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store	63.93% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback	0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/benchmark	0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/health	0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/ping	0.00% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/webhook	0.00% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client	6.00% (-0.10%)	👎
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb	6.83% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql	5.12% (-0.07%)	👎
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model/health_event_extentions.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/config/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/main.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/config/config.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/health/reporter.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/preflight-checks/ping/main.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/preflight/pkg/config/config.go	0.00% (ø)	13	0	13
github.com/nvidia/nvsentinel/preflight/pkg/webhook/injector.go	0.00% (ø)	87	0	87
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go	4.03% (-0.15%)	596 (+21)	24	572 (+21)	👎
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go	3.43% (-0.04%)	3115 (+149)	107 (+4)	3008 (+145)	👎
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_builder.go	19.61% (-0.39%)	102 (+42)	20 (+8)	82 (+34)	👎
github.com/nvidia/nvsentinel/store-client/pkg/client/postgresql_client.go	8.41% (-0.25%)	9466 (+300)	796 (+2)	8670 (+298)	👎
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/database_client.go	0.00% (ø)	4176 (+217)	0	4176 (+217)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/datastore.go	5.22% (-0.18%)	1149 (+39)	60	1089 (+39)	👎
github.com/nvidia/nvsentinel/tests/helpers/fault_quarantine.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/kubernetes_object_monitor.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector_test.go
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/benchmark/benchmark_test.go
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_test.go
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb/health_store_test.go
github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
github.com/nvidia/nvsentinel/tests/fault_quarantine_test.go
github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go
github.com/nvidia/nvsentinel/tests/kubernetes_object_monitor_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

github-actions · 2026-02-10T18:07:03Z

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler	24.03% (+0.11%)	👍
github.com/nvidia/nvsentinel/janitor	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller	17.71% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock	34.64% (-0.74%)	👎
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices	27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics	40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1	18.72% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go	24.03% (+0.11%)	2626 (+21)	631 (+8)	1995 (+13)	👍
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	71.43% (ø)	7	5	2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	21.09% (ø)	1233	260	973
github.com/nvidia/nvsentinel/janitor/main.go	0.00% (ø)	1127	0	1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	23.33% (ø)	210	49	161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go	25.82% (ø)	364	94	270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	18.53% (ø)	3595	666	2929
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go	17.33% (ø)	854	148	706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go	18.36% (ø)	828	152	676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go	0.00% (ø)	165	0	165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go	27.50% (ø)	120	33	87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go	40.00% (ø)	50	20	30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go	18.72% (ø)	1293	242	1051

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

docs/METRICS.md

node-drainer/pkg/reconciler/reconciler.go

fault-remediation/pkg/remediation/remediation.go

github-actions · 2026-02-11T07:03:17Z

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler	24.03% (+0.11%)	👍
github.com/nvidia/nvsentinel/janitor	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller	17.71% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock	34.64% (-0.74%)	👎
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices	27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics	40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1	18.72% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client	5.98% (-0.12%)	👎

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go	24.03% (+0.11%)	2626 (+21)	631 (+8)	1995 (+13)	👍
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	71.43% (ø)	7	5	2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	21.09% (ø)	1233	260	973
github.com/nvidia/nvsentinel/janitor/main.go	0.00% (ø)	1127	0	1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	23.33% (ø)	210	49	161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go	25.82% (ø)	364	94	270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	18.53% (ø)	3595	666	2929
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go	17.33% (ø)	854	148	706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go	18.36% (ø)	828	152	676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go	0.00% (ø)	165	0	165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go	27.50% (ø)	120	33	87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go	40.00% (ø)	50	20	30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go	18.72% (ø)	1293	242	1051
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go	4.03% (-0.15%)	596 (+21)	24	572 (+21)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

github-actions · 2026-02-11T07:24:45Z

Merging this branch changes the coverage (1 decrease, 3 increase)

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler	24.03% (+0.11%)	👍
github.com/nvidia/nvsentinel/janitor	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller	17.74% (+0.03%)	👍
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock	35.38% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices	27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics	40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1	18.72% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler	38.58% (+0.58%)	👍
github.com/nvidia/nvsentinel/store-client/pkg/client	5.98% (-0.12%)	👎

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go	24.03% (+0.11%)	2626 (+21)	631 (+8)	1995 (+13)	👍
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	71.43% (ø)	7	5	2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	21.09% (ø)	1233	260	973
github.com/nvidia/nvsentinel/janitor/main.go	0.00% (ø)	1127	0	1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	23.33% (ø)	210	49	161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go	25.82% (ø)	364	94	270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	18.58% (+0.06%)	3595	668 (+2)	2927 (-2)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go	17.33% (ø)	854	148	706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go	18.36% (ø)	828	152	676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go	0.00% (ø)	165	0	165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go	27.50% (ø)	120	33	87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go	40.00% (ø)	50	20	30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go	18.72% (ø)	1293	242	1051
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go	38.58% (+0.58%)	1011 (+32)	390 (+18)	621 (+14)	👍
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go	4.03% (-0.15%)	596 (+21)	24	572 (+21)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

github-actions · 2026-02-11T11:03:01Z

Merging this branch changes the coverage (1 decrease, 2 increase)

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics	47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler	24.03% (+0.11%)	👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics	0.00% (ø)
github.com/nvidia/nvsentinel/janitor	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller	17.71% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock	35.38% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices	27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics	40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1	18.72% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/metrics	0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler	38.58% (+0.58%)	👍
github.com/nvidia/nvsentinel/store-client/pkg/client	5.98% (-0.12%)	👎

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics/metrics.go	47.37% (ø)	19	9	10
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go	24.03% (+0.11%)	2626 (+21)	631 (+8)	1995 (+13)	👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	71.43% (ø)	7	5	2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	21.09% (ø)	1233	260	973
github.com/nvidia/nvsentinel/janitor/main.go	0.00% (ø)	1127	0	1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	23.33% (ø)	210	49	161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go	25.82% (ø)	364	94	270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	18.53% (ø)	3595	666	2929
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go	17.33% (ø)	854	148	706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go	18.36% (ø)	828	152	676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go	0.00% (ø)	165	0	165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go	27.50% (ø)	120	33	87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go	40.00% (ø)	50	20	30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go	18.72% (ø)	1293	242	1051
github.com/nvidia/nvsentinel/node-drainer/pkg/metrics/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go	38.58% (+0.58%)	1011 (+32)	390 (+18)	621 (+14)	👍
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go	4.03% (-0.15%)	596 (+21)	24	572 (+21)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

fault-quarantine/pkg/metrics/metrics.go

lalitadithya

Can you please add how we tested that the metric match manual observations in the following cases:

pod restarts
long drains
cancelled breakfix
immediate mode eviction failed
immediate mode eviction success
delete after timeout

lalitadithya · 2026-02-12T10:19:41Z

commons/pkg/metricsutil/duration_test.go

I don't understand what is being tested here. In code we call time.Since(timestamp).Seconds() in test also we call time.Since(timestamp).Seconds(), so when would this fail?

This is to test CalculateDurationSeconds utility function.

But the code we are using to test it is the same as the code we are using in the function so I'm a bit lost in how this helps

This is just a Unit Test to test specific blob.

Removed it.

lalitadithya · 2026-02-12T10:21:03Z

fault-quarantine/pkg/metrics/metrics.go

Why did we change this? The previous values seem to be correct here

This is just for consistency, since not using manual list anywhere else.

But consistency won't help here right? Hoe can retry attempts be less than 1, am I missing something here?

So Prometheus provided a generic default bucket which works for both decimal and whole number use cases. No worries let me revert this, since adding confusion.

lalitadithya · 2026-02-12T10:26:54Z

node-drainer/pkg/reconciler/reconciler.go

Shouldn't this be exported after all the procession is completed? In this case after the healtheventstatus.drainfinishtimestamp has been set? Otherwise it possible that the status is never updated the database and the next operation doesn't start

Intent to emit metrics immediately after Pods get evicted successfully to measure the correct performance. If we emit it after updating node labels, db updates, then might some delay get added. Please share your thoughts might be I am overthinking.

I think the intent is to emit the metric after the module is completed processing so that we know how long the module took. For example, even if the pod is evicted, but ND doesn't complete the rest of the activities in mongodb then FR can't start. If we stop measuring after the pod are evicted, then we will have gaps in our observability.

What what if node drained successfully and db update failed, are we ok to loose this data?

nitz2407 · 2026-02-12T11:42:50Z

Can you please add how we tested that the metric match manual observations in the following cases:
* pod restarts

* long drains

* cancelled breakfix

* immediate mode eviction failed

* immediate mode eviction success

* delete after timeout

Pod restarts:
While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log: “{"time":"2026-02-10T16:51:17.446109565Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":284.411107465}”

Verification is done based on taking delta of log generation timestamp - quarantinefinishtimestamp stored in db
If that matches with evictionDuration it means, perf measurement is correct.

Long drains:
Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Cancelled breakfix:
By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Immediate mode eviction failed/success and delete after timeout:
Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

lalitadithya · 2026-02-12T11:50:00Z

Can you please add how we tested that the metric match manual observations in the following cases:
* pod restarts

* long drains

* cancelled breakfix

* immediate mode eviction failed

* immediate mode eviction success

* delete after timeout
Pod restarts: While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log: “{"time":"2026-02-10T16:51:17.446109565Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":284.411107465}”

Verification is done based on taking delta of log generation timestamp - quarantinefinishtimestamp stored in db If that matches with evictionDuration it means, perf measurement is correct.

Long drains: Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Cancelled breakfix: By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Immediate mode eviction failed/success and delete after timeout: Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log

Can we test pod restarts while the drainer is processing events?

Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Can we run some mock workloads to test? Long drains are going to be 90% of the cases we need to make sure that longer drains are showing accurate numbers.

By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Can we run some mock/simulated user workloads to test?

Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

Can we run some mock/simulated user workloads to test?

# Conflicts: # fault-quarantine/pkg/reconciler/reconciler_e2e_test.go # fault-remediation/pkg/remediation/remediation.go

github-actions · 2026-02-18T11:51:18Z

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-quarantine	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/breaker	30.06% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/common	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher	2.53% (+2.53%)	👍
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer	30.27% (-0.06%)	👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics	47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler	20.64% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/config	29.79% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler	20.88% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation	27.85% (-0.25%)	👎
github.com/nvidia/nvsentinel/janitor/pkg/controller	17.84% (ø)
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/fault-quarantine/main.go	0.00% (ø)	246	0	246
github.com/nvidia/nvsentinel/fault-quarantine/pkg/breaker/breaker.go	30.06% (ø)	835	251	584
github.com/nvidia/nvsentinel/fault-quarantine/pkg/common/common.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher/event_watcher.go	2.53% (+2.53%)	712 (+62)	18 (+18)	694 (+44)	👍
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/k8s_client.go	31.58% (ø)	1067	337	730
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/node_informer.go	28.34% (-0.14%)	727	206 (-1)	521 (+1)	👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics/metrics.go	47.37% (ø)	19	9	10
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go	20.64% (ø)	2820	582	2238
github.com/nvidia/nvsentinel/fault-remediation/pkg/config/config.go	29.79% (ø)	339	101	238
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go	0.00% (ø)	342	0	342
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go	20.88% (ø)	1312	274	1038
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go	27.85% (-0.25%)	1379 (+23)	384 (+3)	995 (+20)	👎
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	18.55% (ø)	3601	668	2933
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go	0.00% (ø)	127	0	127
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/config/config_test.go
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
github.com/nvidia/nvsentinel/tests/fault_management_test.go
github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go
github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
github.com/nvidia/nvsentinel/tests/janitor_test.go
github.com/nvidia/nvsentinel/tests/log_collector_test.go
github.com/nvidia/nvsentinel/tests/node_drainer_test.go
github.com/nvidia/nvsentinel/tests/scale_test.go
github.com/nvidia/nvsentinel/tests/smoke_test.go

nitz2407 · 2026-02-18T17:48:22Z

Can you please add how we tested that the metric match manual observations in the following cases:
* pod restarts

* long drains

* cancelled breakfix

* immediate mode eviction failed

* immediate mode eviction success

* delete after timeout
Pod restarts: While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log: “{"time":"2026-02-10T16:51:17.446109565Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":284.411107465}”
Verification is done based on taking delta of log generation timestamp - quarantinefinishtimestamp stored in db If that matches with evictionDuration it means, perf measurement is correct.
Long drains: Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.
Cancelled breakfix: By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.
Immediate mode eviction failed/success and delete after timeout: Since user workloads are not there on dev clusters it’s difficult to test the scenarios.
While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log

Can we test pod restarts while the drainer is processing events?

Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Can we run some mock workloads to test? Long drains are going to be 90% of the cases we need to make sure that longer drains are showing accurate numbers.

By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Can we run some mock/simulated user workloads to test?

Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

Can we run some mock/simulated user workloads to test?

Added the test results under Testing section.

fix: added breakfix response time metrics

7cd1c64

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

fault-remediation/pkg/reconciler/reconciler.go Outdated Show resolved Hide resolved

nitz2407 force-pushed the nitijain/HIPPO-2400 branch 2 times, most recently from c255847 to d598364 Compare January 20, 2026 12:01

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

fault-remediation/pkg/events/health_event.go Show resolved Hide resolved

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from d598364 to b57001e Compare January 20, 2026 12:13

Merge remote-tracking branch 'origin/main' into nitijain/HIPPO-2400

f1d00fa

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from b57001e to f1d00fa Compare January 21, 2026 11:38

Merge branch 'main' into nitijain/HIPPO-2400

e5bfda5

Merge branch 'main' into nitijain/HIPPO-2400

76afc39

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 1d702c4 to 008aee9 Compare January 23, 2026 07:30

chore: trigger pipeline

9e45fef

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 008aee9 to 9e45fef Compare January 23, 2026 08:00

nitz2407 added 2 commits January 23, 2026 14:31

Merge branch 'main' into nitijain/HIPPO-2400

c18cda8

Merge branch 'main' into nitijain/HIPPO-2400

65bc91b

lalitadithya reviewed Jan 25, 2026

View reviewed changes

fault-quarantine/pkg/metrics/metrics.go Outdated Show resolved Hide resolved

node-drainer/pkg/reconciler/reconciler.go Outdated Show resolved Hide resolved

fault-remediation/pkg/reconciler/reconciler.go Outdated Show resolved Hide resolved

nitz2407 added 2 commits January 27, 2026 16:41

Merge remote-tracking branch 'origin/main' into nitijain/HIPPO-2400

1c6cdf5

# Conflicts: # fault-remediation/pkg/reconciler/reconciler_test.go # node-drainer/pkg/reconciler/reconciler.go

Merge remote-tracking branch 'origin/main' into nitijain/HIPPO-2400

d7619d3

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

nitz2407 force-pushed the nitijain/HIPPO-2400 branch 3 times, most recently from 0f69523 to b639b31 Compare February 10, 2026 10:07

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from b639b31 to 707df37 Compare February 10, 2026 10:14

Merge branch 'main' into nitijain/HIPPO-2400

8fec1fb

nitz2407 force-pushed the nitijain/HIPPO-2400 branch 2 times, most recently from f6fb17f to 6a6ef91 Compare February 10, 2026 17:56

XRFXLP reviewed Feb 11, 2026

View reviewed changes

docs/METRICS.md Outdated Show resolved Hide resolved

docs/METRICS.md Outdated Show resolved Hide resolved

node-drainer/pkg/reconciler/reconciler.go Outdated Show resolved Hide resolved

fault-remediation/pkg/remediation/remediation.go Show resolved Hide resolved

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 6a6ef91 to cddfaf7 Compare February 11, 2026 06:52

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from cddfaf7 to 2ebb98b Compare February 11, 2026 07:14

Merge branch 'main' into nitijain/HIPPO-2400

b7e4a8c

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 2ebb98b to b7e4a8c Compare February 11, 2026 10:52

XRFXLP reviewed Feb 11, 2026

View reviewed changes

fault-quarantine/pkg/metrics/metrics.go Outdated Show resolved Hide resolved

Merge branch 'main' into nitijain/HIPPO-2400

b167c15

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 85e65b1 to b167c15 Compare February 12, 2026 06:18

lalitadithya reviewed Feb 12, 2026

View reviewed changes

Merge branch 'main' into nitijain/HIPPO-2400

8bbe4d1

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from c64c8d4 to 8bbe4d1 Compare February 18, 2026 09:52

Merge remote-tracking branch 'origin/main' into nitijain/HIPPO-2400

340e538

# Conflicts: # fault-quarantine/pkg/reconciler/reconciler_e2e_test.go # fault-remediation/pkg/remediation/remediation.go

nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 2626494 to 340e538 Compare February 18, 2026 11:39

Conversation

nitz2407 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Test scenarios

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 20, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Jan 21, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

nitz2407 commented Jan 22, 2026

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

github-actions bot commented Jan 23, 2026

Merging this branch will increase overall coverage

Changed unit test files

Uh oh!

github-actions bot commented Jan 23, 2026

Merging this branch will increase overall coverage

Changed unit test files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 10, 2026

Merging this branch changes the coverage (2 decrease, 1 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Feb 10, 2026

Merging this branch changes the coverage (1 decrease, 1 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

Uh oh!

nitz2407 commented Jan 20, 2026 •

edited

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading