Skip to content

feat: added breakfix response time metrics#714

Open
nitz2407 wants to merge 14 commits intoNVIDIA:mainfrom
nitz2407:nitijain/HIPPO-2400
Open

feat: added breakfix response time metrics#714
nitz2407 wants to merge 14 commits intoNVIDIA:mainfrom
nitz2407:nitijain/HIPPO-2400

Conversation

@nitz2407
Copy link
Contributor

@nitz2407 nitz2407 commented Jan 20, 2026

Summary

Added metrics to answer below queries:

  • What is mean time to remediate (Emit metrics from Janitor module)
  • What is the mean time to quarantine (Emit metrics from fault quarantine module)
  • What is mean amount of time spent waiting for user workloads to complete (Emit metrics from node drainer module)
  • What is mean time for cr creation (Emit metrics from fault remediation)

Testing

Here is test environment used to test different scenarios:

  • Job used to run on testing node
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-until-timeout
  namespace: test-namespace
spec:
  nodeName: aks-gpu-12493808-vmss00000l
  containers:
  - name: main
    image: busybox:1.36
    command:
    - /bin/sh
    - -c
    - "trap '' SIGTERM; echo 'Running until force-delete or node-drainer timeout'; sleep 900"
  restartPolicy: Never


  • Node drainer config
evictionTimeoutInSeconds = "60"
[[userNamespaces]]
    name = "*"
    mode = "AllowCompletion"


  • Grafana queries to verify the results
histogram_quantile(0.99, sum by(le) (rate(node_drainer_pod_eviction_duration_seconds_bucket[1m])))
histogram_quantile(0.99, sum by(le) (rate(fault_remediation_cr_generate_duration_seconds_bucket[1m])))
histogram_quantile(0.99, sum by(le) (rate(fault_quarantine_node_quarantine_duration_seconds_bucket[1m])))

Test scenarios

1. Pod restart

  • Run user workload on the node

  • Inject XID 95

  • Verify Fault quarantine logs to verify node quarantine duration

{"time":"2026-02-18T09:00:15.752561052Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771405213,\"nanos\":683000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"69957f9f4cf999a4e5fe6915\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T09:00:15.752585855Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:15.836308683Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69957f9f4cf999a4e5fe6915","status":"Quarantined"}
{"time":"2026-02-18T09:00:15.836342186Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":2.153339486}
{"time":"2026-02-18T09:00:15.836360688Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}

  • Verify Node drainer logs to verify if node drain has been start or not
{"time":"2026-02-18T09:00:46.1287337Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:46.128769203Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:00:46.143497034Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:00:46.143523336Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:00:46.143542938Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:46.143550939Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:00:46.166531615Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:00:46.166559618Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:01:26.167823286Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.167860789Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.16786809Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:01:26.182014066Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:01:26.182044169Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:01:26.18205997Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.182082172Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:01:26.203294585Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:01:26.203333489Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":4,"error":"waiting for pods to complete: 1 pods remaining"}

  • Restart Node drainer pod
nitijain@nitijain-mlt scripts % k delete pod -n nvsentinel node-drainer-57d96b4749-kcs6n
pod "node-drainer-57d96b4749-kcs6n" deleted
nitijain@nitijain-mlt scripts % k get pod -n nvsentinel | grep drain
node-drainer-57d96b4749-n4jzt         1/1     Running            0               14s

  • Verify Node drainer logs if it resume to drain the pod during previous restart and time taken to drain the pod
nitijain@nitijain-mlt scripts % k logs -f -n nvsentinel node-drainer-57d96b4749-n4jzt
2026/02/18 09:02:05 INFO Registering PostgreSQL datastore provider
2026/02/18 09:02:05 INFO Registered datastore provider provider=postgresql
2026/02/18 09:02:05 INFO Registering MongoDB datastore provider
2026/02/18 09:02:05 INFO Registered datastore provider provider=mongodb
2026/02/18 09:02:05 INFO Registering MongoDB builder factory
{"time":"2026-02-18T09:02:05.692406866Z","level":"INFO","msg":"Starting node-drainer","module":"node-drainer","version":"dev","version":"dev","commit":"none","date":"unknown"}
{"time":"2026-02-18T09:02:05.692724194Z","level":"INFO","msg":"Using new certificate path","module":"node-drainer","version":"dev","resolved_path":"/etc/ssl/client-certs"}
{"time":"2026-02-18T09:02:05.692751497Z","level":"INFO","msg":"Database client cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs"}
{"time":"2026-02-18T09:02:05.692756097Z","level":"INFO","msg":"Starting node drainer initialization","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.692762798Z","level":"INFO","msg":"Loading datastore config","module":"node-drainer","version":"dev","provider":"mongodb"}
{"time":"2026-02-18T09:02:05.692975317Z","level":"INFO","msg":"Running with partial drain disabled","module":"node-drainer","version":"dev"}
W0218 09:02:05.693027       1 client_config.go:682] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"time":"2026-02-18T09:02:05.694055813Z","level":"INFO","msg":"Successfully initialized kubernetes client","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.694264131Z","level":"INFO","msg":"Created datastore factory","module":"node-drainer","version":"dev","providers":2}
{"time":"2026-02-18T09:02:05.694282433Z","level":"INFO","msg":"Creating datastore","module":"node-drainer","version":"dev","provider":"mongodb"}
{"time":"2026-02-18T09:02:05.694303035Z","level":"INFO","msg":"NewMongoDBDataStore TLS config check","module":"node-drainer","version":"dev","hasTLSConfig":true,"certPath":"/etc/ssl/client-certs/tls.crt","caPath":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.694311935Z","level":"INFO","msg":"Extracted cert directory from TLSConfig","module":"node-drainer","version":"dev","certDir":"/etc/ssl/client-certs"}
{"time":"2026-02-18T09:02:05.694335837Z","level":"INFO","msg":"Trying to read CA cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.694370641Z","level":"INFO","msg":"Successfully read CA cert","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.696031688Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.75257921Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.752889738Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"HealthEvents","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.75290994Z","level":"INFO","msg":"Trying to read CA cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.752970045Z","level":"INFO","msg":"Successfully read CA cert","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.753298074Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.79783523Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.798143257Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"HealthEvents","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.798160159Z","level":"INFO","msg":"Successfully created adapted MongoDB store","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.798209463Z","level":"INFO","msg":"Trying to read CA cert","module":"node-drainer","version":"dev","path":"/etc/ssl/client-certs/ca.crt"}
{"time":"2026-02-18T09:02:05.798258267Z","level":"INFO","msg":"Successfully read CA cert","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.798559594Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.850554112Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.85087214Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"HealthEvents","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.850885841Z","level":"INFO","msg":"Trying to ping database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.851068158Z","level":"INFO","msg":"Successfully pinged database to confirm connectivity","module":"node-drainer","version":"dev","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.851279876Z","level":"INFO","msg":"Confirmed that the collection exists in the database","module":"node-drainer","version":"dev","collection":"ResumeTokens","database":"HealthEventsDatabase"}
{"time":"2026-02-18T09:02:05.851796722Z","level":"INFO","msg":"ResumeToken found","module":"node-drainer","version":"dev","token":"zQAAAAJfZGF0YQC9AAAAODI2OTk1N0Y5RjAwMDAwMDAyMkIwNDJDMDEwMDI5NkU1QTEwMDRFNTk4NDM0QzQ2MzM0RkM1ODFENTZEMDExRkYxMTczRjQ2M0M2RjcwNjU3MjYxNzQ2OTZGNkU1NDc5NzA2NTAwM0M3NTcwNjQ2MTc0NjUwMDQ2NjQ2RjYzNzU2RDY1NkU3NDRCNjU3OTAwNDY2NDVGNjk2NDAwNjQ2OTk1N0Y5RjRDRjk5OUE0RTVGRTY5MTUwMDAwMDQAAA=="}
{"time":"2026-02-18T09:02:05.866243305Z","level":"INFO","msg":"Initialization completed successfully","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:05.866277808Z","level":"INFO","msg":"Starting Kubernetes informers","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166778297Z","level":"INFO","msg":"Kubernetes informers started and synced","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166828001Z","level":"INFO","msg":"Starting queue worker","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166839002Z","level":"INFO","msg":"Starting workqueue processor","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166852104Z","level":"INFO","msg":"Handling cold start","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.166860104Z","level":"INFO","msg":"Querying for events requiring processing","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.649999314Z","level":"INFO","msg":"Found events to re-process","module":"node-drainer","version":"dev","count":1}
{"time":"2026-02-18T09:02:06.650594767Z","level":"INFO","msg":"Re-queued event from cold start","module":"node-drainer","version":"dev","nodeName":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.65062627Z","level":"INFO","msg":"Cold start processing completed","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.65063527Z","level":"INFO","msg":"Starting database event watcher","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.650648872Z","level":"INFO","msg":"All components started successfully","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.650812786Z","level":"INFO","msg":"server initialized","module":"node-drainer","version":"dev","port":2112,"read_timeout":10000000000,"write_timeout":10000000000}
{"time":"2026-02-18T09:02:06.650863891Z","level":"INFO","msg":"Event watcher started, consuming events","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.650881192Z","level":"INFO","msg":"Starting metrics server","module":"node-drainer","version":"dev"}
{"time":"2026-02-18T09:02:06.651237324Z","level":"INFO","msg":"starting server","module":"node-drainer","version":"dev","addr":":2112"}
{"time":"2026-02-18T09:02:06.651377636Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.651410039Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:02:06.663321497Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:02:06.6633515Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:02:06.663371302Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.663377802Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:06.690662725Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:02:06.69071403Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":1,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:02:16.691354774Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:16.691387177Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:02:16.706017424Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:02:16.706044626Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:02:16.706063228Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:16.706069728Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:16.722418022Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:02:16.722448624Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":2,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:02:36.7233877Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:36.723422703Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:02:36.737882996Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:02:36.737913899Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:02:36.737935001Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:36.737942902Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:02:36.751592322Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:02:36.751626125Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:03:16.752380622Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:03:16.752411825Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:03:16.766346981Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:03:16.766372584Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:03:16.766390885Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:03:16.766398086Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:03:16.783643041Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:03:16.783673744Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":4,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:04:36.784553175Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:04:36.784581577Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:04:36.798327081Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:04:36.798356783Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:04:36.798378185Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:04:36.798384686Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:04:36.816884605Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:04:36.816932009Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":5,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:06:36.817969644Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:06:36.818004647Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:06:36.830792789Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:06:36.830827392Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:06:36.830850995Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:06:36.830858595Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:06:36.855068458Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:06:36.855109262Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":6,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:08:36.856384539Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.856413942Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.856422042Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.856427743Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:08:36.871009547Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:08:36.871041149Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:08:36.871059251Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.871066652Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:08:36.890268568Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:08:36.890302171Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":7,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:10:36.891282297Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:10:36.8913186Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:10:36.906787307Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:10:36.906814409Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:10:36.906835711Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:10:36.906843112Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:10:36.923007182Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:10:36.923048386Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":8,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:12:36.923921272Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.923957575Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.923964876Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T09:12:36.936691536Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T09:12:36.93673444Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T09:12:36.936753142Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.936761843Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:12:36.951577293Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T09:12:36.951611797Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":9,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T09:14:36.952723834Z","level":"INFO","msg":"Ignoring completed pod %s in namespace %s on node %s (status: %s) during eviction check","module":"node-drainer","version":"dev","test-pod-until-timeout":"test-namespace","aks-gpu-12493808-vmss00000l":"Succeeded"}
{"time":"2026-02-18T09:14:36.952812642Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:36.952822143Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:36.952827243Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:36.952832743Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T09:14:36.968191887Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T09:14:37.060907395Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:37.06702173Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":861.239015229}
{"time":"2026-02-18T09:14:37.067052132Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"69957f9f4cf999a4e5fe6915","evictionStatus":"Succeeded"}

  • Verify Fault remediation logs to check time taken to create the maintenance cr
{"time":"2026-02-18T09:14:37.068779484Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T09:14:37.086218209Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T09:14:37.191635528Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:37.191752338Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T09:14:37.191910352Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957f9f4cf999a4e5fe6915"}
{"time":"2026-02-18T09:14:37.203650479Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.143644378,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T09:14:37.203678181Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957f9f4cf999a4e5fe6915","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

  • Node quarantine duration: 2.153339486 sec

  • Node drainer eviction duration: 861.239015229 sec

  • Fault remediation cr generation duration: 0.143644378 sec

2. Already Quarantine node

  • Inject XID 95
  • As node is already cordoned so Fault quarantine duration should be less as process event quickly
{"time":"2026-02-18T06:34:29.021206229Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771396468,\"nanos\":339000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"69955d741c32273a78fe6914\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T06:34:29.021231931Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.121158825Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69955d741c32273a78fe6914","status":"Quarantined"}
{"time":"2026-02-18T06:34:29.121195728Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":0.782192628}
{"time":"2026-02-18T06:34:29.12121533Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}
{"time":"2026-02-18T07:26:20.87553057Z","level":"INFO","msg":"Evaluating NodeRuleEvaluator for node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T07:26:20.876457653Z","level":"INFO","msg":"Evaluating NodeRuleEvaluator for node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T07:26:20.88168692Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"6995699c4cf999a4e5fe6911","status":"AlreadyQuarantined"}
{"time":"2026-02-18T07:26:20.881716223Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":1.7857139229999999}

  • Node drainer also should finish quickly as pods was already drained
{"time":"2026-02-18T07:26:20.890972551Z","level":"INFO","msg":"Attempting to store resume token","module":"node-drainer","version":"dev","client":"node-drainer"}
{"time":"2026-02-18T07:26:20.891360686Z","level":"INFO","msg":"HealthEvents which are part of quarantineHealthEvent annotation","module":"node-drainer","version":"dev","eventCount":3}
{"time":"2026-02-18T07:26:20.892398578Z","level":"INFO","msg":"Full drain previously completed for node as part of old event, skipping drain","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","id":"69955d741c32273a78fe6914"}
{"time":"2026-02-18T07:26:20.892423281Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"MarkAlreadyDrained"}
{"time":"2026-02-18T07:26:20.899897549Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":0.022892349}
{"time":"2026-02-18T07:26:20.899923051Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"6995699c4cf999a4e5fe6911","evictionStatus":"AlreadyDrained"}


  • Verify if cr generated or not at Fault remediation
{"time":"2026-02-18T07:26:21.02043353Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T07:26:21.034314371Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"remediating","to":"remediating"}
{"time":"2026-02-18T07:26:21.034347374Z","level":"INFO","msg":"No update needed for node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"remediating"}
{"time":"2026-02-18T07:26:21.034441082Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T07:26:21.034605497Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995699c4cf999a4e5fe6911"}
{"time":"2026-02-18T07:26:21.044279362Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.152273362,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T07:26:21.044306965Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995699c4cf999a4e5fe6911","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

  • Node quarantine duration: 1.7857139229999999 sec

  • Node drainer eviction duration: 0.022892349 sec

  • Fault remediation cr generation duration: 0.152273362 sec

3. Delete after timeout

  • Run user job at node
  • Inject XID 95
  • Verify Fault quarantine logs if cordoned the node and in what time
{"time":"2026-02-18T06:34:28.890474493Z","level":"INFO","msg":"Handling event for ruleset","module":"fault-quarantine","version":"dev","event":{"CreatedAt":"2026-02-18T06:34:28.339Z","HealthEvent":{"version":1,"agent":"gpu-health-monitor","componentClass":"GPU","checkName":"GpuMemWatch","isFatal":true,"message":"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.","recommendedAction":15,"errorCode":["DCGM_FR_UNCONTAINED_ERROR"],"entitiesImpacted":[{"entityType":"GPU","entityValue":"1"},{"entityType":"PCI","entityValue":"0002:00:00.0"},{"entityType":"GPU_UUID","entityValue":"GPU-3d7408b2-d525-643f-5bfc-45a761045e14"}],"metadata":{"node.kubernetes.io/instance-type":"Standard_ND96amsr_A100_v4","nvidia.com/cuda.driver-version.full":"570.148.08","nvidia.com/cuda.driver-version.major":"570","nvidia.com/cuda.driver-version.minor":"148","nvidia.com/cuda.driver-version.revision":"08","nvidia.com/cuda.runtime-version.full":"12.8","nvidia.com/cuda.runtime-version.major":"12","nvidia.com/cuda.runtime-version.minor":"8","nvidia.com/gpu.product":"NVIDIA-A100-SXM4-80GB","providerID":"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0","topology.kubernetes.io/region":"southcentralus","topology.kubernetes.io/zone":"0"},"generatedTimestamp":{"seconds":1771396468,"nanos":339000000},"nodeName":"aks-gpu-12493808-vmss00000l","processingStrategy":1,"id":"69955d741c32273a78fe6914"},"HealthEventStatus":{"userpodsevictionstatus":{"Status":"","Message":""}}},"ruleset":"GPU fatal error ruleset"}
{"time":"2026-02-18T06:34:28.890618506Z","level":"INFO","msg":"Evaluating NodeRuleEvaluator for node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:28.89156019Z","level":"INFO","msg":"Removing manual uncordon annotation from node before applying new quarantine","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.021177926Z","level":"INFO","msg":"Cordoning node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.021206229Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771396468,\"nanos\":339000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"69955d741c32273a78fe6914\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T06:34:29.021231931Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:34:29.121158825Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69955d741c32273a78fe6914","status":"Quarantined"}
{"time":"2026-02-18T06:34:29.121195728Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":0.782192628}
{"time":"2026-02-18T06:34:29.12121533Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}

  • Verify Node drainer logs if node get drained after timeout
{"time":"2026-02-18T06:46:59.820471017Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T06:46:59.820507821Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":10,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T06:48:59.822208178Z","level":"INFO","msg":"Ignoring completed pod %s in namespace %s on node %s (status: %s) during eviction check","module":"node-drainer","version":"dev","test-pod-until-timeout":"test-namespace","aks-gpu-12493808-vmss00000l":"Succeeded"}
{"time":"2026-02-18T06:48:59.822294186Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.822303387Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.822307887Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.822313488Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T06:48:59.836890383Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T06:48:59.913799321Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:48:59.919534531Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":870.80452913}
{"time":"2026-02-18T06:48:59.919559133Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"69955d741c32273a78fe6914","evictionStatus":"Succeeded"}

  • Verify Fault remediation logs, if remediation cr gets generatedd
{"time":"2026-02-18T06:48:59.921232081Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T06:48:59.938151686Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T06:49:00.015312745Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:49:00.015436756Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T06:49:00.015599171Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-69955d741c32273a78fe6914"}
{"time":"2026-02-18T06:49:00.030816024Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.117803522,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T06:49:00.030868328Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-69955d741c32273a78fe6914","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

  • Node quarantine duration: 0.782192628 sec

  • Node drainer eviction duration: 870.80452913 sec

  • Fault remediation cr generation duration: 0.117803522 sec

4. Cancel breakfix

  • Run user job on a gpu node
  • Inject XID 95
  • Verify if node gets cordoned
{"time":"2026-02-18T08:11:02.865202189Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771402262,\"nanos\":152000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"699574164cf999a4e5fe6913\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T08:11:02.865227292Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:11:02.962546096Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"699574164cf999a4e5fe6913","status":"Quarantined"}
{"time":"2026-02-18T08:11:02.962581999Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":0.810579399}

  • Check Node drainer logs to verify if draining is in progress

{"time":"2026-02-18T08:11:33.346501259Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:11:33.346542763Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T08:12:13.34802909Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.348072594Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.348082895Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T08:12:13.361578905Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T08:12:13.361613909Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T08:12:13.36163291Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.361642611Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:12:13.380082165Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:12:13.380128069Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":4,"error":"waiting for pods to complete: 1 pods remaining"}
{"time":"2026-02-18T08:13:33.380991357Z","level":"INFO","msg":"Checking pod completion status for AllowCompletion namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:13:33.38102646Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"CheckCompletion"}
{"time":"2026-02-18T08:13:33.401294059Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T08:13:33.401333163Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T08:13:33.401363265Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:13:33.401373466Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:13:33.428140743Z","level":"INFO","msg":"Pods still running on node, requeueing for later check","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","remainingPods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:13:33.428177646Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":5,"error":"waiting for pods to complete: 1 pods remaining"}

  • Manually uncordon the node, check if remediation gets cancelled or not
    FQ logs
{"time":"2026-02-18T08:11:02.962598601Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}
{"time":"2026-02-18T08:14:03.213163644Z","level":"INFO","msg":"Detected manual uncordon of FQ-quarantined node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:14:03.49709277Z","level":"INFO","msg":"Updated quarantining events to cancelled status","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","firstEventId":"699574164cf999a4e5fe6913","documentsUpdated":1}
{"time":"2026-02-18T08:14:03.497133874Z","level":"INFO","msg":"Set currentQuarantinedNodes to 0 for manually uncordoned node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:14:03.497142174Z","level":"INFO","msg":"Successfully completed manual uncordon handling","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}

ND logs

{"time":"2026-02-18T08:14:03.498845122Z","level":"INFO","msg":"Detected Cancelled event, marking event as cancelled","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}
{"time":"2026-02-18T08:14:03.498877925Z","level":"INFO","msg":"Marked specific event as cancelled","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}
{"time":"2026-02-18T08:14:03.498887226Z","level":"INFO","msg":"Attempting to store resume token","module":"node-drainer","version":"dev","client":"node-drainer"}
{"time":"2026-02-18T08:15:33.42882663Z","level":"INFO","msg":"Event was cancelled, performing cleanup","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}
{"time":"2026-02-18T08:15:33.434928776Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"699574164cf999a4e5fe6913","evictionStatus":"Cancelled"}
{"time":"2026-02-18T08:15:33.44915785Z","level":"INFO","msg":"Label already absent","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state"}
{"time":"2026-02-18T08:15:33.449193753Z","level":"INFO","msg":"Successfully cleaned up cancelled event","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","eventID":"ObjectID(\"699574164cf999a4e5fe6913\")"}


Result:

  • Node quarantine duration: 0.810579399 sec

  • Node drainer eviction duration: Since event cancelled no meteic emitted

  • Fault remediation cr generation duration: Since event cancelled no metrics emitted

5. Force pod eviction mode

  • Run user job on gpu node
  • Inject XID 95, but this time with drain overrides to true, like this
var nowMs = Date.now();
db.HealthEvents.insertOne({
  createdAt: new Date(nowMs),
  healthevent: {
    agent: "gpu-health-monitor",
    checkname: "GpuMemWatch",
    componentclass: "GPU",
    entitiesimpacted: [
      { entitytype: "GPU", entityvalue: "1" },
      { entitytype: "PCI", entityvalue: "0002:00:00.0" },
      { entitytype: "GPU_UUID", entityvalue: "GPU-3d7408b2-d525-643f-5bfc-45a761045e14" }
    ],
    errorcode: [ "DCGM_FR_UNCONTAINED_ERROR" ],
    generatedtimestamp: {
      seconds: Math.floor(nowMs / 1000),
      nanos: (nowMs % 1000) * 1000000
    },
    isfatal: true,
    ishealthy: false,
    drainoverrides: {
     force: true,
     skip: false
    },
    message: "GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.",
    metadata: {
      "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4",
      "nvidia.com/cuda.driver-version.full": "570.148.08",
      "nvidia.com/cuda.driver-version.major": "570",
      "nvidia.com/cuda.driver-version.minor": "148",
      "nvidia.com/cuda.driver-version.revision": "08",
      "nvidia.com/cuda.runtime-version.full": "12.8",
      "nvidia.com/cuda.runtime-version.major": "12",
      "nvidia.com/cuda.runtime-version.minor": "8",
      "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB",
      "providerID": "azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0",
      "topology.kubernetes.io/region": "southcentralus",
      "topology.kubernetes.io/zone": "0"
    },
    nodename: "aks-gpu-12493808-vmss00000l",
    processingstrategy: 1,
    quarantineoverrides: null,
    recommendedaction: 15,
    version: 1
  },
  healtheventstatus: {
    faultremediated: null,
    nodequarantined: null,
    userpodsevictionstatus: { status: "" }
  }
});

  • Verify if node gets cordon and in what time
{"time":"2026-02-18T08:38:13.908356611Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771403890,\"nanos\":505000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"drainOverrides\":{\"force\":true},\"processingStrategy\":1,\"id\":\"69957a754cf999a4e5fe6914\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T08:38:13.908377013Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:13.988792351Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"69957a754cf999a4e5fe6914","status":"Quarantined"}
{"time":"2026-02-18T08:38:13.988829054Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":3.483826154}
{"time":"2026-02-18T08:38:13.988846356Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}

  • Though drain config is set to allow completion, but since force drain is set to true, pod should go for evection immediately
{"time":"2026-02-18T08:38:14.0012167Z","level":"INFO","msg":"Set initial eviction status to InProgress","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001273705Z","level":"INFO","msg":"Attempting to store resume token","module":"node-drainer","version":"dev","client":"node-drainer"}
{"time":"2026-02-18T08:38:14.001556532Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001834357Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001851759Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.00186156Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:38:14.001902264Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.001906964Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T08:38:14.019435785Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"quarantined","to":"draining"}
{"time":"2026-02-18T08:38:14.123114175Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"quarantined","to":"draining"}
{"time":"2026-02-18T08:38:14.249809494Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285260274Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285297377Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285304678Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:14.285316179Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":1,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T08:38:24.286440577Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.286746605Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.286775608Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.286783409Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:38:24.28679331Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.28680011Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T08:38:24.3007473Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"remediation-failed","to":"draining"}
{"time":"2026-02-18T08:38:24.300787104Z","level":"WARN","msg":"Invalid state transition","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"remediation-failed","to":"draining","error":"unexpected state transition: remediation-failed -> draining (expected one of: [])"}
{"time":"2026-02-18T08:38:24.372625346Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.37266395Z","level":"ERROR","msg":"Failed to update node label to draining","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","error":"unexpected state transition: remediation-failed -> draining (expected one of: [])"}
{"time":"2026-02-18T08:38:24.372695853Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.391796819Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.391836723Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:24.391848724Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":2,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T08:38:44.392413386Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392699712Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392715714Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392724114Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T08:38:44.392733815Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.392740016Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T08:38:44.406787892Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T08:38:44.406813394Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T08:38:44.426993827Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.42702643Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.427034331Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:38:44.427044831Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":3,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T08:39:24.427529294Z","level":"INFO","msg":"DrainOverrides.Force is true, forcing immediate eviction for all namespaces on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427870325Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427887927Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427894427Z","level":"INFO","msg":"All pods evicted in namespace from node","module":"node-drainer","version":"dev","namespaces":["default","prometheus"],"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427953833Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.427960133Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T08:39:24.440841406Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T08:39:24.527559501Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.534053392Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":70.551046991}
{"time":"2026-02-18T08:39:24.534077294Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"69957a754cf999a4e5fe6914","evictionStatus":"Succeeded"}

  • Verify if remediation cr gets generated or not
{"time":"2026-02-18T08:39:24.536090877Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T08:39:24.547581924Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T08:39:24.641829304Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.641939414Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T08:39:24.642096428Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957a754cf999a4e5fe6914"}
{"time":"2026-02-18T08:39:24.65771505Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.130705849,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T08:39:24.657748553Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-69957a754cf999a4e5fe6914","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

  • Node quarantine duration: 3.483826154 sec

  • Node drainer eviction duration: 70.551046991 sec

  • Fault remediation cr generation duration: 0.130705849 sec

6. Long drains

  • Run user job on gpu node
  • Inject XID 95
  • Verify if node gets cordon and in what time
{"time":"2026-02-18T15:48:10.266654645Z","level":"INFO","msg":"Cordoning node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T15:48:10.26671065Z","level":"INFO","msg":"Setting annotations on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l","annotations":{"quarantineHealthEvent":"[{\"version\":1,\"agent\":\"gpu-health-monitor\",\"componentClass\":\"GPU\",\"checkName\":\"GpuMemWatch\",\"isFatal\":true,\"message\":\"GPU had an uncontained error (XID 95) Drain the GPU and reset it or reboot the node.\",\"recommendedAction\":15,\"errorCode\":[\"DCGM_FR_UNCONTAINED_ERROR\"],\"entitiesImpacted\":[{\"entityType\":\"GPU\",\"entityValue\":\"1\"},{\"entityType\":\"PCI\",\"entityValue\":\"0002:00:00.0\"},{\"entityType\":\"GPU_UUID\",\"entityValue\":\"GPU-3d7408b2-d525-643f-5bfc-45a761045e14\"}],\"metadata\":{\"node.kubernetes.io/instance-type\":\"Standard_ND96amsr_A100_v4\",\"nvidia.com/cuda.driver-version.full\":\"570.148.08\",\"nvidia.com/cuda.driver-version.major\":\"570\",\"nvidia.com/cuda.driver-version.minor\":\"148\",\"nvidia.com/cuda.driver-version.revision\":\"08\",\"nvidia.com/cuda.runtime-version.full\":\"12.8\",\"nvidia.com/cuda.runtime-version.major\":\"12\",\"nvidia.com/cuda.runtime-version.minor\":\"8\",\"nvidia.com/gpu.product\":\"NVIDIA-A100-SXM4-80GB\",\"providerID\":\"azure:///subscriptions/397f2a8c-2c98-4127-8aed-16ebd6ca47bd/resourceGroups/mc_rg-nvs-dev1_nvs-dgxc-k8s-azr-scus-dev1_southcentralus/providers/Microsoft.Compute/virtualMachineScaleSets/aks-gpu-12493808-vmss/virtualMachines/0\",\"topology.kubernetes.io/region\":\"southcentralus\",\"topology.kubernetes.io/zone\":\"0\"},\"generatedTimestamp\":{\"seconds\":1771429688,\"nanos\":53000000},\"nodeName\":\"aks-gpu-12493808-vmss00000l\",\"processingStrategy\":1,\"id\":\"6995df3ab5db4c69fefe6911\"}]","quarantineHealthEventIsCordoned":"True"}}
{"time":"2026-02-18T15:48:10.266738053Z","level":"INFO","msg":"Adding labels on node","module":"fault-quarantine","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T15:48:10.370927709Z","level":"INFO","msg":"Document updated with status","module":"fault-quarantine","version":"dev","id":"6995df3ab5db4c69fefe6911","status":"Quarantined"}
{"time":"2026-02-18T15:48:10.370967113Z","level":"INFO","msg":"Node quarantine duration","module":"fault-quarantine","version":"dev","duration":2.317964513}
{"time":"2026-02-18T15:48:10.370987315Z","level":"INFO","msg":"Attempting to store resume token","module":"fault-quarantine","version":"dev","client":"fault-quarantine"}


  • Verify pod eviction time
{"time":"2026-02-18T16:16:41.590969475Z","level":"INFO","msg":"Pods still present on node, will retry","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","pods":["test-namespace/test-pod-until-timeout"]}
{"time":"2026-02-18T16:16:41.591014879Z","level":"INFO","msg":"Performing immediate eviction for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.591019679Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"EvictImmediate"}
{"time":"2026-02-18T16:16:41.626309499Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"draining"}
{"time":"2026-02-18T16:16:41.626339602Z","level":"INFO","msg":"No update needed for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","label":"dgxc.nvidia.com/nvsentinel-state","value":"draining"}
{"time":"2026-02-18T16:16:41.626354803Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.651662412Z","level":"INFO","msg":"Pod eviction initiated for namespace on node","module":"node-drainer","version":"dev","pod":"test-pod-until-timeout","namespace":"test-namespace","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.651720317Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:16:41.651739819Z","level":"WARN","msg":"Error processing event for node (will retry)","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","attempt":18,"error":"immediate eviction completed, requeuing for status verification"}
{"time":"2026-02-18T16:18:41.652877081Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"prometheus-prometheus-node-exporter-p52lq","namespace":"prometheus","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.652913785Z","level":"INFO","msg":"Ignoring DaemonSet pod in namespace on node during eviction check","module":"node-drainer","version":"dev","pod":"debug-ds-kxn8j","namespace":"default","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.652921585Z","level":"INFO","msg":"All pods evicted in namespace from node","module":"node-drainer","version":"dev","namespaces":["prometheus","default"],"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.65296839Z","level":"INFO","msg":"All pods evicted successfully on node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.65297439Z","level":"INFO","msg":"Evaluated action for node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","action":"UpdateStatus"}
{"time":"2026-02-18T16:18:41.689784767Z","level":"INFO","msg":"Labeling node","module":"node-drainer","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"draining","to":"drain-succeeded"}
{"time":"2026-02-18T16:18:41.817383671Z","level":"INFO","msg":"Label updated successfully for node","module":"node-drainer","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.823019888Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":1831.467014087}
{"time":"2026-02-18T16:18:41.82304559Z","level":"INFO","msg":"Health event status has been updated","module":"node-drainer","version":"dev","documentID":"6995df3ab5db4c69fefe6911","evictionStatus":"Succeeded"}

  • Verify remediation cr generation time
{"time":"2026-02-18T16:18:41.82522299Z","level":"INFO","msg":"Reconciling Event","module":"fault-remediation","version":"dev"}
{"time":"2026-02-18T16:18:41.841718703Z","level":"INFO","msg":"Labeling node","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","from":"drain-succeeded","to":"remediating"}
{"time":"2026-02-18T16:18:41.937340274Z","level":"INFO","msg":"Label updated successfully for node","module":"fault-remediation","version":"dev","label":"dgxc.nvidia.com/nvsentinel-state","node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.937439183Z","level":"INFO","msg":"Creating maintenance CR","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8"}
{"time":"2026-02-18T16:18:41.937626001Z","level":"INFO","msg":"Added owner reference to CR for automatic garbage collection","module":"fault-remediation","version":"dev","node":"aks-gpu-12493808-vmss00000l","nodeUID":"fa82be96-fa4a-4380-ae79-ac80b894e8c8","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995df3ab5db4c69fefe6911"}
{"time":"2026-02-18T16:18:41.953412249Z","level":"INFO","msg":"Fault remediation CR generation duration","module":"fault-remediation","version":"dev","duration":0.136405048,"node":"aks-gpu-12493808-vmss00000l"}
{"time":"2026-02-18T16:18:41.953441951Z","level":"INFO","msg":"Created Maintenance CR successfully","module":"fault-remediation","version":"dev","crName":"maintenance-aks-gpu-12493808-vmss00000l-6995df3ab5db4c69fefe6911","node":"aks-gpu-12493808-vmss00000l","template":"RESTART_VM"}

Result:

  • Node quarantine duration: 2.317964513 sec

  • Node drainer eviction duration: 1831.467014087 sec

  • Fault remediation cr generation duration: 0.136405048 sec

For all above scenarios verified on grafana dashboard that emit time durations are falling under correct buckets.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

Release Notes

  • New Features

    • Added metrics to track node quarantine duration from health event generation to completion.
    • Added metrics to track pod eviction duration from event receipt to successful completion.
    • Added metrics to track maintenance resource creation duration from event receipt to completion.
  • Documentation

    • Updated metrics documentation with new metrics and clarified MTTR metric description.
  • Tests

    • Added comprehensive test coverage for new metrics tracking capabilities.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces lifecycle duration metrics for health events: node quarantine duration (time from health event generation to node cordon), pod eviction duration (event receipt to pod eviction completion), and CR generation duration (event receipt to maintenance CR creation). Timestamp fields are added to track lifecycle milestones, database schemas updated, and reconcilers enhanced to record durations at key completion points.

Changes

Cohort / File(s) Summary
Metrics Documentation
docs/METRICS.md
Updated metric descriptions and added three new duration metrics with bucket definitions and explanations for node quarantine, pod eviction, and CR generation timings.
Metric Declarations
fault-quarantine/pkg/metrics/metrics.go, node-drainer/pkg/metrics/metrics.go, fault-remediation/pkg/metrics/metrics.go
Added three new Prometheus histogram metrics: NodeQuarantineDuration (1s–1h buckets), PodEvictionDuration (1min–90day buckets), and CRGenerationDuration (1s–1h buckets).
Duration Utility
commons/pkg/metricsutil/duration.go, commons/pkg/metricsutil/duration_test.go
New utility function CalculateDurationSeconds to safely compute elapsed seconds from a timestamp, with zero-value handling and comprehensive test coverage.
Data Model Extensions
data-models/pkg/model/health_event_extentions.go, fault-remediation/pkg/events/health_event.go, store-client/pkg/datastore/types.go
Added ReceivedAt and QuarantineFinishTimestamp, DrainFinishTimestamp timestamp fields to track event lifecycle milestones across quarantine, drain, and remediation phases.
Metric Recording – Quarantine
fault-quarantine/pkg/reconciler/reconciler.go
Updated updateQuarantineMetrics to accept generatedTimestamp, calculate cordon duration, and record NodeQuarantineDuration metric when cordoning succeeds.
Metric Recording – Remediation
fault-remediation/pkg/reconciler/reconciler.go, fault-remediation/pkg/remediation/remediation.go
Added ReceivedAt tracking on event receipt and CR generation duration calculation and recording when DrainFinishTimestamp is available.
Metric Recording – Node Drainer
node-drainer/pkg/reconciler/reconciler.go
Added PodEvictionDuration observation when eviction succeeds, calculated from quarantine finish timestamp; refactored status update to use batch field updates.
Janitor MTTR Updates
janitor/pkg/metrics/metrics.go, janitor/pkg/controller/rebootnode_controller.go, janitor/pkg/controller/terminatenode_controller.go
Updated metric description and changed MTTR measurement from StartTime to CreationTimestamp across reboot and terminate controllers.
Database Schema & Client Interfaces
store-client/pkg/client/interfaces.go
Added UpdateDocumentStatusFields method to DatabaseClient interface to support batch status field updates.
MongoDB Client Implementation
store-client/pkg/client/mongodb_client.go
Implemented UpdateDocumentStatusFields to perform bulk status updates with support for nested timestamp fields.
PostgreSQL Client Implementation
store-client/pkg/client/postgresql_client.go, store-client/pkg/datastore/providers/postgresql/database_client.go
Implemented UpdateDocumentStatusFields with deterministic field iteration, nested jsonb_set construction, and denormalized column synchronization for health_events.
PostgreSQL Schema & SQL Handling
store-client/pkg/datastore/providers/postgresql/datastore.go, store-client/pkg/datastore/providers/postgresql/health_events.go
Added quarantine_finish_timestamp and drain_finish_timestamp columns to health_events; updated insert and update paths to handle new timestamp fields and parameter reordering.
PostgreSQL Pipeline
store-client/pkg/client/mongodb_pipeline_builder.go
Refactored updateDescription field checks to use generic \$expr-based comparisons instead of four separate explicit checks.
Convenience Store Methods
store-client/pkg/client/convenience.go
Refactored UpdateHealthEventNodeQuarantineStatus to update both status and quarantineFinishTimestamp in a single call via UpdateDocumentStatusFields.
Test Coverage
fault-quarantine/pkg/reconciler/reconciler_e2e_test.go, fault-remediation/pkg/reconciler/reconciler_e2e_test.go, fault-remediation/pkg/reconciler/reconciler_test.go, node-drainer/pkg/reconciler/reconciler_integration_test.go
Added end-to-end and integration tests for NodeQuarantineDuration, CRGenerationDuration, PodEvictionDuration metrics, plus unit test for extractReceivedAtTimestamp helper; includes histogram count helper function.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Timestamps hop through the system flow,
Recording when each event does go—
From quarantine to drain to repair,
Metrics now dance everywhere!
Duration captured, MTTR clear,
Data insights appear! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: added breakfix response time metrics' accurately describes the main changes, which add multiple response time metrics across several modules.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@fault-quarantine/pkg/reconciler/reconciler.go`:
- Around line 798-800: The call to event.HealthEvent.GeneratedTimestamp.AsTime()
can panic when GeneratedTimestamp is nil; update the caller (before passing into
updateQuarantineMetrics) to guard against nil GeneratedTimestamp: check
event.HealthEvent.GeneratedTimestamp != nil and only call AsTime() when non-nil,
otherwise use a safe zero-value (or appropriate fallback) time.Time value and
pass that to updateQuarantineMetrics so reconciliation won't crash on
legacy/malformed events.

In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Line 114: The current assignment to event.Event["_received_at"] can panic if
event.Event is nil; in the reconciler.go code ensure you guard and initialize
the map before writing: check if event.Event == nil and if so set event.Event =
make(map[string]interface{}) (or the appropriate map type used by Event) and
then assign event.Event["_received_at"] = start.Unix(); reference the
event.Event map and the start.Unix() assignment so the fix is applied at the
same location.
- Around line 243-245: The metrics code observes CR generation duration even
when healthEventWithStatus.ReceivedAt is the zero value, which yields a huge
duration; update the else branch containing crGenerationDuration.Observe(...) to
first check healthEventWithStatus.ReceivedAt.IsZero() and skip calling
crGenerationDuration.Observe(...) when true (i.e., only call Observe with
time.Since(healthEventWithStatus.ReceivedAt).Seconds() if ReceivedAt.IsZero() is
false) so metrics are not polluted by uninitialized timestamps.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 519-527: The type assertion for the _received_at field
(receivedAtRaw in the event handling block) assumes int64 but json.Unmarshal
turns numbers into float64, so update the logic in the reconciler.go block that
computes evictionDuration (using receivedAtRaw, receivedAtUnix, time.Unix,
metrics.PodEvictionDuration and nodeName) to accept both float64 and int64 (and
optionally numeric strings) -- convert the value to an integer Unix seconds (or
to float seconds) before creating time.Unix and computing time.Since, log a
warning only if the type is neither supported nor convertible, and then observe
the evictionDuration with metrics.PodEvictionDuration.
🧹 Nitpick comments (2)
fault-quarantine/pkg/metrics/metrics.go (1)

192-194: Unused parameter nodeName in RecordNodeCordonDuration.

The nodeName parameter is accepted but never used since the NodeCordonDuration histogram has no labels. Either remove the unused parameter or consider adding a node label to the histogram if per-node granularity is desired for this metric.

Option 1: Remove unused parameter
-func RecordNodeCordonDuration(nodeName string, generatedTimestamp time.Time) {
+func RecordNodeCordonDuration(generatedTimestamp time.Time) {
 	NodeCordonDuration.Observe(time.Since(generatedTimestamp).Seconds())
 }
fault-remediation/pkg/reconciler/reconciler.go (1)

538-543: Make _received_at parsing tolerant of numeric types
Strictly expecting int64 can drop the value when the map originated from JSON/BSON conversions. A simple type switch avoids a zero ReceivedAt.

♻️ Suggested refactor
-	if receivedAtRaw, ok := eventWithToken.Event["_received_at"]; ok {
-		if receivedAtUnix, ok := receivedAtRaw.(int64); ok {
-			result.ReceivedAt = time.Unix(receivedAtUnix, 0)
-		}
-	}
+	if receivedAtRaw, ok := eventWithToken.Event["_received_at"]; ok {
+		switch v := receivedAtRaw.(type) {
+		case int64:
+			result.ReceivedAt = time.Unix(v, 0)
+		case int32:
+			result.ReceivedAt = time.Unix(int64(v), 0)
+		case float64:
+			result.ReceivedAt = time.Unix(int64(v), 0)
+		}
+	}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 522-527: The extraction of _received_at in reconciler.go currently
only handles int64 and will silently skip valid timestamps decoded as float64,
int32, json.Number or strings; update the logic in the block that reads
eventWithToken.Event["_received_at"] (used by Reconcile() and setting
result.ReceivedAt) to perform a type switch (int64, int32, int, float64,
json.Number, string) and convert each into an int64 unix seconds (parsing
json.Number or string as needed) before calling time.Unix(...,0) so
result.ReceivedAt is correctly set for those common BSON/JSON numeric encodings.
♻️ Duplicate comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)

106-106: Guard against nil event.Event before assignment.
Line 106 can panic if event.Event is nil.

🛠 Suggested fix
-	event.Event["_received_at"] = start.Unix()
+	if event.Event == nil {
+		event.Event = map[string]any{}
+	}
+	event.Event["_received_at"] = start.Unix()

229-231: Skip CR generation metric when ReceivedAt is zero.
time.Since(time.Time{}) produces huge durations and pollutes metrics.

🛠 Suggested fix
-		metrics.CRGenerationDuration.Observe(time.Since(healthEventWithStatus.ReceivedAt).Seconds())
+		if !healthEventWithStatus.ReceivedAt.IsZero() {
+			metrics.CRGenerationDuration.Observe(time.Since(healthEventWithStatus.ReceivedAt).Seconds())
+		} else {
+			slog.Warn("ReceivedAt is zero; skipping CR generation duration metric", "node", nodeName)
+		}

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch 2 times, most recently from c255847 to d598364 Compare January 20, 2026 12:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@fault-remediation/pkg/events/health_event.go`:
- Around line 17-27: The ReceivedAt field in the HealthEventDoc struct is only
tagged with json:"-" but still will be serialized to MongoDB via BSON; update
the struct so that the ReceivedAt field also has bson:"-" to prevent it from
persisting. Locate HealthEventDoc and modify the ReceivedAt field's tags (the
field name ReceivedAt in file health_event.go) to include bson:"-" alongside
json:"-" so MongoDB drivers will ignore it.
♻️ Duplicate comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)

519-527: Handle float64 _received_at to avoid dropped metrics.
If the event is read from JSONB (e.g., Postgres), numbers typically unmarshal as float64, so the int64 assertion can fail and skip observation.

🛠 Suggested fix
-	if receivedAtRaw, ok := event["_received_at"]; ok {
-		if receivedAtUnix, ok := receivedAtRaw.(int64); ok {
-			receivedAt := time.Unix(receivedAtUnix, 0)
-			evictionDuration := time.Since(receivedAt).Seconds()
-			metrics.PodEvictionDuration.Observe(evictionDuration)
-		} else {
-			slog.Warn("Invalid type for _received_at timestamp", "node", nodeName)
-		}
-	}
+	if receivedAtRaw, ok := event["_received_at"]; ok {
+		var receivedAtUnix int64
+		switch v := receivedAtRaw.(type) {
+		case int64:
+			receivedAtUnix = v
+		case float64:
+			receivedAtUnix = int64(v)
+		default:
+			slog.Warn("Invalid type for _received_at timestamp", "node", nodeName, "type", fmt.Sprintf("%T", receivedAtRaw))
+		}
+		if receivedAtUnix > 0 {
+			receivedAt := time.Unix(receivedAtUnix, 0)
+			evictionDuration := time.Since(receivedAt).Seconds()
+			metrics.PodEvictionDuration.Observe(evictionDuration)
+		}
+	}
🧹 Nitpick comments (1)
fault-quarantine/pkg/metrics/metrics.go (1)

192-195: Consider removing the unused nodeName parameter.
It’s not used by the histogram and may confuse callers unless you plan to add labels.

♻️ Optional cleanup
-func RecordNodeCordonDuration(nodeName string, generatedTimestamp time.Time) {
+func RecordNodeCordonDuration(generatedTimestamp time.Time) {
 	NodeCordonDuration.Observe(time.Since(generatedTimestamp).Seconds())
}

Update call sites accordingly (e.g., in fault-quarantine/pkg/reconciler/reconciler.go).

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from d598364 to b57001e Compare January 20, 2026 12:13
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation 32.12% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus 30.58% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/events 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 21.58% (+0.22%) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher 22.58% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler 25.48% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client 6.11% (+0.01%) 👍
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql 5.19% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ø) 433 0 433
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation.go 32.12% (ø) 358 115 243
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go 30.58% (ø) 206 63 143
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/events/health_event.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ø) 248 0 248
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 21.58% (+0.22%) 1325 (+38) 286 (+11) 1039 (+27) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/fault_remediation_client_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.16% (ø) 1289 363 926
github.com/nvidia/nvsentinel/health-events-analyzer/main.go 0.00% (ø) 278 0 278
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config/rules.go 0.00% (ø) 14 0 14
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go 22.58% (ø) 186 42 144
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go 25.48% (ø) 526 134 392
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go 3.47% (ø) 2966 103 2863
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_builder.go 20.00% (ø) 60 12 48
github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder.go 17.02% (ø) 47 8 39
github.com/nvidia/nvsentinel/store-client/pkg/client/postgresql_pipeline_builder.go 20.00% (ø) 60 12 48
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go
  • github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
  • github.com/nvidia/nvsentinel/tests/health_events_analyzer_test.go

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from b57001e to f1d00fa Compare January 21, 2026 11:38
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil 18.18% (+18.18%) 🎉
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics 47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 24.01% (+0.09%) 👍
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation 32.12% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus 30.58% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/events 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 21.77% (+0.40%) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher 22.58% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler 25.48% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 50.78% (+0.14%) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client 6.10% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql 5.19% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration.go 18.18% (+18.18%) 33 (+33) 6 (+6) 27 (+27) 🎉
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics/metrics.go 47.37% (ø) 19 9 10
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 24.01% (+0.09%) 2620 (+15) 629 (+6) 1991 (+9) 👍
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ø) 433 0 433
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation.go 32.12% (ø) 358 115 243
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go 30.58% (ø) 206 63 143
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/events/health_event.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ø) 248 0 248
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 21.77% (+0.40%) 1332 (+45) 290 (+15) 1042 (+30) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/fault_remediation_client_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.16% (ø) 1289 363 926
github.com/nvidia/nvsentinel/health-events-analyzer/main.go 0.00% (ø) 278 0 278
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config/rules.go 0.00% (ø) 14 0 14
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go 22.58% (ø) 186 42 144
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go 25.48% (ø) 526 134 392
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go 50.78% (+0.14%) 644 (+14) 327 (+8) 317 (+6) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go 3.47% (ø) 2966 103 2863
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_builder.go 20.00% (ø) 60 12 48
github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder.go 17.02% (ø) 47 8 39
github.com/nvidia/nvsentinel/store-client/pkg/client/postgresql_pipeline_builder.go 20.00% (ø) 60 12 48
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go
  • github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_integration_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
  • github.com/nvidia/nvsentinel/tests/health_events_analyzer_test.go

@nitz2407
Copy link
Contributor Author

/ok to test e5bfda5

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 1d702c4 to 008aee9 Compare January 23, 2026 07:30
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil 18.18% (+18.18%) 🎉

Coverage by file

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration_test.go

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 008aee9 to 9e45fef Compare January 23, 2026 08:00
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/metricsutil 18.18% (+18.18%) 🎉

Coverage by file

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/metricsutil/duration_test.go

# Conflicts:
#	fault-remediation/pkg/reconciler/reconciler_test.go
#	node-drainer/pkg/reconciler/reconciler.go
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
store-client/pkg/datastore/providers/postgresql/health_events.go (1)

226-264: ⚠️ Potential issue | 🟠 Major

Timestamp fields not synced to JSONB document — consumers see null values.

The code writes quarantine_finish_timestamp and drain_finish_timestamp to table columns but does not propagate them into the document JSONB via jsonb_set. Since all read paths (FindHealthEventsByNode, FindHealthEventsByQuery, GetHealthEventByID) unmarshal exclusively from the JSONB document field and the struct fields are pointer types (*time.Time), missing keys unmarshal to nil. This affects both the non-nil branch (lines 226-264) and the else branch (lines 268-291).

The issue is most severe in UpdateHealthEventStatusByNode (lines 337-348), which updates only the raw columns without any JSONB synchronization.

Proposed fix for the non-nil branch
 			    document = jsonb_set(
 			        jsonb_set(
 			            jsonb_set(
 			                jsonb_set(
-			                    document,
+			                    jsonb_set(
+			                        jsonb_set(
+			                            document,
+			                            '{healtheventstatus,quarantinefinishtimestamp}',
+			                            to_jsonb($2::timestamp)
+			                        ),
+			                        '{healtheventstatus,drainfinishtimestamp}',
+			                        to_jsonb($5::timestamp)
+			                    ),
 			                    '{healtheventstatus,nodequarantined}',
 			                    to_jsonb($1::text)
 			                ),

Apply similar changes to the else branch and UpdateHealthEventStatusByNode.

🤖 Fix all issues with AI agents
In `@data-models/pkg/model/health_event_extentions.go`:
- Line 47: The struct field QuarantineFinishTimestamp currently causes a linter
line-length failure; shorten the alignment whitespace before the type so the
declaration for QuarantineFinishTimestamp *time.Time
`bson:"quarantinefinishtimestamp,omitempty"
json:"quarantinefinishtimestamp,omitempty"` is <=120 chars, or if alignment
cannot be reduced without harming readability, add a nolint directive (matching
the style used for LastRemediationTimestamp) to the field tag to suppress the
linter error; locate the field by name QuarantineFinishTimestamp in
health_event_extentions.go and apply one of these fixes.

In `@store-client/pkg/client/convenience.go`:
- Around line 39-47: UpdateHealthEventNodeQuarantineStatus currently always sets
"healtheventstatus.quarantinefinishtimestamp", which lets UnQuarantined calls
overwrite the original finish time; change the function
(UpdateHealthEventNodeQuarantineStatus) to only include the
"healtheventstatus.quarantinefinishtimestamp" field in the fields map when the
new status indicates quarantine completion (e.g., status == "Quarantined"),
otherwise omit that key so un-quarantine or non-completing statuses don't
overwrite the existing timestamp.

In `@store-client/pkg/datastore/providers/postgresql/database_client.go`:
- Around line 314-315: The WHERE clause is hardcoded to "id = $N" which is
inconsistent with UpdateDocumentStatus's conditional use of "data->>'_id'" for
non-health_events tables; modify the code that builds whereClause (currently
using update.ToSQL() result and whereClause variable) to follow the same logic
as UpdateDocumentStatus: if c.tableName == "health_events" use "id = $%d"
otherwise use "data->>'_id' = $%d", and ensure the parameter index uses
len(args)+1 and the passed argument is the document id value; update any callers
accordingly so the predicate matches the table type.

In `@store-client/pkg/datastore/providers/postgresql/datastore.go`:
- Around line 402-406: The warning is misleading because ADD COLUMN IF NOT
EXISTS suppresses "already exists" errors; if db.ExecContext(ctx,
timestampColumn) returns an error it's a real failure. Replace the slog.Warn
call in the timestampColumns loop with a proper failure handling: either
return/propagate the error from the enclosing function (like the schemas path)
or at minimum log it as an error (use slog.Error) and include the error object
plus the failing SQL (timestampColumn) for debugging. Update the handler around
db.ExecContext(ctx, timestampColumn) accordingly (references: timestampColumns,
db.ExecContext, slog.Warn).
🧹 Nitpick comments (2)
store-client/pkg/client/interfaces.go (1)

27-27: Add a godoc comment for the new interface method.

All other methods in DatabaseClient have doc comments. As per coding guidelines, exported Go functions require comments.

Suggested fix
 	UpdateDocumentStatus(ctx context.Context, documentID string, statusPath string, status interface{}) error
+	// UpdateDocumentStatusFields updates multiple status fields in a document in one operation.
+	// Keys in fields are dot-notation paths (e.g. "healtheventstatus.nodequarantined").
 	UpdateDocumentStatusFields(ctx context.Context, documentID string, fields map[string]interface{}) error

As per coding guidelines: "Function comments required for all exported Go functions".

store-client/pkg/datastore/providers/postgresql/health_events.go (1)

337-358: UpdateHealthEventStatusByNode also lacks JSONB sync for all fields (pre-existing + new).

This function updates only the table columns and does not touch the document JSONB at all. While this is a pre-existing gap, the two new timestamp columns (quarantine_finish_timestamp, drain_finish_timestamp) widen it. If any consumer reads events updated via this path and relies on the JSONB document (which all read paths do), they will see stale data.

Consider adding JSONB jsonb_set calls here consistent with UpdateHealthEventStatus, or document why this function intentionally skips JSONB sync.

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch 3 times, most recently from 0f69523 to b639b31 Compare February 10, 2026 10:07
@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from b639b31 to 707df37 Compare February 10, 2026 10:14
@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/data-models/pkg/model 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 20.89% (+0.06%) 👍
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler 25.48% (ø)
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/controller 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/publisher 0.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store 63.93% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback 0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/benchmark 0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/health 0.00% (ø)
github.com/nvidia/nvsentinel/preflight-checks/ping 0.00% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/webhook 0.00% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client 6.00% (-0.10%) 👎
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb 6.83% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql 5.12% (-0.07%) 👎
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/data-models/pkg/model/health_event_extentions.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/config/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/main.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/config/config.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/health/reporter.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/preflight-checks/ping/main.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/preflight/pkg/config/config.go 0.00% (ø) 13 0 13
github.com/nvidia/nvsentinel/preflight/pkg/webhook/injector.go 0.00% (ø) 87 0 87
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go 4.03% (-0.15%) 596 (+21) 24 572 (+21) 👎
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go 3.43% (-0.04%) 3115 (+149) 107 (+4) 3008 (+145) 👎
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_builder.go 19.61% (-0.39%) 102 (+42) 20 (+8) 82 (+34) 👎
github.com/nvidia/nvsentinel/store-client/pkg/client/postgresql_client.go 8.41% (-0.25%) 9466 (+300) 796 (+2) 8670 (+298) 👎
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/database_client.go 0.00% (ø) 4176 (+217) 0 4176 (+217)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/datastore.go 5.22% (-0.18%) 1149 (+39) 60 1089 (+39) 👎
github.com/nvidia/nvsentinel/tests/helpers/fault_quarantine.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kubernetes_object_monitor.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector_test.go
  • github.com/nvidia/nvsentinel/preflight-checks/nccl-loopback/pkg/benchmark/benchmark_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_pipeline_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb/health_store_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_quarantine_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/kubernetes_object_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch 2 times, most recently from f6fb17f to 6a6ef91 Compare February 10, 2026 17:56
@github-actions
Copy link

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 24.03% (+0.11%) 👍
github.com/nvidia/nvsentinel/janitor 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.71% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock 34.64% (-0.74%) 👎
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices 27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics 40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1 18.72% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 24.03% (+0.11%) 2626 (+21) 631 (+8) 1995 (+13) 👍
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 71.43% (ø) 7 5 2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 21.09% (ø) 1233 260 973
github.com/nvidia/nvsentinel/janitor/main.go 0.00% (ø) 1127 0 1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 23.33% (ø) 210 49 161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go 25.82% (ø) 364 94 270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.53% (ø) 3595 666 2929
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go 17.33% (ø) 854 148 706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go 18.36% (ø) 828 152 676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 165 0 165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go 27.50% (ø) 120 33 87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go 40.00% (ø) 50 20 30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go 18.72% (ø) 1293 242 1051

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 6a6ef91 to cddfaf7 Compare February 11, 2026 06:52
@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 24.03% (+0.11%) 👍
github.com/nvidia/nvsentinel/janitor 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.71% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock 34.64% (-0.74%) 👎
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices 27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics 40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1 18.72% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client 5.98% (-0.12%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 24.03% (+0.11%) 2626 (+21) 631 (+8) 1995 (+13) 👍
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 71.43% (ø) 7 5 2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 21.09% (ø) 1233 260 973
github.com/nvidia/nvsentinel/janitor/main.go 0.00% (ø) 1127 0 1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 23.33% (ø) 210 49 161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go 25.82% (ø) 364 94 270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.53% (ø) 3595 666 2929
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go 17.33% (ø) 854 148 706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go 18.36% (ø) 828 152 676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 165 0 165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go 27.50% (ø) 120 33 87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go 40.00% (ø) 50 20 30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go 18.72% (ø) 1293 242 1051
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go 4.03% (-0.15%) 596 (+21) 24 572 (+21) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from cddfaf7 to 2ebb98b Compare February 11, 2026 07:14
@github-actions
Copy link

Merging this branch changes the coverage (1 decrease, 3 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 24.03% (+0.11%) 👍
github.com/nvidia/nvsentinel/janitor 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.74% (+0.03%) 👍
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock 35.38% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices 27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics 40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1 18.72% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 38.58% (+0.58%) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client 5.98% (-0.12%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 24.03% (+0.11%) 2626 (+21) 631 (+8) 1995 (+13) 👍
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 71.43% (ø) 7 5 2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 21.09% (ø) 1233 260 973
github.com/nvidia/nvsentinel/janitor/main.go 0.00% (ø) 1127 0 1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 23.33% (ø) 210 49 161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go 25.82% (ø) 364 94 270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.58% (+0.06%) 3595 668 (+2) 2927 (-2) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go 17.33% (ø) 854 148 706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go 18.36% (ø) 828 152 676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 165 0 165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go 27.50% (ø) 120 33 87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go 40.00% (ø) 50 20 30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go 18.72% (ø) 1293 242 1051
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go 38.58% (+0.58%) 1011 (+32) 390 (+18) 621 (+14) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go 4.03% (-0.15%) 596 (+21) 24 572 (+21) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 2ebb98b to b7e4a8c Compare February 11, 2026 10:52
@github-actions
Copy link

Merging this branch changes the coverage (1 decrease, 2 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics 47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 24.03% (+0.11%) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics 0.00% (ø)
github.com/nvidia/nvsentinel/janitor 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 24.78% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 24.91% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.71% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/distributedlock 35.38% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices 27.50% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics 40.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1 18.72% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/metrics 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 38.58% (+0.58%) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client 5.98% (-0.12%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics/metrics.go 47.37% (ø) 19 9 10
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 24.03% (+0.11%) 2626 (+21) 631 (+8) 1995 (+13) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 71.43% (ø) 7 5 2
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 21.09% (ø) 1233 260 973
github.com/nvidia/nvsentinel/janitor/main.go 0.00% (ø) 1127 0 1127
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 23.33% (ø) 210 49 161
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go 25.82% (ø) 364 94 270
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.53% (ø) 3595 666 2929
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go 17.33% (ø) 854 148 706
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go 18.36% (ø) 828 152 676
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 165 0 165
github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager.go 27.50% (ø) 120 33 87
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go 40.00% (ø) 50 20 30
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go 18.72% (ø) 1293 242 1051
github.com/nvidia/nvsentinel/node-drainer/pkg/metrics/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go 38.58% (+0.58%) 1011 (+32) 390 (+18) 621 (+14) 👍
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go 4.03% (-0.15%) 596 (+21) 24 572 (+21) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/suite_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/distributedlock/nodelock_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/gpuservices/manager_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 85e65b1 to b167c15 Compare February 12, 2026 06:18
Copy link
Collaborator

@lalitadithya lalitadithya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add how we tested that the metric match manual observations in the following cases:

  • pod restarts
  • long drains
  • cancelled breakfix
  • immediate mode eviction failed
  • immediate mode eviction success
  • delete after timeout

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is being tested here. In code we call time.Since(timestamp).Seconds() in test also we call time.Since(timestamp).Seconds(), so when would this fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to test CalculateDurationSeconds utility function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the code we are using to test it is the same as the code we are using in the function so I'm a bit lost in how this helps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a Unit Test to test specific blob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this? The previous values seem to be correct here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for consistency, since not using manual list anywhere else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But consistency won't help here right? Hoe can retry attempts be less than 1, am I missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Prometheus provided a generic default bucket which works for both decimal and whole number use cases. No worries let me revert this, since adding confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be exported after all the procession is completed? In this case after the healtheventstatus.drainfinishtimestamp has been set? Otherwise it possible that the status is never updated the database and the next operation doesn't start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent to emit metrics immediately after Pods get evicted successfully to measure the correct performance. If we emit it after updating node labels, db updates, then might some delay get added. Please share your thoughts might be I am overthinking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent is to emit the metric after the module is completed processing so that we know how long the module took. For example, even if the pod is evicted, but ND doesn't complete the rest of the activities in mongodb then FR can't start. If we stop measuring after the pod are evicted, then we will have gaps in our observability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What what if node drained successfully and db update failed, are we ok to loose this data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@nitz2407
Copy link
Contributor Author

nitz2407 commented Feb 12, 2026

Can you please add how we tested that the metric match manual observations in the following cases:

* pod restarts

* long drains

* cancelled breakfix

* immediate mode eviction failed

* immediate mode eviction success

* delete after timeout

Pod restarts:
While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log: “​​{"time":"2026-02-10T16:51:17.446109565Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":284.411107465}”

Verification is done based on taking delta of log generation timestamp - quarantinefinishtimestamp stored in db
If that matches with evictionDuration it means, perf measurement is correct.

Long drains:
Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Cancelled breakfix:
By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Immediate mode eviction failed/success and delete after timeout:
Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

@lalitadithya
Copy link
Collaborator

Can you please add how we tested that the metric match manual observations in the following cases:

* pod restarts

* long drains

* cancelled breakfix

* immediate mode eviction failed

* immediate mode eviction success

* delete after timeout

Pod restarts: While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log: “​​{"time":"2026-02-10T16:51:17.446109565Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":284.411107465}”

Verification is done based on taking delta of log generation timestamp - quarantinefinishtimestamp stored in db If that matches with evictionDuration it means, perf measurement is correct.

Long drains: Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Cancelled breakfix: By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Immediate mode eviction failed/success and delete after timeout: Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log

Can we test pod restarts while the drainer is processing events?

Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Can we run some mock workloads to test? Long drains are going to be 90% of the cases we need to make sure that longer drains are showing accurate numbers.

By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Can we run some mock/simulated user workloads to test?

Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

Can we run some mock/simulated user workloads to test?

@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from c64c8d4 to 8bbe4d1 Compare February 18, 2026 09:52
# Conflicts:
#	fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
#	fault-remediation/pkg/remediation/remediation.go
@nitz2407 nitz2407 force-pushed the nitijain/HIPPO-2400 branch from 2626494 to 340e538 Compare February 18, 2026 11:39
@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/breaker 30.06% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/common 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher 2.53% (+2.53%) 👍
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer 30.27% (-0.06%) 👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics 47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 20.64% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/config 29.79% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 20.88% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 27.85% (-0.25%) 👎
github.com/nvidia/nvsentinel/janitor/pkg/controller 17.84% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-quarantine/main.go 0.00% (ø) 246 0 246
github.com/nvidia/nvsentinel/fault-quarantine/pkg/breaker/breaker.go 30.06% (ø) 835 251 584
github.com/nvidia/nvsentinel/fault-quarantine/pkg/common/common.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher/event_watcher.go 2.53% (+2.53%) 712 (+62) 18 (+18) 694 (+44) 👍
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/k8s_client.go 31.58% (ø) 1067 337 730
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/node_informer.go 28.34% (-0.14%) 727 206 (-1) 521 (+1) 👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics/metrics.go 47.37% (ø) 19 9 10
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 20.64% (ø) 2820 582 2238
github.com/nvidia/nvsentinel/fault-remediation/pkg/config/config.go 29.79% (ø) 339 101 238
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ø) 342 0 342
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 20.88% (ø) 1312 274 1038
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 27.85% (-0.25%) 1379 (+23) 384 (+3) 995 (+20) 👎
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 18.55% (ø) 3601 668 2933
github.com/nvidia/nvsentinel/janitor/pkg/controller/utils.go 0.00% (ø) 127 0 127
github.com/nvidia/nvsentinel/tests/helpers/fault_remediation.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/health_events_analyzer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/csp_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/fault_management_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/janitor_test.go
  • github.com/nvidia/nvsentinel/tests/log_collector_test.go
  • github.com/nvidia/nvsentinel/tests/node_drainer_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

@nitz2407
Copy link
Contributor Author

Can you please add how we tested that the metric match manual observations in the following cases:

* pod restarts

* long drains

* cancelled breakfix

* immediate mode eviction failed

* immediate mode eviction success

* delete after timeout

Pod restarts: While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log: “​​{"time":"2026-02-10T16:51:17.446109565Z","level":"INFO","msg":"Node drainer evictionDuration is","module":"node-drainer","version":"dev","evictionDuration":284.411107465}”
Verification is done based on taking delta of log generation timestamp - quarantinefinishtimestamp stored in db If that matches with evictionDuration it means, perf measurement is correct.
Long drains: Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.
Cancelled breakfix: By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.
Immediate mode eviction failed/success and delete after timeout: Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

While injecting fatal event, node drainer pod was down due to image pull error, after event injected successfully, up the node drainer pod and verified the log

Can we test pod restarts while the drainer is processing events?

Testing is done on dev clusters and workloads are not there so didn’t test it. Another thing is how long/short drain matters here, changes are agnostic to it. Didn’t get how this test is relevant.

Can we run some mock workloads to test? Long drains are going to be 90% of the cases we need to make sure that longer drains are showing accurate numbers.

By the time want to cancel an event, pipeline processed so fast as no user workload are there on dev clusters.

Can we run some mock/simulated user workloads to test?

Since user workloads are not there on dev clusters it’s difficult to test the scenarios.

Can we run some mock/simulated user workloads to test?

Added the test results under Testing section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments