feat: NIC health monitor design by KaivalyaMDabhadkar · Pull Request #752 · NVIDIA/NVSentinel

KaivalyaMDabhadkar · 2026-01-27T14:53:37Z

Summary

Preview the design

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

Documentation
- Added a comprehensive NIC Health Monitor documentation suite: overview, three-layer architecture, link state detection, link counter degradation monitoring, syslog/driver error detection and correlation, severity model, configuration examples, operational guidance, event schemas, and integration notes for GPU cluster monitoring.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-27T14:53:57Z

📝 Walkthrough

Walkthrough

Adds four comprehensive NIC Health Monitor design documents describing a three-layer detection architecture (link state, link counters, syslog correlation), per-node daemons, a platform connector, centralized Health Events Analyzer with MongoDB, data flows, event schemas, and operational/configuration guidance for GPU clusters.

Changes

Cohort / File(s)	Summary
Overview & Top-level `docs/nic-health-monitor/Overview.md`	New high-level design document describing goals, architecture, severity model, data flows, and integration points (Platform Connector, Health Events Analyzer, MongoDB).
Link State Detection `docs/nic-health-monitor/link-state-detection.md`	New detailed spec for per-node link state polling (1s), device discovery, flap/stabilization rules, PCI checks, SR-IOV handling, event schema for fatal/non-fatal states, and configuration mappings for GPU ports.
Link Counter Detection `docs/nic-health-monitor/link-counter-detection.md`	New detailed spec for counter-based degradation monitoring (5s polling), counter sets (standard/extended), delta & rate calculations, threshold rules (delta/velocity/time windows), reset/wrap handling, config schema, data models, and event construction.
Syslog Detection & Correlation `docs/nic-health-monitor/syslog-detection-correlation.md`	New spec for kernel/syslog pattern detection, NIC driver error handlers, pattern sets and severities, integration with syslog-health-monitor, and correlation rules for Health Events Analyzer (diagnostic context vs. port-state driven decisions).

Sequence Diagram(s)

sequenceDiagram
    participant NodeDaemon as Node Daemon\n(Link State / Counters / Syslog)
    participant PlatformConnector as Platform Connector
    participant MongoDB as MongoDB (History)
    participant Analyzer as Health Events Analyzer
    participant Operator as Operator / Alerting

    NodeDaemon->>PlatformConnector: send raw events (state,counters,syslog)
    PlatformConnector->>MongoDB: store raw events / snapshots
    PlatformConnector->>Analyzer: forward events / metadata
    Analyzer->>MongoDB: query historical context / correlate
    Analyzer->>Operator: emit health events (fatal/non-fatal)
    Operator-->>Analyzer: acknowledge / trigger remediation

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A hop, a sniff, a tiny ear that twitches,
I guard the cables, counters, and log glitches.
Three layered checks — state, counters, and logs,
I chase the grey failures through fibers and frogs,
The rabbit nods: clusters hum, safe from hitches.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding comprehensive NIC health monitor design documentation with multiple focus areas (link state, counter, syslog detection).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@docs/nic-health-monitor.md`:
- Around line 1567-1576: Section 7.4's "Monitored Kernel Patterns" table is
missing two patterns that appear in Appendix A; either add
`mlx5_core.*ENABLE_HCA.*timeout` and `NETDEV WATCHDOG.*transmit queue.*timed
out` to the Section 7.4 table (alongside existing entries
`pci_power_insufficient`, `port_module_high_temp`, `cmd_exec_timeout`,
`unrecoverable_err`) with matching RecommendedAction and Meaning as in Appendix
A, or remove those two rows from Appendix A so both sections match; update the
entries where the table is defined in Section 7.4 (Monitored Kernel Patterns)
and ensure Appendix A's "Fatal Kernel Log Patterns" table reflects the same set
to keep the spec and quick reference consistent.
- Around line 675-677: The document contains conflicting guidance about the
link_downed counter; pick one consistent approach and update all references
accordingly: either (A) exclude link_downed from health monitoring — remove
link_downed from the "fatal counter" examples, the State Monitor data flow
lists, and the "Consolidated Deterministic Failure Thresholds" table (e.g.,
remove any Delta > 0 entries referring to link_downed), or (B) include it as
fatal — keep it in the examples/tables and augment the State Monitor and
threshold docs (e.g., the "Consolidated Deterministic Failure Thresholds" table
and State Monitor flow) with explicit detection logic for administrative events
(such as correlating link_downed deltas with kernel/driver reload log patterns,
process restarts, or sticky-window suppression logic) and update the note
currently at "Note on `link_downed`" to describe the distinguishing method;
ensure every occurrence of link_downed (examples, tables, and the State Monitor)
is updated to reflect the chosen approach.
- Around line 949-957: In the "5.6 RoCE Handling" section update the fatal
counters list to exclude or mark req_transport_retries_exceeded as Native IB
only: locate the section header "5.6 RoCE Handling" and the bullet list
containing symbol_error, link_downed, local_link_integrity_errors,
excessive_buffer_overrun_errors, req_transport_retries_exceeded and remove
req_transport_retries_exceeded from that RoCE-specific list (or append "(Native
IB only)" to that item) so the documentation matches the later statement that
req_transport_retries_exceeded is only available on Native InfiniBand.
- Line 537: The code references a misspelled protobuf enum symbol
pb.RecommenedAction; update that reference to pb.RecommendedAction wherever it's
used (specifically in the mapping for RecommendedAction) so the correct enum
name is used and the code compiles; also verify any related imports/aliases
still resolve to the protobuf package after the rename.
- Around line 133-146: The two out-of-order subsection headings should be
renumbered to sit under the "2.4 Hardware Failures Detectable by Local
Monitoring" section: change "#### 2.6.1 The `mlx5_core cmd_exec timeout`" to
"#### 2.4.1 The `mlx5_core cmd_exec timeout`" and change "#### 2.6.2 Transport
Retry Count Exceeded (Error 12)" to "#### 2.4.2 Transport Retry Count Exceeded
(Error 12)"; update the visible subsection numbers only (leave the heading text
and content unchanged) so the document hierarchy is correct.

🧹 Nitpick comments (2)

docs/nic-health-monitor.md (2)
575-597: Consider setting specific RecommendedAction in helper constructors.

The NewFatalEvent helper defaults to RecommendedAction_NONE, but Section 1.4 (lines 52-57) defines specific actions for different fatal conditions (e.g., RecommendedAction_REPLACE_VM for state=DOWN, RecommendedAction_RESTART_BM for cmd_exec timeout).

Callers must manually override this field after construction, which could lead to inconsistency. Consider either:

Adding a recommendedAction parameter to the constructor, or

Providing specialized constructors (e.g., NewFatalReplaceVMEvent, NewFatalRestartBMEvent)

This would enforce consistent action mapping at the type level.
♻️ Alternative: Add parameter to constructor
-func NewFatalEvent(deviceName, message string, linkLayer string) *NicHealthEvent {
+func NewFatalEvent(deviceName, message string, linkLayer string, recommendedAction RecommendedAction) *NicHealthEvent {
     checkName := "EthernetErrorCheck"
     if linkLayer == "InfiniBand" {
         checkName = "InfiniBandErrorCheck"
     }
 
     return &NicHealthEvent{
         Version:            1,
         Agent:              AGENT,
         CheckName:          checkName,
         ComponentClass:     "NIC",
         GeneratedTimestamp: time.Now(),
         Message:            message,
         IsFatal:            true,
         IsHealthy:          false,
         NodeName:           os.Getenv("NODE_NAME"),
-        RecommendedAction:  RecommendedAction_NONE,
+        RecommendedAction:  recommendedAction,
         EntitiesImpacted:   []Entity{{EntityType: "NIC", EntityValue: deviceName}},
         LinkLayer:          linkLayer,
     }
 }
82-84: Add language specifiers to fenced code blocks for better rendering.

Multiple fenced code blocks throughout the document lack language specifiers, which affects syntax highlighting and rendering. Static analysis identified blocks at lines 82, 110, 149, 210, 268, 289, 367, 383, 418, 439, 684, 1015, 1042, 1049, 1079, 1107, 1132, 1473, 1490.

Examples:

Line 82: Physical degradation flow diagram should use text or leave unmarked for plain text
Line 110: Queue Pair state machine should use text
Line 684: Counter reset problem should use text
Line 1015: GID table example should use text or bash
Line 1042: ASCII diagram should use text
Example fixes
-```
+```text
 Physical Impairment --> Eye Diagram Closes --> Symbol Errors --> FEC Corrections --> CRC Failures --> Packet Loss
- +text
RESET --> INIT --> RTR --> RTS
</details>

Based on learnings: As this is a design document, syntax highlighting is helpful for readability but not critical for correctness.

</blockquote></details>

</blockquote></details>

docs/nic-health-monitor.md

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/nic-health-monitor.md`:
- Line 100: The docs reference an undefined counter `uncorrectable_fec_errors`;
add a Fatal Counter Reference entry defining this counter: provide its sysfs
path (e.g., the NIC sysfs file used to expose FEC uncorrectable counts), a
recommended fatal threshold for PAM4 links (e.g., sustained extreme rates like
>1e6/sec or a per-interval threshold), and the recommended action (alert, run
diagnostics, or replace module). Update the Fatal Counter Reference table (and
cross-reference Sections 5.1 / 5.1.1 / 5.8 / Appendix A) to include
`uncorrectable_fec_errors` so the monitoring spec contains the counter name,
sysfs path, threshold, and remediation guidance.

🧹 Nitpick comments (1)

docs/nic-health-monitor.md (1)

82-85: Consider adding language identifiers to fenced code blocks for better rendering.

Multiple fenced code blocks throughout the document (lines 82, 110, 149, 210, 268, 289, 367, 383, 417, 438, 683, 1014, 1041, 1048, 1078, 1106, 1131, 1477, 1494) lack language identifiers. While these are primarily ASCII diagrams and pseudocode, adding identifiers like ```text improves markdown rendering and accessibility.

docs/nic-health-monitor.md

lalitadithya · 2026-01-27T20:50:14Z

docs/nic-health-monitor.md

+Deduplication:
+└── Same message within 60s window --> Suppressed (prevents alert storms)


I don't think we need this, in our architecture we don't alert on these things, so it is perfectly valid to capture every error message, we do that for XIDs and we tend to get bursts of those.

docs/nic-health-monitor.md

lalitadithya · 2026-01-27T20:58:48Z

docs/nic-health-monitor.md

+[StateMonitoring]
+# Expected link speed in Gb/s. Links negotiating below this are FATAL.
+# Common values: 100, 200, 400 (NDR)
+TargetLinkSpeedGbps = 400


This seems to mean we will need per cluster configuration, are there ways to detect the expected speed by looking at the hardware perhaps?

lalitadithya · 2026-01-27T21:00:01Z

docs/nic-health-monitor.md

+}
+```
+
+### 5.4 Device Discovery and Parsing


Won't this also pick up NICs that are not configured by network operator? I thought the last time we did this, we ran into a lot of false positives because of this.

docs/nic-health-monitor.md

lalitadithya · 2026-01-27T21:11:37Z

This design is very detailed which is great! But I think it is trying to solve many things at once and I think we should try and reduce scope and also break it down into multiple smaller designs so that we can agree to smaller sections but while having the larger picture in mind.

For scope, I suggest we only focus on Mellanox IB and RoCE, no EFA, TCPXO, eth (yet) they can be follow ups that can be done once we have made progress on the implementation of this.

With respect to the break down, I think we should break it down into the following parts:

Link state detection, i.e., simple UP/DOWN detection of both states and NIC falling off
Link counter detection, i.e., parse and detect errors using NIC counters available on the hosts
Syslog detection, i.e., detect errors based on message in the syslog
Repeat failure detection, i.e., link flaps and other advanced correlation techniques

What do you think?

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/nic-health-monitor/link-counter-detection.md`:
- Line 469: The "Fatal counters" code comment incorrectly lists symbol_error as
fatal; update that comment so symbol_error is not listed as a fatal counter and
instead aligns with the counter specification tables (which classify
symbol_error as non-fatal, rate-based >10/sec). Specifically edit the "Fatal
counters" list entry that currently includes `symbol_error` to remove it
(leaving `link_downed`, `local_link_integrity_errors`,
`excessive_buffer_overrun_errors`, `req_transport_retries_exceeded`) so the
comment and the tables (referencing `symbol_error`) are consistent.

In `@docs/nic-health-monitor/syslog-detection-correlation.md`:
- Around line 94-104: The diagram header "PATTERN MATCHING FOR (all emitted as
WARNING)" incorrectly implies all NIC driver patterns are Warning; update the
diagram (or its caption) so patterns listed under the "PATTERN MATCHING FOR"
block correctly reflect the authoritative classifications from the pattern table
(e.g., mark "mlx5_core cmd_exec timeout", "mlx5_core health poll failed",
"mlx5_core unrecoverable", "PCIe Bus Error.*Fatal", and "NETDEV
WATCHDOG.*mlx5_core" as Fatal rather than Warning), or alternatively add a clear
note next to that diagram pointing readers to Section 5.1 / the pattern table
for precise severity mappings so the diagram is not misleading.

🧹 Nitpick comments (2)

docs/nic-health-monitor/link-state-detection.md (2)
374-399: Clarify fallback behavior when GPU product name is unavailable.

The speed degradation detection algorithm (line 377) mentions getting GPU product name "via NVML or node labels" but doesn't specify what happens if neither source is available (e.g., CPU-only nodes, bare metal without NVML).

Line 382 mentions "If no match found, skip speed validation" which is the right approach, but consider making this fallback behavior more explicit in the algorithm steps.
Suggested clarification

Add after step 1:
1. **Get GPU product name** via NVML or node labels (e.g., `nvidia.com/gpu.product`)
   - If neither source is available, skip speed validation (return early)
2. **Normalize** the product name to lowercase
802-860: Configuration format differs from other documents.

This document uses INI format for configuration (lines 802-860), while link-counter-detection.md uses YAML (lines 669-843). If these are configurations for the same component, consider consistency. If they're for different components, this is acceptable—just ensure it's clear which configuration applies where.

docs/nic-health-monitor/link-counter-detection.md

coderabbitai · 2026-01-30T16:03:12Z

docs/nic-health-monitor/syslog-detection-correlation.md

+│  │  PATTERN MATCHING FOR (all emitted as WARNING):                      │  │
+│  │  ├── mlx5_core cmd_exec timeout       → Warning (investigate)        │  │
+│  │  ├── mlx5_core health poll failed     → Warning (investigate)        │  │
+│  │  ├── mlx5_core unrecoverable          → Warning (investigate)        │  │
+│  │  ├── PCIe Bus Error.*Fatal            → Warning (investigate)        │  │
+│  │  ├── NETDEV WATCHDOG.*mlx5_core       → Warning (investigate)        │  │
+│  │  ├── High Temperature                 → Warning (monitor)            │  │
+│  │  ├── Detected insufficient power      → Warning (informational)      │  │
+│  │  ├── module absent                    → Warning (informational)      │  │
+│  │  └── ACCESS_REG.*failed               → Warning (noise, can filter)  │  │
+│  │                                                                      │  │


⚠️ Potential issue | 🟡 Minor

Diagram oversimplifies severity classification.

The flow diagram labels all NIC driver patterns as "Warning (investigate)", but the authoritative pattern table (lines 420-431) correctly classifies several as Fatal (cmd_exec timeout, health poll failed, unrecoverable, PCIe Fatal, NETDEV WATCHDOG).

Consider updating the diagram to distinguish between Fatal and Warning patterns, or add a note that the diagram is simplified and readers should refer to Section 5.1 for accurate severity classifications.

🤖 Prompt for AI Agents

In `@docs/nic-health-monitor/syslog-detection-correlation.md` around lines 94 - 104, The diagram header "PATTERN MATCHING FOR (all emitted as WARNING)" incorrectly implies all NIC driver patterns are Warning; update the diagram (or its caption) so patterns listed under the "PATTERN MATCHING FOR" block correctly reflect the authoritative classifications from the pattern table (e.g., mark "mlx5_core cmd_exec timeout", "mlx5_core health poll failed", "mlx5_core unrecoverable", "PCIe Bus Error.*Fatal", and "NETDEV WATCHDOG.*mlx5_core" as Fatal rather than Warning), or alternatively add a clear note next to that diagram pointing readers to Section 5.1 / the pattern table for precise severity mappings so the diagram is not misleading.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@docs/nic-health-monitor/link-state-detection.md`:
- Around line 7-20: Update the TOC entry for "Appendix A" so its fragment
matches the actual heading slug "Appendix A: Quick Reference - Fatal Condition
Classification" by replacing the current fragment
"#appendix-a-quick-reference---fatal-state-conditions" with the slug generated
from the real heading (e.g.,
"#appendix-a-quick-reference---fatal-condition-classification") so the link in
the list item for "Appendix A" points to the exact heading text; update any
other occurrences of the old fragment (noted also around lines 831-844) to the
same corrected fragment.
- Around line 552-556: The PF (Physical Function) table cell contains a sentence
fragment "Should ALWAYS be ACTIVE." — update the PF row in the table so the
description is a full sentence (e.g., "The Physical Function should always be
active.") and ensure capitalization/punctuation matches the document style;
locate the PF/Physical Function entry in the table and replace the fragment with
the completed sentence.

In `@docs/nic-health-monitor/Overview.md`:
- Around line 21-88: The fenced ASCII diagram blocks lack a language tag which
triggers MD040; update each triple-backtick fence that wraps the ASCII diagrams
(the large NVSentinel NIC MONITORING ARCHITECTURE diagram and the other similar
diagram blocks) to use a language identifier `text` (i.e. replace ``` with
```text) so markdownlint passes, and apply the same change to the other diagram
fences referenced in the comment.

coderabbitai · 2026-01-30T16:15:50Z

docs/nic-health-monitor/link-state-detection.md

+1. [Overview](#1-overview)
+2. [Architecture](#2-architecture)
+3. [State Monitoring Specification](#3-state-monitoring-specification)
+4. [Link Speed Degradation Detection](#4-link-speed-degradation-detection)
+5. [Device Discovery and Parsing](#5-device-discovery-and-parsing)
+6. [State Change and Flap Detection](#6-state-change-and-flap-detection)
+7. [PCI Configuration Space Health Check](#7-pci-configuration-space-health-check)
+8. [SR-IOV Virtual Function Handling](#8-sr-iov-virtual-function-handling)
+9. [RoCE State Monitoring](#9-roce-state-monitoring)
+10. [Supported Hardware](#10-supported-hardware)
+11. [Data Structures](#11-data-structures)
+12. [Configuration](#12-configuration)
+13. [Event Management](#13-event-management)
+- [Appendix A: Quick Reference - Fatal State Conditions](#appendix-a-quick-reference---fatal-state-conditions)


⚠️ Potential issue | 🟡 Minor

Fix TOC fragment for Appendix A to match the actual heading slug.

The TOC link for Appendix A doesn’t match the heading text, which triggers MD051. Update the fragment to align with the heading “Appendix A: Quick Reference - Fatal Condition Classification.”

🔧 Suggested fix

-- [Appendix A: Quick Reference - Fatal State Conditions](`#appendix-a-quick-reference---fatal-state-conditions`) +- [Appendix A: Quick Reference - Fatal Condition Classification](`#appendix-a-quick-reference---fatal-condition-classification`)

Also applies to: 831-844

🤖 Prompt for AI Agents

In `@docs/nic-health-monitor/link-state-detection.md` around lines 7 - 20, Update the TOC entry for "Appendix A" so its fragment matches the actual heading slug "Appendix A: Quick Reference - Fatal Condition Classification" by replacing the current fragment "#appendix-a-quick-reference---fatal-state-conditions" with the slug generated from the real heading (e.g., "#appendix-a-quick-reference---fatal-condition-classification") so the link in the list item for "Appendix A" points to the exact heading text; update any other occurrences of the old fragment (noted also around lines 831-844) to the same corrected fragment.

coderabbitai · 2026-01-30T16:15:50Z

docs/nic-health-monitor/link-state-detection.md

+| Term   | Full Name         | Description                                                     |
+|--------|-------------------|-----------------------------------------------------------------|
+| **PF** | Physical Function | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE.   |
+| **VF** | Virtual Function  | A "virtual clone" of the PF. Created for VMs/containers to use. |
+


⚠️ Potential issue | 🟡 Minor

Complete the sentence in the PF definition row.

“Should ALWAYS be ACTIVE.” reads as a fragment. Consider making it a full sentence.

✍️ Suggested edit

-| **PF** | Physical Function | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE. | +| **PF** | Physical Function | The "real" NIC. Host OS controls it and it should ALWAYS be ACTIVE. |

🧰 Tools

🪛 LanguageTool

[style] ~554-~554: To form a complete sentence, be sure to include a subject.
Context: ... | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE. | | VF | Virtua...

(MISSING_IT_THERE)

🤖 Prompt for AI Agents

In `@docs/nic-health-monitor/link-state-detection.md` around lines 552 - 556, The PF (Physical Function) table cell contains a sentence fragment "Should ALWAYS be ACTIVE." — update the PF row in the table so the description is a full sentence (e.g., "The Physical Function should always be active.") and ensure capitalization/punctuation matches the document style; locate the PF/Physical Function entry in the table and replace the fragment with the completed sentence.

coderabbitai · 2026-01-30T16:15:50Z

docs/nic-health-monitor/Overview.md

+```
+┌────────────────────────────────────────────────────────────────────────────────┐
+│                    NVSentinel NIC MONITORING ARCHITECTURE                      │
+├────────────────────────────────────────────────────────────────────────────────┤
+│                                                                                │
+│  ┌──────────────────────────────────────────────────────────────────────────┐  │
+│  │                 PER-NODE DAEMONSETS (Raw Event Reporters)                │  │
+│  ├──────────────────────────────────────────────────────────────────────────┤  │
+│  │                                                                          │  │
+│  │  ┌─────────────────────────────┐  ┌─────────────────────────────────┐   │  │
+│  │  │    NIC HEALTH MONITOR       │  │    SYSLOG HEALTH MONITOR        │   │  │
+│  │  │    ══════════════════       │  │    ════════════════════         │   │  │
+│  │  │                             │  │                                 │   │  │
+│  │  │  DATA SOURCES:              │  │  DATA SOURCE:                   │   │  │
+│  │  │  • /sys/class/infiniband/   │  │  • journald / /var/log/journal  │   │  │
+│  │  │  • /sys/class/net/          │  │                                 │   │  │
+│  │  │                             │  │  CHECK:                         │   │  │
+│  │  │  CHECKS:                    │  │  • SysLogsNICDriverError        │   │  │
+│  │  │  • InfiniBandStateCheck     │  │    (mlx5_core cmd_exec timeout, │   │  │
+│  │  │  • InfiniBandDegradationChk │  │     health poll failed,         │   │  │
+│  │  │  • EthernetStateCheck       │  │     unrecoverable, PCIe fatal)  │   │  │
+│  │  │  • EthernetDegradationCheck │  │                                 │   │  │
+│  │  │                             │  │  BEHAVIOR:                      │   │  │
+│  │  │  BEHAVIOR:                  │  │  • Reports RAW events as-is     │   │  │
+│  │  │  • Reports RAW events as-is │  │  • NO aggregation               │   │  │
+│  │  │  • NO aggregation           │  │  • NO local persistence         │   │  │
+│  │  │  • NO local persistence     │  │                                 │   │  │
+│  │  └─────────────┬───────────────┘  └───────────────┬─────────────────┘   │  │
+│  │                │                                  │                     │  │
+│  └────────────────┼──────────────────────────────────┼─────────────────────┘  │
+│                   │                                  │                        │
+│                   └────────────────┬─────────────────┘                        │
+│                                    │                                          │
+│                                    ▼                                          │
+│  ┌──────────────────────────────────────┐                                     │
+│  │       PLATFORM CONNECTOR             │                                     │
+│  │       ══════════════════             │                                     │
+│  │  • Receives raw events               │                                     │
+│  │  • Persists to MongoDB               │                                     │
+│  │  • Triggers downstream               │                                     │
+│  └──────────────────┬───────────────────┘                                     │
+│                     │                                                         │
+│                     ▼                                                         │
+│  ┌──────────────────────────────────────────────────────────────────────────┐ │
+│  │             CENTRALIZED DEPLOYMENT (Correlation & Intelligence)          │ │
+│  ├──────────────────────────────────────────────────────────────────────────┤ │
+│  │                                                                          │ │
+│  │  ┌────────────────────────────────────────────────────────────────────┐  │ │
+│  │  │                    HEALTH EVENTS ANALYZER                          │  │ │
+│  │  │                    ══════════════════════                          │  │ │
+│  │  │                                                                    │  │ │
+│  │  │  OPTIONAL CORRELATION (MongoDB Aggregation Pipelines):             │  │ │
+│  │  │                                                                    │  │ │
+│  │  │  • Correlate kernel log warnings with port state events            │  │ │
+│  │  │  • Provide diagnostic context for operator investigation           │  │ │
+│  │  │                                                                    │  │ │
+│  │  │  NOTE: Deterministic kernel logs (cmd_exec timeout, etc.) = Fatal  │  │ │
+│  │  │        Diagnostic logs = Warning                                   │  │ │
+│  │  │                                                                    │  │ │
+│  │  │  OUTPUT:                                                           │  │ │
+│  │  │  • Deterministic failure detection and remediation triggering      │  │ │
+│  │  │  • Diagnostic correlation for operator investigation               │  │ │
+│  │  └────────────────────────────────────────────────────────────────────┘  │ │
+│  │                                                                          │ │
+│  └──────────────────────────────────────────────────────────────────────────┘ │
+│                                                                                │
+└────────────────────────────────────────────────────────────────────────────────┘
+```


⚠️ Potential issue | 🟡 Minor

Add a language tag to fenced blocks to satisfy markdownlint.

These ASCII diagrams are in fenced blocks without a language identifier, triggering MD040. Consider using text for clarity.

🔧 Suggested fix

-``` +```text ... -``` +```

Also applies to: 123-154, 158-174

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 57-57: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/nic-health-monitor/Overview.md` around lines 21 - 88, The fenced ASCII diagram blocks lack a language tag which triggers MD040; update each triple-backtick fence that wraps the ASCII diagrams (the large NVSentinel NIC MONITORING ARCHITECTURE diagram and the other similar diagram blocks) to use a language identifier `text` (i.e. replace ``` with ```text) so markdownlint passes, and apply the same change to the other diagram fences referenced in the comment.

lalitadithya · 2026-02-02T06:27:52Z

docs/nic-health-monitor/Overview.md

+| **Log Watcher** | `mlx5_core.*cmd_exec timeout`, `health poll failed`, `unrecoverable`, `PCIe Fatal` | **Fatal**   | Deterministic hardware/driver failure |
+| **Log Watcher** | `insufficient power`, `module absent`, `ACCESS_REG failed`                         | **Warning** | Diagnostic context for correlation    |
+
+> **Design Note**: Deterministically fatal events in logs trigger `REPLACE_VM`. Diagnostic logs remain as `Warning` and do not directly trigger remediation.


and do not directly trigger remediation.

meaning they will be published as non-fatal events? If yes, can we mention it that way so that it is clear?

lalitadithya · 2026-02-02T06:31:16Z

docs/nic-health-monitor/link-state-detection.md

+│  │  │  BEHAVIOR:                         │                                  │  │
+│  │  │  • Reports RAW state events        │                                  │  │
+│  │  │  • NO aggregation                  │                                  │  │
+│  │  │  • NO local persistence            │                                  │  │


I'm not sure if we have to mention this, I understand that the previous iteration has persistence, but now this we have removed it, to someone who is reading this the first time it looks odd as why we would call this out.

lalitadithya · 2026-02-02T06:31:51Z

docs/nic-health-monitor/link-state-detection.md

+    IBStateArmed   = "3: ARMED"     // Armed but not active (check SM)
+    IBStateActive  = "4: ACTIVE"    // Normal operational state
+
+    // Port physical states


Can we add a link to where we got these states from?

lalitadithya · 2026-02-02T06:32:37Z

docs/nic-health-monitor/link-state-detection.md

+   - If `state = INIT` → Warning, message: "Port stuck in INIT - check Subnet Manager"
+   - If `state = ARMED` → Warning, message: "Port ARMED but not ACTIVE - check fabric manager"


Why are these warning, it looks like someone has to take some action on these right?

lalitadithya · 2026-02-02T06:33:02Z

docs/nic-health-monitor/link-state-detection.md

+3. **Read physical state** from `/sys/class/infiniband/<dev>/ports/<port>/phys_state`
+4. **Evaluate physical state:**
+   - If `phys_state = Disabled` → Mark as **FATAL**, message: "Physical state DISABLED"
+   - If `phys_state = Polling` → Warning, message: "Link training in progress"


Why is this warning? In this state jobs can't run right?

lalitadithya · 2026-02-02T06:50:38Z

docs/nic-health-monitor/syslog-detection-correlation.md

+| `mlx5_core unrecoverable`      | **Fatal**   | Hardware admission of failure | `REPLACE_VM`       |
+| `PCIe Bus Error.*Fatal`        | **Fatal**   | PCIe link broken              | `REPLACE_VM`       |
+| `NETDEV WATCHDOG`              | **Fatal**   | Data path stalled             | `REPLACE_VM`       |
+| `Detected insufficient power`  | **Warning** | Power negotiation status      | `NONE`             |


In general, we make use of fatal v/s non-fatal as the convention, but in these designs, we are using warning which is a bit different. Can we use the same terminology so that it is less confusing?

lalitadithya · 2026-02-02T06:55:44Z

docs/nic-health-monitor/link-counter-detection.md

+- Often indicates "Silent Drop" or **Black Hole** in the fabric
+- Local symptom of a **remote** problem
+
+**What This Monitor CAN Detect**: The `local_ack_timeout_err` and `req_transport_retries_exceeded` (native IB) hardware counters track these retry events at the NIC level. Rising counter values indicate transport-layer problems even if we can't see the application error.


Can this also be triggered by an application? Say one of the ranks of a multi-node training crashes or hangs, would it cause this ack timeout to be incremented on the other nodes?

lalitadithya · 2026-02-02T06:57:17Z

docs/nic-health-monitor/link-counter-detection.md

+
+**Key Non-Fatal Counters** (monitor for performance degradation):
+
+| Category       | Counters                | IsFatal | Alert Threshold | Justification                                               |


Can we link to documentation that recommend the alert thresholds?

lalitadithya · 2026-02-02T07:09:24Z

docs/nic-health-monitor/link-counter-detection.md

+|-----------------------------------|----------|-------------------------|---------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
+| `link_downed`                     | Standard | **Delta > 0 (Runtime)** | **YES** | Logical path destruction; QP disconnect. Standard HPC apps don't support transparent QP rerouting. | [HPE ClusterStor](https://support.hpe.com/hpesc/public/docDisplay?docId=sd00001143en_us) |
+| `excessive_buffer_overrun_errors` | Standard | **> 0 (Any)**           | **YES** | Lossless guarantee violation; packet dropped immediately. HCA ingress buffer overflow.             | [IBM Redbooks](https://www.redbooks.ibm.com/redbooks/pdfs/sg247767.pdf)                  |
+| `rnr_nak_retry_err`               | Extended | **> 0 (Any)**           | **YES** | Receiver Not Ready NAK retry exhausted; connection severed. Terminal state of error handling.      | [Datadog IB](https://docs.datadoghq.com/integrations/infiniband/)                        |


The link doesn't seem to talk about the threshold or how it will cause a fatal error, can we review the linked docs and make sure that we link to the exact section where we have documentation where it talks about the impact and fatal classification?

lalitadithya · 2026-02-02T07:14:23Z

docs/nic-health-monitor/link-counter-detection.md

+      isFatal: false
+      thresholdType: velocity
+      threshold: 10.0
+      velocityUnit: per_second


Suggested change

velocityUnit: per_second

velocityUnit: second

deesharma24 · 2026-02-02T07:24:57Z

docs/nic-health-monitor/Overview.md

+│  LAYER 1: LINK STATE DETECTION                                             │
+│  ═══════════════════════════════                                           │
+│  • Polling interval: 1 second                                              │
+│  • Data source: /sys/class/infiniband/<dev>/ports/<port>/state, phys_state │


Last time when we had nic-health-monitor deployed in our prod clusters. one of the issues which we faced was whether certain interfaces(be it infiniband/ethernet) are in active use or not. if they are not in active use, then whether they stay up/down doesn't matter and I vaguely remember due to this all the nodes in some cluster(probably osmo) should nic down even though that nic interface was not getting used. how is it going to help this time? will it still be configuration driven where we need to mention the interfaces we want to monitor based cluster/csp?

XRFXLP · 2026-02-13T05:30:52Z

docs/nic-health-monitor/link-state-detection.md

+| H100        | 8              | 400 Gb/s (NDR) | [DGX H100 User Guide](https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html) |
+| H200        | 8              | 400 Gb/s (NDR) | NVIDIA Documentation                                                                               |
+| B200        | 8              | 400 Gb/s       | [DGX B200 User Guide](https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-to-dgxb200.html) |
+| GB200       | 8              | 400 Gb/s       | [GB200 NVL2](https://www.nvidia.com/en-us/data-center/gb200-nvl2/)                                 |


GB200 system uses fifth generation nvlink whereas H100 using fourth, wouldn't there would be ~1.5x speedup?

XRFXLP · 2026-02-13T05:33:00Z

docs/nic-health-monitor/link-state-detection.md

+
+**GPU-to-InfiniBand Mapping:**
+
+| GPU Product | Expected Ports | Expected Rate  | Reference                                                                                          |


Are these theoretical expected rate if everything E2E as per data-sheet? What if rates are like ~195 gbps for A100?

XRFXLP · 2026-02-13T05:54:01Z

docs/nic-health-monitor/link-state-detection.md

+│  ═══════════════════════════════════════════════════════════════════════════    │
+│                                                                                  │
+│  IMPACT ON NCCL AllReduce (Ring Topology):                                       │
+│                                                                                  │


This diagram gives impression that communication is inside node between different GPUs i.e. link speed being nvlink speed (which is out of scope of this MR). Can we make it slightly more clear to have inter-node seperaiton?

feat: NIC health monitor design

33f610a

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

fix: fix typos

1fb8b35

KaivalyaMDabhadkar self-assigned this Jan 27, 2026

KaivalyaMDabhadkar linked an issue Jan 27, 2026 that may be closed by this pull request

[Feature]: Infiniband health monitor #642

Open

1 task

chore: format tables

4235116

lalitadithya reviewed Jan 27, 2026

View reviewed changes

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

docs/nic-health-monitor.md Outdated Show resolved Hide resolved

lalitadithya reviewed Jan 27, 2026

View reviewed changes

KaivalyaMDabhadkar added 2 commits January 30, 2026 21:29

fix: addressed reviews and split docs

1955947

Merge remote-tracking branch 'upstream/main' into 642-nic-monitor-design

7d0aa7e

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

fix: fix section

09650a6

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

lalitadithya reviewed Feb 2, 2026

View reviewed changes

deesharma24 reviewed Feb 2, 2026

View reviewed changes

lalitadithya mentioned this pull request Feb 2, 2026

[Feature]: Infiniband health monitor #642

Open

1 task

XRFXLP reviewed Feb 13, 2026

View reviewed changes

		Deduplication:
		└── Same message within 60s window --> Suppressed (prevents alert storms)

		- If `state = INIT` → Warning, message: "Port stuck in INIT - check Subnet Manager"
		- If `state = ARMED` → Warning, message: "Port ARMED but not ACTIVE - check fabric manager"


		Key Non-Fatal Counters (monitor for performance degradation):

		\| Category \| Counters \| IsFatal \| Alert Threshold \| Justification \|


		GPU-to-InfiniBand Mapping:

		\| GPU Product \| Expected Ports \| Expected Rate \| Reference \|

Conversation

KaivalyaMDabhadkar commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lalitadithya commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaivalyaMDabhadkar commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

XRFXLP Feb 13, 2026 •

edited

Loading