Skip to content

feat: NIC health monitor design#752

Open
KaivalyaMDabhadkar wants to merge 6 commits intoNVIDIA:mainfrom
KaivalyaMDabhadkar:642-nic-monitor-design
Open

feat: NIC health monitor design#752
KaivalyaMDabhadkar wants to merge 6 commits intoNVIDIA:mainfrom
KaivalyaMDabhadkar:642-nic-monitor-design

Conversation

@KaivalyaMDabhadkar
Copy link
Contributor

@KaivalyaMDabhadkar KaivalyaMDabhadkar commented Jan 27, 2026

Summary

Preview the design

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive NIC Health Monitor documentation suite: overview, three-layer architecture, link state detection, link counter degradation monitoring, syslog/driver error detection and correlation, severity model, configuration examples, operational guidance, event schemas, and integration notes for GPU cluster monitoring.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Adds four comprehensive NIC Health Monitor design documents describing a three-layer detection architecture (link state, link counters, syslog correlation), per-node daemons, a platform connector, centralized Health Events Analyzer with MongoDB, data flows, event schemas, and operational/configuration guidance for GPU clusters.

Changes

Cohort / File(s) Summary
Overview & Top-level
docs/nic-health-monitor/Overview.md
New high-level design document describing goals, architecture, severity model, data flows, and integration points (Platform Connector, Health Events Analyzer, MongoDB).
Link State Detection
docs/nic-health-monitor/link-state-detection.md
New detailed spec for per-node link state polling (1s), device discovery, flap/stabilization rules, PCI checks, SR-IOV handling, event schema for fatal/non-fatal states, and configuration mappings for GPU ports.
Link Counter Detection
docs/nic-health-monitor/link-counter-detection.md
New detailed spec for counter-based degradation monitoring (5s polling), counter sets (standard/extended), delta & rate calculations, threshold rules (delta/velocity/time windows), reset/wrap handling, config schema, data models, and event construction.
Syslog Detection & Correlation
docs/nic-health-monitor/syslog-detection-correlation.md
New spec for kernel/syslog pattern detection, NIC driver error handlers, pattern sets and severities, integration with syslog-health-monitor, and correlation rules for Health Events Analyzer (diagnostic context vs. port-state driven decisions).

Sequence Diagram(s)

sequenceDiagram
    participant NodeDaemon as Node Daemon\n(Link State / Counters / Syslog)
    participant PlatformConnector as Platform Connector
    participant MongoDB as MongoDB (History)
    participant Analyzer as Health Events Analyzer
    participant Operator as Operator / Alerting

    NodeDaemon->>PlatformConnector: send raw events (state,counters,syslog)
    PlatformConnector->>MongoDB: store raw events / snapshots
    PlatformConnector->>Analyzer: forward events / metadata
    Analyzer->>MongoDB: query historical context / correlate
    Analyzer->>Operator: emit health events (fatal/non-fatal)
    Operator-->>Analyzer: acknowledge / trigger remediation
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A hop, a sniff, a tiny ear that twitches,
I guard the cables, counters, and log glitches.
Three layered checks — state, counters, and logs,
I chase the grey failures through fibers and frogs,
The rabbit nods: clusters hum, safe from hitches.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding comprehensive NIC health monitor design documentation with multiple focus areas (link state, counter, syslog detection).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor.md`:
- Around line 1567-1576: Section 7.4's "Monitored Kernel Patterns" table is
missing two patterns that appear in Appendix A; either add
`mlx5_core.*ENABLE_HCA.*timeout` and `NETDEV WATCHDOG.*transmit queue.*timed
out` to the Section 7.4 table (alongside existing entries
`pci_power_insufficient`, `port_module_high_temp`, `cmd_exec_timeout`,
`unrecoverable_err`) with matching RecommendedAction and Meaning as in Appendix
A, or remove those two rows from Appendix A so both sections match; update the
entries where the table is defined in Section 7.4 (Monitored Kernel Patterns)
and ensure Appendix A's "Fatal Kernel Log Patterns" table reflects the same set
to keep the spec and quick reference consistent.
- Around line 675-677: The document contains conflicting guidance about the
link_downed counter; pick one consistent approach and update all references
accordingly: either (A) exclude link_downed from health monitoring — remove
link_downed from the "fatal counter" examples, the State Monitor data flow
lists, and the "Consolidated Deterministic Failure Thresholds" table (e.g.,
remove any Delta > 0 entries referring to link_downed), or (B) include it as
fatal — keep it in the examples/tables and augment the State Monitor and
threshold docs (e.g., the "Consolidated Deterministic Failure Thresholds" table
and State Monitor flow) with explicit detection logic for administrative events
(such as correlating link_downed deltas with kernel/driver reload log patterns,
process restarts, or sticky-window suppression logic) and update the note
currently at "Note on `link_downed`" to describe the distinguishing method;
ensure every occurrence of link_downed (examples, tables, and the State Monitor)
is updated to reflect the chosen approach.
- Around line 949-957: In the "5.6 RoCE Handling" section update the fatal
counters list to exclude or mark req_transport_retries_exceeded as Native IB
only: locate the section header "5.6 RoCE Handling" and the bullet list
containing symbol_error, link_downed, local_link_integrity_errors,
excessive_buffer_overrun_errors, req_transport_retries_exceeded and remove
req_transport_retries_exceeded from that RoCE-specific list (or append "(Native
IB only)" to that item) so the documentation matches the later statement that
req_transport_retries_exceeded is only available on Native InfiniBand.
- Line 537: The code references a misspelled protobuf enum symbol
pb.RecommenedAction; update that reference to pb.RecommendedAction wherever it's
used (specifically in the mapping for RecommendedAction) so the correct enum
name is used and the code compiles; also verify any related imports/aliases
still resolve to the protobuf package after the rename.
- Around line 133-146: The two out-of-order subsection headings should be
renumbered to sit under the "2.4 Hardware Failures Detectable by Local
Monitoring" section: change "#### 2.6.1 The `mlx5_core cmd_exec timeout`" to
"#### 2.4.1 The `mlx5_core cmd_exec timeout`" and change "#### 2.6.2 Transport
Retry Count Exceeded (Error 12)" to "#### 2.4.2 Transport Retry Count Exceeded
(Error 12)"; update the visible subsection numbers only (leave the heading text
and content unchanged) so the document hierarchy is correct.
🧹 Nitpick comments (2)
docs/nic-health-monitor.md (2)

575-597: Consider setting specific RecommendedAction in helper constructors.

The NewFatalEvent helper defaults to RecommendedAction_NONE, but Section 1.4 (lines 52-57) defines specific actions for different fatal conditions (e.g., RecommendedAction_REPLACE_VM for state=DOWN, RecommendedAction_RESTART_BM for cmd_exec timeout).

Callers must manually override this field after construction, which could lead to inconsistency. Consider either:

  1. Adding a recommendedAction parameter to the constructor, or
  2. Providing specialized constructors (e.g., NewFatalReplaceVMEvent, NewFatalRestartBMEvent)

This would enforce consistent action mapping at the type level.

♻️ Alternative: Add parameter to constructor
-func NewFatalEvent(deviceName, message string, linkLayer string) *NicHealthEvent {
+func NewFatalEvent(deviceName, message string, linkLayer string, recommendedAction RecommendedAction) *NicHealthEvent {
     checkName := "EthernetErrorCheck"
     if linkLayer == "InfiniBand" {
         checkName = "InfiniBandErrorCheck"
     }
 
     return &NicHealthEvent{
         Version:            1,
         Agent:              AGENT,
         CheckName:          checkName,
         ComponentClass:     "NIC",
         GeneratedTimestamp: time.Now(),
         Message:            message,
         IsFatal:            true,
         IsHealthy:          false,
         NodeName:           os.Getenv("NODE_NAME"),
-        RecommendedAction:  RecommendedAction_NONE,
+        RecommendedAction:  recommendedAction,
         EntitiesImpacted:   []Entity{{EntityType: "NIC", EntityValue: deviceName}},
         LinkLayer:          linkLayer,
     }
 }

82-84: Add language specifiers to fenced code blocks for better rendering.

Multiple fenced code blocks throughout the document lack language specifiers, which affects syntax highlighting and rendering. Static analysis identified blocks at lines 82, 110, 149, 210, 268, 289, 367, 383, 418, 439, 684, 1015, 1042, 1049, 1079, 1107, 1132, 1473, 1490.

Examples:

Line 82: Physical degradation flow diagram should use text or leave unmarked for plain text
Line 110: Queue Pair state machine should use text
Line 684: Counter reset problem should use text
Line 1015: GID table example should use text or bash
Line 1042: ASCII diagram should use text

Example fixes
-```
+```text
 Physical Impairment --> Eye Diagram Closes --> Symbol Errors --> FEC Corrections --> CRC Failures --> Packet Loss

- +text
RESET --> INIT --> RTR --> RTS

</details>

Based on learnings: As this is a design document, syntax highlighting is helpful for readability but not critical for correctness.

</blockquote></details>

</blockquote></details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@KaivalyaMDabhadkar KaivalyaMDabhadkar self-assigned this Jan 27, 2026
@KaivalyaMDabhadkar KaivalyaMDabhadkar linked an issue Jan 27, 2026 that may be closed by this pull request
1 task
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor.md`:
- Line 100: The docs reference an undefined counter `uncorrectable_fec_errors`;
add a Fatal Counter Reference entry defining this counter: provide its sysfs
path (e.g., the NIC sysfs file used to expose FEC uncorrectable counts), a
recommended fatal threshold for PAM4 links (e.g., sustained extreme rates like
>1e6/sec or a per-interval threshold), and the recommended action (alert, run
diagnostics, or replace module). Update the Fatal Counter Reference table (and
cross-reference Sections 5.1 / 5.1.1 / 5.8 / Appendix A) to include
`uncorrectable_fec_errors` so the monitoring spec contains the counter name,
sysfs path, threshold, and remediation guidance.
🧹 Nitpick comments (1)
docs/nic-health-monitor.md (1)

82-85: Consider adding language identifiers to fenced code blocks for better rendering.

Multiple fenced code blocks throughout the document (lines 82, 110, 149, 210, 268, 289, 367, 383, 417, 438, 683, 1014, 1041, 1048, 1078, 1106, 1131, 1477, 1494) lack language identifiers. While these are primarily ASCII diagrams and pseudocode, adding identifiers like ```text improves markdown rendering and accessibility.

Comment on lines 430 to 431
Deduplication:
└── Same message within 60s window --> Suppressed (prevents alert storms)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this, in our architecture we don't alert on these things, so it is perfectly valid to capture every error message, we do that for XIDs and we tend to get bursts of those.

[StateMonitoring]
# Expected link speed in Gb/s. Links negotiating below this are FATAL.
# Common values: 100, 200, 400 (NDR)
TargetLinkSpeedGbps = 400
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to mean we will need per cluster configuration, are there ways to detect the expected speed by looking at the hardware perhaps?

}
```

### 5.4 Device Discovery and Parsing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this also pick up NICs that are not configured by network operator? I thought the last time we did this, we ran into a lot of false positives because of this.

@lalitadithya
Copy link
Collaborator

This design is very detailed which is great! But I think it is trying to solve many things at once and I think we should try and reduce scope and also break it down into multiple smaller designs so that we can agree to smaller sections but while having the larger picture in mind.

For scope, I suggest we only focus on Mellanox IB and RoCE, no EFA, TCPXO, eth (yet) they can be follow ups that can be done once we have made progress on the implementation of this.

With respect to the break down, I think we should break it down into the following parts:

  1. Link state detection, i.e., simple UP/DOWN detection of both states and NIC falling off
  2. Link counter detection, i.e., parse and detect errors using NIC counters available on the hosts
  3. Syslog detection, i.e., detect errors based on message in the syslog
  4. Repeat failure detection, i.e., link flaps and other advanced correlation techniques

What do you think?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor/link-counter-detection.md`:
- Line 469: The "Fatal counters" code comment incorrectly lists symbol_error as
fatal; update that comment so symbol_error is not listed as a fatal counter and
instead aligns with the counter specification tables (which classify
symbol_error as non-fatal, rate-based >10/sec). Specifically edit the "Fatal
counters" list entry that currently includes `symbol_error` to remove it
(leaving `link_downed`, `local_link_integrity_errors`,
`excessive_buffer_overrun_errors`, `req_transport_retries_exceeded`) so the
comment and the tables (referencing `symbol_error`) are consistent.

In `@docs/nic-health-monitor/syslog-detection-correlation.md`:
- Around line 94-104: The diagram header "PATTERN MATCHING FOR (all emitted as
WARNING)" incorrectly implies all NIC driver patterns are Warning; update the
diagram (or its caption) so patterns listed under the "PATTERN MATCHING FOR"
block correctly reflect the authoritative classifications from the pattern table
(e.g., mark "mlx5_core cmd_exec timeout", "mlx5_core health poll failed",
"mlx5_core unrecoverable", "PCIe Bus Error.*Fatal", and "NETDEV
WATCHDOG.*mlx5_core" as Fatal rather than Warning), or alternatively add a clear
note next to that diagram pointing readers to Section 5.1 / the pattern table
for precise severity mappings so the diagram is not misleading.
🧹 Nitpick comments (2)
docs/nic-health-monitor/link-state-detection.md (2)

374-399: Clarify fallback behavior when GPU product name is unavailable.

The speed degradation detection algorithm (line 377) mentions getting GPU product name "via NVML or node labels" but doesn't specify what happens if neither source is available (e.g., CPU-only nodes, bare metal without NVML).

Line 382 mentions "If no match found, skip speed validation" which is the right approach, but consider making this fallback behavior more explicit in the algorithm steps.

Suggested clarification

Add after step 1:

1. **Get GPU product name** via NVML or node labels (e.g., `nvidia.com/gpu.product`)
   - If neither source is available, skip speed validation (return early)
2. **Normalize** the product name to lowercase

802-860: Configuration format differs from other documents.

This document uses INI format for configuration (lines 802-860), while link-counter-detection.md uses YAML (lines 669-843). If these are configurations for the same component, consider consistency. If they're for different components, this is acceptable—just ensure it's clear which configuration applies where.

Comment on lines +94 to +104
│ │ PATTERN MATCHING FOR (all emitted as WARNING): │ │
│ │ ├── mlx5_core cmd_exec timeout → Warning (investigate) │ │
│ │ ├── mlx5_core health poll failed → Warning (investigate) │ │
│ │ ├── mlx5_core unrecoverable → Warning (investigate) │ │
│ │ ├── PCIe Bus Error.*Fatal → Warning (investigate) │ │
│ │ ├── NETDEV WATCHDOG.*mlx5_core → Warning (investigate) │ │
│ │ ├── High Temperature → Warning (monitor) │ │
│ │ ├── Detected insufficient power → Warning (informational) │ │
│ │ ├── module absent → Warning (informational) │ │
│ │ └── ACCESS_REG.*failed → Warning (noise, can filter) │ │
│ │ │ │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Diagram oversimplifies severity classification.

The flow diagram labels all NIC driver patterns as "Warning (investigate)", but the authoritative pattern table (lines 420-431) correctly classifies several as Fatal (cmd_exec timeout, health poll failed, unrecoverable, PCIe Fatal, NETDEV WATCHDOG).

Consider updating the diagram to distinguish between Fatal and Warning patterns, or add a note that the diagram is simplified and readers should refer to Section 5.1 for accurate severity classifications.

🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/syslog-detection-correlation.md` around lines 94 -
104, The diagram header "PATTERN MATCHING FOR (all emitted as WARNING)"
incorrectly implies all NIC driver patterns are Warning; update the diagram (or
its caption) so patterns listed under the "PATTERN MATCHING FOR" block correctly
reflect the authoritative classifications from the pattern table (e.g., mark
"mlx5_core cmd_exec timeout", "mlx5_core health poll failed", "mlx5_core
unrecoverable", "PCIe Bus Error.*Fatal", and "NETDEV WATCHDOG.*mlx5_core" as
Fatal rather than Warning), or alternatively add a clear note next to that
diagram pointing readers to Section 5.1 / the pattern table for precise severity
mappings so the diagram is not misleading.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor/link-state-detection.md`:
- Around line 7-20: Update the TOC entry for "Appendix A" so its fragment
matches the actual heading slug "Appendix A: Quick Reference - Fatal Condition
Classification" by replacing the current fragment
"#appendix-a-quick-reference---fatal-state-conditions" with the slug generated
from the real heading (e.g.,
"#appendix-a-quick-reference---fatal-condition-classification") so the link in
the list item for "Appendix A" points to the exact heading text; update any
other occurrences of the old fragment (noted also around lines 831-844) to the
same corrected fragment.
- Around line 552-556: The PF (Physical Function) table cell contains a sentence
fragment "Should ALWAYS be ACTIVE." — update the PF row in the table so the
description is a full sentence (e.g., "The Physical Function should always be
active.") and ensure capitalization/punctuation matches the document style;
locate the PF/Physical Function entry in the table and replace the fragment with
the completed sentence.

In `@docs/nic-health-monitor/Overview.md`:
- Around line 21-88: The fenced ASCII diagram blocks lack a language tag which
triggers MD040; update each triple-backtick fence that wraps the ASCII diagrams
(the large NVSentinel NIC MONITORING ARCHITECTURE diagram and the other similar
diagram blocks) to use a language identifier `text` (i.e. replace ``` with
```text) so markdownlint passes, and apply the same change to the other diagram
fences referenced in the comment.

Comment on lines +7 to +20
1. [Overview](#1-overview)
2. [Architecture](#2-architecture)
3. [State Monitoring Specification](#3-state-monitoring-specification)
4. [Link Speed Degradation Detection](#4-link-speed-degradation-detection)
5. [Device Discovery and Parsing](#5-device-discovery-and-parsing)
6. [State Change and Flap Detection](#6-state-change-and-flap-detection)
7. [PCI Configuration Space Health Check](#7-pci-configuration-space-health-check)
8. [SR-IOV Virtual Function Handling](#8-sr-iov-virtual-function-handling)
9. [RoCE State Monitoring](#9-roce-state-monitoring)
10. [Supported Hardware](#10-supported-hardware)
11. [Data Structures](#11-data-structures)
12. [Configuration](#12-configuration)
13. [Event Management](#13-event-management)
- [Appendix A: Quick Reference - Fatal State Conditions](#appendix-a-quick-reference---fatal-state-conditions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix TOC fragment for Appendix A to match the actual heading slug.

The TOC link for Appendix A doesn’t match the heading text, which triggers MD051. Update the fragment to align with the heading “Appendix A: Quick Reference - Fatal Condition Classification.”

🔧 Suggested fix
-- [Appendix A: Quick Reference - Fatal State Conditions](`#appendix-a-quick-reference---fatal-state-conditions`)
+- [Appendix A: Quick Reference - Fatal Condition Classification](`#appendix-a-quick-reference---fatal-condition-classification`)

Also applies to: 831-844

🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/link-state-detection.md` around lines 7 - 20, Update
the TOC entry for "Appendix A" so its fragment matches the actual heading slug
"Appendix A: Quick Reference - Fatal Condition Classification" by replacing the
current fragment "#appendix-a-quick-reference---fatal-state-conditions" with the
slug generated from the real heading (e.g.,
"#appendix-a-quick-reference---fatal-condition-classification") so the link in
the list item for "Appendix A" points to the exact heading text; update any
other occurrences of the old fragment (noted also around lines 831-844) to the
same corrected fragment.

Comment on lines +552 to +556
| Term | Full Name | Description |
|--------|-------------------|-----------------------------------------------------------------|
| **PF** | Physical Function | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE. |
| **VF** | Virtual Function | A "virtual clone" of the PF. Created for VMs/containers to use. |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Complete the sentence in the PF definition row.

“Should ALWAYS be ACTIVE.” reads as a fragment. Consider making it a full sentence.

✍️ Suggested edit
-| **PF** | Physical Function | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE.   |
+| **PF** | Physical Function | The "real" NIC. Host OS controls it and it should ALWAYS be ACTIVE.   |
🧰 Tools
🪛 LanguageTool

[style] ~554-~554: To form a complete sentence, be sure to include a subject.
Context: ... | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE. | | VF | Virtua...

(MISSING_IT_THERE)

🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/link-state-detection.md` around lines 552 - 556, The
PF (Physical Function) table cell contains a sentence fragment "Should ALWAYS be
ACTIVE." — update the PF row in the table so the description is a full sentence
(e.g., "The Physical Function should always be active.") and ensure
capitalization/punctuation matches the document style; locate the PF/Physical
Function entry in the table and replace the fragment with the completed
sentence.

Comment on lines +21 to +88
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ NVSentinel NIC MONITORING ARCHITECTURE │
├────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ PER-NODE DAEMONSETS (Raw Event Reporters) │ │
│ ├──────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ NIC HEALTH MONITOR │ │ SYSLOG HEALTH MONITOR │ │ │
│ │ │ ══════════════════ │ │ ════════════════════ │ │ │
│ │ │ │ │ │ │ │
│ │ │ DATA SOURCES: │ │ DATA SOURCE: │ │ │
│ │ │ • /sys/class/infiniband/ │ │ • journald / /var/log/journal │ │ │
│ │ │ • /sys/class/net/ │ │ │ │ │
│ │ │ │ │ CHECK: │ │ │
│ │ │ CHECKS: │ │ • SysLogsNICDriverError │ │ │
│ │ │ • InfiniBandStateCheck │ │ (mlx5_core cmd_exec timeout, │ │ │
│ │ │ • InfiniBandDegradationChk │ │ health poll failed, │ │ │
│ │ │ • EthernetStateCheck │ │ unrecoverable, PCIe fatal) │ │ │
│ │ │ • EthernetDegradationCheck │ │ │ │ │
│ │ │ │ │ BEHAVIOR: │ │ │
│ │ │ BEHAVIOR: │ │ • Reports RAW events as-is │ │ │
│ │ │ • Reports RAW events as-is │ │ • NO aggregation │ │ │
│ │ │ • NO aggregation │ │ • NO local persistence │ │ │
│ │ │ • NO local persistence │ │ │ │ │
│ │ └─────────────┬───────────────┘ └───────────────┬─────────────────┘ │ │
│ │ │ │ │ │
│ └────────────────┼──────────────────────────────────┼─────────────────────┘ │
│ │ │ │
│ └────────────────┬─────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ PLATFORM CONNECTOR │ │
│ │ ══════════════════ │ │
│ │ • Receives raw events │ │
│ │ • Persists to MongoDB │ │
│ │ • Triggers downstream │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ CENTRALIZED DEPLOYMENT (Correlation & Intelligence) │ │
│ ├──────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ HEALTH EVENTS ANALYZER │ │ │
│ │ │ ══════════════════════ │ │ │
│ │ │ │ │ │
│ │ │ OPTIONAL CORRELATION (MongoDB Aggregation Pipelines): │ │ │
│ │ │ │ │ │
│ │ │ • Correlate kernel log warnings with port state events │ │ │
│ │ │ • Provide diagnostic context for operator investigation │ │ │
│ │ │ │ │ │
│ │ │ NOTE: Deterministic kernel logs (cmd_exec timeout, etc.) = Fatal │ │ │
│ │ │ Diagnostic logs = Warning │ │ │
│ │ │ │ │ │
│ │ │ OUTPUT: │ │ │
│ │ │ • Deterministic failure detection and remediation triggering │ │ │
│ │ │ • Diagnostic correlation for operator investigation │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────┘
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to fenced blocks to satisfy markdownlint.

These ASCII diagrams are in fenced blocks without a language identifier, triggering MD040. Consider using text for clarity.

🔧 Suggested fix
-```
+```text
...
-```
+```

Also applies to: 123-154, 158-174

🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 57-57: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/Overview.md` around lines 21 - 88, The fenced ASCII
diagram blocks lack a language tag which triggers MD040; update each
triple-backtick fence that wraps the ASCII diagrams (the large NVSentinel NIC
MONITORING ARCHITECTURE diagram and the other similar diagram blocks) to use a
language identifier `text` (i.e. replace ``` with ```text) so markdownlint
passes, and apply the same change to the other diagram fences referenced in the
comment.

| **Log Watcher** | `mlx5_core.*cmd_exec timeout`, `health poll failed`, `unrecoverable`, `PCIe Fatal` | **Fatal** | Deterministic hardware/driver failure |
| **Log Watcher** | `insufficient power`, `module absent`, `ACCESS_REG failed` | **Warning** | Diagnostic context for correlation |

> **Design Note**: Deterministically fatal events in logs trigger `REPLACE_VM`. Diagnostic logs remain as `Warning` and do not directly trigger remediation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do not directly trigger remediation.

meaning they will be published as non-fatal events? If yes, can we mention it that way so that it is clear?

│ │ │ BEHAVIOR: │ │ │
│ │ │ • Reports RAW state events │ │ │
│ │ │ • NO aggregation │ │ │
│ │ │ • NO local persistence │ │ │
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we have to mention this, I understand that the previous iteration has persistence, but now this we have removed it, to someone who is reading this the first time it looks odd as why we would call this out.

IBStateArmed = "3: ARMED" // Armed but not active (check SM)
IBStateActive = "4: ACTIVE" // Normal operational state

// Port physical states
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a link to where we got these states from?

Comment on lines +279 to +280
- If `state = INIT` → Warning, message: "Port stuck in INIT - check Subnet Manager"
- If `state = ARMED` → Warning, message: "Port ARMED but not ACTIVE - check fabric manager"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these warning, it looks like someone has to take some action on these right?

3. **Read physical state** from `/sys/class/infiniband/<dev>/ports/<port>/phys_state`
4. **Evaluate physical state:**
- If `phys_state = Disabled` → Mark as **FATAL**, message: "Physical state DISABLED"
- If `phys_state = Polling` → Warning, message: "Link training in progress"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this warning? In this state jobs can't run right?

| `mlx5_core unrecoverable` | **Fatal** | Hardware admission of failure | `REPLACE_VM` |
| `PCIe Bus Error.*Fatal` | **Fatal** | PCIe link broken | `REPLACE_VM` |
| `NETDEV WATCHDOG` | **Fatal** | Data path stalled | `REPLACE_VM` |
| `Detected insufficient power` | **Warning** | Power negotiation status | `NONE` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we make use of fatal v/s non-fatal as the convention, but in these designs, we are using warning which is a bit different. Can we use the same terminology so that it is less confusing?

- Often indicates "Silent Drop" or **Black Hole** in the fabric
- Local symptom of a **remote** problem

**What This Monitor CAN Detect**: The `local_ack_timeout_err` and `req_transport_retries_exceeded` (native IB) hardware counters track these retry events at the NIC level. Rising counter values indicate transport-layer problems even if we can't see the application error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this also be triggered by an application? Say one of the ranks of a multi-node training crashes or hangs, would it cause this ack timeout to be incremented on the other nodes?


**Key Non-Fatal Counters** (monitor for performance degradation):

| Category | Counters | IsFatal | Alert Threshold | Justification |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link to documentation that recommend the alert thresholds?

|-----------------------------------|----------|-------------------------|---------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| `link_downed` | Standard | **Delta > 0 (Runtime)** | **YES** | Logical path destruction; QP disconnect. Standard HPC apps don't support transparent QP rerouting. | [HPE ClusterStor](https://support.hpe.com/hpesc/public/docDisplay?docId=sd00001143en_us) |
| `excessive_buffer_overrun_errors` | Standard | **> 0 (Any)** | **YES** | Lossless guarantee violation; packet dropped immediately. HCA ingress buffer overflow. | [IBM Redbooks](https://www.redbooks.ibm.com/redbooks/pdfs/sg247767.pdf) |
| `rnr_nak_retry_err` | Extended | **> 0 (Any)** | **YES** | Receiver Not Ready NAK retry exhausted; connection severed. Terminal state of error handling. | [Datadog IB](https://docs.datadoghq.com/integrations/infiniband/) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link doesn't seem to talk about the threshold or how it will cause a fatal error, can we review the linked docs and make sure that we link to the exact section where we have documentation where it talks about the impact and fatal classification?

isFatal: false
thresholdType: velocity
threshold: 10.0
velocityUnit: per_second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
velocityUnit: per_second
velocityUnit: second

│ LAYER 1: LINK STATE DETECTION │
│ ═══════════════════════════════ │
│ • Polling interval: 1 second │
│ • Data source: /sys/class/infiniband/<dev>/ports/<port>/state, phys_state │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time when we had nic-health-monitor deployed in our prod clusters. one of the issues which we faced was whether certain interfaces(be it infiniband/ethernet) are in active use or not. if they are not in active use, then whether they stay up/down doesn't matter and I vaguely remember due to this all the nodes in some cluster(probably osmo) should nic down even though that nic interface was not getting used. how is it going to help this time? will it still be configuration driven where we need to mention the interfaces we want to monitor based cluster/csp?

| H100 | 8 | 400 Gb/s (NDR) | [DGX H100 User Guide](https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html) |
| H200 | 8 | 400 Gb/s (NDR) | NVIDIA Documentation |
| B200 | 8 | 400 Gb/s | [DGX B200 User Guide](https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-to-dgxb200.html) |
| GB200 | 8 | 400 Gb/s | [GB200 NVL2](https://www.nvidia.com/en-us/data-center/gb200-nvl2/) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GB200 system uses fifth generation nvlink whereas H100 using fourth, wouldn't there would be ~1.5x speedup?


**GPU-to-InfiniBand Mapping:**

| GPU Product | Expected Ports | Expected Rate | Reference |
Copy link
Member

@XRFXLP XRFXLP Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these theoretical expected rate if everything E2E as per data-sheet? What if rates are like ~195 gbps for A100?

│ ═══════════════════════════════════════════════════════════════════════════ │
│ │
│ IMPACT ON NCCL AllReduce (Ring Topology): │
│ │
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram gives impression that communication is inside node between different GPUs i.e. link speed being nvlink speed (which is out of scope of this MR). Can we make it slightly more clear to have inter-node seperaiton?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Infiniband health monitor

4 participants