feat: NIC health monitor design#752
Conversation
📝 WalkthroughWalkthroughAdds four comprehensive NIC Health Monitor design documents describing a three-layer detection architecture (link state, link counters, syslog correlation), per-node daemons, a platform connector, centralized Health Events Analyzer with MongoDB, data flows, event schemas, and operational/configuration guidance for GPU clusters. Changes
Sequence Diagram(s)sequenceDiagram
participant NodeDaemon as Node Daemon\n(Link State / Counters / Syslog)
participant PlatformConnector as Platform Connector
participant MongoDB as MongoDB (History)
participant Analyzer as Health Events Analyzer
participant Operator as Operator / Alerting
NodeDaemon->>PlatformConnector: send raw events (state,counters,syslog)
PlatformConnector->>MongoDB: store raw events / snapshots
PlatformConnector->>Analyzer: forward events / metadata
Analyzer->>MongoDB: query historical context / correlate
Analyzer->>Operator: emit health events (fatal/non-fatal)
Operator-->>Analyzer: acknowledge / trigger remediation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor.md`:
- Around line 1567-1576: Section 7.4's "Monitored Kernel Patterns" table is
missing two patterns that appear in Appendix A; either add
`mlx5_core.*ENABLE_HCA.*timeout` and `NETDEV WATCHDOG.*transmit queue.*timed
out` to the Section 7.4 table (alongside existing entries
`pci_power_insufficient`, `port_module_high_temp`, `cmd_exec_timeout`,
`unrecoverable_err`) with matching RecommendedAction and Meaning as in Appendix
A, or remove those two rows from Appendix A so both sections match; update the
entries where the table is defined in Section 7.4 (Monitored Kernel Patterns)
and ensure Appendix A's "Fatal Kernel Log Patterns" table reflects the same set
to keep the spec and quick reference consistent.
- Around line 675-677: The document contains conflicting guidance about the
link_downed counter; pick one consistent approach and update all references
accordingly: either (A) exclude link_downed from health monitoring — remove
link_downed from the "fatal counter" examples, the State Monitor data flow
lists, and the "Consolidated Deterministic Failure Thresholds" table (e.g.,
remove any Delta > 0 entries referring to link_downed), or (B) include it as
fatal — keep it in the examples/tables and augment the State Monitor and
threshold docs (e.g., the "Consolidated Deterministic Failure Thresholds" table
and State Monitor flow) with explicit detection logic for administrative events
(such as correlating link_downed deltas with kernel/driver reload log patterns,
process restarts, or sticky-window suppression logic) and update the note
currently at "Note on `link_downed`" to describe the distinguishing method;
ensure every occurrence of link_downed (examples, tables, and the State Monitor)
is updated to reflect the chosen approach.
- Around line 949-957: In the "5.6 RoCE Handling" section update the fatal
counters list to exclude or mark req_transport_retries_exceeded as Native IB
only: locate the section header "5.6 RoCE Handling" and the bullet list
containing symbol_error, link_downed, local_link_integrity_errors,
excessive_buffer_overrun_errors, req_transport_retries_exceeded and remove
req_transport_retries_exceeded from that RoCE-specific list (or append "(Native
IB only)" to that item) so the documentation matches the later statement that
req_transport_retries_exceeded is only available on Native InfiniBand.
- Line 537: The code references a misspelled protobuf enum symbol
pb.RecommenedAction; update that reference to pb.RecommendedAction wherever it's
used (specifically in the mapping for RecommendedAction) so the correct enum
name is used and the code compiles; also verify any related imports/aliases
still resolve to the protobuf package after the rename.
- Around line 133-146: The two out-of-order subsection headings should be
renumbered to sit under the "2.4 Hardware Failures Detectable by Local
Monitoring" section: change "#### 2.6.1 The `mlx5_core cmd_exec timeout`" to
"#### 2.4.1 The `mlx5_core cmd_exec timeout`" and change "#### 2.6.2 Transport
Retry Count Exceeded (Error 12)" to "#### 2.4.2 Transport Retry Count Exceeded
(Error 12)"; update the visible subsection numbers only (leave the heading text
and content unchanged) so the document hierarchy is correct.
🧹 Nitpick comments (2)
docs/nic-health-monitor.md (2)
575-597: Consider setting specific RecommendedAction in helper constructors.The
NewFatalEventhelper defaults toRecommendedAction_NONE, but Section 1.4 (lines 52-57) defines specific actions for different fatal conditions (e.g.,RecommendedAction_REPLACE_VMfor state=DOWN,RecommendedAction_RESTART_BMfor cmd_exec timeout).Callers must manually override this field after construction, which could lead to inconsistency. Consider either:
- Adding a
recommendedActionparameter to the constructor, or- Providing specialized constructors (e.g.,
NewFatalReplaceVMEvent,NewFatalRestartBMEvent)This would enforce consistent action mapping at the type level.
♻️ Alternative: Add parameter to constructor
-func NewFatalEvent(deviceName, message string, linkLayer string) *NicHealthEvent { +func NewFatalEvent(deviceName, message string, linkLayer string, recommendedAction RecommendedAction) *NicHealthEvent { checkName := "EthernetErrorCheck" if linkLayer == "InfiniBand" { checkName = "InfiniBandErrorCheck" } return &NicHealthEvent{ Version: 1, Agent: AGENT, CheckName: checkName, ComponentClass: "NIC", GeneratedTimestamp: time.Now(), Message: message, IsFatal: true, IsHealthy: false, NodeName: os.Getenv("NODE_NAME"), - RecommendedAction: RecommendedAction_NONE, + RecommendedAction: recommendedAction, EntitiesImpacted: []Entity{{EntityType: "NIC", EntityValue: deviceName}}, LinkLayer: linkLayer, } }
82-84: Add language specifiers to fenced code blocks for better rendering.Multiple fenced code blocks throughout the document lack language specifiers, which affects syntax highlighting and rendering. Static analysis identified blocks at lines 82, 110, 149, 210, 268, 289, 367, 383, 418, 439, 684, 1015, 1042, 1049, 1079, 1107, 1132, 1473, 1490.
Examples:
Line 82: Physical degradation flow diagram should use
textor leave unmarked for plain text
Line 110: Queue Pair state machine should usetext
Line 684: Counter reset problem should usetext
Line 1015: GID table example should usetextorbash
Line 1042: ASCII diagram should usetextExample fixes
-``` +```text Physical Impairment --> Eye Diagram Closes --> Symbol Errors --> FEC Corrections --> CRC Failures --> Packet Loss-
+text
RESET --> INIT --> RTR --> RTS</details> Based on learnings: As this is a design document, syntax highlighting is helpful for readability but not critical for correctness. </blockquote></details> </blockquote></details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor.md`:
- Line 100: The docs reference an undefined counter `uncorrectable_fec_errors`;
add a Fatal Counter Reference entry defining this counter: provide its sysfs
path (e.g., the NIC sysfs file used to expose FEC uncorrectable counts), a
recommended fatal threshold for PAM4 links (e.g., sustained extreme rates like
>1e6/sec or a per-interval threshold), and the recommended action (alert, run
diagnostics, or replace module). Update the Fatal Counter Reference table (and
cross-reference Sections 5.1 / 5.1.1 / 5.8 / Appendix A) to include
`uncorrectable_fec_errors` so the monitoring spec contains the counter name,
sysfs path, threshold, and remediation guidance.
🧹 Nitpick comments (1)
docs/nic-health-monitor.md (1)
82-85: Consider adding language identifiers to fenced code blocks for better rendering.Multiple fenced code blocks throughout the document (lines 82, 110, 149, 210, 268, 289, 367, 383, 417, 438, 683, 1014, 1041, 1048, 1078, 1106, 1131, 1477, 1494) lack language identifiers. While these are primarily ASCII diagrams and pseudocode, adding identifiers like
```textimproves markdown rendering and accessibility.
docs/nic-health-monitor.md
Outdated
| Deduplication: | ||
| └── Same message within 60s window --> Suppressed (prevents alert storms) |
There was a problem hiding this comment.
I don't think we need this, in our architecture we don't alert on these things, so it is perfectly valid to capture every error message, we do that for XIDs and we tend to get bursts of those.
docs/nic-health-monitor.md
Outdated
| [StateMonitoring] | ||
| # Expected link speed in Gb/s. Links negotiating below this are FATAL. | ||
| # Common values: 100, 200, 400 (NDR) | ||
| TargetLinkSpeedGbps = 400 |
There was a problem hiding this comment.
This seems to mean we will need per cluster configuration, are there ways to detect the expected speed by looking at the hardware perhaps?
docs/nic-health-monitor.md
Outdated
| } | ||
| ``` | ||
|
|
||
| ### 5.4 Device Discovery and Parsing |
There was a problem hiding this comment.
Won't this also pick up NICs that are not configured by network operator? I thought the last time we did this, we ran into a lot of false positives because of this.
|
This design is very detailed which is great! But I think it is trying to solve many things at once and I think we should try and reduce scope and also break it down into multiple smaller designs so that we can agree to smaller sections but while having the larger picture in mind. For scope, I suggest we only focus on Mellanox IB and RoCE, no EFA, TCPXO, eth (yet) they can be follow ups that can be done once we have made progress on the implementation of this. With respect to the break down, I think we should break it down into the following parts:
What do you think? |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor/link-counter-detection.md`:
- Line 469: The "Fatal counters" code comment incorrectly lists symbol_error as
fatal; update that comment so symbol_error is not listed as a fatal counter and
instead aligns with the counter specification tables (which classify
symbol_error as non-fatal, rate-based >10/sec). Specifically edit the "Fatal
counters" list entry that currently includes `symbol_error` to remove it
(leaving `link_downed`, `local_link_integrity_errors`,
`excessive_buffer_overrun_errors`, `req_transport_retries_exceeded`) so the
comment and the tables (referencing `symbol_error`) are consistent.
In `@docs/nic-health-monitor/syslog-detection-correlation.md`:
- Around line 94-104: The diagram header "PATTERN MATCHING FOR (all emitted as
WARNING)" incorrectly implies all NIC driver patterns are Warning; update the
diagram (or its caption) so patterns listed under the "PATTERN MATCHING FOR"
block correctly reflect the authoritative classifications from the pattern table
(e.g., mark "mlx5_core cmd_exec timeout", "mlx5_core health poll failed",
"mlx5_core unrecoverable", "PCIe Bus Error.*Fatal", and "NETDEV
WATCHDOG.*mlx5_core" as Fatal rather than Warning), or alternatively add a clear
note next to that diagram pointing readers to Section 5.1 / the pattern table
for precise severity mappings so the diagram is not misleading.
🧹 Nitpick comments (2)
docs/nic-health-monitor/link-state-detection.md (2)
374-399: Clarify fallback behavior when GPU product name is unavailable.The speed degradation detection algorithm (line 377) mentions getting GPU product name "via NVML or node labels" but doesn't specify what happens if neither source is available (e.g., CPU-only nodes, bare metal without NVML).
Line 382 mentions "If no match found, skip speed validation" which is the right approach, but consider making this fallback behavior more explicit in the algorithm steps.
Suggested clarification
Add after step 1:
1. **Get GPU product name** via NVML or node labels (e.g., `nvidia.com/gpu.product`) - If neither source is available, skip speed validation (return early) 2. **Normalize** the product name to lowercase
802-860: Configuration format differs from other documents.This document uses INI format for configuration (lines 802-860), while
link-counter-detection.mduses YAML (lines 669-843). If these are configurations for the same component, consider consistency. If they're for different components, this is acceptable—just ensure it's clear which configuration applies where.
| │ │ PATTERN MATCHING FOR (all emitted as WARNING): │ │ | ||
| │ │ ├── mlx5_core cmd_exec timeout → Warning (investigate) │ │ | ||
| │ │ ├── mlx5_core health poll failed → Warning (investigate) │ │ | ||
| │ │ ├── mlx5_core unrecoverable → Warning (investigate) │ │ | ||
| │ │ ├── PCIe Bus Error.*Fatal → Warning (investigate) │ │ | ||
| │ │ ├── NETDEV WATCHDOG.*mlx5_core → Warning (investigate) │ │ | ||
| │ │ ├── High Temperature → Warning (monitor) │ │ | ||
| │ │ ├── Detected insufficient power → Warning (informational) │ │ | ||
| │ │ ├── module absent → Warning (informational) │ │ | ||
| │ │ └── ACCESS_REG.*failed → Warning (noise, can filter) │ │ | ||
| │ │ │ │ |
There was a problem hiding this comment.
Diagram oversimplifies severity classification.
The flow diagram labels all NIC driver patterns as "Warning (investigate)", but the authoritative pattern table (lines 420-431) correctly classifies several as Fatal (cmd_exec timeout, health poll failed, unrecoverable, PCIe Fatal, NETDEV WATCHDOG).
Consider updating the diagram to distinguish between Fatal and Warning patterns, or add a note that the diagram is simplified and readers should refer to Section 5.1 for accurate severity classifications.
🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/syslog-detection-correlation.md` around lines 94 -
104, The diagram header "PATTERN MATCHING FOR (all emitted as WARNING)"
incorrectly implies all NIC driver patterns are Warning; update the diagram (or
its caption) so patterns listed under the "PATTERN MATCHING FOR" block correctly
reflect the authoritative classifications from the pattern table (e.g., mark
"mlx5_core cmd_exec timeout", "mlx5_core health poll failed", "mlx5_core
unrecoverable", "PCIe Bus Error.*Fatal", and "NETDEV WATCHDOG.*mlx5_core" as
Fatal rather than Warning), or alternatively add a clear note next to that
diagram pointing readers to Section 5.1 / the pattern table for precise severity
mappings so the diagram is not misleading.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/nic-health-monitor/link-state-detection.md`:
- Around line 7-20: Update the TOC entry for "Appendix A" so its fragment
matches the actual heading slug "Appendix A: Quick Reference - Fatal Condition
Classification" by replacing the current fragment
"#appendix-a-quick-reference---fatal-state-conditions" with the slug generated
from the real heading (e.g.,
"#appendix-a-quick-reference---fatal-condition-classification") so the link in
the list item for "Appendix A" points to the exact heading text; update any
other occurrences of the old fragment (noted also around lines 831-844) to the
same corrected fragment.
- Around line 552-556: The PF (Physical Function) table cell contains a sentence
fragment "Should ALWAYS be ACTIVE." — update the PF row in the table so the
description is a full sentence (e.g., "The Physical Function should always be
active.") and ensure capitalization/punctuation matches the document style;
locate the PF/Physical Function entry in the table and replace the fragment with
the completed sentence.
In `@docs/nic-health-monitor/Overview.md`:
- Around line 21-88: The fenced ASCII diagram blocks lack a language tag which
triggers MD040; update each triple-backtick fence that wraps the ASCII diagrams
(the large NVSentinel NIC MONITORING ARCHITECTURE diagram and the other similar
diagram blocks) to use a language identifier `text` (i.e. replace ``` with
```text) so markdownlint passes, and apply the same change to the other diagram
fences referenced in the comment.
| 1. [Overview](#1-overview) | ||
| 2. [Architecture](#2-architecture) | ||
| 3. [State Monitoring Specification](#3-state-monitoring-specification) | ||
| 4. [Link Speed Degradation Detection](#4-link-speed-degradation-detection) | ||
| 5. [Device Discovery and Parsing](#5-device-discovery-and-parsing) | ||
| 6. [State Change and Flap Detection](#6-state-change-and-flap-detection) | ||
| 7. [PCI Configuration Space Health Check](#7-pci-configuration-space-health-check) | ||
| 8. [SR-IOV Virtual Function Handling](#8-sr-iov-virtual-function-handling) | ||
| 9. [RoCE State Monitoring](#9-roce-state-monitoring) | ||
| 10. [Supported Hardware](#10-supported-hardware) | ||
| 11. [Data Structures](#11-data-structures) | ||
| 12. [Configuration](#12-configuration) | ||
| 13. [Event Management](#13-event-management) | ||
| - [Appendix A: Quick Reference - Fatal State Conditions](#appendix-a-quick-reference---fatal-state-conditions) |
There was a problem hiding this comment.
Fix TOC fragment for Appendix A to match the actual heading slug.
The TOC link for Appendix A doesn’t match the heading text, which triggers MD051. Update the fragment to align with the heading “Appendix A: Quick Reference - Fatal Condition Classification.”
🔧 Suggested fix
-- [Appendix A: Quick Reference - Fatal State Conditions](`#appendix-a-quick-reference---fatal-state-conditions`)
+- [Appendix A: Quick Reference - Fatal Condition Classification](`#appendix-a-quick-reference---fatal-condition-classification`)Also applies to: 831-844
🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/link-state-detection.md` around lines 7 - 20, Update
the TOC entry for "Appendix A" so its fragment matches the actual heading slug
"Appendix A: Quick Reference - Fatal Condition Classification" by replacing the
current fragment "#appendix-a-quick-reference---fatal-state-conditions" with the
slug generated from the real heading (e.g.,
"#appendix-a-quick-reference---fatal-condition-classification") so the link in
the list item for "Appendix A" points to the exact heading text; update any
other occurrences of the old fragment (noted also around lines 831-844) to the
same corrected fragment.
| | Term | Full Name | Description | | ||
| |--------|-------------------|-----------------------------------------------------------------| | ||
| | **PF** | Physical Function | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE. | | ||
| | **VF** | Virtual Function | A "virtual clone" of the PF. Created for VMs/containers to use. | | ||
|
|
There was a problem hiding this comment.
Complete the sentence in the PF definition row.
“Should ALWAYS be ACTIVE.” reads as a fragment. Consider making it a full sentence.
✍️ Suggested edit
-| **PF** | Physical Function | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE. |
+| **PF** | Physical Function | The "real" NIC. Host OS controls it and it should ALWAYS be ACTIVE. |🧰 Tools
🪛 LanguageTool
[style] ~554-~554: To form a complete sentence, be sure to include a subject.
Context: ... | The "real" NIC. Host OS controls it. Should ALWAYS be ACTIVE. | | VF | Virtua...
(MISSING_IT_THERE)
🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/link-state-detection.md` around lines 552 - 556, The
PF (Physical Function) table cell contains a sentence fragment "Should ALWAYS be
ACTIVE." — update the PF row in the table so the description is a full sentence
(e.g., "The Physical Function should always be active.") and ensure
capitalization/punctuation matches the document style; locate the PF/Physical
Function entry in the table and replace the fragment with the completed
sentence.
| ``` | ||
| ┌────────────────────────────────────────────────────────────────────────────────┐ | ||
| │ NVSentinel NIC MONITORING ARCHITECTURE │ | ||
| ├────────────────────────────────────────────────────────────────────────────────┤ | ||
| │ │ | ||
| │ ┌──────────────────────────────────────────────────────────────────────────┐ │ | ||
| │ │ PER-NODE DAEMONSETS (Raw Event Reporters) │ │ | ||
| │ ├──────────────────────────────────────────────────────────────────────────┤ │ | ||
| │ │ │ │ | ||
| │ │ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │ │ | ||
| │ │ │ NIC HEALTH MONITOR │ │ SYSLOG HEALTH MONITOR │ │ │ | ||
| │ │ │ ══════════════════ │ │ ════════════════════ │ │ │ | ||
| │ │ │ │ │ │ │ │ | ||
| │ │ │ DATA SOURCES: │ │ DATA SOURCE: │ │ │ | ||
| │ │ │ • /sys/class/infiniband/ │ │ • journald / /var/log/journal │ │ │ | ||
| │ │ │ • /sys/class/net/ │ │ │ │ │ | ||
| │ │ │ │ │ CHECK: │ │ │ | ||
| │ │ │ CHECKS: │ │ • SysLogsNICDriverError │ │ │ | ||
| │ │ │ • InfiniBandStateCheck │ │ (mlx5_core cmd_exec timeout, │ │ │ | ||
| │ │ │ • InfiniBandDegradationChk │ │ health poll failed, │ │ │ | ||
| │ │ │ • EthernetStateCheck │ │ unrecoverable, PCIe fatal) │ │ │ | ||
| │ │ │ • EthernetDegradationCheck │ │ │ │ │ | ||
| │ │ │ │ │ BEHAVIOR: │ │ │ | ||
| │ │ │ BEHAVIOR: │ │ • Reports RAW events as-is │ │ │ | ||
| │ │ │ • Reports RAW events as-is │ │ • NO aggregation │ │ │ | ||
| │ │ │ • NO aggregation │ │ • NO local persistence │ │ │ | ||
| │ │ │ • NO local persistence │ │ │ │ │ | ||
| │ │ └─────────────┬───────────────┘ └───────────────┬─────────────────┘ │ │ | ||
| │ │ │ │ │ │ | ||
| │ └────────────────┼──────────────────────────────────┼─────────────────────┘ │ | ||
| │ │ │ │ | ||
| │ └────────────────┬─────────────────┘ │ | ||
| │ │ │ | ||
| │ ▼ │ | ||
| │ ┌──────────────────────────────────────┐ │ | ||
| │ │ PLATFORM CONNECTOR │ │ | ||
| │ │ ══════════════════ │ │ | ||
| │ │ • Receives raw events │ │ | ||
| │ │ • Persists to MongoDB │ │ | ||
| │ │ • Triggers downstream │ │ | ||
| │ └──────────────────┬───────────────────┘ │ | ||
| │ │ │ | ||
| │ ▼ │ | ||
| │ ┌──────────────────────────────────────────────────────────────────────────┐ │ | ||
| │ │ CENTRALIZED DEPLOYMENT (Correlation & Intelligence) │ │ | ||
| │ ├──────────────────────────────────────────────────────────────────────────┤ │ | ||
| │ │ │ │ | ||
| │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ | ||
| │ │ │ HEALTH EVENTS ANALYZER │ │ │ | ||
| │ │ │ ══════════════════════ │ │ │ | ||
| │ │ │ │ │ │ | ||
| │ │ │ OPTIONAL CORRELATION (MongoDB Aggregation Pipelines): │ │ │ | ||
| │ │ │ │ │ │ | ||
| │ │ │ • Correlate kernel log warnings with port state events │ │ │ | ||
| │ │ │ • Provide diagnostic context for operator investigation │ │ │ | ||
| │ │ │ │ │ │ | ||
| │ │ │ NOTE: Deterministic kernel logs (cmd_exec timeout, etc.) = Fatal │ │ │ | ||
| │ │ │ Diagnostic logs = Warning │ │ │ | ||
| │ │ │ │ │ │ | ||
| │ │ │ OUTPUT: │ │ │ | ||
| │ │ │ • Deterministic failure detection and remediation triggering │ │ │ | ||
| │ │ │ • Diagnostic correlation for operator investigation │ │ │ | ||
| │ │ └────────────────────────────────────────────────────────────────────┘ │ │ | ||
| │ │ │ │ | ||
| │ └──────────────────────────────────────────────────────────────────────────┘ │ | ||
| │ │ | ||
| └────────────────────────────────────────────────────────────────────────────────┘ | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to fenced blocks to satisfy markdownlint.
These ASCII diagrams are in fenced blocks without a language identifier, triggering MD040. Consider using text for clarity.
🔧 Suggested fix
-```
+```text
...
-```
+```Also applies to: 123-154, 158-174
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 57-57: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/nic-health-monitor/Overview.md` around lines 21 - 88, The fenced ASCII
diagram blocks lack a language tag which triggers MD040; update each
triple-backtick fence that wraps the ASCII diagrams (the large NVSentinel NIC
MONITORING ARCHITECTURE diagram and the other similar diagram blocks) to use a
language identifier `text` (i.e. replace ``` with ```text) so markdownlint
passes, and apply the same change to the other diagram fences referenced in the
comment.
| | **Log Watcher** | `mlx5_core.*cmd_exec timeout`, `health poll failed`, `unrecoverable`, `PCIe Fatal` | **Fatal** | Deterministic hardware/driver failure | | ||
| | **Log Watcher** | `insufficient power`, `module absent`, `ACCESS_REG failed` | **Warning** | Diagnostic context for correlation | | ||
|
|
||
| > **Design Note**: Deterministically fatal events in logs trigger `REPLACE_VM`. Diagnostic logs remain as `Warning` and do not directly trigger remediation. |
There was a problem hiding this comment.
and do not directly trigger remediation.
meaning they will be published as non-fatal events? If yes, can we mention it that way so that it is clear?
| │ │ │ BEHAVIOR: │ │ │ | ||
| │ │ │ • Reports RAW state events │ │ │ | ||
| │ │ │ • NO aggregation │ │ │ | ||
| │ │ │ • NO local persistence │ │ │ |
There was a problem hiding this comment.
I'm not sure if we have to mention this, I understand that the previous iteration has persistence, but now this we have removed it, to someone who is reading this the first time it looks odd as why we would call this out.
| IBStateArmed = "3: ARMED" // Armed but not active (check SM) | ||
| IBStateActive = "4: ACTIVE" // Normal operational state | ||
|
|
||
| // Port physical states |
There was a problem hiding this comment.
Can we add a link to where we got these states from?
| - If `state = INIT` → Warning, message: "Port stuck in INIT - check Subnet Manager" | ||
| - If `state = ARMED` → Warning, message: "Port ARMED but not ACTIVE - check fabric manager" |
There was a problem hiding this comment.
Why are these warning, it looks like someone has to take some action on these right?
| 3. **Read physical state** from `/sys/class/infiniband/<dev>/ports/<port>/phys_state` | ||
| 4. **Evaluate physical state:** | ||
| - If `phys_state = Disabled` → Mark as **FATAL**, message: "Physical state DISABLED" | ||
| - If `phys_state = Polling` → Warning, message: "Link training in progress" |
There was a problem hiding this comment.
Why is this warning? In this state jobs can't run right?
| | `mlx5_core unrecoverable` | **Fatal** | Hardware admission of failure | `REPLACE_VM` | | ||
| | `PCIe Bus Error.*Fatal` | **Fatal** | PCIe link broken | `REPLACE_VM` | | ||
| | `NETDEV WATCHDOG` | **Fatal** | Data path stalled | `REPLACE_VM` | | ||
| | `Detected insufficient power` | **Warning** | Power negotiation status | `NONE` | |
There was a problem hiding this comment.
In general, we make use of fatal v/s non-fatal as the convention, but in these designs, we are using warning which is a bit different. Can we use the same terminology so that it is less confusing?
| - Often indicates "Silent Drop" or **Black Hole** in the fabric | ||
| - Local symptom of a **remote** problem | ||
|
|
||
| **What This Monitor CAN Detect**: The `local_ack_timeout_err` and `req_transport_retries_exceeded` (native IB) hardware counters track these retry events at the NIC level. Rising counter values indicate transport-layer problems even if we can't see the application error. |
There was a problem hiding this comment.
Can this also be triggered by an application? Say one of the ranks of a multi-node training crashes or hangs, would it cause this ack timeout to be incremented on the other nodes?
|
|
||
| **Key Non-Fatal Counters** (monitor for performance degradation): | ||
|
|
||
| | Category | Counters | IsFatal | Alert Threshold | Justification | |
There was a problem hiding this comment.
Can we link to documentation that recommend the alert thresholds?
| |-----------------------------------|----------|-------------------------|---------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------| | ||
| | `link_downed` | Standard | **Delta > 0 (Runtime)** | **YES** | Logical path destruction; QP disconnect. Standard HPC apps don't support transparent QP rerouting. | [HPE ClusterStor](https://support.hpe.com/hpesc/public/docDisplay?docId=sd00001143en_us) | | ||
| | `excessive_buffer_overrun_errors` | Standard | **> 0 (Any)** | **YES** | Lossless guarantee violation; packet dropped immediately. HCA ingress buffer overflow. | [IBM Redbooks](https://www.redbooks.ibm.com/redbooks/pdfs/sg247767.pdf) | | ||
| | `rnr_nak_retry_err` | Extended | **> 0 (Any)** | **YES** | Receiver Not Ready NAK retry exhausted; connection severed. Terminal state of error handling. | [Datadog IB](https://docs.datadoghq.com/integrations/infiniband/) | |
There was a problem hiding this comment.
The link doesn't seem to talk about the threshold or how it will cause a fatal error, can we review the linked docs and make sure that we link to the exact section where we have documentation where it talks about the impact and fatal classification?
| isFatal: false | ||
| thresholdType: velocity | ||
| threshold: 10.0 | ||
| velocityUnit: per_second |
There was a problem hiding this comment.
| velocityUnit: per_second | |
| velocityUnit: second |
| │ LAYER 1: LINK STATE DETECTION │ | ||
| │ ═══════════════════════════════ │ | ||
| │ • Polling interval: 1 second │ | ||
| │ • Data source: /sys/class/infiniband/<dev>/ports/<port>/state, phys_state │ |
There was a problem hiding this comment.
Last time when we had nic-health-monitor deployed in our prod clusters. one of the issues which we faced was whether certain interfaces(be it infiniband/ethernet) are in active use or not. if they are not in active use, then whether they stay up/down doesn't matter and I vaguely remember due to this all the nodes in some cluster(probably osmo) should nic down even though that nic interface was not getting used. how is it going to help this time? will it still be configuration driven where we need to mention the interfaces we want to monitor based cluster/csp?
| | H100 | 8 | 400 Gb/s (NDR) | [DGX H100 User Guide](https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html) | | ||
| | H200 | 8 | 400 Gb/s (NDR) | NVIDIA Documentation | | ||
| | B200 | 8 | 400 Gb/s | [DGX B200 User Guide](https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-to-dgxb200.html) | | ||
| | GB200 | 8 | 400 Gb/s | [GB200 NVL2](https://www.nvidia.com/en-us/data-center/gb200-nvl2/) | |
There was a problem hiding this comment.
GB200 system uses fifth generation nvlink whereas H100 using fourth, wouldn't there would be ~1.5x speedup?
|
|
||
| **GPU-to-InfiniBand Mapping:** | ||
|
|
||
| | GPU Product | Expected Ports | Expected Rate | Reference | |
There was a problem hiding this comment.
Are these theoretical expected rate if everything E2E as per data-sheet? What if rates are like ~195 gbps for A100?
| │ ═══════════════════════════════════════════════════════════════════════════ │ | ||
| │ │ | ||
| │ IMPACT ON NCCL AllReduce (Ring Topology): │ | ||
| │ │ |
There was a problem hiding this comment.
This diagram gives impression that communication is inside node between different GPUs i.e. link speed being nvlink speed (which is out of scope of this MR). Can we make it slightly more clear to have inter-node seperaiton?
Summary
Preview the design
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.