feat: publish GpuTotalCountMismatch healthevent in case driver nvidia-gpu is not getting detected by driver#833
feat: publish GpuTotalCountMismatch healthevent in case driver nvidia-gpu is not getting detected by driver#833deesharma24 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds an optional GPU count check: enumerates NVIDIA GPUs from sysfs at startup, compares sysfs GPU PCI addresses to DCGM-reported GPUs each health-check cycle, calls a callback with results, and emits deduplicated health events for mismatches. Toggle available via CLI and Helm values. Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as CLI Handler
participant Watcher as DCGMWatcher
participant Sysfs as Sysfs Scanner
participant DCGM as DCGM API
participant Connector as PlatformConnector
participant gRPC as Platform gRPC
User->>CLI: start (--gpu-count-check-enabled=true)
CLI->>Watcher: initialize(gpu_count_check_enabled=true)
Watcher->>Sysfs: get_nvidia_gpu_pci_addresses()
Sysfs-->>Watcher: sysfs_gpu_pci_addresses
loop Health check cycle
Watcher->>DCGM: query GPU IDs
DCGM-->>Watcher: dcgm_gpu_ids
Watcher->>Watcher: compare sysfs vs dcgm
Watcher->>Connector: gpu_count_check_completed(sysfs_addrs, dcgm_ids)
alt Mismatch
Connector->>Connector: compute missing PCI addresses
Connector->>gRPC: send_health_event(GpuTotalCountMismatch, impactedEntities, isFatal=true)
gRPC-->>Connector: ack
Connector->>Connector: update cache (dedupe)
else Match / resolved
Connector->>gRPC: send_health_event(GpuTotalCountMismatch cleared / healthy)
gRPC-->>Connector: ack
Connector->>Connector: update cache (dedupe)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml (1)
90-99:⚠️ Potential issue | 🔴 CriticalAdd explicit
/syshostPath mount for GPU PCIe enumeration.The GPU health monitor reads
/sys/bus/pci/devices/at startup to enumerate NVIDIA GPUs (vendor0x10de, class0x0302). The daemonset currently lacks an explicit volume mount for/sys, which means the enumeration silently fails and returns zero GPUs, causing persistent mismatch errors.Add this to the daemonset's
volumeMounts(around line 98):- name: sys mountPath: /sys readOnly: trueAnd add the corresponding volume definition (after
gpu-metadatavolume):- name: sys hostPath: path: /sys type: DirectoryThe metadata-collector daemonset in the same chart already mounts
/sysin this exact way, establishing precedent for this cluster setup.
🤖 Fix all issues with AI agents
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py`:
- Around line 80-84: The helper metadata_file() writes sample JSON to a
NamedTemporaryFile but returns its path while the file is still open, risking
unreadable/partial content; modify metadata_file (the NamedTemporaryFile
creation) to ensure the file is flushed and closed before returning the
path—e.g., open/write/close inside a context manager or explicitly call
f.flush() and f.close() after json.dump—so
MetadataReader/PlatformConnectorEventProcessor reads a fully written file.
🧹 Nitpick comments (6)
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/types.py (1)
53-60: Missing return type annotation.The concrete implementation in
platform_connector.pyannotates-> None; the abstract declaration here should match. That said, the existing methods (health_event_occurred,dcgm_connectivity_failed) also omit return types, so this is a pre-existing pattern.Suggested fix
`@abc.abstractmethod` - def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]): + def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]) -> None:As per coding guidelines, "Type hints required for all functions in Python code".
docs/gpu-health-monitor.md (1)
103-103: Nit: clarify the configuration parameter name.The doc says
gpu_count_check_enabled(Python/underscore style), but users configure this via Helm (gpuCountCheck.enabled) or CLI (--gpu-count-check-enabled). Consider referencing the Helm value name since this is deployment-facing documentation.health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py (1)
276-282: Consider guarding against empty sysfs results being cached permanently.If
get_nvidia_gpu_pci_addresses()returns an empty list due to a transient sysfs read error at startup, the cached count will be0for the lifetime of the process. This would suppress mismatch detection entirely (both counts would appear as 0 vs N, or if DCGM also sees GPUs, it would report a false mismatch that can never self-resolve since the cache is static).A simple guard — logging a more prominent warning or even retrying the sysfs scan once — could improve resilience:
💡 Suggested improvement
sysfs_gpu_pci_addresses = [] if self._gpu_count_check_enabled: sysfs_gpu_pci_addresses = get_nvidia_gpu_pci_addresses() - log.info(f"Sysfs GPU count (PCIe hardware): {len(sysfs_gpu_pci_addresses)}") + if not sysfs_gpu_pci_addresses: + log.warning("Sysfs returned 0 GPUs — GPU count check will be ineffective this cycle") + else: + log.info(f"Sysfs GPU count (PCIe hardware): {len(sysfs_gpu_pci_addresses)}") else: log.info("GPU count check is disabled")health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py (2)
32-32: Hardcoded/tmppath for the UDS socket.Static analysis (S108) flags this as an insecure temp path. For tests, consider using
tempfile.mkdtemp()to create an isolated directory and placing the socket there. This avoids collisions if tests run in parallel and addresses the security hint.
91-93: Servicer overwriteshealth_eventson each RPC call — only the last batch is retained.This is fine for the current tests since each assertion follows a single
gpu_count_check_completedcall. However, if future tests need to verify cumulative events or multiple batches, this will silently lose earlier events. Consider appending instead:self.health_events.extend(request.events)This would also require resetting
self.health_events = []between test phases (already done in some tests).health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)
18-19: Documentation forGPU_CLASS_PREFIXis adequate but could be clearer.The constant
GPU_CLASS_PREFIX = "0x0302"correctly targets NVIDIA data center GPUs (A100, H100, etc.). The inline comment and docstring identify these target SKUs. However, the documentation doesn't explain why data center GPUs use0x0302while consumer/display GPUs use0x0300. Consider expanding the comment to clarify this distinction to prevent future maintainers from wondering why consumer GPUs aren't detected.
...u-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/gpu-health-monitor.md`:
- Around line 96-97: The docs and code disagree about mismatch direction: the
implementation triggers on len(sysfs_gpu_pci_addresses) != len(dcgm_gpu_ids)
while the doc text says sysfs count > DCGM count; decide which behavior is
correct and make them consistent: either update the documentation sentence to
say "sysfs count ≠ DCGM count" and leave the implementation (referencing
len(sysfs_gpu_pci_addresses) and len(dcgm_gpu_ids) and the GpuTotalCountMismatch
health event) as-is, or tighten the implementation to only treat sysfs > DCGM as
a mismatch by changing the conditional that currently uses
len(sysfs_gpu_pci_addresses) != len(dcgm_gpu_ids) to a strict comparison
(len(sysfs_gpu_pci_addresses) > len(dcgm_gpu_ids)) so it matches the documented
behavior and still populates the GpuTotalCountMismatch event with the missing
PCI addresses.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py`:
- Around line 414-476: The GPU count mismatch event built in
_send_gpu_count_mismatch_event currently sets
recommendedAction=platformconnector_pb2.COMPONENT_RESET while entitiesImpacted
only contains PCI entries (entitiesImpacted) or may be empty
(missing_from_driver), which prevents component-level reset; change the logic to
detect when no GPU_UUID entities are present or when entitiesImpacted is empty
and set recommendedAction=platformconnector_pb2.RESTART_VM instead (or otherwise
include GPU_UUID if available via _metadata_reader), and update the message to
handle the empty-case (e.g., omit "Missing from driver: []" or say "none
detected") so downstream consumers (node-drainer/fault-remediation) receive a
compatible action; make these changes inside _send_gpu_count_mismatch_event and
adjust errorCode/metadata as needed if you change action semantics.
🧹 Nitpick comments (4)
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py (1)
276-282: Consider handling an empty sysfs result at startup more explicitly.If
get_nvidia_gpu_pci_addresses()returns[](e.g., sysfs read failure, container without/sysmounted), the empty list is cached for the watcher's lifetime. Every subsequent cycle will report a mismatch (0 sysfs GPUs vs N DCGM GPUs), potentially causing a persistent false-positiveGpuTotalCountMismatchevent.Consider logging a more prominent warning or disabling the check when the sysfs scan returns zero GPUs at startup, since a node with DCGM-visible GPUs but zero sysfs GPUs almost certainly indicates a scan failure rather than a real mismatch.
💡 Suggested approach
if self._gpu_count_check_enabled: sysfs_gpu_pci_addresses = get_nvidia_gpu_pci_addresses() log.info(f"Sysfs GPU count (PCIe hardware): {len(sysfs_gpu_pci_addresses)}") + if not sysfs_gpu_pci_addresses: + log.warning( + "Sysfs returned 0 GPUs — possible scan failure or missing /sys mount. " + "Disabling GPU count check to avoid false positives." + ) + self._gpu_count_check_enabled = False else: log.info("GPU count check is disabled")health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py (2)
32-32: Use a temporary path for the UDS socket instead of a hardcoded/tmppath.Ruff S108 flags this as a probable insecure usage of a temporary file path. In shared CI environments,
/tmp/nvsentinel.sockcould collide with concurrent test runs or be subject to symlink attacks. Consider usingtempfile.mkdtemp()to create an isolated directory for the socket.💡 Suggested approach
-socket_path = "/tmp/nvsentinel.sock" +import tempfile as _tmpmod +_test_sock_dir = _tmpmod.mkdtemp() +socket_path = os.path.join(_test_sock_dir, "nvsentinel.sock")And clean up the directory in
tearDown.
87-93: Servicer replaceshealth_eventson each gRPC call rather than accumulating.
self.health_events = list(request.events)overwrites the list each timeHealthEventOccurredV1is called. This works for current tests because each processor call maps to at most one gRPC request, but it's fragile — if the implementation ever batches or sends multiple gRPC calls, events from earlier calls would be silently lost.Consider using
extendinstead:- self.health_events = list(request.events) + self.health_events.extend(request.events)This would also require clearing
self.health_eventsin the test methods that currently assign[]to it (which they already do), and would make the test assertions more robust.health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)
39-41: Uselog.exceptionto preserve the traceback.Inside an
exceptblock,log.exceptionautomatically includes the traceback, which is more useful for diagnosing sysfs access failures.Proposed fix
except OSError as e: - log.error(f"Failed to read PCI devices directory {devices_path}: {e}") + log.exception(f"Failed to read PCI devices directory {devices_path}: {e}") return []
a38c780 to
979664b
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py`:
- Around line 276-290: The sysfs scan failure handling logs that the GPU count
check is being disabled but never actually flips the flag, so update the failure
branch after get_nvidia_gpu_pci_addresses() returns empty to set
self._gpu_count_check_enabled = False (and keep
metrics.gpu_count_check_sysfs_scan_failure.set(1) and the warning log), ensuring
subsequent cycles skip the gpu-count comparison in gpu_count_check_completed and
avoid emitting spurious GpuTotalCountMismatch events.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py`:
- Around line 428-436: The calculation of missing_from_driver can be skewed if
self._metadata_reader.get_pci_address(gpu_id) returns None for some DCGM GPUs;
update the loop that builds dcgm_pci_addresses (iterating dcgm_gpu_ids and
calling get_pci_address) to skip None results when adding to dcgm_pci_addresses
and emit a warning log (including the gpu_id and context) whenever
get_pci_address yields None so operators can differentiate
metadata-not-yet-populated cases from real driver-missing GPUs; keep the final
set logic (sysfs_pci_set - dcgm_pci_addresses) unchanged but ensure
dcgm_pci_addresses only contains valid PCI strings.
- Around line 414-476: The implementation in _send_gpu_count_mismatch_event sets
recommendedAction=platformconnector_pb2.RESTART_VM but
test_gpu_count_mismatch.py asserts COMPONENT_RESET, causing a mismatch; update
the test to assert platformconnector_pb2.RESTART_VM (to match
_send_gpu_count_mismatch_event and the existing test_platform_connector
expectation) so all tests agree on the recommendedAction constant used by the
HealthEvent.
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py`:
- Line 147: The test asserts the wrong recommendedAction constant; update the
assertion in test_gpu_count_mismatch (the line checking event.recommendedAction)
to expect platformconnector_pb2.RESTART_VM to match the implementation in
_send_gpu_count_mismatch_event in platform_connector.py (which sets
recommendedAction=platformconnector_pb2.RESTART_VM).
🧹 Nitpick comments (6)
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/types.py (1)
53-60: Add return type hint-> None.The coding guidelines require type hints for all functions. The concrete implementation in
platform_connector.pyalready has-> None; the abstract definition should match.Note: the existing methods in this class (
health_event_occurred,dcgm_connectivity_failed) also omit return types — consider adding them in a follow-up.Proposed fix
`@abc.abstractmethod` - def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]): + def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]) -> None:As per coding guidelines, "Type hints required for all functions in Python code".
health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)
37-41: Redundant exception object inlogging.exceptioncall.
log.exception()already captures and logs the full exception traceback. Including{e}in the format string duplicates the exception message in the log output.Proposed fix
except OSError as e: - log.exception(f"Failed to read PCI devices directory {devices_path}: {e}") + log.exception(f"Failed to read PCI devices directory {devices_path}") return []health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_dcgm_watcher/test_dcgm.py (1)
37-38: Consider capturing the callback invocation for assertion intest_start.The stub silently drops the call. Since
test_start(Line 336) exercises the fullwatcher.start()loop (which now firesgpu_count_check_completedeach cycle), you could track invocation here to verify the callback is actually dispatched — similar to howhealth_detailsis captured on Line 32.This is optional since the platform connector tests cover the real logic.
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
367-401:log.errorshould belog.exceptionto preserve the traceback.Ruff TRY400 flags line 401. Using
log.exceptionautomatically includes the traceback, which is essential for debugging send failures in production.Proposed fix
except Exception as e: - log.error(f"Exception while sending GPU count match resolved event: {e}") + log.exception(f"Exception while sending GPU count match resolved event: {e}")
471-476: Samelog.error→log.exceptionimprovement applies here (Ruff TRY400).Proposed fix
except Exception as e: - log.error(f"Exception while sending GPU count mismatch event: {e}") + log.exception(f"Exception while sending GPU count mismatch event: {e}")health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)
885-888: Patching module-level constants is fragile — considerunittest.mock.patch.Directly mutating
platform_connector.MAX_RETRIESandplatform_connector.INITIAL_DELAYrisks leaking state iffinallydoesn't execute (e.g., test runner killed). Usingunittest.mock.patchas a context manager or decorator handles cleanup automatically and is more idiomatic.That said, the
finallyblock at lines 1041-1042 does restore the originals, so this is a low-risk nit.Also applies to: 1041-1042
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py
Show resolved
Hide resolved
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
Show resolved
Hide resolved
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
Show resolved
Hide resolved
...u-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py`:
- Around line 403-412: The callback gpu_count_check_completed can emit
false-positive mismatches when sysfs enumeration fails and returns an empty
list; add a guard so it does not report a mismatch if sysfs_gpu_pci_addresses is
empty (or if sysfs has been disabled). Specifically, inside
gpu_count_check_completed, check if not sysfs_gpu_pci_addresses (or if
self._gpu_count_check_enabled is False) before comparing lengths and skip
calling _send_gpu_count_mismatch_event in that case (instead either no-op or
call _clear_gpu_count_mismatch as appropriate); alternatively, when
get_nvidia_gpu_pci_addresses() detects sysfs failure, set
self._gpu_count_check_enabled = False so subsequent gpu_count_check_completed
invocations are ignored.
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)
55-59: Add a comment noting the default and behavior when disabled.The inline comments describe the check well, but per the coding guidelines for
values.yaml, consider noting that this is enabled by default and what disabling it does (i.e., skips the PCIe-vs-DCGM count comparison and suppresses mismatch health events).Suggested improvement
# GPU count check configuration # Compares PCIe bus GPU count (sysfs) vs driver-visible GPU count (DCGM) # and sends a health event on mismatch +# Enabled by default. Set to false to disable the check and suppress +# GpuTotalCountMismatch health events. gpuCountCheck: enabled: trueAs per coding guidelines, "Document all values in Helm chart
values.yamlwith inline comments" and "Note truthy value requirements in Helm chart documentation where applicable".health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)
39-41: Redundant exception object inlog.exceptioncall.
log.exception()already attaches the active exception's traceback. Including{e}in the f-string duplicates the information.Proposed fix
except OSError as e: - log.exception(f"Failed to read PCI devices directory {devices_path}: {e}") + log.exception(f"Failed to read PCI devices directory {devices_path}") return []health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (1)
400-401: Uselog.exceptioninstead oflog.errorto preserve tracebacks.Both exception handlers use
log.error(f"...")which loses the stack trace.log.exceptionautomatically includes the traceback for better debuggability. This aligns with the static analysis hint (TRY400) and the pattern used inclear_dcgm_connectivity_failureat line 129.Proposed fix
except Exception as e: - log.error(f"Exception while sending GPU count match resolved event: {e}") + log.exception(f"Exception while sending GPU count match resolved event: {e}")except Exception as e: - log.error(f"Exception while sending GPU count mismatch event: {e}") + log.exception(f"Exception while sending GPU count mismatch event: {e}")Also applies to: 475-476
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
Show resolved
Hide resolved
detecting lessthanttotalgpu
Summary
bug => #684
Add GPU PCIe count mismatch detection to gpu-health-monitor that compares the number of GPUs visible on the PCIe bus (via sysfs) against the number reported by DCGM (driver), firing a GpuTotalCountMismatch health event when they diverge
##
1. PCIe GPU enumeration via sysfs -- Added a new pcie/gpu_counter.py module that scans /sys/bus/pci/devices/ to discover NVIDIA GPUs on the PCIe bus by matching vendor ID (0x10de) and GPU class prefix (0x0302). This provides hardware ground-truth GPU count independent of the driver stack. Addresses are cached once at startup since PCIe topology is static per boot.
2. GPU count mismatch health event -- Implemented gpu_count_check_completed callback in PlatformConnectorEventProcessor that compares sysfs GPU count against DCGM-reported GPU count every health check cycle. On mismatch, a fatal GpuTotalCountMismatch health event is sent with COMPONENT_RESET recommended action, identifying the specific missing PCI addresses. On resolution, a healthy event clears the previous mismatch. A healthy baseline event is sent on the first cycle to align with existing health watch patterns.
3. DCGMWatcher integration -- Extended DCGMWatcher to call gpu_count_check_completed on all registered callbacks after each successful DCGM health check, passing both sysfs PCI addresses and DCGM GPU IDs. Added a gpu_count_check_enabled flag to gate the entire feature -- when disabled, no sysfs scan occurs and no callbacks are fired.
Helm chart configuration -- Added gpuCountCheck.enabled toggle to values.yaml (default true) and a --gpu-count-check-enabled CLI argument to both DCGM 3.x and 4.x DaemonSet templates, following the same configuration flow as dcgm.dcgmK8sServiceEnabled. This allows operators to enable or disable the feature at deployment time without code changes.
Testing:-
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Documentation
Tests