Skip to content

feat: publish GpuTotalCountMismatch healthevent in case driver nvidia-gpu is not getting detected by driver#833

Open
deesharma24 wants to merge 1 commit intoNVIDIA:mainfrom
deesharma24:gpudrop
Open

feat: publish GpuTotalCountMismatch healthevent in case driver nvidia-gpu is not getting detected by driver#833
deesharma24 wants to merge 1 commit intoNVIDIA:mainfrom
deesharma24:gpudrop

Conversation

@deesharma24
Copy link
Contributor

@deesharma24 deesharma24 commented Feb 11, 2026

Summary

bug => #684
Add GPU PCIe count mismatch detection to gpu-health-monitor that compares the number of GPUs visible on the PCIe bus (via sysfs) against the number reported by DCGM (driver), firing a GpuTotalCountMismatch health event when they diverge

##

1. PCIe GPU enumeration via sysfs -- Added a new pcie/gpu_counter.py module that scans /sys/bus/pci/devices/ to discover NVIDIA GPUs on the PCIe bus by matching vendor ID (0x10de) and GPU class prefix (0x0302). This provides hardware ground-truth GPU count independent of the driver stack. Addresses are cached once at startup since PCIe topology is static per boot.

2. GPU count mismatch health event -- Implemented gpu_count_check_completed callback in PlatformConnectorEventProcessor that compares sysfs GPU count against DCGM-reported GPU count every health check cycle. On mismatch, a fatal GpuTotalCountMismatch health event is sent with COMPONENT_RESET recommended action, identifying the specific missing PCI addresses. On resolution, a healthy event clears the previous mismatch. A healthy baseline event is sent on the first cycle to align with existing health watch patterns.

3. DCGMWatcher integration -- Extended DCGMWatcher to call gpu_count_check_completed on all registered callbacks after each successful DCGM health check, passing both sysfs PCI addresses and DCGM GPU IDs. Added a gpu_count_check_enabled flag to gate the entire feature -- when disabled, no sysfs scan occurs and no callbacks are fired.
Helm chart configuration -- Added gpuCountCheck.enabled toggle to values.yaml (default true) and a --gpu-count-check-enabled CLI argument to both DCGM 3.x and 4.x DaemonSet templates, following the same configuration flow as dcgm.dcgmK8sServiceEnabled. This allows operators to enable or disable the feature at deployment time without code changes.

Testing:-

  1. able to test the flow when everything is working fine.
  2. was able to test scenario by fake injection of more than expected pcie address to make sure health-monitor picks up the event and publish it.
  3. actual testing is not able to do because we need a failed system to test it whether gpu falls off pcie.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Optional GPU count mismatch detection (hardware PCIe vs driver-visible) added; enabled by default and toggleable via CLI and Helm values.
    • Runtime PCIe GPU enumeration and per-cycle comparison added; emits stateful, deduplicated health events on mismatch.
  • Documentation

    • GPU health monitor docs expanded with mismatch detection rationale, workflow, coverage and configuration.
  • Tests

    • Added unit and integration tests covering PCIe discovery, mismatch detection, event lifecycle and deduplication.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Adds an optional GPU count check: enumerates NVIDIA GPUs from sysfs at startup, compares sysfs GPU PCI addresses to DCGM-reported GPUs each health-check cycle, calls a callback with results, and emits deduplicated health events for mismatches. Toggle available via CLI and Helm values.

Changes

Cohort / File(s) Summary
Helm & Deployment
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml, distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml, distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
Adds --gpu-count-check-enabled container arg and gpuCountCheck.enabled value (default true) to control the GPU count check via Helm.
CLI
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
Adds --gpu-count-check-enabled CLI flag and forwards it into DCGMWatcher initialization.
DCGM watcher, metrics, types
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py, .../dcgm_watcher/metrics.py, .../dcgm_watcher/types.py
DCGMWatcher gains gpu_count_check_enabled; caches sysfs PCI addresses at startup when enabled, compares counts each cycle, logs mismatches, exposes gpu_count_check_sysfs_scan_failure metric, and invokes new gpu_count_check_completed(sysfs_addrs, dcgm_ids) callback on the CallbackInterface.
PCIe enumeration library
health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py, health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/__init__.py
New module get_nvidia_gpu_pci_addresses(sysfs_base="/sys") to enumerate NVIDIA GPU PCI addresses with helper and exports it via package __all__.
Platform connector logic
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
Adds gpu_count_check_completed, _send_gpu_count_mismatch_event, and _clear_gpu_count_mismatch; implements stateful deduplication, builds impacted PCI entities, includes chassis_serial metadata, and retries health-event sends for GpuTotalCountMismatch.
Documentation
docs/gpu-health-monitor.md
Documents GPU Count Mismatch Detection: rationale, coverage, workflow, and config note that the check is enabled by default.
Tests — pcie & gpu_counter
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_pcie/test_gpu_counter.py, .../tests/test_pcie/__init__.py
New unit tests for sysfs GPU enumeration and added license header in tests init.
Tests — platform connector
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py, .../tests/test_platform_connector/test_platform_connector.py
Extensive tests covering mismatch detection, deduplication, resolution lifecycle, multi-missing GPUs, and adjustments to test timing; validates emitted health-event fields and impacted entities.
Tests — dcgm watcher stub
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_dcgm_watcher/test_dcgm.py
Adds a noop gpu_count_check_completed method to the fake event processor to satisfy the extended CallbackInterface.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI Handler
    participant Watcher as DCGMWatcher
    participant Sysfs as Sysfs Scanner
    participant DCGM as DCGM API
    participant Connector as PlatformConnector
    participant gRPC as Platform gRPC

    User->>CLI: start (--gpu-count-check-enabled=true)
    CLI->>Watcher: initialize(gpu_count_check_enabled=true)
    Watcher->>Sysfs: get_nvidia_gpu_pci_addresses()
    Sysfs-->>Watcher: sysfs_gpu_pci_addresses

    loop Health check cycle
        Watcher->>DCGM: query GPU IDs
        DCGM-->>Watcher: dcgm_gpu_ids
        Watcher->>Watcher: compare sysfs vs dcgm
        Watcher->>Connector: gpu_count_check_completed(sysfs_addrs, dcgm_ids)
        alt Mismatch
            Connector->>Connector: compute missing PCI addresses
            Connector->>gRPC: send_health_event(GpuTotalCountMismatch, impactedEntities, isFatal=true)
            gRPC-->>Connector: ack
            Connector->>Connector: update cache (dedupe)
        else Match / resolved
            Connector->>gRPC: send_health_event(GpuTotalCountMismatch cleared / healthy)
            gRPC-->>Connector: ack
            Connector->>Connector: update cache (dedupe)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped the PCIe lanes at dawn,
I counted lights where GPUs are drawn.
When sysfs and DCGM disagreed,
I thumped an event — the logs took heed.
Now mismatch checks keep watch till yawn.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: publish GpuTotalCountMismatch healthevent in case driver nvidia-gpu is not getting detected by driver' clearly identifies the main feature: publishing a GPU count mismatch health event when NVIDIA GPUs are not properly detected by the driver.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml (1)

90-99: ⚠️ Potential issue | 🔴 Critical

Add explicit /sys hostPath mount for GPU PCIe enumeration.

The GPU health monitor reads /sys/bus/pci/devices/ at startup to enumerate NVIDIA GPUs (vendor 0x10de, class 0x0302). The daemonset currently lacks an explicit volume mount for /sys, which means the enumeration silently fails and returns zero GPUs, causing persistent mismatch errors.

Add this to the daemonset's volumeMounts (around line 98):

- name: sys
  mountPath: /sys
  readOnly: true

And add the corresponding volume definition (after gpu-metadata volume):

- name: sys
  hostPath:
    path: /sys
    type: Directory

The metadata-collector daemonset in the same chart already mounts /sys in this exact way, establishing precedent for this cluster setup.

🤖 Fix all issues with AI agents
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py`:
- Around line 80-84: The helper metadata_file() writes sample JSON to a
NamedTemporaryFile but returns its path while the file is still open, risking
unreadable/partial content; modify metadata_file (the NamedTemporaryFile
creation) to ensure the file is flushed and closed before returning the
path—e.g., open/write/close inside a context manager or explicitly call
f.flush() and f.close() after json.dump—so
MetadataReader/PlatformConnectorEventProcessor reads a fully written file.
🧹 Nitpick comments (6)
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/types.py (1)

53-60: Missing return type annotation.

The concrete implementation in platform_connector.py annotates -> None; the abstract declaration here should match. That said, the existing methods (health_event_occurred, dcgm_connectivity_failed) also omit return types, so this is a pre-existing pattern.

Suggested fix
     `@abc.abstractmethod`
-    def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]):
+    def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]) -> None:

As per coding guidelines, "Type hints required for all functions in Python code".

docs/gpu-health-monitor.md (1)

103-103: Nit: clarify the configuration parameter name.

The doc says gpu_count_check_enabled (Python/underscore style), but users configure this via Helm (gpuCountCheck.enabled) or CLI (--gpu-count-check-enabled). Consider referencing the Helm value name since this is deployment-facing documentation.

health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py (1)

276-282: Consider guarding against empty sysfs results being cached permanently.

If get_nvidia_gpu_pci_addresses() returns an empty list due to a transient sysfs read error at startup, the cached count will be 0 for the lifetime of the process. This would suppress mismatch detection entirely (both counts would appear as 0 vs N, or if DCGM also sees GPUs, it would report a false mismatch that can never self-resolve since the cache is static).

A simple guard — logging a more prominent warning or even retrying the sysfs scan once — could improve resilience:

💡 Suggested improvement
         sysfs_gpu_pci_addresses = []
         if self._gpu_count_check_enabled:
             sysfs_gpu_pci_addresses = get_nvidia_gpu_pci_addresses()
-            log.info(f"Sysfs GPU count (PCIe hardware): {len(sysfs_gpu_pci_addresses)}")
+            if not sysfs_gpu_pci_addresses:
+                log.warning("Sysfs returned 0 GPUs — GPU count check will be ineffective this cycle")
+            else:
+                log.info(f"Sysfs GPU count (PCIe hardware): {len(sysfs_gpu_pci_addresses)}")
         else:
             log.info("GPU count check is disabled")
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py (2)

32-32: Hardcoded /tmp path for the UDS socket.

Static analysis (S108) flags this as an insecure temp path. For tests, consider using tempfile.mkdtemp() to create an isolated directory and placing the socket there. This avoids collisions if tests run in parallel and addresses the security hint.


91-93: Servicer overwrites health_events on each RPC call — only the last batch is retained.

This is fine for the current tests since each assertion follows a single gpu_count_check_completed call. However, if future tests need to verify cumulative events or multiple batches, this will silently lose earlier events. Consider appending instead:

self.health_events.extend(request.events)

This would also require resetting self.health_events = [] between test phases (already done in some tests).

health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)

18-19: Documentation for GPU_CLASS_PREFIX is adequate but could be clearer.

The constant GPU_CLASS_PREFIX = "0x0302" correctly targets NVIDIA data center GPUs (A100, H100, etc.). The inline comment and docstring identify these target SKUs. However, the documentation doesn't explain why data center GPUs use 0x0302 while consumer/display GPUs use 0x0300. Consider expanding the comment to clarify this distinction to prevent future maintainers from wondering why consumer GPUs aren't detected.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/gpu-health-monitor.md`:
- Around line 96-97: The docs and code disagree about mismatch direction: the
implementation triggers on len(sysfs_gpu_pci_addresses) != len(dcgm_gpu_ids)
while the doc text says sysfs count > DCGM count; decide which behavior is
correct and make them consistent: either update the documentation sentence to
say "sysfs count ≠ DCGM count" and leave the implementation (referencing
len(sysfs_gpu_pci_addresses) and len(dcgm_gpu_ids) and the GpuTotalCountMismatch
health event) as-is, or tighten the implementation to only treat sysfs > DCGM as
a mismatch by changing the conditional that currently uses
len(sysfs_gpu_pci_addresses) != len(dcgm_gpu_ids) to a strict comparison
(len(sysfs_gpu_pci_addresses) > len(dcgm_gpu_ids)) so it matches the documented
behavior and still populates the GpuTotalCountMismatch event with the missing
PCI addresses.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py`:
- Around line 414-476: The GPU count mismatch event built in
_send_gpu_count_mismatch_event currently sets
recommendedAction=platformconnector_pb2.COMPONENT_RESET while entitiesImpacted
only contains PCI entries (entitiesImpacted) or may be empty
(missing_from_driver), which prevents component-level reset; change the logic to
detect when no GPU_UUID entities are present or when entitiesImpacted is empty
and set recommendedAction=platformconnector_pb2.RESTART_VM instead (or otherwise
include GPU_UUID if available via _metadata_reader), and update the message to
handle the empty-case (e.g., omit "Missing from driver: []" or say "none
detected") so downstream consumers (node-drainer/fault-remediation) receive a
compatible action; make these changes inside _send_gpu_count_mismatch_event and
adjust errorCode/metadata as needed if you change action semantics.
🧹 Nitpick comments (4)
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py (1)

276-282: Consider handling an empty sysfs result at startup more explicitly.

If get_nvidia_gpu_pci_addresses() returns [] (e.g., sysfs read failure, container without /sys mounted), the empty list is cached for the watcher's lifetime. Every subsequent cycle will report a mismatch (0 sysfs GPUs vs N DCGM GPUs), potentially causing a persistent false-positive GpuTotalCountMismatch event.

Consider logging a more prominent warning or disabling the check when the sysfs scan returns zero GPUs at startup, since a node with DCGM-visible GPUs but zero sysfs GPUs almost certainly indicates a scan failure rather than a real mismatch.

💡 Suggested approach
         if self._gpu_count_check_enabled:
             sysfs_gpu_pci_addresses = get_nvidia_gpu_pci_addresses()
             log.info(f"Sysfs GPU count (PCIe hardware): {len(sysfs_gpu_pci_addresses)}")
+            if not sysfs_gpu_pci_addresses:
+                log.warning(
+                    "Sysfs returned 0 GPUs — possible scan failure or missing /sys mount. "
+                    "Disabling GPU count check to avoid false positives."
+                )
+                self._gpu_count_check_enabled = False
         else:
             log.info("GPU count check is disabled")
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py (2)

32-32: Use a temporary path for the UDS socket instead of a hardcoded /tmp path.

Ruff S108 flags this as a probable insecure usage of a temporary file path. In shared CI environments, /tmp/nvsentinel.sock could collide with concurrent test runs or be subject to symlink attacks. Consider using tempfile.mkdtemp() to create an isolated directory for the socket.

💡 Suggested approach
-socket_path = "/tmp/nvsentinel.sock"
+import tempfile as _tmpmod
+_test_sock_dir = _tmpmod.mkdtemp()
+socket_path = os.path.join(_test_sock_dir, "nvsentinel.sock")

And clean up the directory in tearDown.


87-93: Servicer replaces health_events on each gRPC call rather than accumulating.

self.health_events = list(request.events) overwrites the list each time HealthEventOccurredV1 is called. This works for current tests because each processor call maps to at most one gRPC request, but it's fragile — if the implementation ever batches or sends multiple gRPC calls, events from earlier calls would be silently lost.

Consider using extend instead:

-    self.health_events = list(request.events)
+    self.health_events.extend(request.events)

This would also require clearing self.health_events in the test methods that currently assign [] to it (which they already do), and would make the test assertions more robust.

health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)

39-41: Use log.exception to preserve the traceback.

Inside an except block, log.exception automatically includes the traceback, which is more useful for diagnosing sysfs access failures.

Proposed fix
     except OSError as e:
-        log.error(f"Failed to read PCI devices directory {devices_path}: {e}")
+        log.exception(f"Failed to read PCI devices directory {devices_path}: {e}")
         return []

@deesharma24 deesharma24 force-pushed the gpudrop branch 2 times, most recently from a38c780 to 979664b Compare February 11, 2026 14:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py`:
- Around line 276-290: The sysfs scan failure handling logs that the GPU count
check is being disabled but never actually flips the flag, so update the failure
branch after get_nvidia_gpu_pci_addresses() returns empty to set
self._gpu_count_check_enabled = False (and keep
metrics.gpu_count_check_sysfs_scan_failure.set(1) and the warning log), ensuring
subsequent cycles skip the gpu-count comparison in gpu_count_check_completed and
avoid emitting spurious GpuTotalCountMismatch events.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py`:
- Around line 428-436: The calculation of missing_from_driver can be skewed if
self._metadata_reader.get_pci_address(gpu_id) returns None for some DCGM GPUs;
update the loop that builds dcgm_pci_addresses (iterating dcgm_gpu_ids and
calling get_pci_address) to skip None results when adding to dcgm_pci_addresses
and emit a warning log (including the gpu_id and context) whenever
get_pci_address yields None so operators can differentiate
metadata-not-yet-populated cases from real driver-missing GPUs; keep the final
set logic (sysfs_pci_set - dcgm_pci_addresses) unchanged but ensure
dcgm_pci_addresses only contains valid PCI strings.
- Around line 414-476: The implementation in _send_gpu_count_mismatch_event sets
recommendedAction=platformconnector_pb2.RESTART_VM but
test_gpu_count_mismatch.py asserts COMPONENT_RESET, causing a mismatch; update
the test to assert platformconnector_pb2.RESTART_VM (to match
_send_gpu_count_mismatch_event and the existing test_platform_connector
expectation) so all tests agree on the recommendedAction constant used by the
HealthEvent.

In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_gpu_count_mismatch.py`:
- Line 147: The test asserts the wrong recommendedAction constant; update the
assertion in test_gpu_count_mismatch (the line checking event.recommendedAction)
to expect platformconnector_pb2.RESTART_VM to match the implementation in
_send_gpu_count_mismatch_event in platform_connector.py (which sets
recommendedAction=platformconnector_pb2.RESTART_VM).
🧹 Nitpick comments (6)
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/types.py (1)

53-60: Add return type hint -> None.

The coding guidelines require type hints for all functions. The concrete implementation in platform_connector.py already has -> None; the abstract definition should match.

Note: the existing methods in this class (health_event_occurred, dcgm_connectivity_failed) also omit return types — consider adding them in a follow-up.

Proposed fix
     `@abc.abstractmethod`
-    def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]):
+    def gpu_count_check_completed(self, sysfs_gpu_pci_addresses: list[str], dcgm_gpu_ids: list[int]) -> None:

As per coding guidelines, "Type hints required for all functions in Python code".

health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)

37-41: Redundant exception object in logging.exception call.

log.exception() already captures and logs the full exception traceback. Including {e} in the format string duplicates the exception message in the log output.

Proposed fix
     except OSError as e:
-        log.exception(f"Failed to read PCI devices directory {devices_path}: {e}")
+        log.exception(f"Failed to read PCI devices directory {devices_path}")
         return []
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_dcgm_watcher/test_dcgm.py (1)

37-38: Consider capturing the callback invocation for assertion in test_start.

The stub silently drops the call. Since test_start (Line 336) exercises the full watcher.start() loop (which now fires gpu_count_check_completed each cycle), you could track invocation here to verify the callback is actually dispatched — similar to how health_details is captured on Line 32.

This is optional since the platform connector tests cover the real logic.

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)

367-401: log.error should be log.exception to preserve the traceback.

Ruff TRY400 flags line 401. Using log.exception automatically includes the traceback, which is essential for debugging send failures in production.

Proposed fix
         except Exception as e:
-            log.error(f"Exception while sending GPU count match resolved event: {e}")
+            log.exception(f"Exception while sending GPU count match resolved event: {e}")

471-476: Same log.errorlog.exception improvement applies here (Ruff TRY400).

Proposed fix
         except Exception as e:
-            log.error(f"Exception while sending GPU count mismatch event: {e}")
+            log.exception(f"Exception while sending GPU count mismatch event: {e}")
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)

885-888: Patching module-level constants is fragile — consider unittest.mock.patch.

Directly mutating platform_connector.MAX_RETRIES and platform_connector.INITIAL_DELAY risks leaking state if finally doesn't execute (e.g., test runner killed). Using unittest.mock.patch as a context manager or decorator handles cleanup automatically and is more idiomatic.

That said, the finally block at lines 1041-1042 does restore the originals, so this is a low-risk nit.

Also applies to: 1041-1042

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py`:
- Around line 403-412: The callback gpu_count_check_completed can emit
false-positive mismatches when sysfs enumeration fails and returns an empty
list; add a guard so it does not report a mismatch if sysfs_gpu_pci_addresses is
empty (or if sysfs has been disabled). Specifically, inside
gpu_count_check_completed, check if not sysfs_gpu_pci_addresses (or if
self._gpu_count_check_enabled is False) before comparing lengths and skip
calling _send_gpu_count_mismatch_event in that case (instead either no-op or
call _clear_gpu_count_mismatch as appropriate); alternatively, when
get_nvidia_gpu_pci_addresses() detects sysfs failure, set
self._gpu_count_check_enabled = False so subsequent gpu_count_check_completed
invocations are ignored.
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)

55-59: Add a comment noting the default and behavior when disabled.

The inline comments describe the check well, but per the coding guidelines for values.yaml, consider noting that this is enabled by default and what disabling it does (i.e., skips the PCIe-vs-DCGM count comparison and suppresses mismatch health events).

Suggested improvement
 # GPU count check configuration
 # Compares PCIe bus GPU count (sysfs) vs driver-visible GPU count (DCGM)
 # and sends a health event on mismatch
+# Enabled by default. Set to false to disable the check and suppress
+# GpuTotalCountMismatch health events.
 gpuCountCheck:
   enabled: true

As per coding guidelines, "Document all values in Helm chart values.yaml with inline comments" and "Note truthy value requirements in Helm chart documentation where applicable".

health-monitors/gpu-health-monitor/gpu_health_monitor/pcie/gpu_counter.py (1)

39-41: Redundant exception object in log.exception call.

log.exception() already attaches the active exception's traceback. Including {e} in the f-string duplicates the information.

Proposed fix
     except OSError as e:
-        log.exception(f"Failed to read PCI devices directory {devices_path}: {e}")
+        log.exception(f"Failed to read PCI devices directory {devices_path}")
         return []
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (1)

400-401: Use log.exception instead of log.error to preserve tracebacks.

Both exception handlers use log.error(f"...") which loses the stack trace. log.exception automatically includes the traceback for better debuggability. This aligns with the static analysis hint (TRY400) and the pattern used in clear_dcgm_connectivity_failure at line 129.

Proposed fix
         except Exception as e:
-            log.error(f"Exception while sending GPU count match resolved event: {e}")
+            log.exception(f"Exception while sending GPU count match resolved event: {e}")
         except Exception as e:
-            log.error(f"Exception while sending GPU count mismatch event: {e}")
+            log.exception(f"Exception while sending GPU count mismatch event: {e}")

Also applies to: 475-476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant