feat: cloud-native GPU health event management by ArangoGutierrez · Pull Request #794 · NVIDIA/NVSentinel

ArangoGutierrez · 2026-02-04T15:31:59Z

Summary

This PR introduces a Kubernetes-native approach to GPU health monitoring and remediation, designed to replace the MongoDB-centric architecture.

Key Changes

HealthEvent CRD (nvsentinel.nvidia.com/v1alpha1): Cluster-wide custom resource for tracking GPU health events
Controller Chain: Automated remediation workflow with four specialized controllers:
- QuarantineController: Cordons nodes on fatal GPU errors
- DrainController: Gracefully evicts pods from quarantined nodes
- RemediationController: Executes remediation strategies (reboot, GPU reset, manual)
- TTLController: Automatic cleanup of resolved events based on configurable policies
HealthProvider Service: gRPC-based GPU health monitoring using NVML
CRDPublisher: Bridge between device-api-server and Kubernetes API

Architecture

┌─────────────────┐    gRPC    ┌──────────────────┐
│ HealthProvider  │ ─────────▶│ Device-API-Server│
│   (DaemonSet)   │            │                  │
└─────────────────┘            └────────┬─────────┘
        │                               │
        │ NVML                          │ CRDPublisher
        ▼                               ▼
    ┌───────┐                   ┌───────────────┐
    │  GPU  │                   │  HealthEvent  │
    └───────┘                   │     (CRD)     │
                                └───────┬───────┘
                                        │
              ┌─────────────────────────┼─────────────────────────┐
              │                         │                         │
              ▼                         ▼                         ▼
    ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
    │ QuarantineCtrl   │───▶│   DrainCtrl      │───▶│ RemediationCtrl  │
    │  (cordon node)   │    │  (evict pods)    │    │ (reboot/reset)   │
    └──────────────────┘    └──────────────────┘    └──────────────────┘

Phase Progression

New → Quarantined → Drained → Remediated → Resolved

Each phase transition is handled by a dedicated controller, ensuring separation of concerns and reliable state management.

Test Plan

Unit tests for all controllers (quarantine, drain, remediation, TTL)
Unit tests for CRDPublisher and HealthProvider
Integration test harness (cmd/controller-test)
Manual integration test on AWS EKS cluster with GPUs
Full phase progression verified: New → Quarantined → Drained → Remediated (2 seconds)

Remove all NVSentinel modules except api/ to prepare for donation as a standalone Device API repository (device.nvidia.com). Changes: - Remove all modules: commons, data-models, health-monitors, etc. - Simplify README.md, DEVELOPMENT.md, SECURITY.md for API focus - Simplify Makefile to delegate to api/ - Trim .gitignore and .versions.yaml to essentials - Trim .github/ to basic scaffolding (lint-test workflow, templates) - Fix api/Makefile (remove undefined THIRD_PARTY_DIR reference) Preserves: - api/ folder with GPU proto definitions and generated Go code - LICENSE, CODE_OF_CONDUCT.md - Basic GitHub scaffolding (DCO, issue templates, PR template) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>

Add ProviderService with RPCs for GPU lifecycle management: - RegisterGpu / UnregisterGpu - UpdateGpuStatus / UpdateGpuCondition Also adds resource_version field to Gpu message for optimistic concurrency control. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Implements the Device API Server - a gRPC server providing unified GPU device information via GpuService (consumer) and ProviderService (provider) APIs. Components: - In-memory cache with RWMutex for thread-safe GPU state - Watch broadcaster for real-time state change notifications - NVML fallback provider for GPU enumeration and XID health monitoring - Prometheus metrics (cache, watch, NVML, gRPC stats) - Helm chart with ServiceMonitor and alerting rules Includes comprehensive unit tests (35 tests, race-clean) and documentation (API reference, operations guide, design docs). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The gRPC API is unauthenticated, which poses a security risk when bound to all network interfaces. This change: - Changes default GRPCAddress from ":50051" to "127.0.0.1:50051" - Updates flag documentation with security warning - Expands network security documentation per Copilot review - Updates design and operations docs to reflect new default Users who need network access can explicitly set --grpc-address=:50051, with full awareness of the security implications. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The Heartbeat RPC is intentionally defined but unimplemented, reserved for future provider liveness detection. This commit: - Improves the error message to explain the reservation purpose - Updates code comments to clarify this is intentional - Aligns implementation comments with proto documentation Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add ResourceVersion() method to GpuCache and use it to populate ListMeta.ResourceVersion in ListGpus responses. This enables: - Optimistic concurrency patterns in consumers - Cache invalidation based on version changes - Consistency tracking across list operations The resource version is a monotonically increasing counter that increments on every cache mutation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add module structure documentation to DEVELOPMENT.md to clarify the relationship between the different Go modules in this monorepo: - Device API Server uses github.com/nvidia/device-api - API definitions use github.com/nvidia/device-api/api - client-go and code-generator use github.com/nvidia/nvsentinel This documents the intentional divergence where device-api is a standalone component that may be published as its own repository. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The PR introduced inconsistent module paths where the api module was changed to github.com/nvidia/device-api/api while client-go still uses github.com/nvidia/nvsentinel/api. This broke CI. This commit aligns all modules to use github.com/nvidia/nvsentinel: - api/go.mod: github.com/nvidia/nvsentinel/api - go.mod: github.com/nvidia/nvsentinel - All imports updated to use nvsentinel path - Proto go_package options updated - Generated files updated Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Addresses review feedback from pteranodan on PR #720 regarding API design and NVML provider coupling. BREAKING CHANGES: - Removed ProviderService in favor of unified GpuService - RegisterGpu() → CreateGpu() - UnregisterGpu() → DeleteGpu() - UpdateGpuCondition() → removed (use UpdateGpuStatus with full status) API Changes: - Consolidated GpuService with both read and write operations - Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu - Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition) - Deleted api/proto/device/v1alpha1/provider.proto - Updated gpu.proto with unified service definition Service Layer: - Created pkg/deviceapiserver/service/gpu_service.go (unified service) - Deleted consumer.go and provider.go (replaced by gpu_service.go) - Updated integration tests to use new API - Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete NVML Provider Decoupling: - NVML provider now uses GpuServiceClient instead of direct cache access - Server creates loopback gRPC connection for embedded NVML provider - Added standalone cmd/nvml-provider/ binary for sidecar deployment - Provider now "dogfoods" its own API via gRPC Documentation: - Updated README.md with new unified API architecture - Rewrote docs/api/device-api-server.md with single service model - Updated docs/operations/device-api-server.md - Simplified docs/design/device-api-server.md - Removed obsolete task lists and design documents Deployment: - Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/ - Added deployments/container/Dockerfile - Added deployments/static/ for static manifests - Added demos/nvml-sidecar-demo.sh Other: - Updated .gitignore for planning documents - Enhanced Makefile with new targets - Added metrics for unified service This refactoring improves: 1. API consistency with Kubernetes conventions 2. Tooling compatibility (code generators, controller-runtime) 3. Decoupling of NVML provider from server internals 4. Flexibility to run NVML as embedded or sidecar 5. Overall code maintainability and clarity Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>

Update Go version to 1.25 across all module files and the Dockerfile to ensure consistent builds. The previous version (1.25.5) was a non-existent patch version, and the Dockerfile was using 1.24 which doesn't have Docker Hub images available. Changes: - go.mod: 1.25.5 -> 1.25 - api/go.mod: 1.25.5 -> 1.25 - Dockerfile: 1.24 -> 1.25 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The flag parsing was broken because flag.Parse() was called in main() after parseFlags() had already returned the config struct. This meant command-line arguments like --server-address were never applied to the config. Move flag.Parse() inside parseFlags() after defining all flags but before returning, so the parsed values are captured in the returned config struct. Also reorder klog.InitFlags() to be called before parseFlags() so klog flags are also registered before parsing. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The daemonset template was passing --provider-address to the device-api-server container when nvmlProvider.enabled was true, but this flag doesn't exist in the device-api-server binary. The sidecar architecture doesn't require the server to know about providers - providers connect TO the server via gRPC. Remove the invalid flag to fix container crashes. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Multiple fixes for the sidecar test values: - serverAddress: Change from localhost:9001 to localhost:50051 to match the actual gRPC port the device-api-server listens on - nodeSelector: Explicitly null out nvidia.com/gpu.present to prevent Helm from merging it with the node-type=gpu selector. The test cluster only has node-type=gpu labels. - securityContext: Run as root (uid 0) to allow the server to create Unix sockets in the hostPath /var/run/device-api/ directory. The non-root user (65534) cannot write to this host directory. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Major improvements to the NVML sidecar demo script: Cross-platform builds: - Use docker buildx for linux/amd64 builds (required when building on ARM Macs for x86 clusters) - Use --push flag to push directly after build (required for cross-platform builds which can't be stored locally) - Check for buildx availability at startup Idempotency: - Check if images already exist on ttl.sh before building - Prompt user whether to rebuild existing images - Use helm upgrade if release exists, helm install if not - Use rollout restart instead of relying on --wait timeout - Add 2-minute rollout timeout with status check on failure Bug fixes: - Fix namespace creation piped command (color codes corrupted YAML) - Fix pod label selector (app.kubernetes.io/name=nvsentinel) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The in-pod metrics check was failing because the container name is 'nvsentinel' (from Chart.Name), not 'device-api-server'. Rather than add complexity to detect the container name, remove this redundant check - metrics are already validated in Step 7 (show_metrics). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

…-api-server' The Helm chart names the main container using {{ .Chart.Name }}, which resolves to 'nvsentinel'. Fixed all kubectl logs/exec commands to use the correct container name. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The committed protobuf files were generated with protoc v33.4, but CI was installing v33.0. This version mismatch caused verify-codegen to fail because regenerating with a different protoc version produces different file headers.

Comprehensive design document capturing: - In-memory cache architecture (replacing Kine/SQLite) - Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs) - Readiness gate blocking consumers until providers register - Watch broadcaster with slow subscriber eviction - What to take from each PR - Implementation phases - Authorship preservation requirements This document serves as the spec for cherry-picking #720 onto merged #718. Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com> Co-authored-by: Dan Stone <danstone@nvidia.com>

Prepare for isolated worktree development workflow. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

This commit introduces a Kubernetes-native approach to GPU health monitoring and remediation, replacing the previous MongoDB-centric architecture. New components: - HealthEvent CRD (nvsentinel.nvidia.com/v1alpha1) for cluster-wide state - QuarantineController: Cordons nodes on fatal GPU errors - DrainController: Gracefully evicts pods from quarantined nodes - RemediationController: Executes remediation strategies (reboot, gpu-reset) - TTLController: Automatic cleanup of resolved events based on policy - HealthProvider service: gRPC-based GPU health monitoring using NVML - CRDPublisher: Publishes GPU health events from device-api-server Key features: - Automatic phase progression: New → Quarantined → Drained → Remediated - Configurable TTL policies via ConfigMap - Prometheus metrics for all controllers - Comprehensive unit tests - Integration test harness (cmd/controller-test) Helm chart updates: - CRD installation under deployments/helm/nvsentinel/crds/ - TTL policy ConfigMap template - Health-provider DaemonSet chart Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Copilot

Pull request overview

This PR introduces a Kubernetes-native GPU health monitoring and remediation system that replaces a MongoDB-centric architecture with CRD-based state management. The implementation centers around a HealthEvent CRD with four specialized controllers (Quarantine, Drain, Remediation, TTL) that orchestrate automated fault handling workflows. Additionally, the PR includes a complete gRPC-based client generation infrastructure for the NVIDIA Device API, enabling Kubernetes-style interactions with node-local GPU resources.

Changes:

Introduced HealthEvent CRD with automated controller chain for GPU fault management
Implemented gRPC-based Device API client infrastructure with full Kubernetes-style code generation
Added comprehensive test coverage including fake clients, informers, and integration tests

Reviewed changes

Copilot reviewed 131 out of 1150 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`code-generator/cmd/client-gen/**`	Custom client generator supporting gRPC transport for Device API
`client-go/**`	Generated and manual gRPC client implementation with Kubernetes-native patterns
`api/device/v1alpha1/**`	Device API type definitions with protobuf conversion logic
`api/nvsentinel/v1alpha1/**`	HealthEvent CRD schema definitions
`cmd/health-provider/**`	GPU health monitoring service with NVML integration
`.versions.yaml`	Centralized tool version management
`.github/workflows/ci.yml`	CI workflow for build validation

ArangoGutierrez · 2026-02-06T18:17:21Z

Closing in favor of #795, which is properly rebased onto main and includes all fixes from code review.

ArangoGutierrez and others added 23 commits January 13, 2026 18:17

chore - Add small set of GitHub checks and copilot rules (#691)

7fc1f70

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

feat: introduce k8s-idiomatic Go SDK for Device API (#692)

a07681f

Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>

chore: update documentation (#693)

6a94259

Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>

chore: add .worktrees/ and temp docs to .gitignore

59578f3

Prepare for isolated worktree development workflow. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez requested a review from Copilot February 4, 2026 15:34

Copilot AI reviewed Feb 4, 2026

View reviewed changes

ArangoGutierrez mentioned this pull request Feb 4, 2026

feat: cloud-native GPU health event management #795

Draft

5 tasks

ArangoGutierrez closed this Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cloud-native GPU health event management#794

feat: cloud-native GPU health event management#794
ArangoGutierrez wants to merge 23 commits intomainfrom
feat/cloud-native-storage

ArangoGutierrez commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

ArangoGutierrez commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ArangoGutierrez commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Architecture

Phase Progression

Test Plan

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ArangoGutierrez commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

ArangoGutierrez commented Feb 4, 2026 •

edited

Loading