Skip to content

feat: cloud-native GPU health event management#794

Closed
ArangoGutierrez wants to merge 23 commits intomainfrom
feat/cloud-native-storage
Closed

feat: cloud-native GPU health event management#794
ArangoGutierrez wants to merge 23 commits intomainfrom
feat/cloud-native-storage

Conversation

@ArangoGutierrez
Copy link
Contributor

@ArangoGutierrez ArangoGutierrez commented Feb 4, 2026

Summary

This PR introduces a Kubernetes-native approach to GPU health monitoring and remediation, designed to replace the MongoDB-centric architecture.

Key Changes

  • HealthEvent CRD (nvsentinel.nvidia.com/v1alpha1): Cluster-wide custom resource for tracking GPU health events
  • Controller Chain: Automated remediation workflow with four specialized controllers:
    • QuarantineController: Cordons nodes on fatal GPU errors
    • DrainController: Gracefully evicts pods from quarantined nodes
    • RemediationController: Executes remediation strategies (reboot, GPU reset, manual)
    • TTLController: Automatic cleanup of resolved events based on configurable policies
  • HealthProvider Service: gRPC-based GPU health monitoring using NVML
  • CRDPublisher: Bridge between device-api-server and Kubernetes API

Architecture

┌─────────────────┐    gRPC    ┌──────────────────┐
│ HealthProvider  │ ─────────▶│ Device-API-Server│
│   (DaemonSet)   │            │                  │
└─────────────────┘            └────────┬─────────┘
        │                               │
        │ NVML                          │ CRDPublisher
        ▼                               ▼
    ┌───────┐                   ┌───────────────┐
    │  GPU  │                   │  HealthEvent  │
    └───────┘                   │     (CRD)     │
                                └───────┬───────┘
                                        │
              ┌─────────────────────────┼─────────────────────────┐
              │                         │                         │
              ▼                         ▼                         ▼
    ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
    │ QuarantineCtrl   │───▶│   DrainCtrl      │───▶│ RemediationCtrl  │
    │  (cordon node)   │    │  (evict pods)    │    │ (reboot/reset)   │
    └──────────────────┘    └──────────────────┘    └──────────────────┘

Phase Progression

New → Quarantined → Drained → Remediated → Resolved

Each phase transition is handled by a dedicated controller, ensuring separation of concerns and reliable state management.

Test Plan

  • Unit tests for all controllers (quarantine, drain, remediation, TTL)
  • Unit tests for CRDPublisher and HealthProvider
  • Integration test harness (cmd/controller-test)
  • Manual integration test on AWS EKS cluster with GPUs
  • Full phase progression verified: New → Quarantined → Drained → Remediated (2 seconds)

Related

ArangoGutierrez and others added 23 commits January 13, 2026 18:17
Remove all NVSentinel modules except api/ to prepare for donation
as a standalone Device API repository (device.nvidia.com).

Changes:
- Remove all modules: commons, data-models, health-monitors, etc.
- Simplify README.md, DEVELOPMENT.md, SECURITY.md for API focus
- Simplify Makefile to delegate to api/
- Trim .gitignore and .versions.yaml to essentials
- Trim .github/ to basic scaffolding (lint-test workflow, templates)
- Fix api/Makefile (remove undefined THIRD_PARTY_DIR reference)

Preserves:
- api/ folder with GPU proto definitions and generated Go code
- LICENSE, CODE_OF_CONDUCT.md
- Basic GitHub scaffolding (DCO, issue templates, PR template)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>
Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>
Add ProviderService with RPCs for GPU lifecycle management:
- RegisterGpu / UnregisterGpu
- UpdateGpuStatus / UpdateGpuCondition

Also adds resource_version field to Gpu message for optimistic
concurrency control.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implements the Device API Server - a gRPC server providing unified
GPU device information via GpuService (consumer) and ProviderService
(provider) APIs.

Components:
- In-memory cache with RWMutex for thread-safe GPU state
- Watch broadcaster for real-time state change notifications
- NVML fallback provider for GPU enumeration and XID health monitoring
- Prometheus metrics (cache, watch, NVML, gRPC stats)
- Helm chart with ServiceMonitor and alerting rules

Includes comprehensive unit tests (35 tests, race-clean) and
documentation (API reference, operations guide, design docs).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The gRPC API is unauthenticated, which poses a security risk when
bound to all network interfaces. This change:

- Changes default GRPCAddress from ":50051" to "127.0.0.1:50051"
- Updates flag documentation with security warning
- Expands network security documentation per Copilot review
- Updates design and operations docs to reflect new default

Users who need network access can explicitly set --grpc-address=:50051,
with full awareness of the security implications.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The Heartbeat RPC is intentionally defined but unimplemented,
reserved for future provider liveness detection. This commit:

- Improves the error message to explain the reservation purpose
- Updates code comments to clarify this is intentional
- Aligns implementation comments with proto documentation

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add ResourceVersion() method to GpuCache and use it to populate
ListMeta.ResourceVersion in ListGpus responses. This enables:

- Optimistic concurrency patterns in consumers
- Cache invalidation based on version changes
- Consistency tracking across list operations

The resource version is a monotonically increasing counter that
increments on every cache mutation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add module structure documentation to DEVELOPMENT.md to clarify
the relationship between the different Go modules in this monorepo:

- Device API Server uses github.com/nvidia/device-api
- API definitions use github.com/nvidia/device-api/api
- client-go and code-generator use github.com/nvidia/nvsentinel

This documents the intentional divergence where device-api is a
standalone component that may be published as its own repository.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The PR introduced inconsistent module paths where the api module
was changed to github.com/nvidia/device-api/api while client-go
still uses github.com/nvidia/nvsentinel/api. This broke CI.

This commit aligns all modules to use github.com/nvidia/nvsentinel:
- api/go.mod: github.com/nvidia/nvsentinel/api
- go.mod: github.com/nvidia/nvsentinel
- All imports updated to use nvsentinel path
- Proto go_package options updated
- Generated files updated

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Addresses review feedback from pteranodan on PR #720 regarding API
design and NVML provider coupling.

BREAKING CHANGES:
- Removed ProviderService in favor of unified GpuService
- RegisterGpu() → CreateGpu()
- UnregisterGpu() → DeleteGpu()
- UpdateGpuCondition() → removed (use UpdateGpuStatus with full status)

API Changes:
- Consolidated GpuService with both read and write operations
- Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu
- Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition)
- Deleted api/proto/device/v1alpha1/provider.proto
- Updated gpu.proto with unified service definition

Service Layer:
- Created pkg/deviceapiserver/service/gpu_service.go (unified service)
- Deleted consumer.go and provider.go (replaced by gpu_service.go)
- Updated integration tests to use new API
- Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete

NVML Provider Decoupling:
- NVML provider now uses GpuServiceClient instead of direct cache access
- Server creates loopback gRPC connection for embedded NVML provider
- Added standalone cmd/nvml-provider/ binary for sidecar deployment
- Provider now "dogfoods" its own API via gRPC

Documentation:
- Updated README.md with new unified API architecture
- Rewrote docs/api/device-api-server.md with single service model
- Updated docs/operations/device-api-server.md
- Simplified docs/design/device-api-server.md
- Removed obsolete task lists and design documents

Deployment:
- Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/
- Added deployments/container/Dockerfile
- Added deployments/static/ for static manifests
- Added demos/nvml-sidecar-demo.sh

Other:
- Updated .gitignore for planning documents
- Enhanced Makefile with new targets
- Added metrics for unified service

This refactoring improves:
1. API consistency with Kubernetes conventions
2. Tooling compatibility (code generators, controller-runtime)
3. Decoupling of NVML provider from server internals
4. Flexibility to run NVML as embedded or sidecar
5. Overall code maintainability and clarity

Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>
Update Go version to 1.25 across all module files and the Dockerfile
to ensure consistent builds. The previous version (1.25.5) was a
non-existent patch version, and the Dockerfile was using 1.24 which
doesn't have Docker Hub images available.

Changes:
- go.mod: 1.25.5 -> 1.25
- api/go.mod: 1.25.5 -> 1.25
- Dockerfile: 1.24 -> 1.25

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The flag parsing was broken because flag.Parse() was called in main()
after parseFlags() had already returned the config struct. This meant
command-line arguments like --server-address were never applied to
the config.

Move flag.Parse() inside parseFlags() after defining all flags but
before returning, so the parsed values are captured in the returned
config struct.

Also reorder klog.InitFlags() to be called before parseFlags() so
klog flags are also registered before parsing.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The daemonset template was passing --provider-address to the
device-api-server container when nvmlProvider.enabled was true,
but this flag doesn't exist in the device-api-server binary.

The sidecar architecture doesn't require the server to know about
providers - providers connect TO the server via gRPC. Remove the
invalid flag to fix container crashes.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Multiple fixes for the sidecar test values:

- serverAddress: Change from localhost:9001 to localhost:50051 to
  match the actual gRPC port the device-api-server listens on

- nodeSelector: Explicitly null out nvidia.com/gpu.present to prevent
  Helm from merging it with the node-type=gpu selector. The test
  cluster only has node-type=gpu labels.

- securityContext: Run as root (uid 0) to allow the server to create
  Unix sockets in the hostPath /var/run/device-api/ directory. The
  non-root user (65534) cannot write to this host directory.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Major improvements to the NVML sidecar demo script:

Cross-platform builds:
- Use docker buildx for linux/amd64 builds (required when building
  on ARM Macs for x86 clusters)
- Use --push flag to push directly after build (required for
  cross-platform builds which can't be stored locally)
- Check for buildx availability at startup

Idempotency:
- Check if images already exist on ttl.sh before building
- Prompt user whether to rebuild existing images
- Use helm upgrade if release exists, helm install if not
- Use rollout restart instead of relying on --wait timeout
- Add 2-minute rollout timeout with status check on failure

Bug fixes:
- Fix namespace creation piped command (color codes corrupted YAML)
- Fix pod label selector (app.kubernetes.io/name=nvsentinel)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The in-pod metrics check was failing because the container name is 'nvsentinel' (from Chart.Name), not 'device-api-server'. Rather than add complexity to detect the container name, remove this redundant check - metrics are already validated in Step 7 (show_metrics).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…-api-server'

The Helm chart names the main container using {{ .Chart.Name }}, which resolves to 'nvsentinel'. Fixed all kubectl logs/exec commands to use the correct container name.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The committed protobuf files were generated with protoc v33.4, but CI was installing v33.0. This version mismatch caused verify-codegen to fail because regenerating with a different protoc version produces different file headers.
Comprehensive design document capturing:
- In-memory cache architecture (replacing Kine/SQLite)
- Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs)
- Readiness gate blocking consumers until providers register
- Watch broadcaster with slow subscriber eviction
- What to take from each PR
- Implementation phases
- Authorship preservation requirements

This document serves as the spec for cherry-picking #720 onto merged #718.

Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
Co-authored-by: Dan Stone <danstone@nvidia.com>
Prepare for isolated worktree development workflow.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This commit introduces a Kubernetes-native approach to GPU health monitoring
and remediation, replacing the previous MongoDB-centric architecture.

New components:
- HealthEvent CRD (nvsentinel.nvidia.com/v1alpha1) for cluster-wide state
- QuarantineController: Cordons nodes on fatal GPU errors
- DrainController: Gracefully evicts pods from quarantined nodes
- RemediationController: Executes remediation strategies (reboot, gpu-reset)
- TTLController: Automatic cleanup of resolved events based on policy
- HealthProvider service: gRPC-based GPU health monitoring using NVML
- CRDPublisher: Publishes GPU health events from device-api-server

Key features:
- Automatic phase progression: New → Quarantined → Drained → Remediated
- Configurable TTL policies via ConfigMap
- Prometheus metrics for all controllers
- Comprehensive unit tests
- Integration test harness (cmd/controller-test)

Helm chart updates:
- CRD installation under deployments/helm/nvsentinel/crds/
- TTL policy ConfigMap template
- Health-provider DaemonSet chart

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Kubernetes-native GPU health monitoring and remediation system that replaces a MongoDB-centric architecture with CRD-based state management. The implementation centers around a HealthEvent CRD with four specialized controllers (Quarantine, Drain, Remediation, TTL) that orchestrate automated fault handling workflows. Additionally, the PR includes a complete gRPC-based client generation infrastructure for the NVIDIA Device API, enabling Kubernetes-style interactions with node-local GPU resources.

Changes:

  • Introduced HealthEvent CRD with automated controller chain for GPU fault management
  • Implemented gRPC-based Device API client infrastructure with full Kubernetes-style code generation
  • Added comprehensive test coverage including fake clients, informers, and integration tests

Reviewed changes

Copilot reviewed 131 out of 1150 changed files in this pull request and generated no comments.

Show a summary per file
File Description
code-generator/cmd/client-gen/** Custom client generator supporting gRPC transport for Device API
client-go/** Generated and manual gRPC client implementation with Kubernetes-native patterns
api/device/v1alpha1/** Device API type definitions with protobuf conversion logic
api/nvsentinel/v1alpha1/** HealthEvent CRD schema definitions
cmd/health-provider/** GPU health monitoring service with NVML integration
.versions.yaml Centralized tool version management
.github/workflows/ci.yml CI workflow for build validation

@ArangoGutierrez
Copy link
Contributor Author

Closing in favor of #795, which is properly rebased onto main and includes all fixes from code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments