Skip to content

feat: add device-api-server with NVML fallback provider#720

Open
ArangoGutierrez wants to merge 1 commit intodevice-apifrom
device-api-server
Open

feat: add device-api-server with NVML fallback provider#720
ArangoGutierrez wants to merge 1 commit intodevice-apifrom
device-api-server

Conversation

@ArangoGutierrez
Copy link
Contributor

@ArangoGutierrez ArangoGutierrez commented Jan 21, 2026

Summary

Implements the Device API Server — a node-local gRPC cache server for GPU device state, deployed as a Kubernetes DaemonSet. It acts as an intermediary between providers (e.g., NVSentinel health monitors that update GPU state) and consumers (e.g., Device Plugins, DRA Drivers that read state for scheduling decisions).

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                          Kubernetes Node                             │
│                                                                      │
│  ┌──────────────────┐                  ┌──────────────────────────┐  │
│  │   NVSentinel      │                  │  Device Plugin / DRA     │  │
│  │  Health Monitor   │                  │       Driver             │  │
│  │   [Provider]      │                  │     [Consumer]           │  │
│  └────────┬─────────┘                  └────────────┬─────────────┘  │
│           │ CreateGpu()                              │ GetGpu()       │
│           │ UpdateGpuStatus()                        │ ListGpus()     │
│           │ (gRPC)                                   │ WatchGpus()    │
│           ▼                                          ▼                │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │                  Device API Server (DaemonSet)                  │  │
│  │                                                                 │  │
│  │  ┌─────────────────────────────────────────────────────────┐   │  │
│  │  │                   GpuService (Unified)                   │   │  │
│  │  │  Write: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu │   │  │
│  │  │  Read:  GetGpu, ListGpus, WatchGpus                      │   │  │
│  │  └──────────────────────┬──────────────────────────────────┘   │  │
│  │                         ▼                                       │  │
│  │  ┌─────────────────────────────────────────────────────────┐   │  │
│  │  │               Cache (sync.RWMutex)                       │   │  │
│  │  │  • Writer-preference: blocks ALL new reads during write  │   │  │
│  │  │  • proto.Clone on every read/write (clone-safety)        │   │  │
│  │  │  • Optimistic concurrency via resource_version           │   │  │
│  │  │  • Watch broadcaster with ListAndSubscribe atomicity     │   │  │
│  │  └─────────────────────────────────────────────────────────┘   │  │
│  │                                                                 │  │
│  │  ┌───────────┐  ┌───────────┐  ┌─────────────────────────┐    │  │
│  │  │ Health    │  │ Metrics   │  │ Unix Socket             │    │  │
│  │  │ :8081     │  │ :9090     │  │ /var/run/device-api/    │    │  │
│  │  │ /healthz  │  │ /metrics  │  │ device.sock             │    │  │
│  │  └───────────┘  └───────────┘  └─────────────────────────┘    │  │
│  └────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Why read-blocking semantics?

The sync.RWMutex with writer-preference ensures consumers never read stale "healthy" state when a provider is updating to "unhealthy". Once a write is pending, new readers block until the write completes. This is critical for correct scheduling decisions.

NVML Provider (Sidecar)

The NVML provider runs as a sidecar container that:

  1. Enumerates GPUs via NVML and registers them via CreateGpu
  2. Monitors XID events for hardware health (ECC errors, GPU fallen off bus, etc.)
  3. Updates GPU conditions via UpdateGpuStatus when health changes
  4. Classifies XIDs by severity: ignored (application errors), warning, critical (hardware failures)

Decoupled as a sidecar for independent updates and glibc isolation (NVML requires RTLD_DEEPBIND).

Key Design Decisions

Decision Rationale
Unified GpuService (not split provider/consumer) Simpler API surface, standard CRUD, matches K8s patterns
proto.Clone on all cache paths Prevents external mutation of cached state; verified by clone-safety tests
Optimistic concurrency (resource_version) Prevents lost updates from concurrent providers without pessimistic locking
MaxConditionsPerGpu = 32 Bounds memory growth from unbounded condition lists
Build-tagged NVML (//go:build nvml) Pure Go server binary; NVML only compiled when needed for sidecar
gRPC resource limits (4MB msg, 100 streams) Prevents resource exhaustion from errant local processes
Generic error messages to clients Internal error details logged server-side, not leaked to gRPC clients

Key Components

  • GpuService — Unified gRPC service with standard CRUD + Watch
  • GpuCache — In-memory store with sync.RWMutex, clone-safety, and monotonic resource versioning
  • Watch Broadcaster — Atomic ListAndSubscribe to prevent duplicate events; slow subscriber isolation
  • NVML Provider — GPU enumeration and XID-based health monitoring (build-tagged sidecar)
  • Prometheus Metrics — Cache stats, NVML status, gRPC latency, watch event drops
  • Helm Chart — DaemonSet with ServiceMonitor, PrometheusRule, sidecar support

Type of Change

  • New feature (new message, field, or service method)
  • Documentation
  • Build/Tooling

Test Results

  • 86 unit tests passing with -race flag (cache: 34, service: 21, nvml: 22, metrics: 7, config: 2)
  • go vet clean (including -tags nvml)
  • All cache mutations use proto.Clone (verified via clone-safety tests)
  • Concurrent stress tests: 10 readers + 1 writer, 5 concurrent writers
  • Non-numeric resource_version rejected with InvalidArgument
  • Health classification: False (unhealthy) overrides Unknown regardless of condition ordering

Checklist

  • Proto files compile successfully (make protos-generate)
  • Generated code is up to date and committed
  • Self-review completed
  • Documentation updated
  • Signed-off commits (DCO)
  • gRPC server has message size and stream limits
  • Internal errors do not leak details to clients
  • XID severity maps are immutable (unexported)
  • Port collision validated at startup
  • Helm template handles both :port and host:port formats

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a comprehensive Device API Server with NVML fallback provider support for GPU device information and health monitoring. The server acts as a node-local gRPC cache intermediary between providers (health monitors) and consumers (device plugins, DRA drivers).

Changes:

  • Implements complete Device API Server with gRPC services (GpuService for consumers, ProviderService for providers)
  • Adds built-in NVML fallback provider for direct GPU enumeration and XID-based health monitoring
  • Implements thread-safe in-memory cache with RWMutex for read-blocking semantics during writes
  • Adds watch broadcaster for real-time GPU state change notifications
  • Includes comprehensive Prometheus metrics for observability
  • Provides Helm charts for Kubernetes deployment with ServiceMonitor and alerting

Reviewed changes

Copilot reviewed 51 out of 54 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/version/version.go Version information package with build-time ldflags support
pkg/deviceapiserver/config.go Server configuration with flags, environment variables, and validation
pkg/deviceapiserver/server.go Main server implementation with lifecycle management and graceful shutdown
pkg/deviceapiserver/cache/*.go Thread-safe GPU cache with broadcaster for watch events
pkg/deviceapiserver/service/*.go gRPC service implementations (consumer and provider APIs)
pkg/deviceapiserver/nvml/*.go NVML provider for GPU enumeration and XID health monitoring
pkg/deviceapiserver/metrics/metrics.go Prometheus metrics definitions and recording
cmd/device-api-server/main.go Main entry point with flag parsing and signal handling
docs/*.md Comprehensive API, operations, and design documentation
charts/device-api-server/* Helm chart templates for Kubernetes deployment
go.mod/go.sum Dependencies including NVML, gRPC, and Prometheus libraries

@ArangoGutierrez ArangoGutierrez force-pushed the device-api-server branch 2 times, most recently from 10ae27f to bd1b3ad Compare January 22, 2026 13:54
import "device/v1alpha1/gpu.proto";

// ==========================================
// Provider Service Definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m concerned about adding a new service with custom methods. Should we extend the existing one with standard methods instead?

Usually, "registering" is just a Create call, "unregistering" is a Delete call, and condition updates are either an UpdateStatus or Patch call. I think we should stick to these well understood conventions.

Using custom methods also risks compatibility with standard tooling (e.g., code generators and controller-runtime). Making things harder for us both now and later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tight coupling between the API server and NVML feels off.

While we definitely need a default health check for GPUs, embedding it directly into the API server makes it hardware-aware rather than a ‘dumb’ transport pipe.

What do you think about moving the default health checks out of the API server entirely? The API server should be agnostic to how the state is produced. The default health checks should interact with the API server just like any other would: via the standard client interface.

ArangoGutierrez added a commit that referenced this pull request Jan 29, 2026
Addresses review feedback from pteranodan on PR #720 regarding API
design and NVML provider coupling.

BREAKING CHANGES:
- Removed ProviderService in favor of unified GpuService
- RegisterGpu() → CreateGpu()
- UnregisterGpu() → DeleteGpu()
- UpdateGpuCondition() → removed (use UpdateGpuStatus with full status)

API Changes:
- Consolidated GpuService with both read and write operations
- Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu
- Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition)
- Deleted api/proto/device/v1alpha1/provider.proto
- Updated gpu.proto with unified service definition

Service Layer:
- Created pkg/deviceapiserver/service/gpu_service.go (unified service)
- Deleted consumer.go and provider.go (replaced by gpu_service.go)
- Updated integration tests to use new API
- Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete

NVML Provider Decoupling:
- NVML provider now uses GpuServiceClient instead of direct cache access
- Server creates loopback gRPC connection for embedded NVML provider
- Added standalone cmd/nvml-provider/ binary for sidecar deployment
- Provider now "dogfoods" its own API via gRPC

Documentation:
- Updated README.md with new unified API architecture
- Rewrote docs/api/device-api-server.md with single service model
- Updated docs/operations/device-api-server.md
- Simplified docs/design/device-api-server.md
- Removed obsolete task lists and design documents

Deployment:
- Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/
- Added deployments/container/Dockerfile
- Added deployments/static/ for static manifests
- Added demos/nvml-sidecar-demo.sh

Other:
- Updated .gitignore for planning documents
- Enhanced Makefile with new targets
- Added metrics for unified service

This refactoring improves:
1. API consistency with Kubernetes conventions
2. Tooling compatibility (code generators, controller-runtime)
3. Decoupling of NVML provider from server internals
4. Flexibility to run NVML as embedded or sidecar
5. Overall code maintainability and clarity

Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>
//
// This increments on every modification to the GPU and can be used by
// consumers to detect changes or implement optimistic concurrency.
int64 resource_version = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and the resource version inside of the metadata field?

if id := os.Getenv("PROVIDER_ID"); id != "" {
cfg.ProviderID = id
}
if root := os.Getenv("NVML_DRIVER_ROOT"); root != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the standard or conventional environment variable used to hold the driver root location?

},
}

_, err := p.gpuClient.CreateGpu(ctx, req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I’ve updated client-gen to support Create methods (#718). Once that's merged, you can use versioned clients here instead of proto stubs:

gpu, err := clientset.DeviceV1alpha1().GPUs().Create(ctx, &devicev1alpha1.GPU{}, metav1.CreateOptions{}) 

},
}

if _, err := p.gpuClient.UpdateGpuStatus(ctx, req); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I’ve updated client-gen to support Update methods (#718). Once that's merged, you can use versioned clients here instead of proto stubs:

gpu, err := clientset.DeviceV1alpha1().GPUs().Update(ctx, &devicev1alpha1.GPU{}, metav1.UpdateOptions{})

ArangoGutierrez added a commit that referenced this pull request Feb 4, 2026
Comprehensive design document capturing:
- In-memory cache architecture (replacing Kine/SQLite)
- Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs)
- Readiness gate blocking consumers until providers register
- Watch broadcaster with slow subscriber eviction
- What to take from each PR
- Implementation phases
- Authorship preservation requirements

This document serves as the spec for cherry-picking #720 onto merged #718.

Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
Co-authored-by: Dan Stone <danstone@nvidia.com>
ArangoGutierrez added a commit that referenced this pull request Feb 4, 2026
Addresses review feedback from pteranodan on PR #720 regarding API
design and NVML provider coupling.

BREAKING CHANGES:
- Removed ProviderService in favor of unified GpuService
- RegisterGpu() → CreateGpu()
- UnregisterGpu() → DeleteGpu()
- UpdateGpuCondition() → removed (use UpdateGpuStatus with full status)

API Changes:
- Consolidated GpuService with both read and write operations
- Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu
- Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition)
- Deleted api/proto/device/v1alpha1/provider.proto
- Updated gpu.proto with unified service definition

Service Layer:
- Created pkg/deviceapiserver/service/gpu_service.go (unified service)
- Deleted consumer.go and provider.go (replaced by gpu_service.go)
- Updated integration tests to use new API
- Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete

NVML Provider Decoupling:
- NVML provider now uses GpuServiceClient instead of direct cache access
- Server creates loopback gRPC connection for embedded NVML provider
- Added standalone cmd/nvml-provider/ binary for sidecar deployment
- Provider now "dogfoods" its own API via gRPC

Documentation:
- Updated README.md with new unified API architecture
- Rewrote docs/api/device-api-server.md with single service model
- Updated docs/operations/device-api-server.md
- Simplified docs/design/device-api-server.md
- Removed obsolete task lists and design documents

Deployment:
- Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/
- Added deployments/container/Dockerfile
- Added deployments/static/ for static manifests
- Added demos/nvml-sidecar-demo.sh

Other:
- Updated .gitignore for planning documents
- Enhanced Makefile with new targets
- Added metrics for unified service

This refactoring improves:
1. API consistency with Kubernetes conventions
2. Tooling compatibility (code generators, controller-runtime)
3. Decoupling of NVML provider from server internals
4. Flexibility to run NVML as embedded or sidecar
5. Overall code maintainability and clarity

Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>
ArangoGutierrez added a commit that referenced this pull request Feb 4, 2026
Comprehensive design document capturing:
- In-memory cache architecture (replacing Kine/SQLite)
- Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs)
- Readiness gate blocking consumers until providers register
- Watch broadcaster with slow subscriber eviction
- What to take from each PR
- Implementation phases
- Authorship preservation requirements

This document serves as the spec for cherry-picking #720 onto merged #718.

Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
Co-authored-by: Dan Stone <danstone@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 52 changed files in this pull request and generated 2 comments.

go.mod Outdated
@@ -0,0 +1,32 @@
module github.com/nvidia/nvsentinel

go 1.25
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go version is specified as "1.25" here, but in api/go.mod (before the change) it was "1.25.5". While Go considers these equivalent, it's better to be consistent across all go.mod files in the repository. Consider using the same patch version everywhere for consistency.

Copilot uses AI. Check for mistakes.
- The TCP listener binds to `127.0.0.1:50051` (localhost) by default. This prevents network exposure.
- Prefer using a Unix domain socket for all local clients and providers; avoid exposing the TCP gRPC listener to the broader cluster network.
- In multi-tenant or partially untrusted clusters, strongly recommend **disabling the TCP listener** (set `--grpc-address=""`) or using a restricted Unix socket combined with Kubernetes `NetworkPolicy` to limit access to the Device API Server pod.
- If TCP gRPC must be exposed beyond the node (e.g., `--grpc-address=:50051`), terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`.
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Network Security" section documents the gRPC API as unauthenticated and recommends keeping it node-local, but the Helm chart currently defaults server.grpcAddress to :50051 (see deployments/helm/nvsentinel/values.yaml), which binds the unauthenticated gRPC listener on all pod interfaces and makes it reachable by any pod in the cluster over plaintext. In a multi-tenant or partially untrusted cluster, an attacker can connect to this port and invoke write RPCs (CreateGpu, UpdateGpuStatus, DeleteGpu) to tamper with GPU health/scheduling state, despite the documentation implying a safer localhost-only default. To close this gap, align the chart defaults with the documented security posture (e.g., default to 127.0.0.1:50051 or Unix socket only) and require explicit opt-in plus TLS/mTLS and restrictive NetworkPolicy when exposing the TCP listener beyond localhost.

Suggested change
- If TCP gRPC must be exposed beyond the node (e.g., `--grpc-address=:50051`), terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`.
- If TCP gRPC must be exposed beyond the node, **do not** publish a plaintext, all-interfaces listener (for example, `--grpc-address=:50051`) directly to the cluster network. Instead, keep the server bound to localhost (for example, `127.0.0.1:50051`) or a Unix domain socket and expose it only behind TLS/mTLS or an equivalent authenticated tunnel (such as a sidecar proxy or ingress with client auth), combined with a least-privilege `NetworkPolicy`.

Copilot uses AI. Check for mistakes.
ArangoGutierrez added a commit that referenced this pull request Feb 5, 2026
Aligns Helm chart default with config.go secure default.
The gRPC API is unauthenticated, so binding to all interfaces
by default exposes the server to unauthorized cluster access.

Addresses Copilot security review feedback on PR #720.
ArangoGutierrez added a commit that referenced this pull request Feb 5, 2026
The resource_version was duplicated at both Gpu.resource_version (field 4,
int64) and Gpu.metadata.resource_version (in ObjectMeta, string). This is
redundant and violates K8s API conventions where resource_version should
only be in metadata.

Changes:
- Remove Gpu.resource_version field from proto (reserved field 4 for
  backward compatibility)
- Update cache to use metadata.resource_version (string format)
- Update service to read from metadata.resource_version
- Remove ResourceVersion from goverter ignore list (no longer needed)
- Regenerate protobuf code

The UpdateGpuStatusRequest.resource_version field is intentionally kept
as it serves a different purpose: passing the expected version for
optimistic concurrency checks in requests.

Addresses pteranodan's review feedback on PR #720.
ArangoGutierrez added a commit that referenced this pull request Feb 5, 2026
The API server should be a 'dumb' transport pipe, agnostic to how
GPU state is produced. NVML integration should be external only,
interacting via the standard gRPC client interface.

Changes:
- Remove nvmlProvider field and related methods from server.go
- Remove NVML config fields (NVMLEnabled, NVMLDriverRoot, etc.) from config.go
- Remove NVML flags from main.go (--enable-nvml-provider, --nvml-driver-root, etc.)
- Remove embedded nvml section from Helm values.yaml
- Remove nvml args and env vars from daemonset.yaml template
- Update NOTES.txt to reference nvmlProvider sidecar instead

The nvmlProvider sidecar (values.yaml) remains as the recommended
approach for NVML-based GPU enumeration and health monitoring.

Addresses pteranodan's architecture review feedback on PR #720.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 66 out of 70 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 66 out of 70 changed files in this pull request and generated 12 comments.

Comment on lines +57 to +86
// XidDescriptions provides human-readable descriptions for common XIDs.
var XidDescriptions = map[uint64]string{
// Application errors (typically ignored)
13: "Graphics Engine Exception",
31: "GPU memory page fault",
43: "GPU stopped processing",
45: "Preemptive cleanup",
68: "Video processor exception",
109: "Context Switch Timeout",

// Memory errors
48: "Double Bit ECC Error",
63: "Row remapping failure",
64: "Uncontained ECC error",
74: "NVLink error",
79: "GPU has fallen off the bus",
94: "Contained ECC error",
95: "Uncontained ECC error",

// Other notable XIDs
8: "GPU not accessible",
32: "Invalid or corrupted push buffer stream",
38: "Driver firmware error",
56: "Display engine error",
57: "Error programming video memory interface",
62: "Internal micro-controller halt (non-fatal)",
69: "Graphics engine accessor error",
119: "GSP error",
120: "GSP firmware error",
}
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XidDescriptions is exported as a mutable map, which allows other packages to modify it at runtime (and contradicts the stated goal of keeping XID maps immutable). Since it’s only used internally, consider making it unexported (xidDescriptions) and/or exposing a function that returns a copy to prevent accidental mutation.

Copilot uses AI. Check for mistakes.
Comment on lines 143 to 147
# gRPC API port
EXPOSE 8080
# Metrics port
EXPOSE 8081

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device-api-server container image exposes ports 8080/8081, but the server’s admin health gRPC and metrics HTTP endpoints are configured via flags (defaulting to :50051 for health gRPC and :9090 for metrics in pkg/controlplane/apiserver/options). Also, gRPC device service is on a unix socket, not a TCP port. Please align EXPOSE/comments with the actual runtime ports (or drop EXPOSE if they’re not meaningful for UDS-based deployment).

Copilot uses AI. Check for mistakes.
Makefile Outdated
Comment on lines 178 to 187
helm lint deployments/helm/nvsentinel

.PHONY: helm-template
helm-template: ## Render Helm chart templates
helm template nvsentinel deployments/helm/nvsentinel

.PHONY: helm-package
helm-package: ## Package Helm chart
@mkdir -p dist/
helm package deployments/helm/nvsentinel -d dist/
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm targets reference deployments/helm/nvsentinel, but this PR adds the chart under deployments/helm/device-api-server. As written, make helm-lint, make helm-template, and make helm-package will fail because the referenced path doesn’t exist. Please update these targets to point at the correct chart directory (and ensure the chart name matches).

Suggested change
helm lint deployments/helm/nvsentinel
.PHONY: helm-template
helm-template: ## Render Helm chart templates
helm template nvsentinel deployments/helm/nvsentinel
.PHONY: helm-package
helm-package: ## Package Helm chart
@mkdir -p dist/
helm package deployments/helm/nvsentinel -d dist/
helm lint deployments/helm/device-api-server
.PHONY: helm-template
helm-template: ## Render Helm chart templates
helm template device-api-server deployments/helm/device-api-server
.PHONY: helm-package
helm-package: ## Package Helm chart
@mkdir -p dist/
helm package deployments/helm/device-api-server -d dist/

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +90
### Key Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `image.repository` | Image repository | `ghcr.io/nvidia/device-api-server` |
| `image.tag` | Image tag | Chart appVersion |
| `server.grpcAddress` | gRPC server address | `:50051` |
| `server.unixSocket` | Unix socket path | `/var/run/device-api/device.sock` |
| `server.healthPort` | Health endpoint port | `8081` |
| `server.metricsPort` | Metrics endpoint port | `9090` |
| `nvmlProvider.enabled` | Enable NVML provider sidecar | `false` |
| `nvmlProvider.driverRoot` | NVIDIA driver library root | `/run/nvidia/driver` |
| `nvmlProvider.healthCheckEnabled` | Enable XID event monitoring | `true` |
| `runtimeClassName` | Pod RuntimeClass | `""` |
| `nodeSelector` | Node selector | `nvidia.com/gpu.present: "true"` |
| `metrics.serviceMonitor.enabled` | Create ServiceMonitor | `false` |
| `metrics.prometheusRule.enabled` | Create PrometheusRule | `false` |

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart README lists configuration keys that don’t exist in values.yaml (e.g., server.grpcAddress) and implies a TCP listener by default, while the chart template configures --bind-address as a unix socket. Please update the parameter table/examples to reflect the actual values.yaml schema and the current server flags to avoid broken installs.

Copilot uses AI. Check for mistakes.
Comment on lines 148 to 182
preparedStorage, err := storage.PrepareRun(ctx)
if err != nil {
logger.Error(err, "Failed to prepare storage backend")
os.Exit(1)
}

// Run the storage backend in the background.
storageErrCh := make(chan error, 1)
go func() {
storageErrCh <- preparedStorage.Run(ctx)
}()

// Create, prepare, and run the device API server.
server, err := completedAPIServerConfig.New(storage)
if err != nil {
logger.Error(err, "Failed to create device API server")
os.Exit(1)
}

prepared, err := server.PrepareRun(ctx)
if err != nil {
logger.Error(err, "Failed to prepare device API server")
os.Exit(1)
}

if err := prepared.Run(ctx); err != nil {
logger.Error(err, "Device API server error")
os.Exit(1)
}

// Check if the storage backend exited with an error.
if err := <-storageErrCh; err != nil {
logger.Error(err, "Storage backend error")
os.Exit(1)
}
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage backend is run in a goroutine and only awaited after the gRPC server exits. If the storage backend fails while the server is running, the server will keep serving with no coordination/shutdown (this regresses from the previous errgroup-based lifecycle). Consider running storage and server under an errgroup (or select on storageErrCh) so a storage failure triggers cancellation and a clean shutdown of the server.

Copilot uses AI. Check for mistakes.
Comment on lines 83 to 95
### Command-Line Flags

| Flag | Default | Description |
|------|---------|-------------|
| `--grpc-address` | `127.0.0.1:50051` | TCP address for gRPC server (localhost only by default) |
| `--unix-socket` | `/var/run/device-api/device.sock` | Unix socket path |
| `--health-port` | `8081` | HTTP port for health endpoints |
| `--metrics-port` | `9090` | HTTP port for Prometheus metrics |
| `--shutdown-timeout` | `30` | Graceful shutdown timeout (seconds) |
| `--shutdown-delay` | `5` | Pre-shutdown delay (seconds) |
| `--log-format` | `json` | Log format: `text` or `json` |
| `-v` | `0` | Log verbosity level |

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI flag documentation here appears out of sync with the actual server flags wired up in pkg/controlplane/apiserver/options (e.g., it documents --grpc-address/--unix-socket/--health-port/--metrics-port, but the code defines --bind-address, --health-probe-bind-address, and --metrics-bind-address). Please update this doc to match the real flags so operators can deploy/configure the server correctly.

Copilot uses AI. Check for mistakes.
Comment on lines 110 to 116
args:
- --grpc-address=:50051
- --health-port=8081
- --metrics-port=9090
- --provider-address=localhost:9001
- --unix-socket=/var/run/device-api/device.sock
- --v=0
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static DaemonSet manifest uses flags like --grpc-address/--health-port/--metrics-port/--unix-socket/--provider-address that do not match the current device-api-server binary flags (which use --bind-address, --health-probe-bind-address, --metrics-bind-address, etc.). As written, the container will fail to start due to unknown flags. Please update the args to the new flag names/semantics or clearly mark this manifest as deprecated/out-of-date.

Copilot uses AI. Check for mistakes.
string resource_version = 3;

// opts contains the options for the update.
UpdateOptions opts = 4;
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateGpuStatusRequest lacks a namespace field while other RPCs are namespace-aware, forcing the server implementation to hardcode "default" for status updates (see TODO in gpu_service.go). This makes it impossible to update status for GPUs created outside default and is an API inconsistency. Consider adding string namespace (and possibly object meta) to UpdateGpuStatusRequest now before the API stabilizes, or explicitly document/enforce that GPUs are cluster-singleton in the API.

Suggested change
UpdateOptions opts = 4;
UpdateOptions opts = 4;
// namespace specifies the namespace scoping for the GPU whose status is being updated.
// An empty namespace is equivalent to the "default" namespace, but "default" is the
// canonical representation.
//
// Optional. Defaults to unset.
string namespace = 5;

Copilot uses AI. Check for mistakes.
Comment on lines 44 to 114
"github.com/nvidia/nvsentinel/pkg/controlplane/apiserver"
"github.com/nvidia/nvsentinel/pkg/controlplane/apiserver/options"
"github.com/nvidia/nvsentinel/pkg/storage/storagebackend"
"github.com/nvidia/nvsentinel/pkg/util/version"

// Import service providers so their init() functions register them.
_ "github.com/nvidia/nvsentinel/pkg/services/device/v1alpha1"
)

const (
// ComponentName is the name of this component for logging.
ComponentName = "device-api-server"
)

func main() {
opts := options.NewOptions()

fss := cliflag.NamedFlagSets{}
opts.AddFlags(&fss)

// Add a version flag to the global flag set.
showVersion := pflag.Bool("version", false, "Show version and exit")

// Merge all named flag sets into the global pflag command line.
for _, fs := range fss.FlagSets {
pflag.CommandLine.AddFlagSet(fs)
}

pflag.Parse()

// Handle version flag before any other initialization.
if *showVersion {
v := version.Get()
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
if err := enc.Encode(v); err != nil {
fmt.Fprintf(os.Stderr, "Failed to encode version: %v\n", err)
os.Exit(1)
}
os.Exit(0)
}

// Set up signal handling for graceful shutdown.
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer cancel()

// Complete fills in defaults and resolves environment overrides.
completedOpts, err := opts.Complete(ctx)
if err != nil {
fmt.Fprintf(os.Stderr, "Failed to complete options: %v\n", err)
os.Exit(1)
}

// Validate rejects invalid flag combinations.
if errs := completedOpts.Validate(); len(errs) > 0 {
for _, e := range errs {
fmt.Fprintf(os.Stderr, "Invalid configuration: %v\n", e)
}
os.Exit(1)
}

// Create root logger with component name.
logger := klog.Background().WithName(ComponentName)
ctx = klog.NewContext(ctx, logger)

versionInfo := version.Get()
logger.Info("Starting server",
"version", versionInfo.GitVersion,
"commit", versionInfo.GitCommit,
"buildDate", versionInfo.BuildDate,
)
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd/device-api-server is importing github.com/nvidia/nvsentinel/pkg/util/version, but the build/linker flags in the updated Makefile are wired to github.com/nvidia/nvsentinel/pkg/version. As a result the reported version will stay at util/version defaults and won’t reflect ldflags. Please switch this file to use pkg/version (and update the logged field names accordingly), or revert Makefile/Dockerfile ldflags to target pkg/util/version consistently across the repo.

Copilot uses AI. Check for mistakes.
AdminServer *grpc.Server
AdminCleanup func()
Metrics *metrics.ServerMetrics
MetricsRegistry *prometheus.Registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this, good catch!

healthpb.RegisterHealthServer(s.AdminServer, s.HealthServer)
// Also register on DeviceServer so sidecar providers connecting via
// unix socket can perform health checks without a separate connection.
healthpb.RegisterHealthServer(s.DeviceServer, s.HealthServer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch again!

}

if s.MetricsAddress != "" {
// TODO: put in wg??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to get back to this, thank you!

// See the License for the specific language governing permissions and
// limitations under the License.

// Code generated by service-gen. DO NOT EDIT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: all the code under pkg/service/device/<version>/ will eventually be generated.

func (p *gpuServiceProvider) Install(svr *grpc.Server, storageConfig storagebackend.Config) (api.Service, error) {
// Install creates the in-memory storage backend and registers the GPU service
// on the provided gRPC server.
func (p *gpuServiceProvider) Install(svr *grpc.Server, _ storagebackend.Config) (api.Service, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change hardcodes the storage backend to in-memory and completely ignores the passed storagebackend.Config.

This prevents a service from using durable storage (like etcd or an etcdshim) which was previously supported via factory.Create. This change effectively locks us into a single-node, non-durable architecture and removes the ability to configure the storage layer via flags.

Since the goal is to add an in-memory option, can we switch on the config instead? For example:

// Pseudo-code
if storageConfig.Type == "memory" {
    // In-memory path
    s, destroy, err = memory.CreateStorage(...)
} else {
    // Use the standard K8s factory
    s, destroy, err = factory.Create(configForResource, ...)
}


resourcePrefix := path.Join("/", p.groupVersion.Group, "gpus")
codecs := serializer.NewCodecFactory(scheme)
info, _ := runtime.SerializerInfoForMediaType(codecs.SupportedMediaTypes(), runtime.ContentTypeJSON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I think we need an error check on runtime.SerializerInfoForMediaType to prevent panics if the serializer isn't found.


key := s.storageKey(req.GetNamespace(), "")
w, err := s.storage.Watch(ctx, key, storage.ListOptions{
ResourceVersion: req.GetOpts().GetResourceVersion(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issue

We cannot ignore the ResourceVersion (RV) from the request. By omitting req.GetOpts().GetResourceVersion() from storage.ListOptions, we break the fundamental List/Watch consistency contract.

Clients rely on passing a RV to ensure they don't miss events that occur in the split second between a List request finishing and a Watch request starting. Without it, clients can permanently miss events.

TestIntegration_WatchResumption demonstrates where a client lists, a new item is created, and then the client Watches (using the List RV) and the client never receives an event for the item created in between the List and Watch.

}

// UpdateGpuStatusRequest specifies the GPU status to update.
message UpdateGpuStatusRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This request structure deviates from the established design pattern for Updates in Kubernetes and in this file (see UpdateGpuRequest). Typically, update operations (including status updates) accept the whole object and use the metadata (e.g., RV) within it for addressing concurrency control.

Additionally, by splitting fields out manually we risk accidentally omitting necessary metadata (e.g., RV, namespace, generation, etc.).

Suggestion:

message UpdateGpuRequest {
  // gpu is the GPU resource whose status to update.
  Gpu gpu = 1;

  // opts contains the options for the update.
  UpdateOptions opts = 2;
}


// Package version provides version information for the Device API Server.
// These values are set at build time via ldflags.
package version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a duplicate version package. We already have an implementation at pkg/util/version. We should combine these to prevent confusion about which to use.

I like the pkg/version import path you chose; it is cleaner and more discoverable. I suggest we move the logic from pkg/util/version into this new package and delete the existing directory.

The existing package includes a HTTP handler (for /version) and a UserAgent helper that are actively used. We should keep those.

We can remove RegisterComponent, it's not being used. Forgot to remove it earlier.

nvmllib Library

// gRPC client to communicate with Device API Server
client v1alpha1.GpuServiceClient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the generated Kubernetes-style client SDK (pkg/client-go) instead of raw gRPC service client.

Example:

// Get the latest object to ensure we have the current ResourceVersion.
gpu, err := p.client.DeviceV1alpha1().GPUs().Get(ctx, gpuName, metav1.GetOptions{})
if err != nil {
    return err
}

// Modify status
gpu.Status = newStatus

// Update
_, err = p.client.DeviceV1alpha1().GPUs().UpdateStatus(ctx, gpu, metav1.UpdateOptions{})

"github.com/NVIDIA/go-nvml/pkg/nvml"
"k8s.io/klog/v2"

v1alpha1 "github.com/nvidia/nvsentinel/internal/generated/device/v1alpha1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Providers must never import internal packages.


func NewOptions() *Options {
return &Options{
InMemory: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in other comments, the current implementation has some gaps regarding ResourceVersion semantics, which will break controllers and informers that rely on those guarantees.

I suggest we make Kine (with SQLite) the default in-memory path. Because Kine is the default backend for K3s (a CNCF certified distro), we know it passes the upstream Kubernetes storage conformance tests. This gives us verified correctness out of the box.

We can still include this custom implementation, but perhaps behind a flag like --experimental-in-memory-storage until we can verify it against the full storage compliance suite.

ArangoGutierrez added a commit that referenced this pull request Feb 13, 2026
Adds forward-looking guidance to the existing TODO(#720) comment,
clarifying that the proto message must be extended with a namespace
field when multi-namespace support is added.

Addresses audit finding M3: UpdateGpuStatus hardcoded to default namespace.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 87 out of 91 changed files in this pull request and generated 1 comment.

Comment on lines +267 to +275
// 2. Check for pending page retirements (ECC errors)
if pending, ret := device.GetRetiredPagesPendingStatus(); ret == nvml.SUCCESS {
if pending == nvml.FEATURE_ENABLED {
p.logger.V(1).Info("GPU has pending page retirements", "uuid", uuid)
return false, nil
}
}

// Device is accessible and no pending issues
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method GetRetiredPagesPendingStatus() is called on the device but is not defined in the Device interface (pkg/providers/nvml/interface.go:36-42). This will cause a compilation error when building with -tags=nvml. Add this method to the Device interface and implement it in both the wrapper and mock implementations.

Suggested change
// 2. Check for pending page retirements (ECC errors)
if pending, ret := device.GetRetiredPagesPendingStatus(); ret == nvml.SUCCESS {
if pending == nvml.FEATURE_ENABLED {
p.logger.V(1).Info("GPU has pending page retirements", "uuid", uuid)
return false, nil
}
}
// Device is accessible and no pending issues
// Device is accessible according to basic checks

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 110 out of 114 changed files in this pull request and generated 13 comments.

Comment on lines +90 to +94
if labels != nil {
version := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "device_apiserver_build_info",
Help: "Build information about the device-apiserver binary.",
ConstLabels: m.buildInfoLabels,
ConstLabels: labels,
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build-info metric still uses the name/help text "device_apiserver_build_info" / "device-apiserver" even though the binary/component is now "device-api-server". At minimum, consider updating the Help string and the service label value to match the new component name; if the metric name is kept for compatibility, documenting that would help avoid confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +53
func (p *gpuServiceProvider) Install(svr *grpc.Server, cfg storagebackend.Config) (api.Service, error) {
// Currently only in-memory storage is supported. The cfg parameter is
// accepted for future extensibility but not used for backend selection.
_ = cfg

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Install() now unconditionally uses the in-memory storage backend and explicitly ignores the provided storagebackend.Config. This means configuring the server for a persistent backend won’t actually persist GPU resources, which is a surprising behavior change. Consider either wiring cfg into backend selection (restore the factory.Create path when persistence is desired) or returning an explicit error when a non-in-memory backend is configured but unsupported.

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +190
// Cap conditions to prevent unbounded growth
const maxConditions = 100
if len(gpu.Status.Conditions) > maxConditions {
gpu.Status.Conditions = gpu.Status.Conditions[len(gpu.Status.Conditions)-maxConditions:]
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states MaxConditionsPerGpu = 32, but this provider-side cap uses maxConditions = 100. Please reconcile the intended limit (and ideally share a single constant) so behavior matches the documented design decision.

Copilot uses AI. Check for mistakes.
Comment on lines +190 to +195
select {
case <-w.done:
w.logger.V(4).Info("Watcher already done, dropping error event")
case w.result <- watch.Event{Type: watch.Error, Object: statusErr}:
case <-time.After(5 * time.Second):
w.logger.V(2).Info("Error event send timed out, dropping")
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendError also uses time.After in a select; repeated errors can accumulate timers similarly. Consider using a reusable timer (or a non-blocking send with explicit drop policy) to avoid timer churn/leaks.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +81
// Validate target scheme
if !strings.HasPrefix(c.Target, "unix://") && !strings.HasPrefix(c.Target, "unix:") &&
!strings.HasPrefix(c.Target, "dns:") && !strings.HasPrefix(c.Target, "passthrough:") {
return fmt.Errorf("gRPC target %q must use unix://, dns:, or passthrough: scheme", c.Target)
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate() accepts both "unix://" and "unix:" targets, but the error message only mentions "unix://" (and omits "unix:"). Please update the message to reflect the actual accepted schemes (unix://, unix:, dns:, passthrough:) to avoid confusing users.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +41
summary: "Device API Server is down on {{ "{{ $labels.instance }}" }}"
description: "Device API Server has been unreachable for more than 5 minutes."
runbook_url: "https://github.com/nvidia/device-api/blob/main/docs/operations/device-api-server.md#alert-deviceapiserverdown"

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PrometheusRule runbook_url links point to https://github.com/nvidia/device-api/... which doesn’t match this repository (nvidia/nvsentinel). If these docs live in-repo, update the URLs to the correct location to avoid broken runbook links.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +29
# -- Image configuration
image:
# -- Image repository
repository: ghcr.io/nvidia/device-api-server
# -- Image pull policy
pullPolicy: IfNotPresent
# -- Image tag (defaults to Chart appVersion)
tag: ""

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image references are inconsistent across deployment artifacts: the Helm chart defaults to ghcr.io/nvidia/device-api-server, while the static manifest uses ghcr.io/nvidia/nvsentinel/device-api-server. This can cause users to deploy the wrong images; consider standardizing on a single registry/repository naming scheme (or documenting why they differ).

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +147
case <-time.After(30 * time.Second):
w.logger.Error(nil, "Event send timed out; consumer not reading, stopping watcher")
return
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using time.After inside the per-event select allocates a new timer each loop iteration; if the send succeeds quickly, the timer still lives until it fires, which can cause unbounded timer buildup under load. Prefer a reusable time.Timer (stop+drain) or a non-allocating timeout strategy (e.g., context with deadline) for the send timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +83 to 105
```bash
go get github.com/nvidia/device-api/api@latest
```

```go
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"github.com/nvidia/nvsentinel/pkg/client-go/clientset/versioned"
"github.com/nvidia/nvsentinel/pkg/grpc/client"
v1alpha1 "github.com/nvidia/device-api/api/gen/go/device/v1alpha1"
)
```

### Example: List GPUs

```go
package main

import (
"context"
"log"

v1alpha1 "github.com/nvidia/device-api/api/gen/go/device/v1alpha1"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README examples reference the module path "github.com/nvidia/device-api/...", but this repo’s Go module is "github.com/nvidia/nvsentinel" (see go.mod). These import paths / go get instructions won’t work as written; please update them to the correct module paths (or clarify if the repo/module is being renamed).

Copilot uses AI. Check for mistakes.
Comment on lines 116 to 171
- name: device-api-server
image: ghcr.io/nvidia/nvsentinel/device-api-server:v0.0.0 # Replace with specific version for production
imagePullPolicy: IfNotPresent
args:
- --bind-address=unix:///var/run/device-api/device.sock
- --health-probe-bind-address=:8081
- --metrics-bind-address=:9090
- --shutdown-grace-period=25s
- -v=0
ports:
- name: health
containerPort: 8081
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: health
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: health
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
volumeMounts:
- name: device-api-socket
mountPath: /var/run/device-api

# NVML Provider Sidecar - CGO binary, requires RuntimeClass nvidia
- name: nvml-provider
image: ghcr.io/nvidia/nvsentinel/nvml-provider:v0.0.0 # Replace with specific version for production
imagePullPolicy: IfNotPresent
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This manifest references images under ghcr.io/nvidia/nvsentinel/... while the Helm chart defaults to ghcr.io/nvidia/device-api-server / device-api-server-sidecar. Please align these image repositories/tags (or explicitly document the intended difference) to avoid deployment confusion.

Copilot uses AI. Check for mistakes.
…ices

Implement a Kubernetes-style device API server with the following components:

Core server (cmd/device-api-server):
- gRPC-based controlplane API server using apiserver-runtime patterns
- GPU service with full CRUD + Watch + UpdateStatus via storage.Interface
- Server lifecycle management with graceful shutdown
- Health and metrics endpoints with gRPC reflection

Storage (pkg/storage):
- In-memory storage.Interface implementation with watch support
- Configurable watch channel buffer sizes and event drop metrics
- Factory pattern for storage backend selection

NVML provider (cmd/nvml-provider):
- Sidecar that enumerates GPUs via NVML and registers them with the API
- XID error event monitoring with health condition updates
- Reconciliation loop with configurable intervals
- Environment-based driver root configuration (NVIDIA_DRIVER_ROOT)

Client SDK (pkg/client-go):
- Typed gRPC client with Get/List/Watch/Create/Update/UpdateStatus/Delete
- Fake client for testing with k8s.io/client-go/testing integration
- Informers and listers following Kubernetes client-go conventions
- Clientset pattern for versioned API access

Code generator (code-generator):
- Fork of k8s.io/code-generator/cmd/client-gen for gRPC backends
- Generates typed clients, fake clients, and expansion interfaces
- Full UpdateStatus template with proper gRPC implementation
- Integrated into hack/update-codegen.sh pipeline

Proto API (api/proto/device/v1alpha1):
- GPU resource with spec (UUID) and status (conditions, recommendedAction)
- Standard CRUD + Watch + UpdateStatus RPCs
- K8s-style request/response patterns with options

Security hardening:
- gRPC message size and stream limits
- Server error detail scrubbing for client responses
- Unix socket path validation and restrictive permissions
- Localhost-only enforcement for insecure credentials

Deployment (deployments/helm):
- Helm chart with configurable storage, gRPC, health, and metrics
- Static manifests with versioned image references
- Dockerfile with pinned base images

Testing:
- Unit tests for storage, services, providers, and utilities
- Integration tests for client-go with full gRPC stack
- Shared testutil with bufconn gRPC test helpers
- Fake client examples for consumer testing patterns

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants