feat: add device-api-server with NVML fallback provider by ArangoGutierrez · Pull Request #720 · NVIDIA/NVSentinel

ArangoGutierrez · 2026-01-21T21:27:26Z

Summary

Implements the Device API Server — a node-local gRPC cache server for GPU device state, deployed as a Kubernetes DaemonSet. It acts as an intermediary between providers (e.g., NVSentinel health monitors that update GPU state) and consumers (e.g., Device Plugins, DRA Drivers that read state for scheduling decisions).

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                          Kubernetes Node                             │
│                                                                      │
│  ┌──────────────────┐                  ┌──────────────────────────┐  │
│  │   NVSentinel      │                  │  Device Plugin / DRA     │  │
│  │  Health Monitor   │                  │       Driver             │  │
│  │   [Provider]      │                  │     [Consumer]           │  │
│  └────────┬─────────┘                  └────────────┬─────────────┘  │
│           │ CreateGpu()                              │ GetGpu()       │
│           │ UpdateGpuStatus()                        │ ListGpus()     │
│           │ (gRPC)                                   │ WatchGpus()    │
│           ▼                                          ▼                │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │                  Device API Server (DaemonSet)                  │  │
│  │                                                                 │  │
│  │  ┌─────────────────────────────────────────────────────────┐   │  │
│  │  │                   GpuService (Unified)                   │   │  │
│  │  │  Write: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu │   │  │
│  │  │  Read:  GetGpu, ListGpus, WatchGpus                      │   │  │
│  │  └──────────────────────┬──────────────────────────────────┘   │  │
│  │                         ▼                                       │  │
│  │  ┌─────────────────────────────────────────────────────────┐   │  │
│  │  │               Cache (sync.RWMutex)                       │   │  │
│  │  │  • Writer-preference: blocks ALL new reads during write  │   │  │
│  │  │  • proto.Clone on every read/write (clone-safety)        │   │  │
│  │  │  • Optimistic concurrency via resource_version           │   │  │
│  │  │  • Watch broadcaster with ListAndSubscribe atomicity     │   │  │
│  │  └─────────────────────────────────────────────────────────┘   │  │
│  │                                                                 │  │
│  │  ┌───────────┐  ┌───────────┐  ┌─────────────────────────┐    │  │
│  │  │ Health    │  │ Metrics   │  │ Unix Socket             │    │  │
│  │  │ :8081     │  │ :9090     │  │ /var/run/device-api/    │    │  │
│  │  │ /healthz  │  │ /metrics  │  │ device.sock             │    │  │
│  │  └───────────┘  └───────────┘  └─────────────────────────┘    │  │
│  └────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Why read-blocking semantics?

The sync.RWMutex with writer-preference ensures consumers never read stale "healthy" state when a provider is updating to "unhealthy". Once a write is pending, new readers block until the write completes. This is critical for correct scheduling decisions.

NVML Provider (Sidecar)

The NVML provider runs as a sidecar container that:

Enumerates GPUs via NVML and registers them via CreateGpu
Monitors XID events for hardware health (ECC errors, GPU fallen off bus, etc.)
Updates GPU conditions via UpdateGpuStatus when health changes
Classifies XIDs by severity: ignored (application errors), warning, critical (hardware failures)

Decoupled as a sidecar for independent updates and glibc isolation (NVML requires RTLD_DEEPBIND).

Key Design Decisions

Decision	Rationale
Unified GpuService (not split provider/consumer)	Simpler API surface, standard CRUD, matches K8s patterns
proto.Clone on all cache paths	Prevents external mutation of cached state; verified by clone-safety tests
Optimistic concurrency (resource_version)	Prevents lost updates from concurrent providers without pessimistic locking
MaxConditionsPerGpu = 32	Bounds memory growth from unbounded condition lists
Build-tagged NVML (`//go:build nvml`)	Pure Go server binary; NVML only compiled when needed for sidecar
gRPC resource limits (4MB msg, 100 streams)	Prevents resource exhaustion from errant local processes
Generic error messages to clients	Internal error details logged server-side, not leaked to gRPC clients

Key Components

GpuService — Unified gRPC service with standard CRUD + Watch
GpuCache — In-memory store with sync.RWMutex, clone-safety, and monotonic resource versioning
Watch Broadcaster — Atomic ListAndSubscribe to prevent duplicate events; slow subscriber isolation
NVML Provider — GPU enumeration and XID-based health monitoring (build-tagged sidecar)
Prometheus Metrics — Cache stats, NVML status, gRPC latency, watch event drops
Helm Chart — DaemonSet with ServiceMonitor, PrometheusRule, sidecar support

Type of Change

New feature (new message, field, or service method)
Documentation
Build/Tooling

Test Results

86 unit tests passing with -race flag (cache: 34, service: 21, nvml: 22, metrics: 7, config: 2)
go vet clean (including -tags nvml)
All cache mutations use proto.Clone (verified via clone-safety tests)
Concurrent stress tests: 10 readers + 1 writer, 5 concurrent writers
Non-numeric resource_version rejected with InvalidArgument
Health classification: False (unhealthy) overrides Unknown regardless of condition ordering

Checklist

Copilot

Pull request overview

This pull request implements a comprehensive Device API Server with NVML fallback provider support for GPU device information and health monitoring. The server acts as a node-local gRPC cache intermediary between providers (health monitors) and consumers (device plugins, DRA drivers).

Changes:

Implements complete Device API Server with gRPC services (GpuService for consumers, ProviderService for providers)
Adds built-in NVML fallback provider for direct GPU enumeration and XID-based health monitoring
Implements thread-safe in-memory cache with RWMutex for read-blocking semantics during writes
Adds watch broadcaster for real-time GPU state change notifications
Includes comprehensive Prometheus metrics for observability
Provides Helm charts for Kubernetes deployment with ServiceMonitor and alerting

Reviewed changes

Copilot reviewed 51 out of 54 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/version/version.go	Version information package with build-time ldflags support
pkg/deviceapiserver/config.go	Server configuration with flags, environment variables, and validation
pkg/deviceapiserver/server.go	Main server implementation with lifecycle management and graceful shutdown
pkg/deviceapiserver/cache/*.go	Thread-safe GPU cache with broadcaster for watch events
pkg/deviceapiserver/service/*.go	gRPC service implementations (consumer and provider APIs)
pkg/deviceapiserver/nvml/*.go	NVML provider for GPU enumeration and XID health monitoring
pkg/deviceapiserver/metrics/metrics.go	Prometheus metrics definitions and recording
cmd/device-api-server/main.go	Main entry point with flag parsing and signal handling
docs/*.md	Comprehensive API, operations, and design documentation
charts/device-api-server/*	Helm chart templates for Kubernetes deployment
go.mod/go.sum	Dependencies including NVML, gRPC, and Prometheus libraries

docs/operations/device-api-server.md

pteranodan · 2026-01-23T15:02:12Z

api/proto/device/v1alpha1/provider.proto

+import "device/v1alpha1/gpu.proto";
+
+// ==========================================
+// Provider Service Definition


I’m concerned about adding a new service with custom methods. Should we extend the existing one with standard methods instead?

Usually, "registering" is just a Create call, "unregistering" is a Delete call, and condition updates are either an UpdateStatus or Patch call. I think we should stick to these well understood conventions.

Using custom methods also risks compatibility with standard tooling (e.g., code generators and controller-runtime). Making things harder for us both now and later.

pteranodan · 2026-01-23T15:32:58Z

pkg/deviceapiserver/server.go

The tight coupling between the API server and NVML feels off.

While we definitely need a default health check for GPUs, embedding it directly into the API server makes it hardware-aware rather than a ‘dumb’ transport pipe.

What do you think about moving the default health checks out of the API server entirely? The API server should be agnostic to how the state is produced. The default health checks should interact with the API server just like any other would: via the standard client interface.

Addresses review feedback from pteranodan on PR #720 regarding API design and NVML provider coupling. BREAKING CHANGES: - Removed ProviderService in favor of unified GpuService - RegisterGpu() → CreateGpu() - UnregisterGpu() → DeleteGpu() - UpdateGpuCondition() → removed (use UpdateGpuStatus with full status) API Changes: - Consolidated GpuService with both read and write operations - Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu - Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition) - Deleted api/proto/device/v1alpha1/provider.proto - Updated gpu.proto with unified service definition Service Layer: - Created pkg/deviceapiserver/service/gpu_service.go (unified service) - Deleted consumer.go and provider.go (replaced by gpu_service.go) - Updated integration tests to use new API - Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete NVML Provider Decoupling: - NVML provider now uses GpuServiceClient instead of direct cache access - Server creates loopback gRPC connection for embedded NVML provider - Added standalone cmd/nvml-provider/ binary for sidecar deployment - Provider now "dogfoods" its own API via gRPC Documentation: - Updated README.md with new unified API architecture - Rewrote docs/api/device-api-server.md with single service model - Updated docs/operations/device-api-server.md - Simplified docs/design/device-api-server.md - Removed obsolete task lists and design documents Deployment: - Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/ - Added deployments/container/Dockerfile - Added deployments/static/ for static manifests - Added demos/nvml-sidecar-demo.sh Other: - Updated .gitignore for planning documents - Enhanced Makefile with new targets - Added metrics for unified service This refactoring improves: 1. API consistency with Kubernetes conventions 2. Tooling compatibility (code generators, controller-runtime) 3. Decoupling of NVML provider from server internals 4. Flexibility to run NVML as embedded or sidecar 5. Overall code maintainability and clarity Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>

pteranodan · 2026-01-30T13:46:31Z

api/proto/device/v1alpha1/gpu.proto

+  //
+  // This increments on every modification to the GPU and can be used by
+  // consumers to detect changes or implement optimistic concurrency.
+  int64 resource_version = 4;


What's the difference between this and the resource version inside of the metadata field?

pteranodan · 2026-01-30T13:53:19Z

cmd/nvml-provider/main.go

+	if id := os.Getenv("PROVIDER_ID"); id != "" {
+		cfg.ProviderID = id
+	}
+	if root := os.Getenv("NVML_DRIVER_ROOT"); root != "" {


Is this the standard or conventional environment variable used to hold the driver root location?

pteranodan · 2026-01-30T14:22:29Z

cmd/nvml-provider/main.go

+		},
+	}
+
+	_, err := p.gpuClient.CreateGpu(ctx, req)


FYI: I’ve updated client-gen to support Create methods (#718). Once that's merged, you can use versioned clients here instead of proto stubs:

gpu, err := clientset.DeviceV1alpha1().GPUs().Create(ctx, &devicev1alpha1.GPU{}, metav1.CreateOptions{})

pteranodan · 2026-01-30T14:27:17Z

cmd/nvml-provider/main.go

+		},
+	}
+
+	if _, err := p.gpuClient.UpdateGpuStatus(ctx, req); err != nil {


FYI: I’ve updated client-gen to support Update methods (#718). Once that's merged, you can use versioned clients here instead of proto stubs:

gpu, err := clientset.DeviceV1alpha1().GPUs().Update(ctx, &devicev1alpha1.GPU{}, metav1.UpdateOptions{})

Comprehensive design document capturing: - In-memory cache architecture (replacing Kine/SQLite) - Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs) - Readiness gate blocking consumers until providers register - Watch broadcaster with slow subscriber eviction - What to take from each PR - Implementation phases - Authorship preservation requirements This document serves as the spec for cherry-picking #720 onto merged #718. Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com> Co-authored-by: Dan Stone <danstone@nvidia.com>

Addresses review feedback from pteranodan on PR #720 regarding API design and NVML provider coupling. BREAKING CHANGES: - Removed ProviderService in favor of unified GpuService - RegisterGpu() → CreateGpu() - UnregisterGpu() → DeleteGpu() - UpdateGpuCondition() → removed (use UpdateGpuStatus with full status) API Changes: - Consolidated GpuService with both read and write operations - Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu - Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition) - Deleted api/proto/device/v1alpha1/provider.proto - Updated gpu.proto with unified service definition Service Layer: - Created pkg/deviceapiserver/service/gpu_service.go (unified service) - Deleted consumer.go and provider.go (replaced by gpu_service.go) - Updated integration tests to use new API - Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete NVML Provider Decoupling: - NVML provider now uses GpuServiceClient instead of direct cache access - Server creates loopback gRPC connection for embedded NVML provider - Added standalone cmd/nvml-provider/ binary for sidecar deployment - Provider now "dogfoods" its own API via gRPC Documentation: - Updated README.md with new unified API architecture - Rewrote docs/api/device-api-server.md with single service model - Updated docs/operations/device-api-server.md - Simplified docs/design/device-api-server.md - Removed obsolete task lists and design documents Deployment: - Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/ - Added deployments/container/Dockerfile - Added deployments/static/ for static manifests - Added demos/nvml-sidecar-demo.sh Other: - Updated .gitignore for planning documents - Enhanced Makefile with new targets - Added metrics for unified service This refactoring improves: 1. API consistency with Kubernetes conventions 2. Tooling compatibility (code generators, controller-runtime) 3. Decoupling of NVML provider from server internals 4. Flexibility to run NVML as embedded or sidecar 5. Overall code maintainability and clarity Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>

Comprehensive design document capturing: - In-memory cache architecture (replacing Kine/SQLite) - Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs) - Readiness gate blocking consumers until providers register - Watch broadcaster with slow subscriber eviction - What to take from each PR - Implementation phases - Authorship preservation requirements This document serves as the spec for cherry-picking #720 onto merged #718. Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com> Co-authored-by: Dan Stone <danstone@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 50 out of 52 changed files in this pull request and generated 2 comments.

Copilot · 2026-02-05T10:33:03Z

go.mod

@@ -0,0 +1,32 @@
+module github.com/nvidia/nvsentinel
+
+go 1.25


The Go version is specified as "1.25" here, but in api/go.mod (before the change) it was "1.25.5". While Go considers these equivalent, it's better to be consistent across all go.mod files in the repository. Consider using the same patch version everywhere for consistency.

Copilot · 2026-02-05T10:33:04Z

docs/operations/device-api-server.md

+- The TCP listener binds to `127.0.0.1:50051` (localhost) by default. This prevents network exposure.
+- Prefer using a Unix domain socket for all local clients and providers; avoid exposing the TCP gRPC listener to the broader cluster network.
+- In multi-tenant or partially untrusted clusters, strongly recommend **disabling the TCP listener** (set `--grpc-address=""`) or using a restricted Unix socket combined with Kubernetes `NetworkPolicy` to limit access to the Device API Server pod.
+- If TCP gRPC must be exposed beyond the node (e.g., `--grpc-address=:50051`), terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`.


The "Network Security" section documents the gRPC API as unauthenticated and recommends keeping it node-local, but the Helm chart currently defaults server.grpcAddress to :50051 (see deployments/helm/nvsentinel/values.yaml), which binds the unauthenticated gRPC listener on all pod interfaces and makes it reachable by any pod in the cluster over plaintext. In a multi-tenant or partially untrusted cluster, an attacker can connect to this port and invoke write RPCs (CreateGpu, UpdateGpuStatus, DeleteGpu) to tamper with GPU health/scheduling state, despite the documentation implying a safer localhost-only default. To close this gap, align the chart defaults with the documented security posture (e.g., default to 127.0.0.1:50051 or Unix socket only) and require explicit opt-in plus TLS/mTLS and restrictive NetworkPolicy when exposing the TCP listener beyond localhost.

Suggested change

- If TCP gRPC must be exposed beyond the node (e.g., `--grpc-address=:50051`), terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`.

- If TCP gRPC must be exposed beyond the node, **do not** publish a plaintext, all-interfaces listener (for example, `--grpc-address=:50051`) directly to the cluster network. Instead, keep the server bound to localhost (for example, `127.0.0.1:50051`) or a Unix domain socket and expose it only behind TLS/mTLS or an equivalent authenticated tunnel (such as a sidecar proxy or ingress with client auth), combined with a least-privilege `NetworkPolicy`.

Aligns Helm chart default with config.go secure default. The gRPC API is unauthenticated, so binding to all interfaces by default exposes the server to unauthorized cluster access. Addresses Copilot security review feedback on PR #720.

The resource_version was duplicated at both Gpu.resource_version (field 4, int64) and Gpu.metadata.resource_version (in ObjectMeta, string). This is redundant and violates K8s API conventions where resource_version should only be in metadata. Changes: - Remove Gpu.resource_version field from proto (reserved field 4 for backward compatibility) - Update cache to use metadata.resource_version (string format) - Update service to read from metadata.resource_version - Remove ResourceVersion from goverter ignore list (no longer needed) - Regenerate protobuf code The UpdateGpuStatusRequest.resource_version field is intentionally kept as it serves a different purpose: passing the expected version for optimistic concurrency checks in requests. Addresses pteranodan's review feedback on PR #720.

The API server should be a 'dumb' transport pipe, agnostic to how GPU state is produced. NVML integration should be external only, interacting via the standard gRPC client interface. Changes: - Remove nvmlProvider field and related methods from server.go - Remove NVML config fields (NVMLEnabled, NVMLDriverRoot, etc.) from config.go - Remove NVML flags from main.go (--enable-nvml-provider, --nvml-driver-root, etc.) - Remove embedded nvml section from Helm values.yaml - Remove nvml args and env vars from daemonset.yaml template - Update NOTES.txt to reference nvmlProvider sidecar instead The nvmlProvider sidecar (values.yaml) remains as the recommended approach for NVML-based GPU enumeration and health monitoring. Addresses pteranodan's architecture review feedback on PR #720.

Copilot

Pull request overview

Copilot reviewed 66 out of 70 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 66 out of 70 changed files in this pull request and generated 12 comments.

Copilot · 2026-02-07T21:09:20Z

pkg/providers/nvml/xid.go

+// XidDescriptions provides human-readable descriptions for common XIDs.
+var XidDescriptions = map[uint64]string{
+	// Application errors (typically ignored)
+	13:  "Graphics Engine Exception",
+	31:  "GPU memory page fault",
+	43:  "GPU stopped processing",
+	45:  "Preemptive cleanup",
+	68:  "Video processor exception",
+	109: "Context Switch Timeout",
+
+	// Memory errors
+	48: "Double Bit ECC Error",
+	63: "Row remapping failure",
+	64: "Uncontained ECC error",
+	74: "NVLink error",
+	79: "GPU has fallen off the bus",
+	94: "Contained ECC error",
+	95: "Uncontained ECC error",
+
+	// Other notable XIDs
+	8:   "GPU not accessible",
+	32:  "Invalid or corrupted push buffer stream",
+	38:  "Driver firmware error",
+	56:  "Display engine error",
+	57:  "Error programming video memory interface",
+	62:  "Internal micro-controller halt (non-fatal)",
+	69:  "Graphics engine accessor error",
+	119: "GSP error",
+	120: "GSP firmware error",
+}


XidDescriptions is exported as a mutable map, which allows other packages to modify it at runtime (and contradicts the stated goal of keeping XID maps immutable). Since it’s only used internally, consider making it unexported (xidDescriptions) and/or exposing a function that returns a copy to prevent accidental mutation.

Copilot · 2026-02-07T21:09:20Z

deployments/container/Dockerfile

+# gRPC API port
+EXPOSE 8080
+# Metrics port
+EXPOSE 8081
+


The device-api-server container image exposes ports 8080/8081, but the server’s admin health gRPC and metrics HTTP endpoints are configured via flags (defaulting to :50051 for health gRPC and :9090 for metrics in pkg/controlplane/apiserver/options). Also, gRPC device service is on a unix socket, not a TCP port. Please align EXPOSE/comments with the actual runtime ports (or drop EXPOSE if they’re not meaningful for UDS-based deployment).

Copilot · 2026-02-07T21:09:20Z

Makefile

+	helm lint deployments/helm/nvsentinel
+
+.PHONY: helm-template
+helm-template: ## Render Helm chart templates
+	helm template nvsentinel deployments/helm/nvsentinel
+
+.PHONY: helm-package
+helm-package: ## Package Helm chart
+	@mkdir -p dist/
+	helm package deployments/helm/nvsentinel -d dist/


Helm targets reference deployments/helm/nvsentinel, but this PR adds the chart under deployments/helm/device-api-server. As written, make helm-lint, make helm-template, and make helm-package will fail because the referenced path doesn’t exist. Please update these targets to point at the correct chart directory (and ensure the chart name matches).

Suggested change

helm lint deployments/helm/nvsentinel

.PHONY: helm-template

helm-template: ## Render Helm chart templates

helm template nvsentinel deployments/helm/nvsentinel

.PHONY: helm-package

helm-package: ## Package Helm chart

@mkdir -p dist/

helm package deployments/helm/nvsentinel -d dist/

helm lint deployments/helm/device-api-server

.PHONY: helm-template

helm-template: ## Render Helm chart templates

helm template device-api-server deployments/helm/device-api-server

.PHONY: helm-package

helm-package: ## Package Helm chart

@mkdir -p dist/

helm package deployments/helm/device-api-server -d dist/

Copilot · 2026-02-07T21:09:21Z

deployments/helm/device-api-server/README.md

+### Key Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `image.repository` | Image repository | `ghcr.io/nvidia/device-api-server` |
+| `image.tag` | Image tag | Chart appVersion |
+| `server.grpcAddress` | gRPC server address | `:50051` |
+| `server.unixSocket` | Unix socket path | `/var/run/device-api/device.sock` |
+| `server.healthPort` | Health endpoint port | `8081` |
+| `server.metricsPort` | Metrics endpoint port | `9090` |
+| `nvmlProvider.enabled` | Enable NVML provider sidecar | `false` |
+| `nvmlProvider.driverRoot` | NVIDIA driver library root | `/run/nvidia/driver` |
+| `nvmlProvider.healthCheckEnabled` | Enable XID event monitoring | `true` |
+| `runtimeClassName` | Pod RuntimeClass | `""` |
+| `nodeSelector` | Node selector | `nvidia.com/gpu.present: "true"` |
+| `metrics.serviceMonitor.enabled` | Create ServiceMonitor | `false` |
+| `metrics.prometheusRule.enabled` | Create PrometheusRule | `false` |
+


The chart README lists configuration keys that don’t exist in values.yaml (e.g., server.grpcAddress) and implies a TCP listener by default, while the chart template configures --bind-address as a unix socket. Please update the parameter table/examples to reflect the actual values.yaml schema and the current server flags to avoid broken installs.

Copilot · 2026-02-07T21:09:21Z

cmd/device-api-server/main.go

+	preparedStorage, err := storage.PrepareRun(ctx)
+	if err != nil {
+		logger.Error(err, "Failed to prepare storage backend")
+		os.Exit(1)
+	}
+
+	// Run the storage backend in the background.
+	storageErrCh := make(chan error, 1)
+	go func() {
+		storageErrCh <- preparedStorage.Run(ctx)
+	}()
+
+	// Create, prepare, and run the device API server.
+	server, err := completedAPIServerConfig.New(storage)
+	if err != nil {
+		logger.Error(err, "Failed to create device API server")
+		os.Exit(1)
+	}
+
+	prepared, err := server.PrepareRun(ctx)
+	if err != nil {
+		logger.Error(err, "Failed to prepare device API server")
+		os.Exit(1)
+	}
+
+	if err := prepared.Run(ctx); err != nil {
+		logger.Error(err, "Device API server error")
+		os.Exit(1)
+	}
+
+	// Check if the storage backend exited with an error.
+	if err := <-storageErrCh; err != nil {
+		logger.Error(err, "Storage backend error")
+		os.Exit(1)
+	}


The storage backend is run in a goroutine and only awaited after the gRPC server exits. If the storage backend fails while the server is running, the server will keep serving with no coordination/shutdown (this regresses from the previous errgroup-based lifecycle). Consider running storage and server under an errgroup (or select on storageErrCh) so a storage failure triggers cancellation and a clean shutdown of the server.

Copilot · 2026-02-07T21:09:22Z

docs/operations/device-api-server.md

+### Command-Line Flags
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--grpc-address` | `127.0.0.1:50051` | TCP address for gRPC server (localhost only by default) |
+| `--unix-socket` | `/var/run/device-api/device.sock` | Unix socket path |
+| `--health-port` | `8081` | HTTP port for health endpoints |
+| `--metrics-port` | `9090` | HTTP port for Prometheus metrics |
+| `--shutdown-timeout` | `30` | Graceful shutdown timeout (seconds) |
+| `--shutdown-delay` | `5` | Pre-shutdown delay (seconds) |
+| `--log-format` | `json` | Log format: `text` or `json` |
+| `-v` | `0` | Log verbosity level |
+


The CLI flag documentation here appears out of sync with the actual server flags wired up in pkg/controlplane/apiserver/options (e.g., it documents --grpc-address/--unix-socket/--health-port/--metrics-port, but the code defines --bind-address, --health-probe-bind-address, and --metrics-bind-address). Please update this doc to match the real flags so operators can deploy/configure the server correctly.

Copilot · 2026-02-07T21:09:22Z

deployments/static/nvsentinel-daemonset.yaml

+          args:
+            - --grpc-address=:50051
+            - --health-port=8081
+            - --metrics-port=9090
+            - --provider-address=localhost:9001
+            - --unix-socket=/var/run/device-api/device.sock
+            - --v=0


This static DaemonSet manifest uses flags like --grpc-address/--health-port/--metrics-port/--unix-socket/--provider-address that do not match the current device-api-server binary flags (which use --bind-address, --health-probe-bind-address, --metrics-bind-address, etc.). As written, the container will fail to start due to unknown flags. Please update the args to the new flag names/semantics or clearly mark this manifest as deprecated/out-of-date.

Copilot · 2026-02-07T21:09:22Z

api/proto/device/v1alpha1/gpu.proto

+  string resource_version = 3;
+
+  // opts contains the options for the update.
+  UpdateOptions opts = 4;


UpdateGpuStatusRequest lacks a namespace field while other RPCs are namespace-aware, forcing the server implementation to hardcode "default" for status updates (see TODO in gpu_service.go). This makes it impossible to update status for GPUs created outside default and is an API inconsistency. Consider adding string namespace (and possibly object meta) to UpdateGpuStatusRequest now before the API stabilizes, or explicitly document/enforce that GPUs are cluster-singleton in the API.

Suggested change

UpdateOptions opts = 4;

UpdateOptions opts = 4;

// namespace specifies the namespace scoping for the GPU whose status is being updated.

// An empty namespace is equivalent to the "default" namespace, but "default" is the

// canonical representation.

//

// Optional. Defaults to unset.

string namespace = 5;

Copilot · 2026-02-07T21:09:23Z

cmd/device-api-server/main.go

+	"github.com/nvidia/nvsentinel/pkg/controlplane/apiserver"
+	"github.com/nvidia/nvsentinel/pkg/controlplane/apiserver/options"
+	"github.com/nvidia/nvsentinel/pkg/storage/storagebackend"
+	"github.com/nvidia/nvsentinel/pkg/util/version"
+
+	// Import service providers so their init() functions register them.
+	_ "github.com/nvidia/nvsentinel/pkg/services/device/v1alpha1"
+)
+
+const (
+	// ComponentName is the name of this component for logging.
+	ComponentName = "device-api-server"
+)
+
+func main() {
+	opts := options.NewOptions()
+
+	fss := cliflag.NamedFlagSets{}
+	opts.AddFlags(&fss)
+
+	// Add a version flag to the global flag set.
+	showVersion := pflag.Bool("version", false, "Show version and exit")
+
+	// Merge all named flag sets into the global pflag command line.
+	for _, fs := range fss.FlagSets {
+		pflag.CommandLine.AddFlagSet(fs)
+	}
+
+	pflag.Parse()
+
+	// Handle version flag before any other initialization.
+	if *showVersion {
+		v := version.Get()
+		enc := json.NewEncoder(os.Stdout)
+		enc.SetIndent("", "  ")
+		if err := enc.Encode(v); err != nil {
+			fmt.Fprintf(os.Stderr, "Failed to encode version: %v\n", err)
+			os.Exit(1)
+		}
+		os.Exit(0)
+	}
+
+	// Set up signal handling for graceful shutdown.
+	ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
+	defer cancel()
+
+	// Complete fills in defaults and resolves environment overrides.
+	completedOpts, err := opts.Complete(ctx)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "Failed to complete options: %v\n", err)
+		os.Exit(1)
+	}
+
+	// Validate rejects invalid flag combinations.
+	if errs := completedOpts.Validate(); len(errs) > 0 {
+		for _, e := range errs {
+			fmt.Fprintf(os.Stderr, "Invalid configuration: %v\n", e)
+		}
+		os.Exit(1)
+	}
+
+	// Create root logger with component name.
+	logger := klog.Background().WithName(ComponentName)
+	ctx = klog.NewContext(ctx, logger)
+
+	versionInfo := version.Get()
+	logger.Info("Starting server",
+		"version", versionInfo.GitVersion,
+		"commit", versionInfo.GitCommit,
+		"buildDate", versionInfo.BuildDate,
+	)


cmd/device-api-server is importing github.com/nvidia/nvsentinel/pkg/util/version, but the build/linker flags in the updated Makefile are wired to github.com/nvidia/nvsentinel/pkg/version. As a result the reported version will stay at util/version defaults and won’t reflect ldflags. Please switch this file to use pkg/version (and update the logged field names accordingly), or revert Makefile/Dockerfile ldflags to target pkg/util/version consistently across the repo.

pkg/services/device/v1alpha1/gpu_service.go

pteranodan · 2026-02-10T16:34:52Z

pkg/controlplane/apiserver/server.go

 	AdminServer      *grpc.Server
 	AdminCleanup     func()
 	Metrics          *metrics.ServerMetrics
-	MetricsRegistry  *prometheus.Registry


I forgot to remove this, good catch!

pteranodan · 2026-02-10T16:35:07Z

pkg/controlplane/apiserver/server.go

 		healthpb.RegisterHealthServer(s.AdminServer, s.HealthServer)
+		// Also register on DeviceServer so sidecar providers connecting via
+		// unix socket can perform health checks without a separate connection.
+		healthpb.RegisterHealthServer(s.DeviceServer, s.HealthServer)


Good catch again!

pteranodan · 2026-02-10T16:35:41Z

pkg/controlplane/apiserver/server.go

 	}

 	if s.MetricsAddress != "" {
-		// TODO: put in wg??


Meant to get back to this, thank you!

pteranodan · 2026-02-10T22:02:55Z

pkg/services/device/v1alpha1/gpu_provider.go

-//  See the License for the specific language governing permissions and
-//  limitations under the License.
-
-// Code generated by service-gen. DO NOT EDIT.


Note: all the code under pkg/service/device/<version>/ will eventually be generated.

pteranodan · 2026-02-10T22:06:47Z

pkg/services/device/v1alpha1/gpu_provider.go

-func (p *gpuServiceProvider) Install(svr *grpc.Server, storageConfig storagebackend.Config) (api.Service, error) {
+// Install creates the in-memory storage backend and registers the GPU service
+// on the provided gRPC server.
+func (p *gpuServiceProvider) Install(svr *grpc.Server, _ storagebackend.Config) (api.Service, error) {


This change hardcodes the storage backend to in-memory and completely ignores the passed storagebackend.Config.

This prevents a service from using durable storage (like etcd or an etcdshim) which was previously supported via factory.Create. This change effectively locks us into a single-node, non-durable architecture and removes the ability to configure the storage layer via flags.

Since the goal is to add an in-memory option, can we switch on the config instead? For example:

// Pseudo-code if storageConfig.Type == "memory" { // In-memory path s, destroy, err = memory.CreateStorage(...) } else { // Use the standard K8s factory s, destroy, err = factory.Create(configForResource, ...) }

pteranodan · 2026-02-10T22:07:52Z

pkg/services/device/v1alpha1/gpu_provider.go


-	resourcePrefix := path.Join("/", p.groupVersion.Group, "gpus")
+	codecs := serializer.NewCodecFactory(scheme)
+	info, _ := runtime.SerializerInfoForMediaType(codecs.SupportedMediaTypes(), runtime.ContentTypeJSON)


Minor: I think we need an error check on runtime.SerializerInfoForMediaType to prevent panics if the serializer isn't found.

pteranodan · 2026-02-10T22:18:40Z

pkg/services/device/v1alpha1/gpu_service.go


 	key := s.storageKey(req.GetNamespace(), "")
 	w, err := s.storage.Watch(ctx, key, storage.ListOptions{
-		ResourceVersion: req.GetOpts().GetResourceVersion(),


Critical Issue

We cannot ignore the ResourceVersion (RV) from the request. By omitting req.GetOpts().GetResourceVersion() from storage.ListOptions, we break the fundamental List/Watch consistency contract.

Clients rely on passing a RV to ensure they don't miss events that occur in the split second between a List request finishing and a Watch request starting. Without it, clients can permanently miss events.

TestIntegration_WatchResumption demonstrates where a client lists, a new item is created, and then the client Watches (using the List RV) and the client never receives an event for the item created in between the List and Watch.

pteranodan · 2026-02-10T22:23:21Z

api/proto/device/v1alpha1/gpu.proto

 }

+// UpdateGpuStatusRequest specifies the GPU status to update.
+message UpdateGpuStatusRequest {


This request structure deviates from the established design pattern for Updates in Kubernetes and in this file (see UpdateGpuRequest). Typically, update operations (including status updates) accept the whole object and use the metadata (e.g., RV) within it for addressing concurrency control.

Additionally, by splitting fields out manually we risk accidentally omitting necessary metadata (e.g., RV, namespace, generation, etc.).

Suggestion:

message UpdateGpuRequest { // gpu is the GPU resource whose status to update. Gpu gpu = 1; // opts contains the options for the update. UpdateOptions opts = 2; }

pteranodan · 2026-02-10T22:28:05Z

pkg/version/version.go

+
+// Package version provides version information for the Device API Server.
+// These values are set at build time via ldflags.
+package version


This introduces a duplicate version package. We already have an implementation at pkg/util/version. We should combine these to prevent confusion about which to use.

I like the pkg/version import path you chose; it is cleaner and more discoverable. I suggest we move the logic from pkg/util/version into this new package and delete the existing directory.

The existing package includes a HTTP handler (for /version) and a UserAgent helper that are actively used. We should keep those.

We can remove RegisterComponent, it's not being used. Forgot to remove it earlier.

pteranodan · 2026-02-10T22:46:10Z

pkg/providers/nvml/provider.go

+	nvmllib Library
+
+	// gRPC client to communicate with Device API Server
+	client v1alpha1.GpuServiceClient


We should use the generated Kubernetes-style client SDK (pkg/client-go) instead of raw gRPC service client.

Example:

// Get the latest object to ensure we have the current ResourceVersion. gpu, err := p.client.DeviceV1alpha1().GPUs().Get(ctx, gpuName, metav1.GetOptions{}) if err != nil { return err } // Modify status gpu.Status = newStatus // Update _, err = p.client.DeviceV1alpha1().GPUs().UpdateStatus(ctx, gpu, metav1.UpdateOptions{})

pteranodan · 2026-02-10T22:50:03Z

pkg/providers/nvml/provider.go

+	"github.com/NVIDIA/go-nvml/pkg/nvml"
+	"k8s.io/klog/v2"
+
+	v1alpha1 "github.com/nvidia/nvsentinel/internal/generated/device/v1alpha1"


Minor: Providers must never import internal packages.

pteranodan · 2026-02-10T23:24:28Z

pkg/storage/storagebackend/options/options.go


 func NewOptions() *Options {
 	return &Options{
+		InMemory:                    true,


As noted in other comments, the current implementation has some gaps regarding ResourceVersion semantics, which will break controllers and informers that rely on those guarantees.

I suggest we make Kine (with SQLite) the default in-memory path. Because Kine is the default backend for K3s (a CNCF certified distro), we know it passes the upstream Kubernetes storage conformance tests. This gives us verified correctness out of the box.

We can still include this custom implementation, but perhaps behind a flag like --experimental-in-memory-storage until we can verify it against the full storage compliance suite.

Adds forward-looking guidance to the existing TODO(#720) comment, clarifying that the proto message must be extended with a namespace field when multi-namespace support is added. Addresses audit finding M3: UpdateGpuStatus hardcoded to default namespace. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 87 out of 91 changed files in this pull request and generated 1 comment.

Copilot · 2026-02-13T10:55:00Z

cmd/nvml-provider/reconciler.go

+		// 2. Check for pending page retirements (ECC errors)
+		if pending, ret := device.GetRetiredPagesPendingStatus(); ret == nvml.SUCCESS {
+			if pending == nvml.FEATURE_ENABLED {
+				p.logger.V(1).Info("GPU has pending page retirements", "uuid", uuid)
+				return false, nil
+			}
+		}
+
+		// Device is accessible and no pending issues


The method GetRetiredPagesPendingStatus() is called on the device but is not defined in the Device interface (pkg/providers/nvml/interface.go:36-42). This will cause a compilation error when building with -tags=nvml. Add this method to the Device interface and implement it in both the wrapper and mock implementations.

Suggested change

// 2. Check for pending page retirements (ECC errors)

if pending, ret := device.GetRetiredPagesPendingStatus(); ret == nvml.SUCCESS {

if pending == nvml.FEATURE_ENABLED {

p.logger.V(1).Info("GPU has pending page retirements", "uuid", uuid)

return false, nil

}

}

// Device is accessible and no pending issues

// Device is accessible according to basic checks

Copilot

Pull request overview

Copilot reviewed 110 out of 114 changed files in this pull request and generated 13 comments.

Copilot · 2026-02-13T12:02:24Z

pkg/controlplane/apiserver/metrics/metrics.go

+		if labels != nil {
 			version := prometheus.NewGauge(prometheus.GaugeOpts{
 				Name:        "device_apiserver_build_info",
 				Help:        "Build information about the device-apiserver binary.",
-				ConstLabels: m.buildInfoLabels,
+				ConstLabels: labels,


The build-info metric still uses the name/help text "device_apiserver_build_info" / "device-apiserver" even though the binary/component is now "device-api-server". At minimum, consider updating the Help string and the service label value to match the new component name; if the metric name is kept for compatibility, documenting that would help avoid confusion.

Copilot · 2026-02-13T12:02:24Z

pkg/services/device/v1alpha1/gpu_provider.go

+func (p *gpuServiceProvider) Install(svr *grpc.Server, cfg storagebackend.Config) (api.Service, error) {
+	// Currently only in-memory storage is supported. The cfg parameter is
+	// accepted for future extensibility but not used for backend selection.
+	_ = cfg
+


Install() now unconditionally uses the in-memory storage backend and explicitly ignores the provided storagebackend.Config. This means configuring the server for a persistent backend won’t actually persist GPU resources, which is a surprising behavior change. Consider either wiring cfg into backend selection (restore the factory.Create path when persistence is desired) or returning an explicit error when a non-in-memory backend is configured but unsupported.

Copilot · 2026-02-13T12:02:24Z

pkg/providers/nvml/enumerator.go

+	// Cap conditions to prevent unbounded growth
+	const maxConditions = 100
+	if len(gpu.Status.Conditions) > maxConditions {
+		gpu.Status.Conditions = gpu.Status.Conditions[len(gpu.Status.Conditions)-maxConditions:]
+	}


The PR description states MaxConditionsPerGpu = 32, but this provider-side cap uses maxConditions = 100. Please reconcile the intended limit (and ideally share a single constant) so behavior matches the documented design decision.

Copilot · 2026-02-13T12:02:25Z

pkg/grpc/client/watcher.go

 	select {
 	case <-w.done:
 		w.logger.V(4).Info("Watcher already done, dropping error event")
 	case w.result <- watch.Event{Type: watch.Error, Object: statusErr}:
+	case <-time.After(5 * time.Second):
+		w.logger.V(2).Info("Error event send timed out, dropping")


sendError also uses time.After in a select; repeated errors can accumulate timers similarly. Consider using a reusable timer (or a non-blocking send with explicit drop policy) to avoid timer churn/leaks.

Copilot · 2026-02-13T12:02:25Z

pkg/grpc/client/config.go

+	// Validate target scheme
+	if !strings.HasPrefix(c.Target, "unix://") && !strings.HasPrefix(c.Target, "unix:") &&
+		!strings.HasPrefix(c.Target, "dns:") && !strings.HasPrefix(c.Target, "passthrough:") {
+		return fmt.Errorf("gRPC target %q must use unix://, dns:, or passthrough: scheme", c.Target)
+	}


Validate() accepts both "unix://" and "unix:" targets, but the error message only mentions "unix://" (and omits "unix:"). Please update the message to reflect the actual accepted schemes (unix://, unix:, dns:, passthrough:) to avoid confusing users.

Copilot · 2026-02-13T12:02:26Z

deployments/helm/device-api-server/templates/prometheusrule.yaml

+            summary: "Device API Server is down on {{ "{{ $labels.instance }}" }}"
+            description: "Device API Server has been unreachable for more than 5 minutes."
+            runbook_url: "https://github.com/nvidia/device-api/blob/main/docs/operations/device-api-server.md#alert-deviceapiserverdown"
+


The PrometheusRule runbook_url links point to https://github.com/nvidia/device-api/... which doesn’t match this repository (nvidia/nvsentinel). If these docs live in-repo, update the URLs to the correct location to avoid broken runbook links.

Copilot · 2026-02-13T12:02:26Z

deployments/helm/device-api-server/values.yaml

+# -- Image configuration
+image:
+  # -- Image repository
+  repository: ghcr.io/nvidia/device-api-server
+  # -- Image pull policy
+  pullPolicy: IfNotPresent
+  # -- Image tag (defaults to Chart appVersion)
+  tag: ""
+


Image references are inconsistent across deployment artifacts: the Helm chart defaults to ghcr.io/nvidia/device-api-server, while the static manifest uses ghcr.io/nvidia/nvsentinel/device-api-server. This can cause users to deploy the wrong images; consider standardizing on a single registry/repository naming scheme (or documenting why they differ).

Copilot · 2026-02-13T12:02:26Z

pkg/grpc/client/watcher.go

+		case <-time.After(30 * time.Second):
+			w.logger.Error(nil, "Event send timed out; consumer not reading, stopping watcher")
+			return


Using time.After inside the per-event select allocates a new timer each loop iteration; if the send succeeds quickly, the timer still lives until it fires, which can cause unbounded timer buildup under load. Prefer a reusable time.Timer (stop+drain) or a non-allocating timeout strategy (e.g., context with deadline) for the send timeout.

Copilot · 2026-02-13T12:02:27Z

README.md

+```bash
+go get github.com/nvidia/device-api/api@latest
+```
+
 ```go
 import (
-    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
-    "github.com/nvidia/nvsentinel/pkg/client-go/clientset/versioned"
-	"github.com/nvidia/nvsentinel/pkg/grpc/client"
+    v1alpha1 "github.com/nvidia/device-api/api/gen/go/device/v1alpha1"
+)
+```
+
+### Example: List GPUs
+
+```go
+package main
+
+import (
+    "context"
+    "log"
+
+    v1alpha1 "github.com/nvidia/device-api/api/gen/go/device/v1alpha1"
+    "google.golang.org/grpc"
+    "google.golang.org/grpc/credentials/insecure"
 )


README examples reference the module path "github.com/nvidia/device-api/...", but this repo’s Go module is "github.com/nvidia/nvsentinel" (see go.mod). These import paths / go get instructions won’t work as written; please update them to the correct module paths (or clarify if the repo/module is being renamed).

Copilot · 2026-02-13T12:02:27Z

deployments/static/nvsentinel-daemonset.yaml

+        - name: device-api-server
+          image: ghcr.io/nvidia/nvsentinel/device-api-server:v0.0.0  # Replace with specific version for production
+          imagePullPolicy: IfNotPresent
+          args:
+            - --bind-address=unix:///var/run/device-api/device.sock
+            - --health-probe-bind-address=:8081
+            - --metrics-bind-address=:9090
+            - --shutdown-grace-period=25s
+            - -v=0
+          ports:
+            - name: health
+              containerPort: 8081
+              protocol: TCP
+            - name: metrics
+              containerPort: 9090
+              protocol: TCP
+          livenessProbe:
+            httpGet:
+              path: /healthz
+              port: health
+            initialDelaySeconds: 5
+            periodSeconds: 10
+            timeoutSeconds: 5
+            failureThreshold: 3
+          readinessProbe:
+            httpGet:
+              path: /readyz
+              port: health
+            initialDelaySeconds: 5
+            periodSeconds: 10
+            timeoutSeconds: 5
+            failureThreshold: 3
+          resources:
+            requests:
+              cpu: 50m
+              memory: 64Mi
+            limits:
+              cpu: 200m
+              memory: 256Mi
+          securityContext:
+            runAsNonRoot: true
+            runAsUser: 65534
+            runAsGroup: 65534
+            readOnlyRootFilesystem: true
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+                - ALL
+          volumeMounts:
+            - name: device-api-socket
+              mountPath: /var/run/device-api
+
+        # NVML Provider Sidecar - CGO binary, requires RuntimeClass nvidia
+        - name: nvml-provider
+          image: ghcr.io/nvidia/nvsentinel/nvml-provider:v0.0.0  # Replace with specific version for production
+          imagePullPolicy: IfNotPresent


This manifest references images under ghcr.io/nvidia/nvsentinel/... while the Helm chart defaults to ghcr.io/nvidia/device-api-server / device-api-server-sidecar. Please align these image repositories/tags (or explicitly document the intended difference) to avoid deployment confusion.

…ices Implement a Kubernetes-style device API server with the following components: Core server (cmd/device-api-server): - gRPC-based controlplane API server using apiserver-runtime patterns - GPU service with full CRUD + Watch + UpdateStatus via storage.Interface - Server lifecycle management with graceful shutdown - Health and metrics endpoints with gRPC reflection Storage (pkg/storage): - In-memory storage.Interface implementation with watch support - Configurable watch channel buffer sizes and event drop metrics - Factory pattern for storage backend selection NVML provider (cmd/nvml-provider): - Sidecar that enumerates GPUs via NVML and registers them with the API - XID error event monitoring with health condition updates - Reconciliation loop with configurable intervals - Environment-based driver root configuration (NVIDIA_DRIVER_ROOT) Client SDK (pkg/client-go): - Typed gRPC client with Get/List/Watch/Create/Update/UpdateStatus/Delete - Fake client for testing with k8s.io/client-go/testing integration - Informers and listers following Kubernetes client-go conventions - Clientset pattern for versioned API access Code generator (code-generator): - Fork of k8s.io/code-generator/cmd/client-gen for gRPC backends - Generates typed clients, fake clients, and expansion interfaces - Full UpdateStatus template with proper gRPC implementation - Integrated into hack/update-codegen.sh pipeline Proto API (api/proto/device/v1alpha1): - GPU resource with spec (UUID) and status (conditions, recommendedAction) - Standard CRUD + Watch + UpdateStatus RPCs - K8s-style request/response patterns with options Security hardening: - gRPC message size and stream limits - Server error detail scrubbing for client responses - Unix socket path validation and restrictive permissions - Localhost-only enforcement for insecure credentials Deployment (deployments/helm): - Helm chart with configurable storage, gRPC, health, and metrics - Static manifests with versioned image references - Dockerfile with pinned base images Testing: - Unit tests for storage, services, providers, and utilities - Integration tests for client-go with full gRPC stack - Shared testutil with bufconn gRPC test helpers - Fake client examples for consumer testing patterns Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez requested review from Copilot, dims, lalitadithya and pteranodan January 21, 2026 21:27

Copilot started reviewing on behalf of ArangoGutierrez January 21, 2026 21:28 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

docs/operations/device-api-server.md Outdated Show resolved Hide resolved

ArangoGutierrez force-pushed the device-api-server branch 2 times, most recently from 10ae27f to bd1b3ad Compare January 22, 2026 13:54

pteranodan reviewed Jan 23, 2026

View reviewed changes

ArangoGutierrez requested a review from pteranodan January 29, 2026 12:49

pteranodan reviewed Jan 30, 2026

View reviewed changes

ArangoGutierrez mentioned this pull request Feb 4, 2026

feat: cloud-native GPU health event management #794

Closed

5 tasks

ArangoGutierrez mentioned this pull request Feb 4, 2026

feat: cloud-native GPU health event management #795

Draft

5 tasks

ArangoGutierrez requested a review from Copilot February 5, 2026 10:24

Copilot started reviewing on behalf of ArangoGutierrez February 5, 2026 10:25 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

ArangoGutierrez requested review from Copilot and pteranodan February 5, 2026 14:03

Copilot started reviewing on behalf of ArangoGutierrez February 5, 2026 14:03 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

ArangoGutierrez requested a review from Copilot February 7, 2026 20:59

Copilot started reviewing on behalf of ArangoGutierrez February 7, 2026 21:00 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

pteranodan reviewed Feb 10, 2026

View reviewed changes

ArangoGutierrez requested review from Copilot and pteranodan February 13, 2026 10:49

Copilot started reviewing on behalf of ArangoGutierrez February 13, 2026 10:49 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

ArangoGutierrez force-pushed the device-api-server branch 3 times, most recently from 046249c to 1c407f5 Compare February 13, 2026 11:44

ArangoGutierrez requested a review from Copilot February 13, 2026 11:52

Copilot started reviewing on behalf of ArangoGutierrez February 13, 2026 11:53 View session

ArangoGutierrez force-pushed the device-api-server branch from 1c407f5 to 5e6d23d Compare February 13, 2026 12:01

Copilot AI reviewed Feb 13, 2026

View reviewed changes

ArangoGutierrez force-pushed the device-api-server branch from 5e6d23d to 4b4a6da Compare February 13, 2026 13:14

-  UpdateOptions opts = 4;
+  UpdateOptions opts = 4;
+  // namespace specifies the namespace scoping for the GPU whose status is being updated.
+  // An empty namespace is equivalent to the "default" namespace, but "default" is the
+  // canonical representation.
+  //
+  // Optional. Defaults to unset.
+  string namespace = 5;

Conversation

ArangoGutierrez commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Why read-blocking semantics?

NVML Provider (Sidecar)

Key Design Decisions

Key Components

Type of Change

Test Results

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ArangoGutierrez commented Jan 21, 2026 •

edited

Loading