feat: add device-api-server with NVML fallback provider#720
feat: add device-api-server with NVML fallback provider#720ArangoGutierrez wants to merge 1 commit intodevice-apifrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements a comprehensive Device API Server with NVML fallback provider support for GPU device information and health monitoring. The server acts as a node-local gRPC cache intermediary between providers (health monitors) and consumers (device plugins, DRA drivers).
Changes:
- Implements complete Device API Server with gRPC services (GpuService for consumers, ProviderService for providers)
- Adds built-in NVML fallback provider for direct GPU enumeration and XID-based health monitoring
- Implements thread-safe in-memory cache with RWMutex for read-blocking semantics during writes
- Adds watch broadcaster for real-time GPU state change notifications
- Includes comprehensive Prometheus metrics for observability
- Provides Helm charts for Kubernetes deployment with ServiceMonitor and alerting
Reviewed changes
Copilot reviewed 51 out of 54 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/version/version.go | Version information package with build-time ldflags support |
| pkg/deviceapiserver/config.go | Server configuration with flags, environment variables, and validation |
| pkg/deviceapiserver/server.go | Main server implementation with lifecycle management and graceful shutdown |
| pkg/deviceapiserver/cache/*.go | Thread-safe GPU cache with broadcaster for watch events |
| pkg/deviceapiserver/service/*.go | gRPC service implementations (consumer and provider APIs) |
| pkg/deviceapiserver/nvml/*.go | NVML provider for GPU enumeration and XID health monitoring |
| pkg/deviceapiserver/metrics/metrics.go | Prometheus metrics definitions and recording |
| cmd/device-api-server/main.go | Main entry point with flag parsing and signal handling |
| docs/*.md | Comprehensive API, operations, and design documentation |
| charts/device-api-server/* | Helm chart templates for Kubernetes deployment |
| go.mod/go.sum | Dependencies including NVML, gRPC, and Prometheus libraries |
10ae27f to
bd1b3ad
Compare
| import "device/v1alpha1/gpu.proto"; | ||
|
|
||
| // ========================================== | ||
| // Provider Service Definition |
There was a problem hiding this comment.
I’m concerned about adding a new service with custom methods. Should we extend the existing one with standard methods instead?
Usually, "registering" is just a Create call, "unregistering" is a Delete call, and condition updates are either an UpdateStatus or Patch call. I think we should stick to these well understood conventions.
Using custom methods also risks compatibility with standard tooling (e.g., code generators and controller-runtime). Making things harder for us both now and later.
pkg/deviceapiserver/server.go
Outdated
There was a problem hiding this comment.
The tight coupling between the API server and NVML feels off.
While we definitely need a default health check for GPUs, embedding it directly into the API server makes it hardware-aware rather than a ‘dumb’ transport pipe.
What do you think about moving the default health checks out of the API server entirely? The API server should be agnostic to how the state is produced. The default health checks should interact with the API server just like any other would: via the standard client interface.
Addresses review feedback from pteranodan on PR #720 regarding API design and NVML provider coupling. BREAKING CHANGES: - Removed ProviderService in favor of unified GpuService - RegisterGpu() → CreateGpu() - UnregisterGpu() → DeleteGpu() - UpdateGpuCondition() → removed (use UpdateGpuStatus with full status) API Changes: - Consolidated GpuService with both read and write operations - Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu - Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition) - Deleted api/proto/device/v1alpha1/provider.proto - Updated gpu.proto with unified service definition Service Layer: - Created pkg/deviceapiserver/service/gpu_service.go (unified service) - Deleted consumer.go and provider.go (replaced by gpu_service.go) - Updated integration tests to use new API - Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete NVML Provider Decoupling: - NVML provider now uses GpuServiceClient instead of direct cache access - Server creates loopback gRPC connection for embedded NVML provider - Added standalone cmd/nvml-provider/ binary for sidecar deployment - Provider now "dogfoods" its own API via gRPC Documentation: - Updated README.md with new unified API architecture - Rewrote docs/api/device-api-server.md with single service model - Updated docs/operations/device-api-server.md - Simplified docs/design/device-api-server.md - Removed obsolete task lists and design documents Deployment: - Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/ - Added deployments/container/Dockerfile - Added deployments/static/ for static manifests - Added demos/nvml-sidecar-demo.sh Other: - Updated .gitignore for planning documents - Enhanced Makefile with new targets - Added metrics for unified service This refactoring improves: 1. API consistency with Kubernetes conventions 2. Tooling compatibility (code generators, controller-runtime) 3. Decoupling of NVML provider from server internals 4. Flexibility to run NVML as embedded or sidecar 5. Overall code maintainability and clarity Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>
api/proto/device/v1alpha1/gpu.proto
Outdated
| // | ||
| // This increments on every modification to the GPU and can be used by | ||
| // consumers to detect changes or implement optimistic concurrency. | ||
| int64 resource_version = 4; |
There was a problem hiding this comment.
What's the difference between this and the resource version inside of the metadata field?
cmd/nvml-provider/main.go
Outdated
| if id := os.Getenv("PROVIDER_ID"); id != "" { | ||
| cfg.ProviderID = id | ||
| } | ||
| if root := os.Getenv("NVML_DRIVER_ROOT"); root != "" { |
There was a problem hiding this comment.
Is this the standard or conventional environment variable used to hold the driver root location?
cmd/nvml-provider/main.go
Outdated
| }, | ||
| } | ||
|
|
||
| _, err := p.gpuClient.CreateGpu(ctx, req) |
There was a problem hiding this comment.
FYI: I’ve updated client-gen to support Create methods (#718). Once that's merged, you can use versioned clients here instead of proto stubs:
gpu, err := clientset.DeviceV1alpha1().GPUs().Create(ctx, &devicev1alpha1.GPU{}, metav1.CreateOptions{})
cmd/nvml-provider/main.go
Outdated
| }, | ||
| } | ||
|
|
||
| if _, err := p.gpuClient.UpdateGpuStatus(ctx, req); err != nil { |
There was a problem hiding this comment.
FYI: I’ve updated client-gen to support Update methods (#718). Once that's merged, you can use versioned clients here instead of proto stubs:
gpu, err := clientset.DeviceV1alpha1().GPUs().Update(ctx, &devicev1alpha1.GPU{}, metav1.UpdateOptions{})
Comprehensive design document capturing: - In-memory cache architecture (replacing Kine/SQLite) - Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs) - Readiness gate blocking consumers until providers register - Watch broadcaster with slow subscriber eviction - What to take from each PR - Implementation phases - Authorship preservation requirements This document serves as the spec for cherry-picking #720 onto merged #718. Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com> Co-authored-by: Dan Stone <danstone@nvidia.com>
Addresses review feedback from pteranodan on PR #720 regarding API design and NVML provider coupling. BREAKING CHANGES: - Removed ProviderService in favor of unified GpuService - RegisterGpu() → CreateGpu() - UnregisterGpu() → DeleteGpu() - UpdateGpuCondition() → removed (use UpdateGpuStatus with full status) API Changes: - Consolidated GpuService with both read and write operations - Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu - Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition) - Deleted api/proto/device/v1alpha1/provider.proto - Updated gpu.proto with unified service definition Service Layer: - Created pkg/deviceapiserver/service/gpu_service.go (unified service) - Deleted consumer.go and provider.go (replaced by gpu_service.go) - Updated integration tests to use new API - Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete NVML Provider Decoupling: - NVML provider now uses GpuServiceClient instead of direct cache access - Server creates loopback gRPC connection for embedded NVML provider - Added standalone cmd/nvml-provider/ binary for sidecar deployment - Provider now "dogfoods" its own API via gRPC Documentation: - Updated README.md with new unified API architecture - Rewrote docs/api/device-api-server.md with single service model - Updated docs/operations/device-api-server.md - Simplified docs/design/device-api-server.md - Removed obsolete task lists and design documents Deployment: - Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/ - Added deployments/container/Dockerfile - Added deployments/static/ for static manifests - Added demos/nvml-sidecar-demo.sh Other: - Updated .gitignore for planning documents - Enhanced Makefile with new targets - Added metrics for unified service This refactoring improves: 1. API consistency with Kubernetes conventions 2. Tooling compatibility (code generators, controller-runtime) 3. Decoupling of NVML provider from server internals 4. Flexibility to run NVML as embedded or sidecar 5. Overall code maintainability and clarity Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>
Comprehensive design document capturing: - In-memory cache architecture (replacing Kine/SQLite) - Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs) - Readiness gate blocking consumers until providers register - Watch broadcaster with slow subscriber eviction - What to take from each PR - Implementation phases - Authorship preservation requirements This document serves as the spec for cherry-picking #720 onto merged #718. Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com> Co-authored-by: Dan Stone <danstone@nvidia.com>
go.mod
Outdated
| @@ -0,0 +1,32 @@ | |||
| module github.com/nvidia/nvsentinel | |||
|
|
|||
| go 1.25 | |||
There was a problem hiding this comment.
The Go version is specified as "1.25" here, but in api/go.mod (before the change) it was "1.25.5". While Go considers these equivalent, it's better to be consistent across all go.mod files in the repository. Consider using the same patch version everywhere for consistency.
docs/operations/device-api-server.md
Outdated
| - The TCP listener binds to `127.0.0.1:50051` (localhost) by default. This prevents network exposure. | ||
| - Prefer using a Unix domain socket for all local clients and providers; avoid exposing the TCP gRPC listener to the broader cluster network. | ||
| - In multi-tenant or partially untrusted clusters, strongly recommend **disabling the TCP listener** (set `--grpc-address=""`) or using a restricted Unix socket combined with Kubernetes `NetworkPolicy` to limit access to the Device API Server pod. | ||
| - If TCP gRPC must be exposed beyond the node (e.g., `--grpc-address=:50051`), terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`. |
There was a problem hiding this comment.
The "Network Security" section documents the gRPC API as unauthenticated and recommends keeping it node-local, but the Helm chart currently defaults server.grpcAddress to :50051 (see deployments/helm/nvsentinel/values.yaml), which binds the unauthenticated gRPC listener on all pod interfaces and makes it reachable by any pod in the cluster over plaintext. In a multi-tenant or partially untrusted cluster, an attacker can connect to this port and invoke write RPCs (CreateGpu, UpdateGpuStatus, DeleteGpu) to tamper with GPU health/scheduling state, despite the documentation implying a safer localhost-only default. To close this gap, align the chart defaults with the documented security posture (e.g., default to 127.0.0.1:50051 or Unix socket only) and require explicit opt-in plus TLS/mTLS and restrictive NetworkPolicy when exposing the TCP listener beyond localhost.
| - If TCP gRPC must be exposed beyond the node (e.g., `--grpc-address=:50051`), terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`. | |
| - If TCP gRPC must be exposed beyond the node, **do not** publish a plaintext, all-interfaces listener (for example, `--grpc-address=:50051`) directly to the cluster network. Instead, keep the server bound to localhost (for example, `127.0.0.1:50051`) or a Unix domain socket and expose it only behind TLS/mTLS or an equivalent authenticated tunnel (such as a sidecar proxy or ingress with client auth), combined with a least-privilege `NetworkPolicy`. |
Aligns Helm chart default with config.go secure default. The gRPC API is unauthenticated, so binding to all interfaces by default exposes the server to unauthorized cluster access. Addresses Copilot security review feedback on PR #720.
The resource_version was duplicated at both Gpu.resource_version (field 4, int64) and Gpu.metadata.resource_version (in ObjectMeta, string). This is redundant and violates K8s API conventions where resource_version should only be in metadata. Changes: - Remove Gpu.resource_version field from proto (reserved field 4 for backward compatibility) - Update cache to use metadata.resource_version (string format) - Update service to read from metadata.resource_version - Remove ResourceVersion from goverter ignore list (no longer needed) - Regenerate protobuf code The UpdateGpuStatusRequest.resource_version field is intentionally kept as it serves a different purpose: passing the expected version for optimistic concurrency checks in requests. Addresses pteranodan's review feedback on PR #720.
The API server should be a 'dumb' transport pipe, agnostic to how GPU state is produced. NVML integration should be external only, interacting via the standard gRPC client interface. Changes: - Remove nvmlProvider field and related methods from server.go - Remove NVML config fields (NVMLEnabled, NVMLDriverRoot, etc.) from config.go - Remove NVML flags from main.go (--enable-nvml-provider, --nvml-driver-root, etc.) - Remove embedded nvml section from Helm values.yaml - Remove nvml args and env vars from daemonset.yaml template - Update NOTES.txt to reference nvmlProvider sidecar instead The nvmlProvider sidecar (values.yaml) remains as the recommended approach for NVML-based GPU enumeration and health monitoring. Addresses pteranodan's architecture review feedback on PR #720.
| // XidDescriptions provides human-readable descriptions for common XIDs. | ||
| var XidDescriptions = map[uint64]string{ | ||
| // Application errors (typically ignored) | ||
| 13: "Graphics Engine Exception", | ||
| 31: "GPU memory page fault", | ||
| 43: "GPU stopped processing", | ||
| 45: "Preemptive cleanup", | ||
| 68: "Video processor exception", | ||
| 109: "Context Switch Timeout", | ||
|
|
||
| // Memory errors | ||
| 48: "Double Bit ECC Error", | ||
| 63: "Row remapping failure", | ||
| 64: "Uncontained ECC error", | ||
| 74: "NVLink error", | ||
| 79: "GPU has fallen off the bus", | ||
| 94: "Contained ECC error", | ||
| 95: "Uncontained ECC error", | ||
|
|
||
| // Other notable XIDs | ||
| 8: "GPU not accessible", | ||
| 32: "Invalid or corrupted push buffer stream", | ||
| 38: "Driver firmware error", | ||
| 56: "Display engine error", | ||
| 57: "Error programming video memory interface", | ||
| 62: "Internal micro-controller halt (non-fatal)", | ||
| 69: "Graphics engine accessor error", | ||
| 119: "GSP error", | ||
| 120: "GSP firmware error", | ||
| } |
There was a problem hiding this comment.
XidDescriptions is exported as a mutable map, which allows other packages to modify it at runtime (and contradicts the stated goal of keeping XID maps immutable). Since it’s only used internally, consider making it unexported (xidDescriptions) and/or exposing a function that returns a copy to prevent accidental mutation.
deployments/container/Dockerfile
Outdated
| # gRPC API port | ||
| EXPOSE 8080 | ||
| # Metrics port | ||
| EXPOSE 8081 | ||
|
|
There was a problem hiding this comment.
The device-api-server container image exposes ports 8080/8081, but the server’s admin health gRPC and metrics HTTP endpoints are configured via flags (defaulting to :50051 for health gRPC and :9090 for metrics in pkg/controlplane/apiserver/options). Also, gRPC device service is on a unix socket, not a TCP port. Please align EXPOSE/comments with the actual runtime ports (or drop EXPOSE if they’re not meaningful for UDS-based deployment).
Makefile
Outdated
| helm lint deployments/helm/nvsentinel | ||
|
|
||
| .PHONY: helm-template | ||
| helm-template: ## Render Helm chart templates | ||
| helm template nvsentinel deployments/helm/nvsentinel | ||
|
|
||
| .PHONY: helm-package | ||
| helm-package: ## Package Helm chart | ||
| @mkdir -p dist/ | ||
| helm package deployments/helm/nvsentinel -d dist/ |
There was a problem hiding this comment.
Helm targets reference deployments/helm/nvsentinel, but this PR adds the chart under deployments/helm/device-api-server. As written, make helm-lint, make helm-template, and make helm-package will fail because the referenced path doesn’t exist. Please update these targets to point at the correct chart directory (and ensure the chart name matches).
| helm lint deployments/helm/nvsentinel | |
| .PHONY: helm-template | |
| helm-template: ## Render Helm chart templates | |
| helm template nvsentinel deployments/helm/nvsentinel | |
| .PHONY: helm-package | |
| helm-package: ## Package Helm chart | |
| @mkdir -p dist/ | |
| helm package deployments/helm/nvsentinel -d dist/ | |
| helm lint deployments/helm/device-api-server | |
| .PHONY: helm-template | |
| helm-template: ## Render Helm chart templates | |
| helm template device-api-server deployments/helm/device-api-server | |
| .PHONY: helm-package | |
| helm-package: ## Package Helm chart | |
| @mkdir -p dist/ | |
| helm package deployments/helm/device-api-server -d dist/ |
| ### Key Parameters | ||
|
|
||
| | Parameter | Description | Default | | ||
| |-----------|-------------|---------| | ||
| | `image.repository` | Image repository | `ghcr.io/nvidia/device-api-server` | | ||
| | `image.tag` | Image tag | Chart appVersion | | ||
| | `server.grpcAddress` | gRPC server address | `:50051` | | ||
| | `server.unixSocket` | Unix socket path | `/var/run/device-api/device.sock` | | ||
| | `server.healthPort` | Health endpoint port | `8081` | | ||
| | `server.metricsPort` | Metrics endpoint port | `9090` | | ||
| | `nvmlProvider.enabled` | Enable NVML provider sidecar | `false` | | ||
| | `nvmlProvider.driverRoot` | NVIDIA driver library root | `/run/nvidia/driver` | | ||
| | `nvmlProvider.healthCheckEnabled` | Enable XID event monitoring | `true` | | ||
| | `runtimeClassName` | Pod RuntimeClass | `""` | | ||
| | `nodeSelector` | Node selector | `nvidia.com/gpu.present: "true"` | | ||
| | `metrics.serviceMonitor.enabled` | Create ServiceMonitor | `false` | | ||
| | `metrics.prometheusRule.enabled` | Create PrometheusRule | `false` | | ||
|
|
There was a problem hiding this comment.
The chart README lists configuration keys that don’t exist in values.yaml (e.g., server.grpcAddress) and implies a TCP listener by default, while the chart template configures --bind-address as a unix socket. Please update the parameter table/examples to reflect the actual values.yaml schema and the current server flags to avoid broken installs.
| preparedStorage, err := storage.PrepareRun(ctx) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to prepare storage backend") | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| // Run the storage backend in the background. | ||
| storageErrCh := make(chan error, 1) | ||
| go func() { | ||
| storageErrCh <- preparedStorage.Run(ctx) | ||
| }() | ||
|
|
||
| // Create, prepare, and run the device API server. | ||
| server, err := completedAPIServerConfig.New(storage) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to create device API server") | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| prepared, err := server.PrepareRun(ctx) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to prepare device API server") | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| if err := prepared.Run(ctx); err != nil { | ||
| logger.Error(err, "Device API server error") | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| // Check if the storage backend exited with an error. | ||
| if err := <-storageErrCh; err != nil { | ||
| logger.Error(err, "Storage backend error") | ||
| os.Exit(1) | ||
| } |
There was a problem hiding this comment.
The storage backend is run in a goroutine and only awaited after the gRPC server exits. If the storage backend fails while the server is running, the server will keep serving with no coordination/shutdown (this regresses from the previous errgroup-based lifecycle). Consider running storage and server under an errgroup (or select on storageErrCh) so a storage failure triggers cancellation and a clean shutdown of the server.
| ### Command-Line Flags | ||
|
|
||
| | Flag | Default | Description | | ||
| |------|---------|-------------| | ||
| | `--grpc-address` | `127.0.0.1:50051` | TCP address for gRPC server (localhost only by default) | | ||
| | `--unix-socket` | `/var/run/device-api/device.sock` | Unix socket path | | ||
| | `--health-port` | `8081` | HTTP port for health endpoints | | ||
| | `--metrics-port` | `9090` | HTTP port for Prometheus metrics | | ||
| | `--shutdown-timeout` | `30` | Graceful shutdown timeout (seconds) | | ||
| | `--shutdown-delay` | `5` | Pre-shutdown delay (seconds) | | ||
| | `--log-format` | `json` | Log format: `text` or `json` | | ||
| | `-v` | `0` | Log verbosity level | | ||
|
|
There was a problem hiding this comment.
The CLI flag documentation here appears out of sync with the actual server flags wired up in pkg/controlplane/apiserver/options (e.g., it documents --grpc-address/--unix-socket/--health-port/--metrics-port, but the code defines --bind-address, --health-probe-bind-address, and --metrics-bind-address). Please update this doc to match the real flags so operators can deploy/configure the server correctly.
| args: | ||
| - --grpc-address=:50051 | ||
| - --health-port=8081 | ||
| - --metrics-port=9090 | ||
| - --provider-address=localhost:9001 | ||
| - --unix-socket=/var/run/device-api/device.sock | ||
| - --v=0 |
There was a problem hiding this comment.
This static DaemonSet manifest uses flags like --grpc-address/--health-port/--metrics-port/--unix-socket/--provider-address that do not match the current device-api-server binary flags (which use --bind-address, --health-probe-bind-address, --metrics-bind-address, etc.). As written, the container will fail to start due to unknown flags. Please update the args to the new flag names/semantics or clearly mark this manifest as deprecated/out-of-date.
api/proto/device/v1alpha1/gpu.proto
Outdated
| string resource_version = 3; | ||
|
|
||
| // opts contains the options for the update. | ||
| UpdateOptions opts = 4; |
There was a problem hiding this comment.
UpdateGpuStatusRequest lacks a namespace field while other RPCs are namespace-aware, forcing the server implementation to hardcode "default" for status updates (see TODO in gpu_service.go). This makes it impossible to update status for GPUs created outside default and is an API inconsistency. Consider adding string namespace (and possibly object meta) to UpdateGpuStatusRequest now before the API stabilizes, or explicitly document/enforce that GPUs are cluster-singleton in the API.
| UpdateOptions opts = 4; | |
| UpdateOptions opts = 4; | |
| // namespace specifies the namespace scoping for the GPU whose status is being updated. | |
| // An empty namespace is equivalent to the "default" namespace, but "default" is the | |
| // canonical representation. | |
| // | |
| // Optional. Defaults to unset. | |
| string namespace = 5; |
| "github.com/nvidia/nvsentinel/pkg/controlplane/apiserver" | ||
| "github.com/nvidia/nvsentinel/pkg/controlplane/apiserver/options" | ||
| "github.com/nvidia/nvsentinel/pkg/storage/storagebackend" | ||
| "github.com/nvidia/nvsentinel/pkg/util/version" | ||
|
|
||
| // Import service providers so their init() functions register them. | ||
| _ "github.com/nvidia/nvsentinel/pkg/services/device/v1alpha1" | ||
| ) | ||
|
|
||
| const ( | ||
| // ComponentName is the name of this component for logging. | ||
| ComponentName = "device-api-server" | ||
| ) | ||
|
|
||
| func main() { | ||
| opts := options.NewOptions() | ||
|
|
||
| fss := cliflag.NamedFlagSets{} | ||
| opts.AddFlags(&fss) | ||
|
|
||
| // Add a version flag to the global flag set. | ||
| showVersion := pflag.Bool("version", false, "Show version and exit") | ||
|
|
||
| // Merge all named flag sets into the global pflag command line. | ||
| for _, fs := range fss.FlagSets { | ||
| pflag.CommandLine.AddFlagSet(fs) | ||
| } | ||
|
|
||
| pflag.Parse() | ||
|
|
||
| // Handle version flag before any other initialization. | ||
| if *showVersion { | ||
| v := version.Get() | ||
| enc := json.NewEncoder(os.Stdout) | ||
| enc.SetIndent("", " ") | ||
| if err := enc.Encode(v); err != nil { | ||
| fmt.Fprintf(os.Stderr, "Failed to encode version: %v\n", err) | ||
| os.Exit(1) | ||
| } | ||
| os.Exit(0) | ||
| } | ||
|
|
||
| // Set up signal handling for graceful shutdown. | ||
| ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) | ||
| defer cancel() | ||
|
|
||
| // Complete fills in defaults and resolves environment overrides. | ||
| completedOpts, err := opts.Complete(ctx) | ||
| if err != nil { | ||
| fmt.Fprintf(os.Stderr, "Failed to complete options: %v\n", err) | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| // Validate rejects invalid flag combinations. | ||
| if errs := completedOpts.Validate(); len(errs) > 0 { | ||
| for _, e := range errs { | ||
| fmt.Fprintf(os.Stderr, "Invalid configuration: %v\n", e) | ||
| } | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| // Create root logger with component name. | ||
| logger := klog.Background().WithName(ComponentName) | ||
| ctx = klog.NewContext(ctx, logger) | ||
|
|
||
| versionInfo := version.Get() | ||
| logger.Info("Starting server", | ||
| "version", versionInfo.GitVersion, | ||
| "commit", versionInfo.GitCommit, | ||
| "buildDate", versionInfo.BuildDate, | ||
| ) |
There was a problem hiding this comment.
cmd/device-api-server is importing github.com/nvidia/nvsentinel/pkg/util/version, but the build/linker flags in the updated Makefile are wired to github.com/nvidia/nvsentinel/pkg/version. As a result the reported version will stay at util/version defaults and won’t reflect ldflags. Please switch this file to use pkg/version (and update the logged field names accordingly), or revert Makefile/Dockerfile ldflags to target pkg/util/version consistently across the repo.
| AdminServer *grpc.Server | ||
| AdminCleanup func() | ||
| Metrics *metrics.ServerMetrics | ||
| MetricsRegistry *prometheus.Registry |
There was a problem hiding this comment.
I forgot to remove this, good catch!
| healthpb.RegisterHealthServer(s.AdminServer, s.HealthServer) | ||
| // Also register on DeviceServer so sidecar providers connecting via | ||
| // unix socket can perform health checks without a separate connection. | ||
| healthpb.RegisterHealthServer(s.DeviceServer, s.HealthServer) |
| } | ||
|
|
||
| if s.MetricsAddress != "" { | ||
| // TODO: put in wg?? |
There was a problem hiding this comment.
Meant to get back to this, thank you!
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
|
|
||
| // Code generated by service-gen. DO NOT EDIT. |
There was a problem hiding this comment.
Note: all the code under pkg/service/device/<version>/ will eventually be generated.
| func (p *gpuServiceProvider) Install(svr *grpc.Server, storageConfig storagebackend.Config) (api.Service, error) { | ||
| // Install creates the in-memory storage backend and registers the GPU service | ||
| // on the provided gRPC server. | ||
| func (p *gpuServiceProvider) Install(svr *grpc.Server, _ storagebackend.Config) (api.Service, error) { |
There was a problem hiding this comment.
This change hardcodes the storage backend to in-memory and completely ignores the passed storagebackend.Config.
This prevents a service from using durable storage (like etcd or an etcdshim) which was previously supported via factory.Create. This change effectively locks us into a single-node, non-durable architecture and removes the ability to configure the storage layer via flags.
Since the goal is to add an in-memory option, can we switch on the config instead? For example:
// Pseudo-code
if storageConfig.Type == "memory" {
// In-memory path
s, destroy, err = memory.CreateStorage(...)
} else {
// Use the standard K8s factory
s, destroy, err = factory.Create(configForResource, ...)
}
|
|
||
| resourcePrefix := path.Join("/", p.groupVersion.Group, "gpus") | ||
| codecs := serializer.NewCodecFactory(scheme) | ||
| info, _ := runtime.SerializerInfoForMediaType(codecs.SupportedMediaTypes(), runtime.ContentTypeJSON) |
There was a problem hiding this comment.
Minor: I think we need an error check on runtime.SerializerInfoForMediaType to prevent panics if the serializer isn't found.
|
|
||
| key := s.storageKey(req.GetNamespace(), "") | ||
| w, err := s.storage.Watch(ctx, key, storage.ListOptions{ | ||
| ResourceVersion: req.GetOpts().GetResourceVersion(), |
There was a problem hiding this comment.
Critical Issue
We cannot ignore the ResourceVersion (RV) from the request. By omitting req.GetOpts().GetResourceVersion() from storage.ListOptions, we break the fundamental List/Watch consistency contract.
Clients rely on passing a RV to ensure they don't miss events that occur in the split second between a List request finishing and a Watch request starting. Without it, clients can permanently miss events.
TestIntegration_WatchResumption demonstrates where a client lists, a new item is created, and then the client Watches (using the List RV) and the client never receives an event for the item created in between the List and Watch.
| } | ||
|
|
||
| // UpdateGpuStatusRequest specifies the GPU status to update. | ||
| message UpdateGpuStatusRequest { |
There was a problem hiding this comment.
This request structure deviates from the established design pattern for Updates in Kubernetes and in this file (see UpdateGpuRequest). Typically, update operations (including status updates) accept the whole object and use the metadata (e.g., RV) within it for addressing concurrency control.
Additionally, by splitting fields out manually we risk accidentally omitting necessary metadata (e.g., RV, namespace, generation, etc.).
Suggestion:
message UpdateGpuRequest {
// gpu is the GPU resource whose status to update.
Gpu gpu = 1;
// opts contains the options for the update.
UpdateOptions opts = 2;
}
|
|
||
| // Package version provides version information for the Device API Server. | ||
| // These values are set at build time via ldflags. | ||
| package version |
There was a problem hiding this comment.
This introduces a duplicate version package. We already have an implementation at pkg/util/version. We should combine these to prevent confusion about which to use.
I like the pkg/version import path you chose; it is cleaner and more discoverable. I suggest we move the logic from pkg/util/version into this new package and delete the existing directory.
The existing package includes a HTTP handler (for /version) and a UserAgent helper that are actively used. We should keep those.
We can remove RegisterComponent, it's not being used. Forgot to remove it earlier.
pkg/providers/nvml/provider.go
Outdated
| nvmllib Library | ||
|
|
||
| // gRPC client to communicate with Device API Server | ||
| client v1alpha1.GpuServiceClient |
There was a problem hiding this comment.
We should use the generated Kubernetes-style client SDK (pkg/client-go) instead of raw gRPC service client.
Example:
// Get the latest object to ensure we have the current ResourceVersion.
gpu, err := p.client.DeviceV1alpha1().GPUs().Get(ctx, gpuName, metav1.GetOptions{})
if err != nil {
return err
}
// Modify status
gpu.Status = newStatus
// Update
_, err = p.client.DeviceV1alpha1().GPUs().UpdateStatus(ctx, gpu, metav1.UpdateOptions{})
pkg/providers/nvml/provider.go
Outdated
| "github.com/NVIDIA/go-nvml/pkg/nvml" | ||
| "k8s.io/klog/v2" | ||
|
|
||
| v1alpha1 "github.com/nvidia/nvsentinel/internal/generated/device/v1alpha1" |
There was a problem hiding this comment.
Minor: Providers must never import internal packages.
|
|
||
| func NewOptions() *Options { | ||
| return &Options{ | ||
| InMemory: true, |
There was a problem hiding this comment.
As noted in other comments, the current implementation has some gaps regarding ResourceVersion semantics, which will break controllers and informers that rely on those guarantees.
I suggest we make Kine (with SQLite) the default in-memory path. Because Kine is the default backend for K3s (a CNCF certified distro), we know it passes the upstream Kubernetes storage conformance tests. This gives us verified correctness out of the box.
We can still include this custom implementation, but perhaps behind a flag like --experimental-in-memory-storage until we can verify it against the full storage compliance suite.
Adds forward-looking guidance to the existing TODO(#720) comment, clarifying that the proto message must be extended with a namespace field when multi-namespace support is added. Addresses audit finding M3: UpdateGpuStatus hardcoded to default namespace. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
| // 2. Check for pending page retirements (ECC errors) | ||
| if pending, ret := device.GetRetiredPagesPendingStatus(); ret == nvml.SUCCESS { | ||
| if pending == nvml.FEATURE_ENABLED { | ||
| p.logger.V(1).Info("GPU has pending page retirements", "uuid", uuid) | ||
| return false, nil | ||
| } | ||
| } | ||
|
|
||
| // Device is accessible and no pending issues |
There was a problem hiding this comment.
The method GetRetiredPagesPendingStatus() is called on the device but is not defined in the Device interface (pkg/providers/nvml/interface.go:36-42). This will cause a compilation error when building with -tags=nvml. Add this method to the Device interface and implement it in both the wrapper and mock implementations.
| // 2. Check for pending page retirements (ECC errors) | |
| if pending, ret := device.GetRetiredPagesPendingStatus(); ret == nvml.SUCCESS { | |
| if pending == nvml.FEATURE_ENABLED { | |
| p.logger.V(1).Info("GPU has pending page retirements", "uuid", uuid) | |
| return false, nil | |
| } | |
| } | |
| // Device is accessible and no pending issues | |
| // Device is accessible according to basic checks |
046249c to
1c407f5
Compare
1c407f5 to
5e6d23d
Compare
| if labels != nil { | ||
| version := prometheus.NewGauge(prometheus.GaugeOpts{ | ||
| Name: "device_apiserver_build_info", | ||
| Help: "Build information about the device-apiserver binary.", | ||
| ConstLabels: m.buildInfoLabels, | ||
| ConstLabels: labels, |
There was a problem hiding this comment.
The build-info metric still uses the name/help text "device_apiserver_build_info" / "device-apiserver" even though the binary/component is now "device-api-server". At minimum, consider updating the Help string and the service label value to match the new component name; if the metric name is kept for compatibility, documenting that would help avoid confusion.
| func (p *gpuServiceProvider) Install(svr *grpc.Server, cfg storagebackend.Config) (api.Service, error) { | ||
| // Currently only in-memory storage is supported. The cfg parameter is | ||
| // accepted for future extensibility but not used for backend selection. | ||
| _ = cfg | ||
|
|
There was a problem hiding this comment.
Install() now unconditionally uses the in-memory storage backend and explicitly ignores the provided storagebackend.Config. This means configuring the server for a persistent backend won’t actually persist GPU resources, which is a surprising behavior change. Consider either wiring cfg into backend selection (restore the factory.Create path when persistence is desired) or returning an explicit error when a non-in-memory backend is configured but unsupported.
| // Cap conditions to prevent unbounded growth | ||
| const maxConditions = 100 | ||
| if len(gpu.Status.Conditions) > maxConditions { | ||
| gpu.Status.Conditions = gpu.Status.Conditions[len(gpu.Status.Conditions)-maxConditions:] | ||
| } |
There was a problem hiding this comment.
The PR description states MaxConditionsPerGpu = 32, but this provider-side cap uses maxConditions = 100. Please reconcile the intended limit (and ideally share a single constant) so behavior matches the documented design decision.
| select { | ||
| case <-w.done: | ||
| w.logger.V(4).Info("Watcher already done, dropping error event") | ||
| case w.result <- watch.Event{Type: watch.Error, Object: statusErr}: | ||
| case <-time.After(5 * time.Second): | ||
| w.logger.V(2).Info("Error event send timed out, dropping") |
There was a problem hiding this comment.
sendError also uses time.After in a select; repeated errors can accumulate timers similarly. Consider using a reusable timer (or a non-blocking send with explicit drop policy) to avoid timer churn/leaks.
| // Validate target scheme | ||
| if !strings.HasPrefix(c.Target, "unix://") && !strings.HasPrefix(c.Target, "unix:") && | ||
| !strings.HasPrefix(c.Target, "dns:") && !strings.HasPrefix(c.Target, "passthrough:") { | ||
| return fmt.Errorf("gRPC target %q must use unix://, dns:, or passthrough: scheme", c.Target) | ||
| } |
There was a problem hiding this comment.
Validate() accepts both "unix://" and "unix:" targets, but the error message only mentions "unix://" (and omits "unix:"). Please update the message to reflect the actual accepted schemes (unix://, unix:, dns:, passthrough:) to avoid confusing users.
| summary: "Device API Server is down on {{ "{{ $labels.instance }}" }}" | ||
| description: "Device API Server has been unreachable for more than 5 minutes." | ||
| runbook_url: "https://github.com/nvidia/device-api/blob/main/docs/operations/device-api-server.md#alert-deviceapiserverdown" | ||
|
|
There was a problem hiding this comment.
The PrometheusRule runbook_url links point to https://github.com/nvidia/device-api/... which doesn’t match this repository (nvidia/nvsentinel). If these docs live in-repo, update the URLs to the correct location to avoid broken runbook links.
| # -- Image configuration | ||
| image: | ||
| # -- Image repository | ||
| repository: ghcr.io/nvidia/device-api-server | ||
| # -- Image pull policy | ||
| pullPolicy: IfNotPresent | ||
| # -- Image tag (defaults to Chart appVersion) | ||
| tag: "" | ||
|
|
There was a problem hiding this comment.
Image references are inconsistent across deployment artifacts: the Helm chart defaults to ghcr.io/nvidia/device-api-server, while the static manifest uses ghcr.io/nvidia/nvsentinel/device-api-server. This can cause users to deploy the wrong images; consider standardizing on a single registry/repository naming scheme (or documenting why they differ).
| case <-time.After(30 * time.Second): | ||
| w.logger.Error(nil, "Event send timed out; consumer not reading, stopping watcher") | ||
| return |
There was a problem hiding this comment.
Using time.After inside the per-event select allocates a new timer each loop iteration; if the send succeeds quickly, the timer still lives until it fires, which can cause unbounded timer buildup under load. Prefer a reusable time.Timer (stop+drain) or a non-allocating timeout strategy (e.g., context with deadline) for the send timeout.
| ```bash | ||
| go get github.com/nvidia/device-api/api@latest | ||
| ``` | ||
|
|
||
| ```go | ||
| import ( | ||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| "github.com/nvidia/nvsentinel/pkg/client-go/clientset/versioned" | ||
| "github.com/nvidia/nvsentinel/pkg/grpc/client" | ||
| v1alpha1 "github.com/nvidia/device-api/api/gen/go/device/v1alpha1" | ||
| ) | ||
| ``` | ||
|
|
||
| ### Example: List GPUs | ||
|
|
||
| ```go | ||
| package main | ||
|
|
||
| import ( | ||
| "context" | ||
| "log" | ||
|
|
||
| v1alpha1 "github.com/nvidia/device-api/api/gen/go/device/v1alpha1" | ||
| "google.golang.org/grpc" | ||
| "google.golang.org/grpc/credentials/insecure" | ||
| ) |
There was a problem hiding this comment.
README examples reference the module path "github.com/nvidia/device-api/...", but this repo’s Go module is "github.com/nvidia/nvsentinel" (see go.mod). These import paths / go get instructions won’t work as written; please update them to the correct module paths (or clarify if the repo/module is being renamed).
| - name: device-api-server | ||
| image: ghcr.io/nvidia/nvsentinel/device-api-server:v0.0.0 # Replace with specific version for production | ||
| imagePullPolicy: IfNotPresent | ||
| args: | ||
| - --bind-address=unix:///var/run/device-api/device.sock | ||
| - --health-probe-bind-address=:8081 | ||
| - --metrics-bind-address=:9090 | ||
| - --shutdown-grace-period=25s | ||
| - -v=0 | ||
| ports: | ||
| - name: health | ||
| containerPort: 8081 | ||
| protocol: TCP | ||
| - name: metrics | ||
| containerPort: 9090 | ||
| protocol: TCP | ||
| livenessProbe: | ||
| httpGet: | ||
| path: /healthz | ||
| port: health | ||
| initialDelaySeconds: 5 | ||
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3 | ||
| readinessProbe: | ||
| httpGet: | ||
| path: /readyz | ||
| port: health | ||
| initialDelaySeconds: 5 | ||
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3 | ||
| resources: | ||
| requests: | ||
| cpu: 50m | ||
| memory: 64Mi | ||
| limits: | ||
| cpu: 200m | ||
| memory: 256Mi | ||
| securityContext: | ||
| runAsNonRoot: true | ||
| runAsUser: 65534 | ||
| runAsGroup: 65534 | ||
| readOnlyRootFilesystem: true | ||
| allowPrivilegeEscalation: false | ||
| capabilities: | ||
| drop: | ||
| - ALL | ||
| volumeMounts: | ||
| - name: device-api-socket | ||
| mountPath: /var/run/device-api | ||
|
|
||
| # NVML Provider Sidecar - CGO binary, requires RuntimeClass nvidia | ||
| - name: nvml-provider | ||
| image: ghcr.io/nvidia/nvsentinel/nvml-provider:v0.0.0 # Replace with specific version for production | ||
| imagePullPolicy: IfNotPresent |
There was a problem hiding this comment.
This manifest references images under ghcr.io/nvidia/nvsentinel/... while the Helm chart defaults to ghcr.io/nvidia/device-api-server / device-api-server-sidecar. Please align these image repositories/tags (or explicitly document the intended difference) to avoid deployment confusion.
…ices Implement a Kubernetes-style device API server with the following components: Core server (cmd/device-api-server): - gRPC-based controlplane API server using apiserver-runtime patterns - GPU service with full CRUD + Watch + UpdateStatus via storage.Interface - Server lifecycle management with graceful shutdown - Health and metrics endpoints with gRPC reflection Storage (pkg/storage): - In-memory storage.Interface implementation with watch support - Configurable watch channel buffer sizes and event drop metrics - Factory pattern for storage backend selection NVML provider (cmd/nvml-provider): - Sidecar that enumerates GPUs via NVML and registers them with the API - XID error event monitoring with health condition updates - Reconciliation loop with configurable intervals - Environment-based driver root configuration (NVIDIA_DRIVER_ROOT) Client SDK (pkg/client-go): - Typed gRPC client with Get/List/Watch/Create/Update/UpdateStatus/Delete - Fake client for testing with k8s.io/client-go/testing integration - Informers and listers following Kubernetes client-go conventions - Clientset pattern for versioned API access Code generator (code-generator): - Fork of k8s.io/code-generator/cmd/client-gen for gRPC backends - Generates typed clients, fake clients, and expansion interfaces - Full UpdateStatus template with proper gRPC implementation - Integrated into hack/update-codegen.sh pipeline Proto API (api/proto/device/v1alpha1): - GPU resource with spec (UUID) and status (conditions, recommendedAction) - Standard CRUD + Watch + UpdateStatus RPCs - K8s-style request/response patterns with options Security hardening: - gRPC message size and stream limits - Server error detail scrubbing for client responses - Unix socket path validation and restrictive permissions - Localhost-only enforcement for insecure credentials Deployment (deployments/helm): - Helm chart with configurable storage, gRPC, health, and metrics - Static manifests with versioned image references - Dockerfile with pinned base images Testing: - Unit tests for storage, services, providers, and utilities - Integration tests for client-go with full gRPC stack - Shared testutil with bufconn gRPC test helpers - Fake client examples for consumer testing patterns Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
5e6d23d to
4b4a6da
Compare
Summary
Implements the Device API Server — a node-local gRPC cache server for GPU device state, deployed as a Kubernetes DaemonSet. It acts as an intermediary between providers (e.g., NVSentinel health monitors that update GPU state) and consumers (e.g., Device Plugins, DRA Drivers that read state for scheduling decisions).
Architecture
Why read-blocking semantics?
The
sync.RWMutexwith writer-preference ensures consumers never read stale "healthy" state when a provider is updating to "unhealthy". Once a write is pending, new readers block until the write completes. This is critical for correct scheduling decisions.NVML Provider (Sidecar)
The NVML provider runs as a sidecar container that:
CreateGpuUpdateGpuStatuswhen health changesDecoupled as a sidecar for independent updates and glibc isolation (NVML requires
RTLD_DEEPBIND).Key Design Decisions
//go:build nvml)Key Components
sync.RWMutex, clone-safety, and monotonic resource versioningListAndSubscribeto prevent duplicate events; slow subscriber isolationType of Change
Test Results
-raceflag (cache: 34, service: 21, nvml: 22, metrics: 7, config: 2)go vetclean (including-tags nvml)proto.Clone(verified via clone-safety tests)Checklist
make protos-generate):portandhost:portformats