feat: cloud-native GPU health event management#795
feat: cloud-native GPU health event management#795ArangoGutierrez wants to merge 58 commits intomainfrom
Conversation
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>
Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>
Add ProviderService with RPCs for GPU lifecycle management: - RegisterGpu / UnregisterGpu - UpdateGpuStatus / UpdateGpuCondition Also adds resource_version field to Gpu message for optimistic concurrency control. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implements the Device API Server - a gRPC server providing unified GPU device information via GpuService (consumer) and ProviderService (provider) APIs. Components: - In-memory cache with RWMutex for thread-safe GPU state - Watch broadcaster for real-time state change notifications - NVML fallback provider for GPU enumeration and XID health monitoring - Prometheus metrics (cache, watch, NVML, gRPC stats) - Helm chart with ServiceMonitor and alerting rules Includes comprehensive unit tests (35 tests, race-clean) and documentation (API reference, operations guide, design docs). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The gRPC API is unauthenticated, which poses a security risk when bound to all network interfaces. This change: - Changes default GRPCAddress from ":50051" to "127.0.0.1:50051" - Updates flag documentation with security warning - Expands network security documentation per Copilot review - Updates design and operations docs to reflect new default Users who need network access can explicitly set --grpc-address=:50051, with full awareness of the security implications. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The Heartbeat RPC is intentionally defined but unimplemented, reserved for future provider liveness detection. This commit: - Improves the error message to explain the reservation purpose - Updates code comments to clarify this is intentional - Aligns implementation comments with proto documentation Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add ResourceVersion() method to GpuCache and use it to populate ListMeta.ResourceVersion in ListGpus responses. This enables: - Optimistic concurrency patterns in consumers - Cache invalidation based on version changes - Consistency tracking across list operations The resource version is a monotonically increasing counter that increments on every cache mutation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add module structure documentation to DEVELOPMENT.md to clarify the relationship between the different Go modules in this monorepo: - Device API Server uses github.com/nvidia/device-api - API definitions use github.com/nvidia/device-api/api - client-go and code-generator use github.com/nvidia/nvsentinel This documents the intentional divergence where device-api is a standalone component that may be published as its own repository. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The PR introduced inconsistent module paths where the api module was changed to github.com/nvidia/device-api/api while client-go still uses github.com/nvidia/nvsentinel/api. This broke CI. This commit aligns all modules to use github.com/nvidia/nvsentinel: - api/go.mod: github.com/nvidia/nvsentinel/api - go.mod: github.com/nvidia/nvsentinel - All imports updated to use nvsentinel path - Proto go_package options updated - Generated files updated Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Addresses review feedback from pteranodan on PR #720 regarding API design and NVML provider coupling. BREAKING CHANGES: - Removed ProviderService in favor of unified GpuService - RegisterGpu() → CreateGpu() - UnregisterGpu() → DeleteGpu() - UpdateGpuCondition() → removed (use UpdateGpuStatus with full status) API Changes: - Consolidated GpuService with both read and write operations - Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu - Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition) - Deleted api/proto/device/v1alpha1/provider.proto - Updated gpu.proto with unified service definition Service Layer: - Created pkg/deviceapiserver/service/gpu_service.go (unified service) - Deleted consumer.go and provider.go (replaced by gpu_service.go) - Updated integration tests to use new API - Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete NVML Provider Decoupling: - NVML provider now uses GpuServiceClient instead of direct cache access - Server creates loopback gRPC connection for embedded NVML provider - Added standalone cmd/nvml-provider/ binary for sidecar deployment - Provider now "dogfoods" its own API via gRPC Documentation: - Updated README.md with new unified API architecture - Rewrote docs/api/device-api-server.md with single service model - Updated docs/operations/device-api-server.md - Simplified docs/design/device-api-server.md - Removed obsolete task lists and design documents Deployment: - Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/ - Added deployments/container/Dockerfile - Added deployments/static/ for static manifests - Added demos/nvml-sidecar-demo.sh Other: - Updated .gitignore for planning documents - Enhanced Makefile with new targets - Added metrics for unified service This refactoring improves: 1. API consistency with Kubernetes conventions 2. Tooling compatibility (code generators, controller-runtime) 3. Decoupling of NVML provider from server internals 4. Flexibility to run NVML as embedded or sidecar 5. Overall code maintainability and clarity Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>
Update Go version to 1.25 across all module files and the Dockerfile to ensure consistent builds. The previous version (1.25.5) was a non-existent patch version, and the Dockerfile was using 1.24 which doesn't have Docker Hub images available. Changes: - go.mod: 1.25.5 -> 1.25 - api/go.mod: 1.25.5 -> 1.25 - Dockerfile: 1.24 -> 1.25 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The flag parsing was broken because flag.Parse() was called in main() after parseFlags() had already returned the config struct. This meant command-line arguments like --server-address were never applied to the config. Move flag.Parse() inside parseFlags() after defining all flags but before returning, so the parsed values are captured in the returned config struct. Also reorder klog.InitFlags() to be called before parseFlags() so klog flags are also registered before parsing. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The daemonset template was passing --provider-address to the device-api-server container when nvmlProvider.enabled was true, but this flag doesn't exist in the device-api-server binary. The sidecar architecture doesn't require the server to know about providers - providers connect TO the server via gRPC. Remove the invalid flag to fix container crashes. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Multiple fixes for the sidecar test values: - serverAddress: Change from localhost:9001 to localhost:50051 to match the actual gRPC port the device-api-server listens on - nodeSelector: Explicitly null out nvidia.com/gpu.present to prevent Helm from merging it with the node-type=gpu selector. The test cluster only has node-type=gpu labels. - securityContext: Run as root (uid 0) to allow the server to create Unix sockets in the hostPath /var/run/device-api/ directory. The non-root user (65534) cannot write to this host directory. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Major improvements to the NVML sidecar demo script: Cross-platform builds: - Use docker buildx for linux/amd64 builds (required when building on ARM Macs for x86 clusters) - Use --push flag to push directly after build (required for cross-platform builds which can't be stored locally) - Check for buildx availability at startup Idempotency: - Check if images already exist on ttl.sh before building - Prompt user whether to rebuild existing images - Use helm upgrade if release exists, helm install if not - Use rollout restart instead of relying on --wait timeout - Add 2-minute rollout timeout with status check on failure Bug fixes: - Fix namespace creation piped command (color codes corrupted YAML) - Fix pod label selector (app.kubernetes.io/name=nvsentinel) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The in-pod metrics check was failing because the container name is 'nvsentinel' (from Chart.Name), not 'device-api-server'. Rather than add complexity to detect the container name, remove this redundant check - metrics are already validated in Step 7 (show_metrics). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…-api-server'
The Helm chart names the main container using {{ .Chart.Name }}, which resolves to 'nvsentinel'. Fixed all kubectl logs/exec commands to use the correct container name.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The committed protobuf files were generated with protoc v33.4, but CI was installing v33.0. This version mismatch caused verify-codegen to fail because regenerating with a different protoc version produces different file headers.
Comprehensive design document capturing: - In-memory cache architecture (replacing Kine/SQLite) - Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs) - Readiness gate blocking consumers until providers register - Watch broadcaster with slow subscriber eviction - What to take from each PR - Implementation phases - Authorship preservation requirements This document serves as the spec for cherry-picking #720 onto merged #718. Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com> Co-authored-by: Dan Stone <danstone@nvidia.com>
Prepare for isolated worktree development workflow. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This commit introduces a Kubernetes-native approach to GPU health monitoring and remediation, replacing the previous MongoDB-centric architecture. New components: - HealthEvent CRD (nvsentinel.nvidia.com/v1alpha1) for cluster-wide state - QuarantineController: Cordons nodes on fatal GPU errors - DrainController: Gracefully evicts pods from quarantined nodes - RemediationController: Executes remediation strategies (reboot, gpu-reset) - TTLController: Automatic cleanup of resolved events based on policy - HealthProvider service: gRPC-based GPU health monitoring using NVML - CRDPublisher: Publishes GPU health events from device-api-server Key features: - Automatic phase progression: New → Quarantined → Drained → Remediated - Configurable TTL policies via ConfigMap - Prometheus metrics for all controllers - Comprehensive unit tests - Integration test harness (cmd/controller-test) Helm chart updates: - CRD installation under deployments/helm/nvsentinel/crds/ - TTL policy ConfigMap template - Health-provider DaemonSet chart Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Provides persistent context for AI assistants working on the cloud-native GPU health event management feature, including architecture diagrams, key file locations, and integration test procedures. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
10c3967 to
1a8dd5d
Compare
Document the architecture decision for migrating E2E tests from MongoDB-based to CRD-based architecture: - Update existing tests (don't rebuild) - Replace MongoDB assertions with CRD assertions - 4-phase implementation plan (3-4 weeks) - Success criteria and test file mapping Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Introduces Kubernetes-native GPU health event management by adding a gRPC-based Device API client SDK, a customized code generator (client-gen fork) to produce gRPC-backed clientsets/informers, and new binaries/utilities to run the health-provider and controller integration tests.
Changes:
- Add
code-generatormodule (modifiedclient-gen) to generate gRPC-backed clientsets, informers, listers, and fakes. - Add
client-gomodule implementing gRPC transport (nvgrpc) plus generated client artifacts and runnable examples/tests. - Add/extend runtime entrypoints for health-provider (NVML + stub builds), device-api-server startup, and a controller integration test runner.
Reviewed changes
Copilot reviewed 116 out of 185 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| code-generator/go.mod | New Go module for custom gRPC-aware code generation tooling |
| code-generator/cmd/client-gen/types/types.go | Forked/modified upstream client-gen types to support protobuf package wiring |
| code-generator/cmd/client-gen/types/helpers_test.go | Copied upstream tests for type helpers to keep behavior consistent |
| code-generator/cmd/client-gen/types/helpers.go | Copied upstream helper logic for group/version parsing & sorting |
| code-generator/cmd/client-gen/main.go | New entrypoint for custom client-gen execution |
| code-generator/cmd/client-gen/generators/util/tags_test.go | Copied upstream tests validating client-gen tag parsing |
| code-generator/cmd/client-gen/generators/util/gvpackages.go | Copied upstream utility for splitting path/group-version |
| code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go | Modified scheme generator to support gRPC-oriented layout |
| code-generator/cmd/client-gen/generators/generator_for_expansion.go | Copied upstream expansion generator behavior |
| code-generator/cmd/client-gen/generators/generator_for_clientset.go | Modified clientset generator templates to use nvgrpc transport |
| code-generator/cmd/client-gen/generators/fake/generator_fake_for_group.go | Modified fake client generation for gRPC-backed typed clients |
| code-generator/cmd/client-gen/generators/fake/generator_fake_for_clientset.go | Modified fake clientset generation to satisfy client-go interfaces |
| code-generator/cmd/client-gen/generators/fake/fake_client_generator.go | Targets wiring for fake typed clients and fake clientset generation |
| code-generator/cmd/client-gen/args/gvtype.go | Args parsing for included type overrides (forked upstream) |
| code-generator/cmd/client-gen/args/gvpackages_test.go | Tests for input group/version package resolution |
| code-generator/cmd/client-gen/args/gvpackages.go | Input group/version package resolution including base path support |
| code-generator/cmd/client-gen/args/args.go | Custom args including --proto-base for protobuf stub imports |
| code-generator/cmd/client-gen/README.md | Documentation for using the custom gRPC client generator |
| code-generator/README.md | Overview of code-generator module purpose and usage |
| code-generator/CONFIGURATION.md | How generator tool versions are sourced/overridden |
| cmd/health-provider/main_stub.go | Stub health-provider binary for builds without NVML |
| cmd/health-provider/main.go | NVML-enabled health-provider service entrypoint |
| cmd/device-api-server/main.go | Device API Server main wiring and config/logging startup |
| cmd/controller-test/main.go | Local integration test runner for HealthEvent controllers |
| client-go/version/version_test.go | Unit test for client-go User-Agent formatting |
| client-go/version/version.go | User-Agent/version injection for gRPC client identification |
| client-go/tests/device/v1alpha1/fake_client_test.go | Tests for generated fake clientset CRUD/list/watch behaviors |
| client-go/nvgrpc/watcher.go | Watch adapter turning gRPC stream events into watch.Interface |
| client-go/nvgrpc/options_test.go | Unit tests for dial option accumulation behavior |
| client-go/nvgrpc/options.go | Dial options (logger, unary/stream interceptors) for gRPC client |
| client-go/nvgrpc/interceptors_test.go | Unit tests for latency interceptors behavior |
| client-go/nvgrpc/interceptors.go | Latency interceptors for unary & stream establishment |
| client-go/nvgrpc/config_test.go | Unit tests for config defaulting/validation behavior |
| client-go/nvgrpc/config.go | Client config defaulting + env override handling for target/user-agent |
| client-go/nvgrpc/client_conn_test.go | Tests ensuring config defaulting doesn’t mutate caller input |
| client-go/nvgrpc/client_conn.go | Central gRPC connection factory with keepalive + interceptor chaining |
| client-go/listers/device/v1alpha1/gpu.go | Generated lister for GPU resources |
| client-go/listers/device/v1alpha1/expansion_generated.go | Generated lister expansion hook |
| client-go/informers/externalversions/internalinterfaces/factory_interfaces.go | Generated informer internal interfaces |
| client-go/informers/externalversions/generic.go | Generated generic informer routing by GVR |
| client-go/informers/externalversions/device/v1alpha1/interface.go | Generated device group informer interfaces |
| client-go/informers/externalversions/device/v1alpha1/gpu.go | Generated GPU informer driven by versioned client |
| client-go/informers/externalversions/device/interface.go | Generated group-level informer interfaces |
| client-go/hack/update-codegen.sh | Script to regenerate client-go artifacts using custom codegen |
| client-go/hack/boilerplate.go.txt | Boilerplate header for generated client-go files |
| client-go/go.mod | New client-go module dependencies (grpc + k8s libs) |
| client-go/examples/watch/main.go | Example: watch stream with custom interceptors and shutdown |
| client-go/examples/watch/README.md | Documentation for watch example |
| client-go/examples/fake-server/README.md | Documentation for fake device-api server example |
| client-go/examples/fake-client/main_test.go | Example test showing informer + fake client integration |
| client-go/examples/fake-client/README.md | Documentation for fake-client example |
| client-go/examples/controller/main.go | Example: controller-runtime informer injection using gRPC informer |
| client-go/examples/controller/README.md | Documentation for controller example |
| client-go/examples/client/main.go | Example: basic list/get client usage |
| client-go/examples/client/README.md | Documentation for basic client example |
| client-go/examples/README.md | Index of examples and recommended local workflow |
| client-go/client/versioned/typed/device/v1alpha1/gpu.go | Generated typed GPU client backed by gRPC/protobuf |
| client-go/client/versioned/typed/device/v1alpha1/generated_expansion.go | Generated typed client expansion hook |
| client-go/client/versioned/typed/device/v1alpha1/fake/fake_gpu.go | Generated fake typed GPU client |
| client-go/client/versioned/typed/device/v1alpha1/fake/fake_device_client.go | Generated fake group client |
| client-go/client/versioned/typed/device/v1alpha1/fake/doc.go | Generated fake typed client package docs |
| client-go/client/versioned/typed/device/v1alpha1/doc.go | Generated typed client package docs |
| client-go/client/versioned/typed/device/v1alpha1/device_client.go | Generated group client constructor wiring |
| client-go/client/versioned/scheme/register.go | Generated scheme registration for client-go |
| client-go/client/versioned/scheme/doc.go | Generated scheme package docs |
| client-go/client/versioned/fake/register.go | Generated scheme registration for fake clientset |
| client-go/client/versioned/fake/doc.go | Generated fake clientset package docs |
| client-go/client/versioned/fake/clientset_generated.go | Generated fake clientset implementation |
| client-go/client/versioned/clientset.go | Generated versioned clientset entrypoint |
| client-go/README.md | Client SDK overview, installation, usage patterns, and examples |
| client-go/Makefile | Client-go local dev workflow (codegen, test, build, lint, clean) |
| client-go/DEVELOPMENT.md | Developer guide for generated vs manual code areas |
| api/nvsentinel/v1alpha1/groupversion_info.go | New API group registration for HealthEvent CRD types |
| api/nvsentinel/v1alpha1/doc.go | Package docs and generator markers for nvsentinel API group |
| api/hack/update-codegen.sh | API module generation script (protos + deepcopy + goverter) |
| api/hack/boilerplate.go.txt | Boilerplate header for generated API files |
| api/go.mod | API module dependency updates (grpc/protobuf + k8s/apimachinery) |
| api/device/v1alpha1/zz_generated.goverter.go | Generated goverter conversions between protobuf and KRM |
| api/device/v1alpha1/zz_generated.deepcopy.go | Generated deepcopy helpers for Kubernetes runtime objects |
| api/device/v1alpha1/register.go | device API group registration (GPU/GPUList) |
| api/device/v1alpha1/gpu_types.go | GPU KRM type definitions + client-gen markers/verbs |
| api/device/v1alpha1/gpu_conversion_test.go | Tests for protobuf ↔ KRM conversion correctness |
| api/device/v1alpha1/gpu_conversion.go | Wrapper conversion helpers selecting generated goverter impl |
| api/device/v1alpha1/doc.go | Package docs and groupName marker for device API group |
| api/device/v1alpha1/converter.go | Goverter mapping configuration and helper conversions |
| api/README.md | API module structure + generation pipeline overview |
| api/Makefile | New API module workflow (code-gen, manifests, test, build, tools) |
| api/DEVELOPMENT.md | API dev workflow and conventions documentation |
| .versions.yaml | Tool/version pin updates for codegen-related tooling |
| .golangci.yml | Lint config updates (parallel runners, govet enabled, exclusions) |
| .gitattributes | Mark generated/copied code paths as linguist-generated |
cmd/health-provider/main.go
Outdated
| klog.InitFlags(nil) | ||
|
|
||
| cfg, nvmlCfg := parseFlags() | ||
| flag.Parse() |
There was a problem hiding this comment.
flag.Parse() is invoked twice (once in main and once inside parseFlags). With Go's default CommandLine FlagSet (ExitOnError), the second parse will terminate the process with a "Parse called twice" error. Fix by parsing flags exactly once: typically, have parseFlags only register/bind flags and return defaults, then call flag.Parse() once in main, and apply env overrides after parsing.
| flag.Parse() |
| // Parse flags first | ||
| flag.Parse() | ||
|
|
There was a problem hiding this comment.
flag.Parse() is invoked twice (once in main and once inside parseFlags). With Go's default CommandLine FlagSet (ExitOnError), the second parse will terminate the process with a "Parse called twice" error. Fix by parsing flags exactly once: typically, have parseFlags only register/bind flags and return defaults, then call flag.Parse() once in main, and apply env overrides after parsing.
cmd/health-provider/main_stub.go
Outdated
| cfg, nvmlCfg := parseFlags() | ||
| flag.Parse() |
There was a problem hiding this comment.
Same double-parse issue as the NVML build: parseFlags calls flag.Parse(), and main calls flag.Parse() again. This will exit at runtime with "Parse called twice". Remove one of the parses (recommended: do not parse inside parseFlags) and keep env overrides applied after the single parse.
| flag.Parse() | ||
|
|
There was a problem hiding this comment.
Same double-parse issue as the NVML build: parseFlags calls flag.Parse(), and main calls flag.Parse() again. This will exit at runtime with "Parse called twice". Remove one of the parses (recommended: do not parse inside parseFlags) and keep env overrides applied after the single parse.
| allGroups := clientgentypes.ToGroupVersionInfo(g.groups, g.groupGoNames) | ||
| m := map[string]interface{}{ | ||
| "allGroups": allGroups, | ||
| "fmtErrorf": c.Universe.Type(types.Name{Package: "fmt", Name: "Errorf"}), |
There was a problem hiding this comment.
fmt.Errorf is a function, but this code looks it up via c.Universe.Type(...). In gengo, functions should be retrieved via c.Universe.Function(...). Leaving this as Type risks returning nil and causing template execution to fail or generate invalid references when emitting $.fmtErrorf|raw$.
| "fmtErrorf": c.Universe.Type(types.Name{Package: "fmt", Name: "Errorf"}), | |
| "fmtErrorf": c.Universe.Function(types.Name{Package: "fmt", Name: "Errorf"}), |
client-go/README.md
Outdated
|
|
||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| "k8s.io/apimachinery/pkg/api/meta" | ||
| "github.com/nvidia/nvsentinel/client-go/clientset/versioned" |
There was a problem hiding this comment.
The Quick Start example imports github.com/nvidia/nvsentinel/client-go/clientset/versioned, but this PR introduces the generated client under client-go/client/versioned (e.g., client-go/client/versioned/clientset.go). Update the import path in the README snippet so it compiles for users copying it.
| "github.com/nvidia/nvsentinel/client-go/clientset/versioned" | |
| "github.com/nvidia/nvsentinel/client-go/client/versioned" |
|
|
||
| var newClientsetForConfigAndClientTemplate = ` | ||
| // NewForConfigAndClient creates a new Clientset for the given config and gRPC client connection. | ||
| // The provided gRPC client connection provided takes precedence over the configured transport values. |
There was a problem hiding this comment.
The comment has a duplicated word (provided ... provided). Reword to remove the duplication for readability.
| // The provided gRPC client connection provided takes precedence over the configured transport values. | |
| // The provided gRPC client connection takes precedence over the configured transport values. |
| func NewWatcher( | ||
| source Source, | ||
| cancel context.CancelFunc, | ||
| logger logr.Logger, | ||
| ) watch.Interface { |
There was a problem hiding this comment.
Watcher is core to informer reliability (mapping gRPC stream events/errors into watch.Interface semantics), but there are no unit tests covering event-type mapping (ADDED/MODIFIED/DELETED/ERROR), normal stream shutdown (EOF/Canceled), or the gRPC-code → metav1.Status mapping in sendError. Add tests using a fake Source implementation to validate the emitted watch events and termination behavior.
| // nolint:cyclop // Complexity is necessary to handle various gRPC stream states and event types. | ||
| func (w *Watcher) receive() { |
There was a problem hiding this comment.
Watcher is core to informer reliability (mapping gRPC stream events/errors into watch.Interface semantics), but there are no unit tests covering event-type mapping (ADDED/MODIFIED/DELETED/ERROR), normal stream shutdown (EOF/Canceled), or the gRPC-code → metav1.Status mapping in sendError. Add tests using a fake Source implementation to validate the emitted watch events and termination behavior.
| } | ||
| } | ||
|
|
||
| func (w *Watcher) sendError(err error) { |
There was a problem hiding this comment.
Watcher is core to informer reliability (mapping gRPC stream events/errors into watch.Interface semantics), but there are no unit tests covering event-type mapping (ADDED/MODIFIED/DELETED/ERROR), normal stream shutdown (EOF/Canceled), or the gRPC-code → metav1.Status mapping in sendError. Add tests using a fake Source implementation to validate the emitted watch events and termination behavior.
Audit results for 4 test files (9 test scenarios): - smoke_test.go: Full fatal/unsupported event flows - fault_quarantine_test.go: CEL, pre-cordon, circuit breaker, strategy - node_drainer_test.go: Eviction modes, partial drain - fault_remediation_test.go: RebootNode CR creation Key findings: - All tests use SendHealthEvent() HTTP POST → need CRD creation - Node label state checks → need CRD status.phase checks - MongoDB change streams → need K8s informer/watch - 6 new helper functions required for CRD operations Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add new helper functions for E2E tests to work with HealthEvent CRDs: Builder pattern: - NewHealthEventCRD() - fluent builder with sensible defaults - With* methods for all spec fields CRD operations: - CreateHealthEventCRD, GetHealthEventCRD, DeleteHealthEventCRD - ListHealthEventCRDs, ListHealthEventCRDsForNode - DeleteAllHealthEventCRDs Phase waiting: - WaitForHealthEventPhase, WaitForHealthEventPhaseNotEqual - WaitForHealthEventCondition, WaitForHealthEventPhaseSequence Assertions: - AssertHealthEventPhase, AssertHealthEventNotExists - AssertHealthEventHasCondition, AssertHealthEventNeverReachesPhase - AssertNodeQuarantinedCondition, AssertPodsDrainedCondition Migration helpers: - SendHealthEventViaCRD - drop-in replacement for SendHealthEvent - SendHealthyEventViaCRD - drop-in replacement for SendHealthyEvent - TriggerFullRemediationFlowViaCRD - CRD version of flow trigger Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Register nvsentinelv1alpha1.HealthEvent types in test scheme - Move keyHealthEventName to main_test.go (shared across build tags) - Remove unused v1 import from smoke_test.go Phase 4 validation confirmed: - Controllers process HealthEvent CRDs correctly - Phase transitions work: New → Quarantined → Draining - Node cordoning works correctly Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The E2E tests workflow was designed for the old microservices architecture with MongoDB/PostgreSQL datastores. The new CRD-based system doesn't use these datastores - HealthEvents are now stored as Kubernetes Custom Resources. E2E testing for the new architecture will be handled differently. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 16
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.gitignore (1)
41-43:⚠️ Potential issue | 🟡 MinorRemove duplicate comment line.
The JetBrains IDEs section header comment is duplicated on consecutive lines.
🧹 Proposed fix
### JetBrains IDEs (GoLand, PyCharm, IntelliJ) ### -### JetBrains IDEs (GoLand, PyCharm, IntelliJ) ### # User-specific stuff
🤖 Fix all issues with AI agents
In @.gitattributes:
- Around line 1-21: Add the Apache 2.0 license header to the top of
.gitattributes: insert the standard multi-line Apache-2.0 header text (including
copyright line, SPDX identifier or full license notice and year/owner as per
project convention) as the first lines of the file so the file contains the
required license header before any content; ensure formatting matches other
repository files' headers.
In @.github/workflows/ci.yml:
- Around line 40-70: The "go" job references matrix variables (matrix.step_name,
matrix.make_command, matrix.component) but lacks a strategy.matrix; add a
strategy block to the "go" job that defines the matrix entries used (e.g.,
step_name, make_command, component) or restore the original strategy.matrix if
it was removed so those variables are defined for the job; update the job's
top-level stanza "go" to include strategy: matrix: with the appropriate
permutations that feed the referenced matrix.step_name, matrix.make_command and
matrix.component.
In `@api/proto/device/v1alpha1/gpu.proto`:
- Around line 360-375: UpdateGpuStatusRequest currently defines resource_version
as int64 while ObjectMeta.resource_version is a string, causing type mismatch;
change UpdateGpuStatusRequest.resource_version to string to match
ObjectMeta.resource_version (or vice versa if you choose numeric versioning) and
update any callers/serializers that read/write
UpdateGpuStatusRequest.resource_version and related handlers (e.g., the
UpdateGpuStatus RPC handling code and client usage) to use the new string type
so optimistic-concurrency checks compare the same type across messages.
- Around line 210-222: The RPC docs for CreateGpu and the
CreateGpuResponse.created field conflict; decide whether CreateGpu should return
an ALREADY_EXISTS error or use idempotent semantics with
CreateGpuResponse.created (true/false) returning the existing GPU. Then make
them consistent: if you choose idempotent upsert, update the CreateGpu RPC
comment to state it returns success with created=false and the existing GPU when
a name already exists and keep/CreateGpuResponse.created semantics; if you
choose error-on-duplicate, remove or repurpose CreateGpuResponse.created and
change the CreateGpu RPC comment to state it returns ALREADY_EXISTS and ensure
server code for rpc CreateGpu enforces that behavior. Apply the change to the
documented remarks for rpc CreateGpu and the CreateGpuResponse message so both
match the chosen behavior.
- Around line 96-111: The Gpu message defines two different resource_version
fields (ObjectMeta.resource_version as a string and Gpu.resource_version as
int64), causing confusion; either remove the top-level int64 resource_version
from message Gpu or rename it to a clearly distinct semantic name (e.g.,
generation or internal_version) to avoid collision with
ObjectMeta.resource_version; update the proto definition for message Gpu (field
currently declared as "int64 resource_version = 4") accordingly, adjust all code
references/usages to the new name or remove usages if deleted, update any
comments/docs in the proto to reflect the change, and regenerate protobuf stubs
wherever they are consumed.
In `@api/README.md`:
- Around line 1-38: The README "NVIDIA Device API: API Definitions" is missing
the Apache-2.0 license header; add the standard Apache License 2.0 header
(including the SPDX identifier and full copyright notice used across the repo)
to the top of this README so the header appears before the existing title and
content, ensure the year and copyright holder match the repository convention,
and preserve the existing markdown below the header.
In `@client-go/client/versioned/typed/device/v1alpha1/gpu.go`:
- Around line 88-107: In gpus.List, guard against a nil result from
devicev1alpha1.FromProtoList(resp.GetGpuList()) before accessing list.Items:
after calling FromProtoList, check if list == nil and if so return nil,
fmt.Errorf("received nil GPU list from proto for namespace %s",
c.getNamespace()); ensure the generator template that produces this file imports
fmt and applies the nil-check consistently (reference symbols: gpus.List,
devicev1alpha1.FromProtoList, resp.GetGpuList, c.getNamespace, and GPUList).
- Around line 66-86: The Get implementation for gpus calls
devicev1alpha1.FromProto(resp.GetGpu()) and immediately logs
obj.GetResourceVersion(), but FromProto can return nil; add a nil guard after
obj := devicev1alpha1.FromProto(resp.GetGpu()) to check if obj == nil and return
a descriptive error (e.g., wrap fmt.Errorf or errors.New) instead of proceeding
to call obj.GetResourceVersion(); apply this change in the generator template
that produces the gpus.Get function so all generated clients (and the GetGpu
call path to c.client.GetGpu / resp.GetGpu) include the nil-check and error
return.
In `@client-go/informers/externalversions/device/interface.go`:
- Line 17: Add a package-level godoc comment at the top of the device package
file: insert a comment starting with "// Package device ..." immediately above
the "package device" declaration that briefly describes the purpose and
responsibilities of the device package (e.g., it provides informer shared types
for device API group). Ensure the comment follows godoc style and summarizes the
package's intent.
In `@client-go/informers/externalversions/device/v1alpha1/interface.go`:
- Line 17: Add a package-level godoc comment to the v1alpha1 package in
interface.go by inserting a top-of-file comment that begins with "// Package
v1alpha1" and briefly describes the purpose of this package (e.g., that it
provides informer interfaces/types for device v1alpha1). Ensure the comment is
placed immediately above the "package v1alpha1" declaration so standard Go tools
and linters recognize the package documentation.
In `@client-go/nvgrpc/client_conn.go`:
- Around line 26-28: Guard against a nil Config in ClientConnFor: check if the
incoming pointer config is nil before dereferencing (cfg := *config) and return
a clear error (e.g., fmt.Errorf or errors.New) if it is; update ClientConnFor to
perform this nil check at the top so callers that pass nil get a descriptive
error instead of a panic while leaving the rest of the function (using cfg and
opts ...DialOption) unchanged.
In `@client-go/README.md`:
- Around line 40-44: Update the incorrect import in the Quick Start import
block: replace the wrong import string
"github.com/nvidia/nvsentinel/client-go/clientset/versioned" with
"github.com/nvidia/nvsentinel/client-go/client/versioned" so the example imports
the correct client package; ensure the import block containing metav1, api/meta,
and nvgrpc uses the corrected path.
In `@cmd/device-api-server/main.go`:
- Around line 123-127: The current flag.Set("logging_format", "json") is
ineffective for klog/v2; replace it by initializing klog with a structured JSON
handler: when config.LogFormat == "json" call klog.SetSlogLogger(...) using
slog.NewJSONHandler(os.Stdout, nil) so klog emits JSON; place this call after
flag.Parse() and before any logging (i.e., before the first logger.Info() call)
to ensure it takes effect; update the block that currently checks
config.LogFormat and remove the flag.Set usage in favor of klog.SetSlogLogger.
In `@cmd/health-provider/main_stub.go`:
- Around line 35-36: Remove the duplicate flag.Parse() call in main_stub.go:
parseFlags() already calls flag.Parse(), so delete the standalone flag.Parse()
invocation to avoid parsing flags twice; keep parseFlags() as the single place
that calls flag.Parse() and returns cfg and nvmlCfg (refer to the parseFlags
function and the cfg, nvmlCfg variables).
In `@code-generator/cmd/client-gen/generators/generator_for_type.go`:
- Around line 27-35: Wrap the error returned by ParseClientGenTags with context
at each call site by replacing bare returns of err with: return
fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err). Update the
four occurrences where ParseClientGenTags is called (in the generators for types
and groups and their fake counterparts) so the function that calls
ParseClientGenTags returns the wrapped error; ensure you import fmt if not
already present and reference t.Name.Name in the formatting string.
In `@deployments/helm/health-provider/Chart.yaml`:
- Around line 1-24: The Chart.yaml for the chart named "health-provider" is not
located with the other nvsentinel charts; move the entire chart directory
containing Chart.yaml for health-provider into the project's central nvsentinel
charts directory so it sits alongside existing charts like gpu-health-monitor,
syslog-health-monitor, and csp-health-monitor, and update any CI/build
references or packaging manifests that point to the old location (e.g., chart
packaging steps or dependency references) to the new chart directory.
🟡 Minor comments (40)
.github/copilot-instructions.md-14-25 (1)
14-25:⚠️ Potential issue | 🟡 MinorAdd language identifiers to fenced code blocks.
MD040 is triggered because these fences lack a language; add
text/bashfor clarity and lint compliance.🛠️ Proposed fix
-``` +```text api/ ├── proto/ # Protocol Buffer source definitions (edit these) │ └── device/v1alpha1/ │ └── gpu.proto ├── gen/go/ # Generated Go code (DO NOT edit manually) │ └── device/v1alpha1/ │ ├── gpu.pb.go │ └── gpu_grpc.pb.go ├── go.mod └── Makefile -``` +``` @@ -``` +```text feat: add new GPU condition type fix: correct timestamp handling in conditions docs: update API documentation chore: update protoc-gen-go version -``` +``` @@ -``` +```text Signed-off-by: Name <email@example.com> -``` +``` @@ -``` +```text Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); ... -``` +```Also applies to: 60-65, 71-73, 81-86
demos/nvml-sidecar-demo.sh-454-458 (1)
454-458:⚠️ Potential issue | 🟡 MinorDocumentation inconsistency in summary.
Line 455 references
Dockerfile.nvml-provider, but the script actually usesdeployments/container/Dockerfile(defined on line 24). Update the summary to reference the correct file path.Suggested fix
echo "${white}Key files:${reset}" - echo "${cyan} - Dockerfile.nvml-provider # Sidecar container build${reset}" + echo "${cyan} - ${DOCKERFILE} # Multi-target container build${reset}" echo "${cyan} - values-sidecar-test.yaml # Helm values for sidecar mode${reset}" echo "${cyan} - charts/device-api-server/ # Helm chart with sidecar support${reset}"demos/nvml-sidecar-demo.sh-367-369 (1)
367-369:⚠️ Potential issue | 🟡 MinorBug: Duplicate
pausecalls.Lines 367 and 368 both call
pause, requiring the user to press ENTER twice unnecessarily. This appears to be a copy-paste error.Suggested fix
step "Check nvml-provider sidecar logs" run_cmd kubectl --kubeconfig="${KUBECONFIG}" -n "${NAMESPACE}" logs "${POD}" -c nvml-provider --tail=20 || true pause - - pause }code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go-127-131 (1)
127-131:⚠️ Potential issue | 🟡 MinorSurface unexpected
os.Staterrors.Non-
IsNotExisterrors are silently ignored, which can mask permission/IO problems and skip custom registration unexpectedly.🛠️ Suggested fix
if g.OutputPath != "" { - if _, err := os.Stat(filepath.Join(g.OutputPath, strings.ToLower("register_custom.go"))); err == nil { + customPath := filepath.Join(g.OutputPath, "register_custom.go") + if _, err := os.Stat(customPath); err == nil { m["customRegister"] = true + } else if !os.IsNotExist(err) { + return fmt.Errorf("stat %s: %w", customPath, err) } }code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go-55-59 (1)
55-59:⚠️ Potential issue | 🟡 MinorAdd GoDoc comments for exported GenScheme methods (and fix Filter comment).
Exported methods require GoDoc comments starting with the method name per Go conventions.
✍️ Suggested fix
+// Namers returns the naming systems used by this generator. func (g *GenScheme) Namers(c *generator.Context) namer.NameSystems { return namer.NameSystems{ "raw": namer.NewRawNamer(g.OutputPkg, g.ImportTracker), } } -// We only want to call GenerateType() once. +// Filter ensures GenerateType runs only once for this generator. func (g *GenScheme) Filter(c *generator.Context, t *types.Type) bool { ret := !g.schemeGenerated g.schemeGenerated = true return ret } +// Imports returns the import lines required for the generated scheme. func (g *GenScheme) Imports(c *generator.Context) (imports []string) { imports = append(imports, g.ImportTracker.ImportLines()...) for _, group := range g.Groups { for _, version := range group.Versions { packagePath := g.InputPackages[clientgentypes.GroupVersion{Group: group.Group, Version: version.Version}] groupAlias := strings.ToLower(g.GroupGoNames[clientgentypes.GroupVersion{Group: group.Group, Version: version.Version}]) if g.CreateRegistry { // import the install package for internal clientsets instead of the type package with register.go if version.Version != "" { packagePath = path.Dir(packagePath) } packagePath = path.Join(packagePath, "install") imports = append(imports, fmt.Sprintf("%s \"%s\"", groupAlias, packagePath)) break } else { imports = append(imports, fmt.Sprintf("%s%s \"%s\"", groupAlias, strings.ToLower(version.Version.NonEmpty()), packagePath)) } } } return } +// GenerateType writes the scheme, codecs, and registration code. func (g *GenScheme) GenerateType(c *generator.Context, t *types.Type, w io.Writer) error { sw := generator.NewSnippetWriter(w, c, "$", "$")code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go-24-24 (1)
24-24:⚠️ Potential issue | 🟡 MinorAdd a package-level GoDoc comment.
The package declaration lacks the required package comment, which will trip GoDoc/golint checks. Other generator files in the codebase properly document their packages.
✍️ Suggested fix
- package scheme + // Package scheme contains generators for clientset schemes, codecs, and registration. + package schemeAs per coding guidelines: Package-level godoc required for all Go packages.
cmd/health-provider/main.go-51-52 (1)
51-52:⚠️ Potential issue | 🟡 MinorRemove redundant
flag.Parse()call.
flag.Parse()is already called insideparseFlags()at line 112. This second call is redundant and should be removed to avoid confusion.🔧 Proposed fix
cfg, nvmlCfg := parseFlags() - flag.Parse().gitignore-30-31 (1)
30-31:⚠️ Potential issue | 🟡 MinorRemove duplicate
api/bin/*entry.This pattern is already defined at line 17. The duplicate entry should be removed.
🧹 Proposed fix
# Local tool binaries (managed by api/Makefile) -api/bin/*client-go/examples/fake-client/main_test.go-80-86 (1)
80-86:⚠️ Potential issue | 🟡 MinorAdd type assertion safety check.
The type assertion
obj.(*devicev1alpha1.GPU)could panic if the object is of an unexpected type. Use a comma-ok idiom for safety.🛡️ Proposed fix
gpuInformer.AddEventHandler(&cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { - gpu := obj.(*devicev1alpha1.GPU) + gpu, ok := obj.(*devicev1alpha1.GPU) + if !ok { + t.Errorf("unexpected object type: %T", obj) + return + } t.Logf("Informer signaled GPU added: %s", gpu.Name) gpuChan <- gpu }, }).github/workflows/ci.yml-17-28 (1)
17-28:⚠️ Potential issue | 🟡 MinorBranch name inconsistency in concurrency configuration.
The workflow triggers on
device-apibranch (lines 20, 23), butcancel-in-progresschecks againstmainbranch (line 28). This means in-progress runs ondevice-apiwill always be cancelled on new pushes, which may not be intended.🔧 Suggested fix
concurrency: group: ${{ github.workflow }}-${{ github.ref }} - cancel-in-progress: ${{ github.ref != 'refs/heads/main' }} + cancel-in-progress: ${{ github.ref != 'refs/heads/device-api' }}Or update to match your actual protected branch.
README.md-259-259 (1)
259-259:⚠️ Potential issue | 🟡 MinorNormalize horizontal rule to
---(no trailing space).Line 259 has a horizontal rule with trailing whitespace, which triggers MD035.
🧹 Suggested fix
---- +---README.md-14-42 (1)
14-42:⚠️ Potential issue | 🟡 MinorAdd language identifier to the ASCII diagram fence.
Line 14 opens a fenced block without a language identifier, which triggers MD040. Using
textkeeps the diagram readable and satisfies linting.📝 Suggested fix
-``` +```text @@ -``` +```code-generator/cmd/client-gen/generators/generator_for_group.go-27-35 (1)
27-35:⚠️ Potential issue | 🟡 MinorWrap ParseClientGenTags errors with context.
Line 107 returns the raw parse error, which obscures which type caused the failure. Wrap with context to improve diagnostics. This pattern is also present at generator_for_type.go:96, fake/generator_fake_for_type.go:83, and fake/generator_fake_for_group.go:90—consider applying the same fix across the codebase for consistency.
🧩 Suggested change
import ( + "fmt" "io" @@ tags, err := util.ParseClientGenTags(append(t.SecondClosestCommentLines, t.CommentLines...)) if err != nil { - return err + return fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err) }code-generator/cmd/client-gen/types/helpers.go-38-38 (1)
38-38:⚠️ Potential issue | 🟡 MinorAdd issue reference to the TODO comment.
The TODO at line 38 should reference a concrete GitHub issue ID for traceability, per coding guidelines requiring TODO comments to reference issues in Go code.- // TODO once we've cleared the last uses as strings, this special case should be removed. + // TODO (#<issue-id>) once we've cleared the last uses as strings, this special case should be removed.cmd/nvml-provider/reconciler.go-118-118 (1)
118-118:⚠️ Potential issue | 🟡 MinorLink this TODO to a concrete issue ID for traceability.
The TODO comment at line 118 should reference a GitHub issue or tracking number (e.g.,
TODO:#123Add provider_id filtering...). This enables developers to track the improvement across the codebase and project management systems.client-go/examples/controller/main.go-72-113 (1)
72-113:⚠️ Potential issue | 🟡 MinorRemove or clarify misleading "hybrid" reconciliation comment.
The fallback code would fail if executed. The manager uses a dummyrest.Config{Host: "http://localhost:0"}with a mapper containing only GPU resources. Any attempt to watch non-GPU resources (Pods, Events, etc.) would fail due to both the non-functional REST endpoint and missing mapper entries. Either clarify that this example is GPU-only reconciliation or wire a real kubeconfig with a discovery-backed mapper if hybrid support is intended.cmd/nvml-provider/reconciler.go-285-301 (1)
285-301:⚠️ Potential issue | 🟡 MinorConfirm merge semantics—status fields are fully replaced, which will clear
recommended_action.The code populates only
Conditionsin the status update. SinceUpdateGpuStatusperforms a complete replacement of the entireGpuStatusobject (see implementation:cached.gpu.Status = status), therecommended_actionfield will be cleared to an empty string if it was previously set. This causes status data loss. Either preserve the existingrecommended_actionby reading the current GPU state before updating, or use field masks to update only theConditionsfield.cmd/nvml-provider/reconciler.go-71-90 (1)
71-90:⚠️ Potential issue | 🟡 MinorAdd provider ownership filtering before unregistering GPUs.
ReconcileStateunregisters any cached GPU missing from NVML. The code currently assumes a single NVML provider ("For now, we assume all GPUs belong to us"), but the architecture supports multiple providers. If ListGpus returns GPUs from other providers, this reconciliation will delete resources they own. The code already has a TODO for this (line 118). Before the system supports multiple providers, add provider_id to the Gpu resource and filter cached GPUs by provider before calling unregisterGPU.client-go/examples/client/README.md-22-27 (1)
22-27:⚠️ Potential issue | 🟡 MinorInconsistent socket path between documentation and expected output.
Line 22 states
sudois required for the default socket path in/var/run/, but the expected output on line 27 shows the target asunix:///tmp/nvidia-device-api.sock. This discrepancy may confuse users about the actual socket location. Consider clarifying that the Fake Server uses/tmp/while production deployments use/var/run/.client-go/DEVELOPMENT.md-41-44 (1)
41-44:⚠️ Potential issue | 🟡 MinorFix markdownlint MD027: remove extra spaces after
>in blockquotes.Suggested fix
-> - **API Definitions**: Modify the Proto or Go types in `../api`. -> - **Client Logic**: Modify the templates in `../code-generator/cmd/client-gen`. -> - **Transport & Connection**: Modify the gRPC logic in `nvgrpc`. +> - **API Definitions**: Modify the Proto or Go types in `../api`. +> - **Client Logic**: Modify the templates in `../code-generator/cmd/client-gen`. +> - **Transport & Connection**: Modify the gRPC logic in `nvgrpc`.client-go/version/version.go-15-25 (1)
15-25:⚠️ Potential issue | 🟡 MinorAdd a package-level doc comment for
package version.Suggested fix
- package version +// Package version provides build-time version metadata and user-agent helpers. +package versionAs per coding guidelines: Package-level godoc required for all Go packages.
client-go/tests/device/v1alpha1/fake_client_test.go-85-99 (1)
85-99:⚠️ Potential issue | 🟡 MinorAdd type assertion safety check to prevent panic.
Line 93 performs an unchecked type assertion. If the watch returns an unexpected type, this will panic rather than fail gracefully.
Proposed fix
- obj := event.Object.(*devicev1alpha1.GPU) + obj, ok := event.Object.(*devicev1alpha1.GPU) + if !ok { + t.Fatalf("Expected *devicev1alpha1.GPU, got %T", event.Object) + } if obj.Name != "gpu-2" {client-go/tests/device/v1alpha1/fake_client_test.go-41-53 (1)
41-53:⚠️ Potential issue | 🟡 MinorUse
t.Fatalfinstead oft.Errorfwhen test cannot continue.On Line 43, if
Getfails, the subsequent assertions ongpuwill panic due to nil access. Uset.Fatalfto halt test execution on critical failures.Proposed fix
gpu, err := client.DeviceV1alpha1().GPUs().Get(ctx, "gpu-1", metav1.GetOptions{}) if err != nil { - t.Errorf("Fake client failed to retrieve GPU: %v", err) + t.Fatalf("Fake client failed to retrieve GPU: %v", err) }client-go/nvgrpc/interceptors_test.go-35-35 (1)
35-35:⚠️ Potential issue | 🟡 MinorUnused
expectedCodefield in test table.The
expectedCodefield is defined but never asserted against in the test. Either remove it or add an assertion to verify the interceptor correctly propagates the gRPC status code.Option 1: Remove unused field
tests := []struct { name string method string log bool invokerErr error - expectedCode codes.Code }{ - {"Returns OK code for successful call", "/svc/success", true, nil, codes.OK}, - {"Returns internal status error on internal error", "/svc/internal_error", true, status.Error(codes.Internal, "fail"), codes.Internal}, - {"Returns canceled status error when canceled", "/svc/cancel", true, status.Error(codes.Canceled, ""), codes.Canceled}, - {"Returns deadline exceeded status error on timeout", "/svc/timeout", true, status.Error(codes.DeadlineExceeded, ""), codes.DeadlineExceeded}, - {"Does not log if level too low", "/svc/skip_log", false, nil, codes.OK}, + {"Returns OK code for successful call", "/svc/success", true, nil}, + {"Returns internal status error on internal error", "/svc/internal_error", true, status.Error(codes.Internal, "fail")}, + {"Returns canceled status error when canceled", "/svc/cancel", true, status.Error(codes.Canceled, "")}, + {"Returns deadline exceeded status error on timeout", "/svc/timeout", true, status.Error(codes.DeadlineExceeded, "")}, + {"Does not log if level too low", "/svc/skip_log", false, nil}, }Option 2: Add assertion for expected code
err := interceptor(context.Background(), tt.method, nil, nil, nil, invoker) if !errors.Is(err, tt.invokerErr) { t.Fatalf("Returned error mismatch. Got %v, want %v", err, tt.invokerErr) } + if got := status.Code(err); got != tt.expectedCode { + t.Errorf("Status code mismatch. Got %v, want %v", got, tt.expectedCode) + }client-go/go.mod-7-7 (1)
7-7:⚠️ Potential issue | 🟡 MinorRemove self-referential replace directive.
The
replace github.com/nvidia/nvsentinel/client-go => .directive on line 7 is non-standard and potentially problematic. This is the only module in the repository using this pattern—all other modules correctly use relative path replacements like=> ../api. Additionally, there is nogo.workfile that would justify this workspace-style setup. The replace directive in a module's owngo.modfile is not honored by external consumers of that module, making it ineffective. Remove this line.client-go/nvgrpc/watcher_test.go-154-177 (1)
154-177:⚠️ Potential issue | 🟡 MinorAvoid fixed sleep to reduce test flakiness.
time.Sleep(10 * time.Millisecond)can be brittle under load; prefer a synchronization signal (e.g., a started channel inFakeSourceorrequire.Eventually) before callingStop().cmd/controller-test/main.go-59-84 (1)
59-84:⚠️ Potential issue | 🟡 MinorMetrics bind address flag is ignored.
The
metricsAddrflag is parsed but never applied;ctrl.Options{}.Metricsuses defaults instead. This makes--metrics-bind-addressineffective.🐛 Proposed fix
@@ "sigs.k8s.io/controller-runtime/pkg/client" "sigs.k8s.io/controller-runtime/pkg/healthz" "sigs.k8s.io/controller-runtime/pkg/log/zap" + metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" @@ mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ Scheme: scheme, - Metrics: ctrl.Options{}.Metrics, + Metrics: metricsserver.Options{BindAddress: metricsAddr}, HealthProbeBindAddress: probeAddr, LeaderElection: enableLeaderElection, LeaderElectionID: "nvsentinel-controllers.nvidia.com", })code-generator/cmd/client-gen/generators/generator_for_clientset.go-78-80 (1)
78-80:⚠️ Potential issue | 🟡 MinorTie the TODO to a tracking issue.
As per coding guidelines: TODO comments should reference issues in Go code.
code-generator/cmd/client-gen/args/gvtype.go-39-47 (1)
39-47:⚠️ Potential issue | 🟡 MinorAdd GoDoc for exported
NewGVTypesValue.
The exported function lacks a doc comment.✍️ Suggested fix
-func NewGVTypesValue(gvToTypes *map[types.GroupVersion][]string, def []string) *gvTypeValue { +// NewGVTypesValue initializes a gvTypeValue with optional defaults. +func NewGVTypesValue(gvToTypes *map[types.GroupVersion][]string, def []string) *gvTypeValue {As per coding guidelines: Function comments required for all exported Go functions.
code-generator/cmd/client-gen/args/gvtype.go-93-101 (1)
93-101:⚠️ Potential issue | 🟡 MinorAvoid double slashes in
String()when version is empty.
Whengv.Versionis empty, the current formatting yieldsgroup//type. Consider conditionally formatting.💡 Suggested fix
- for _, t := range ts { - strs = append(strs, gv.Group.String()+"/"+gv.Version.String()+"/"+t) - } + for _, t := range ts { + if gv.Version == "" { + strs = append(strs, gv.Group.String()+"/"+t) + } else { + strs = append(strs, gv.Group.String()+"/"+gv.Version.String()+"/"+t) + } + }client-go/client/versioned/clientset.go-52-59 (1)
52-59:⚠️ Potential issue | 🟡 MinorClose the gRPC connection if clientset initialization fails.
In
NewForConfig, ifNewForConfigAndClientreturns an error (e.g., from device client initialization failure), the gRPC connection is leaked. Update both the generated code and the template incode-generator/cmd/client-gen/generators/generator_for_clientset.go.Suggested fix
conn, err := nvgrpc.ClientConnFor(&configShallowCopy) if err != nil { return nil, err } - return NewForConfigAndClient(&configShallowCopy, conn) + cs, err := NewForConfigAndClient(&configShallowCopy, conn) + if err != nil { + _ = conn.Close() + return nil, err + } + return cs, nilclient-go/informers/externalversions/generic.go-49-60 (1)
49-60:⚠️ Potential issue | 🟡 MinorTODO needs an issue reference.
Line 50’s TODO should link to a tracking issue or be removed; since this file is generated, update the generator template to keep it consistent.
As per coding guidelines, TODO comments should reference issues in Go code.code-generator/cmd/client-gen/generators/fake/generator_fake_for_clientset.go-92-119 (1)
92-119:⚠️ Potential issue | 🟡 MinorReplace the TODO with an issue reference (or remove).
Line 93’s TODO should reference a tracking issue.
As per coding guidelines, TODO comments should reference issues in Go code.code-generator/cmd/client-gen/generators/fake/fake_client_generator.go-96-137 (1)
96-137:⚠️ Potential issue | 🟡 MinorAdd godoc and an issue reference for
TargetForClientset.
Same concerns: exported function missing comment and TODO at Line 98 lacks an issue reference.
As per coding guidelines, Function comments required for all exported Go functions; TODO comments should reference issues in Go code.code-generator/cmd/client-gen/generators/fake/fake_client_generator.go-40-94 (1)
40-94:⚠️ Potential issue | 🟡 MinorAdd godoc and an issue reference for
TargetForGroup.
This exported function lacks a leading comment, and the TODO at Line 41 should reference a tracking issue; please update both in the generator.
As per coding guidelines, Function comments required for all exported Go functions; TODO comments should reference issues in Go code.DEVELOPMENT.md-126-128 (1)
126-128:⚠️ Potential issue | 🟡 MinorUpdate "Getting Help" links to point to the correct repository.
Lines 126-127 in DEVELOPMENT.md link to
github.com/NVIDIA/device-api, but this is the NVSentinel repository (module:github.com/nvidia/nvsentinel). The correct links should point togithub.com/NVIDIA/NVSentinelfor reporting issues and starting discussions related to this project's development. See README.md lines 233, 238, 243, 249 for the correct repository links.code-generator/cmd/client-gen/generators/fake/generator_fake_for_group.go-24-24 (1)
24-24:⚠️ Potential issue | 🟡 MinorAdd package-level GoDoc for the
fakepackage.None of the files in this package have a package comment. Add a comment like
// Package fake provides fake clientset implementations.before thepackage fakedeclaration. This can be done in any existing file or in a newdoc.gofile.Per coding guidelines: Package-level godoc required for all Go packages.
api/proto/device/v1alpha1/gpu.proto-33-39 (1)
33-39:⚠️ Potential issue | 🟡 MinorInconsistent indentation in comments.
Lines 36-38 use tab indentation while the surrounding lines use spaces. This inconsistency may cause proto linter warnings.
// resource_version represents the internal version of this object. // // Value must be treated as opaque by clients and passed unmodified back to the server. - // Populated by the system. - // Read-only. - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency + // Populated by the system. + // Read-only. + // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency string resource_version = 2;api/proto/device/v1alpha1/gpu.proto-28-46 (1)
28-46:⚠️ Potential issue | 🟡 MinorTypo in documentation comment.
Line 42: "reprsentation" should be "representation".
// namespace defines the space within which each name must be unique. An empty namespace is - // equivalent to the "default" namespace, but "default" is the canonical reprsentation. + // equivalent to the "default" namespace, but "default" is the canonical representation. // Not all objects are required to be scoped to a namespace - the value of this field forclient-go/informers/externalversions/factory.go-208-208 (1)
208-208:⚠️ Potential issue | 🟡 MinorDocumentation refers to non-existent method.
Line 208 mentions
factory.WaitForStop()but the actual method isShutdown(). This appears to be a generator template issue.// ctx, cancel := context.Background() // defer cancel() // factory := NewSharedInformerFactory(client, resyncPeriod) -// defer factory.WaitForStop() // Returns immediately if nothing was started. +// defer factory.Shutdown() // Returns immediately if nothing was started. // genericInformer := factory.ForResource(resource)
| # ============================================================================== | ||
| # GIT ATTRIBUTES | ||
| # ============================================================================== | ||
| # Use 'linguist-generated=true' to hide generated code from GitHub PR diffs. | ||
| # ============================================================================== | ||
|
|
||
| # Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc) | ||
| zz_generated.*.go linguist-generated=true | ||
|
|
||
| # Hide generated Protobuf Go bindings | ||
| *.pb.go linguist-generated=true | ||
|
|
||
| # Hide generated client library | ||
| client-go/client/** linguist-generated=true | ||
| client-go/listers/** linguist-generated=true | ||
| client-go/informers/** linguist-generated=true | ||
|
|
||
| # Hide copied, unmodified upstream code | ||
| code-generator/cmd/client-gen/types/** linguist-generated=true | ||
| code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true | ||
| code-generator/cmd/client-gen/generators/util/** linguist-generated=true |
There was a problem hiding this comment.
Add Apache 2.0 license header to this file.
The new .gitattributes file is missing the required Apache 2.0 license header.
📝 Suggested header addition
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
# ==============================================================================
# GIT ATTRIBUTES
# ==============================================================================Based on learnings: All files must include Apache 2.0 license header.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # ============================================================================== | |
| # GIT ATTRIBUTES | |
| # ============================================================================== | |
| # Use 'linguist-generated=true' to hide generated code from GitHub PR diffs. | |
| # ============================================================================== | |
| # Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc) | |
| zz_generated.*.go linguist-generated=true | |
| # Hide generated Protobuf Go bindings | |
| *.pb.go linguist-generated=true | |
| # Hide generated client library | |
| client-go/client/** linguist-generated=true | |
| client-go/listers/** linguist-generated=true | |
| client-go/informers/** linguist-generated=true | |
| # Hide copied, unmodified upstream code | |
| code-generator/cmd/client-gen/types/** linguist-generated=true | |
| code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true | |
| code-generator/cmd/client-gen/generators/util/** linguist-generated=true | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| # ============================================================================== | |
| # GIT ATTRIBUTES | |
| # ============================================================================== | |
| # Use 'linguist-generated=true' to hide generated code from GitHub PR diffs. | |
| # ============================================================================== | |
| # Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc) | |
| zz_generated.*.go linguist-generated=true | |
| # Hide generated Protobuf Go bindings | |
| *.pb.go linguist-generated=true | |
| # Hide generated client library | |
| client-go/client/** linguist-generated=true | |
| client-go/listers/** linguist-generated=true | |
| client-go/informers/** linguist-generated=true | |
| # Hide copied, unmodified upstream code | |
| code-generator/cmd/client-gen/types/** linguist-generated=true | |
| code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true | |
| code-generator/cmd/client-gen/generators/util/** linguist-generated=true |
🤖 Prompt for AI Agents
In @.gitattributes around lines 1 - 21, Add the Apache 2.0 license header to the
top of .gitattributes: insert the standard multi-line Apache-2.0 header text
(including copyright line, SPDX identifier or full license notice and year/owner
as per project convention) as the first lines of the file so the file contains
the required license header before any content; ensure formatting matches other
repository files' headers.
.github/workflows/ci.yml
Outdated
| jobs: | ||
| go: | ||
| name: Go | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 15 | ||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
|
|
||
| - name: Setup build environment | ||
| uses: ./.github/actions/setup-ci-env | ||
|
|
||
| - name: ${{ matrix.step_name }} | ||
| run: ${{ matrix.make_command }} | ||
|
|
||
| - name: Load Helm version from .versions.yaml | ||
| if: matrix.component == 'helm-charts' | ||
| id: helm-version | ||
| run: | | ||
| HELM_VERSION=$(yq eval '.testing_tools.helm' .versions.yaml) | ||
| echo "helm_version=${HELM_VERSION}" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Setup Helm | ||
| if: matrix.component == 'helm-charts' | ||
| uses: azure/setup-helm@1a275c3b69536ee54be43f2070a358922e12c8d4 # v4.3.1 | ||
| with: | ||
| version: ${{ steps.helm-version.outputs.helm_version }} | ||
|
|
||
| - name: Validate Helm Charts | ||
| if: matrix.component == 'helm-charts' | ||
| run: make helm-lint | ||
|
|
There was a problem hiding this comment.
Critical: go job is missing matrix configuration.
The go job references matrix.step_name, matrix.make_command, and matrix.component (lines 51-52, 55, 62, 68) but has no strategy.matrix defined. This will cause the workflow to fail.
🐛 Add missing matrix configuration
go:
name: Go
runs-on: ubuntu-latest
timeout-minutes: 15
+ strategy:
+ matrix:
+ include:
+ - component: api
+ step_name: "Build API"
+ make_command: "make -C api build"
+ - component: helm-charts
+ step_name: "Lint Helm Charts"
+ make_command: "echo 'Helm lint handled separately'"
+ # Add other components as needed
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2Or, if the matrix was accidentally removed, restore the original configuration.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| jobs: | |
| go: | |
| name: Go | |
| runs-on: ubuntu-latest | |
| timeout-minutes: 15 | |
| steps: | |
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | |
| - name: Setup build environment | |
| uses: ./.github/actions/setup-ci-env | |
| - name: ${{ matrix.step_name }} | |
| run: ${{ matrix.make_command }} | |
| - name: Load Helm version from .versions.yaml | |
| if: matrix.component == 'helm-charts' | |
| id: helm-version | |
| run: | | |
| HELM_VERSION=$(yq eval '.testing_tools.helm' .versions.yaml) | |
| echo "helm_version=${HELM_VERSION}" >> $GITHUB_OUTPUT | |
| - name: Setup Helm | |
| if: matrix.component == 'helm-charts' | |
| uses: azure/setup-helm@1a275c3b69536ee54be43f2070a358922e12c8d4 # v4.3.1 | |
| with: | |
| version: ${{ steps.helm-version.outputs.helm_version }} | |
| - name: Validate Helm Charts | |
| if: matrix.component == 'helm-charts' | |
| run: make helm-lint | |
| jobs: | |
| go: | |
| name: Go | |
| runs-on: ubuntu-latest | |
| timeout-minutes: 15 | |
| strategy: | |
| matrix: | |
| include: | |
| - component: api | |
| step_name: "Build API" | |
| make_command: "make -C api build" | |
| - component: helm-charts | |
| step_name: "Lint Helm Charts" | |
| make_command: "echo 'Helm lint handled separately'" | |
| # Add other components as needed | |
| steps: | |
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | |
| - name: Setup build environment | |
| uses: ./.github/actions/setup-ci-env | |
| - name: ${{ matrix.step_name }} | |
| run: ${{ matrix.make_command }} | |
| - name: Load Helm version from .versions.yaml | |
| if: matrix.component == 'helm-charts' | |
| id: helm-version | |
| run: | | |
| HELM_VERSION=$(yq eval '.testing_tools.helm' .versions.yaml) | |
| echo "helm_version=${HELM_VERSION}" >> $GITHUB_OUTPUT | |
| - name: Setup Helm | |
| if: matrix.component == 'helm-charts' | |
| uses: azure/setup-helm@1a275c3b69536ee54be43f2070a358922e12c8d4 # v4.3.1 | |
| with: | |
| version: ${{ steps.helm-version.outputs.helm_version }} | |
| - name: Validate Helm Charts | |
| if: matrix.component == 'helm-charts' | |
| run: make helm-lint |
🧰 Tools
🪛 actionlint (1.7.10)
[error] 51-51: property "step_name" is not defined in object type {}
(expression)
[error] 52-52: property "make_command" is not defined in object type {}
(expression)
[error] 55-55: property "component" is not defined in object type {}
(expression)
[error] 57-57: shellcheck reported issue in this script: SC2086:info:2:40: Double quote to prevent globbing and word splitting
(shellcheck)
[error] 62-62: property "component" is not defined in object type {}
(expression)
[error] 68-68: property "component" is not defined in object type {}
(expression)
🤖 Prompt for AI Agents
In @.github/workflows/ci.yml around lines 40 - 70, The "go" job references
matrix variables (matrix.step_name, matrix.make_command, matrix.component) but
lacks a strategy.matrix; add a strategy block to the "go" job that defines the
matrix entries used (e.g., step_name, make_command, component) or restore the
original strategy.matrix if it was removed so those variables are defined for
the job; update the job's top-level stanza "go" to include strategy: matrix:
with the appropriate permutations that feed the referenced matrix.step_name,
matrix.make_command and matrix.component.
| message Gpu { | ||
| // name is the unique logical identifier of the GPU resource. | ||
| // | ||
| // This is typically the lowercased GPU UUID, but may take other forms. | ||
| // | ||
| // Example: "gpu-a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6" | ||
| string name = 1; | ||
| ObjectMeta metadata = 1; | ||
|
|
||
| // spec defines the identity and desired attributes of the GPU resource. | ||
| GpuSpec spec = 2; | ||
|
|
||
| // status contains the most recently observed state of the GPU resource. | ||
| // This data may lag slightly behind the actual on-device state. | ||
| GpuStatus status = 3; | ||
|
|
||
| // resource_version is a monotonically increasing version number. | ||
| // | ||
| // This increments on every modification to the GPU and can be used by | ||
| // consumers to detect changes or implement optimistic concurrency. | ||
| int64 resource_version = 4; | ||
| } |
There was a problem hiding this comment.
Duplicate resource_version fields with different types.
The Gpu message has resource_version in two places:
metadata.resource_version(string) atObjectMetafield 2resource_version(int64) atGpufield 4
This creates potential confusion for API consumers. In Kubernetes conventions, resource_version is typically an opaque string stored only in ObjectMeta. If the intent is to have an internal numeric version for optimistic concurrency, consider:
- Removing the top-level
resource_versionand using onlymetadata.resource_version - If both are needed, rename the top-level field to something like
generationorinternal_versionto distinguish its purpose
🤖 Prompt for AI Agents
In `@api/proto/device/v1alpha1/gpu.proto` around lines 96 - 111, The Gpu message
defines two different resource_version fields (ObjectMeta.resource_version as a
string and Gpu.resource_version as int64), causing confusion; either remove the
top-level int64 resource_version from message Gpu or rename it to a clearly
distinct semantic name (e.g., generation or internal_version) to avoid collision
with ObjectMeta.resource_version; update the proto definition for message Gpu
(field currently declared as "int64 resource_version = 4") accordingly, adjust
all code references/usages to the new name or remove usages if deleted, update
any comments/docs in the proto to reflect the change, and regenerate protobuf
stubs wherever they are consumed.
| // CreateGpu registers a new GPU with the server. | ||
| // | ||
| // The GPU name (metadata.name) must be unique. If a GPU with the | ||
| // same name already exists, returns ALREADY_EXISTS. | ||
| // | ||
| // This operation acquires a write lock, blocking consumer reads | ||
| // until the operation completes. | ||
| // | ||
| // Returns: | ||
| // - The created GPU with server-assigned fields (resource_version) | ||
| // - ALREADY_EXISTS if a GPU with that name exists | ||
| // - INVALID_ARGUMENT if required fields are missing | ||
| rpc CreateGpu(CreateGpuRequest) returns (CreateGpuResponse); |
There was a problem hiding this comment.
Contradictory semantics between RPC doc and response message.
The CreateGpu RPC documentation (line 220-221) states it "returns ALREADY_EXISTS if a GPU with that name exists." However, CreateGpuResponse.created (lines 339-343) indicates that when false, the existing GPU is returned for idempotent patterns.
These are mutually exclusive behaviors:
- Option A: Return
ALREADY_EXISTSerror (no response body) - Option B: Return success with
created=falseand existing GPU data (idempotent upsert)
Clarify which behavior is intended and update the documentation to be consistent.
Also applies to: 339-344
🤖 Prompt for AI Agents
In `@api/proto/device/v1alpha1/gpu.proto` around lines 210 - 222, The RPC docs for
CreateGpu and the CreateGpuResponse.created field conflict; decide whether
CreateGpu should return an ALREADY_EXISTS error or use idempotent semantics with
CreateGpuResponse.created (true/false) returning the existing GPU. Then make
them consistent: if you choose idempotent upsert, update the CreateGpu RPC
comment to state it returns success with created=false and the existing GPU when
a name already exists and keep/CreateGpuResponse.created semantics; if you
choose error-on-duplicate, remove or repurpose CreateGpuResponse.created and
change the CreateGpu RPC comment to state it returns ALREADY_EXISTS and ensure
server code for rpc CreateGpu enforces that behavior. Apply the change to the
documented remarks for rpc CreateGpu and the CreateGpuResponse message so both
match the chosen behavior.
| // UpdateGpuStatusRequest contains the status update for a GPU. | ||
| message UpdateGpuStatusRequest { | ||
| // The name of the GPU to update. | ||
| // Required. | ||
| string name = 1; | ||
|
|
||
| // The new status to set. | ||
| // This completely replaces the existing status. | ||
| // Required. | ||
| GpuStatus status = 2; | ||
|
|
||
| // Expected resource_version for optimistic concurrency. | ||
| // If set and doesn't match current version, returns ABORTED. | ||
| // Optional. | ||
| int64 resource_version = 3; | ||
| } |
There was a problem hiding this comment.
Type inconsistency in resource_version for optimistic concurrency.
UpdateGpuStatusRequest.resource_version is int64 (line 374), but ObjectMeta.resource_version is string (line 39). This inconsistency complicates client implementations that need to track versions across different message types.
Consider aligning the types—either use string consistently (Kubernetes convention) or int64 consistently (if internal numeric versioning is preferred).
🤖 Prompt for AI Agents
In `@api/proto/device/v1alpha1/gpu.proto` around lines 360 - 375,
UpdateGpuStatusRequest currently defines resource_version as int64 while
ObjectMeta.resource_version is a string, causing type mismatch; change
UpdateGpuStatusRequest.resource_version to string to match
ObjectMeta.resource_version (or vice versa if you choose numeric versioning) and
update any callers/serializers that read/write
UpdateGpuStatusRequest.resource_version and related handlers (e.g., the
UpdateGpuStatus RPC handling code and client usage) to use the new string type
so optimistic-concurrency checks compare the same type across messages.
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| "k8s.io/apimachinery/pkg/api/meta" | ||
| "github.com/nvidia/nvsentinel/client-go/clientset/versioned" | ||
| "github.com/nvidia/nvsentinel/client-go/nvgrpc" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the directory structure of client-go
fd -t d "clientset|client" client-go --max-depth 3 | head -20Repository: NVIDIA/NVSentinel
Length of output: 137
🏁 Script executed:
# Also find the README.md file and examine it
fd -t f "README.md" client-go | xargs wc -lRepository: NVIDIA/NVSentinel
Length of output: 356
🏁 Script executed:
# Read the relevant section of README.md around lines 40-44
fd -t f "README.md" client-go | xargs cat -n | sed -n '35,55p'Repository: NVIDIA/NVSentinel
Length of output: 910
🏁 Script executed:
# Check what's actually in client-go/client directory
fd -t f "versioned" client-go/client --max-depth 2 | head -20Repository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
# Also check if clientset directory exists at all
ls -la client-go/ | grep -E "clientset|client"Repository: NVIDIA/NVSentinel
Length of output: 118
🏁 Script executed:
# Look for any versioned clientset files
find client-go -name "*versioned*" -type f 2>/dev/null | head -20Repository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
# List contents of client-go/client
find client-go/client -type f -name "*.go" | head -20Repository: NVIDIA/NVSentinel
Length of output: 778
🏁 Script executed:
# Check the structure more carefully
tree client-go -L 3 -I 'vendor'Repository: NVIDIA/NVSentinel
Length of output: 1194
Fix import path in Quick Start example.
The import path on line 42 is incorrect. The codebase structure has client-go/client/versioned/, not client-go/clientset/versioned/. Change the import to:
"github.com/nvidia/nvsentinel/client-go/client/versioned"
This will prevent import errors when developers use the example code.
🤖 Prompt for AI Agents
In `@client-go/README.md` around lines 40 - 44, Update the incorrect import in the
Quick Start import block: replace the wrong import string
"github.com/nvidia/nvsentinel/client-go/clientset/versioned" with
"github.com/nvidia/nvsentinel/client-go/client/versioned" so the example imports
the correct client package; ensure the import block containing metav1, api/meta,
and nvgrpc uses the corrected path.
| // Configure klog for JSON output if requested | ||
| if config.LogFormat == "json" { | ||
| // Set JSON format flags | ||
| flag.Set("logging_format", "json") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find and examine the main.go file
find . -name "main.go" -path "*/cmd/device-api-server/*" -type fRepository: NVIDIA/NVSentinel
Length of output: 92
🏁 Script executed:
#!/bin/bash
# Read the main.go file to see the actual code
if [ -f "cmd/device-api-server/main.go" ]; then
wc -l cmd/device-api-server/main.go
fiRepository: NVIDIA/NVSentinel
Length of output: 94
🏁 Script executed:
#!/bin/bash
# Get context around lines 123-127
if [ -f "cmd/device-api-server/main.go" ]; then
sed -n '100,160p' cmd/device-api-server/main.go | cat -n
fiRepository: NVIDIA/NVSentinel
Length of output: 2225
🏁 Script executed:
#!/bin/bash
# Check imports to see how klog is being used
if [ -f "cmd/device-api-server/main.go" ]; then
sed -n '1,50p' cmd/device-api-server/main.go | cat -n
fiRepository: NVIDIA/NVSentinel
Length of output: 2137
🌐 Web query:
klog v2 JSON format configuration flag.Set logging_format after flag.Parse
💡 Result:
--logging-format=json is not a klog/v2 flag; it comes from k8s.io/component-base/logs (and its logs/api/v1 config). Kubernetes docs describe it as the flag that switches klog output to JSON. [1]
If you want to force JSON in code (even after flag.Parse()), you must do it before initializing logs and before any logging happens (klog requires flags parsed before logging). [2]
Option A: set the flag value after flag.Parse(), then init logs
import (
"flag"
"k8s.io/component-base/logs"
)
func main() {
logs.AddGoFlags(flag.CommandLine) // registers --logging-format (and other logging flags) [3]
flag.Parse()
_ = flag.Set("logging-format", "json") // must happen before InitLogs / any logging
logs.InitLogs() // reads flags and configures logging [3]
defer logs.FlushLogs()
// ... your code ...
}Option B (preferred): avoid flag.Set, set the config directly
Kubernetes moved applying --logging-format-style options into a single “validate & apply” step that should run right after flag parsing. [4] This is done via the logs/api/v1 LoggingConfiguration (“Options”) API. [4]
Key point: setting logging-format after logs are already initialized (or after you already emitted logs) is too late; early startup logs may still be non-JSON. [1]
Sources:
[1] Kubernetes system logs docs (--logging-format=json, JSON format caveats)
[2] klog/v2 docs: flag.Parse must be called before logging
[3] k8s.io/component-base/logs docs: AddGoFlags, InitLogs (call after parsing)
[4] Kubernetes dev thread: applying --logging-format moved to Options.ValidateAndApply and should run ASAP after parsing
🌐 Web query:
klog/v2 SetLogger JSON output configuration methods
💡 Result:
In k8s.io/klog/v2, JSON output isn’t enabled by a “klog JSON mode” switch; instead you get JSON by redirecting klog to a structured backend logger (logr or slog) that’s configured to emit JSON. The relevant configuration methods are: SetLogger / SetLoggerWithOptions, and SetSlogLogger. [1]
1) Use SetSlogLogger with slog.NewJSONHandler (JSON lines)
import (
"log/slog"
"os"
"k8s.io/klog/v2"
)
klog.SetSlogLogger(slog.New(slog.NewJSONHandler(os.Stdout, nil)))JSONHandler produces line-delimited JSON objects. [2]
2) Use SetLogger (or SetLoggerWithOptions) with a logr backend that outputs JSON
SetLogger redirects all klog output to a logr.Logger implementation. [1]
Typical pattern (zap -> logr via zapr.NewLogger):
klog.SetLogger(zapr.NewLogger(zapLog)) // zapLog must be configured for JSONSetLoggerWithOptions is the flexible variant, with these key options: [1]
ContextualLogger(true|false)FlushLogger(func(){...})WriteKlogBuffer(func([]byte){...})(lets you capture/redirect klog’s unstructured output buffer)
3) Note on klogr: its default JSON is “JSON-in-text”, not structured forwarding
klogr historically serialized key/values as JSON and logged that as a text message via klog; this is not recommended when klog is being forwarded to a structured logger. [3]
Sources: klog API docs (SetLogger, SetLoggerWithOptions, SetSlogLogger, options). [1] slog JSONHandler docs. [2] klogr README note about default JSON serialization behavior. [3]
🏁 Script executed:
#!/bin/bash
# Check for logs.InitLogs or klog.SetSlogLogger/SetLogger calls
rg -n "InitLogs|SetSlogLogger|SetLogger" cmd/device-api-server/main.goRepository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Check for any imports of component-base/logs
rg -n "component-base/logs|logs\." cmd/device-api-server/main.goRepository: NVIDIA/NVSentinel
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Read the full main.go to see complete initialization
cat -n cmd/device-api-server/main.goRepository: NVIDIA/NVSentinel
Length of output: 7180
Replace ineffective klog JSON flag configuration with proper klog.SetSlogLogger() initialization.
The flag.Set("logging_format", "json") call on line 126 is ineffective. This flag does not exist in klog/v2 (it belongs to k8s.io/component-base/logs, which this code does not use). To enable JSON output in klog/v2, use klog.SetSlogLogger() with a JSON handler instead:
if config.LogFormat == "json" {
klog.SetSlogLogger(slog.New(slog.NewJSONHandler(os.Stdout, nil)))
}This must be called before any logging occurs (after flag.Parse() but before line 139 where logger.Info() is first called).
🤖 Prompt for AI Agents
In `@cmd/device-api-server/main.go` around lines 123 - 127, The current
flag.Set("logging_format", "json") is ineffective for klog/v2; replace it by
initializing klog with a structured JSON handler: when config.LogFormat ==
"json" call klog.SetSlogLogger(...) using slog.NewJSONHandler(os.Stdout, nil) so
klog emits JSON; place this call after flag.Parse() and before any logging
(i.e., before the first logger.Info() call) to ensure it takes effect; update
the block that currently checks config.LogFormat and remove the flag.Set usage
in favor of klog.SetSlogLogger.
cmd/health-provider/main_stub.go
Outdated
| cfg, nvmlCfg := parseFlags() | ||
| flag.Parse() |
There was a problem hiding this comment.
Remove duplicate flag.Parse() call.
parseFlags() already calls flag.Parse() at line 88, so calling it again at line 36 will cause issues (flags parsed twice, potentially with empty values on the second parse).
🐛 Proposed fix
func main() {
klog.InitFlags(nil)
cfg, nvmlCfg := parseFlags()
- flag.Parse()
logger := klog.Background()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cfg, nvmlCfg := parseFlags() | |
| flag.Parse() | |
| func main() { | |
| klog.InitFlags(nil) | |
| cfg, nvmlCfg := parseFlags() | |
| logger := klog.Background() |
🤖 Prompt for AI Agents
In `@cmd/health-provider/main_stub.go` around lines 35 - 36, Remove the duplicate
flag.Parse() call in main_stub.go: parseFlags() already calls flag.Parse(), so
delete the standalone flag.Parse() invocation to avoid parsing flags twice; keep
parseFlags() as the single place that calls flag.Parse() and returns cfg and
nvmlCfg (refer to the parseFlags function and the cfg, nvmlCfg variables).
| import ( | ||
| "io" | ||
| "path" | ||
| "strings" | ||
|
|
||
| "golang.org/x/text/cases" | ||
| "golang.org/x/text/language" | ||
|
|
||
| "k8s.io/gengo/v2/generator" |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
rg -n --type=go 'ParseClientGenTags' code-generator/cmd/client-genRepository: NVIDIA/NVSentinel
Length of output: 2473
🏁 Script executed:
cat -n code-generator/cmd/client-gen/generators/generator_for_type.go | sed -n '90,105p'Repository: NVIDIA/NVSentinel
Length of output: 772
🏁 Script executed:
rg -B5 -A5 'ParseClientGenTags' code-generator/cmd/client-gen --type=goRepository: NVIDIA/NVSentinel
Length of output: 13014
🏁 Script executed:
sed -n '105,112p' code-generator/cmd/client-gen/generators/generator_for_group.goRepository: NVIDIA/NVSentinel
Length of output: 309
🏁 Script executed:
sed -n '80,88p' code-generator/cmd/client-gen/generators/fake/generator_fake_for_type.goRepository: NVIDIA/NVSentinel
Length of output: 483
🏁 Script executed:
sed -n '87,95p' code-generator/cmd/client-gen/generators/fake/generator_fake_for_group.goRepository: NVIDIA/NVSentinel
Length of output: 316
Wrap ParseClientGenTags errors with context across all four call sites.
The error-wrapping guideline requires adding context to errors returned from ParseClientGenTags. Line 96 in generator_for_type.go returns the raw parse error, making failures harder to diagnose. Three additional identical locations exist in the codebase:
generator_for_type.go:96-99generator_for_group.go:107-110fake/generator_fake_for_type.go:83-86fake/generator_fake_for_group.go:90-93
Apply consistent wrapping across all four sites using fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err) to align with the error-wrapping guideline for Go code.
Example for generator_for_type.go
import (
+ "fmt"
"io"
"path"
"strings"
@@
tags, err := util.ParseClientGenTags(append(t.SecondClosestCommentLines, t.CommentLines...))
if err != nil {
- return err
+ return fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err)
}🤖 Prompt for AI Agents
In `@code-generator/cmd/client-gen/generators/generator_for_type.go` around lines
27 - 35, Wrap the error returned by ParseClientGenTags with context at each call
site by replacing bare returns of err with: return fmt.Errorf("parse client-gen
tags for %s: %w", t.Name.Name, err). Update the four occurrences where
ParseClientGenTags is called (in the generators for types and groups and their
fake counterparts) so the function that calls ParseClientGenTags returns the
wrapped error; ensure you import fmt if not already present and reference
t.Name.Name in the formatting string.
| apiVersion: v2 | ||
| name: health-provider | ||
| description: GPU Health Provider for NVSentinel - monitors GPU health via NVML | ||
|
|
||
| type: application | ||
| version: 0.1.0 | ||
| appVersion: "0.1.0" | ||
|
|
||
| keywords: | ||
| - nvidia | ||
| - gpu | ||
| - health | ||
| - monitoring | ||
| - nvml | ||
|
|
||
| home: https://github.com/nvidia/nvsentinel | ||
| sources: | ||
| - https://github.com/nvidia/nvsentinel | ||
|
|
||
| maintainers: | ||
| - name: NVIDIA | ||
| url: https://github.com/nvidia | ||
|
|
||
| dependencies: [] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there are other Helm charts and their locations
echo "=== Helm charts in distros/kubernetes/nvsentinel/charts/ ==="
fd -t d -d 1 . distros/kubernetes/nvsentinel/charts 2>/dev/null || echo "Path not found"
echo ""
echo "=== Helm charts in deployments/helm/ ==="
fd -t d -d 1 . deployments/helm 2>/dev/null || echo "Path not found"
echo ""
echo "=== All Chart.yaml files ==="
fd Chart.yamlRepository: NVIDIA/NVSentinel
Length of output: 2807
Relocate chart to follow project structure conventions.
This Helm chart should be located at distros/kubernetes/nvsentinel/charts/health-provider/ to align with project organization. Other health-related monitors in the project—including gpu-health-monitor, syslog-health-monitor, and csp-health-monitor—are all located in distros/kubernetes/nvsentinel/charts/. Move this chart to the standard location for consistency.
🤖 Prompt for AI Agents
In `@deployments/helm/health-provider/Chart.yaml` around lines 1 - 24, The
Chart.yaml for the chart named "health-provider" is not located with the other
nvsentinel charts; move the entire chart directory containing Chart.yaml for
health-provider into the project's central nvsentinel charts directory so it
sits alongside existing charts like gpu-health-monitor, syslog-health-monitor,
and csp-health-monitor, and update any CI/build references or packaging
manifests that point to the old location (e.g., chart packaging steps or
dependency references) to the new chart directory.
…o change After removing the top-level resource_version field from the Gpu proto message (Task 5 of PR #795 review fixes), all Go code needed to be updated to use metadata.resource_version instead. Changes: - cache.go: Store resource version in Metadata.ResourceVersion (string) instead of the removed Gpu.ResourceVersion (int64) - gpu_service.go: Update all GetResourceVersion() calls to use GetMetadata().GetResourceVersion(), parse string to int64 for optimistic concurrency checks - integration_test.go: Update test assertions to use the new path This follows K8s conventions where resource_version lives in ObjectMeta. Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The go job referenced matrix.step_name, matrix.make_command, and matrix.component without defining a strategy.matrix block, causing the workflow to fail. Also removed orphaned Helm-related steps that referenced a non-existent helm-charts matrix entry. Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
parseFlags() already calls flag.Parse(), so the second call in main() would cause a runtime crash with "flag provided but not defined" or "Parse called twice" depending on Go version. Fixed in both main.go (NVML build) and main_stub.go (stub build). Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The --metrics-bind-address flag was parsed but never applied to the controller manager. Now correctly configures metrics server binding. Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Proto changes: - Removed top-level resource_version field from Gpu message (use metadata.resource_version instead, following K8s conventions) - Changed UpdateGpuStatusRequest.resource_version from int64 to string to match ObjectMeta.resource_version type - Updated CreateGpu RPC docs to clarify idempotent semantics (returns existing GPU with created=false instead of ALREADY_EXISTS error) - Fixed typo: reprsentation -> representation - Fixed inconsistent indentation in ObjectMeta comments Go code updates: - Removed ResourceVersion from converter ignore list (field no longer exists) - Updated gpu_service.go to parse string resource_version - Updated integration tests for string resource_version Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Updated dependencies required by code-generator validation-gen tool. Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
1. Changed c.Universe.Type() to c.Universe.Function() for fmt.Errorf since Errorf is a function, not a type. Using Type() could return nil and cause template execution failures. 2. Added conn.Close() in NewForConfig error path to prevent gRPC connection leak when NewForConfigAndClient fails after ClientConnFor succeeds. Note: Run `make -C client-go code-gen` to regenerate client code with these fixes applied. Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Fix client-go/README.md: correct import path to client/versioned - Add Apache 2.0 license header to .gitattributes - Remove duplicate api/bin/* entry from .gitignore - Remove duplicate JetBrains IDE comment from .gitignore - Remove duplicate pause call from nvml-sidecar-demo.sh
- Add nil guard for config in ClientConnFor - Fix resource leak in NewForConfig by closing connection on error - Update go.mod dependencies
- Add nolint:lll for kubebuilder validation comments that cannot be split - Add whitespace per wsl_v5 linter requirements - Format files with gofmt/goimports
Critical: clone protos in cache mutations to prevent data races, switch remediation controller from Status().Update to Status().Patch, fix klog JSON logging initialization. Major: add atomic ListWithVersion/ListAndSubscribe to cache, fix watch race in gpu_service, add quarantine toleration to reboot jobs, fix drain boolean default, use sync.Once for publisher stop, fix ctx ordering in NVML provider, add nil guards in client-go. Minor: improve error wrapping in code-generator, tighten socket perms, remove dead code, add license header to api/README.md. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Previously, if the status patch to Draining succeeded but pod eviction failed, the event was permanently stuck because the controller only accepted PhaseQuarantined. Now accepts both PhaseQuarantined and PhaseDraining for resilient retry behavior.
Previously, executeReboot returned nil when the job was still running, causing premature transition to Remediated. Now returns errRebootInProgress which triggers a 30s requeue to check the job status again.
Using r.Update() sent the full object and risked conflicts with concurrent writes. Using client.MergeFrom Patch only sends the diff (the annotation changes), eliminating conflict risk and preventing drift in the subsequent status patch.
Previously, if the target node didn't exist, the event stayed stuck in New phase forever. Now transitions to Cancelled with a NodeNotFound reason, allowing the TTL controller to eventually clean it up.
The static DaemonSet was binding gRPC to :50051 (all interfaces), exposing the unauthenticated API to the cluster network. Changed to 127.0.0.1:50051 to match the Helm chart default.
gRPC defaults to 50051 (not 8080), health to 8081, metrics to 9090. Updated EXPOSE directives and comments to match the code defaults.
Removed 11 metric definitions that were registered but never populated by any controller code. These created false observability — showing permanent zeros that could mislead operators building dashboards. Retained: quarantineActionsTotal, drainActionsTotal, ttlDeletionsTotal, remediationActionsTotal, remediationFailuresTotal (all actively used).
Both UpdateGpu and UpdateGpuStatus silently discarded the error from strconv.ParseInt when parsing resource_version. A non-numeric string like "not-a-number" would parse to 0, causing the cache's optimistic concurrency check (expectedVersion > 0) to be bypassed entirely. Now returns codes.InvalidArgument when resource_version is present but not a valid numeric string. Adds regression tests for both RPCs. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
MarkProviderGPUsUnknown was calling Notify N times (once per GPU) while holding the write lock, starving all readers during provider timeout. Now collects events inside the lock, releases the lock, then broadcasts.
Summary
This PR introduces a Kubernetes-native approach to GPU health monitoring and remediation, building on top of the full NVSentinel codebase.
Key Changes
nvsentinel.nvidia.com/v1alpha1): Cluster-wide custom resource for tracking GPU health eventsQuarantineController: Cordons nodes on fatal GPU errorsDrainController: Gracefully evicts pods from quarantined nodesRemediationController: Executes remediation strategies (reboot, GPU reset, manual)TTLController: Automatic cleanup of resolved events based on configurable policiesArchitecture
Phase Progression
Each phase transition is handled by a dedicated controller, ensuring separation of concerns and reliable state management.
Test Plan
cmd/controller-test)Related
docs/plans/2026-02-04-hybrid-device-apiserver-design.mdSummary by CodeRabbit
New Features
Documentation
Infrastructure