Skip to content

feat: cloud-native GPU health event management#795

Draft
ArangoGutierrez wants to merge 58 commits intomainfrom
feat/cloud-native-healthevents
Draft

feat: cloud-native GPU health event management#795
ArangoGutierrez wants to merge 58 commits intomainfrom
feat/cloud-native-healthevents

Conversation

@ArangoGutierrez
Copy link
Contributor

@ArangoGutierrez ArangoGutierrez commented Feb 4, 2026

Summary

This PR introduces a Kubernetes-native approach to GPU health monitoring and remediation, building on top of the full NVSentinel codebase.

Note: This PR supersedes #794, which was incorrectly based on a stripped-down branch. This version is properly rebased onto main and includes all core NVSentinel functionality.

Key Changes

  • HealthEvent CRD (nvsentinel.nvidia.com/v1alpha1): Cluster-wide custom resource for tracking GPU health events
  • Controller Chain: Automated remediation workflow with four specialized controllers:
    • QuarantineController: Cordons nodes on fatal GPU errors
    • DrainController: Gracefully evicts pods from quarantined nodes
    • RemediationController: Executes remediation strategies (reboot, GPU reset, manual)
    • TTLController: Automatic cleanup of resolved events based on configurable policies
  • HealthProvider Service: gRPC-based GPU health monitoring using NVML
  • CRDPublisher: Bridge between device-api-server and Kubernetes API
  • Device API enhancements: k8s-idiomatic Go SDK, client-go integration

Architecture

┌─────────────────┐    gRPC    ┌──────────────────┐
│ HealthProvider  │ ─────────▶│ Device-API-Server│
│   (DaemonSet)   │            │                  │
└─────────────────┘            └────────┬─────────┘
        │                               │
        │ NVML                          │ CRDPublisher
        ▼                               ▼
    ┌───────┐                   ┌───────────────┐
    │  GPU  │                   │  HealthEvent  │
    └───────┘                   │     (CRD)     │
                                └───────┬───────┘
                                        │
              ┌─────────────────────────┼─────────────────────────┐
              │                         │                         │
              ▼                         ▼                         ▼
    ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
    │ QuarantineCtrl   │───▶│   DrainCtrl      │───▶│ RemediationCtrl  │
    │  (cordon node)   │    │  (evict pods)    │    │ (reboot/reset)   │
    └──────────────────┘    └──────────────────┘    └──────────────────┘

Phase Progression

New → Quarantined → Drained → Remediated → Resolved

Each phase transition is handled by a dedicated controller, ensuring separation of concerns and reliable state management.

Test Plan

  • Unit tests for all controllers (quarantine, drain, remediation, TTL)
  • Unit tests for CRDPublisher and HealthProvider
  • Integration test harness (cmd/controller-test)
  • Manual integration test on AWS EKS cluster with GPUs
  • Full phase progression verified: New → Quarantined → Drained → Remediated (2 seconds)

Related

Summary by CodeRabbit

  • New Features

    • Added Kubernetes-style GPU device API with read/write CRUD operations and watch support
    • Introduced Go client library for accessing GPU resources via gRPC over Unix sockets
    • Added Device API Server for managing GPU state and metadata
    • Implemented NVML provider sidecar for GPU health monitoring and status propagation
    • Added Helm charts for simplified deployment
  • Documentation

    • Added comprehensive development guides and API documentation
    • Included example applications demonstrating client usage patterns
    • Added deployment and configuration documentation
  • Infrastructure

    • Updated build system and code generation framework
    • Reorganized CI/CD workflows and testing infrastructure

ArangoGutierrez and others added 22 commits February 4, 2026 16:45
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>
Signed-off-by: Dan Huenecke <dhuenecke@nvidia.com>
Add ProviderService with RPCs for GPU lifecycle management:
- RegisterGpu / UnregisterGpu
- UpdateGpuStatus / UpdateGpuCondition

Also adds resource_version field to Gpu message for optimistic
concurrency control.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implements the Device API Server - a gRPC server providing unified
GPU device information via GpuService (consumer) and ProviderService
(provider) APIs.

Components:
- In-memory cache with RWMutex for thread-safe GPU state
- Watch broadcaster for real-time state change notifications
- NVML fallback provider for GPU enumeration and XID health monitoring
- Prometheus metrics (cache, watch, NVML, gRPC stats)
- Helm chart with ServiceMonitor and alerting rules

Includes comprehensive unit tests (35 tests, race-clean) and
documentation (API reference, operations guide, design docs).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The gRPC API is unauthenticated, which poses a security risk when
bound to all network interfaces. This change:

- Changes default GRPCAddress from ":50051" to "127.0.0.1:50051"
- Updates flag documentation with security warning
- Expands network security documentation per Copilot review
- Updates design and operations docs to reflect new default

Users who need network access can explicitly set --grpc-address=:50051,
with full awareness of the security implications.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The Heartbeat RPC is intentionally defined but unimplemented,
reserved for future provider liveness detection. This commit:

- Improves the error message to explain the reservation purpose
- Updates code comments to clarify this is intentional
- Aligns implementation comments with proto documentation

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add ResourceVersion() method to GpuCache and use it to populate
ListMeta.ResourceVersion in ListGpus responses. This enables:

- Optimistic concurrency patterns in consumers
- Cache invalidation based on version changes
- Consistency tracking across list operations

The resource version is a monotonically increasing counter that
increments on every cache mutation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add module structure documentation to DEVELOPMENT.md to clarify
the relationship between the different Go modules in this monorepo:

- Device API Server uses github.com/nvidia/device-api
- API definitions use github.com/nvidia/device-api/api
- client-go and code-generator use github.com/nvidia/nvsentinel

This documents the intentional divergence where device-api is a
standalone component that may be published as its own repository.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The PR introduced inconsistent module paths where the api module
was changed to github.com/nvidia/device-api/api while client-go
still uses github.com/nvidia/nvsentinel/api. This broke CI.

This commit aligns all modules to use github.com/nvidia/nvsentinel:
- api/go.mod: github.com/nvidia/nvsentinel/api
- go.mod: github.com/nvidia/nvsentinel
- All imports updated to use nvsentinel path
- Proto go_package options updated
- Generated files updated

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Addresses review feedback from pteranodan on PR #720 regarding API
design and NVML provider coupling.

BREAKING CHANGES:
- Removed ProviderService in favor of unified GpuService
- RegisterGpu() → CreateGpu()
- UnregisterGpu() → DeleteGpu()
- UpdateGpuCondition() → removed (use UpdateGpuStatus with full status)

API Changes:
- Consolidated GpuService with both read and write operations
- Added standard K8s CRUD methods: CreateGpu, UpdateGpu, UpdateGpuStatus, DeleteGpu
- Removed custom provider methods (RegisterGpu, UnregisterGpu, UpdateGpuCondition)
- Deleted api/proto/device/v1alpha1/provider.proto
- Updated gpu.proto with unified service definition

Service Layer:
- Created pkg/deviceapiserver/service/gpu_service.go (unified service)
- Deleted consumer.go and provider.go (replaced by gpu_service.go)
- Updated integration tests to use new API
- Improved cache interface with Create, Update, UpdateStatusWithVersion, Delete

NVML Provider Decoupling:
- NVML provider now uses GpuServiceClient instead of direct cache access
- Server creates loopback gRPC connection for embedded NVML provider
- Added standalone cmd/nvml-provider/ binary for sidecar deployment
- Provider now "dogfoods" its own API via gRPC

Documentation:
- Updated README.md with new unified API architecture
- Rewrote docs/api/device-api-server.md with single service model
- Updated docs/operations/device-api-server.md
- Simplified docs/design/device-api-server.md
- Removed obsolete task lists and design documents

Deployment:
- Reorganized charts: charts/device-api-server/ → deployments/helm/nvsentinel/
- Added deployments/container/Dockerfile
- Added deployments/static/ for static manifests
- Added demos/nvml-sidecar-demo.sh

Other:
- Updated .gitignore for planning documents
- Enhanced Makefile with new targets
- Added metrics for unified service

This refactoring improves:
1. API consistency with Kubernetes conventions
2. Tooling compatibility (code generators, controller-runtime)
3. Decoupling of NVML provider from server internals
4. Flexibility to run NVML as embedded or sidecar
5. Overall code maintainability and clarity

Signed-off-by: Eduardo Arango <eduardoa@nvidia.com>
Update Go version to 1.25 across all module files and the Dockerfile
to ensure consistent builds. The previous version (1.25.5) was a
non-existent patch version, and the Dockerfile was using 1.24 which
doesn't have Docker Hub images available.

Changes:
- go.mod: 1.25.5 -> 1.25
- api/go.mod: 1.25.5 -> 1.25
- Dockerfile: 1.24 -> 1.25

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The flag parsing was broken because flag.Parse() was called in main()
after parseFlags() had already returned the config struct. This meant
command-line arguments like --server-address were never applied to
the config.

Move flag.Parse() inside parseFlags() after defining all flags but
before returning, so the parsed values are captured in the returned
config struct.

Also reorder klog.InitFlags() to be called before parseFlags() so
klog flags are also registered before parsing.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The daemonset template was passing --provider-address to the
device-api-server container when nvmlProvider.enabled was true,
but this flag doesn't exist in the device-api-server binary.

The sidecar architecture doesn't require the server to know about
providers - providers connect TO the server via gRPC. Remove the
invalid flag to fix container crashes.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Multiple fixes for the sidecar test values:

- serverAddress: Change from localhost:9001 to localhost:50051 to
  match the actual gRPC port the device-api-server listens on

- nodeSelector: Explicitly null out nvidia.com/gpu.present to prevent
  Helm from merging it with the node-type=gpu selector. The test
  cluster only has node-type=gpu labels.

- securityContext: Run as root (uid 0) to allow the server to create
  Unix sockets in the hostPath /var/run/device-api/ directory. The
  non-root user (65534) cannot write to this host directory.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Major improvements to the NVML sidecar demo script:

Cross-platform builds:
- Use docker buildx for linux/amd64 builds (required when building
  on ARM Macs for x86 clusters)
- Use --push flag to push directly after build (required for
  cross-platform builds which can't be stored locally)
- Check for buildx availability at startup

Idempotency:
- Check if images already exist on ttl.sh before building
- Prompt user whether to rebuild existing images
- Use helm upgrade if release exists, helm install if not
- Use rollout restart instead of relying on --wait timeout
- Add 2-minute rollout timeout with status check on failure

Bug fixes:
- Fix namespace creation piped command (color codes corrupted YAML)
- Fix pod label selector (app.kubernetes.io/name=nvsentinel)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The in-pod metrics check was failing because the container name is 'nvsentinel' (from Chart.Name), not 'device-api-server'. Rather than add complexity to detect the container name, remove this redundant check - metrics are already validated in Step 7 (show_metrics).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…-api-server'

The Helm chart names the main container using {{ .Chart.Name }}, which resolves to 'nvsentinel'. Fixed all kubectl logs/exec commands to use the correct container name.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The committed protobuf files were generated with protoc v33.4, but CI was installing v33.0. This version mismatch caused verify-codegen to fail because regenerating with a different protoc version produces different file headers.
Comprehensive design document capturing:
- In-memory cache architecture (replacing Kine/SQLite)
- Memory safety bounds sized for Vera Rubin Ultra NVL576 (1024 GPUs)
- Readiness gate blocking consumers until providers register
- Watch broadcaster with slow subscriber eviction
- What to take from each PR
- Implementation phases
- Authorship preservation requirements

This document serves as the spec for cherry-picking #720 onto merged #718.

Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
Co-authored-by: Dan Stone <danstone@nvidia.com>
Prepare for isolated worktree development workflow.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This commit introduces a Kubernetes-native approach to GPU health monitoring
and remediation, replacing the previous MongoDB-centric architecture.

New components:
- HealthEvent CRD (nvsentinel.nvidia.com/v1alpha1) for cluster-wide state
- QuarantineController: Cordons nodes on fatal GPU errors
- DrainController: Gracefully evicts pods from quarantined nodes
- RemediationController: Executes remediation strategies (reboot, gpu-reset)
- TTLController: Automatic cleanup of resolved events based on policy
- HealthProvider service: gRPC-based GPU health monitoring using NVML
- CRDPublisher: Publishes GPU health events from device-api-server

Key features:
- Automatic phase progression: New → Quarantined → Drained → Remediated
- Configurable TTL policies via ConfigMap
- Prometheus metrics for all controllers
- Comprehensive unit tests
- Integration test harness (cmd/controller-test)

Helm chart updates:
- CRD installation under deployments/helm/nvsentinel/crds/
- TTL policy ConfigMap template
- Health-provider DaemonSet chart

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/cloud-native-healthevents

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Provides persistent context for AI assistants working on the cloud-native
GPU health event management feature, including architecture diagrams,
key file locations, and integration test procedures.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the feat/cloud-native-healthevents branch from 10c3967 to 1a8dd5d Compare February 4, 2026 15:49
Document the architecture decision for migrating E2E tests from
MongoDB-based to CRD-based architecture:
- Update existing tests (don't rebuild)
- Replace MongoDB assertions with CRD assertions
- 4-phase implementation plan (3-4 weeks)
- Success criteria and test file mapping

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces Kubernetes-native GPU health event management by adding a gRPC-based Device API client SDK, a customized code generator (client-gen fork) to produce gRPC-backed clientsets/informers, and new binaries/utilities to run the health-provider and controller integration tests.

Changes:

  • Add code-generator module (modified client-gen) to generate gRPC-backed clientsets, informers, listers, and fakes.
  • Add client-go module implementing gRPC transport (nvgrpc) plus generated client artifacts and runnable examples/tests.
  • Add/extend runtime entrypoints for health-provider (NVML + stub builds), device-api-server startup, and a controller integration test runner.

Reviewed changes

Copilot reviewed 116 out of 185 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
code-generator/go.mod New Go module for custom gRPC-aware code generation tooling
code-generator/cmd/client-gen/types/types.go Forked/modified upstream client-gen types to support protobuf package wiring
code-generator/cmd/client-gen/types/helpers_test.go Copied upstream tests for type helpers to keep behavior consistent
code-generator/cmd/client-gen/types/helpers.go Copied upstream helper logic for group/version parsing & sorting
code-generator/cmd/client-gen/main.go New entrypoint for custom client-gen execution
code-generator/cmd/client-gen/generators/util/tags_test.go Copied upstream tests validating client-gen tag parsing
code-generator/cmd/client-gen/generators/util/gvpackages.go Copied upstream utility for splitting path/group-version
code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go Modified scheme generator to support gRPC-oriented layout
code-generator/cmd/client-gen/generators/generator_for_expansion.go Copied upstream expansion generator behavior
code-generator/cmd/client-gen/generators/generator_for_clientset.go Modified clientset generator templates to use nvgrpc transport
code-generator/cmd/client-gen/generators/fake/generator_fake_for_group.go Modified fake client generation for gRPC-backed typed clients
code-generator/cmd/client-gen/generators/fake/generator_fake_for_clientset.go Modified fake clientset generation to satisfy client-go interfaces
code-generator/cmd/client-gen/generators/fake/fake_client_generator.go Targets wiring for fake typed clients and fake clientset generation
code-generator/cmd/client-gen/args/gvtype.go Args parsing for included type overrides (forked upstream)
code-generator/cmd/client-gen/args/gvpackages_test.go Tests for input group/version package resolution
code-generator/cmd/client-gen/args/gvpackages.go Input group/version package resolution including base path support
code-generator/cmd/client-gen/args/args.go Custom args including --proto-base for protobuf stub imports
code-generator/cmd/client-gen/README.md Documentation for using the custom gRPC client generator
code-generator/README.md Overview of code-generator module purpose and usage
code-generator/CONFIGURATION.md How generator tool versions are sourced/overridden
cmd/health-provider/main_stub.go Stub health-provider binary for builds without NVML
cmd/health-provider/main.go NVML-enabled health-provider service entrypoint
cmd/device-api-server/main.go Device API Server main wiring and config/logging startup
cmd/controller-test/main.go Local integration test runner for HealthEvent controllers
client-go/version/version_test.go Unit test for client-go User-Agent formatting
client-go/version/version.go User-Agent/version injection for gRPC client identification
client-go/tests/device/v1alpha1/fake_client_test.go Tests for generated fake clientset CRUD/list/watch behaviors
client-go/nvgrpc/watcher.go Watch adapter turning gRPC stream events into watch.Interface
client-go/nvgrpc/options_test.go Unit tests for dial option accumulation behavior
client-go/nvgrpc/options.go Dial options (logger, unary/stream interceptors) for gRPC client
client-go/nvgrpc/interceptors_test.go Unit tests for latency interceptors behavior
client-go/nvgrpc/interceptors.go Latency interceptors for unary & stream establishment
client-go/nvgrpc/config_test.go Unit tests for config defaulting/validation behavior
client-go/nvgrpc/config.go Client config defaulting + env override handling for target/user-agent
client-go/nvgrpc/client_conn_test.go Tests ensuring config defaulting doesn’t mutate caller input
client-go/nvgrpc/client_conn.go Central gRPC connection factory with keepalive + interceptor chaining
client-go/listers/device/v1alpha1/gpu.go Generated lister for GPU resources
client-go/listers/device/v1alpha1/expansion_generated.go Generated lister expansion hook
client-go/informers/externalversions/internalinterfaces/factory_interfaces.go Generated informer internal interfaces
client-go/informers/externalversions/generic.go Generated generic informer routing by GVR
client-go/informers/externalversions/device/v1alpha1/interface.go Generated device group informer interfaces
client-go/informers/externalversions/device/v1alpha1/gpu.go Generated GPU informer driven by versioned client
client-go/informers/externalversions/device/interface.go Generated group-level informer interfaces
client-go/hack/update-codegen.sh Script to regenerate client-go artifacts using custom codegen
client-go/hack/boilerplate.go.txt Boilerplate header for generated client-go files
client-go/go.mod New client-go module dependencies (grpc + k8s libs)
client-go/examples/watch/main.go Example: watch stream with custom interceptors and shutdown
client-go/examples/watch/README.md Documentation for watch example
client-go/examples/fake-server/README.md Documentation for fake device-api server example
client-go/examples/fake-client/main_test.go Example test showing informer + fake client integration
client-go/examples/fake-client/README.md Documentation for fake-client example
client-go/examples/controller/main.go Example: controller-runtime informer injection using gRPC informer
client-go/examples/controller/README.md Documentation for controller example
client-go/examples/client/main.go Example: basic list/get client usage
client-go/examples/client/README.md Documentation for basic client example
client-go/examples/README.md Index of examples and recommended local workflow
client-go/client/versioned/typed/device/v1alpha1/gpu.go Generated typed GPU client backed by gRPC/protobuf
client-go/client/versioned/typed/device/v1alpha1/generated_expansion.go Generated typed client expansion hook
client-go/client/versioned/typed/device/v1alpha1/fake/fake_gpu.go Generated fake typed GPU client
client-go/client/versioned/typed/device/v1alpha1/fake/fake_device_client.go Generated fake group client
client-go/client/versioned/typed/device/v1alpha1/fake/doc.go Generated fake typed client package docs
client-go/client/versioned/typed/device/v1alpha1/doc.go Generated typed client package docs
client-go/client/versioned/typed/device/v1alpha1/device_client.go Generated group client constructor wiring
client-go/client/versioned/scheme/register.go Generated scheme registration for client-go
client-go/client/versioned/scheme/doc.go Generated scheme package docs
client-go/client/versioned/fake/register.go Generated scheme registration for fake clientset
client-go/client/versioned/fake/doc.go Generated fake clientset package docs
client-go/client/versioned/fake/clientset_generated.go Generated fake clientset implementation
client-go/client/versioned/clientset.go Generated versioned clientset entrypoint
client-go/README.md Client SDK overview, installation, usage patterns, and examples
client-go/Makefile Client-go local dev workflow (codegen, test, build, lint, clean)
client-go/DEVELOPMENT.md Developer guide for generated vs manual code areas
api/nvsentinel/v1alpha1/groupversion_info.go New API group registration for HealthEvent CRD types
api/nvsentinel/v1alpha1/doc.go Package docs and generator markers for nvsentinel API group
api/hack/update-codegen.sh API module generation script (protos + deepcopy + goverter)
api/hack/boilerplate.go.txt Boilerplate header for generated API files
api/go.mod API module dependency updates (grpc/protobuf + k8s/apimachinery)
api/device/v1alpha1/zz_generated.goverter.go Generated goverter conversions between protobuf and KRM
api/device/v1alpha1/zz_generated.deepcopy.go Generated deepcopy helpers for Kubernetes runtime objects
api/device/v1alpha1/register.go device API group registration (GPU/GPUList)
api/device/v1alpha1/gpu_types.go GPU KRM type definitions + client-gen markers/verbs
api/device/v1alpha1/gpu_conversion_test.go Tests for protobuf ↔ KRM conversion correctness
api/device/v1alpha1/gpu_conversion.go Wrapper conversion helpers selecting generated goverter impl
api/device/v1alpha1/doc.go Package docs and groupName marker for device API group
api/device/v1alpha1/converter.go Goverter mapping configuration and helper conversions
api/README.md API module structure + generation pipeline overview
api/Makefile New API module workflow (code-gen, manifests, test, build, tools)
api/DEVELOPMENT.md API dev workflow and conventions documentation
.versions.yaml Tool/version pin updates for codegen-related tooling
.golangci.yml Lint config updates (parallel runners, govet enabled, exclusions)
.gitattributes Mark generated/copied code paths as linguist-generated

klog.InitFlags(nil)

cfg, nvmlCfg := parseFlags()
flag.Parse()
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag.Parse() is invoked twice (once in main and once inside parseFlags). With Go's default CommandLine FlagSet (ExitOnError), the second parse will terminate the process with a "Parse called twice" error. Fix by parsing flags exactly once: typically, have parseFlags only register/bind flags and return defaults, then call flag.Parse() once in main, and apply env overrides after parsing.

Suggested change
flag.Parse()

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +113
// Parse flags first
flag.Parse()

Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag.Parse() is invoked twice (once in main and once inside parseFlags). With Go's default CommandLine FlagSet (ExitOnError), the second parse will terminate the process with a "Parse called twice" error. Fix by parsing flags exactly once: typically, have parseFlags only register/bind flags and return defaults, then call flag.Parse() once in main, and apply env overrides after parsing.

Copilot uses AI. Check for mistakes.
Comment on lines 35 to 36
cfg, nvmlCfg := parseFlags()
flag.Parse()
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same double-parse issue as the NVML build: parseFlags calls flag.Parse(), and main calls flag.Parse() again. This will exit at runtime with "Parse called twice". Remove one of the parses (recommended: do not parse inside parseFlags) and keep env overrides applied after the single parse.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +89
flag.Parse()

Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same double-parse issue as the NVML build: parseFlags calls flag.Parse(), and main calls flag.Parse() again. This will exit at runtime with "Parse called twice". Remove one of the parses (recommended: do not parse inside parseFlags) and keep env overrides applied after the single parse.

Copilot uses AI. Check for mistakes.
allGroups := clientgentypes.ToGroupVersionInfo(g.groups, g.groupGoNames)
m := map[string]interface{}{
"allGroups": allGroups,
"fmtErrorf": c.Universe.Type(types.Name{Package: "fmt", Name: "Errorf"}),
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Errorf is a function, but this code looks it up via c.Universe.Type(...). In gengo, functions should be retrieved via c.Universe.Function(...). Leaving this as Type risks returning nil and causing template execution to fail or generate invalid references when emitting $.fmtErrorf|raw$.

Suggested change
"fmtErrorf": c.Universe.Type(types.Name{Package: "fmt", Name: "Errorf"}),
"fmtErrorf": c.Universe.Function(types.Name{Package: "fmt", Name: "Errorf"}),

Copilot uses AI. Check for mistakes.

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/api/meta"
"github.com/nvidia/nvsentinel/client-go/clientset/versioned"
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Quick Start example imports github.com/nvidia/nvsentinel/client-go/clientset/versioned, but this PR introduces the generated client under client-go/client/versioned (e.g., client-go/client/versioned/clientset.go). Update the import path in the README snippet so it compiles for users copying it.

Suggested change
"github.com/nvidia/nvsentinel/client-go/clientset/versioned"
"github.com/nvidia/nvsentinel/client-go/client/versioned"

Copilot uses AI. Check for mistakes.

var newClientsetForConfigAndClientTemplate = `
// NewForConfigAndClient creates a new Clientset for the given config and gRPC client connection.
// The provided gRPC client connection provided takes precedence over the configured transport values.
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment has a duplicated word (provided ... provided). Reword to remove the duplication for readability.

Suggested change
// The provided gRPC client connection provided takes precedence over the configured transport values.
// The provided gRPC client connection takes precedence over the configured transport values.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +53
func NewWatcher(
source Source,
cancel context.CancelFunc,
logger logr.Logger,
) watch.Interface {
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watcher is core to informer reliability (mapping gRPC stream events/errors into watch.Interface semantics), but there are no unit tests covering event-type mapping (ADDED/MODIFIED/DELETED/ERROR), normal stream shutdown (EOF/Canceled), or the gRPC-code → metav1.Status mapping in sendError. Add tests using a fake Source implementation to validate the emitted watch events and termination behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +89
// nolint:cyclop // Complexity is necessary to handle various gRPC stream states and event types.
func (w *Watcher) receive() {
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watcher is core to informer reliability (mapping gRPC stream events/errors into watch.Interface semantics), but there are no unit tests covering event-type mapping (ADDED/MODIFIED/DELETED/ERROR), normal stream shutdown (EOF/Canceled), or the gRPC-code → metav1.Status mapping in sendError. Add tests using a fake Source implementation to validate the emitted watch events and termination behavior.

Copilot uses AI. Check for mistakes.
}
}

func (w *Watcher) sendError(err error) {
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watcher is core to informer reliability (mapping gRPC stream events/errors into watch.Interface semantics), but there are no unit tests covering event-type mapping (ADDED/MODIFIED/DELETED/ERROR), normal stream shutdown (EOF/Canceled), or the gRPC-code → metav1.Status mapping in sendError. Add tests using a fake Source implementation to validate the emitted watch events and termination behavior.

Copilot uses AI. Check for mistakes.
Audit results for 4 test files (9 test scenarios):
- smoke_test.go: Full fatal/unsupported event flows
- fault_quarantine_test.go: CEL, pre-cordon, circuit breaker, strategy
- node_drainer_test.go: Eviction modes, partial drain
- fault_remediation_test.go: RebootNode CR creation

Key findings:
- All tests use SendHealthEvent() HTTP POST → need CRD creation
- Node label state checks → need CRD status.phase checks
- MongoDB change streams → need K8s informer/watch
- 6 new helper functions required for CRD operations

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add new helper functions for E2E tests to work with HealthEvent CRDs:

Builder pattern:
- NewHealthEventCRD() - fluent builder with sensible defaults
- With* methods for all spec fields

CRD operations:
- CreateHealthEventCRD, GetHealthEventCRD, DeleteHealthEventCRD
- ListHealthEventCRDs, ListHealthEventCRDsForNode
- DeleteAllHealthEventCRDs

Phase waiting:
- WaitForHealthEventPhase, WaitForHealthEventPhaseNotEqual
- WaitForHealthEventCondition, WaitForHealthEventPhaseSequence

Assertions:
- AssertHealthEventPhase, AssertHealthEventNotExists
- AssertHealthEventHasCondition, AssertHealthEventNeverReachesPhase
- AssertNodeQuarantinedCondition, AssertPodsDrainedCondition

Migration helpers:
- SendHealthEventViaCRD - drop-in replacement for SendHealthEvent
- SendHealthyEventViaCRD - drop-in replacement for SendHealthyEvent
- TriggerFullRemediationFlowViaCRD - CRD version of flow trigger

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Register nvsentinelv1alpha1.HealthEvent types in test scheme
- Move keyHealthEventName to main_test.go (shared across build tags)
- Remove unused v1 import from smoke_test.go

Phase 4 validation confirmed:
- Controllers process HealthEvent CRDs correctly
- Phase transitions work: New → Quarantined → Draining
- Node cordoning works correctly

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The E2E tests workflow was designed for the old microservices architecture
with MongoDB/PostgreSQL datastores. The new CRD-based system doesn't use
these datastores - HealthEvents are now stored as Kubernetes Custom Resources.

E2E testing for the new architecture will be handled differently.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.gitignore (1)

41-43: ⚠️ Potential issue | 🟡 Minor

Remove duplicate comment line.

The JetBrains IDEs section header comment is duplicated on consecutive lines.

🧹 Proposed fix
 ### JetBrains IDEs (GoLand, PyCharm, IntelliJ) ###
-### JetBrains IDEs (GoLand, PyCharm, IntelliJ) ###
 # User-specific stuff
🤖 Fix all issues with AI agents
In @.gitattributes:
- Around line 1-21: Add the Apache 2.0 license header to the top of
.gitattributes: insert the standard multi-line Apache-2.0 header text (including
copyright line, SPDX identifier or full license notice and year/owner as per
project convention) as the first lines of the file so the file contains the
required license header before any content; ensure formatting matches other
repository files' headers.

In @.github/workflows/ci.yml:
- Around line 40-70: The "go" job references matrix variables (matrix.step_name,
matrix.make_command, matrix.component) but lacks a strategy.matrix; add a
strategy block to the "go" job that defines the matrix entries used (e.g.,
step_name, make_command, component) or restore the original strategy.matrix if
it was removed so those variables are defined for the job; update the job's
top-level stanza "go" to include strategy: matrix: with the appropriate
permutations that feed the referenced matrix.step_name, matrix.make_command and
matrix.component.

In `@api/proto/device/v1alpha1/gpu.proto`:
- Around line 360-375: UpdateGpuStatusRequest currently defines resource_version
as int64 while ObjectMeta.resource_version is a string, causing type mismatch;
change UpdateGpuStatusRequest.resource_version to string to match
ObjectMeta.resource_version (or vice versa if you choose numeric versioning) and
update any callers/serializers that read/write
UpdateGpuStatusRequest.resource_version and related handlers (e.g., the
UpdateGpuStatus RPC handling code and client usage) to use the new string type
so optimistic-concurrency checks compare the same type across messages.
- Around line 210-222: The RPC docs for CreateGpu and the
CreateGpuResponse.created field conflict; decide whether CreateGpu should return
an ALREADY_EXISTS error or use idempotent semantics with
CreateGpuResponse.created (true/false) returning the existing GPU. Then make
them consistent: if you choose idempotent upsert, update the CreateGpu RPC
comment to state it returns success with created=false and the existing GPU when
a name already exists and keep/CreateGpuResponse.created semantics; if you
choose error-on-duplicate, remove or repurpose CreateGpuResponse.created and
change the CreateGpu RPC comment to state it returns ALREADY_EXISTS and ensure
server code for rpc CreateGpu enforces that behavior. Apply the change to the
documented remarks for rpc CreateGpu and the CreateGpuResponse message so both
match the chosen behavior.
- Around line 96-111: The Gpu message defines two different resource_version
fields (ObjectMeta.resource_version as a string and Gpu.resource_version as
int64), causing confusion; either remove the top-level int64 resource_version
from message Gpu or rename it to a clearly distinct semantic name (e.g.,
generation or internal_version) to avoid collision with
ObjectMeta.resource_version; update the proto definition for message Gpu (field
currently declared as "int64 resource_version = 4") accordingly, adjust all code
references/usages to the new name or remove usages if deleted, update any
comments/docs in the proto to reflect the change, and regenerate protobuf stubs
wherever they are consumed.

In `@api/README.md`:
- Around line 1-38: The README "NVIDIA Device API: API Definitions" is missing
the Apache-2.0 license header; add the standard Apache License 2.0 header
(including the SPDX identifier and full copyright notice used across the repo)
to the top of this README so the header appears before the existing title and
content, ensure the year and copyright holder match the repository convention,
and preserve the existing markdown below the header.

In `@client-go/client/versioned/typed/device/v1alpha1/gpu.go`:
- Around line 88-107: In gpus.List, guard against a nil result from
devicev1alpha1.FromProtoList(resp.GetGpuList()) before accessing list.Items:
after calling FromProtoList, check if list == nil and if so return nil,
fmt.Errorf("received nil GPU list from proto for namespace %s",
c.getNamespace()); ensure the generator template that produces this file imports
fmt and applies the nil-check consistently (reference symbols: gpus.List,
devicev1alpha1.FromProtoList, resp.GetGpuList, c.getNamespace, and GPUList).
- Around line 66-86: The Get implementation for gpus calls
devicev1alpha1.FromProto(resp.GetGpu()) and immediately logs
obj.GetResourceVersion(), but FromProto can return nil; add a nil guard after
obj := devicev1alpha1.FromProto(resp.GetGpu()) to check if obj == nil and return
a descriptive error (e.g., wrap fmt.Errorf or errors.New) instead of proceeding
to call obj.GetResourceVersion(); apply this change in the generator template
that produces the gpus.Get function so all generated clients (and the GetGpu
call path to c.client.GetGpu / resp.GetGpu) include the nil-check and error
return.

In `@client-go/informers/externalversions/device/interface.go`:
- Line 17: Add a package-level godoc comment at the top of the device package
file: insert a comment starting with "// Package device ..." immediately above
the "package device" declaration that briefly describes the purpose and
responsibilities of the device package (e.g., it provides informer shared types
for device API group). Ensure the comment follows godoc style and summarizes the
package's intent.

In `@client-go/informers/externalversions/device/v1alpha1/interface.go`:
- Line 17: Add a package-level godoc comment to the v1alpha1 package in
interface.go by inserting a top-of-file comment that begins with "// Package
v1alpha1" and briefly describes the purpose of this package (e.g., that it
provides informer interfaces/types for device v1alpha1). Ensure the comment is
placed immediately above the "package v1alpha1" declaration so standard Go tools
and linters recognize the package documentation.

In `@client-go/nvgrpc/client_conn.go`:
- Around line 26-28: Guard against a nil Config in ClientConnFor: check if the
incoming pointer config is nil before dereferencing (cfg := *config) and return
a clear error (e.g., fmt.Errorf or errors.New) if it is; update ClientConnFor to
perform this nil check at the top so callers that pass nil get a descriptive
error instead of a panic while leaving the rest of the function (using cfg and
opts ...DialOption) unchanged.

In `@client-go/README.md`:
- Around line 40-44: Update the incorrect import in the Quick Start import
block: replace the wrong import string
"github.com/nvidia/nvsentinel/client-go/clientset/versioned" with
"github.com/nvidia/nvsentinel/client-go/client/versioned" so the example imports
the correct client package; ensure the import block containing metav1, api/meta,
and nvgrpc uses the corrected path.

In `@cmd/device-api-server/main.go`:
- Around line 123-127: The current flag.Set("logging_format", "json") is
ineffective for klog/v2; replace it by initializing klog with a structured JSON
handler: when config.LogFormat == "json" call klog.SetSlogLogger(...) using
slog.NewJSONHandler(os.Stdout, nil) so klog emits JSON; place this call after
flag.Parse() and before any logging (i.e., before the first logger.Info() call)
to ensure it takes effect; update the block that currently checks
config.LogFormat and remove the flag.Set usage in favor of klog.SetSlogLogger.

In `@cmd/health-provider/main_stub.go`:
- Around line 35-36: Remove the duplicate flag.Parse() call in main_stub.go:
parseFlags() already calls flag.Parse(), so delete the standalone flag.Parse()
invocation to avoid parsing flags twice; keep parseFlags() as the single place
that calls flag.Parse() and returns cfg and nvmlCfg (refer to the parseFlags
function and the cfg, nvmlCfg variables).

In `@code-generator/cmd/client-gen/generators/generator_for_type.go`:
- Around line 27-35: Wrap the error returned by ParseClientGenTags with context
at each call site by replacing bare returns of err with: return
fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err). Update the
four occurrences where ParseClientGenTags is called (in the generators for types
and groups and their fake counterparts) so the function that calls
ParseClientGenTags returns the wrapped error; ensure you import fmt if not
already present and reference t.Name.Name in the formatting string.

In `@deployments/helm/health-provider/Chart.yaml`:
- Around line 1-24: The Chart.yaml for the chart named "health-provider" is not
located with the other nvsentinel charts; move the entire chart directory
containing Chart.yaml for health-provider into the project's central nvsentinel
charts directory so it sits alongside existing charts like gpu-health-monitor,
syslog-health-monitor, and csp-health-monitor, and update any CI/build
references or packaging manifests that point to the old location (e.g., chart
packaging steps or dependency references) to the new chart directory.
🟡 Minor comments (40)
.github/copilot-instructions.md-14-25 (1)

14-25: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

MD040 is triggered because these fences lack a language; add text/bash for clarity and lint compliance.

🛠️ Proposed fix
-```
+```text
 api/
 ├── proto/                    # Protocol Buffer source definitions (edit these)
 │   └── device/v1alpha1/
 │       └── gpu.proto
 ├── gen/go/                   # Generated Go code (DO NOT edit manually)
 │   └── device/v1alpha1/
 │       ├── gpu.pb.go
 │       └── gpu_grpc.pb.go
 ├── go.mod
 └── Makefile
-```
+```

@@
-```
+```text
 feat: add new GPU condition type
 fix: correct timestamp handling in conditions
 docs: update API documentation
 chore: update protoc-gen-go version
-```
+```

@@
-```
+```text
 Signed-off-by: Name <email@example.com>
-```
+```

@@
-```
+```text
 Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.

 Licensed under the Apache License, Version 2.0 (the "License");
 ...
-```
+```

Also applies to: 60-65, 71-73, 81-86

demos/nvml-sidecar-demo.sh-454-458 (1)

454-458: ⚠️ Potential issue | 🟡 Minor

Documentation inconsistency in summary.

Line 455 references Dockerfile.nvml-provider, but the script actually uses deployments/container/Dockerfile (defined on line 24). Update the summary to reference the correct file path.

Suggested fix
     echo "${white}Key files:${reset}"
-    echo "${cyan}  - Dockerfile.nvml-provider  # Sidecar container build${reset}"
+    echo "${cyan}  - ${DOCKERFILE}  # Multi-target container build${reset}"
     echo "${cyan}  - values-sidecar-test.yaml  # Helm values for sidecar mode${reset}"
     echo "${cyan}  - charts/device-api-server/ # Helm chart with sidecar support${reset}"
demos/nvml-sidecar-demo.sh-367-369 (1)

367-369: ⚠️ Potential issue | 🟡 Minor

Bug: Duplicate pause calls.

Lines 367 and 368 both call pause, requiring the user to press ENTER twice unnecessarily. This appears to be a copy-paste error.

Suggested fix
     step "Check nvml-provider sidecar logs"
     run_cmd kubectl --kubeconfig="${KUBECONFIG}" -n "${NAMESPACE}" logs "${POD}" -c nvml-provider --tail=20 || true

     pause
-
-    pause
 }
code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go-127-131 (1)

127-131: ⚠️ Potential issue | 🟡 Minor

Surface unexpected os.Stat errors.

Non-IsNotExist errors are silently ignored, which can mask permission/IO problems and skip custom registration unexpectedly.

🛠️ Suggested fix
 	if g.OutputPath != "" {
-		if _, err := os.Stat(filepath.Join(g.OutputPath, strings.ToLower("register_custom.go"))); err == nil {
+		customPath := filepath.Join(g.OutputPath, "register_custom.go")
+		if _, err := os.Stat(customPath); err == nil {
 			m["customRegister"] = true
+		} else if !os.IsNotExist(err) {
+			return fmt.Errorf("stat %s: %w", customPath, err)
 		}
 	}
code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go-55-59 (1)

55-59: ⚠️ Potential issue | 🟡 Minor

Add GoDoc comments for exported GenScheme methods (and fix Filter comment).

Exported methods require GoDoc comments starting with the method name per Go conventions.

✍️ Suggested fix
+// Namers returns the naming systems used by this generator.
 func (g *GenScheme) Namers(c *generator.Context) namer.NameSystems {
 	return namer.NameSystems{
 		"raw": namer.NewRawNamer(g.OutputPkg, g.ImportTracker),
 	}
 }
 
-// We only want to call GenerateType() once.
+// Filter ensures GenerateType runs only once for this generator.
 func (g *GenScheme) Filter(c *generator.Context, t *types.Type) bool {
 	ret := !g.schemeGenerated
 	g.schemeGenerated = true
 	return ret
 }
 
+// Imports returns the import lines required for the generated scheme.
 func (g *GenScheme) Imports(c *generator.Context) (imports []string) {
 	imports = append(imports, g.ImportTracker.ImportLines()...)
 	for _, group := range g.Groups {
 		for _, version := range group.Versions {
 			packagePath := g.InputPackages[clientgentypes.GroupVersion{Group: group.Group, Version: version.Version}]
 			groupAlias := strings.ToLower(g.GroupGoNames[clientgentypes.GroupVersion{Group: group.Group, Version: version.Version}])
 			if g.CreateRegistry {
 				// import the install package for internal clientsets instead of the type package with register.go
 				if version.Version != "" {
 					packagePath = path.Dir(packagePath)
 				}
 				packagePath = path.Join(packagePath, "install")
 
 				imports = append(imports, fmt.Sprintf("%s \"%s\"", groupAlias, packagePath))
 				break
 			} else {
 				imports = append(imports, fmt.Sprintf("%s%s \"%s\"", groupAlias, strings.ToLower(version.Version.NonEmpty()), packagePath))
 			}
 		}
 	}
 	return
 }
 
+// GenerateType writes the scheme, codecs, and registration code.
 func (g *GenScheme) GenerateType(c *generator.Context, t *types.Type, w io.Writer) error {
 	sw := generator.NewSnippetWriter(w, c, "$", "$")
code-generator/cmd/client-gen/generators/scheme/generator_for_scheme.go-24-24 (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Add a package-level GoDoc comment.

The package declaration lacks the required package comment, which will trip GoDoc/golint checks. Other generator files in the codebase properly document their packages.

✍️ Suggested fix
- package scheme
+ // Package scheme contains generators for clientset schemes, codecs, and registration.
+ package scheme

As per coding guidelines: Package-level godoc required for all Go packages.

cmd/health-provider/main.go-51-52 (1)

51-52: ⚠️ Potential issue | 🟡 Minor

Remove redundant flag.Parse() call.

flag.Parse() is already called inside parseFlags() at line 112. This second call is redundant and should be removed to avoid confusion.

🔧 Proposed fix
 	cfg, nvmlCfg := parseFlags()
-	flag.Parse()
.gitignore-30-31 (1)

30-31: ⚠️ Potential issue | 🟡 Minor

Remove duplicate api/bin/* entry.

This pattern is already defined at line 17. The duplicate entry should be removed.

🧹 Proposed fix
 # Local tool binaries (managed by api/Makefile)
-api/bin/*
client-go/examples/fake-client/main_test.go-80-86 (1)

80-86: ⚠️ Potential issue | 🟡 Minor

Add type assertion safety check.

The type assertion obj.(*devicev1alpha1.GPU) could panic if the object is of an unexpected type. Use a comma-ok idiom for safety.

🛡️ Proposed fix
 	gpuInformer.AddEventHandler(&cache.ResourceEventHandlerFuncs{
 		AddFunc: func(obj interface{}) {
-			gpu := obj.(*devicev1alpha1.GPU)
+			gpu, ok := obj.(*devicev1alpha1.GPU)
+			if !ok {
+				t.Errorf("unexpected object type: %T", obj)
+				return
+			}
 			t.Logf("Informer signaled GPU added: %s", gpu.Name)
 			gpuChan <- gpu
 		},
 	})
.github/workflows/ci.yml-17-28 (1)

17-28: ⚠️ Potential issue | 🟡 Minor

Branch name inconsistency in concurrency configuration.

The workflow triggers on device-api branch (lines 20, 23), but cancel-in-progress checks against main branch (line 28). This means in-progress runs on device-api will always be cancelled on new pushes, which may not be intended.

🔧 Suggested fix
 concurrency:
   group: ${{ github.workflow }}-${{ github.ref }}
-  cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
+  cancel-in-progress: ${{ github.ref != 'refs/heads/device-api' }}

Or update to match your actual protected branch.

README.md-259-259 (1)

259-259: ⚠️ Potential issue | 🟡 Minor

Normalize horizontal rule to --- (no trailing space).

Line 259 has a horizontal rule with trailing whitespace, which triggers MD035.

🧹 Suggested fix
---- 
+---
README.md-14-42 (1)

14-42: ⚠️ Potential issue | 🟡 Minor

Add language identifier to the ASCII diagram fence.

Line 14 opens a fenced block without a language identifier, which triggers MD040. Using text keeps the diagram readable and satisfies linting.

📝 Suggested fix
-```
+```text
@@
-```
+```
code-generator/cmd/client-gen/generators/generator_for_group.go-27-35 (1)

27-35: ⚠️ Potential issue | 🟡 Minor

Wrap ParseClientGenTags errors with context.

Line 107 returns the raw parse error, which obscures which type caused the failure. Wrap with context to improve diagnostics. This pattern is also present at generator_for_type.go:96, fake/generator_fake_for_type.go:83, and fake/generator_fake_for_group.go:90—consider applying the same fix across the codebase for consistency.

🧩 Suggested change
 import (
+	"fmt"
 	"io"
@@
 		tags, err := util.ParseClientGenTags(append(t.SecondClosestCommentLines, t.CommentLines...))
 		if err != nil {
-			return err
+			return fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err)
 		}
code-generator/cmd/client-gen/types/helpers.go-38-38 (1)

38-38: ⚠️ Potential issue | 🟡 Minor

Add issue reference to the TODO comment.
The TODO at line 38 should reference a concrete GitHub issue ID for traceability, per coding guidelines requiring TODO comments to reference issues in Go code.

- // TODO once we've cleared the last uses as strings, this special case should be removed.
+ // TODO (#<issue-id>) once we've cleared the last uses as strings, this special case should be removed.
cmd/nvml-provider/reconciler.go-118-118 (1)

118-118: ⚠️ Potential issue | 🟡 Minor

Link this TODO to a concrete issue ID for traceability.

The TODO comment at line 118 should reference a GitHub issue or tracking number (e.g., TODO: #123 Add provider_id filtering...). This enables developers to track the improvement across the codebase and project management systems.

client-go/examples/controller/main.go-72-113 (1)

72-113: ⚠️ Potential issue | 🟡 Minor

Remove or clarify misleading "hybrid" reconciliation comment.
The fallback code would fail if executed. The manager uses a dummy rest.Config{Host: "http://localhost:0"} with a mapper containing only GPU resources. Any attempt to watch non-GPU resources (Pods, Events, etc.) would fail due to both the non-functional REST endpoint and missing mapper entries. Either clarify that this example is GPU-only reconciliation or wire a real kubeconfig with a discovery-backed mapper if hybrid support is intended.

cmd/nvml-provider/reconciler.go-285-301 (1)

285-301: ⚠️ Potential issue | 🟡 Minor

Confirm merge semantics—status fields are fully replaced, which will clear recommended_action.

The code populates only Conditions in the status update. Since UpdateGpuStatus performs a complete replacement of the entire GpuStatus object (see implementation: cached.gpu.Status = status), the recommended_action field will be cleared to an empty string if it was previously set. This causes status data loss. Either preserve the existing recommended_action by reading the current GPU state before updating, or use field masks to update only the Conditions field.

cmd/nvml-provider/reconciler.go-71-90 (1)

71-90: ⚠️ Potential issue | 🟡 Minor

Add provider ownership filtering before unregistering GPUs.

ReconcileState unregisters any cached GPU missing from NVML. The code currently assumes a single NVML provider ("For now, we assume all GPUs belong to us"), but the architecture supports multiple providers. If ListGpus returns GPUs from other providers, this reconciliation will delete resources they own. The code already has a TODO for this (line 118). Before the system supports multiple providers, add provider_id to the Gpu resource and filter cached GPUs by provider before calling unregisterGPU.

client-go/examples/client/README.md-22-27 (1)

22-27: ⚠️ Potential issue | 🟡 Minor

Inconsistent socket path between documentation and expected output.

Line 22 states sudo is required for the default socket path in /var/run/, but the expected output on line 27 shows the target as unix:///tmp/nvidia-device-api.sock. This discrepancy may confuse users about the actual socket location. Consider clarifying that the Fake Server uses /tmp/ while production deployments use /var/run/.

client-go/DEVELOPMENT.md-41-44 (1)

41-44: ⚠️ Potential issue | 🟡 Minor

Fix markdownlint MD027: remove extra spaces after > in blockquotes.

Suggested fix
->  - **API Definitions**: Modify the Proto or Go types in `../api`.
->  - **Client Logic**: Modify the templates in `../code-generator/cmd/client-gen`.
->  - **Transport & Connection**: Modify the gRPC logic in `nvgrpc`.
+> - **API Definitions**: Modify the Proto or Go types in `../api`.
+> - **Client Logic**: Modify the templates in `../code-generator/cmd/client-gen`.
+> - **Transport & Connection**: Modify the gRPC logic in `nvgrpc`.
client-go/version/version.go-15-25 (1)

15-25: ⚠️ Potential issue | 🟡 Minor

Add a package-level doc comment for package version.

Suggested fix
- package version
+// Package version provides build-time version metadata and user-agent helpers.
+package version

As per coding guidelines: Package-level godoc required for all Go packages.

client-go/tests/device/v1alpha1/fake_client_test.go-85-99 (1)

85-99: ⚠️ Potential issue | 🟡 Minor

Add type assertion safety check to prevent panic.

Line 93 performs an unchecked type assertion. If the watch returns an unexpected type, this will panic rather than fail gracefully.

Proposed fix
-		obj := event.Object.(*devicev1alpha1.GPU)
+		obj, ok := event.Object.(*devicev1alpha1.GPU)
+		if !ok {
+			t.Fatalf("Expected *devicev1alpha1.GPU, got %T", event.Object)
+		}
 		if obj.Name != "gpu-2" {
client-go/tests/device/v1alpha1/fake_client_test.go-41-53 (1)

41-53: ⚠️ Potential issue | 🟡 Minor

Use t.Fatalf instead of t.Errorf when test cannot continue.

On Line 43, if Get fails, the subsequent assertions on gpu will panic due to nil access. Use t.Fatalf to halt test execution on critical failures.

Proposed fix
 	gpu, err := client.DeviceV1alpha1().GPUs().Get(ctx, "gpu-1", metav1.GetOptions{})
 	if err != nil {
-		t.Errorf("Fake client failed to retrieve GPU: %v", err)
+		t.Fatalf("Fake client failed to retrieve GPU: %v", err)
 	}
client-go/nvgrpc/interceptors_test.go-35-35 (1)

35-35: ⚠️ Potential issue | 🟡 Minor

Unused expectedCode field in test table.

The expectedCode field is defined but never asserted against in the test. Either remove it or add an assertion to verify the interceptor correctly propagates the gRPC status code.

Option 1: Remove unused field
 	tests := []struct {
 		name         string
 		method       string
 		log          bool
 		invokerErr   error
-		expectedCode codes.Code
 	}{
-		{"Returns OK code for successful call", "/svc/success", true, nil, codes.OK},
-		{"Returns internal status error on internal error", "/svc/internal_error", true, status.Error(codes.Internal, "fail"), codes.Internal},
-		{"Returns canceled status error when canceled", "/svc/cancel", true, status.Error(codes.Canceled, ""), codes.Canceled},
-		{"Returns deadline exceeded status error on timeout", "/svc/timeout", true, status.Error(codes.DeadlineExceeded, ""), codes.DeadlineExceeded},
-		{"Does not log if level too low", "/svc/skip_log", false, nil, codes.OK},
+		{"Returns OK code for successful call", "/svc/success", true, nil},
+		{"Returns internal status error on internal error", "/svc/internal_error", true, status.Error(codes.Internal, "fail")},
+		{"Returns canceled status error when canceled", "/svc/cancel", true, status.Error(codes.Canceled, "")},
+		{"Returns deadline exceeded status error on timeout", "/svc/timeout", true, status.Error(codes.DeadlineExceeded, "")},
+		{"Does not log if level too low", "/svc/skip_log", false, nil},
 	}
Option 2: Add assertion for expected code
 			err := interceptor(context.Background(), tt.method, nil, nil, nil, invoker)
 			if !errors.Is(err, tt.invokerErr) {
 				t.Fatalf("Returned error mismatch. Got %v, want %v", err, tt.invokerErr)
 			}
+			if got := status.Code(err); got != tt.expectedCode {
+				t.Errorf("Status code mismatch. Got %v, want %v", got, tt.expectedCode)
+			}
client-go/go.mod-7-7 (1)

7-7: ⚠️ Potential issue | 🟡 Minor

Remove self-referential replace directive.

The replace github.com/nvidia/nvsentinel/client-go => . directive on line 7 is non-standard and potentially problematic. This is the only module in the repository using this pattern—all other modules correctly use relative path replacements like => ../api. Additionally, there is no go.work file that would justify this workspace-style setup. The replace directive in a module's own go.mod file is not honored by external consumers of that module, making it ineffective. Remove this line.

client-go/nvgrpc/watcher_test.go-154-177 (1)

154-177: ⚠️ Potential issue | 🟡 Minor

Avoid fixed sleep to reduce test flakiness.

time.Sleep(10 * time.Millisecond) can be brittle under load; prefer a synchronization signal (e.g., a started channel in FakeSource or require.Eventually) before calling Stop().

cmd/controller-test/main.go-59-84 (1)

59-84: ⚠️ Potential issue | 🟡 Minor

Metrics bind address flag is ignored.

The metricsAddr flag is parsed but never applied; ctrl.Options{}.Metrics uses defaults instead. This makes --metrics-bind-address ineffective.

🐛 Proposed fix
@@
 	"sigs.k8s.io/controller-runtime/pkg/client"
 	"sigs.k8s.io/controller-runtime/pkg/healthz"
 	"sigs.k8s.io/controller-runtime/pkg/log/zap"
+	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
@@
 	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
 		Scheme:                 scheme,
-		Metrics:                ctrl.Options{}.Metrics,
+		Metrics:                metricsserver.Options{BindAddress: metricsAddr},
 		HealthProbeBindAddress: probeAddr,
 		LeaderElection:         enableLeaderElection,
 		LeaderElectionID:       "nvsentinel-controllers.nvidia.com",
 	})
code-generator/cmd/client-gen/generators/generator_for_clientset.go-78-80 (1)

78-80: ⚠️ Potential issue | 🟡 Minor

Tie the TODO to a tracking issue.

As per coding guidelines: TODO comments should reference issues in Go code.

code-generator/cmd/client-gen/args/gvtype.go-39-47 (1)

39-47: ⚠️ Potential issue | 🟡 Minor

Add GoDoc for exported NewGVTypesValue.
The exported function lacks a doc comment.

✍️ Suggested fix
-func NewGVTypesValue(gvToTypes *map[types.GroupVersion][]string, def []string) *gvTypeValue {
+// NewGVTypesValue initializes a gvTypeValue with optional defaults.
+func NewGVTypesValue(gvToTypes *map[types.GroupVersion][]string, def []string) *gvTypeValue {

As per coding guidelines: Function comments required for all exported Go functions.

code-generator/cmd/client-gen/args/gvtype.go-93-101 (1)

93-101: ⚠️ Potential issue | 🟡 Minor

Avoid double slashes in String() when version is empty.
When gv.Version is empty, the current formatting yields group//type. Consider conditionally formatting.

💡 Suggested fix
-		for _, t := range ts {
-			strs = append(strs, gv.Group.String()+"/"+gv.Version.String()+"/"+t)
-		}
+		for _, t := range ts {
+			if gv.Version == "" {
+				strs = append(strs, gv.Group.String()+"/"+t)
+			} else {
+				strs = append(strs, gv.Group.String()+"/"+gv.Version.String()+"/"+t)
+			}
+		}
client-go/client/versioned/clientset.go-52-59 (1)

52-59: ⚠️ Potential issue | 🟡 Minor

Close the gRPC connection if clientset initialization fails.

In NewForConfig, if NewForConfigAndClient returns an error (e.g., from device client initialization failure), the gRPC connection is leaked. Update both the generated code and the template in code-generator/cmd/client-gen/generators/generator_for_clientset.go.

Suggested fix
	conn, err := nvgrpc.ClientConnFor(&configShallowCopy)
	if err != nil {
		return nil, err
	}

-	return NewForConfigAndClient(&configShallowCopy, conn)
+	cs, err := NewForConfigAndClient(&configShallowCopy, conn)
+	if err != nil {
+		_ = conn.Close()
+		return nil, err
+	}
+	return cs, nil
client-go/informers/externalversions/generic.go-49-60 (1)

49-60: ⚠️ Potential issue | 🟡 Minor

TODO needs an issue reference.
Line 50’s TODO should link to a tracking issue or be removed; since this file is generated, update the generator template to keep it consistent.
As per coding guidelines, TODO comments should reference issues in Go code.

code-generator/cmd/client-gen/generators/fake/generator_fake_for_clientset.go-92-119 (1)

92-119: ⚠️ Potential issue | 🟡 Minor

Replace the TODO with an issue reference (or remove).
Line 93’s TODO should reference a tracking issue.
As per coding guidelines, TODO comments should reference issues in Go code.

code-generator/cmd/client-gen/generators/fake/fake_client_generator.go-96-137 (1)

96-137: ⚠️ Potential issue | 🟡 Minor

Add godoc and an issue reference for TargetForClientset.
Same concerns: exported function missing comment and TODO at Line 98 lacks an issue reference.
As per coding guidelines, Function comments required for all exported Go functions; TODO comments should reference issues in Go code.

code-generator/cmd/client-gen/generators/fake/fake_client_generator.go-40-94 (1)

40-94: ⚠️ Potential issue | 🟡 Minor

Add godoc and an issue reference for TargetForGroup.
This exported function lacks a leading comment, and the TODO at Line 41 should reference a tracking issue; please update both in the generator.
As per coding guidelines, Function comments required for all exported Go functions; TODO comments should reference issues in Go code.

DEVELOPMENT.md-126-128 (1)

126-128: ⚠️ Potential issue | 🟡 Minor

Update "Getting Help" links to point to the correct repository.

Lines 126-127 in DEVELOPMENT.md link to github.com/NVIDIA/device-api, but this is the NVSentinel repository (module: github.com/nvidia/nvsentinel). The correct links should point to github.com/NVIDIA/NVSentinel for reporting issues and starting discussions related to this project's development. See README.md lines 233, 238, 243, 249 for the correct repository links.

code-generator/cmd/client-gen/generators/fake/generator_fake_for_group.go-24-24 (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Add package-level GoDoc for the fake package.

None of the files in this package have a package comment. Add a comment like // Package fake provides fake clientset implementations. before the package fake declaration. This can be done in any existing file or in a new doc.go file.

Per coding guidelines: Package-level godoc required for all Go packages.

api/proto/device/v1alpha1/gpu.proto-33-39 (1)

33-39: ⚠️ Potential issue | 🟡 Minor

Inconsistent indentation in comments.

Lines 36-38 use tab indentation while the surrounding lines use spaces. This inconsistency may cause proto linter warnings.

   // resource_version represents the internal version of this object.
   //
   // Value must be treated as opaque by clients and passed unmodified back to the server.
-	// Populated by the system.
-	// Read-only.
-	// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
+  // Populated by the system.
+  // Read-only.
+  // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
   string resource_version = 2;
api/proto/device/v1alpha1/gpu.proto-28-46 (1)

28-46: ⚠️ Potential issue | 🟡 Minor

Typo in documentation comment.

Line 42: "reprsentation" should be "representation".

   // namespace defines the space within which each name must be unique. An empty namespace is
-  // equivalent to the "default" namespace, but "default" is the canonical reprsentation.
+  // equivalent to the "default" namespace, but "default" is the canonical representation.
   // Not all objects are required to be scoped to a namespace - the value of this field for
client-go/informers/externalversions/factory.go-208-208 (1)

208-208: ⚠️ Potential issue | 🟡 Minor

Documentation refers to non-existent method.

Line 208 mentions factory.WaitForStop() but the actual method is Shutdown(). This appears to be a generator template issue.

 //	ctx, cancel := context.Background()
 //	defer cancel()
 //	factory := NewSharedInformerFactory(client, resyncPeriod)
-//	defer factory.WaitForStop()    // Returns immediately if nothing was started.
+//	defer factory.Shutdown()    // Returns immediately if nothing was started.
 //	genericInformer := factory.ForResource(resource)

Comment on lines +1 to +21
# ==============================================================================
# GIT ATTRIBUTES
# ==============================================================================
# Use 'linguist-generated=true' to hide generated code from GitHub PR diffs.
# ==============================================================================

# Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc)
zz_generated.*.go linguist-generated=true

# Hide generated Protobuf Go bindings
*.pb.go linguist-generated=true

# Hide generated client library
client-go/client/** linguist-generated=true
client-go/listers/** linguist-generated=true
client-go/informers/** linguist-generated=true

# Hide copied, unmodified upstream code
code-generator/cmd/client-gen/types/** linguist-generated=true
code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true
code-generator/cmd/client-gen/generators/util/** linguist-generated=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add Apache 2.0 license header to this file.
The new .gitattributes file is missing the required Apache 2.0 license header.

📝 Suggested header addition
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # ==============================================================================
 # GIT ATTRIBUTES
 # ==============================================================================

Based on learnings: All files must include Apache 2.0 license header.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# ==============================================================================
# GIT ATTRIBUTES
# ==============================================================================
# Use 'linguist-generated=true' to hide generated code from GitHub PR diffs.
# ==============================================================================
# Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc)
zz_generated.*.go linguist-generated=true
# Hide generated Protobuf Go bindings
*.pb.go linguist-generated=true
# Hide generated client library
client-go/client/** linguist-generated=true
client-go/listers/** linguist-generated=true
client-go/informers/** linguist-generated=true
# Hide copied, unmodified upstream code
code-generator/cmd/client-gen/types/** linguist-generated=true
code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true
code-generator/cmd/client-gen/generators/util/** linguist-generated=true
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
# GIT ATTRIBUTES
# ==============================================================================
# Use 'linguist-generated=true' to hide generated code from GitHub PR diffs.
# ==============================================================================
# Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc)
zz_generated.*.go linguist-generated=true
# Hide generated Protobuf Go bindings
*.pb.go linguist-generated=true
# Hide generated client library
client-go/client/** linguist-generated=true
client-go/listers/** linguist-generated=true
client-go/informers/** linguist-generated=true
# Hide copied, unmodified upstream code
code-generator/cmd/client-gen/types/** linguist-generated=true
code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true
code-generator/cmd/client-gen/generators/util/** linguist-generated=true
🤖 Prompt for AI Agents
In @.gitattributes around lines 1 - 21, Add the Apache 2.0 license header to the
top of .gitattributes: insert the standard multi-line Apache-2.0 header text
(including copyright line, SPDX identifier or full license notice and year/owner
as per project convention) as the first lines of the file so the file contains
the required license header before any content; ensure formatting matches other
repository files' headers.

Comment on lines 40 to 70
jobs:
go:
name: Go
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: Setup build environment
uses: ./.github/actions/setup-ci-env

- name: ${{ matrix.step_name }}
run: ${{ matrix.make_command }}

- name: Load Helm version from .versions.yaml
if: matrix.component == 'helm-charts'
id: helm-version
run: |
HELM_VERSION=$(yq eval '.testing_tools.helm' .versions.yaml)
echo "helm_version=${HELM_VERSION}" >> $GITHUB_OUTPUT

- name: Setup Helm
if: matrix.component == 'helm-charts'
uses: azure/setup-helm@1a275c3b69536ee54be43f2070a358922e12c8d4 # v4.3.1
with:
version: ${{ steps.helm-version.outputs.helm_version }}

- name: Validate Helm Charts
if: matrix.component == 'helm-charts'
run: make helm-lint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: go job is missing matrix configuration.

The go job references matrix.step_name, matrix.make_command, and matrix.component (lines 51-52, 55, 62, 68) but has no strategy.matrix defined. This will cause the workflow to fail.

🐛 Add missing matrix configuration
   go:
     name: Go
     runs-on: ubuntu-latest
     timeout-minutes: 15
+    strategy:
+      matrix:
+        include:
+          - component: api
+            step_name: "Build API"
+            make_command: "make -C api build"
+          - component: helm-charts
+            step_name: "Lint Helm Charts"
+            make_command: "echo 'Helm lint handled separately'"
+          # Add other components as needed
     steps:
       - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2

Or, if the matrix was accidentally removed, restore the original configuration.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
jobs:
go:
name: Go
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Setup build environment
uses: ./.github/actions/setup-ci-env
- name: ${{ matrix.step_name }}
run: ${{ matrix.make_command }}
- name: Load Helm version from .versions.yaml
if: matrix.component == 'helm-charts'
id: helm-version
run: |
HELM_VERSION=$(yq eval '.testing_tools.helm' .versions.yaml)
echo "helm_version=${HELM_VERSION}" >> $GITHUB_OUTPUT
- name: Setup Helm
if: matrix.component == 'helm-charts'
uses: azure/setup-helm@1a275c3b69536ee54be43f2070a358922e12c8d4 # v4.3.1
with:
version: ${{ steps.helm-version.outputs.helm_version }}
- name: Validate Helm Charts
if: matrix.component == 'helm-charts'
run: make helm-lint
jobs:
go:
name: Go
runs-on: ubuntu-latest
timeout-minutes: 15
strategy:
matrix:
include:
- component: api
step_name: "Build API"
make_command: "make -C api build"
- component: helm-charts
step_name: "Lint Helm Charts"
make_command: "echo 'Helm lint handled separately'"
# Add other components as needed
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Setup build environment
uses: ./.github/actions/setup-ci-env
- name: ${{ matrix.step_name }}
run: ${{ matrix.make_command }}
- name: Load Helm version from .versions.yaml
if: matrix.component == 'helm-charts'
id: helm-version
run: |
HELM_VERSION=$(yq eval '.testing_tools.helm' .versions.yaml)
echo "helm_version=${HELM_VERSION}" >> $GITHUB_OUTPUT
- name: Setup Helm
if: matrix.component == 'helm-charts'
uses: azure/setup-helm@1a275c3b69536ee54be43f2070a358922e12c8d4 # v4.3.1
with:
version: ${{ steps.helm-version.outputs.helm_version }}
- name: Validate Helm Charts
if: matrix.component == 'helm-charts'
run: make helm-lint
🧰 Tools
🪛 actionlint (1.7.10)

[error] 51-51: property "step_name" is not defined in object type {}

(expression)


[error] 52-52: property "make_command" is not defined in object type {}

(expression)


[error] 55-55: property "component" is not defined in object type {}

(expression)


[error] 57-57: shellcheck reported issue in this script: SC2086:info:2:40: Double quote to prevent globbing and word splitting

(shellcheck)


[error] 62-62: property "component" is not defined in object type {}

(expression)


[error] 68-68: property "component" is not defined in object type {}

(expression)

🤖 Prompt for AI Agents
In @.github/workflows/ci.yml around lines 40 - 70, The "go" job references
matrix variables (matrix.step_name, matrix.make_command, matrix.component) but
lacks a strategy.matrix; add a strategy block to the "go" job that defines the
matrix entries used (e.g., step_name, make_command, component) or restore the
original strategy.matrix if it was removed so those variables are defined for
the job; update the job's top-level stanza "go" to include strategy: matrix:
with the appropriate permutations that feed the referenced matrix.step_name,
matrix.make_command and matrix.component.

Comment on lines 96 to 111
message Gpu {
// name is the unique logical identifier of the GPU resource.
//
// This is typically the lowercased GPU UUID, but may take other forms.
//
// Example: "gpu-a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6"
string name = 1;
ObjectMeta metadata = 1;

// spec defines the identity and desired attributes of the GPU resource.
GpuSpec spec = 2;

// status contains the most recently observed state of the GPU resource.
// This data may lag slightly behind the actual on-device state.
GpuStatus status = 3;

// resource_version is a monotonically increasing version number.
//
// This increments on every modification to the GPU and can be used by
// consumers to detect changes or implement optimistic concurrency.
int64 resource_version = 4;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Duplicate resource_version fields with different types.

The Gpu message has resource_version in two places:

  1. metadata.resource_version (string) at ObjectMeta field 2
  2. resource_version (int64) at Gpu field 4

This creates potential confusion for API consumers. In Kubernetes conventions, resource_version is typically an opaque string stored only in ObjectMeta. If the intent is to have an internal numeric version for optimistic concurrency, consider:

  1. Removing the top-level resource_version and using only metadata.resource_version
  2. If both are needed, rename the top-level field to something like generation or internal_version to distinguish its purpose
🤖 Prompt for AI Agents
In `@api/proto/device/v1alpha1/gpu.proto` around lines 96 - 111, The Gpu message
defines two different resource_version fields (ObjectMeta.resource_version as a
string and Gpu.resource_version as int64), causing confusion; either remove the
top-level int64 resource_version from message Gpu or rename it to a clearly
distinct semantic name (e.g., generation or internal_version) to avoid collision
with ObjectMeta.resource_version; update the proto definition for message Gpu
(field currently declared as "int64 resource_version = 4") accordingly, adjust
all code references/usages to the new name or remove usages if deleted, update
any comments/docs in the proto to reflect the change, and regenerate protobuf
stubs wherever they are consumed.

Comment on lines 210 to 222
// CreateGpu registers a new GPU with the server.
//
// The GPU name (metadata.name) must be unique. If a GPU with the
// same name already exists, returns ALREADY_EXISTS.
//
// This operation acquires a write lock, blocking consumer reads
// until the operation completes.
//
// Returns:
// - The created GPU with server-assigned fields (resource_version)
// - ALREADY_EXISTS if a GPU with that name exists
// - INVALID_ARGUMENT if required fields are missing
rpc CreateGpu(CreateGpuRequest) returns (CreateGpuResponse);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Contradictory semantics between RPC doc and response message.

The CreateGpu RPC documentation (line 220-221) states it "returns ALREADY_EXISTS if a GPU with that name exists." However, CreateGpuResponse.created (lines 339-343) indicates that when false, the existing GPU is returned for idempotent patterns.

These are mutually exclusive behaviors:

  • Option A: Return ALREADY_EXISTS error (no response body)
  • Option B: Return success with created=false and existing GPU data (idempotent upsert)

Clarify which behavior is intended and update the documentation to be consistent.

Also applies to: 339-344

🤖 Prompt for AI Agents
In `@api/proto/device/v1alpha1/gpu.proto` around lines 210 - 222, The RPC docs for
CreateGpu and the CreateGpuResponse.created field conflict; decide whether
CreateGpu should return an ALREADY_EXISTS error or use idempotent semantics with
CreateGpuResponse.created (true/false) returning the existing GPU. Then make
them consistent: if you choose idempotent upsert, update the CreateGpu RPC
comment to state it returns success with created=false and the existing GPU when
a name already exists and keep/CreateGpuResponse.created semantics; if you
choose error-on-duplicate, remove or repurpose CreateGpuResponse.created and
change the CreateGpu RPC comment to state it returns ALREADY_EXISTS and ensure
server code for rpc CreateGpu enforces that behavior. Apply the change to the
documented remarks for rpc CreateGpu and the CreateGpuResponse message so both
match the chosen behavior.

Comment on lines 360 to 375
// UpdateGpuStatusRequest contains the status update for a GPU.
message UpdateGpuStatusRequest {
// The name of the GPU to update.
// Required.
string name = 1;

// The new status to set.
// This completely replaces the existing status.
// Required.
GpuStatus status = 2;

// Expected resource_version for optimistic concurrency.
// If set and doesn't match current version, returns ABORTED.
// Optional.
int64 resource_version = 3;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Type inconsistency in resource_version for optimistic concurrency.

UpdateGpuStatusRequest.resource_version is int64 (line 374), but ObjectMeta.resource_version is string (line 39). This inconsistency complicates client implementations that need to track versions across different message types.

Consider aligning the types—either use string consistently (Kubernetes convention) or int64 consistently (if internal numeric versioning is preferred).

🤖 Prompt for AI Agents
In `@api/proto/device/v1alpha1/gpu.proto` around lines 360 - 375,
UpdateGpuStatusRequest currently defines resource_version as int64 while
ObjectMeta.resource_version is a string, causing type mismatch; change
UpdateGpuStatusRequest.resource_version to string to match
ObjectMeta.resource_version (or vice versa if you choose numeric versioning) and
update any callers/serializers that read/write
UpdateGpuStatusRequest.resource_version and related handlers (e.g., the
UpdateGpuStatus RPC handling code and client usage) to use the new string type
so optimistic-concurrency checks compare the same type across messages.

Comment on lines 40 to 44
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/api/meta"
"github.com/nvidia/nvsentinel/client-go/clientset/versioned"
"github.com/nvidia/nvsentinel/client-go/nvgrpc"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check the directory structure of client-go
fd -t d "clientset|client" client-go --max-depth 3 | head -20

Repository: NVIDIA/NVSentinel

Length of output: 137


🏁 Script executed:

# Also find the README.md file and examine it
fd -t f "README.md" client-go | xargs wc -l

Repository: NVIDIA/NVSentinel

Length of output: 356


🏁 Script executed:

# Read the relevant section of README.md around lines 40-44
fd -t f "README.md" client-go | xargs cat -n | sed -n '35,55p'

Repository: NVIDIA/NVSentinel

Length of output: 910


🏁 Script executed:

# Check what's actually in client-go/client directory
fd -t f "versioned" client-go/client --max-depth 2 | head -20

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

# Also check if clientset directory exists at all
ls -la client-go/ | grep -E "clientset|client"

Repository: NVIDIA/NVSentinel

Length of output: 118


🏁 Script executed:

# Look for any versioned clientset files
find client-go -name "*versioned*" -type f 2>/dev/null | head -20

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

# List contents of client-go/client
find client-go/client -type f -name "*.go" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 778


🏁 Script executed:

# Check the structure more carefully
tree client-go -L 3 -I 'vendor'

Repository: NVIDIA/NVSentinel

Length of output: 1194


Fix import path in Quick Start example.

The import path on line 42 is incorrect. The codebase structure has client-go/client/versioned/, not client-go/clientset/versioned/. Change the import to:

"github.com/nvidia/nvsentinel/client-go/client/versioned"

This will prevent import errors when developers use the example code.

🤖 Prompt for AI Agents
In `@client-go/README.md` around lines 40 - 44, Update the incorrect import in the
Quick Start import block: replace the wrong import string
"github.com/nvidia/nvsentinel/client-go/clientset/versioned" with
"github.com/nvidia/nvsentinel/client-go/client/versioned" so the example imports
the correct client package; ensure the import block containing metav1, api/meta,
and nvgrpc uses the corrected path.

Comment on lines 123 to 127
// Configure klog for JSON output if requested
if config.LogFormat == "json" {
// Set JSON format flags
flag.Set("logging_format", "json")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, find and examine the main.go file
find . -name "main.go" -path "*/cmd/device-api-server/*" -type f

Repository: NVIDIA/NVSentinel

Length of output: 92


🏁 Script executed:

#!/bin/bash
# Read the main.go file to see the actual code
if [ -f "cmd/device-api-server/main.go" ]; then
  wc -l cmd/device-api-server/main.go
fi

Repository: NVIDIA/NVSentinel

Length of output: 94


🏁 Script executed:

#!/bin/bash
# Get context around lines 123-127
if [ -f "cmd/device-api-server/main.go" ]; then
  sed -n '100,160p' cmd/device-api-server/main.go | cat -n
fi

Repository: NVIDIA/NVSentinel

Length of output: 2225


🏁 Script executed:

#!/bin/bash
# Check imports to see how klog is being used
if [ -f "cmd/device-api-server/main.go" ]; then
  sed -n '1,50p' cmd/device-api-server/main.go | cat -n
fi

Repository: NVIDIA/NVSentinel

Length of output: 2137


🌐 Web query:

klog v2 JSON format configuration flag.Set logging_format after flag.Parse

💡 Result:

--logging-format=json is not a klog/v2 flag; it comes from k8s.io/component-base/logs (and its logs/api/v1 config). Kubernetes docs describe it as the flag that switches klog output to JSON. [1]

If you want to force JSON in code (even after flag.Parse()), you must do it before initializing logs and before any logging happens (klog requires flags parsed before logging). [2]

Option A: set the flag value after flag.Parse(), then init logs

import (
  "flag"
  "k8s.io/component-base/logs"
)

func main() {
  logs.AddGoFlags(flag.CommandLine) // registers --logging-format (and other logging flags) [3]
  flag.Parse()

  _ = flag.Set("logging-format", "json") // must happen before InitLogs / any logging
  logs.InitLogs()                        // reads flags and configures logging [3]
  defer logs.FlushLogs()

  // ... your code ...
}

Option B (preferred): avoid flag.Set, set the config directly

Kubernetes moved applying --logging-format-style options into a single “validate & apply” step that should run right after flag parsing. [4] This is done via the logs/api/v1 LoggingConfiguration (“Options”) API. [4]

Key point: setting logging-format after logs are already initialized (or after you already emitted logs) is too late; early startup logs may still be non-JSON. [1]

Sources:
[1] Kubernetes system logs docs (--logging-format=json, JSON format caveats)
[2] klog/v2 docs: flag.Parse must be called before logging
[3] k8s.io/component-base/logs docs: AddGoFlags, InitLogs (call after parsing)
[4] Kubernetes dev thread: applying --logging-format moved to Options.ValidateAndApply and should run ASAP after parsing


🌐 Web query:

klog/v2 SetLogger JSON output configuration methods

💡 Result:

In k8s.io/klog/v2, JSON output isn’t enabled by a “klog JSON mode” switch; instead you get JSON by redirecting klog to a structured backend logger (logr or slog) that’s configured to emit JSON. The relevant configuration methods are: SetLogger / SetLoggerWithOptions, and SetSlogLogger. [1]

1) Use SetSlogLogger with slog.NewJSONHandler (JSON lines)

import (
  "log/slog"
  "os"
  "k8s.io/klog/v2"
)

klog.SetSlogLogger(slog.New(slog.NewJSONHandler(os.Stdout, nil)))

JSONHandler produces line-delimited JSON objects. [2]

2) Use SetLogger (or SetLoggerWithOptions) with a logr backend that outputs JSON

SetLogger redirects all klog output to a logr.Logger implementation. [1]

Typical pattern (zap -> logr via zapr.NewLogger):

klog.SetLogger(zapr.NewLogger(zapLog)) // zapLog must be configured for JSON

SetLoggerWithOptions is the flexible variant, with these key options: [1]

  • ContextualLogger(true|false)
  • FlushLogger(func(){...})
  • WriteKlogBuffer(func([]byte){...}) (lets you capture/redirect klog’s unstructured output buffer)

3) Note on klogr: its default JSON is “JSON-in-text”, not structured forwarding

klogr historically serialized key/values as JSON and logged that as a text message via klog; this is not recommended when klog is being forwarded to a structured logger. [3]

Sources: klog API docs (SetLogger, SetLoggerWithOptions, SetSlogLogger, options). [1] slog JSONHandler docs. [2] klogr README note about default JSON serialization behavior. [3]


🏁 Script executed:

#!/bin/bash
# Check for logs.InitLogs or klog.SetSlogLogger/SetLogger calls
rg -n "InitLogs|SetSlogLogger|SetLogger" cmd/device-api-server/main.go

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Check for any imports of component-base/logs
rg -n "component-base/logs|logs\." cmd/device-api-server/main.go

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Read the full main.go to see complete initialization
cat -n cmd/device-api-server/main.go

Repository: NVIDIA/NVSentinel

Length of output: 7180


Replace ineffective klog JSON flag configuration with proper klog.SetSlogLogger() initialization.

The flag.Set("logging_format", "json") call on line 126 is ineffective. This flag does not exist in klog/v2 (it belongs to k8s.io/component-base/logs, which this code does not use). To enable JSON output in klog/v2, use klog.SetSlogLogger() with a JSON handler instead:

if config.LogFormat == "json" {
	klog.SetSlogLogger(slog.New(slog.NewJSONHandler(os.Stdout, nil)))
}

This must be called before any logging occurs (after flag.Parse() but before line 139 where logger.Info() is first called).

🤖 Prompt for AI Agents
In `@cmd/device-api-server/main.go` around lines 123 - 127, The current
flag.Set("logging_format", "json") is ineffective for klog/v2; replace it by
initializing klog with a structured JSON handler: when config.LogFormat ==
"json" call klog.SetSlogLogger(...) using slog.NewJSONHandler(os.Stdout, nil) so
klog emits JSON; place this call after flag.Parse() and before any logging
(i.e., before the first logger.Info() call) to ensure it takes effect; update
the block that currently checks config.LogFormat and remove the flag.Set usage
in favor of klog.SetSlogLogger.

Comment on lines 35 to 36
cfg, nvmlCfg := parseFlags()
flag.Parse()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove duplicate flag.Parse() call.

parseFlags() already calls flag.Parse() at line 88, so calling it again at line 36 will cause issues (flags parsed twice, potentially with empty values on the second parse).

🐛 Proposed fix
 func main() {
 	klog.InitFlags(nil)

 	cfg, nvmlCfg := parseFlags()
-	flag.Parse()

 	logger := klog.Background()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cfg, nvmlCfg := parseFlags()
flag.Parse()
func main() {
klog.InitFlags(nil)
cfg, nvmlCfg := parseFlags()
logger := klog.Background()
🤖 Prompt for AI Agents
In `@cmd/health-provider/main_stub.go` around lines 35 - 36, Remove the duplicate
flag.Parse() call in main_stub.go: parseFlags() already calls flag.Parse(), so
delete the standalone flag.Parse() invocation to avoid parsing flags twice; keep
parseFlags() as the single place that calls flag.Parse() and returns cfg and
nvmlCfg (refer to the parseFlags function and the cfg, nvmlCfg variables).

Comment on lines 27 to 35
import (
"io"
"path"
"strings"

"golang.org/x/text/cases"
"golang.org/x/text/language"

"k8s.io/gengo/v2/generator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n --type=go 'ParseClientGenTags' code-generator/cmd/client-gen

Repository: NVIDIA/NVSentinel

Length of output: 2473


🏁 Script executed:

cat -n code-generator/cmd/client-gen/generators/generator_for_type.go | sed -n '90,105p'

Repository: NVIDIA/NVSentinel

Length of output: 772


🏁 Script executed:

rg -B5 -A5 'ParseClientGenTags' code-generator/cmd/client-gen --type=go

Repository: NVIDIA/NVSentinel

Length of output: 13014


🏁 Script executed:

sed -n '105,112p' code-generator/cmd/client-gen/generators/generator_for_group.go

Repository: NVIDIA/NVSentinel

Length of output: 309


🏁 Script executed:

sed -n '80,88p' code-generator/cmd/client-gen/generators/fake/generator_fake_for_type.go

Repository: NVIDIA/NVSentinel

Length of output: 483


🏁 Script executed:

sed -n '87,95p' code-generator/cmd/client-gen/generators/fake/generator_fake_for_group.go

Repository: NVIDIA/NVSentinel

Length of output: 316


Wrap ParseClientGenTags errors with context across all four call sites.

The error-wrapping guideline requires adding context to errors returned from ParseClientGenTags. Line 96 in generator_for_type.go returns the raw parse error, making failures harder to diagnose. Three additional identical locations exist in the codebase:

  • generator_for_type.go:96-99
  • generator_for_group.go:107-110
  • fake/generator_fake_for_type.go:83-86
  • fake/generator_fake_for_group.go:90-93

Apply consistent wrapping across all four sites using fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err) to align with the error-wrapping guideline for Go code.

Example for generator_for_type.go
 import (
+	"fmt"
 	"io"
 	"path"
 	"strings"
@@
 	tags, err := util.ParseClientGenTags(append(t.SecondClosestCommentLines, t.CommentLines...))
 	if err != nil {
-		return err
+		return fmt.Errorf("parse client-gen tags for %s: %w", t.Name.Name, err)
 	}
🤖 Prompt for AI Agents
In `@code-generator/cmd/client-gen/generators/generator_for_type.go` around lines
27 - 35, Wrap the error returned by ParseClientGenTags with context at each call
site by replacing bare returns of err with: return fmt.Errorf("parse client-gen
tags for %s: %w", t.Name.Name, err). Update the four occurrences where
ParseClientGenTags is called (in the generators for types and groups and their
fake counterparts) so the function that calls ParseClientGenTags returns the
wrapped error; ensure you import fmt if not already present and reference
t.Name.Name in the formatting string.

Comment on lines +1 to +24
apiVersion: v2
name: health-provider
description: GPU Health Provider for NVSentinel - monitors GPU health via NVML

type: application
version: 0.1.0
appVersion: "0.1.0"

keywords:
- nvidia
- gpu
- health
- monitoring
- nvml

home: https://github.com/nvidia/nvsentinel
sources:
- https://github.com/nvidia/nvsentinel

maintainers:
- name: NVIDIA
url: https://github.com/nvidia

dependencies: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there are other Helm charts and their locations
echo "=== Helm charts in distros/kubernetes/nvsentinel/charts/ ==="
fd -t d -d 1 . distros/kubernetes/nvsentinel/charts 2>/dev/null || echo "Path not found"

echo ""
echo "=== Helm charts in deployments/helm/ ==="
fd -t d -d 1 . deployments/helm 2>/dev/null || echo "Path not found"

echo ""
echo "=== All Chart.yaml files ==="
fd Chart.yaml

Repository: NVIDIA/NVSentinel

Length of output: 2807


Relocate chart to follow project structure conventions.

This Helm chart should be located at distros/kubernetes/nvsentinel/charts/health-provider/ to align with project organization. Other health-related monitors in the project—including gpu-health-monitor, syslog-health-monitor, and csp-health-monitor—are all located in distros/kubernetes/nvsentinel/charts/. Move this chart to the standard location for consistency.

🤖 Prompt for AI Agents
In `@deployments/helm/health-provider/Chart.yaml` around lines 1 - 24, The
Chart.yaml for the chart named "health-provider" is not located with the other
nvsentinel charts; move the entire chart directory containing Chart.yaml for
health-provider into the project's central nvsentinel charts directory so it
sits alongside existing charts like gpu-health-monitor, syslog-health-monitor,
and csp-health-monitor, and update any CI/build references or packaging
manifests that point to the old location (e.g., chart packaging steps or
dependency references) to the new chart directory.

…o change

After removing the top-level resource_version field from the Gpu proto
message (Task 5 of PR #795 review fixes), all Go code needed to be
updated to use metadata.resource_version instead.

Changes:
- cache.go: Store resource version in Metadata.ResourceVersion (string)
  instead of the removed Gpu.ResourceVersion (int64)
- gpu_service.go: Update all GetResourceVersion() calls to use
  GetMetadata().GetResourceVersion(), parse string to int64 for
  optimistic concurrency checks
- integration_test.go: Update test assertions to use the new path

This follows K8s conventions where resource_version lives in ObjectMeta.

Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The go job referenced matrix.step_name, matrix.make_command, and
matrix.component without defining a strategy.matrix block, causing
the workflow to fail.

Also removed orphaned Helm-related steps that referenced a
non-existent helm-charts matrix entry.

Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
parseFlags() already calls flag.Parse(), so the second call in main()
would cause a runtime crash with "flag provided but not defined" or
"Parse called twice" depending on Go version.

Fixed in both main.go (NVML build) and main_stub.go (stub build).

Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The --metrics-bind-address flag was parsed but never applied to the
controller manager. Now correctly configures metrics server binding.

Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Proto changes:
- Removed top-level resource_version field from Gpu message (use
  metadata.resource_version instead, following K8s conventions)
- Changed UpdateGpuStatusRequest.resource_version from int64 to string
  to match ObjectMeta.resource_version type
- Updated CreateGpu RPC docs to clarify idempotent semantics (returns
  existing GPU with created=false instead of ALREADY_EXISTS error)
- Fixed typo: reprsentation -> representation
- Fixed inconsistent indentation in ObjectMeta comments

Go code updates:
- Removed ResourceVersion from converter ignore list (field no longer exists)
- Updated gpu_service.go to parse string resource_version
- Updated integration tests for string resource_version

Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Updated dependencies required by code-generator validation-gen tool.

Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
1. Changed c.Universe.Type() to c.Universe.Function() for fmt.Errorf
   since Errorf is a function, not a type. Using Type() could return
   nil and cause template execution failures.

2. Added conn.Close() in NewForConfig error path to prevent gRPC
   connection leak when NewForConfigAndClient fails after
   ClientConnFor succeeds.

Note: Run `make -C client-go code-gen` to regenerate client code
with these fixes applied.

Signed-off-by: Carlos Eduardo Arango Gutierrez <cearango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
- Fix client-go/README.md: correct import path to client/versioned
- Add Apache 2.0 license header to .gitattributes
- Remove duplicate api/bin/* entry from .gitignore
- Remove duplicate JetBrains IDE comment from .gitignore
- Remove duplicate pause call from nvml-sidecar-demo.sh
- Add nil guard for config in ClientConnFor
- Fix resource leak in NewForConfig by closing connection on error
- Update go.mod dependencies
- Add nolint:lll for kubebuilder validation comments that cannot be split
- Add whitespace per wsl_v5 linter requirements
- Format files with gofmt/goimports
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 119 out of 191 changed files in this pull request and generated no new comments.

Critical: clone protos in cache mutations to prevent data races, switch
remediation controller from Status().Update to Status().Patch, fix klog
JSON logging initialization.

Major: add atomic ListWithVersion/ListAndSubscribe to cache, fix watch
race in gpu_service, add quarantine toleration to reboot jobs, fix drain
boolean default, use sync.Once for publisher stop, fix ctx ordering in
NVML provider, add nil guards in client-go.

Minor: improve error wrapping in code-generator, tighten socket perms,
remove dead code, add license header to api/README.md.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Previously, if the status patch to Draining succeeded but pod eviction
failed, the event was permanently stuck because the controller only
accepted PhaseQuarantined. Now accepts both PhaseQuarantined and
PhaseDraining for resilient retry behavior.
Previously, executeReboot returned nil when the job was still running,
causing premature transition to Remediated. Now returns errRebootInProgress
which triggers a 30s requeue to check the job status again.
Using r.Update() sent the full object and risked conflicts with
concurrent writes. Using client.MergeFrom Patch only sends the diff
(the annotation changes), eliminating conflict risk and preventing
drift in the subsequent status patch.
Previously, if the target node didn't exist, the event stayed stuck in
New phase forever. Now transitions to Cancelled with a NodeNotFound
reason, allowing the TTL controller to eventually clean it up.
The static DaemonSet was binding gRPC to :50051 (all interfaces),
exposing the unauthenticated API to the cluster network. Changed to
127.0.0.1:50051 to match the Helm chart default.
gRPC defaults to 50051 (not 8080), health to 8081, metrics to 9090.
Updated EXPOSE directives and comments to match the code defaults.
Removed 11 metric definitions that were registered but never populated
by any controller code. These created false observability — showing
permanent zeros that could mislead operators building dashboards.

Retained: quarantineActionsTotal, drainActionsTotal, ttlDeletionsTotal,
remediationActionsTotal, remediationFailuresTotal (all actively used).
Both UpdateGpu and UpdateGpuStatus silently discarded the error from
strconv.ParseInt when parsing resource_version. A non-numeric string
like "not-a-number" would parse to 0, causing the cache's optimistic
concurrency check (expectedVersion > 0) to be bypassed entirely.

Now returns codes.InvalidArgument when resource_version is present
but not a valid numeric string. Adds regression tests for both RPCs.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
MarkProviderGPUsUnknown was calling Notify N times (once per GPU) while
holding the write lock, starving all readers during provider timeout.
Now collects events inside the lock, releases the lock, then broadcasts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments