Skip to content

WIP - k8s API for healtheventwithstatus model#640

Open
yavinash007 wants to merge 2 commits intoNVIDIA:mainfrom
yavinash007:healtheventwithstatus
Open

WIP - k8s API for healtheventwithstatus model#640
yavinash007 wants to merge 2 commits intoNVIDIA:mainfrom
yavinash007:healtheventwithstatus

Conversation

@yavinash007
Copy link
Contributor

@yavinash007 yavinash007 commented Dec 23, 2025

Summary

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Introduced HealthEvent and HealthStatus Kubernetes CRDs and rich HealthEvent API (spec/status, enums, entities) for Kubernetes-native event storage and remediation.
  • Tests

    • Added comprehensive unit tests ensuring bidirectional conversion and field parity between internal models, CRDs, and protobufs.
  • Chores

    • Added CRD/deepcopy generation support and pinned controller-gen tooling; updated Go module dependencies and Make targets.
  • Documentation

    • Added ADR detailing the Kubernetes-backed datastore design and operational considerations.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

Adds a new HealthEvent Kubernetes CRD (v1alpha1), API types and generated deepcopies, bidirectional conversion functions and tests between protobuf/model/CRD, CRD manifests, and Makefile/tooling updates to support controller-gen-based generation.

Changes

Cohort / File(s) Summary
Versioning & Makefiles
./.versions.yaml, make/common.mk, janitor/Makefile
Added go_tools.controller_gen v0.20.0, introduced CONTROLLER_GEN_VERSION make variable, and pin janitor to install that controller-gen version.
data-models build
data-models/Makefile
Added API_DIR, CRD_OUTPUT_DIR and a generate target to install/run controller-gen, generate deepcopy and CRDs, and move CRD YAMLs to output.
API group wiring
data-models/api/v1alpha1/groupversion_info.go
Registered v1alpha1 GroupVersion, SchemeBuilder, and AddToScheme for the healthevents API group.
HealthEvent API types
data-models/api/v1alpha1/healthevent_types.go
Added HealthEvent/HealthEventList, spec/status structs, enums (Status, RecommendedAction, ProcessingStrategy), kubebuilder markers, and SchemeBuilder registration.
Generated deepcopies
data-models/api/v1alpha1/zz_generated.deepcopy.go
Added autogenerated DeepCopyInto/DeepCopy/DeepCopyObject implementations for new v1alpha1 types.
Conversion layer
data-models/pkg/conversion/healthevent_coversion.go, data-models/pkg/conversion/healthevent_conversion_test.go
Implemented nil-safe, time-aware bidirectional conversions between protobuf, CRD, and model types plus comprehensive unit tests verifying field parity and round-trips.
Model JSON tagging
data-models/pkg/model/health_event_extentions.go
Added JSON struct tags to OperationStatus fields (kept existing BSON tags).
Go module deps
data-models/go.mod
Updated Go toolchain/dependencies: added k8s.io/apimachinery v0.35.0 and sigs.k8s.io/controller-runtime v0.22.4 and adjusted indirect deps.
CRD manifests
distros/kubernetes/nvsentinel/crds/healthevents.dgxc.nvidia.com_healthevents.yaml, distros/kubernetes/nvsentinel/crds/healthevents.dgxc.nvidia.com_healthstatuses.yaml
Added Namespaced CRDs for HealthEvent and HealthStatus v1alpha1 with OpenAPIv3 schemas, validation, and status subresources.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Producer as Producer/Service
    participant Converter as Conversion Layer
    participant K8sAPI as Kubernetes API Server (CRD)
    participant Storage as API Server Storage / etcd
    Producer->>Converter: Create/emit HealthEvent (proto/model)
    Converter->>K8sAPI: Convert to v1alpha1.HealthEvent (create/update)
    K8sAPI->>Storage: Persist CR (spec immutable, status mutable)
    K8sAPI-->>Converter: CR persisted (status updates readable)
    Note over Converter,K8sAPI: Reconciler/consumers read/write status via CRD
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇
I hopped through structs, YAML, and test,
Wove CRDs into a cozy nest.
Types, deepcopies, conversions align—
I thumped my foot: "Kubernetes, fine!"
A carrot giggle, then a tiny rest. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'WIP - k8s API for healtheventwithstatus model' is vague and generic, using the placeholder prefix 'WIP' without clearly describing the actual change or feature being introduced. Replace the vague title with a specific, descriptive summary like 'Add Kubernetes CRD API for HealthEvent model' to clearly communicate the main change without relying on placeholders.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.


var (
// GroupVersion is group version used to register these objects
GroupVersion = schema.GroupVersion{Group: "data-models.dgxc.nvidia.com", Version: "v1alpha1"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe we do healthevents.dgxc.nvidia.com but open to ideas here too. Just because health-events is a little more descriptive than data-models

@yavinash007 yavinash007 force-pushed the healtheventwithstatus branch from aa400bd to 14aa429 Compare December 23, 2025 21:43
EventID string `json:"eventID"`

// Node associated with this health event
// +kubebuilder:validation:Required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks like we're redefining the types, we ideally don't want to maintain two copies of the same types and keep them in sync


// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
type HealthStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my thought was this struct would look like this:

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
type HealthStatus struct {
  metav1.TypeMeta   `json:",inline"`
  metav1.ObjectMeta `json:"metadata,omitempty"`

  Spec   model.HealthEvent   `json:"spec,omitempty"`
  Status model.HealthEventStatus `json:"status,omitempty"`
}

So you would not define your own spec and status objects we would just use the existing ones.

There's some implications for api-versioning going forward if we want to adhere to best practices that we should discuss with the NVIDIA folks if we do it this way but it seems like a clean way to share the object that the other datasources use.

@ivelichkovich
Copy link
Contributor

Discussed offline, it'll likely be cleaner to have separate structs than munging the proto and the k8s type. Then we should make sure to have testing to ensure the structs stay in-sync.

One idea is if we can get rid of the omitempty (which doesn't seem critical) on the proto types json tags then we can write a unit test to initialize the proto, write it to json, convert it to crd type, write that to json and compare. They should match.

@yavinash007 yavinash007 force-pushed the healtheventwithstatus branch from 7f9a0bb to 6260e04 Compare January 11, 2026 23:45
@yavinash007
Copy link
Contributor Author

Discussed offline, it'll likely be cleaner to have separate structs than munging the proto and the k8s type. Then we should make sure to have testing to ensure the structs stay in-sync.

One idea is if we can get rid of the omitempty (which doesn't seem critical) on the proto types json tags then we can write a unit test to initialize the proto, write it to json, convert it to crd type, write that to json and compare. They should match.

pls check now, let me know what you think.

@yavinash007 yavinash007 force-pushed the healtheventwithstatus branch 7 times, most recently from 045caab to 91cc242 Compare January 21, 2026 23:03
return out
}

func assertRecursiveFieldParity(t *testing.T, leftName string, leftType reflect.Type, rightName string, rightType reflect.Type, ignoreTopLevel map[string]struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add some comments throughout the function to help explain what's going on, just because reflection/recursion can sometimes be a little tough to follow

@ivelichkovich
Copy link
Contributor

We will eventually want to include some CI check making sure the generated files are actually in-sync with generation but since this is the first PR in a series of PRs on a greenfield feature I think this is a good start and we can follow up with that

@yavinash007 yavinash007 marked this pull request as ready for review January 22, 2026 01:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In `@data-models/api/v1alpha1/healthevent_types.go`:
- Around line 49-55: The OperationStatus struct marks Status as required via the
kubebuilder tag but its JSON tag includes `omitempty`, allowing it to be dropped
from CR JSON; update the OperationStatus definition by removing `omitempty` from
the Status field's json tag (change `Status Status `json:"status,omitempty"` to
`Status Status `json:"status"`), or alternatively remove the
`+kubebuilder:validation:Required` line if the field should be optional; ensure
the change is applied to the OperationStatus struct and the Status field
declaration.
- Around line 146-161: Add Go doc comments for the exported types HealthEvent
and HealthEventList by inserting descriptive comment lines immediately above
each type declaration: a comment for HealthEvent explaining it represents the
Kubernetes custom resource holding Spec (HealthEventSpec) and Status
(HealthEventStatus) for a health event, and a comment for HealthEventList
explaining it represents a list of HealthEvent resources (the list type used by
the API server). Ensure the comments start with the type names "HealthEvent" and
"HealthEventList" respectively and are placed directly above their type
definitions.
- Around line 96-144: Add a proper Go doc comment for the exported type
HealthEventSpec: place a concise sentence above the type declaration starting
with "HealthEventSpec ..." describing what the struct represents (e.g., a
health/event payload for components/nodes including status, metadata, timestamps
and remediation info). Ensure the comment follows Go doc conventions (starts
with the type name and is a full sentence) and resides immediately above the
HealthEventSpec declaration so documentation tools and linters recognize it.
- Line 1: Add Go doc comments for all exported types to satisfy Go conventions
by documenting Status, RecommendedAction, ProcessingStrategy, HealthEventSpec,
HealthEvent, and HealthEventList; place a short sentence above each type
declaration describing its purpose. Also fix the field tag mismatch in
OperationStatus by removing `omitempty` from the JSON tag on the Status field
(since it is annotated `// +kubebuilder:validation:Required`) so the tags and
validation are consistent; update the comment on OperationStatus.Status if
needed to reflect it is required.
- Around line 7-47: Add Go doc comments for the exported enum types Status,
RecommendedAction, and ProcessingStrategy and for each corresponding const
block: place a one-line comment immediately above the type declarations (e.g.,
"Status ..." describing what the Status values represent) and add a short
comment above each const block that describes the group of constants (e.g.,
"Status values," "RecommendedAction values," "ProcessingStrategy values").
Ensure the comments use the exact exported identifier as the first word (Status,
RecommendedAction, ProcessingStrategy) so they satisfy Go doc conventions and
linters.

In `@data-models/pkg/conversion/healthevent_coversion.go`:
- Around line 219-239: Add missing package-level godoc at the top of the file
and add concise function comments for the exported conversion helpers
ProtoRecommendedActionToCRD, CRDRecommendedActionToProto,
ProtoProcessingStrategyToCRD, and CRDProcessingStrategyToProto; each function
comment should state it converts between the protobuf enum and the CRD enum and
note that unknown inputs map to zero values due to protobuf/name maps but are
constrained by upstream CRD schema and protobuf enum rules. Ensure comments
follow Go godoc style (start with the function/package name) and place the
package comment at the top of the file.

In
`@distros/kubernetes/nvsentinel/crds/healthevents.dgxc.nvidia.com_healthevents.yaml`:
- Around line 137-182: The HealthEventStatus schema currently requires
userPodsEvictionStatus which forces every status update to include it; update
the CRD by removing "userPodsEvictionStatus" from the status.required array (or
change the required list to be empty) so that userPodsEvictionStatus becomes
optional, ensuring HealthEventStatus can be PATCHed/created without that field;
locate the status schema for HealthEventStatus and modify the required property
accordingly.
- Around line 38-136: The CRD is missing the ProcessingStrategy field from the
HealthEventSpec; add a new spec property named processingStrategy (type: string)
with enum values UNSPECIFIED, EXECUTE_REMEDIATION, and STORE_ONLY and an
appropriate description matching HealthEventSpec in
data-models/api/v1alpha1/healthevent_types.go, then regenerate the OpenAPI CRD
so the field is persisted by the Kubernetes API server; locate the spec
properties block in the healthevents CRD (the same section containing
recommendedAction, version, etc.) and add the processingStrategy entry there
before regenerating.
♻️ Duplicate comments (1)
data-models/pkg/conversion/healthevent_conversion_test.go (1)

36-84: Add inline comments to explain the reflection logic.

The recursive field extraction logic using reflection can be difficult to follow. Consider adding brief inline comments explaining:

  • The purpose of unwrapping pointers (lines 40-42, 70-73)
  • Why PkgPath != "" means unexported (line 51)
  • The isScalarStruct exclusion rationale (line 76)
🧹 Nitpick comments (4)
data-models/Makefile (1)

48-66: Optional: split the generate target to satisfy checkmake max body length.
This keeps the lint clean and makes the steps easier to reuse.

♻️ Possible split to keep target bodies short
-.PHONY: generate
-generate: ## Generate CRDs and move them to Helm chart directory
-	`@echo` "Generating CRDs for $(API_DIR)..."
-	@# Install controller-gen if not present
-	`@which` controller-gen > /dev/null || (echo "Installing controller-gen..." && \
-		go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION))
-	@# Generate deepcopy files for API types
-	go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR)
-	@# Generate CRDs directly into API_DIR
-	go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \
-		crd paths=./$(API_DIR) output:crd:dir=./$(API_DIR)
-	@# Move generated CRDs to Helm chart directory
-	`@echo` "Moving generated CRDs to $(CRD_OUTPUT_DIR)..."
-	`@mkdir` -p $(CRD_OUTPUT_DIR)
-	`@mv` ./$(API_DIR)/*.yaml $(CRD_OUTPUT_DIR)/ || true
-	`@echo` "CRDs generated and moved to $(CRD_OUTPUT_DIR)"
-	`@ls` -1 $(CRD_OUTPUT_DIR)/*.yaml || echo "No CRD YAMLs generated"
+.PHONY: generate generate-crds move-crds
+generate: generate-crds move-crds ## Generate CRDs and move them to Helm chart directory
+
+generate-crds:
+	`@echo` "Generating CRDs for $(API_DIR)..."
+	`@which` controller-gen > /dev/null || (echo "Installing controller-gen..." && \
+		go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION))
+	go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR)
+	go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \
+		crd paths=./$(API_DIR) output:crd:dir=./$(API_DIR)
+
+move-crds:
+	`@echo` "Moving generated CRDs to $(CRD_OUTPUT_DIR)..."
+	`@mkdir` -p $(CRD_OUTPUT_DIR)
+	`@mv` ./$(API_DIR)/*.yaml $(CRD_OUTPUT_DIR)/ || true
+	`@ls` -1 $(CRD_OUTPUT_DIR)/*.yaml || echo "No CRD YAMLs generated"
data-models/pkg/conversion/healthevent_conversion_test.go (1)

189-272: Consider table-driven tests for conversion scenarios.

The conversion tests are thorough but could benefit from table-driven structure to test additional scenarios (e.g., nil inputs, partial fields, edge cases). As per coding guidelines, table-driven tests are preferred when testing multiple scenarios.

♻️ Example structure
func TestHealthEventSpecConversion(t *testing.T) {
    tests := []struct {
        name       string
        protoEvent *protos.HealthEvent
        wantErr    bool
    }{
        {
            name:       "full event",
            protoEvent: &protos.HealthEvent{/* ... */},
        },
        {
            name:       "nil input",
            protoEvent: nil,
        },
        {
            name:       "minimal fields",
            protoEvent: &protos.HealthEvent{CheckName: "check1"},
        },
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // test logic
        })
    }
}
data-models/pkg/conversion/healthevent_coversion.go (2)

1-1: Filename typo: healthevent_coversion.go should be healthevent_conversion.go.

The filename has a typo - "coversion" is missing an 'n'. Consider renaming for consistency.


1-11: Add package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages. Add a brief description of the package's purpose.

 package conversion
+
+// Package conversion provides bidirectional conversion functions between
+// internal model types, v1alpha1 CRD representations, and protobuf definitions
+// for HealthEvent resources.
 
 import (

@@ -0,0 +1,165 @@
package v1alpha1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if package comment exists in v1alpha1
rg -n '^\s*//\s*Package v1alpha1\b' -g '*.go' data-models/api/v1alpha1/

Repository: NVIDIA/NVSentinel

Length of output: 201


🏁 Script executed:

# Also check what files exist in the v1alpha1 directory
find data-models/api/v1alpha1/ -type f -name '*.go' | head -20

Repository: NVIDIA/NVSentinel

Length of output: 202


🏁 Script executed:

# Read the full healthevent_types.go file to see exported types and their comments
wc -l data-models/api/v1alpha1/healthevent_types.go

Repository: NVIDIA/NVSentinel

Length of output: 110


🏁 Script executed:

cat -n data-models/api/v1alpha1/healthevent_types.go

Repository: NVIDIA/NVSentinel

Length of output: 6630


Add doc comments for exported types and fix field tag mismatch in OperationStatus.

Export types without doc comments violate Go conventions:

  • Status (line 9)
  • RecommendedAction (line 26)
  • ProcessingStrategy (line 41)
  • HealthEventSpec (line 96)
  • HealthEvent (line 148)
  • HealthEventList (line 157)

Additionally, fix the conflicting field annotation on OperationStatus.Status (line 52): the field is marked // +kubebuilder:validation:Required but also has omitempty in the JSON tag. Remove omitempty since required fields should not be omittable.

🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` at line 1, Add Go doc comments
for all exported types to satisfy Go conventions by documenting Status,
RecommendedAction, ProcessingStrategy, HealthEventSpec, HealthEvent, and
HealthEventList; place a short sentence above each type declaration describing
its purpose. Also fix the field tag mismatch in OperationStatus by removing
`omitempty` from the JSON tag on the Status field (since it is annotated `//
+kubebuilder:validation:Required`) so the tags and validation are consistent;
update the comment on OperationStatus.Status if needed to reflect it is
required.

Comment on lines +7 to +47
// +kubebuilder:validation:Enum=NotStarted;InProgress;Failed;Succeeded;AlreadyDrained;\
// UnQuarantined;Quarantined;AlreadyQuarantined;Cancelled
type Status string

const (
StatusNotStarted Status = "NotStarted"
StatusInProgress Status = "InProgress"
StatusFailed Status = "Failed"
StatusSucceeded Status = "Succeeded"
StatusAlreadyDrained Status = "AlreadyDrained"

StatusUnQuarantined Status = "UnQuarantined"
StatusQuarantined Status = "Quarantined"
StatusAlreadyQuarantined Status = "AlreadyQuarantined"
StatusCancelled Status = "Cancelled"
)

// +kubebuilder:validation:Enum=NONE;COMPONENT_RESET;CONTACT_SUPPORT;\
// RUN_FIELDDIAG;RESTART_VM;RESTART_BM;REPLACE_VM;RUN_DCGMEUD;UNKNOWN
type RecommendedAction string

const (
RecommendedActionNone RecommendedAction = "NONE"
RecommendedActionComponentReset RecommendedAction = "COMPONENT_RESET"
RecommendedActionContactSupport RecommendedAction = "CONTACT_SUPPORT"
RecommendedActionRunFieldDiag RecommendedAction = "RUN_FIELDDIAG"
RecommendedActionRestartVM RecommendedAction = "RESTART_VM"
RecommendedActionRestartBM RecommendedAction = "RESTART_BM"
RecommendedActionReplaceVM RecommendedAction = "REPLACE_VM"
RecommendedActionRunDCGMEUD RecommendedAction = "RUN_DCGMEUD"
RecommendedActionUnknown RecommendedAction = "UNKNOWN"
)

// +kubebuilder:validation:Enum=UNSPECIFIED;EXECUTE_REMEDIATION;STORE_ONLY
type ProcessingStrategy string

const (
ProcessingStrategyUnspecified ProcessingStrategy = "UNSPECIFIED"
ProcessingStrategyExecuteRemediation ProcessingStrategy = "EXECUTE_REMEDIATION"
ProcessingStrategyStoreOnly ProcessingStrategy = "STORE_ONLY"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if file exists and read the relevant lines
if [ -f "data-models/api/v1alpha1/healthevent_types.go" ]; then
  echo "=== File exists, showing lines 1-60 ==="
  head -60 "data-models/api/v1alpha1/healthevent_types.go" | cat -n
else
  echo "File not found"
  find . -name "healthevent_types.go" -type f
fi

Repository: NVIDIA/NVSentinel

Length of output: 178


🏁 Script executed:

#!/bin/bash
# Search for any doc comments before the types
echo "=== Checking for doc comments before exported types ==="
rg -B5 '^\s*type\s+(Status|RecommendedAction|ProcessingStrategy)\s' data-models/api/v1alpha1/healthevent_types.go

Repository: NVIDIA/NVSentinel

Length of output: 797


🏁 Script executed:

#!/bin/bash
# Check entire file structure
echo "=== Full file to understand context ==="
wc -l data-models/api/v1alpha1/healthevent_types.go

Repository: NVIDIA/NVSentinel

Length of output: 150


🏁 Script executed:

#!/bin/bash
# Read more of the file to check package-level doc and understand the full context
head -20 data-models/api/v1alpha1/healthevent_types.go | cat -n

Repository: NVIDIA/NVSentinel

Length of output: 827


🏁 Script executed:

#!/bin/bash
# Check if there are any other exported types in the file with doc comments to compare
rg -B1 '^\s*type\s+\w+\s' data-models/api/v1alpha1/healthevent_types.go | head -40

Repository: NVIDIA/NVSentinel

Length of output: 865


🌐 Web query:

Go golint exported type comment requirements documentation

💡 Result:

Short answer — golint expects exported (capitalized) top-level names to have a doc comment immediately preceding the declaration; the comment should be a complete sentence whose first word is the name being documented (e.g. “TypeName …”). If you omit it golint reports “exported X should have comment or be unexported”; if the wording doesn’t start with the name it reports “comment on exported type X should be of the form 'X ...'”. [1][2]

Sources:

  • Effective Go: doc-comment style and “Every exported (capitalized) name … should have a doc comment; the first sentence should start with the name.” [1]
  • Examples / golint messages (and common linter tooling guidance noting the exact message text). [2]

Add Go doc comments for exported enum types and const blocks.

Status, RecommendedAction, and ProcessingStrategy are exported but lack doc comments required by Go conventions; add comments for each type and the associated const block.

Suggested additions
+// Status represents the state of a remediation sub-operation.
 // +kubebuilder:validation:Enum=NotStarted;InProgress;Failed;Succeeded;AlreadyDrained;\
 // UnQuarantined;Quarantined;AlreadyQuarantined;Cancelled
 type Status string
 
+// Status-related constants for HealthEvent operations.
 const (
 	StatusNotStarted     Status = "NotStarted"
 	StatusInProgress     Status = "InProgress"
 	StatusFailed         Status = "Failed"
 	StatusSucceeded      Status = "Succeeded"
 	StatusAlreadyDrained Status = "AlreadyDrained"

+// RecommendedAction indicates the suggested remediation action.
 // +kubebuilder:validation:Enum=NONE;COMPONENT_RESET;CONTACT_SUPPORT;\
 // RUN_FIELDDIAG;RESTART_VM;RESTART_BM;REPLACE_VM;RUN_DCGMEUD;UNKNOWN
 type RecommendedAction string
 
+// RecommendedAction-related constants.
 const (
 	RecommendedActionNone           RecommendedAction = "NONE"
 	RecommendedActionComponentReset RecommendedAction = "COMPONENT_RESET"

+// ProcessingStrategy controls how the system should handle this event.
 // +kubebuilder:validation:Enum=UNSPECIFIED;EXECUTE_REMEDIATION;STORE_ONLY
 type ProcessingStrategy string
 
+// ProcessingStrategy-related constants.
 const (
 	ProcessingStrategyUnspecified        ProcessingStrategy = "UNSPECIFIED"
 	ProcessingStrategyExecuteRemediation ProcessingStrategy = "EXECUTE_REMEDIATION"
 	ProcessingStrategyStoreOnly          ProcessingStrategy = "STORE_ONLY"
 )
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 7 - 47, Add Go
doc comments for the exported enum types Status, RecommendedAction, and
ProcessingStrategy and for each corresponding const block: place a one-line
comment immediately above the type declarations (e.g., "Status ..." describing
what the Status values represent) and add a short comment above each const block
that describes the group of constants (e.g., "Status values," "RecommendedAction
values," "ProcessingStrategy values"). Ensure the comments use the exact
exported identifier as the first word (Status, RecommendedAction,
ProcessingStrategy) so they satisfy Go doc conventions and linters.

Comment on lines +49 to +55
// OperationStatus represents the status of a sub-operation
type OperationStatus struct {
// +kubebuilder:validation:Required
Status Status `json:"status,omitempty"`

Message string `json:"message,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Required field marked omitempty can drop status in CRD JSON.

Status is marked required but omitempty allows it to be omitted from JSON, which contradicts the CRD schema and can cause validation failures. Remove omitempty (or drop Required if truly optional).

🩹 Proposed fix
-	// +kubebuilder:validation:Required
-	Status Status `json:"status,omitempty"`
+	// +kubebuilder:validation:Required
+	Status Status `json:"status"`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// OperationStatus represents the status of a sub-operation
type OperationStatus struct {
// +kubebuilder:validation:Required
Status Status `json:"status,omitempty"`
Message string `json:"message,omitempty"`
}
// OperationStatus represents the status of a sub-operation
type OperationStatus struct {
// +kubebuilder:validation:Required
Status Status `json:"status"`
Message string `json:"message,omitempty"`
}
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 49 - 55, The
OperationStatus struct marks Status as required via the kubebuilder tag but its
JSON tag includes `omitempty`, allowing it to be dropped from CR JSON; update
the OperationStatus definition by removing `omitempty` from the Status field's
json tag (change `Status Status `json:"status,omitempty"` to `Status Status
`json:"status"`), or alternatively remove the `+kubebuilder:validation:Required`
line if the field should be optional; ensure the change is applied to the
OperationStatus struct and the Status field declaration.

Comment on lines +96 to +144
type HealthEventSpec struct {
// Version of the event
Version uint32 `json:"version,omitempty"`

// Agent that generated the event
Agent string `json:"agent,omitempty"`

// Component class
ComponentClass string `json:"componentClass,omitempty"`

// Name of the check
CheckName string `json:"checkName,omitempty"`

// Whether the fault is fatal
IsFatal *bool `json:"isFatal,omitempty"`

// Whether the system is healthy
IsHealthy *bool `json:"isHealthy,omitempty"`

// Event message
Message string `json:"message,omitempty"`

// Recommended remediation action
RecommendedAction RecommendedAction `json:"recommendedAction,omitempty"`

// Error codes
ErrorCode []string `json:"errorCode,omitempty"`

// Entities impacted
EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"`

// Metadata map
Metadata map[string]string `json:"metadata,omitempty"`

// Timestamp when the event was generated
GeneratedTimestamp *metav1.Time `json:"generatedTimestamp,omitempty"`

// NodeName is the name of the target node
NodeName string `json:"nodeName,omitempty"`

// Behaviour overrides for quarantine
QuarantineOverrides *BehaviourOverrides `json:"quarantineOverrides,omitempty"`

// Behaviour overrides for drain
DrainOverrides *BehaviourOverrides `json:"drainOverrides,omitempty"`

// ProcessingStrategy
ProcessingStrategy ProcessingStrategy `json:"processingStrategy,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the exact lines 96-144 to see if HealthEventSpec has a doc comment
sed -n '85,150p' data-models/api/v1alpha1/healthevent_types.go | cat -n

Repository: NVIDIA/NVSentinel

Length of output: 2543


🏁 Script executed:

#!/bin/bash
# Also check other exported types in the same file to see their doc comment patterns
rg -n '^\s*type\s+\w+\s+(struct|interface)' data-models/api/v1alpha1/healthevent_types.go -B2 | head -40

Repository: NVIDIA/NVSentinel

Length of output: 855


Add a Go doc comment for HealthEventSpec.

The exported type HealthEventSpec at line 96 is missing a doc comment. All other exported types in this file (Entity, BehaviourOverrides, HealthEventStatus, OperationStatus) include proper documentation comments.

♻️ Suggested comment
+// HealthEventSpec defines the desired state of HealthEvent.
 type HealthEventSpec struct {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
type HealthEventSpec struct {
// Version of the event
Version uint32 `json:"version,omitempty"`
// Agent that generated the event
Agent string `json:"agent,omitempty"`
// Component class
ComponentClass string `json:"componentClass,omitempty"`
// Name of the check
CheckName string `json:"checkName,omitempty"`
// Whether the fault is fatal
IsFatal *bool `json:"isFatal,omitempty"`
// Whether the system is healthy
IsHealthy *bool `json:"isHealthy,omitempty"`
// Event message
Message string `json:"message,omitempty"`
// Recommended remediation action
RecommendedAction RecommendedAction `json:"recommendedAction,omitempty"`
// Error codes
ErrorCode []string `json:"errorCode,omitempty"`
// Entities impacted
EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"`
// Metadata map
Metadata map[string]string `json:"metadata,omitempty"`
// Timestamp when the event was generated
GeneratedTimestamp *metav1.Time `json:"generatedTimestamp,omitempty"`
// NodeName is the name of the target node
NodeName string `json:"nodeName,omitempty"`
// Behaviour overrides for quarantine
QuarantineOverrides *BehaviourOverrides `json:"quarantineOverrides,omitempty"`
// Behaviour overrides for drain
DrainOverrides *BehaviourOverrides `json:"drainOverrides,omitempty"`
// ProcessingStrategy
ProcessingStrategy ProcessingStrategy `json:"processingStrategy,omitempty"`
}
// HealthEventSpec defines the desired state of HealthEvent.
type HealthEventSpec struct {
// Version of the event
Version uint32 `json:"version,omitempty"`
// Agent that generated the event
Agent string `json:"agent,omitempty"`
// Component class
ComponentClass string `json:"componentClass,omitempty"`
// Name of the check
CheckName string `json:"checkName,omitempty"`
// Whether the fault is fatal
IsFatal *bool `json:"isFatal,omitempty"`
// Whether the system is healthy
IsHealthy *bool `json:"isHealthy,omitempty"`
// Event message
Message string `json:"message,omitempty"`
// Recommended remediation action
RecommendedAction RecommendedAction `json:"recommendedAction,omitempty"`
// Error codes
ErrorCode []string `json:"errorCode,omitempty"`
// Entities impacted
EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"`
// Metadata map
Metadata map[string]string `json:"metadata,omitempty"`
// Timestamp when the event was generated
GeneratedTimestamp *metav1.Time `json:"generatedTimestamp,omitempty"`
// NodeName is the name of the target node
NodeName string `json:"nodeName,omitempty"`
// Behaviour overrides for quarantine
QuarantineOverrides *BehaviourOverrides `json:"quarantineOverrides,omitempty"`
// Behaviour overrides for drain
DrainOverrides *BehaviourOverrides `json:"drainOverrides,omitempty"`
// ProcessingStrategy
ProcessingStrategy ProcessingStrategy `json:"processingStrategy,omitempty"`
}
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 96 - 144, Add a
proper Go doc comment for the exported type HealthEventSpec: place a concise
sentence above the type declaration starting with "HealthEventSpec ..."
describing what the struct represents (e.g., a health/event payload for
components/nodes including status, metadata, timestamps and remediation info).
Ensure the comment follows Go doc conventions (starts with the type name and is
a full sentence) and resides immediately above the HealthEventSpec declaration
so documentation tools and linters recognize it.

Comment on lines +146 to +161
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
type HealthEvent struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

Spec HealthEventSpec `json:"spec,omitempty"`
Status HealthEventStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true
type HealthEventList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []HealthEvent `json:"items"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n data-models/api/v1alpha1/healthevent_types.go | sed -n '140,165p'

Repository: NVIDIA/NVSentinel

Length of output: 966


Add Go doc comments for HealthEvent and HealthEventList.

These exported types are missing doc comments, which is required by Go conventions.

Suggested comments
-// +kubebuilder:object:root=true
-// +kubebuilder:subresource:status
-type HealthEvent struct {
+// HealthEvent is the Schema for the healthevents API.
+// +kubebuilder:object:root=true
+// +kubebuilder:subresource:status
+type HealthEvent struct {
   ...
 }

-// +kubebuilder:object:root=true
-type HealthEventList struct {
+// HealthEventList contains a list of HealthEvent.
+// +kubebuilder:object:root=true
+type HealthEventList struct {
   ...
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
type HealthEvent struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec HealthEventSpec `json:"spec,omitempty"`
Status HealthEventStatus `json:"status,omitempty"`
}
// +kubebuilder:object:root=true
type HealthEventList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []HealthEvent `json:"items"`
}
// HealthEvent is the Schema for the healthevents API.
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
type HealthEvent struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec HealthEventSpec `json:"spec,omitempty"`
Status HealthEventStatus `json:"status,omitempty"`
}
// HealthEventList contains a list of HealthEvent.
// +kubebuilder:object:root=true
type HealthEventList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []HealthEvent `json:"items"`
}
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 146 - 161, Add Go
doc comments for the exported types HealthEvent and HealthEventList by inserting
descriptive comment lines immediately above each type declaration: a comment for
HealthEvent explaining it represents the Kubernetes custom resource holding Spec
(HealthEventSpec) and Status (HealthEventStatus) for a health event, and a
comment for HealthEventList explaining it represents a list of HealthEvent
resources (the list type used by the API server). Ensure the comments start with
the type names "HealthEvent" and "HealthEventList" respectively and are placed
directly above their type definitions.

Comment on lines +219 to +239
// RecommendedAction conversions
func ProtoRecommendedActionToCRD(p protos.RecommendedAction) v1alpha1.RecommendedAction {
return v1alpha1.RecommendedAction(
protos.RecommendedAction_name[int32(p)],
)
}

func CRDRecommendedActionToProto(c v1alpha1.RecommendedAction) protos.RecommendedAction {
return protos.RecommendedAction(
protos.RecommendedAction_value[string(c)],
)
}

// ProcessingStrategy conversions
func ProtoProcessingStrategyToCRD(p protos.ProcessingStrategy) v1alpha1.ProcessingStrategy {
return v1alpha1.ProcessingStrategy(protos.ProcessingStrategy_name[int32(p)])
}

func CRDProcessingStrategyToProto(c v1alpha1.ProcessingStrategy) protos.ProcessingStrategy {
return protos.ProcessingStrategy(protos.ProcessingStrategy_value[string(c)])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, locate the file and examine its actual content
find . -type f -name "*healthevent*" -o -name "*health_event*" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 1307


🏁 Script executed:

#!/bin/bash
# Check if the file exists and read the relevant section
fd -e go -i health | grep -E "conversion|coversion"

Repository: NVIDIA/NVSentinel

Length of output: 170


🏁 Script executed:

#!/bin/bash
# Search for the file with possible variations
find . -path '*data-models*' -type f -name "*.go" | grep -i health

Repository: NVIDIA/NVSentinel

Length of output: 366


🏁 Script executed:

#!/bin/bash
# Read the specific lines from the file
cat -n data-models/pkg/conversion/healthevent_coversion.go | sed -n '219,239p'

Repository: NVIDIA/NVSentinel

Length of output: 971


🏁 Script executed:

#!/bin/bash
# Check the test file to understand expected behavior
cat -n data-models/pkg/conversion/healthevent_conversion_test.go

Repository: NVIDIA/NVSentinel

Length of output: 11840


🏁 Script executed:

#!/bin/bash
# Search for callers of these enum conversion functions
rg -n 'ProtoRecommendedActionToCRD|CRDRecommendedActionToProto|ProtoProcessingStrategyToCRD|CRDProcessingStrategyToProto' --type go -B 2 -A 2

Repository: NVIDIA/NVSentinel

Length of output: 3900


🏁 Script executed:

#!/bin/bash
# Check CRD types to see if there's validation
head -100 data-models/api/v1alpha1/healthevent_types.go

Repository: NVIDIA/NVSentinel

Length of output: 3650


🏁 Script executed:

#!/bin/bash
# Check the proto enum definitions
grep -A 15 "enum RecommendedAction\|enum ProcessingStrategy" data-models/protobufs/health_event.proto

Repository: NVIDIA/NVSentinel

Length of output: 422


🏁 Script executed:

#!/bin/bash
# Check entire conversion file header for package comments
head -50 data-models/pkg/conversion/healthevent_coversion.go

Repository: NVIDIA/NVSentinel

Length of output: 1350


🏁 Script executed:

#!/bin/bash
# Search for any validation of RecommendedAction or ProcessingStrategy before conversion
rg -n 'RecommendedAction|ProcessingStrategy' data-models/api/v1alpha1/healthevent_types.go -A 2 -B 2

Repository: NVIDIA/NVSentinel

Length of output: 1612


🏁 Script executed:

#!/bin/bash
# Check if there are any other conversion functions to understand the pattern
grep -n "^func.*ToCRD\|^func.*ToProto" data-models/pkg/conversion/healthevent_coversion.go | head -30

Repository: NVIDIA/NVSentinel

Length of output: 1167


Add function comments and package documentation to exported conversion functions.

All exported conversion functions are missing required documentation. Add a package-level godoc comment at the top of the file and function comments for ProtoRecommendedActionToCRD, CRDRecommendedActionToProto, ProtoProcessingStrategyToCRD, and CRDProcessingStrategyToProto. This is required by the Go conventions guideline.

Additionally, while these enum conversions rely on protobuf name/value maps that return zero values for unknown inputs, this concern is mitigated by upstream CRD schema validation (which constrains values to the defined enum set) and protobuf enum constraints. However, consider adding a brief function comment explaining the conversion behavior.

🤖 Prompt for AI Agents
In `@data-models/pkg/conversion/healthevent_coversion.go` around lines 219 - 239,
Add missing package-level godoc at the top of the file and add concise function
comments for the exported conversion helpers ProtoRecommendedActionToCRD,
CRDRecommendedActionToProto, ProtoProcessingStrategyToCRD, and
CRDProcessingStrategyToProto; each function comment should state it converts
between the protobuf enum and the CRD enum and note that unknown inputs map to
zero values due to protobuf/name maps but are constrained by upstream CRD schema
and protobuf enum rules. Ensure comments follow Go godoc style (start with the
function/package name) and place the package comment at the top of the file.

Comment on lines +137 to +182
status:
description: HealthEventStatus defines the observed state of HealthEvent
properties:
faultRemediated:
description: Whether the fault has been remediated
type: boolean
lastRemediationTimestamp:
description: Timestamp of the last remediation attempt
format: date-time
type: string
nodeQuarantined:
description: Whether the node has been quarantined
enum:
- NotStarted
- InProgress
- Failed
- Succeeded
- AlreadyDrained
- UnQuarantined
- Quarantined
- AlreadyQuarantined
- Cancelled
type: string
userPodsEvictionStatus:
description: Status of user pod eviction
properties:
message:
type: string
status:
enum:
- NotStarted
- InProgress
- Failed
- Succeeded
- AlreadyDrained
- UnQuarantined
- Quarantined
- AlreadyQuarantined
- Cancelled
type: string
required:
- status
type: object
required:
- userPodsEvictionStatus
type: object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

userPodsEvictionStatus is marked as required in status.

The status schema requires userPodsEvictionStatus (line 180-181). This means any status update must include this field, which could cause issues during initial resource creation or partial status updates. Consider whether this should be optional.

🤖 Prompt for AI Agents
In
`@distros/kubernetes/nvsentinel/crds/healthevents.dgxc.nvidia.com_healthevents.yaml`
around lines 137 - 182, The HealthEventStatus schema currently requires
userPodsEvictionStatus which forces every status update to include it; update
the CRD by removing "userPodsEvictionStatus" from the status.required array (or
change the required list to be empty) so that userPodsEvictionStatus becomes
optional, ensuring HealthEventStatus can be PATCHed/created without that field;
locate the status schema for HealthEventStatus and modify the required property
accordingly.

@XRFXLP
Copy link
Member

XRFXLP commented Jan 23, 2026

@yavinash007 Can you please add a design doc stating out motivation behind this, problem that this solve and what is the end goal .etc so we've consolidated view of overall arc?

Copy link

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest piece of feedback is that it would be good to adopt KAL so that it lints the HealthEvent against the Kubernetes API Guidelines.....its easier to enable now than later on.

@# Generate deepcopy files for API types
go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR)
@# Generate CRDs directly into API_DIR
go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also be generating the RBAC manifests as well?

StatusAlreadyQuarantined Status = "AlreadyQuarantined"
StatusCancelled Status = "Cancelled"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why the values for the enum types are different between Status and RecommendedAction/ProcessingStrategy?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing its because they match the proto?


// OperationStatus represents the status of a sub-operation
type OperationStatus struct {
// +kubebuilder:validation:Required

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth getting the Kubernetes API Linter (KAL) running in the project as early as possible. The reason is that it analyses the custom resources against the Kubernetes API Guidelines and gives a bunch of recommendations.

For example, one of the linters ensures that every field is marked optional or required. It's easier to make the changes now as the API is being defined instead of down the road where it may even require an API version bump.

// +kubebuilder:validation:Required
Status Status `json:"status,omitempty"`

Message string `json:"message,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good to mark strings with min/max length but KAL would highlight this.

ErrorCode []string `json:"errorCode,omitempty"`

// Entities impacted
EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For slices, it's good to add the kubebuilder markers to aid SSA. (KAL would highlight this)

FaultRemediated *bool `json:"faultRemediated,omitempty"`

// Timestamp of the last remediation attempt
LastRemediationTimestamp *metav1.Time `json:"lastRemediationTimestamp,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some projects use somethign like this:

// Timestamp of the last remediation attempt
LastRemediationTimestamp metav1.Time `json:"lastRemediationTimestamp,omitempty,omitzero"`

I believe there are some issues with using the k8s fuzzer with metav1.Time. But this will really only be a problem if have another API version and do roundtrip testing.

}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to also add:

  • "kubebuilder:resource:path" - so that users can do something like `kubectl get he"
  • Multiple "kubebuilder:printcolumn:name" - again useful to expose info when doing kubectl get


// HealthEvent Status
// HealthEventStatus defines the observed state of HealthEvent
type HealthEventStatus struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you'll be adding Conditions in the future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have the types definitions in the proto, should consider using a code generation plugin such as https://github.com/yandex/protoc-gen-crd to generate the CRDs and possibly the required go types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for the suggestion @lalitadithya — I tried it out locally and it works well.

One thing to consider is that HealthEventStatus is currently not part of the proto file. We may want to move it into the proto as well, alongside HealthEvent, to keep the model consistent.

I’ve captured this consideration in the design doc too. Let me know what you think:
#787

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put the crds in a sub chart of it's own so that we can easily disable it based on a helm value?

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
@yavinash007 yavinash007 changed the title k8s API for healtheventwithstatus model WIP - k8s API for healtheventwithstatus model Jan 31, 2026
@yavinash007 yavinash007 force-pushed the healtheventwithstatus branch from 91cc242 to e145c35 Compare January 31, 2026 04:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 1: The ADR title string "ADR-027: DatastorKubernetes Custom Resource (CR)
for HealthEvent" contains a typo; update the header to read "ADR-027:
DatastoreKubernetes Custom Resource (CR) for HealthEvent" (fix "Datastor" →
"Datastore") wherever that exact title appears in the document so the ADR name
is correct.
🧹 Nitpick comments (3)
data-models/Makefile (1)

40-67: Consider splitting generate into helper targets to keep lint clean.

The body is long (checkmake warns about length), and breaking it into smaller targets improves readability and reuse (e.g., regenerate CRDs without moving files).

🔧 Possible refactor
+.PHONY: generate ensure-controller-gen generate-deepcopy generate-crd move-crd
+
+# generate: Generate deepcopy, CRD types, and other Kubernetes boilerplate
+generate: ensure-controller-gen generate-deepcopy generate-crd move-crd ## Generate CRDs and move them to Helm chart directory
+
+ensure-controller-gen:
+	`@which` controller-gen > /dev/null || (echo "Installing controller-gen..." && \
+		go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION))
+
+generate-deepcopy:
+	`@echo` "Generating CRDs for $(API_DIR)..."
+	go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR)
+
+generate-crd:
+	go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \
+		crd paths=./$(API_DIR) output:crd:dir=./$(API_DIR)
+
+move-crd:
+	`@echo` "Moving generated CRDs to $(CRD_OUTPUT_DIR)..."
+	`@mkdir` -p $(CRD_OUTPUT_DIR)
+	`@mv` ./$(API_DIR)/*.yaml $(CRD_OUTPUT_DIR)/ || true
+	`@echo` "CRDs generated and moved to $(CRD_OUTPUT_DIR)"
+	`@ls` -1 $(CRD_OUTPUT_DIR)/*.yaml || echo "No CRD YAMLs generated"
data-models/pkg/conversion/healthevent_conversion_test.go (2)

222-271: Prefer testify/require / assert for these checks.

This will reduce boilerplate and improve diagnostics for failures. If testify isn’t already in this module, add it once and then replace the t.Fatalf blocks.

🔧 Example conversion
 import (
 	"reflect"
 	"strings"
 	"testing"
 	"time"

 	v1alpha1 "github.com/nvidia/nvsentinel/data-models/api/v1alpha1"
 	"github.com/nvidia/nvsentinel/data-models/pkg/model"
 	"github.com/nvidia/nvsentinel/data-models/pkg/protos"
 	"google.golang.org/protobuf/types/known/timestamppb"
+	"github.com/stretchr/testify/require"
 )
@@
-	if crdSpec.Version != protoEvent.Version ||
-		crdSpec.Agent != protoEvent.Agent ||
-		...
-		crdSpec.ProcessingStrategy != v1alpha1.ProcessingStrategy(protos.ProcessingStrategy_name[int32(protoEvent.ProcessingStrategy)]) {
-		t.Fatalf("proto -> crd conversion failed")
-	}
+	require.Equal(t, protoEvent.Version, crdSpec.Version)
+	require.Equal(t, protoEvent.Agent, crdSpec.Agent)
+	...
+	require.Equal(
+		t,
+		v1alpha1.ProcessingStrategy(protos.ProcessingStrategy_name[int32(protoEvent.ProcessingStrategy)]),
+		crdSpec.ProcessingStrategy,
+	)

As per coding guidelines: Use testify/assert and testify/require for assertions in Go tests.


26-34: Rename tests to include scenario + expected behavior.

Current names are descriptive but don’t follow the agreed pattern. Example: TestHealthEventSpecConversion_RoundTrip_PreservesFields.

As per coding guidelines: Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior.

Also applies to: 189-334

@yavinash007 yavinash007 force-pushed the healtheventwithstatus branch from e145c35 to 61ea5c4 Compare January 31, 2026 04:41
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
@yavinash007 yavinash007 force-pushed the healtheventwithstatus branch from 61ea5c4 to e765d44 Compare January 31, 2026 04:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 89: Populate the "## Alternatives Considered" section by listing each
evaluated option (e.g., "Pure MongoDB with no Kubernetes CRs", "Kubernetes
ConfigMaps/Secrets", "Separate etcd cluster", "External time-series DBs",
"In-memory storage with external backup") and add a concise 1–2 sentence
pros/cons note for each explaining scalability, consistency, operational
complexity, and why it was rejected compared to the chosen Kubernetes CR-backed
approach; ensure you reference the ADR title "027-kubernetes-data-store.md" and
the header "## Alternatives Considered" so reviewers can locate the new content
quickly.
- Around line 19-24: The API Definition is missing the API version; update the
section that lists API Group and Kind to include "API Version: v1alpha1" so the
spec explicitly reads API Group: `healthevents.dgxc.nvidia.com`, API Version:
`v1alpha1`, Kind: `HealthEvent` and clarify that `spec` maps to immutable event
data (HealthEvent) and `status` maps to mutable workflow state
(HealthEventStatus); this ensures the API is fully specified wherever the API
Definition block appears.
🧹 Nitpick comments (5)
docs/designs/027-kubernetes-data-store.md (5)

30-31: Consider documenting conversion edge cases and error handling.

While the conversion APIs and unit tests are mentioned, the ADR could benefit from a brief discussion of:

  • How mismatched or unmappable fields between protobuf, internal models, and CRD structs are handled
  • Error handling strategy when conversion fails
  • Approach to schema evolution and backward compatibility

This would provide clearer guidance for future maintainers.


58-62: Add guidance on CR cleanup configuration.

The CR cleanup section mentions configurable TTL and limits but doesn't provide:

  • Recommended default values or ranges
  • How to determine appropriate limits based on cluster size or event frequency
  • Configuration mechanism (Helm values, CR annotations, etc.)

Adding operational guidance would help users configure these settings appropriately.


69-87: Consider adding scale guidance or performance expectations.

The Consequences section mentions "control plane load" and potential impact on large clusters but doesn't provide:

  • Approximate scale thresholds (e.g., events per second, total CR count)
  • Guidance on when to use CR-backed storage vs. MongoDB
  • Expected performance characteristics

Quantitative guidance would help users make informed deployment decisions.


96-99: Add link to the referenced Datastore Abstraction Design Document.

The References section mentions "NVSentinel Datastore Abstraction Design Document" (also referenced on line 9) but doesn't provide a link. Adding the link would help readers understand the broader context.

If the document isn't yet available, consider noting that with a placeholder like [To be added].


48-57: Consider adding a migration strategy section.

The ADR mentions that the Kubernetes store is optional and the MongoDB store remains unchanged, but doesn't address:

  • Migration path from MongoDB-only to CR-backed storage
  • Whether both stores can run simultaneously during transition
  • Data synchronization strategy if running both
  • Rollback procedure if issues arise

A brief migration strategy would help teams plan adoption.

Comment on lines +19 to +24
### API Definition

- API Group: `healthevents.dgxc.nvidia.com`
- Kind: `HealthEvent`
- `spec`: contains immutable event data (from HealthEvent)
- `status`: tracks mutable workflow state (from HealthEventStatus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add API version to the API Definition.

The API Definition section specifies the API Group and Kind but omits the API version. Based on the AI summary and related code changes, this should be v1alpha1. Explicitly documenting the version is essential for a complete API specification.

📝 Suggested addition
 ### API Definition
 
 - API Group: `healthevents.dgxc.nvidia.com`
+- Version: `v1alpha1`
 - Kind: `HealthEvent`
 - `spec`: contains immutable event data (from HealthEvent)
 - `status`: tracks mutable workflow state (from HealthEventStatus)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### API Definition
- API Group: `healthevents.dgxc.nvidia.com`
- Kind: `HealthEvent`
- `spec`: contains immutable event data (from HealthEvent)
- `status`: tracks mutable workflow state (from HealthEventStatus)
### API Definition
- API Group: `healthevents.dgxc.nvidia.com`
- Version: `v1alpha1`
- Kind: `HealthEvent`
- `spec`: contains immutable event data (from HealthEvent)
- `status`: tracks mutable workflow state (from HealthEventStatus)
🤖 Prompt for AI Agents
In `@docs/designs/027-kubernetes-data-store.md` around lines 19 - 24, The API
Definition is missing the API version; update the section that lists API Group
and Kind to include "API Version: v1alpha1" so the spec explicitly reads API
Group: `healthevents.dgxc.nvidia.com`, API Version: `v1alpha1`, Kind:
`HealthEvent` and clarify that `spec` maps to immutable event data (HealthEvent)
and `status` maps to mutable workflow state (HealthEventStatus); this ensures
the API is fully specified wherever the API Definition block appears.

- Resource cleanup and garbage collection: Automatically remove CRs when events are remediated or when cluster/node limits are reached.
- Phased approach: Core watch-based workflows are enabled first; advanced features can be added later.

## Alternatives Considered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Complete the "Alternatives Considered" section.

The "Alternatives Considered" section is empty. ADRs should document what alternatives were evaluated and why they were not chosen. This provides important context for future reviewers and helps justify the selected approach.

Consider documenting alternatives such as:

  • Pure MongoDB with no Kubernetes CRs
  • Kubernetes ConfigMaps or Secrets
  • Separate etcd cluster
  • External time-series databases
  • In-memory storage with external backup
🤖 Prompt for AI Agents
In `@docs/designs/027-kubernetes-data-store.md` at line 89, Populate the "##
Alternatives Considered" section by listing each evaluated option (e.g., "Pure
MongoDB with no Kubernetes CRs", "Kubernetes ConfigMaps/Secrets", "Separate etcd
cluster", "External time-series DBs", "In-memory storage with external backup")
and add a concise 1–2 sentence pros/cons note for each explaining scalability,
consistency, operational complexity, and why it was rejected compared to the
chosen Kubernetes CR-backed approach; ensure you reference the ADR title
"027-kubernetes-data-store.md" and the header "## Alternatives Considered" so
reviewers can locate the new content quickly.

@yavinash007
Copy link
Contributor Author

@yavinash007 Can you please add a design doc stating out motivation behind this, problem that this solve and what is the end goal .etc so we've consolidated view of overall arc?

Started a doc, pls take a look. let us know your thoughts
@XRFXLP #787

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments