WIP - k8s API for healtheventwithstatus model#640
WIP - k8s API for healtheventwithstatus model#640yavinash007 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a new HealthEvent Kubernetes CRD (v1alpha1), API types and generated deepcopies, bidirectional conversion functions and tests between protobuf/model/CRD, CRD manifests, and Makefile/tooling updates to support controller-gen-based generation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Producer as Producer/Service
participant Converter as Conversion Layer
participant K8sAPI as Kubernetes API Server (CRD)
participant Storage as API Server Storage / etcd
Producer->>Converter: Create/emit HealthEvent (proto/model)
Converter->>K8sAPI: Convert to v1alpha1.HealthEvent (create/update)
K8sAPI->>Storage: Persist CR (spec immutable, status mutable)
K8sAPI-->>Converter: CR persisted (status updates readable)
Note over Converter,K8sAPI: Reconciler/consumers read/write status via CRD
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
||
| var ( | ||
| // GroupVersion is group version used to register these objects | ||
| GroupVersion = schema.GroupVersion{Group: "data-models.dgxc.nvidia.com", Version: "v1alpha1"} |
There was a problem hiding this comment.
I was thinking maybe we do healthevents.dgxc.nvidia.com but open to ideas here too. Just because health-events is a little more descriptive than data-models
aa400bd to
14aa429
Compare
| EventID string `json:"eventID"` | ||
|
|
||
| // Node associated with this health event | ||
| // +kubebuilder:validation:Required |
There was a problem hiding this comment.
This still looks like we're redefining the types, we ideally don't want to maintain two copies of the same types and keep them in sync
|
|
||
| // +kubebuilder:object:root=true | ||
| // +kubebuilder:subresource:status | ||
| type HealthStatus struct { |
There was a problem hiding this comment.
so my thought was this struct would look like this:
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
type HealthStatus struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec model.HealthEvent `json:"spec,omitempty"`
Status model.HealthEventStatus `json:"status,omitempty"`
}
So you would not define your own spec and status objects we would just use the existing ones.
There's some implications for api-versioning going forward if we want to adhere to best practices that we should discuss with the NVIDIA folks if we do it this way but it seems like a clean way to share the object that the other datasources use.
|
Discussed offline, it'll likely be cleaner to have separate structs than munging the proto and the k8s type. Then we should make sure to have testing to ensure the structs stay in-sync. One idea is if we can get rid of the |
7f9a0bb to
6260e04
Compare
pls check now, let me know what you think. |
045caab to
91cc242
Compare
| return out | ||
| } | ||
|
|
||
| func assertRecursiveFieldParity(t *testing.T, leftName string, leftType reflect.Type, rightName string, rightType reflect.Type, ignoreTopLevel map[string]struct{}) { |
There was a problem hiding this comment.
Maybe just add some comments throughout the function to help explain what's going on, just because reflection/recursion can sometimes be a little tough to follow
|
We will eventually want to include some CI check making sure the generated files are actually in-sync with generation but since this is the first PR in a series of PRs on a greenfield feature I think this is a good start and we can follow up with that |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In `@data-models/api/v1alpha1/healthevent_types.go`:
- Around line 49-55: The OperationStatus struct marks Status as required via the
kubebuilder tag but its JSON tag includes `omitempty`, allowing it to be dropped
from CR JSON; update the OperationStatus definition by removing `omitempty` from
the Status field's json tag (change `Status Status `json:"status,omitempty"` to
`Status Status `json:"status"`), or alternatively remove the
`+kubebuilder:validation:Required` line if the field should be optional; ensure
the change is applied to the OperationStatus struct and the Status field
declaration.
- Around line 146-161: Add Go doc comments for the exported types HealthEvent
and HealthEventList by inserting descriptive comment lines immediately above
each type declaration: a comment for HealthEvent explaining it represents the
Kubernetes custom resource holding Spec (HealthEventSpec) and Status
(HealthEventStatus) for a health event, and a comment for HealthEventList
explaining it represents a list of HealthEvent resources (the list type used by
the API server). Ensure the comments start with the type names "HealthEvent" and
"HealthEventList" respectively and are placed directly above their type
definitions.
- Around line 96-144: Add a proper Go doc comment for the exported type
HealthEventSpec: place a concise sentence above the type declaration starting
with "HealthEventSpec ..." describing what the struct represents (e.g., a
health/event payload for components/nodes including status, metadata, timestamps
and remediation info). Ensure the comment follows Go doc conventions (starts
with the type name and is a full sentence) and resides immediately above the
HealthEventSpec declaration so documentation tools and linters recognize it.
- Line 1: Add Go doc comments for all exported types to satisfy Go conventions
by documenting Status, RecommendedAction, ProcessingStrategy, HealthEventSpec,
HealthEvent, and HealthEventList; place a short sentence above each type
declaration describing its purpose. Also fix the field tag mismatch in
OperationStatus by removing `omitempty` from the JSON tag on the Status field
(since it is annotated `// +kubebuilder:validation:Required`) so the tags and
validation are consistent; update the comment on OperationStatus.Status if
needed to reflect it is required.
- Around line 7-47: Add Go doc comments for the exported enum types Status,
RecommendedAction, and ProcessingStrategy and for each corresponding const
block: place a one-line comment immediately above the type declarations (e.g.,
"Status ..." describing what the Status values represent) and add a short
comment above each const block that describes the group of constants (e.g.,
"Status values," "RecommendedAction values," "ProcessingStrategy values").
Ensure the comments use the exact exported identifier as the first word (Status,
RecommendedAction, ProcessingStrategy) so they satisfy Go doc conventions and
linters.
In `@data-models/pkg/conversion/healthevent_coversion.go`:
- Around line 219-239: Add missing package-level godoc at the top of the file
and add concise function comments for the exported conversion helpers
ProtoRecommendedActionToCRD, CRDRecommendedActionToProto,
ProtoProcessingStrategyToCRD, and CRDProcessingStrategyToProto; each function
comment should state it converts between the protobuf enum and the CRD enum and
note that unknown inputs map to zero values due to protobuf/name maps but are
constrained by upstream CRD schema and protobuf enum rules. Ensure comments
follow Go godoc style (start with the function/package name) and place the
package comment at the top of the file.
In
`@distros/kubernetes/nvsentinel/crds/healthevents.dgxc.nvidia.com_healthevents.yaml`:
- Around line 137-182: The HealthEventStatus schema currently requires
userPodsEvictionStatus which forces every status update to include it; update
the CRD by removing "userPodsEvictionStatus" from the status.required array (or
change the required list to be empty) so that userPodsEvictionStatus becomes
optional, ensuring HealthEventStatus can be PATCHed/created without that field;
locate the status schema for HealthEventStatus and modify the required property
accordingly.
- Around line 38-136: The CRD is missing the ProcessingStrategy field from the
HealthEventSpec; add a new spec property named processingStrategy (type: string)
with enum values UNSPECIFIED, EXECUTE_REMEDIATION, and STORE_ONLY and an
appropriate description matching HealthEventSpec in
data-models/api/v1alpha1/healthevent_types.go, then regenerate the OpenAPI CRD
so the field is persisted by the Kubernetes API server; locate the spec
properties block in the healthevents CRD (the same section containing
recommendedAction, version, etc.) and add the processingStrategy entry there
before regenerating.
♻️ Duplicate comments (1)
data-models/pkg/conversion/healthevent_conversion_test.go (1)
36-84: Add inline comments to explain the reflection logic.The recursive field extraction logic using reflection can be difficult to follow. Consider adding brief inline comments explaining:
- The purpose of unwrapping pointers (lines 40-42, 70-73)
- Why
PkgPath != ""means unexported (line 51)- The
isScalarStructexclusion rationale (line 76)
🧹 Nitpick comments (4)
data-models/Makefile (1)
48-66: Optional: split thegeneratetarget to satisfy checkmake max body length.
This keeps the lint clean and makes the steps easier to reuse.♻️ Possible split to keep target bodies short
-.PHONY: generate -generate: ## Generate CRDs and move them to Helm chart directory - `@echo` "Generating CRDs for $(API_DIR)..." - @# Install controller-gen if not present - `@which` controller-gen > /dev/null || (echo "Installing controller-gen..." && \ - go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION)) - @# Generate deepcopy files for API types - go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR) - @# Generate CRDs directly into API_DIR - go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \ - crd paths=./$(API_DIR) output:crd:dir=./$(API_DIR) - @# Move generated CRDs to Helm chart directory - `@echo` "Moving generated CRDs to $(CRD_OUTPUT_DIR)..." - `@mkdir` -p $(CRD_OUTPUT_DIR) - `@mv` ./$(API_DIR)/*.yaml $(CRD_OUTPUT_DIR)/ || true - `@echo` "CRDs generated and moved to $(CRD_OUTPUT_DIR)" - `@ls` -1 $(CRD_OUTPUT_DIR)/*.yaml || echo "No CRD YAMLs generated" +.PHONY: generate generate-crds move-crds +generate: generate-crds move-crds ## Generate CRDs and move them to Helm chart directory + +generate-crds: + `@echo` "Generating CRDs for $(API_DIR)..." + `@which` controller-gen > /dev/null || (echo "Installing controller-gen..." && \ + go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION)) + go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR) + go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \ + crd paths=./$(API_DIR) output:crd:dir=./$(API_DIR) + +move-crds: + `@echo` "Moving generated CRDs to $(CRD_OUTPUT_DIR)..." + `@mkdir` -p $(CRD_OUTPUT_DIR) + `@mv` ./$(API_DIR)/*.yaml $(CRD_OUTPUT_DIR)/ || true + `@ls` -1 $(CRD_OUTPUT_DIR)/*.yaml || echo "No CRD YAMLs generated"data-models/pkg/conversion/healthevent_conversion_test.go (1)
189-272: Consider table-driven tests for conversion scenarios.The conversion tests are thorough but could benefit from table-driven structure to test additional scenarios (e.g., nil inputs, partial fields, edge cases). As per coding guidelines, table-driven tests are preferred when testing multiple scenarios.
♻️ Example structure
func TestHealthEventSpecConversion(t *testing.T) { tests := []struct { name string protoEvent *protos.HealthEvent wantErr bool }{ { name: "full event", protoEvent: &protos.HealthEvent{/* ... */}, }, { name: "nil input", protoEvent: nil, }, { name: "minimal fields", protoEvent: &protos.HealthEvent{CheckName: "check1"}, }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { // test logic }) } }data-models/pkg/conversion/healthevent_coversion.go (2)
1-1: Filename typo:healthevent_coversion.goshould behealthevent_conversion.go.The filename has a typo - "coversion" is missing an 'n'. Consider renaming for consistency.
1-11: Add package-level godoc.As per coding guidelines, package-level godoc is required for all Go packages. Add a brief description of the package's purpose.
package conversion + +// Package conversion provides bidirectional conversion functions between +// internal model types, v1alpha1 CRD representations, and protobuf definitions +// for HealthEvent resources. import (
| @@ -0,0 +1,165 @@ | |||
| package v1alpha1 | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if package comment exists in v1alpha1
rg -n '^\s*//\s*Package v1alpha1\b' -g '*.go' data-models/api/v1alpha1/Repository: NVIDIA/NVSentinel
Length of output: 201
🏁 Script executed:
# Also check what files exist in the v1alpha1 directory
find data-models/api/v1alpha1/ -type f -name '*.go' | head -20Repository: NVIDIA/NVSentinel
Length of output: 202
🏁 Script executed:
# Read the full healthevent_types.go file to see exported types and their comments
wc -l data-models/api/v1alpha1/healthevent_types.goRepository: NVIDIA/NVSentinel
Length of output: 110
🏁 Script executed:
cat -n data-models/api/v1alpha1/healthevent_types.goRepository: NVIDIA/NVSentinel
Length of output: 6630
Add doc comments for exported types and fix field tag mismatch in OperationStatus.
Export types without doc comments violate Go conventions:
Status(line 9)RecommendedAction(line 26)ProcessingStrategy(line 41)HealthEventSpec(line 96)HealthEvent(line 148)HealthEventList(line 157)
Additionally, fix the conflicting field annotation on OperationStatus.Status (line 52): the field is marked // +kubebuilder:validation:Required but also has omitempty in the JSON tag. Remove omitempty since required fields should not be omittable.
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` at line 1, Add Go doc comments
for all exported types to satisfy Go conventions by documenting Status,
RecommendedAction, ProcessingStrategy, HealthEventSpec, HealthEvent, and
HealthEventList; place a short sentence above each type declaration describing
its purpose. Also fix the field tag mismatch in OperationStatus by removing
`omitempty` from the JSON tag on the Status field (since it is annotated `//
+kubebuilder:validation:Required`) so the tags and validation are consistent;
update the comment on OperationStatus.Status if needed to reflect it is
required.
| // +kubebuilder:validation:Enum=NotStarted;InProgress;Failed;Succeeded;AlreadyDrained;\ | ||
| // UnQuarantined;Quarantined;AlreadyQuarantined;Cancelled | ||
| type Status string | ||
|
|
||
| const ( | ||
| StatusNotStarted Status = "NotStarted" | ||
| StatusInProgress Status = "InProgress" | ||
| StatusFailed Status = "Failed" | ||
| StatusSucceeded Status = "Succeeded" | ||
| StatusAlreadyDrained Status = "AlreadyDrained" | ||
|
|
||
| StatusUnQuarantined Status = "UnQuarantined" | ||
| StatusQuarantined Status = "Quarantined" | ||
| StatusAlreadyQuarantined Status = "AlreadyQuarantined" | ||
| StatusCancelled Status = "Cancelled" | ||
| ) | ||
|
|
||
| // +kubebuilder:validation:Enum=NONE;COMPONENT_RESET;CONTACT_SUPPORT;\ | ||
| // RUN_FIELDDIAG;RESTART_VM;RESTART_BM;REPLACE_VM;RUN_DCGMEUD;UNKNOWN | ||
| type RecommendedAction string | ||
|
|
||
| const ( | ||
| RecommendedActionNone RecommendedAction = "NONE" | ||
| RecommendedActionComponentReset RecommendedAction = "COMPONENT_RESET" | ||
| RecommendedActionContactSupport RecommendedAction = "CONTACT_SUPPORT" | ||
| RecommendedActionRunFieldDiag RecommendedAction = "RUN_FIELDDIAG" | ||
| RecommendedActionRestartVM RecommendedAction = "RESTART_VM" | ||
| RecommendedActionRestartBM RecommendedAction = "RESTART_BM" | ||
| RecommendedActionReplaceVM RecommendedAction = "REPLACE_VM" | ||
| RecommendedActionRunDCGMEUD RecommendedAction = "RUN_DCGMEUD" | ||
| RecommendedActionUnknown RecommendedAction = "UNKNOWN" | ||
| ) | ||
|
|
||
| // +kubebuilder:validation:Enum=UNSPECIFIED;EXECUTE_REMEDIATION;STORE_ONLY | ||
| type ProcessingStrategy string | ||
|
|
||
| const ( | ||
| ProcessingStrategyUnspecified ProcessingStrategy = "UNSPECIFIED" | ||
| ProcessingStrategyExecuteRemediation ProcessingStrategy = "EXECUTE_REMEDIATION" | ||
| ProcessingStrategyStoreOnly ProcessingStrategy = "STORE_ONLY" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if file exists and read the relevant lines
if [ -f "data-models/api/v1alpha1/healthevent_types.go" ]; then
echo "=== File exists, showing lines 1-60 ==="
head -60 "data-models/api/v1alpha1/healthevent_types.go" | cat -n
else
echo "File not found"
find . -name "healthevent_types.go" -type f
fiRepository: NVIDIA/NVSentinel
Length of output: 178
🏁 Script executed:
#!/bin/bash
# Search for any doc comments before the types
echo "=== Checking for doc comments before exported types ==="
rg -B5 '^\s*type\s+(Status|RecommendedAction|ProcessingStrategy)\s' data-models/api/v1alpha1/healthevent_types.goRepository: NVIDIA/NVSentinel
Length of output: 797
🏁 Script executed:
#!/bin/bash
# Check entire file structure
echo "=== Full file to understand context ==="
wc -l data-models/api/v1alpha1/healthevent_types.goRepository: NVIDIA/NVSentinel
Length of output: 150
🏁 Script executed:
#!/bin/bash
# Read more of the file to check package-level doc and understand the full context
head -20 data-models/api/v1alpha1/healthevent_types.go | cat -nRepository: NVIDIA/NVSentinel
Length of output: 827
🏁 Script executed:
#!/bin/bash
# Check if there are any other exported types in the file with doc comments to compare
rg -B1 '^\s*type\s+\w+\s' data-models/api/v1alpha1/healthevent_types.go | head -40Repository: NVIDIA/NVSentinel
Length of output: 865
🌐 Web query:
Go golint exported type comment requirements documentation
💡 Result:
Short answer — golint expects exported (capitalized) top-level names to have a doc comment immediately preceding the declaration; the comment should be a complete sentence whose first word is the name being documented (e.g. “TypeName …”). If you omit it golint reports “exported X should have comment or be unexported”; if the wording doesn’t start with the name it reports “comment on exported type X should be of the form 'X ...'”. [1][2]
Sources:
- Effective Go: doc-comment style and “Every exported (capitalized) name … should have a doc comment; the first sentence should start with the name.” [1]
- Examples / golint messages (and common linter tooling guidance noting the exact message text). [2]
Add Go doc comments for exported enum types and const blocks.
Status, RecommendedAction, and ProcessingStrategy are exported but lack doc comments required by Go conventions; add comments for each type and the associated const block.
Suggested additions
+// Status represents the state of a remediation sub-operation.
// +kubebuilder:validation:Enum=NotStarted;InProgress;Failed;Succeeded;AlreadyDrained;\
// UnQuarantined;Quarantined;AlreadyQuarantined;Cancelled
type Status string
+// Status-related constants for HealthEvent operations.
const (
StatusNotStarted Status = "NotStarted"
StatusInProgress Status = "InProgress"
StatusFailed Status = "Failed"
StatusSucceeded Status = "Succeeded"
StatusAlreadyDrained Status = "AlreadyDrained"
+// RecommendedAction indicates the suggested remediation action.
// +kubebuilder:validation:Enum=NONE;COMPONENT_RESET;CONTACT_SUPPORT;\
// RUN_FIELDDIAG;RESTART_VM;RESTART_BM;REPLACE_VM;RUN_DCGMEUD;UNKNOWN
type RecommendedAction string
+// RecommendedAction-related constants.
const (
RecommendedActionNone RecommendedAction = "NONE"
RecommendedActionComponentReset RecommendedAction = "COMPONENT_RESET"
+// ProcessingStrategy controls how the system should handle this event.
// +kubebuilder:validation:Enum=UNSPECIFIED;EXECUTE_REMEDIATION;STORE_ONLY
type ProcessingStrategy string
+// ProcessingStrategy-related constants.
const (
ProcessingStrategyUnspecified ProcessingStrategy = "UNSPECIFIED"
ProcessingStrategyExecuteRemediation ProcessingStrategy = "EXECUTE_REMEDIATION"
ProcessingStrategyStoreOnly ProcessingStrategy = "STORE_ONLY"
)🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 7 - 47, Add Go
doc comments for the exported enum types Status, RecommendedAction, and
ProcessingStrategy and for each corresponding const block: place a one-line
comment immediately above the type declarations (e.g., "Status ..." describing
what the Status values represent) and add a short comment above each const block
that describes the group of constants (e.g., "Status values," "RecommendedAction
values," "ProcessingStrategy values"). Ensure the comments use the exact
exported identifier as the first word (Status, RecommendedAction,
ProcessingStrategy) so they satisfy Go doc conventions and linters.
| // OperationStatus represents the status of a sub-operation | ||
| type OperationStatus struct { | ||
| // +kubebuilder:validation:Required | ||
| Status Status `json:"status,omitempty"` | ||
|
|
||
| Message string `json:"message,omitempty"` | ||
| } |
There was a problem hiding this comment.
Required field marked omitempty can drop status in CRD JSON.
Status is marked required but omitempty allows it to be omitted from JSON, which contradicts the CRD schema and can cause validation failures. Remove omitempty (or drop Required if truly optional).
🩹 Proposed fix
- // +kubebuilder:validation:Required
- Status Status `json:"status,omitempty"`
+ // +kubebuilder:validation:Required
+ Status Status `json:"status"`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // OperationStatus represents the status of a sub-operation | |
| type OperationStatus struct { | |
| // +kubebuilder:validation:Required | |
| Status Status `json:"status,omitempty"` | |
| Message string `json:"message,omitempty"` | |
| } | |
| // OperationStatus represents the status of a sub-operation | |
| type OperationStatus struct { | |
| // +kubebuilder:validation:Required | |
| Status Status `json:"status"` | |
| Message string `json:"message,omitempty"` | |
| } |
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 49 - 55, The
OperationStatus struct marks Status as required via the kubebuilder tag but its
JSON tag includes `omitempty`, allowing it to be dropped from CR JSON; update
the OperationStatus definition by removing `omitempty` from the Status field's
json tag (change `Status Status `json:"status,omitempty"` to `Status Status
`json:"status"`), or alternatively remove the `+kubebuilder:validation:Required`
line if the field should be optional; ensure the change is applied to the
OperationStatus struct and the Status field declaration.
| type HealthEventSpec struct { | ||
| // Version of the event | ||
| Version uint32 `json:"version,omitempty"` | ||
|
|
||
| // Agent that generated the event | ||
| Agent string `json:"agent,omitempty"` | ||
|
|
||
| // Component class | ||
| ComponentClass string `json:"componentClass,omitempty"` | ||
|
|
||
| // Name of the check | ||
| CheckName string `json:"checkName,omitempty"` | ||
|
|
||
| // Whether the fault is fatal | ||
| IsFatal *bool `json:"isFatal,omitempty"` | ||
|
|
||
| // Whether the system is healthy | ||
| IsHealthy *bool `json:"isHealthy,omitempty"` | ||
|
|
||
| // Event message | ||
| Message string `json:"message,omitempty"` | ||
|
|
||
| // Recommended remediation action | ||
| RecommendedAction RecommendedAction `json:"recommendedAction,omitempty"` | ||
|
|
||
| // Error codes | ||
| ErrorCode []string `json:"errorCode,omitempty"` | ||
|
|
||
| // Entities impacted | ||
| EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"` | ||
|
|
||
| // Metadata map | ||
| Metadata map[string]string `json:"metadata,omitempty"` | ||
|
|
||
| // Timestamp when the event was generated | ||
| GeneratedTimestamp *metav1.Time `json:"generatedTimestamp,omitempty"` | ||
|
|
||
| // NodeName is the name of the target node | ||
| NodeName string `json:"nodeName,omitempty"` | ||
|
|
||
| // Behaviour overrides for quarantine | ||
| QuarantineOverrides *BehaviourOverrides `json:"quarantineOverrides,omitempty"` | ||
|
|
||
| // Behaviour overrides for drain | ||
| DrainOverrides *BehaviourOverrides `json:"drainOverrides,omitempty"` | ||
|
|
||
| // ProcessingStrategy | ||
| ProcessingStrategy ProcessingStrategy `json:"processingStrategy,omitempty"` | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the exact lines 96-144 to see if HealthEventSpec has a doc comment
sed -n '85,150p' data-models/api/v1alpha1/healthevent_types.go | cat -nRepository: NVIDIA/NVSentinel
Length of output: 2543
🏁 Script executed:
#!/bin/bash
# Also check other exported types in the same file to see their doc comment patterns
rg -n '^\s*type\s+\w+\s+(struct|interface)' data-models/api/v1alpha1/healthevent_types.go -B2 | head -40Repository: NVIDIA/NVSentinel
Length of output: 855
Add a Go doc comment for HealthEventSpec.
The exported type HealthEventSpec at line 96 is missing a doc comment. All other exported types in this file (Entity, BehaviourOverrides, HealthEventStatus, OperationStatus) include proper documentation comments.
♻️ Suggested comment
+// HealthEventSpec defines the desired state of HealthEvent.
type HealthEventSpec struct {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| type HealthEventSpec struct { | |
| // Version of the event | |
| Version uint32 `json:"version,omitempty"` | |
| // Agent that generated the event | |
| Agent string `json:"agent,omitempty"` | |
| // Component class | |
| ComponentClass string `json:"componentClass,omitempty"` | |
| // Name of the check | |
| CheckName string `json:"checkName,omitempty"` | |
| // Whether the fault is fatal | |
| IsFatal *bool `json:"isFatal,omitempty"` | |
| // Whether the system is healthy | |
| IsHealthy *bool `json:"isHealthy,omitempty"` | |
| // Event message | |
| Message string `json:"message,omitempty"` | |
| // Recommended remediation action | |
| RecommendedAction RecommendedAction `json:"recommendedAction,omitempty"` | |
| // Error codes | |
| ErrorCode []string `json:"errorCode,omitempty"` | |
| // Entities impacted | |
| EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"` | |
| // Metadata map | |
| Metadata map[string]string `json:"metadata,omitempty"` | |
| // Timestamp when the event was generated | |
| GeneratedTimestamp *metav1.Time `json:"generatedTimestamp,omitempty"` | |
| // NodeName is the name of the target node | |
| NodeName string `json:"nodeName,omitempty"` | |
| // Behaviour overrides for quarantine | |
| QuarantineOverrides *BehaviourOverrides `json:"quarantineOverrides,omitempty"` | |
| // Behaviour overrides for drain | |
| DrainOverrides *BehaviourOverrides `json:"drainOverrides,omitempty"` | |
| // ProcessingStrategy | |
| ProcessingStrategy ProcessingStrategy `json:"processingStrategy,omitempty"` | |
| } | |
| // HealthEventSpec defines the desired state of HealthEvent. | |
| type HealthEventSpec struct { | |
| // Version of the event | |
| Version uint32 `json:"version,omitempty"` | |
| // Agent that generated the event | |
| Agent string `json:"agent,omitempty"` | |
| // Component class | |
| ComponentClass string `json:"componentClass,omitempty"` | |
| // Name of the check | |
| CheckName string `json:"checkName,omitempty"` | |
| // Whether the fault is fatal | |
| IsFatal *bool `json:"isFatal,omitempty"` | |
| // Whether the system is healthy | |
| IsHealthy *bool `json:"isHealthy,omitempty"` | |
| // Event message | |
| Message string `json:"message,omitempty"` | |
| // Recommended remediation action | |
| RecommendedAction RecommendedAction `json:"recommendedAction,omitempty"` | |
| // Error codes | |
| ErrorCode []string `json:"errorCode,omitempty"` | |
| // Entities impacted | |
| EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"` | |
| // Metadata map | |
| Metadata map[string]string `json:"metadata,omitempty"` | |
| // Timestamp when the event was generated | |
| GeneratedTimestamp *metav1.Time `json:"generatedTimestamp,omitempty"` | |
| // NodeName is the name of the target node | |
| NodeName string `json:"nodeName,omitempty"` | |
| // Behaviour overrides for quarantine | |
| QuarantineOverrides *BehaviourOverrides `json:"quarantineOverrides,omitempty"` | |
| // Behaviour overrides for drain | |
| DrainOverrides *BehaviourOverrides `json:"drainOverrides,omitempty"` | |
| // ProcessingStrategy | |
| ProcessingStrategy ProcessingStrategy `json:"processingStrategy,omitempty"` | |
| } |
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 96 - 144, Add a
proper Go doc comment for the exported type HealthEventSpec: place a concise
sentence above the type declaration starting with "HealthEventSpec ..."
describing what the struct represents (e.g., a health/event payload for
components/nodes including status, metadata, timestamps and remediation info).
Ensure the comment follows Go doc conventions (starts with the type name and is
a full sentence) and resides immediately above the HealthEventSpec declaration
so documentation tools and linters recognize it.
| // +kubebuilder:object:root=true | ||
| // +kubebuilder:subresource:status | ||
| type HealthEvent struct { | ||
| metav1.TypeMeta `json:",inline"` | ||
| metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
|
||
| Spec HealthEventSpec `json:"spec,omitempty"` | ||
| Status HealthEventStatus `json:"status,omitempty"` | ||
| } | ||
|
|
||
| // +kubebuilder:object:root=true | ||
| type HealthEventList struct { | ||
| metav1.TypeMeta `json:",inline"` | ||
| metav1.ListMeta `json:"metadata,omitempty"` | ||
| Items []HealthEvent `json:"items"` | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n data-models/api/v1alpha1/healthevent_types.go | sed -n '140,165p'Repository: NVIDIA/NVSentinel
Length of output: 966
Add Go doc comments for HealthEvent and HealthEventList.
These exported types are missing doc comments, which is required by Go conventions.
Suggested comments
-// +kubebuilder:object:root=true
-// +kubebuilder:subresource:status
-type HealthEvent struct {
+// HealthEvent is the Schema for the healthevents API.
+// +kubebuilder:object:root=true
+// +kubebuilder:subresource:status
+type HealthEvent struct {
...
}
-// +kubebuilder:object:root=true
-type HealthEventList struct {
+// HealthEventList contains a list of HealthEvent.
+// +kubebuilder:object:root=true
+type HealthEventList struct {
...
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // +kubebuilder:object:root=true | |
| // +kubebuilder:subresource:status | |
| type HealthEvent struct { | |
| metav1.TypeMeta `json:",inline"` | |
| metav1.ObjectMeta `json:"metadata,omitempty"` | |
| Spec HealthEventSpec `json:"spec,omitempty"` | |
| Status HealthEventStatus `json:"status,omitempty"` | |
| } | |
| // +kubebuilder:object:root=true | |
| type HealthEventList struct { | |
| metav1.TypeMeta `json:",inline"` | |
| metav1.ListMeta `json:"metadata,omitempty"` | |
| Items []HealthEvent `json:"items"` | |
| } | |
| // HealthEvent is the Schema for the healthevents API. | |
| // +kubebuilder:object:root=true | |
| // +kubebuilder:subresource:status | |
| type HealthEvent struct { | |
| metav1.TypeMeta `json:",inline"` | |
| metav1.ObjectMeta `json:"metadata,omitempty"` | |
| Spec HealthEventSpec `json:"spec,omitempty"` | |
| Status HealthEventStatus `json:"status,omitempty"` | |
| } | |
| // HealthEventList contains a list of HealthEvent. | |
| // +kubebuilder:object:root=true | |
| type HealthEventList struct { | |
| metav1.TypeMeta `json:",inline"` | |
| metav1.ListMeta `json:"metadata,omitempty"` | |
| Items []HealthEvent `json:"items"` | |
| } |
🤖 Prompt for AI Agents
In `@data-models/api/v1alpha1/healthevent_types.go` around lines 146 - 161, Add Go
doc comments for the exported types HealthEvent and HealthEventList by inserting
descriptive comment lines immediately above each type declaration: a comment for
HealthEvent explaining it represents the Kubernetes custom resource holding Spec
(HealthEventSpec) and Status (HealthEventStatus) for a health event, and a
comment for HealthEventList explaining it represents a list of HealthEvent
resources (the list type used by the API server). Ensure the comments start with
the type names "HealthEvent" and "HealthEventList" respectively and are placed
directly above their type definitions.
| // RecommendedAction conversions | ||
| func ProtoRecommendedActionToCRD(p protos.RecommendedAction) v1alpha1.RecommendedAction { | ||
| return v1alpha1.RecommendedAction( | ||
| protos.RecommendedAction_name[int32(p)], | ||
| ) | ||
| } | ||
|
|
||
| func CRDRecommendedActionToProto(c v1alpha1.RecommendedAction) protos.RecommendedAction { | ||
| return protos.RecommendedAction( | ||
| protos.RecommendedAction_value[string(c)], | ||
| ) | ||
| } | ||
|
|
||
| // ProcessingStrategy conversions | ||
| func ProtoProcessingStrategyToCRD(p protos.ProcessingStrategy) v1alpha1.ProcessingStrategy { | ||
| return v1alpha1.ProcessingStrategy(protos.ProcessingStrategy_name[int32(p)]) | ||
| } | ||
|
|
||
| func CRDProcessingStrategyToProto(c v1alpha1.ProcessingStrategy) protos.ProcessingStrategy { | ||
| return protos.ProcessingStrategy(protos.ProcessingStrategy_value[string(c)]) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, locate the file and examine its actual content
find . -type f -name "*healthevent*" -o -name "*health_event*" | head -20Repository: NVIDIA/NVSentinel
Length of output: 1307
🏁 Script executed:
#!/bin/bash
# Check if the file exists and read the relevant section
fd -e go -i health | grep -E "conversion|coversion"Repository: NVIDIA/NVSentinel
Length of output: 170
🏁 Script executed:
#!/bin/bash
# Search for the file with possible variations
find . -path '*data-models*' -type f -name "*.go" | grep -i healthRepository: NVIDIA/NVSentinel
Length of output: 366
🏁 Script executed:
#!/bin/bash
# Read the specific lines from the file
cat -n data-models/pkg/conversion/healthevent_coversion.go | sed -n '219,239p'Repository: NVIDIA/NVSentinel
Length of output: 971
🏁 Script executed:
#!/bin/bash
# Check the test file to understand expected behavior
cat -n data-models/pkg/conversion/healthevent_conversion_test.goRepository: NVIDIA/NVSentinel
Length of output: 11840
🏁 Script executed:
#!/bin/bash
# Search for callers of these enum conversion functions
rg -n 'ProtoRecommendedActionToCRD|CRDRecommendedActionToProto|ProtoProcessingStrategyToCRD|CRDProcessingStrategyToProto' --type go -B 2 -A 2Repository: NVIDIA/NVSentinel
Length of output: 3900
🏁 Script executed:
#!/bin/bash
# Check CRD types to see if there's validation
head -100 data-models/api/v1alpha1/healthevent_types.goRepository: NVIDIA/NVSentinel
Length of output: 3650
🏁 Script executed:
#!/bin/bash
# Check the proto enum definitions
grep -A 15 "enum RecommendedAction\|enum ProcessingStrategy" data-models/protobufs/health_event.protoRepository: NVIDIA/NVSentinel
Length of output: 422
🏁 Script executed:
#!/bin/bash
# Check entire conversion file header for package comments
head -50 data-models/pkg/conversion/healthevent_coversion.goRepository: NVIDIA/NVSentinel
Length of output: 1350
🏁 Script executed:
#!/bin/bash
# Search for any validation of RecommendedAction or ProcessingStrategy before conversion
rg -n 'RecommendedAction|ProcessingStrategy' data-models/api/v1alpha1/healthevent_types.go -A 2 -B 2Repository: NVIDIA/NVSentinel
Length of output: 1612
🏁 Script executed:
#!/bin/bash
# Check if there are any other conversion functions to understand the pattern
grep -n "^func.*ToCRD\|^func.*ToProto" data-models/pkg/conversion/healthevent_coversion.go | head -30Repository: NVIDIA/NVSentinel
Length of output: 1167
Add function comments and package documentation to exported conversion functions.
All exported conversion functions are missing required documentation. Add a package-level godoc comment at the top of the file and function comments for ProtoRecommendedActionToCRD, CRDRecommendedActionToProto, ProtoProcessingStrategyToCRD, and CRDProcessingStrategyToProto. This is required by the Go conventions guideline.
Additionally, while these enum conversions rely on protobuf name/value maps that return zero values for unknown inputs, this concern is mitigated by upstream CRD schema validation (which constrains values to the defined enum set) and protobuf enum constraints. However, consider adding a brief function comment explaining the conversion behavior.
🤖 Prompt for AI Agents
In `@data-models/pkg/conversion/healthevent_coversion.go` around lines 219 - 239,
Add missing package-level godoc at the top of the file and add concise function
comments for the exported conversion helpers ProtoRecommendedActionToCRD,
CRDRecommendedActionToProto, ProtoProcessingStrategyToCRD, and
CRDProcessingStrategyToProto; each function comment should state it converts
between the protobuf enum and the CRD enum and note that unknown inputs map to
zero values due to protobuf/name maps but are constrained by upstream CRD schema
and protobuf enum rules. Ensure comments follow Go godoc style (start with the
function/package name) and place the package comment at the top of the file.
distros/kubernetes/nvsentinel/crds/healthevents.dgxc.nvidia.com_healthevents.yaml
Show resolved
Hide resolved
| status: | ||
| description: HealthEventStatus defines the observed state of HealthEvent | ||
| properties: | ||
| faultRemediated: | ||
| description: Whether the fault has been remediated | ||
| type: boolean | ||
| lastRemediationTimestamp: | ||
| description: Timestamp of the last remediation attempt | ||
| format: date-time | ||
| type: string | ||
| nodeQuarantined: | ||
| description: Whether the node has been quarantined | ||
| enum: | ||
| - NotStarted | ||
| - InProgress | ||
| - Failed | ||
| - Succeeded | ||
| - AlreadyDrained | ||
| - UnQuarantined | ||
| - Quarantined | ||
| - AlreadyQuarantined | ||
| - Cancelled | ||
| type: string | ||
| userPodsEvictionStatus: | ||
| description: Status of user pod eviction | ||
| properties: | ||
| message: | ||
| type: string | ||
| status: | ||
| enum: | ||
| - NotStarted | ||
| - InProgress | ||
| - Failed | ||
| - Succeeded | ||
| - AlreadyDrained | ||
| - UnQuarantined | ||
| - Quarantined | ||
| - AlreadyQuarantined | ||
| - Cancelled | ||
| type: string | ||
| required: | ||
| - status | ||
| type: object | ||
| required: | ||
| - userPodsEvictionStatus | ||
| type: object |
There was a problem hiding this comment.
userPodsEvictionStatus is marked as required in status.
The status schema requires userPodsEvictionStatus (line 180-181). This means any status update must include this field, which could cause issues during initial resource creation or partial status updates. Consider whether this should be optional.
🤖 Prompt for AI Agents
In
`@distros/kubernetes/nvsentinel/crds/healthevents.dgxc.nvidia.com_healthevents.yaml`
around lines 137 - 182, The HealthEventStatus schema currently requires
userPodsEvictionStatus which forces every status update to include it; update
the CRD by removing "userPodsEvictionStatus" from the status.required array (or
change the required list to be empty) so that userPodsEvictionStatus becomes
optional, ensuring HealthEventStatus can be PATCHed/created without that field;
locate the status schema for HealthEventStatus and modify the required property
accordingly.
|
@yavinash007 Can you please add a design doc stating out motivation behind this, problem that this solve and what is the end goal .etc so we've consolidated view of overall arc? |
richardcase
left a comment
There was a problem hiding this comment.
My biggest piece of feedback is that it would be good to adopt KAL so that it lints the HealthEvent against the Kubernetes API Guidelines.....its easier to enable now than later on.
| @# Generate deepcopy files for API types | ||
| go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR) | ||
| @# Generate CRDs directly into API_DIR | ||
| go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \ |
There was a problem hiding this comment.
Should we also be generating the RBAC manifests as well?
| StatusAlreadyQuarantined Status = "AlreadyQuarantined" | ||
| StatusCancelled Status = "Cancelled" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Is there any reason why the values for the enum types are different between Status and RecommendedAction/ProcessingStrategy?
There was a problem hiding this comment.
Guessing its because they match the proto?
|
|
||
| // OperationStatus represents the status of a sub-operation | ||
| type OperationStatus struct { | ||
| // +kubebuilder:validation:Required |
There was a problem hiding this comment.
It may be worth getting the Kubernetes API Linter (KAL) running in the project as early as possible. The reason is that it analyses the custom resources against the Kubernetes API Guidelines and gives a bunch of recommendations.
For example, one of the linters ensures that every field is marked optional or required. It's easier to make the changes now as the API is being defined instead of down the road where it may even require an API version bump.
| // +kubebuilder:validation:Required | ||
| Status Status `json:"status,omitempty"` | ||
|
|
||
| Message string `json:"message,omitempty"` |
There was a problem hiding this comment.
Its good to mark strings with min/max length but KAL would highlight this.
| ErrorCode []string `json:"errorCode,omitempty"` | ||
|
|
||
| // Entities impacted | ||
| EntitiesImpacted []*Entity `json:"entitiesImpacted,omitempty"` |
There was a problem hiding this comment.
For slices, it's good to add the kubebuilder markers to aid SSA. (KAL would highlight this)
| FaultRemediated *bool `json:"faultRemediated,omitempty"` | ||
|
|
||
| // Timestamp of the last remediation attempt | ||
| LastRemediationTimestamp *metav1.Time `json:"lastRemediationTimestamp,omitempty"` |
There was a problem hiding this comment.
Some projects use somethign like this:
// Timestamp of the last remediation attempt
LastRemediationTimestamp metav1.Time `json:"lastRemediationTimestamp,omitempty,omitzero"`I believe there are some issues with using the k8s fuzzer with metav1.Time. But this will really only be a problem if have another API version and do roundtrip testing.
| } | ||
|
|
||
| // +kubebuilder:object:root=true | ||
| // +kubebuilder:subresource:status |
There was a problem hiding this comment.
It may be useful to also add:
- "kubebuilder:resource:path" - so that users can do something like `kubectl get he"
- Multiple "kubebuilder:printcolumn:name" - again useful to expose info when doing kubectl get
|
|
||
| // HealthEvent Status | ||
| // HealthEventStatus defines the observed state of HealthEvent | ||
| type HealthEventStatus struct { |
There was a problem hiding this comment.
Do you think you'll be adding Conditions in the future?
There was a problem hiding this comment.
Since we already have the types definitions in the proto, should consider using a code generation plugin such as https://github.com/yandex/protoc-gen-crd to generate the CRDs and possibly the required go types?
There was a problem hiding this comment.
Yeah, thanks for the suggestion @lalitadithya — I tried it out locally and it works well.
One thing to consider is that HealthEventStatus is currently not part of the proto file. We may want to move it into the proto as well, alongside HealthEvent, to keep the model consistent.
I’ve captured this consideration in the design doc too. Let me know what you think:
#787
There was a problem hiding this comment.
Should we put the crds in a sub chart of it's own so that we can easily disable it based on a helm value?
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
91cc242 to
e145c35
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 1: The ADR title string "ADR-027: DatastorKubernetes Custom Resource (CR)
for HealthEvent" contains a typo; update the header to read "ADR-027:
DatastoreKubernetes Custom Resource (CR) for HealthEvent" (fix "Datastor" →
"Datastore") wherever that exact title appears in the document so the ADR name
is correct.
🧹 Nitpick comments (3)
data-models/Makefile (1)
40-67: Consider splittinggenerateinto helper targets to keep lint clean.The body is long (checkmake warns about length), and breaking it into smaller targets improves readability and reuse (e.g., regenerate CRDs without moving files).
🔧 Possible refactor
+.PHONY: generate ensure-controller-gen generate-deepcopy generate-crd move-crd + +# generate: Generate deepcopy, CRD types, and other Kubernetes boilerplate +generate: ensure-controller-gen generate-deepcopy generate-crd move-crd ## Generate CRDs and move them to Helm chart directory + +ensure-controller-gen: + `@which` controller-gen > /dev/null || (echo "Installing controller-gen..." && \ + go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION)) + +generate-deepcopy: + `@echo` "Generating CRDs for $(API_DIR)..." + go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) object paths=./$(API_DIR) + +generate-crd: + go run sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_GEN_VERSION) \ + crd paths=./$(API_DIR) output:crd:dir=./$(API_DIR) + +move-crd: + `@echo` "Moving generated CRDs to $(CRD_OUTPUT_DIR)..." + `@mkdir` -p $(CRD_OUTPUT_DIR) + `@mv` ./$(API_DIR)/*.yaml $(CRD_OUTPUT_DIR)/ || true + `@echo` "CRDs generated and moved to $(CRD_OUTPUT_DIR)" + `@ls` -1 $(CRD_OUTPUT_DIR)/*.yaml || echo "No CRD YAMLs generated"data-models/pkg/conversion/healthevent_conversion_test.go (2)
222-271: Prefertestify/require/assertfor these checks.This will reduce boilerplate and improve diagnostics for failures. If
testifyisn’t already in this module, add it once and then replace thet.Fatalfblocks.🔧 Example conversion
import ( "reflect" "strings" "testing" "time" v1alpha1 "github.com/nvidia/nvsentinel/data-models/api/v1alpha1" "github.com/nvidia/nvsentinel/data-models/pkg/model" "github.com/nvidia/nvsentinel/data-models/pkg/protos" "google.golang.org/protobuf/types/known/timestamppb" + "github.com/stretchr/testify/require" ) @@ - if crdSpec.Version != protoEvent.Version || - crdSpec.Agent != protoEvent.Agent || - ... - crdSpec.ProcessingStrategy != v1alpha1.ProcessingStrategy(protos.ProcessingStrategy_name[int32(protoEvent.ProcessingStrategy)]) { - t.Fatalf("proto -> crd conversion failed") - } + require.Equal(t, protoEvent.Version, crdSpec.Version) + require.Equal(t, protoEvent.Agent, crdSpec.Agent) + ... + require.Equal( + t, + v1alpha1.ProcessingStrategy(protos.ProcessingStrategy_name[int32(protoEvent.ProcessingStrategy)]), + crdSpec.ProcessingStrategy, + )As per coding guidelines: Use
testify/assertandtestify/requirefor assertions in Go tests.
26-34: Rename tests to include scenario + expected behavior.Current names are descriptive but don’t follow the agreed pattern. Example:
TestHealthEventSpecConversion_RoundTrip_PreservesFields.As per coding guidelines: Name Go tests descriptively using format:
TestFunctionName_Scenario_ExpectedBehavior.Also applies to: 189-334
e145c35 to
61ea5c4
Compare
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
61ea5c4 to
e765d44
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 89: Populate the "## Alternatives Considered" section by listing each
evaluated option (e.g., "Pure MongoDB with no Kubernetes CRs", "Kubernetes
ConfigMaps/Secrets", "Separate etcd cluster", "External time-series DBs",
"In-memory storage with external backup") and add a concise 1–2 sentence
pros/cons note for each explaining scalability, consistency, operational
complexity, and why it was rejected compared to the chosen Kubernetes CR-backed
approach; ensure you reference the ADR title "027-kubernetes-data-store.md" and
the header "## Alternatives Considered" so reviewers can locate the new content
quickly.
- Around line 19-24: The API Definition is missing the API version; update the
section that lists API Group and Kind to include "API Version: v1alpha1" so the
spec explicitly reads API Group: `healthevents.dgxc.nvidia.com`, API Version:
`v1alpha1`, Kind: `HealthEvent` and clarify that `spec` maps to immutable event
data (HealthEvent) and `status` maps to mutable workflow state
(HealthEventStatus); this ensures the API is fully specified wherever the API
Definition block appears.
🧹 Nitpick comments (5)
docs/designs/027-kubernetes-data-store.md (5)
30-31: Consider documenting conversion edge cases and error handling.While the conversion APIs and unit tests are mentioned, the ADR could benefit from a brief discussion of:
- How mismatched or unmappable fields between protobuf, internal models, and CRD structs are handled
- Error handling strategy when conversion fails
- Approach to schema evolution and backward compatibility
This would provide clearer guidance for future maintainers.
58-62: Add guidance on CR cleanup configuration.The CR cleanup section mentions configurable TTL and limits but doesn't provide:
- Recommended default values or ranges
- How to determine appropriate limits based on cluster size or event frequency
- Configuration mechanism (Helm values, CR annotations, etc.)
Adding operational guidance would help users configure these settings appropriately.
69-87: Consider adding scale guidance or performance expectations.The Consequences section mentions "control plane load" and potential impact on large clusters but doesn't provide:
- Approximate scale thresholds (e.g., events per second, total CR count)
- Guidance on when to use CR-backed storage vs. MongoDB
- Expected performance characteristics
Quantitative guidance would help users make informed deployment decisions.
96-99: Add link to the referenced Datastore Abstraction Design Document.The References section mentions "NVSentinel Datastore Abstraction Design Document" (also referenced on line 9) but doesn't provide a link. Adding the link would help readers understand the broader context.
If the document isn't yet available, consider noting that with a placeholder like
[To be added].
48-57: Consider adding a migration strategy section.The ADR mentions that the Kubernetes store is optional and the MongoDB store remains unchanged, but doesn't address:
- Migration path from MongoDB-only to CR-backed storage
- Whether both stores can run simultaneously during transition
- Data synchronization strategy if running both
- Rollback procedure if issues arise
A brief migration strategy would help teams plan adoption.
| ### API Definition | ||
|
|
||
| - API Group: `healthevents.dgxc.nvidia.com` | ||
| - Kind: `HealthEvent` | ||
| - `spec`: contains immutable event data (from HealthEvent) | ||
| - `status`: tracks mutable workflow state (from HealthEventStatus) |
There was a problem hiding this comment.
Add API version to the API Definition.
The API Definition section specifies the API Group and Kind but omits the API version. Based on the AI summary and related code changes, this should be v1alpha1. Explicitly documenting the version is essential for a complete API specification.
📝 Suggested addition
### API Definition
- API Group: `healthevents.dgxc.nvidia.com`
+- Version: `v1alpha1`
- Kind: `HealthEvent`
- `spec`: contains immutable event data (from HealthEvent)
- `status`: tracks mutable workflow state (from HealthEventStatus)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### API Definition | |
| - API Group: `healthevents.dgxc.nvidia.com` | |
| - Kind: `HealthEvent` | |
| - `spec`: contains immutable event data (from HealthEvent) | |
| - `status`: tracks mutable workflow state (from HealthEventStatus) | |
| ### API Definition | |
| - API Group: `healthevents.dgxc.nvidia.com` | |
| - Version: `v1alpha1` | |
| - Kind: `HealthEvent` | |
| - `spec`: contains immutable event data (from HealthEvent) | |
| - `status`: tracks mutable workflow state (from HealthEventStatus) |
🤖 Prompt for AI Agents
In `@docs/designs/027-kubernetes-data-store.md` around lines 19 - 24, The API
Definition is missing the API version; update the section that lists API Group
and Kind to include "API Version: v1alpha1" so the spec explicitly reads API
Group: `healthevents.dgxc.nvidia.com`, API Version: `v1alpha1`, Kind:
`HealthEvent` and clarify that `spec` maps to immutable event data (HealthEvent)
and `status` maps to mutable workflow state (HealthEventStatus); this ensures
the API is fully specified wherever the API Definition block appears.
| - Resource cleanup and garbage collection: Automatically remove CRs when events are remediated or when cluster/node limits are reached. | ||
| - Phased approach: Core watch-based workflows are enabled first; advanced features can be added later. | ||
|
|
||
| ## Alternatives Considered |
There was a problem hiding this comment.
Complete the "Alternatives Considered" section.
The "Alternatives Considered" section is empty. ADRs should document what alternatives were evaluated and why they were not chosen. This provides important context for future reviewers and helps justify the selected approach.
Consider documenting alternatives such as:
- Pure MongoDB with no Kubernetes CRs
- Kubernetes ConfigMaps or Secrets
- Separate etcd cluster
- External time-series databases
- In-memory storage with external backup
🤖 Prompt for AI Agents
In `@docs/designs/027-kubernetes-data-store.md` at line 89, Populate the "##
Alternatives Considered" section by listing each evaluated option (e.g., "Pure
MongoDB with no Kubernetes CRs", "Kubernetes ConfigMaps/Secrets", "Separate etcd
cluster", "External time-series DBs", "In-memory storage with external backup")
and add a concise 1–2 sentence pros/cons note for each explaining scalability,
consistency, operational complexity, and why it was rejected compared to the
chosen Kubernetes CR-backed approach; ensure you reference the ADR title
"027-kubernetes-data-store.md" and the header "## Alternatives Considered" so
reviewers can locate the new content quickly.
Started a doc, pls take a look. let us know your thoughts |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Tests
Chores
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.