Skip to content

Hyperfleet-267 - feat: Implement Minimal Real GCP Validation and Related Helm Chart#10

Merged
xueli181114 merged 14 commits intoopenshift-hyperfleet:mainfrom
86254860:HYPERFLEET-267
Jan 26, 2026
Merged

Hyperfleet-267 - feat: Implement Minimal Real GCP Validation and Related Helm Chart#10
xueli181114 merged 14 commits intoopenshift-hyperfleet:mainfrom
86254860:HYPERFLEET-267

Conversation

@86254860
Copy link
Collaborator

@86254860 86254860 commented Jan 16, 2026

Summary

  • Implement Minimal Real GCP Validation
    • GCP Connection: Establish connection with customer's GCP project using Workload Identity Federation
    • API Validation: Implement logic to check if critical APIs are enabled (api-enabled validator)
    • Quota check validator (stub implementation)
    • Extensible Framework: Easy to add new validators without modifying core code
    • Support defining dependency via RunAfter in Metadata
    • Parallel Execution: Performance optimization through concurrent validator execution
    • Unit tests using Ginkgo + Gomega
  • Dockerfile and Makefile
  • README.md
  • Add GCP validation adapter configurations and support real validator in helm chart

Testing result

  • make test # All unit tests are passed
  • Manually testing result (All validation passed/) refer to the comment of ticket [HYPERFLEET-267]
    • Happy Path - All validation passed
    • Cross-Project Permissions Not Granted (WIF Not Configured)
    • Customer Project Required APIs Disabled

Next Steps

HYPERFLEET-510, HYPERFLEET-511, HYPERFLEET-512

Summary by CodeRabbit

  • New Features

    • Full GCP validation toolchain: CLI/containerized validator, dependency-aware parallel execution, aggregation, and new validators (API checks, quota-check stub).
  • Configuration

    • Validation mode now uses validation.useDummy (real validation by default); unified validation settings and updated default status-reporter image.
  • Operations

    • Adapter orchestration: job + status-reporter sidecar, RBAC provisioning, and summarized status reporting to control plane.
  • Documentation

    • Expanded README, integration test guidance, and usage examples.
  • Tests

    • Extensive unit and integration test suites added.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci openshift-ci bot requested review from aredenba-rh and yasun1 January 16, 2026 07:18
@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

Walkthrough

Adds a new Helm AdapterConfig and job templates for a HyperFleet GCP validation adapter, consolidating chart mode to validation.useDummy. Introduces an in-repo gcp-validator (Dockerfile, Makefile, CLI), an env-driven config loader, GCP client factory with retries, a validator framework (registry, resolver, executor, aggregation), two validators (api-enabled and quota-check stub), status-reporter sidecar integration, extensive unit and integration tests, Helm template/value updates, and documentation updates.

Sequence Diagram(s)

sequenceDiagram
    participant Env as Env/Config
    participant Registry as Validator Registry
    participant Resolver as Dependency Resolver
    participant Executor as Executor
    participant Validator as Validator Implementations
    participant GCP as GCP Services

    Env->>Registry: validators register (init)
    Env->>Executor: NewContext(LoadFromEnv)
    Executor->>Registry: GetAll()
    Registry-->>Executor: validators[]
    Executor->>Resolver: NewDependencyResolver(validators)
    Resolver-->>Executor: ExecutionGroup[] (levels)
    loop For each ExecutionGroup (by level)
        par Parallel validators
            Executor->>Validator: Validate(ctx)
            Validator->>GCP: Create client (WIF + retry/backoff)
            GCP-->>Validator: API responses
            Validator-->>Executor: Result
        end
        Executor->>Executor: store Result in Context (mutex)
        alt StopOnFirstFailure triggered
            Executor->>Executor: abort remaining groups
        end
    end
    Executor->>Executor: Aggregate results
    Executor->>Env: Write JSON results file
Loading
sequenceDiagram
    participant HyperFleet as HyperFleet API
    participant Preconditions as Adapter Preconditions
    participant K8s as Kubernetes API
    participant Job as Validation Job (Pod)
    participant Reporter as Status Reporter

    Preconditions->>HyperFleet: GET /clusterStatus
    HyperFleet-->>Preconditions: cluster info (name, phase, project)
    Preconditions->>HyperFleet: GET /clusterAdapterStatus
    HyperFleet-->>Preconditions: adapter status
    Preconditions->>K8s: Create ServiceAccount/Role/RoleBinding
    K8s-->>Preconditions: ACK
    Preconditions->>K8s: Create Job (gcp-validator + status-reporter)
    K8s-->>Job: Pod scheduled / runs
    Job->>GCP: Perform API validations
    Job->>Reporter: Write results to /results
    Reporter->>HyperFleet: POST clusterStatus payload
    HyperFleet-->>Reporter: ACK
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

lgtm

Suggested reviewers

  • xueli181114
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: implementing a real GCP validation adapter with Helm chart updates, directly reflecting the core objective.
Docstring Coverage ✅ Passed Docstring coverage is 82.14% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@charts/configs/validation-adapter.yaml`:
- Around line 83-85: The environment variable key for the field named
"managedByResourceName" uses the wrong casing in its source string
("env.MANAGED_By_RESOURCE_NAME"); update the source to use consistent uppercase
"env.MANAGED_BY_RESOURCE_NAME" so the mapping matches the intended environment
variable and avoids casing bugs.

In `@charts/configs/validation-job-adapter-task.yaml`:
- Around line 15-17: The template hardcodes activeDeadlineSeconds (310) which
must be computed from the adapter's maxWaitTimeSeconds plus a buffer; update the
Kubernetes job manifest to accept a Helm/templating value (e.g.,
values.adapter.maxWaitTimeSeconds) and set activeDeadlineSeconds to that value
plus the 10s buffer (e.g., {{ add .Values.adapter.maxWaitTimeSeconds 10 }}), add
a corresponding values.yaml entry (adapter.maxWaitTimeSeconds) and a comment
documenting the coupling, and if desired open a follow-up issue to track making
this configurable everywhere; reference activeDeadlineSeconds and
maxWaitTimeSeconds when making these changes.

In `@validator/go.mod`:
- Line 8: Update the golang.org/x/oauth2 module to v0.27.0 or later in go.mod
(replace the current golang.org/x/oauth2 v0.15.0 entry), then run the
appropriate Go tooling (go get golang.org/x/oauth2@v0.27.0 or newer and go mod
tidy) to refresh go.sum; run tests (go test ./...) and any CI checks to ensure
no breakages and commit the updated go.mod and go.sum. Ensure the change
addresses the CVE affecting jws.Verify by selecting at least v0.27.0 (or newer)
and rebuild/vendor as required.

In `@validator/pkg/validator/executor_test.go`:
- Around line 18-47: The AfterEach currently unconditionally unsets PROJECT_ID
which can remove a pre-existing environment value; modify the test setup so
BeforeEach captures the prior PROJECT_ID (use os.LookupEnv) into a local
variable (e.g., prevProjectID, prevProjectIDSet) before calling os.Setenv, and
update AfterEach to restore the previous state by calling
os.Setenv(prevProjectID) if it existed or os.Unsetenv("PROJECT_ID") only if it
did not exist originally, updating the BeforeEach/AfterEach blocks in
executor_test.go (referencing the BeforeEach and AfterEach functions and the
PROJECT_ID env usage) accordingly.

In `@validator/pkg/validator/executor.go`:
- Around line 131-138: The code assumes validator.Validate(ctx, e.ctx) always
returns a non-nil *Result which can cause a nil pointer panic when setting
result.Duration/Timestamp/ValidatorName; update executor.go to defensively
handle a nil return from validator.Validate by checking if result == nil after
the call (where result is assigned from validator.Validate), and if nil either
instantiate a default Result (with appropriate fields like Failed=true and an
explanatory Message), or log the error via e.logger.Info/Error (using
validator.Metadata().Name for context) and skip populating Duration/Timestamp to
avoid the panic; ensure the Duration and Timestamp are still set on the non-nil
Result before further use.

In `@validator/pkg/validators/api_enabled.go`:
- Around line 96-143: The code does not handle an empty vctx.Config.RequiredAPIs
leading to misleading success; before the loop over requiredAPIs add an early
check (len(requiredAPIs) == 0) and return a short-circuit validator.Result
(e.g., StatusWarning or StatusSkipped) with a clear Message like "no required
APIs configured" and include contextual Details (e.g., "required_apis": 0,
"project_id": vctx.Config.ProjectID); place this check immediately after
requiredAPIs := vctx.Config.RequiredAPIs so it executes prior to creating
contexts or calling svc.Services.Get, and use the same validator.Result struct
shape used elsewhere in this file.

In `@validator/pkg/validators/quota_check.go`:
- Around line 31-86: The QuotaCheckValidator.Validate currently returns
StatusSuccess for an unimplemented stub; change it to return
validator.StatusSkipped (or validator.StatusFailure if a config flag like
vctx.Config.EnableQuotaCheck indicates enforcement) and set Reason to
"QuotaCheckSkipped" (or "QuotaCheckDisabled"/"QuotaCheckFailed" when failing),
include the validator identity (e.g., add "validator": "QuotaCheckValidator") in
Result.Details, and ensure any caller/aggregation treats StatusSkipped as
neutral; update the Validate function and the returned Result fields
accordingly.
🧹 Nitpick comments (26)
validator/go.mod (1)

5-46: Update dependencies to latest stable versions.

Several dependencies lag significantly behind their latest releases: google.golang.org/api is 106 versions behind (v0.154.0 → v0.260.0), and golang.org/x/oauth2 is 9 versions behind (v0.15.0 → v0.34.0). Since this is a new module, starting with the latest stable versions ensures you have recent bug fixes and security patches. Run go get -u to update.

validator/Makefile (1)

7-8: Make the image platform configurable.

Hardcoding linux/amd64 can be inconvenient on arm64 dev machines; a variable keeps current behavior while allowing overrides.

♻️ Proposed tweak
 IMAGE_REGISTRY ?= quay.io/rh-ee-dawang
 IMAGE_TAG ?= latest
+PLATFORM ?= linux/amd64
@@
-	$(CONTAINER_CMD) build --platform linux/amd64 --no-cache --build-arg VERSION=$(VERSION) --build-arg GIT_COMMIT=$(GIT_COMMIT) -t $(IMAGE_REGISTRY)/$(PROJECT_NAME):$(IMAGE_TAG) .
+	$(CONTAINER_CMD) build --platform $(PLATFORM) --no-cache --build-arg VERSION=$(VERSION) --build-arg GIT_COMMIT=$(GIT_COMMIT) -t $(IMAGE_REGISTRY)/$(PROJECT_NAME):$(IMAGE_TAG) .
@@
-	$(CONTAINER_CMD) build --platform linux/amd64 --build-arg BASE_IMAGE=alpine:3.21 --build-arg VERSION=$(VERSION) --build-arg GIT_COMMIT=$(GIT_COMMIT) -t quay.io/$(QUAY_USER)/$(PROJECT_NAME):$(DEV_TAG) .
+	$(CONTAINER_CMD) build --platform $(PLATFORM) --build-arg BASE_IMAGE=alpine:3.21 --build-arg VERSION=$(VERSION) --build-arg GIT_COMMIT=$(GIT_COMMIT) -t quay.io/$(QUAY_USER)/$(PROJECT_NAME):$(DEV_TAG) .

Also applies to: 138-179

validator/README.md (1)

27-44: Add a short auth prerequisite note for local/Docker runs.

The validator uses Workload Identity Federation (mentioned in Features) via google.DefaultClient, which requires Application Default Credentials to be available. The Run Locally and Run in Docker examples lack explicit guidance on this prerequisite, which would cause silent auth failures for first-time users.

✍️ Suggested doc addition
 ### Run Locally
 
 ```bash
 export PROJECT_ID=my-gcp-project
 export RESULTS_PATH=/tmp/results.json
+
+# Ensure Application Default Credentials are available
+# For local development: export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
 
 ./bin/validator
 cat /tmp/results.json
</details>

</blockquote></details>
<details>
<summary>charts/templates/configmap-app.yaml (1)</summary><blockquote>

`14-22`: **Consider adding a fallback for unknown deployment modes.**

The conditional only handles `"dummy"` and `"real"` modes. If `deploymentMode` is set to an unrecognized value, no adapter task template will be included in the ConfigMap, which could cause silent failures at runtime.

Consider adding an `{{- else }}` block with a fail directive or defaulting to one of the modes:


<details>
<summary>♻️ Suggested improvement</summary>

```diff
  {{- if eq .Values.deploymentMode "dummy" }}
  # Adapter task template referenced by the adapter config
  validation-dummy-job-adapter-task.yaml: |
{{ .Files.Get "configs/validation-dummy-job-adapter-task.yaml" | nindent 4 }}
  {{- else if eq .Values.deploymentMode "real" }}
  # Adapter task template referenced by the adapter config
  validation-job-adapter-task.yaml: |
{{ .Files.Get "configs/validation-job-adapter-task.yaml" | nindent 4 }}
+ {{- else }}
+ {{- fail (printf "Invalid deploymentMode '%s'. Must be 'dummy' or 'real'." .Values.deploymentMode) }}
  {{- end }}
charts/templates/deployment.yaml (1)

83-107: LGTM with a note on consistency.

The real deployment mode environment variables are properly defined and align with the validator's expected configuration. The structure mirrors the dummy block appropriately.

As noted in configmap-app.yaml, consider adding a fallback {{- else }} block with a fail directive for unknown deploymentMode values to provide clear error messaging during template rendering.

validator/pkg/config/config_test.go (1)

31-40: Subtle issue in environment restoration logic.

The check if v != "" cannot distinguish between an environment variable that was unset versus one that was explicitly set to an empty string. Both cases result in os.Getenv() returning "", but they have different semantics.

In practice, this is unlikely to cause issues for these specific config variables, but for correctness:

♻️ More precise restoration using LookupEnv
 	BeforeEach(func() {
 		// Save original environment
-		originalEnv = make(map[string]string)
+		originalEnv = make(map[string]*string)
 		envVars := []string{
 			"RESULTS_PATH", "PROJECT_ID", "GCP_REGION",
 			"DISABLED_VALIDATORS", "STOP_ON_FIRST_FAILURE",
 			"REQUIRED_APIS", "LOG_LEVEL",
 			"REQUIRED_VCPUS", "REQUIRED_DISK_GB", "REQUIRED_IP_ADDRESSES",
 			"VPC_NAME", "SUBNET_NAME",
 		}
 		for _, v := range envVars {
-			originalEnv[v] = os.Getenv(v)
+			if val, exists := os.LookupEnv(v); exists {
+				originalEnv[v] = &val
+			}
 			Expect(os.Unsetenv(v)).To(Succeed())
 		}
 	})

 	AfterEach(func() {
 		// Restore original environment
 		for k, v := range originalEnv {
-			if v != "" {
-				Expect(os.Setenv(k, v)).To(Succeed())
-			} else {
-				Expect(os.Unsetenv(k)).To(Succeed())
+			if v != nil {
+				Expect(os.Setenv(k, *v)).To(Succeed())
 			}
 		}
 	})
validator/pkg/gcp/client_test.go (2)

90-108: Potential test flakiness due to timing assumptions.

The assertion on Line 106 expects exactly 2 calls, but the actual count depends on when the context cancellation is observed by the retry loop. If the retry logic checks the context after calling the operation but before incrementing, or if there's any scheduling delay, the count could vary.

Consider using BeNumerically(">=", 2) or BeNumerically("<=", 3) to allow for timing variance, or restructure to make the test deterministic.


110-123: Consider increasing timeout for CI stability.

The 50ms timeout may be too aggressive for CI environments under load. While the test logic is sound, a slightly longer timeout (e.g., 100-200ms) would reduce flakiness risk without significantly impacting test execution time.

validator/pkg/config/config.go (2)

94-112: Silent fallback on parse errors may hide configuration mistakes.

The getEnvBool and getEnvInt helpers silently return the default value when parsing fails. If an operator sets STOP_ON_FIRST_FAILURE=yes (invalid bool) or REQUIRED_VCPUS=abc (invalid int), the config will silently use defaults instead of alerting to the misconfiguration.

Consider logging a warning when parse errors occur, so operators are aware their configuration isn't being applied:

💡 Suggested improvement
+import "log/slog"
+
 func getEnvBool(key string, defaultValue bool) bool {
 	if value := os.Getenv(key); value != "" {
 		b, err := strconv.ParseBool(value)
 		if err == nil {
 			return b
 		}
+		slog.Warn("invalid boolean value for env var, using default",
+			"key", key, "value", value, "default", defaultValue)
 	}
 	return defaultValue
 }

 func getEnvInt(key string, defaultValue int) int {
 	if value := os.Getenv(key); value != "" {
 		i, err := strconv.Atoi(value)
 		if err == nil {
 			return i
 		}
+		slog.Warn("invalid integer value for env var, using default",
+			"key", key, "value", value, "default", defaultValue)
 	}
 	return defaultValue
 }

54-61: Consider filtering empty strings from disabled validators.

If DISABLED_VALIDATORS is set to something like "api-enabled,,quota-check" (with an empty element), the resulting slice will contain an empty string. This could cause IsValidatorEnabled("") to return false for any validator name that somehow matches empty string (unlikely but defensive).

💡 Defensive improvement
 	if disabled := os.Getenv("DISABLED_VALIDATORS"); disabled != "" {
 		cfg.DisabledValidators = strings.Split(disabled, ",")
 		// Trim whitespace
-		for i, v := range cfg.DisabledValidators {
-			cfg.DisabledValidators[i] = strings.TrimSpace(v)
+		var trimmed []string
+		for _, v := range cfg.DisabledValidators {
+			v = strings.TrimSpace(v)
+			if v != "" {
+				trimmed = append(trimmed, v)
+			}
 		}
+		cfg.DisabledValidators = trimmed
 	}
validator/Dockerfile (2)

18-19: Running go mod tidy in the build may mask dependency issues.

If go.mod and go.sum are properly maintained in the repository, go mod tidy should be unnecessary. Running it during the build can mask cases where the committed module files are out of sync with the source code. Consider removing go mod tidy or adding it as a CI check instead.

💡 Suggested change
-# Tidy and verify Go module dependencies
-RUN go mod tidy && go mod verify
+# Verify Go module dependencies
+RUN go mod verify

15-19: Consider optimizing Docker layer caching for dependencies.

Copying the entire source before downloading dependencies means every source change invalidates the dependency cache. For faster rebuilds, copy module files first:

💡 Optimized layer caching
 WORKDIR /build

-# Copy source code
-COPY . .
+# Copy and download dependencies first (cached layer)
+COPY go.mod go.sum ./
+RUN go mod download && go mod verify

-# Tidy and verify Go module dependencies
-RUN go mod tidy && go mod verify
+# Copy source code
+COPY . .

 # Build binary using make to include version, commit, and build date
validator/pkg/validator/resolver_test.go (1)

241-260: Consider whether silent handling of missing dependencies is the desired behavior.

The test documents that missing dependencies are silently ignored, placing the validator at level 0. While this provides graceful degradation, it could mask configuration errors (e.g., typos in RunAfter names). Consider adding a warning log in the resolver when a dependency reference cannot be resolved.

charts/configs/validation-job-adapter-task.yaml (1)

74-75: JOB_NAME duplicates the job name pattern - consider using a field reference or shared template.

The job name pattern gcp-validator-{{ .clusterId | lower }}-{{ .generationId }} is duplicated on lines 4 and 75. If one is updated without the other, they will drift. Consider defining a helper template or extracting to a variable.

Note: Kubernetes doesn't provide a direct fieldRef for the job name from within a pod, so templating is necessary. You could define the name pattern once in a YAML anchor or template variable:

# At the top of the file or in a helper
{{- $jobName := printf "gcp-validator-%s-%s" (.clusterId | lower) .generationId -}}
validator/pkg/validator/registry.go (2)

23-31: Inconsistent duplicate handling between instance and global Register.

Registry.Register (line 24-31) silently overwrites duplicates, while the global Register (line 57-66) panics. This inconsistency could surprise developers. Consider either:

  1. Making both panic on duplicates
  2. Making the instance method return an error for explicit handling
  3. Documenting the difference clearly
💡 Option: Make instance Register consistent with global
 // Register adds a validator to the registry
-func (r *Registry) Register(v Validator) {
+func (r *Registry) Register(v Validator) error {
 	r.mu.Lock()
 	defer r.mu.Unlock()

 	meta := v.Metadata()
-	// Allow overwriting for testing purposes
+	if _, exists := r.validators[meta.Name]; exists {
+		return fmt.Errorf("validator already registered: %s", meta.Name)
+	}
 	r.validators[meta.Name] = v
+	return nil
 }

Also applies to: 57-66


78-83: Consider adding a build tag or panic guard for production safety.

ClearRegistry is marked for testing, but nothing prevents accidental use in production code. Consider either:

  • Moving it to a _test.go file
  • Adding a build tag constraint
  • Or accepting the risk if validator hot-reloading might be needed
validator/pkg/validator/executor_test.go (1)

123-154: Relax the timing assertion to avoid CI flakiness.

The 100ms threshold can be tight on busy runners. Consider a slightly wider bound or a less time-sensitive parallelism check.

⏱️ Proposed adjustment
- Expect(duration).To(BeNumerically("<", 100*time.Millisecond))
+ Expect(duration).To(BeNumerically("<", 200*time.Millisecond))
validator/cmd/validator/main.go (1)

104-107: Avoid logging the full results payload at info level.

The JSON can be large and may include sensitive details from validators. Prefer logging the path at info and emitting the payload only at debug.

🔧 Proposed change
- logger.Info("Results written successfully",
- 	"path", outputFile,
- 	"content", string(data))
+ logger.Info("Results written successfully", "path", outputFile)
+ logger.Debug("Results content", "content", string(data))
validator/pkg/validator/validator.go (1)

59-107: Handle StatusSkipped explicitly in aggregation.

StatusSkipped is defined but not accounted for, which can yield confusing failures (no failed checks yet overall failure). Treat skipped as neutral and report it in details.

🔧 Proposed aggregation tweak
 	checksRun := len(results)
 	checksPassed := 0
+	checksSkipped := 0
 	var failedChecks []string
 	var failureDescriptions []string
 
 	// Single pass to collect all failure information
 	for _, r := range results {
 		switch r.Status {
 		case StatusSuccess:
 			checksPassed++
 		case StatusFailure:
 			failedChecks = append(failedChecks, r.ValidatorName)
 			failureDescriptions = append(failureDescriptions, fmt.Sprintf("%s (%s)", r.ValidatorName, r.Reason))
+		case StatusSkipped:
+			checksSkipped++
 		}
 	}
 
 	details := map[string]interface{}{
 		"checks_run":    checksRun,
 		"checks_passed": checksPassed,
+		"checks_skipped": checksSkipped,
 		"timestamp":     time.Now().UTC().Format(time.RFC3339),
 		"validators":    results,
 	}
 
-	if checksPassed == checksRun {
+	if len(failedChecks) == 0 {
+		msg := "All GCP validation checks passed successfully"
+		if checksSkipped > 0 {
+			msg = fmt.Sprintf("All GCP validation checks passed successfully (%d skipped)", checksSkipped)
+		}
 		return &AggregatedResult{
 			Status:  StatusSuccess,
 			Reason:  "ValidationPassed",
-			Message: "All GCP validation checks passed successfully",
+			Message: msg,
 			Details: details,
 		}
 	}
validator/pkg/validator/executor.go (2)

93-128: Results slice indexing is safe but mutex scope can be tightened.

The pre-allocated results slice with indexed writes (line 145) is thread-safe because each goroutine writes to a unique index. However, the panic recovery block at lines 124-127 acquires the mutex but doesn't need to protect results[index] since index isolation already handles that. Consider removing the redundant lock around results[index]:

♻️ Suggested refinement
 				// Thread-safe result storage
 				e.mu.Lock()
 				e.ctx.Results[meta.Name] = panicResult
-				results[index] = panicResult
 				e.mu.Unlock()
+				results[index] = panicResult

140-145: Same observation for normal result storage.

Similar to the panic path, results[index] doesn't need mutex protection due to unique index per goroutine:

♻️ Suggested refinement
 			// Thread-safe result storage
 			e.mu.Lock()
 			e.ctx.Results[meta.Name] = result
 			e.mu.Unlock()
 
 			results[index] = result

This is already correctly structured - just confirming the mutex is only for e.ctx.Results map access.

validator/pkg/validators/api_enabled.go (1)

64-94: Consider using the passed-in logger instead of slog.Default().

Lines 65, 72, 76 use slog.Default() and slog.Info/Error/Debug directly. For better testability and consistency, consider accepting a logger from the context or using the one configured at initialization.

charts/configs/validation-adapter.yaml (1)

248-302: Complex CEL expressions may benefit from comments or documentation.

The post-processing expressions (lines 258-298) use nested ternary logic with optional chaining (.? and .orValue()). While functional, these are hard to read and maintain. Consider adding inline comments or extracting to named helper expressions if the adapter framework supports it.

validator/pkg/gcp/client.go (3)

39-86: Comment mentions jitter but implementation lacks it.

Line 46 comment says "exponential backoff with jitter" but the implementation only doubles the backoff without adding randomness. Jitter helps prevent thundering herd problems when multiple clients retry simultaneously.

♻️ Proposed fix to add jitter
+import (
+	"math/rand"
+	// ... other imports
+)
+
 func retryWithBackoff(ctx context.Context, operation func() error) error {
 	var lastErr error
 	backoff := initialBackoff
 
 	for attempt := 0; attempt < maxRetries; attempt++ {
 		if attempt > 0 {
-			// Calculate exponential backoff with jitter
+			// Calculate exponential backoff
 			if backoff < maxBackoff {
 				backoff = backoff * 2
 				if backoff > maxBackoff {
 					backoff = maxBackoff
 				}
 			}
+			// Add jitter: 75-125% of backoff
+			jitter := time.Duration(float64(backoff) * (0.75 + rand.Float64()*0.5))
 			slog.Debug("Retrying GCP API call", "attempt", attempt, "backoff", backoff)
 
 			select {
-			case <-time.After(backoff):
+			case <-time.After(jitter):
 			case <-ctx.Done():

67-77: Consider adding 502 Bad Gateway to retryable errors.

The retryable status codes include 429, 500, and 503, but 502 (Bad Gateway) is also commonly transient in cloud environments. Additionally, using errors.As instead of type assertion is more robust for wrapped errors.

♻️ Proposed improvement
+const (
+	// ... existing constants
+	statusBadGateway     = 502
+)
+
 		// Check if error is retryable
-		if apiErr, ok := lastErr.(*googleapi.Error); ok {
+		var apiErr *googleapi.Error
+		if errors.As(lastErr, &apiErr) {
 			// Retry on rate limit, service unavailable, and internal errors
 			if apiErr.Code == statusRateLimited ||
+			   apiErr.Code == statusBadGateway ||
 			   apiErr.Code == statusServiceUnavail ||
 			   apiErr.Code == statusInternalError {
 				continue
 			}

88-100: projectID field is unused in ClientFactory.

The ClientFactory stores projectID but none of the Create*Service methods use it. If it's intended for future use, consider adding a comment; otherwise, remove it to avoid confusion.

♻️ Options

Either document the intent:

 type ClientFactory struct {
-	projectID string
+	projectID string // Reserved for future per-project client configuration
 	logger    *slog.Logger
 }

Or remove if unused:

 type ClientFactory struct {
-	projectID string
 	logger    *slog.Logger
 }
 
-func NewClientFactory(projectID string, logger *slog.Logger) *ClientFactory {
+func NewClientFactory(logger *slog.Logger) *ClientFactory {
 	return &ClientFactory{
-		projectID: projectID,
 		logger:    logger,
 	}
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c18b72 and 929ad77.

⛔ Files ignored due to path filters (1)
  • validator/go.sum is excluded by !**/*.sum
📒 Files selected for processing (31)
  • charts/configs/validation-adapter.yaml
  • charts/configs/validation-job-adapter-task.yaml
  • charts/templates/_helpers.tpl
  • charts/templates/configmap-app.yaml
  • charts/templates/deployment.yaml
  • charts/values.yaml
  • validator/Dockerfile
  • validator/Makefile
  • validator/README.md
  • validator/cmd/validator/main.go
  • validator/go.mod
  • validator/pkg/config/config.go
  • validator/pkg/config/config_suite_test.go
  • validator/pkg/config/config_test.go
  • validator/pkg/gcp/client.go
  • validator/pkg/gcp/client_test.go
  • validator/pkg/gcp/gcp_suite_test.go
  • validator/pkg/validator/context.go
  • validator/pkg/validator/executor.go
  • validator/pkg/validator/executor_test.go
  • validator/pkg/validator/registry.go
  • validator/pkg/validator/registry_test.go
  • validator/pkg/validator/resolver.go
  • validator/pkg/validator/resolver_test.go
  • validator/pkg/validator/validator.go
  • validator/pkg/validator/validator_suite_test.go
  • validator/pkg/validators/api_enabled.go
  • validator/pkg/validators/api_enabled_test.go
  • validator/pkg/validators/quota_check.go
  • validator/pkg/validators/quota_check_test.go
  • validator/pkg/validators/validators_suite_test.go
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: 86254860
Repo: openshift-hyperfleet/adapter-validation-gcp PR: 4
File: charts/configs/fake-validation-gcp-job.yaml:25-25
Timestamp: 2025-12-18T11:32:59.422Z
Learning: In the adapter-validation-gcp repository, the gcp-validator service account referenced in fake-validation-gcp-job.yaml is created by the adapter runtime based on the gcpValidationServiceAccount resource definition in adapter-fake-validation-gcp.yaml, not by Helm templates.
📚 Learning: 2025-12-18T11:32:59.422Z
Learnt from: 86254860
Repo: openshift-hyperfleet/adapter-validation-gcp PR: 4
File: charts/configs/fake-validation-gcp-job.yaml:25-25
Timestamp: 2025-12-18T11:32:59.422Z
Learning: In the adapter-validation-gcp repository, the gcp-validator service account referenced in fake-validation-gcp-job.yaml is created by the adapter runtime based on the gcpValidationServiceAccount resource definition in adapter-fake-validation-gcp.yaml, not by Helm templates.

Applied to files:

  • charts/templates/configmap-app.yaml
  • charts/configs/validation-job-adapter-task.yaml
  • charts/templates/_helpers.tpl
  • charts/values.yaml
  • validator/README.md
  • charts/configs/validation-adapter.yaml
📚 Learning: 2025-12-18T11:21:09.498Z
Learnt from: 86254860
Repo: openshift-hyperfleet/adapter-validation-gcp PR: 4
File: charts/configs/broker-pubsub.yaml:13-13
Timestamp: 2025-12-18T11:21:09.498Z
Learning: In the charts/configs/broker-pubsub.yaml file for the adapter-validation-gcp project, the broker configuration settings like ack_deadline_seconds should remain as fixed values to align with other adapter settings at this phase, rather than being made configurable.

Applied to files:

  • charts/templates/_helpers.tpl
  • charts/configs/validation-adapter.yaml
🧬 Code graph analysis (15)
validator/pkg/gcp/client_test.go (1)
validator/pkg/gcp/client.go (3)
  • GetDefaultClientForTesting (220-222)
  • RetryWithBackoffForTesting (225-227)
  • NewClientFactory (95-100)
validator/pkg/config/config_test.go (1)
validator/pkg/config/config.go (2)
  • LoadFromEnv (40-85)
  • Config (11-37)
validator/pkg/validators/quota_check_test.go (4)
validator/pkg/validators/quota_check.go (1)
  • QuotaCheckValidator (12-12)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/config/config.go (2)
  • LoadFromEnv (40-85)
  • Config (11-37)
validator/pkg/validator/validator.go (3)
  • Result (41-49)
  • Status (32-32)
  • StatusSuccess (35-35)
validator/pkg/validators/api_enabled_test.go (3)
validator/pkg/validators/api_enabled.go (1)
  • APIEnabledValidator (45-45)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/config/config.go (2)
  • LoadFromEnv (40-85)
  • Config (11-37)
validator/pkg/validator/context.go (2)
validator/pkg/config/config.go (1)
  • Config (11-37)
validator/pkg/validator/validator.go (1)
  • Result (41-49)
validator/pkg/validator/resolver_test.go (4)
validator/pkg/validator/resolver.go (2)
  • DependencyResolver (15-17)
  • NewDependencyResolver (20-27)
validator/pkg/validator/validator.go (1)
  • Validator (20-29)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/validator/registry_test.go (1)
  • MockValidator (13-20)
validator/pkg/validators/quota_check.go (4)
validator/pkg/validator/registry.go (1)
  • Register (57-66)
validator/pkg/validator/validator.go (4)
  • ValidatorMetadata (12-17)
  • Result (41-49)
  • Status (32-32)
  • StatusSuccess (35-35)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/config/config.go (1)
  • Config (11-37)
validator/pkg/validator/registry_test.go (3)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/validator/validator.go (4)
  • Result (41-49)
  • ValidatorMetadata (12-17)
  • Status (32-32)
  • StatusSuccess (35-35)
validator/pkg/validator/registry.go (5)
  • Registry (11-14)
  • NewRegistry (17-21)
  • Register (57-66)
  • GetAll (69-71)
  • Get (74-76)
validator/pkg/validator/validator.go (1)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/validator/executor.go (5)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/validator/validator.go (5)
  • Result (41-49)
  • Validator (20-29)
  • Status (32-32)
  • StatusFailure (36-36)
  • StatusSkipped (37-37)
validator/pkg/validator/registry.go (1)
  • GetAll (69-71)
validator/pkg/validator/resolver.go (2)
  • NewDependencyResolver (20-27)
  • ExecutionGroup (9-12)
validator/pkg/config/config.go (1)
  • Config (11-37)
validator/pkg/validator/executor_test.go (6)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/validator/executor.go (2)
  • Executor (13-17)
  • NewExecutor (20-25)
validator/pkg/validator/registry.go (2)
  • ClearRegistry (79-83)
  • Register (57-66)
validator/pkg/config/config.go (2)
  • LoadFromEnv (40-85)
  • Config (11-37)
validator/pkg/validator/validator.go (4)
  • Result (41-49)
  • Status (32-32)
  • StatusSuccess (35-35)
  • StatusFailure (36-36)
validator/pkg/validator/registry_test.go (1)
  • MockValidator (13-20)
validator/pkg/validator/resolver.go (1)
validator/pkg/validator/validator.go (1)
  • Validator (20-29)
validator/pkg/validator/registry.go (1)
validator/pkg/validator/validator.go (1)
  • Validator (20-29)
validator/pkg/validators/api_enabled.go (5)
validator/pkg/validator/registry.go (2)
  • Register (57-66)
  • Get (74-76)
validator/pkg/validator/validator.go (5)
  • ValidatorMetadata (12-17)
  • Result (41-49)
  • Status (32-32)
  • StatusFailure (36-36)
  • StatusSuccess (35-35)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/config/config.go (1)
  • Config (11-37)
validator/pkg/gcp/client.go (1)
  • NewClientFactory (95-100)
validator/pkg/gcp/client.go (1)
validator/pkg/validator/context.go (1)
  • Context (14-30)
🪛 checkmake (0.2.2)
validator/Makefile

[warning] 90-90: Target body for "lint" exceeds allowed length of 5 (8).

(maxbodylength)


[warning] 101-101: Target body for "fmt" exceeds allowed length of 5 (6).

(maxbodylength)


[warning] 185-185: Missing required phony target "all"

(minphony)

🪛 OSV Scanner (2.3.1)
validator/go.mod

[HIGH] 1-1: golang.org/x/oauth2 0.15.0: Unexpected memory consumption during token parsing in golang.org/x/oauth2

(GO-2025-3488)


[HIGH] 1-1: golang.org/x/oauth2 0.15.0: golang.org/x/oauth2 Improper Validation of Syntactic Correctness of Input vulnerability

(GHSA-6v2p-p543-phr9)

🔇 Additional comments (50)
validator/go.mod (1)

3-3: No action needed. Go 1.25.0 is a released, stable version (released August 12, 2025) and is appropriate for this project.

validator/pkg/gcp/gcp_suite_test.go (1)

10-12: LGTM: test suite bootstrap is standard.

No issues spotted in the suite entrypoint.

validator/pkg/validators/validators_suite_test.go (1)

10-12: LGTM: test suite bootstrap is standard.

No concerns here.

charts/templates/_helpers.tpl (1)

81-85: Adapter config filename change verified and is correct.

The file validation-adapter.yaml exists in charts/configs/ and is properly referenced in the template helpers. No issues.

validator/pkg/config/config_suite_test.go (1)

1-13: LGTM!

Standard Ginkgo v2 test bootstrap file. The use of an external test package (config_test) is a good practice for testing the public API.

validator/pkg/validator/validator_suite_test.go (1)

1-13: LGTM!

Standard Ginkgo v2 test bootstrap file, consistent with the other test suite bootstraps in this PR.

validator/pkg/config/config_test.go (1)

42-245: Comprehensive test coverage.

The test suite provides excellent coverage of the config loading functionality:

  • Default values and required field validation
  • Custom configuration overrides
  • List parsing with whitespace handling
  • Type conversion error handling (integers, booleans)
  • Validator enablement logic

The tests are well-organized using nested Context blocks for clear separation of concerns.

validator/pkg/validators/quota_check_test.go (1)

1-103: LGTM! Well-structured test suite for the stub validator.

The test coverage is comprehensive for a stub implementation, covering metadata validation, enablement logic, and the expected stub behavior. Environment variable cleanup is properly handled in the nested AfterEach blocks.

validator/pkg/gcp/client_test.go (1)

69-88: Good use of table-driven tests for retry behavior.

The DescribeTable pattern effectively covers the different HTTP error codes and their expected retry behavior. This makes the test intent clear and easy to extend.

validator/pkg/validators/api_enabled_test.go (3)

34-37: Good defensive cleanup of environment variables.

The AfterEach block properly cleans up both PROJECT_ID and REQUIRED_APIS, ensuring no environment leakage between test runs regardless of which inner context was executed.


83-123: Good configuration testing coverage.

The tests effectively verify the config loading behavior including default APIs, custom API overrides, and whitespace trimming. This ensures the LoadFromEnv function handles edge cases properly.


144-148: Appropriate deferral of integration tests.

The comment correctly identifies that testing Validate() requires either real GCP credentials or mocked clients. This is a reasonable boundary for unit tests.

charts/values.yaml (2)

6-9: Verify production readiness for default "real" mode.

The default deployment mode changed from "dummy" to "real". Ensure that:

  1. The gcp-validator image at the specified registry is available and tested
  2. Downstream deployments are aware of this default change
  3. CI/CD pipelines are updated if they relied on "dummy" mode

90-103: Well-structured real validation configuration.

The configuration block is well-documented with sensible defaults. The requiredApis default matches the defaults in config.go, maintaining consistency between Helm values and Go code.

validator/pkg/config/config.go (1)

39-85: Clean configuration loading implementation.

The LoadFromEnv function is well-structured with clear separation between field initialization, list parsing, and validation. The default API list is sensible and matches the Helm chart defaults.

charts/configs/validation-job-adapter-task.yaml (1)

23-29: LGTM - Pod spec is well-structured.

The pod configuration correctly uses:

  • Service account from the adapter runtime (per learnings, created externally)
  • restartPolicy: Never with backoffLimit: 0 for single-attempt semantics
  • Shared emptyDir volume for inter-container result communication
validator/pkg/validator/registry.go (1)

33-43: LGTM - GetAll correctly returns an unordered snapshot.

The method properly acquires a read lock and returns a fresh slice, avoiding external mutation of the internal map. The non-deterministic iteration order is acceptable since consumers (like DependencyResolver) handle ordering as needed.

validator/pkg/validator/context.go (1)

14-30: Remove suggestion for Results map initialization constructor.

The Results map is properly initialized in all code paths where Context is instantiated (cmd/validator/main.go, executor_test.go, api_enabled_test.go, quota_check_test.go). Every instantiation explicitly calls make(map[string]*validator.Result), preventing any nil map panic. The current pattern is correct and requires no changes.

validator/pkg/validator/resolver_test.go (1)

1-8: LGTM - comprehensive test coverage for dependency resolution.

The test suite covers essential scenarios including no dependencies, linear chains, parallel execution grouping, complex DAGs, circular dependency detection, self-references, missing dependencies, and empty input. The BDD structure with Ginkgo is clear and well-organized. MockValidator is properly defined in registry_test.go and accessible within the shared validator_test package.

validator/Dockerfile (1)

4-4: > Likely an incorrect or invalid review comment.

validator/pkg/validator/executor_test.go (6)

49-58: Good coverage for the empty-registry error path.


60-102: Solid single-validator coverage (execution + context + timing metadata).


104-120: Disabled validator behavior is clearly validated.


164-217: Dependency ordering test is crisp and valuable.


219-256: Stop-on-first-failure behavior is well exercised.


258-285: Failure propagation checks look good.

validator/pkg/validator/registry_test.go (5)

12-45: MockValidator provides flexible test hooks.


72-89: Register coverage for single and multiple validators looks solid.


107-122: GetAll coverage is straightforward and sufficient.


125-141: Get-by-name tests are clear and complete.


91-103: The test correctly reflects the instance method's intended behavior. The Registry.Register() method explicitly allows overwriting validators with duplicate names, as documented in the code comment: "Allow overwriting for testing purposes". This differs intentionally from the global Register() function, which panics on duplicates to protect against accidental re-registration in initialization code. The test is correct and needs no changes.

validator/cmd/validator/main.go (1)

122-135: Log-level parsing is clean and idiomatic.

validator/pkg/validators/quota_check.go (1)

14-29: Metadata/enablement wiring looks good.

validator/pkg/validator/validator.go (1)

10-57: Core validator types and result models are clean and well-scoped.

validator/pkg/validator/executor.go (2)

12-25: LGTM - Well-structured executor with thread-safe design.

The Executor struct correctly uses a mutex to protect shared state during parallel execution, and the constructor is straightforward.


27-86: LGTM - Execution orchestration is well-designed.

The ExecuteAll method correctly:

  • Filters enabled validators before execution
  • Builds a dependency-resolved execution plan
  • Executes groups sequentially while allowing parallelism within groups
  • Handles stop-on-first-failure semantics

The logging at each stage provides good observability.

validator/pkg/validators/api_enabled.go (4)

15-20: LGTM - Timeout constants are well-defined.

The 2-minute overall timeout and 30-second per-request timeout provide reasonable defaults for GCP API calls.


22-42: LGTM - Error extraction logic is robust.

The extractErrorReason function correctly uses errors.As for proper error unwrapping and provides meaningful fallback reasons.


101-143: Good use of per-request context timeout and immediate cancel.

The pattern of creating a per-request timeout context and calling reqCancel() immediately after the API call (line 109) is correct and prevents context leaks.


145-171: LGTM - Result construction is comprehensive.

Success and failure results include all relevant details (enabled/disabled APIs, project ID, actionable hints), which aids debugging and user experience.

validator/pkg/validator/resolver.go (4)

8-27: LGTM - Clean data structures and constructor.

ExecutionGroup and DependencyResolver are well-defined. The constructor properly indexes validators by name.


29-66: LGTM - Execution group resolution is correct.

The method properly:

  1. Detects cycles first
  2. Assigns levels via topological ordering
  3. Groups validators by level with deterministic alphabetical sorting

106-148: Cycle detection algorithm is correct for standard cycles.

The DFS-based cycle detection using visited and recStack (recursion stack) correctly identifies back-edges in the dependency graph. The algorithm properly:

  • Marks nodes as visited permanently after full exploration
  • Uses recStack to track nodes in the current DFS path
  • Skips non-existent dependencies gracefully

68-104: Remove this comment—self-references are properly caught by detectCycles.

The detectCycles function uses a recursion stack to detect all cycles, including self-references. When a validator references itself, detectCycles sets visited[name] and recStack[name] to true before processing dependencies (lines 113–114). During the iteration through RunAfter, the self-reference check at line 130 (else if recStack[dep]) catches it and returns an error immediately. A test case in resolver_test.go ("with self-referencing dependency") explicitly verifies this behavior and confirms the error is returned. Since detectCycles runs first in ResolveExecutionGroups (line 33) before assignLevels (line 38), infinite recursion in assignLevels cannot occur with self-references.

charts/configs/validation-adapter.yaml (3)

131-136: Precondition uses values (plural) with equals operator.

Line 136 uses values: "Active" with operator equals. If the adapter framework expects value (singular) for equality checks vs values (list) for in operator, this may cause issues. Verify the expected schema.


183-194: RBAC permissions follow least-privilege principle - LGTM.

The Role grants minimal permissions:

  • batch/jobs: get only
  • batch/jobs/status: get, update, patch (needed for status reporter)
  • pods: get, list (for status checking)

This is appropriate for a status-reporter role.


1-14: LGTM - Well-documented adapter configuration header.

The metadata and labels properly identify the adapter type, component, and provider.

validator/pkg/gcp/client.go (3)

20-30: LGTM - Retry constants are reasonable.

Initial backoff of 100ms, max of 30s, and 5 retries provide good defaults for GCP API resilience.


102-123: LGTM - Compute service creation follows good patterns.

Uses read-only scope appropriately for validation tasks and wraps creation with retry logic.


217-227: Test helpers are appropriately exposed.

Exporting GetDefaultClientForTesting and RetryWithBackoffForTesting allows unit tests to exercise internal logic without modifying production code paths.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
charts/configs/validation-dummy-adapter.yaml (1)

67-71: Copy-paste error in parameter description.

The gcpValidatorServiceAccount parameter has an incorrect description that refers to "Maximum time to wait for validation completion" instead of describing the service account.

📝 Proposed fix
     - name: "gcpValidatorServiceAccount"
       source: "env.GCP_VALIDATOR_SERVICE_ACCOUNT"
       type: "string"
       default: "gcp-validator-job-sa"
-      description: "Maximum time to wait for validation completion"
+      description: "Kubernetes ServiceAccount name for the validator job"
🤖 Fix all issues with AI agents
In `@charts/configs/validation-adapter.yaml`:
- Around line 265-269: The message expression uses awkward grammar; update the
ternary message inside the expression that references resources.gcpValidationJob
so the false branch reads either "Validation job is pending to be applied" or
"Validation job pending application" (e.g., replace "Validation job is pending
to applied" with one of these corrected phrases) to fix the grammar in the
message field.
♻️ Duplicate comments (3)
validator/pkg/validator/executor.go (1)

137-154: LGTM! Nil result handling properly implemented.

The defensive nil check addresses the previously identified concern. The fallback result includes appropriate error details and timing information.

validator/pkg/validators/api_enabled.go (1)

100-147: Empty requiredAPIs returns success without validation.

When vctx.Config.RequiredAPIs is empty, the loop is skipped and the validator returns success with "No required APIs to validate". This could mask configuration issues. Consider returning StatusSkipped instead to clearly indicate no validation occurred.

♻️ Suggested change
 	// Check each required API
 	requiredAPIs := vctx.Config.RequiredAPIs
+	if len(requiredAPIs) == 0 {
+		slog.Warn("No required APIs configured, skipping validation")
+		return &validator.Result{
+			Status:  validator.StatusSkipped,
+			Reason:  "NoAPIsConfigured",
+			Message: "No required APIs configured for validation",
+			Details: map[string]interface{}{
+				"project_id": vctx.Config.ProjectID,
+			},
+		}
+	}
 	enabledAPIs := []string{}
 	disabledAPIs := []string{}
validator/pkg/validators/quota_check.go (1)

78-90: Stub validator should return StatusSkipped, not StatusSuccess.

Returning success for an unimplemented check can falsely signal "all checks passed" in aggregated results. The validator correctly includes stub: true in details, but consumers may only check the status field.

🔧 Proposed fix
 	slog.Warn("Quota check not yet implemented - returning success by default")

 	return &validator.Result{
-		Status:  validator.StatusSuccess,
-		Reason:  "QuotaCheckStub",
-		Message: "Quota check validation not yet implemented (stub returning success)",
+		Status:  validator.StatusSkipped,
+		Reason:  "QuotaCheckNotImplemented",
+		Message: "Quota check not implemented; skipped",
 		Details: map[string]interface{}{
 			"stub":        true,
 			"implemented": false,
 			"project_id":  vctx.Config.ProjectID,
 			"note":        "This validator needs to be implemented to check actual GCP quotas",
 		},
 	}
🧹 Nitpick comments (3)
charts/configs/validation-adapter.yaml (1)

47-57: Default image values differ from values.yaml.

The default images here use quay.io/rh-ee-dawang/... while values.yaml uses registry.ci.openshift.org/ci/.... While environment variables will override these defaults at runtime, consider aligning the defaults for consistency and to reduce confusion during debugging.

📝 Proposed alignment with values.yaml
     - name: "statusReporterImage"
       source: "env.STATUS_REPORTER_IMAGE"
       type: "string"
-      default: "quay.io/rh-ee-dawang/status-reporter:dev-04e8d0a"
+      default: "registry.ci.openshift.org/ci/status-reporter:latest"
       description: "Container image for the status reporter"
     # GCP Validator configuration
     - name: "gcpValidatorImage"
       source: "env.GCP_VALIDATOR_IMAGE"
       type: "string"
-      default: "quay.io/rh-ee-dawang/gcp-validator:latest"
+      default: "registry.ci.openshift.org/ci/gcp-validator:latest"
       description: "GCP validator container image"
validator/cmd/validator/main.go (2)

107-110: Consider truncating or omitting full results content in logs.

Logging the entire JSON results content (line 110) could produce very large log entries in production with many validators. Consider logging only a summary or making this configurable via log level.

♻️ Suggested change
-	// Log the results content for easy access via logs (useful in containerized environments)
-	logger.Info("Results written successfully",
-		"path", outputFile,
-		"content", string(data))
+	// Log results summary; full content available in output file
+	logger.Info("Results written successfully",
+		"path", outputFile,
+		"size_bytes", len(data))
+	logger.Debug("Results content", "content", string(data))

100-105: Ensure output directory exists before writing.

The comment notes that /results should be pre-created via volumeMounts in Kubernetes. However, if running outside Kubernetes or if the mount fails, os.WriteFile will fail silently on missing directory. Consider explicitly creating the directory.

♻️ Suggested change
+	import "path/filepath"
+
 	// Ensure output directory exists
-	// Note: In Kubernetes, the /results directory should be pre-created via volumeMounts
+	if dir := filepath.Dir(outputFile); dir != "" && dir != "." {
+		if err := os.MkdirAll(dir, 0755); err != nil {
+			logger.Error("Failed to create output directory", "error", err, "path", dir)
+			os.Exit(1)
+		}
+	}
 	if err := os.WriteFile(outputFile, data, 0644); err != nil {
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 929ad77 and 2e624a6.

⛔ Files ignored due to path filters (1)
  • validator/go.sum is excluded by !**/*.sum
📒 Files selected for processing (10)
  • charts/configs/validation-adapter.yaml
  • charts/configs/validation-dummy-adapter.yaml
  • charts/values.yaml
  • validator/cmd/validator/main.go
  • validator/go.mod
  • validator/pkg/config/config.go
  • validator/pkg/validator/executor.go
  • validator/pkg/validator/executor_test.go
  • validator/pkg/validators/api_enabled.go
  • validator/pkg/validators/quota_check.go
✅ Files skipped from review due to trivial changes (1)
  • validator/go.mod
🚧 Files skipped from review as they are similar to previous changes (2)
  • validator/pkg/config/config.go
  • validator/pkg/validator/executor_test.go
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: 86254860
Repo: openshift-hyperfleet/adapter-validation-gcp PR: 4
File: charts/configs/fake-validation-gcp-job.yaml:25-25
Timestamp: 2025-12-18T11:32:59.422Z
Learning: In the adapter-validation-gcp repository, the gcp-validator service account referenced in fake-validation-gcp-job.yaml is created by the adapter runtime based on the gcpValidationServiceAccount resource definition in adapter-fake-validation-gcp.yaml, not by Helm templates.
📚 Learning: 2025-12-18T11:32:59.422Z
Learnt from: 86254860
Repo: openshift-hyperfleet/adapter-validation-gcp PR: 4
File: charts/configs/fake-validation-gcp-job.yaml:25-25
Timestamp: 2025-12-18T11:32:59.422Z
Learning: In the adapter-validation-gcp repository, the gcp-validator service account referenced in fake-validation-gcp-job.yaml is created by the adapter runtime based on the gcpValidationServiceAccount resource definition in adapter-fake-validation-gcp.yaml, not by Helm templates.

Applied to files:

  • charts/configs/validation-dummy-adapter.yaml
  • charts/configs/validation-adapter.yaml
  • charts/values.yaml
📚 Learning: 2025-12-18T11:21:09.498Z
Learnt from: 86254860
Repo: openshift-hyperfleet/adapter-validation-gcp PR: 4
File: charts/configs/broker-pubsub.yaml:13-13
Timestamp: 2025-12-18T11:21:09.498Z
Learning: In the charts/configs/broker-pubsub.yaml file for the adapter-validation-gcp project, the broker configuration settings like ack_deadline_seconds should remain as fixed values to align with other adapter settings at this phase, rather than being made configurable.

Applied to files:

  • charts/configs/validation-adapter.yaml
🧬 Code graph analysis (1)
validator/pkg/validators/api_enabled.go (5)
validator/pkg/validator/registry.go (2)
  • Register (57-66)
  • Get (74-76)
validator/pkg/validator/validator.go (5)
  • ValidatorMetadata (12-17)
  • Result (41-49)
  • Status (32-32)
  • StatusFailure (36-36)
  • StatusSuccess (35-35)
validator/pkg/validator/context.go (1)
  • Context (14-30)
validator/pkg/config/config.go (1)
  • Config (11-37)
validator/pkg/gcp/client.go (1)
  • NewClientFactory (95-100)
🔇 Additional comments (20)
charts/configs/validation-dummy-adapter.yaml (1)

72-73: LGTM on the casing fix.

The environment variable source is now correctly using MANAGED_BY_RESOURCE_NAME with consistent uppercase casing.

charts/values.yaml (2)

6-9: Clear deployment mode documentation.

The comments explaining the two deployment modes (dummy vs real) are helpful for operators. The default to "real" is appropriate for production use.


90-103: Real GCP validator configuration looks complete.

The configuration includes all necessary settings for the real validator:

  • Container image reference
  • Disabled validators list (quota-check stub appropriately disabled)
  • Required GCP APIs with sensible defaults
  • Results path and timeout settings

The structure mirrors the dummy config for consistency.

charts/configs/validation-adapter.yaml (6)

1-27: Well-structured AdapterConfig metadata and base configuration.

The metadata correctly identifies this as a validation adapter for GCP with appropriate labels. The HyperFleet API and Kubernetes configuration sections are properly defined.


83-87: Environment variable casing is correct.

The managedByResourceName parameter correctly uses env.MANAGED_BY_RESOURCE_NAME with proper uppercase casing.


97-122: Preconditions properly validate cluster state and GCP project.

The clusterStatus precondition appropriately:

  • Captures the projectId from the cluster spec
  • Validates that clusterPhase is in an acceptable state
  • Ensures projectId exists before proceeding

This is essential for the real GCP validation flow.


140-164: ServiceAccount resource definition is correct.

The ServiceAccount is properly templated with cluster-specific namespace and labels. Based on learnings, the service account is created by the adapter runtime based on this resource definition, not by Helm templates.


183-194: RBAC permissions are appropriately scoped.

The Role grants minimal permissions needed:

  • get on jobs, get/update/patch on jobs/status for the status reporter
  • get/list on pods for monitoring

This follows the principle of least privilege.


248-302: Post-processing payload expressions are well-designed.

The CEL expressions use proper null-safety with orValue() and handle multiple status conditions (Applied, Available, Health). The conditional logic correctly handles various job states.

validator/pkg/validator/executor.go (3)

89-187: Thread-safe result storage is correctly implemented.

The mutex properly guards access to e.ctx.Results map. The results[index] slice assignment without mutex is safe because each goroutine writes to a distinct index, eliminating data races on the slice.


74-82: Potential race: StopOnFirstFailure check reads results that may still be written.

The check for failure status in groupResults occurs after wg.Wait(), so this is safe within a single group. However, consider that if a validator in the current group fails, other validators in the same group continue running to completion before the check triggers. This is likely the intended behavior for parallel execution, but worth documenting.


98-129: Robust panic recovery with comprehensive error details.

The panic handling correctly captures stack trace, stores the result thread-safely, and logs detailed information for debugging. Well implemented.

validator/pkg/validators/api_enabled.go (3)

22-42: Well-structured error reason extraction.

The extractErrorReason helper properly prioritizes GCP-specific error reasons over generic HTTP codes, with appropriate nil handling and fallback.


105-113: Correct per-request timeout pattern.

The immediate reqCancel() call after Do() properly releases resources. This is the idiomatic pattern for per-request timeouts within a loop.


140-146: Code comparison is correct as-is.

The string comparison service.State == "ENABLED" is correct for the google.golang.org/api/serviceusage/v1 REST-based client being used. The GCP ServiceUsage API returns state as a string field with values like "ENABLED", "DISABLED", and "STATE_UNSPECIFIED". Constants are not applicable here—only the newer protobuf-based cloud.google.com/go/serviceusage/apiv1 library provides enum constants. No changes needed.

validator/pkg/validators/quota_check.go (1)

19-27: Metadata correctly identifies this as a stub with appropriate tags.

The description clearly states it's a stub, and the "stub" tag enables filtering. The dependency on "api-enabled" is appropriate for when real quota checks are implemented.

validator/cmd/validator/main.go (4)

1-16: Clean imports with side-effect registration.

The blank import of validator/pkg/validators correctly triggers init() functions to register validators. This is an idiomatic Go pattern for plugin-style registration.


47-57: Good validation of disabled validators with helpful hint.

Warning on unknown validators helps catch configuration typos early. The hint to run without DISABLED_VALIDATORS is actionable.


69-76: Proper signal handling for graceful shutdown.

The buffered channel with capacity 1 correctly handles the race between signal delivery and receive. Canceling the context propagates shutdown through the validator execution.


125-139: Robust log level parsing with sensible default.

The parseLogLevel helper correctly handles common variations ("warn"/"warning") and defaults to Info for unknown values.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +271 to +275
- type: "Available"
status:
expression: |
resources.?gcpValidationJob.?status.?conditions.orValue([]).exists(c, c.type == "Available")
? resources.gcpValidationJob.status.conditions.filter(c, c.type == "Available")[0].status : "False"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a third value Unknown if there is no resources.gcpValidationJob.status yet ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to openshift-hyperfleet/architecture#79, it requires several changes from different CLM components. I'll update this logic in another PR.

return nil, fmt.Errorf("no validators enabled")
}

e.logger.Info("Found enabled validators", "count", len(enabledValidators))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be more of a debug info.
There are more Info later that may be too detailed.

My concern? generating logs that are too noisy with values that are always the same. Once it is configured, these depdendencies will be always the same for all jobs executions, so having this info may be only needed in case we are debugging an incident and try to find out values at every step

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update in another PR about logging standard if required. Not in current PR.


if service.State == "ENABLED" {
enabledAPIs = append(enabledAPIs, apiName)
slog.Debug("API is enabled", "api", apiName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I notice the slog here I wonder if we should pass the trace_id from the adapter to the adapter tasks and keep the traceability. This should be passed by the adapter framework

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update in another PR about logging standard (e.g., trace_id) if required. Not in current PR.

}))
slog.SetDefault(logger)

logger.Info("Starting GCP Validator")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also log the ldflags info, version, commit, buildDate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update in another PR about logging standard (e.g., ldflags info, version, commit, buildDate) if required. Not in current PR.

@86254860 86254860 changed the title Hyperfleet 267 Implement Minimal Real GCP Validation and Related Helm Chart Hyperfleet-267 - feat: Implement Minimal Real GCP Validation and Related Helm Chart Jan 20, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@charts/templates/deployment.yaml`:
- Around line 102-103: The template defines LOG_LEVEL twice (adapter uses
.Values.logging.level and validation uses .Values.validation.real.logLevel when
useDummy=false), causing the second to override the first; change the validation
env var name from LOG_LEVEL to VALIDATOR_LOG_LEVEL in the deployment template
and update any consumer of the validation config to read VALIDATOR_LOG_LEVEL
instead of LOG_LEVEL (keep the adapter env var tied to .Values.logging.level so
it remains unchanged). Ensure the values mapping
(.Values.validation.real.logLevel) is preserved and any docs or config
references are updated to reflect VALIDATOR_LOG_LEVEL.
♻️ Duplicate comments (2)
charts/configs/validation-dummy-adapter.yaml (1)

240-244: Minor grammar issue in message expression.

Same grammar issue as in validation-adapter.yaml: "pending to applied" should be "pending" or "pending to be applied".

📝 Proposed fix
               message:
                 expression: |
                   has(resources.gcpValidationJob)
                     ? "Validation job applied successfully"
-                    : "Validation job is pending to applied"
+                    : "Validation job is pending"
charts/configs/validation-adapter.yaml (1)

262-266: Minor grammar issue in message expression.

"pending to applied" should be "pending" or "pending to be applied".

📝 Proposed fix
               message:
                 expression: |
                   has(resources.gcpValidationJob)
                     ? "Validation job applied successfully"
-                    : "Validation job is pending to applied"
+                    : "Validation job is pending"
🧹 Nitpick comments (1)
Makefile (1)

84-84: Consider adding helm-template-real to the test target.

The helm-test target validates dummy mode template rendering but not real mode. Adding helm-template-real would ensure both configuration paths are tested during make helm-test.

Proposed change
-helm-test: helm-lint helm-template-broker helm-template-rabbitmq helm-template-dummy ## Run all helm chart tests (lint + template rendering)
+helm-test: helm-lint helm-template-broker helm-template-rabbitmq helm-template-dummy helm-template-real ## Run all helm chart tests (lint + template rendering)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@validator/Makefile`:
- Around line 190-198: The current image-dev target uses the `ifndef QUAY_USER`
guard which misses empty or whitespace-only QUAY_USER (since QUAY_USER is set
with `?=`), so replace that guard with a whitespace-aware check using `ifeq
($(strip $(QUAY_USER)),)` to detect empty values; update the block guarding the
image-dev target (the section referencing QUAY_USER, PROJECT_NAME and DEV_TAG)
to use `ifeq ($(strip $(QUAY_USER)),)` ... `endif` and keep the same error
messages and exit behavior so the makefile aborts when QUAY_USER is empty or
only whitespace.

In `@validator/pkg/validator/context.go`:
- Around line 56-65: The GetComputeService lazy-init is not thread-safe:
concurrent calls to Context.GetComputeService may race on c.computeService and
create duplicates; change the implementation to guard initialization with
synchronization (either add a mutex field on Context and wrap the
nil-check/assignment for c.computeService in a critical section, or use a
sync.Once per service if using the first-call context for all callers is
acceptable). Specifically protect access to c.computeService around the call to
c.clientFactory.CreateComputeService so only one goroutine performs creation and
assigns c.computeService; update the Context struct to include the chosen sync
primitive and apply the same pattern to other lazy getters.

In `@validator/pkg/validator/resolver.go`:
- Around line 150-187: The ToMermaid method on DependencyResolver marks a node
as having dependencies whenever meta.RunAfter is non-empty, which hides
validators whose RunAfter entries are missing from r.validators; change the
logic in ToMermaid so that hasDependencies[name] is set only when an actual edge
is emitted (i.e., inside the loop over meta.RunAfter when an existing dependency
is found), ensuring validators with only-missing deps remain in allValidators
and are rendered as standalone nodes; reference the DependencyResolver.ToMermaid
function, r.validators, Metadata()/RunAfter, hasDependencies and allValidators
when making the change.

In `@validator/test/integration/validator_integration_test.go`:
- Around line 271-290: The test currently sets cfg.MaxWaitTimeSeconds but then
enforces timeout via context.WithTimeout (shortCtx), so it doesn't validate
configuration-driven timeouts; either rename the spec to reflect that it tests
context-based timeouts, or change the test to exercise the executor's internal
timeout logic by creating an executor that reads cfg.MaxWaitTimeSeconds (ensure
validator.NewExecutor or Executor.ExecuteAll uses cfg.MaxWaitTimeSeconds to
derive its own context/cancellation) and call ExecuteAll without wrapping the
call in context.WithTimeout, then assert that the returned error contains
"context" or that results are nil when the configured timeout elapses; update
references to MaxWaitTimeSeconds, NewExecutor, and ExecuteAll accordingly.
♻️ Duplicate comments (1)
validator/cmd/validator/main.go (1)

37-42: Consider logging build information (ldflags).

A previous reviewer requested logging of build information (version, commit, buildDate) at startup. This helps with debugging and traceability in production environments.

♻️ Proposed addition for build info logging

Add ldflags variables and log them at startup:

// At package level
var (
    version   = "dev"
    commit    = "unknown"
    buildDate = "unknown"
)

// In main(), after logger setup:
logger.Info("Starting GCP Validator",
    "version", version,
    "commit", commit,
    "build_date", buildDate)

Then update your build command to inject these values:

go build -ldflags "-X main.version=v1.0.0 -X main.commit=$(git rev-parse HEAD) -X main.buildDate=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
🧹 Nitpick comments (5)
validator/pkg/validator/executor_test.go (1)

159-212: Dependency ordering test is correct, but consider adding out-of-order registration test.

The test correctly verifies that validator-a executes before validator-b and validator-c using thread-safe order tracking. The use of ConsistOf appropriately handles the unordered nature of parallel execution within the same level.

However, per a previous reviewer's question: this test registers validators in dependency order (a, then b/c). Consider adding a test case where dependent validators are registered before their dependencies (e.g., register b/c first, then a) to verify the resolver handles registration order independence.

validator/cmd/validator/main.go (1)

103-106: Consider log level for results content.

Logging the full JSON results content at Info level could be verbose in production. Consider using Debug level for the full content, or only logging a summary at Info level.

♻️ Suggested change
-	logger.Info("Results written successfully",
-		"path", outputFile,
-		"content", string(data))
+	logger.Info("Results written successfully", "path", outputFile)
+	logger.Debug("Results content", "content", string(data))
validator/test/integration/validator_integration_test.go (2)

134-136: Stub assumption may cause future test failure.

This test assumes the quota-check validator (currently a stub) always returns StatusSuccess. When the actual quota validation is implemented, this test will need to be updated or will fail unexpectedly.

♻️ Suggested improvement
-			// Currently a stub, so should succeed
-			Expect(result.Status).To(Equal(validator.StatusSuccess))
+			// Note: quota-check is currently a stub that always succeeds.
+			// Update this assertion when HYPERFLEET-510 implements real quota validation.
+			Expect(result.Status).To(BeElementOf(
+				validator.StatusSuccess,
+				validator.StatusSkipped,
+			), "Stub validator should succeed or skip")

240-244: Prefer Skip() over silent return for conditional test execution.

Silently returning on error can mask real issues (authentication failures, network problems). Using Skip() makes test results clearer and distinguishes between "not applicable" and "passed".

♻️ Proposed fix
 			if err != nil {
 				logger.Warn("Failed to get Compute API status", "error", err.Error())
-				// Don't fail test - API might not be enabled
-				return
+				Skip(fmt.Sprintf("Skipping Compute API test: %v", err))
 			}
validator/pkg/validators/api_enabled.go (1)

163-168: Empty requiredAPIs handling returns success instead of skipped.

When no required APIs are configured, the validator returns StatusSuccess with message "No required APIs to validate". A previous review suggested returning StatusSkipped to better indicate that no actual validation occurred. Consider whether success is the appropriate status here.

♻️ Alternative using StatusSkipped
 	// Build success message based on whether APIs were checked
-	message := fmt.Sprintf("All %d required APIs are enabled", len(enabledAPIs))
-	if len(enabledAPIs) == 0 {
-		message = "No required APIs to validate"
+	if len(enabledAPIs) == 0 {
+		return &validator.Result{
+			Status:  validator.StatusSkipped,
+			Reason:  "NoAPIsConfigured",
+			Message: "No required APIs configured for validation",
+			Details: map[string]interface{}{
+				"project_id": vctx.Config.ProjectID,
+			},
+		}
 	}
+	message := fmt.Sprintf("All %d required APIs are enabled", len(enabledAPIs))
 	slog.Info(message)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@validator/cmd/validator/main.go`:
- Around line 3-11: The code writes results to ResultsPath via os.WriteFile but
doesn’t ensure the target directory exists; before any os.WriteFile call (where
ResultsPath or outputFile is used) call os.MkdirAll(filepath.Dir(outputFile),
0o755) to create the parent directory (importing "path/filepath"), and
handle/return any error from MkdirAll so writes don’t fail on custom nested
paths.

In `@validator/pkg/validator/context_test.go`:
- Around line 185-216: The test and implementation allow unprotected lazy
initialization races: update the Context implementation to protect each
lazily-initialized service (e.g., computeService, iamService,
serviceUsageService, monitoringService) using sync.Once or a dedicated mutex per
service to avoid the check-then-set data race in GetComputeService,
GetIAMService, GetServiceUsageService and GetMonitoringService; then extend the
test to spawn many goroutines calling the same getter concurrently (for example,
50 goroutines all calling GetComputeService) and assert there are no
panics/races to ensure the synchronization is effective.

In `@validator/pkg/validator/validator.go`:
- Around line 60-107: In Aggregate, handle StatusSkipped so skipped validators
don't cause a failure: in the loop over results, add a case for StatusSkipped
that increments a skippedCount (do not append to failedChecks or
failureDescriptions), and include skippedCount in details (e.g.,
details["skipped_checks"]). Compute effectiveChecks := checksRun - skippedCount
(or use checksRun and skippedCount in the final message) and determine success
by checking len(failedChecks) == 0 (no failures) rather than checksPassed ==
checksRun; update the success branch and the failure message to use
effectiveChecks/skip-aware counts so skipped validators are excluded from
failure logic and reporting.
♻️ Duplicate comments (3)
validator/pkg/validator/resolver.go (1)

150-187: Validators with only missing dependencies disappear from diagram.

This issue was flagged in a previous review: when a validator has RunAfter entries but none exist in the resolver's validator set, hasDependencies[name] is set to true but no edges are emitted, causing the validator to be excluded from standalone nodes.

🐛 Previously suggested fix
-		if len(meta.RunAfter) > 0 {
-			hasDependencies[name] = true
-			for _, dep := range meta.RunAfter {
-				// Only show edge if dependency exists in our validator set
-				if _, exists := r.validators[dep]; exists {
-					result += fmt.Sprintf("    %s --> %s\n", name, dep)
-				}
-			}
-		}
+		if len(meta.RunAfter) > 0 {
+			for _, dep := range meta.RunAfter {
+				// Only show edge if dependency exists in our validator set
+				if _, exists := r.validators[dep]; exists {
+					hasDependencies[name] = true
+					result += fmt.Sprintf("    %s --> %s\n", name, dep)
+				}
+			}
+		}
validator/pkg/validator/context.go (1)

54-117: Lazy client initialization is not thread-safe.

Parallel validators can call getters concurrently, racing on nil-check/assign and creating duplicate clients or data races.

🛠️ Suggested fix (sync.Once per service)
 import (
     "context"
     "fmt"
     "log/slog"
+    "sync"
@@
 type Context struct {
@@
     // Client factory for creating GCP service clients
     clientFactory *gcp.ClientFactory

+    // Sync primitives for thread-safe lazy init
+    computeOnce          sync.Once
+    iamOnce              sync.Once
+    cloudResourceMgrOnce sync.Once
+    serviceUsageOnce     sync.Once
+    monitoringOnce       sync.Once
+
+    computeErr          error
+    iamErr              error
+    cloudResourceMgrErr error
+    serviceUsageErr     error
+    monitoringErr       error
@@
 func (c *Context) GetComputeService(ctx context.Context) (*compute.Service, error) {
-    if c.computeService == nil {
-        svc, err := c.clientFactory.CreateComputeService(ctx)
-        if err != nil {
-            return nil, fmt.Errorf("failed to create compute service: %w", err)
-        }
-        c.computeService = svc
-    }
-    return c.computeService, nil
+    c.computeOnce.Do(func() {
+        c.computeService, c.computeErr = c.clientFactory.CreateComputeService(ctx)
+        if c.computeErr != nil {
+            c.computeErr = fmt.Errorf("failed to create compute service: %w", c.computeErr)
+        }
+    })
+    return c.computeService, c.computeErr
 }
validator/test/integration/validator_integration_test.go (1)

271-285: Timeout test doesn’t validate config-driven timeout.

The timeout is enforced by context.WithTimeout, so cfg.MaxWaitTimeSeconds isn’t actually exercised. Consider renaming the test or wiring the executor to derive its timeout from config.

🛠️ Suggested rename (if keeping context-driven timeout)
-        It("should respect global timeout from configuration", func() {
+        It("should handle context cancellation gracefully", func() {
🧹 Nitpick comments (5)
validator/pkg/gcp/client_test.go (2)

69-88: Unused table test parameter.

The shouldRetry parameter (line 71) is declared but never used in the test body. Consider removing it or using it in an assertion for clearer intent.

♻️ Suggested fix
         DescribeTable("should retry based on error code",
-            func(errorCode int, shouldRetry bool, expectedAttempts int) {
+            func(errorCode int, expectedAttempts int) {
                 callCount := 0
                 operation := func() error {
                     callCount++
                     return &googleapi.Error{Code: errorCode}
                 }

                 err := gcp.RetryWithBackoffForTesting(ctx, operation)
                 Expect(err).To(HaveOccurred(), "Should return error")
                 Expect(callCount).To(Equal(expectedAttempts))
             },
-            Entry("429 Rate Limit - should retry", 429, true, 5),
-            Entry("503 Service Unavailable - should retry", 503, true, 5),
-            Entry("500 Internal Error - should retry", 500, true, 5),
-            Entry("404 Not Found - should not retry", 404, false, 1),
-            Entry("403 Forbidden - should not retry", 403, false, 1),
+            Entry("429 Rate Limit - should retry", 429, 5),
+            Entry("503 Service Unavailable - should retry", 503, 5),
+            Entry("500 Internal Error - should retry", 500, 5),
+            Entry("404 Not Found - should not retry", 404, 1),
+            Entry("403 Forbidden - should not retry", 403, 1),
         )

110-123: Potential flakiness with tight timeout.

The 50ms timeout is quite short and could cause flaky tests in resource-constrained CI environments. Consider increasing to 100-200ms for more reliable execution while still testing the timeout behavior effectively.

validator/pkg/config/config.go (1)

99-119: Consider logging invalid environment variable values.

When getEnvBool or getEnvInt encounter invalid values, they silently fall back to defaults. While this behavior is documented, logging a warning could help operators debug misconfiguration issues.

validator/pkg/validator/registry.go (1)

55-66: Inconsistent locking pattern between instance and package-level Register.

The package-level Register directly accesses globalRegistry.mu and globalRegistry.validators instead of calling globalRegistry.Register(). While functionally correct, this duplicates the locking logic and creates a semantic difference (panic on duplicate vs. overwrite).

Consider documenting this intentional behavioral difference more explicitly, or unifying by having the instance method also check for duplicates with a flag/option if needed for production use.

♻️ Alternative: delegate to instance method with duplicate check
 func Register(v Validator) {
     meta := v.Metadata()
     globalRegistry.mu.Lock()
     defer globalRegistry.mu.Unlock()
 
     if _, exists := globalRegistry.validators[meta.Name]; exists {
         panic(fmt.Sprintf("validator already registered: %s", meta.Name))
     }
     globalRegistry.validators[meta.Name] = v
 }

The current implementation is fine; just ensure the behavioral difference (panic vs. overwrite) is intentional and documented.

validator/pkg/validator/resolver.go (1)

68-104: Potential stack overflow on deeply nested dependencies.

The recursive calcLevel function could cause a stack overflow with very deep dependency chains. While unlikely in practice for validators, consider adding a depth limit or converting to iterative approach for robustness.

@xueli181114
Copy link
Collaborator

/lgtm

@xueli181114
Copy link
Collaborator

Merge this PR with comments opened. This repo will be archived soon.

@openshift-ci openshift-ci bot removed the lgtm label Jan 23, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@validator/test/integration/validator_integration_test.go`:
- Around line 240-250: The test currently does an early "return" inside the
failure branch (the logger.Warn call) which causes the test to silently pass;
change that to call Ginkgo's Skip() (e.g., replace the logger.Warn + return with
logger.Warn(...) followed by Skip(fmt.Sprintf("Compute API not enabled or error:
%v", err))) so the test is marked as skipped and the error is included in the
message; ensure Skip is available from Ginkgo and keep the subsequent
Expect(service).NotTo(BeNil()) / logger.Info("Compute API status", ...) path
unchanged for the success case.
♻️ Duplicate comments (1)
validator/pkg/validator/context.go (1)

68-80: Error is not persisted across calls—subsequent callers silently receive (nil, nil) on initialization failure.

The err variable is local to each function call. When sync.Once.Do() runs on the first call and fails, err is set. On subsequent calls, a new err variable is created (initialized to nil), but Do() doesn't re-execute the closure. This means:

  • First call fails → returns (nil, error)
  • Subsequent calls → returns (nil, nil)

This same pattern affects all five service getters.

🔒️ Proposed fix: store error in struct

Add error fields to the struct:

 type Context struct {
     // ...existing fields...
     
+    // Error fields for lazy initialization (persisted for subsequent calls)
+    computeErr          error
+    iamErr              error
+    cloudResourceMgrErr error
+    serviceUsageErr     error
+    monitoringErr       error
+
     // GCP Clients (lazily initialized, shared across validators)
     computeService          *compute.Service
     // ...
 }

Update getters to use struct error field:

 func (c *Context) GetComputeService(ctx context.Context) (*compute.Service, error) {
-    var err error
     c.computeOnce.Do(func() {
-        c.computeService, err = c.clientFactory.CreateComputeService(ctx)
-        if err != nil {
-            err = fmt.Errorf("failed to create compute service: %w", err)
+        c.computeService, c.computeErr = c.clientFactory.CreateComputeService(ctx)
+        if c.computeErr != nil {
+            c.computeErr = fmt.Errorf("failed to create compute service: %w", c.computeErr)
         }
     })
-    if err != nil {
-        return nil, err
-    }
-    return c.computeService, nil
+    return c.computeService, c.computeErr
 }

Apply the same pattern to all other getters.

🧹 Nitpick comments (2)
validator/pkg/validator/context_test.go (1)

190-213: Thread-safety test validates concurrency but not error persistence.

The test verifies that concurrent access doesn't cause panics or races, which is good. However, given the error persistence bug in context.go, consider adding a test that verifies: when initialization fails, all subsequent calls (from different goroutines) also return the same error rather than (nil, nil).

💡 Suggested additional test case
It("should return consistent error across concurrent calls when initialization fails", func() {
    ctx := context.Background()
    var wg sync.WaitGroup
    const numGoroutines = 10
    
    errors := make([]error, numGoroutines)
    services := make([]*serviceusage.Service, numGoroutines)
    
    for i := 0; i < numGoroutines; i++ {
        wg.Add(1)
        go func(idx int) {
            defer GinkgoRecover()
            defer wg.Done()
            services[idx], errors[idx] = vctx.GetServiceUsageService(ctx)
        }(i)
    }
    wg.Wait()
    
    // All calls should return consistent results
    // Either all succeed with non-nil service, or all fail with non-nil error
    firstErr := errors[0]
    firstSvc := services[0]
    for i := 1; i < numGoroutines; i++ {
        if firstErr != nil {
            Expect(errors[i]).To(HaveOccurred(), "All calls should return error if first failed")
        } else {
            Expect(services[i]).To(Equal(firstSvc), "All calls should return same service instance")
        }
    }
})
validator/pkg/validator/resolver.go (1)

68-104: assignLevels assumes no cycles exist—document this precondition.

The recursive calcLevel function will infinite-loop if a cycle exists because there's no visited tracking to prevent re-entry during the same DFS path. This is safe since ResolveExecutionGroups always calls detectCycles first, but the implicit dependency should be documented.

📝 Suggested documentation improvement
 // assignLevels performs topological sort and assigns execution levels
+// Precondition: detectCycles() must be called first to ensure no circular dependencies exist.
+// Without this precondition, this function may infinite-loop on cyclic graphs.
 func (r *DependencyResolver) assignLevels() map[string]int {

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@validator/test/integration/validator_integration_test.go`:
- Around line 169-180: The test currently treats any non-StatusSuccess result as
a failure which conflates StatusSkipped with StatusFailure; update the loop over
results in validator_integration_test.go to increment successCount when r.Status
== validator.StatusSuccess, failureCount only when r.Status ==
validator.StatusFailure, and optionally track skippedCount when r.Status ==
validator.StatusSkipped, then change the assertions to compare checksPassed to
successCount and checksRun - checksPassed - skippedCount (or explicitly
checksRun - checksPassed if you track skippedCount==0) to failureCount so
skipped results are not counted as failures; reference the results slice,
StatusSuccess, StatusFailure, StatusSkipped, and the Aggregate behavior in
validator.Aggregate to locate the relevant code.
♻️ Duplicate comments (2)
validator/pkg/validators/api_enabled.go (1)

94-97: Empty requiredAPIs list still returns success rather than skip.

The previous review flagged that an empty RequiredAPIs list results in a misleading success with "No required APIs to validate". While you marked it as fixed, the early-return check doesn't appear to be present. The current flow proceeds through the loop (which does nothing), then returns success at lines 158-173.

Consider adding an early return with StatusSkipped or a warning log when no APIs are configured, as suggested previously.

🐛 Proposed fix
 	// Check each required API
 	requiredAPIs := vctx.Config.RequiredAPIs
+	if len(requiredAPIs) == 0 {
+		slog.Warn("No required APIs configured, skipping validation")
+		return &validator.Result{
+			Status:  validator.StatusSuccess, // or StatusSkipped if available
+			Reason:  "NoAPIsConfigured",
+			Message: "No required APIs configured for validation",
+			Details: map[string]interface{}{
+				"project_id": vctx.Config.ProjectID,
+			},
+		}
+	}
 	enabledAPIs := []string{}
 	disabledAPIs := []string{}

,

validator/pkg/validator/validator.go (1)

57-88: StatusSkipped results incorrectly contribute to failure determination.

The aggregation logic increments checksPassed only for StatusSuccess, but the success condition checksPassed == checksRun fails when any validator is skipped, even if there are no actual failures. This causes skipped validators to be incorrectly counted as failures.

🛠️ Suggested fix
 func Aggregate(results []*Result) *AggregatedResult {
     checksRun := len(results)
     checksPassed := 0
+    checksSkipped := 0
     var failedChecks []string
     var failureDescriptions []string

     // Single pass to collect all failure information
     for _, r := range results {
         switch r.Status {
         case StatusSuccess:
             checksPassed++
         case StatusFailure:
             failedChecks = append(failedChecks, r.ValidatorName)
             failureDescriptions = append(failureDescriptions, fmt.Sprintf("%s (%s)", r.ValidatorName, r.Reason))
+        case StatusSkipped:
+            checksSkipped++
         }
     }

     details := map[string]interface{}{
         "checks_run":    checksRun,
         "checks_passed": checksPassed,
+        "checks_skipped": checksSkipped,
         "timestamp":     time.Now().UTC().Format(time.RFC3339),
         "validators":    results,
     }

-    if checksPassed == checksRun {
+    if len(failedChecks) == 0 {
         return &AggregatedResult{
             Status:  StatusSuccess,
             Reason:  "ValidationPassed",
             Message: "All GCP validation checks passed successfully",
             Details: details,
         }
     }
🧹 Nitpick comments (3)
validator/pkg/validators/api_enabled.go (1)

99-141: Consider deferring reqCancel for safety.

Calling reqCancel() after svc.Services.Get(...).Do() works, but if a panic occurs between context creation and the cancel call, the context would leak. Using defer is idiomatic and safer.

♻️ Suggested improvement
 	for _, apiName := range requiredAPIs {
 		// Add per-request timeout
 		reqCtx, reqCancel := context.WithTimeout(ctx, apiRequestTimeout)
+		defer reqCancel() // Ensure cleanup even on panic

 		serviceName := fmt.Sprintf("projects/%s/services/%s", vctx.Config.ProjectID, apiName)

 		slog.Debug("Checking API", "api", apiName)
 		service, err := svc.Services.Get(serviceName).Context(reqCtx).Do()
-		reqCancel() // Clean up context
validator/pkg/validator/executor.go (1)

70-91: Consider whether context cancellation should stop execution between groups.

The executor passes ctx to validators and waits for group completion, but doesn't check ctx.Err() between groups. If the context is cancelled mid-execution, the current group will complete (validators check ctx internally), but subsequent groups will still start.

♻️ Optional improvement
     // 4. Execute validators group by group
     allResults := []*Result{}
     for _, group := range groups {
+        // Check if context was cancelled before starting next group
+        if ctx.Err() != nil {
+            e.logger.Warn("Context cancelled, stopping execution", "error", ctx.Err())
+            return allResults, ctx.Err()
+        }
+
         e.logger.Info("Executing level",
             "level", group.Level,
             "validators", len(group.Validators))
validator/test/integration/validator_integration_test.go (1)

115-140: Consider making the quota-check test more flexible.

The test asserts StatusSuccess (line 138), which assumes the stub always succeeds. If the stub implementation changes or returns a different status in certain environments, this test will fail unexpectedly.

♻️ Suggested improvement
-               // Currently a stub, so should succeed
-               Expect(result.Status).To(Equal(validator.StatusSuccess))
+               // Verify result has a valid status (stub currently succeeds)
+               Expect(result.Status).To(BeElementOf(
+                   validator.StatusSuccess,
+                   validator.StatusSkipped,
+               ), "Stub should succeed or skip")

@xueli181114
Copy link
Collaborator

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 23, 2026
@openshift-ci
Copy link

openshift-ci bot commented Jan 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xueli181114

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xueli181114 xueli181114 merged commit f8c3833 into openshift-hyperfleet:main Jan 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants