Skip to content

fix:Improve delete-eks-cluster.sh to perform comprehensive cleanup#878

Open
cbumb wants to merge 9 commits intoNVIDIA:mainfrom
cbumb:UAT_test_cleanup_fix
Open

fix:Improve delete-eks-cluster.sh to perform comprehensive cleanup#878
cbumb wants to merge 9 commits intoNVIDIA:mainfrom
cbumb:UAT_test_cleanup_fix

Conversation

@cbumb
Copy link
Contributor

@cbumb cbumb commented Feb 17, 2026

Summary

  • Rewrite delete-eks-cluster.sh to perform comprehensive cleanup in the correct dependency order: OIDC providers, GPU subnets, then CloudFormation stacks (nodegroups -> addons -> cluster).
  • Adds proper wait/timeout logic for stack deletion and handles orphaned resources.
  • Also passes CLUSTER_NAME env var to the delete step in the integration workflow.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Chores
    • Added CLUSTER_NAME environment variable to the AWS teardown step to allow explicit cluster targeting during destruction.
    • Reworked cluster deletion into a resilient, multi-phase teardown with polling, timeouts and clearer logging for more reliable operation.
    • Expanded automated cleanup to include identity providers, GPU subnet cleanup, network interface detachment, and orphaned resource removal to reduce leftover infrastructure.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds CLUSTER_NAME to the AWS integration workflow teardown and refactors the EKS deletion script into a resilient, multi-phase orchestrator that derives cluster names, cleans OIDC providers (including orphans), removes GPU subnets/ENIs, and deletes CloudFormation stacks with polling and timeouts.

Changes

Cohort / File(s) Summary
Workflow Configuration
.github/workflows/integration-aws.yml
Added CLUSTER_NAME environment variable to the "Destroy Cluster" (Teardown) step so teardown commands can reference an explicit cluster name.
EKS Cluster Deletion Script
tests/uat/aws/delete-eks-cluster.sh
Replaced one-shot deletion with a main-driven orchestration: cluster-name derivation (get_cluster_names_from_stacks), staged delete_cluster_stacks (nodegroups → GPU subnet/ENI cleanup → addons/IAM → cluster), OIDC deletion and orphan handling, ENI detach polling (wait_for_eni_detach), and stack deletion polling (delete_stacks_and_wait, wait_for_stack_deletion). Introduced STACK_DELETE_TIMEOUT.

Sequence Diagram

sequenceDiagram
    participant Script as Deletion Script
    participant CFN as CloudFormation
    participant EKS as EKS/IAM
    participant AWS as AWS Services

    Script->>CFN: Describe stacks / list eksctl-style stacks
    CFN-->>Script: Stack list
    Script->>Script: Extract unique cluster names

    rect rgba(100,150,200,0.5)
    Note over Script,EKS: OIDC provider cleanup
    Script->>EKS: Describe/delete OIDC provider
    EKS-->>Script: Status / errors
    Script->>EKS: List orphaned OIDC providers
    EKS-->>Script: Orphan list
    end

    rect rgba(100,200,150,0.5)
    Note over Script,CFN: Multi-phase stack deletion
    Script->>CFN: Phase 1 - delete nodegroup stacks
    CFN-->>Script: Deletion initiated
    Script->>AWS: Phase 2 - delete GPU subnets (disassoc routes, detach ENIs)
    AWS-->>Script: Ops progress
    Script->>CFN: Phase 3 - delete addon/IAM stacks
    CFN-->>Script: Deletion initiated
    Script->>CFN: Phase 4 - delete cluster stack
    CFN-->>Script: Deletion initiated
    end

    rect rgba(200,150,100,0.5)
    Note over Script,CFN: Polling & completion
    loop Until deletion complete or timeout
        Script->>CFN: Poll stack status
        CFN-->>Script: Status update
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through stacks at morning light,

I nudged old OIDC ghosts to flight,
I coaxed ENIs from subnet beds,
Watched nodegroups fall like sleepy heads,
Now fields are clear — the cluster's night.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: improving delete-eks-cluster.sh to perform comprehensive cleanup with proper dependency ordering and wait logic.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/uat/aws/delete-eks-cluster.sh (4)

339-351: Hardcoded sleep 5 may not be enough for ENI detach to complete.

After force-detaching an ENI (line 341-344), the 5-second sleep on line 345 is a best-effort wait before deletion. If detachment takes longer (e.g., under high API latency), the delete-network-interface call on line 349 may fail. The || log "WARNING: ..." fallback prevents script failure, but the ENI would be left behind.

A small poll on the ENI attachment status (e.g., checking for "detached" or "available" status) with a timeout would be more robust. That said, for a CI teardown script where re-runs are possible, this is acceptable as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 339 - 351, Replace the
fixed sleep after force-detaching the ENI with a short polling loop that checks
the ENI attachment/status via aws ec2 describe-network-interfaces (use the same
$eni_id and $AWS_REGION variables), waiting until the interface shows no
attachment or status is "available" or until a configurable timeout (e.g., 60s)
is reached; if the timeout expires log a warning and continue to the delete step
so delete-network-interface is attempted but more likely to succeed if
detachment completed.

412-414: On DELETE_FAILED, log the reason from stack events to aid debugging.

When a stack enters DELETE_FAILED, CloudFormation records the reason in stack events. Currently the script only logs a generic warning, making it hard to diagnose CI teardown failures. Fetching the latest failure reason would save debugging time.

🔍 Proposed enhancement
             DELETE_FAILED)
-                log "WARNING: Stack $stack_name deletion failed"
+                log "WARNING: Stack $stack_name deletion failed. Recent events:"
+                aws cloudformation describe-stack-events \
+                    --region "$AWS_REGION" \
+                    --stack-name "$stack_name" \
+                    --query 'StackEvents[?ResourceStatus==`DELETE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
+                    --output table 2>/dev/null || true
                 return 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 412 - 414, On the
DELETE_FAILED branch, replace the generic warning with a log that includes the
CloudFormation failure reason: call AWS CLI to fetch recent stack events for the
current stack_name (e.g., use aws cloudformation describe-stack-events
--stack-name "$stack_name" and inspect the top event's ResourceStatusReason or a
matching DELETE_FAILED event) and include that reason in the process log (log
"WARNING: Stack $stack_name deletion failed: <reason>"). Ensure this runs inside
the DELETE_FAILED case so the reason is captured and printed for debugging.

151-152: Unused parameter cluster_name.

The cluster_name parameter on line 152 is accepted but never referenced in the function body. This could confuse maintainers into thinking the function scopes its cleanup to the given cluster. Consider removing it or adding a comment explaining why it's passed (e.g., for future use or logging).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 151 - 152, The function
delete_orphaned_oidc_providers declares an unused parameter cluster_name; either
remove the parameter from the function signature and all callers, or use
cluster_name (e.g., add an explanatory comment or a log line inside
delete_orphaned_oidc_providers referencing the cluster) so its presence is
justified. Update any invocations of delete_orphaned_oidc_providers to match the
new signature if you remove the parameter, or add a simple log like "Cleaning
OIDC providers for ${cluster_name}" if you keep it.

82-101: Suffix stripping assumes cluster names don't contain -nodegroup- or -addon-.

The sequential sed replacements on lines 93-95 would incorrectly truncate a cluster name that happens to contain -nodegroup- or -addon- as a substring. Given the current naming convention (nvs-d<run_id>) this is safe, but worth a comment for future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 82 - 101, In
get_cluster_names_from_stacks, the sed replacements can accidentally truncate
valid cluster names that include the substrings "-nodegroup-" or "-addon-";
update the two sed calls to only strip those patterns when they appear as
trailing components by using stricter end-anchored patterns (e.g. change
"s/-nodegroup-.*$//" to "s/-nodegroup-[^-]*$//" and "s/-addon-.*$//" to
"s/-addon-[^-]*$//"), and add a short comment above the block documenting the
assumption about cluster name formats for future maintainers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 151-168: In delete_orphaned_oidc_providers, don't swallow failures
of aws eks list-clusters; remove the "|| true" fallback for active_clusters,
check its exit status and if listing clusters fails log an error (with context)
and return a non-zero status (or exit) instead of proceeding—this ensures
active_clusters is valid before the loop that uses it and prevents deleting OIDC
providers when list-clusters errored.

---

Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 339-351: Replace the fixed sleep after force-detaching the ENI
with a short polling loop that checks the ENI attachment/status via aws ec2
describe-network-interfaces (use the same $eni_id and $AWS_REGION variables),
waiting until the interface shows no attachment or status is "available" or
until a configurable timeout (e.g., 60s) is reached; if the timeout expires log
a warning and continue to the delete step so delete-network-interface is
attempted but more likely to succeed if detachment completed.
- Around line 412-414: On the DELETE_FAILED branch, replace the generic warning
with a log that includes the CloudFormation failure reason: call AWS CLI to
fetch recent stack events for the current stack_name (e.g., use aws
cloudformation describe-stack-events --stack-name "$stack_name" and inspect the
top event's ResourceStatusReason or a matching DELETE_FAILED event) and include
that reason in the process log (log "WARNING: Stack $stack_name deletion failed:
<reason>"). Ensure this runs inside the DELETE_FAILED case so the reason is
captured and printed for debugging.
- Around line 151-152: The function delete_orphaned_oidc_providers declares an
unused parameter cluster_name; either remove the parameter from the function
signature and all callers, or use cluster_name (e.g., add an explanatory comment
or a log line inside delete_orphaned_oidc_providers referencing the cluster) so
its presence is justified. Update any invocations of
delete_orphaned_oidc_providers to match the new signature if you remove the
parameter, or add a simple log like "Cleaning OIDC providers for
${cluster_name}" if you keep it.
- Around line 82-101: In get_cluster_names_from_stacks, the sed replacements can
accidentally truncate valid cluster names that include the substrings
"-nodegroup-" or "-addon-"; update the two sed calls to only strip those
patterns when they appear as trailing components by using stricter end-anchored
patterns (e.g. change "s/-nodegroup-.*$//" to "s/-nodegroup-[^-]*$//" and
"s/-addon-.*$//" to "s/-addon-[^-]*$//"), and add a short comment above the
block documenting the assumption about cluster name formats for future
maintainers.

@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/common 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer 30.27% (-0.06%) 👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics 47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 20.64% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-quarantine/main.go 0.00% (ø) 246 0 246
github.com/nvidia/nvsentinel/fault-quarantine/pkg/common/common.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher/event_watcher.go 0.00% (ø) 650 0 650
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/k8s_client.go 31.58% (ø) 1067 337 730
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/node_informer.go 28.34% (-0.14%) 727 206 (-1) 521 (+1) 👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/metrics/metrics.go 47.37% (ø) 19 9 10
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 20.64% (ø) 2820 582 2238

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler_e2e_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/uat/aws/delete-eks-cluster.sh (3)

82-101: Minor: get_cluster_names_from_stacks produces a leading space in its output.

cluster_names is built as " name1 name2" (leading space from line 96). After tr | sort -u | tr, the empty element becomes a leading space in the result. This is functionally harmless — word splitting in the caller's for loop (line 61) skips it — but the log on line 58 will show a leading space.

Trim leading/trailing whitespace
     # Return unique names
-    echo "$cluster_names" | tr ' ' '\n' | sort -u | tr '\n' ' '
+    echo "$cluster_names" | tr ' ' '\n' | sort -u | grep -v '^$' | tr '\n' ' ' | sed 's/^ *//;s/ *$//'

Or initialize cluster_names without the leading space:

-        cluster_names="${cluster_names} ${name}"
+        cluster_names="${cluster_names:+$cluster_names }${name}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 82 - 101,
get_cluster_names_from_stacks currently builds cluster_names with a leading
space; change the append logic for cluster_names (in function
get_cluster_names_from_stacks where cluster_names and name are used) so the
first item is added without a leading separator and subsequent items are
prefixed with a single space (e.g. use a conditional append or the
${cluster_names:+...} pattern) and keep the final echo/tr|sort -u|tr pipeline
intact so the function returns the same unique list but without leading/trailing
whitespace.

373-398: wait_for_eni_detach returns 0 even on timeout — confirm this is intentional.

When the || break fires (API error, line 385) or the timeout expires (line 395), the function always returns 0 (the exit status of the final if statement or log). The caller on line 352 proceeds to delete-network-interface regardless, so this works, but an explicit return 0 at the bottom would make the intent clearer.

Make the implicit return explicit
     if [ "$elapsed" -ge "$timeout" ]; then
         log "WARNING: Timed out waiting for ENI $eni_id to detach after ${timeout}s; proceeding with delete attempt"
     fi
+    return 0
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 373 - 398, The helper
function wait_for_eni_detach currently falls through and returns the last
command's exit status implicitly; make the intent explicit by adding an explicit
return value: ensure that on successful detection of an available/empty ENI the
function returns 0, and when the loop times out (after logging the warning) also
explicitly return 0 so callers know we intentionally proceed with deletion;
modify the function wait_for_eni_detach to include explicit return 0 in both the
success path and after the timeout warning (and keep the existing
break/early-exit behavior unchanged).

116-120: delete_orphaned_oidc_providers failure is silently swallowed by the caller.

Because delete_orphaned_oidc_providers (line 118) is invoked inside a || { … } block, set -e is inhibited. If the function returns 1 (e.g., list-clusters fails on line 172), execution continues to return 0 on line 119, and the main loop moves on without any indication of the failure propagating upward.

The safety behavior is preserved (no accidental deletions), and the ERROR log inside the function provides visibility. For a cleanup script this is probably acceptable, but it's worth being aware that OIDC orphan cleanup could silently be skipped for a transient API failure.

Option: propagate the error to make it visible in CI
         --output text 2>/dev/null) || {
         log "WARNING: Could not describe cluster $cluster_name (may already be deleted). Searching for orphaned OIDC providers..."
-        delete_orphaned_oidc_providers "$cluster_name"
-        return 0
+        delete_orphaned_oidc_providers "$cluster_name" || log "WARNING: Orphaned OIDC cleanup failed for $cluster_name (see error above)"
+        return 0  # continue cleanup even if orphaned OIDC cleanup fails
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 116 - 120, The current
error path swallows failures from delete_orphaned_oidc_providers when the aws
describe call fails because the function is invoked inside a `|| { ...; return
0; }` block; update the logic so that after calling
delete_orphaned_oidc_providers "$cluster_name" you capture its exit status and
if it is non-zero propagate that failure (exit or return non-zero) instead of
unconditionally returning 0, ensuring set -e behavior is preserved and failures
in delete_orphaned_oidc_providers surface in CI; reference the call to
delete_orphaned_oidc_providers and the variable cluster_name when making the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 165-175: The change correctly replaces the previous `|| true` with
a fail-fast block that logs an ERROR and returns 1 when `aws eks list-clusters`
fails; keep this fail-fast logic around the `active_clusters=$(aws eks
list-clusters ...) || { log "ERROR: ..."; return 1; }` block and ensure any
caller of this script/function checks and propagates the non-zero return (or
exits) so orphaned OIDC cleanup is never run with an empty `active_clusters` or
unset `AWS_REGION`; reference `active_clusters`, `AWS_REGION`, and the `log`
call to locate the code.

---

Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 82-101: get_cluster_names_from_stacks currently builds
cluster_names with a leading space; change the append logic for cluster_names
(in function get_cluster_names_from_stacks where cluster_names and name are
used) so the first item is added without a leading separator and subsequent
items are prefixed with a single space (e.g. use a conditional append or the
${cluster_names:+...} pattern) and keep the final echo/tr|sort -u|tr pipeline
intact so the function returns the same unique list but without leading/trailing
whitespace.
- Around line 373-398: The helper function wait_for_eni_detach currently falls
through and returns the last command's exit status implicitly; make the intent
explicit by adding an explicit return value: ensure that on successful detection
of an available/empty ENI the function returns 0, and when the loop times out
(after logging the warning) also explicitly return 0 so callers know we
intentionally proceed with deletion; modify the function wait_for_eni_detach to
include explicit return 0 in both the success path and after the timeout warning
(and keep the existing break/early-exit behavior unchanged).
- Around line 116-120: The current error path swallows failures from
delete_orphaned_oidc_providers when the aws describe call fails because the
function is invoked inside a `|| { ...; return 0; }` block; update the logic so
that after calling delete_orphaned_oidc_providers "$cluster_name" you capture
its exit status and if it is non-zero propagate that failure (exit or return
non-zero) instead of unconditionally returning 0, ensuring set -e behavior is
preserved and failures in delete_orphaned_oidc_providers surface in CI;
reference the call to delete_orphaned_oidc_providers and the variable
cluster_name when making the change.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
tests/uat/aws/delete-eks-cluster.sh (2)

418-421: Phase-level stack deletion failures are silently swallowed, resulting in exit 0 even when resources leak.

wait_for_stack_deletion returns 1 on DELETE_FAILED, but || true eats it. If Phase 1 (nodegroup) stacks fail to delete, Phases 2–4 still run and their own stack deletions will likely fail too (nodegroup ENIs still hold VPC resources). The job exits green while stacks and subnets remain.

Consider tracking a failure flag and returning a non-zero exit from main when any phase reports a deletion failure:

♻️ Proposed refactor
-    # Wait for all to complete
-    for stack in $stacks; do
-        wait_for_stack_deletion "$stack" || true
-    done
+    # Wait for all to complete; accumulate failures
+    local failed=0
+    for stack in $stacks; do
+        wait_for_stack_deletion "$stack" || { log "WARNING: Stack $stack did not delete cleanly"; failed=1; }
+    done
+    return "$failed"

Then in delete_cluster_stacks, propagate the flag from each phase and return it to main so the script can exit non-zero.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 418 - 421, The loop that
calls wait_for_stack_deletion currently swallows failures with "|| true",
causing the script to exit 0 even when deletions fail; change the logic to track
failures instead: remove "|| true", capture the return status of
wait_for_stack_deletion inside delete_cluster_stacks (e.g., set a local failure
flag if any call returns non-zero), propagate that flag out of
delete_cluster_stacks (return non-zero when any phase failed), have main check
the return value and exit with a non-zero code when failures occurred so
phase-level DELETE_FAILED results cause the job to fail and stop subsequent
phases when appropriate.

190-211: O(N×M) describe-cluster calls in the orphan detection loop.

For every OIDC provider that matches the regional EKS URL, the inner loop calls describe-cluster once per active cluster. Pre-computing OIDC IDs for all active clusters before the outer loop reduces this to O(N) calls.

♻️ Proposed refactor
+    # Pre-compute OIDC IDs for all active clusters (O(N) calls)
+    local active_oidc_ids=""
+    for cluster in $active_clusters; do
+        local cluster_oidc_url
+        cluster_oidc_url=$(aws eks describe-cluster \
+            --name "$cluster" \
+            --region "$AWS_REGION" \
+            --query 'cluster.identity.oidc.issuer' \
+            --output text 2>/dev/null) || continue
+        active_oidc_ids="${active_oidc_ids} ${cluster_oidc_url##*/}"
+    done
+
     for arn in $provider_arns; do
         # EKS OIDC providers have URLs like: oidc.eks.{region}.amazonaws.com/id/{ID}
         local provider_url
         provider_url=$(aws iam get-open-id-connect-provider \
             --open-id-connect-provider-arn "$arn" \
             --query 'Url' \
             --output text 2>/dev/null) || continue

         if echo "$provider_url" | grep -q "oidc.eks.${AWS_REGION}.amazonaws.com"; then
             local oidc_id
             oidc_id="${provider_url##*/}"

             local is_orphaned=true
-            for cluster in $active_clusters; do
-                local cluster_oidc
-                cluster_oidc=$(aws eks describe-cluster \
-                    --name "$cluster" \
-                    --region "$AWS_REGION" \
-                    --query 'cluster.identity.oidc.issuer' \
-                    --output text 2>/dev/null) || continue
-
-                if echo "$cluster_oidc" | grep -qF "$oidc_id"; then
+            for active_id in $active_oidc_ids; do
+                if [[ "$active_id" == "$oidc_id" ]]; then
                     is_orphaned=false
                     break
                 fi
             done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 190 - 211, Precompute each
active cluster's OIDC issuer once before iterating OIDC providers: iterate over
active_clusters, call aws eks describe-cluster to collect and store cluster
identity OIDC IDs (e.g., in an array or associative map keyed by cluster name or
just a newline-separated string variable), then replace the inner per-provider
describe-cluster loop that sets cluster_oidc and checks grep -qF "$oidc_id" with
a membership check against the precomputed collection (use the same
oidc_id/arn/provider_url variable names and preserve the is_orphaned boolean
logic and log/delete calls in the outer loop).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 222-229: The CloudFormation stack filter uses
starts_with(StackName, 'eksctl-${cluster_name}') which matches sibling clusters;
update the AWS CLI --query expression used when setting all_stacks to require
the trailing hyphen by changing the starts_with pattern to
'eksctl-${cluster_name}-' so only stacks belonging to this cluster (e.g.,
eksctl-<cluster_name>-*) are matched; the change should be applied where
all_stacks is computed (the aws cloudformation list-stacks --query expression)
to mirror the earlier fix.
- Around line 40-43: The CloudFormation query is using starts_with(StackName,
'eksctl-${CLUSTER_PREFIX}') which also matches names like eksctl-nvs2; update
the query to include the trailing '-' so it only matches cluster stacks (e.g.,
starts_with(StackName, 'eksctl-${CLUSTER_PREFIX}-')). Locate the snippet that
sets the stacks variable (the aws cloudformation list-stacks call) and change
the starts_with second argument to include the hyphen suffix; verify this change
also fixes downstream parsing in get_cluster_names_from_stacks which expects the
'-' separator.
- Around line 242-257: The case ordering misclassifies cluster stacks; in the
for loop over "for stack in $all_stacks" adjust the case statement so the
"*-cluster)" pattern is tested before the "*-nodegroup-*)" and "*-addon-*)"
patterns so stacks ending with "-cluster" are assigned to cluster_stack (instead
of being caught by nodegroup_stacks/addon_stacks); update the case order and
keep assignments to nodegroup_stacks, addon_stacks and cluster_stack unchanged.
- Around line 133-135: The current use of "|| true" after the aws iam
list-open-id-connect-providers call hides failures and yields an empty
provider_arn that later triggers the misleading "No OIDC provider found matching
issuer: …" log; remove the "|| true", capture the command exit status (e.g., run
the aws command in a subshell and check $? or use set -e with a trap), and if
the AWS call fails print a clear warning/error and return early instead of
proceeding (reference the provider_arn assignment and the subsequent "No OIDC
provider found matching issuer" log to update control flow).

---

Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 418-421: The loop that calls wait_for_stack_deletion currently
swallows failures with "|| true", causing the script to exit 0 even when
deletions fail; change the logic to track failures instead: remove "|| true",
capture the return status of wait_for_stack_deletion inside
delete_cluster_stacks (e.g., set a local failure flag if any call returns
non-zero), propagate that flag out of delete_cluster_stacks (return non-zero
when any phase failed), have main check the return value and exit with a
non-zero code when failures occurred so phase-level DELETE_FAILED results cause
the job to fail and stop subsequent phases when appropriate.
- Around line 190-211: Precompute each active cluster's OIDC issuer once before
iterating OIDC providers: iterate over active_clusters, call aws eks
describe-cluster to collect and store cluster identity OIDC IDs (e.g., in an
array or associative map keyed by cluster name or just a newline-separated
string variable), then replace the inner per-provider describe-cluster loop that
sets cluster_oidc and checks grep -qF "$oidc_id" with a membership check against
the precomputed collection (use the same oidc_id/arn/provider_url variable names
and preserve the is_orphaned boolean logic and log/delete calls in the outer
loop).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/uat/aws/delete-eks-cluster.sh (2)

96-107: Misleading comment: % (shortest suffix) does NOT protect cluster names containing -nodegroup- or -addon-.

The comment on lines 98–99 claims single % prevents truncation of cluster names that contain -nodegroup- or -addon-. This is incorrect. For example, a cluster named nvs-nodegroup-prod (stack eksctl-nvs-nodegroup-prod-cluster) would be processed as:

nvs-nodegroup-prod-cluster
  →  nvs-nodegroup-prod   (after  %-cluster)
  →  nvs                  (after  %-nodegroup-*, shortest suffix = "-nodegroup-prod") ← WRONG

Using %% (longest suffix) would not help either — the ambiguity is fundamental. In practice this is not an issue because NVSentinel cluster names (prefixed nvs) don't embed these keywords, but the comment should be corrected to avoid confusion.

📝 Suggested comment correction
-    # Stack names follow: eksctl-{CLUSTER}-cluster, eksctl-{CLUSTER}-nodegroup-{NG},
-    # eksctl-{CLUSTER}-addon-{ADDON}. We strip the eksctl- prefix and known
-    # suffixes. Using shortest-suffix parameter expansion (single %) so that
-    # cluster names containing "-nodegroup-" or "-addon-" are not truncated.
+    # Stack names follow: eksctl-{CLUSTER}-cluster, eksctl-{CLUSTER}-nodegroup-{NG},
+    # eksctl-{CLUSTER}-addon-{ADDON}. We strip the eksctl- prefix and the first
+    # matching known suffix. Note: cluster names that themselves contain
+    # "-nodegroup-" or "-addon-" will be mis-parsed; this is not expected for
+    # clusters created under the CLUSTER_PREFIX convention.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 96 - 107, The comment above
the for-loop that trims stack names (using parameter expansions on variable
name, e.g., name="${stack#eksctl-}" and name="${name%-nodegroup-*}") is
incorrect about single % protecting cluster names containing "-nodegroup-" or
"-addon-"; update the comment to state that %-suffix removes the shortest
matching suffix and therefore will truncate any cluster name that contains those
substrings (this behavior is inherent to shell pattern removal and cannot be
fixed by using % vs %%), and note the real reason this is safe in practice
(NVSentinel cluster names do not contain "-nodegroup-" or "-addon-") so readers
aren’t misled.

322-404: delete_gpu_subnets always returns 0, making the callers' error-tracking dead code.

Every exit path returns 0 (explicit return 0 on lines 335/340, and the loop body uses || log "WARNING: …" instead of tracking failures). As a result:

  • Line 292: if ! delete_gpu_subnets "$cluster_name" can never be true — GPU subnet failures are never reflected in delete_cluster_stacks's failed counter.
  • Line 254: the return value is also silently ignored for the no-stacks path.

Consider tracking subnet-level failures and returning non-zero:

♻️ Proposed refactor (representative — apply to each subnet operation)
 delete_gpu_subnets() {
     local cluster_name="$1"
+    local failed=0

     ...

         # Delete the subnet
         aws ec2 delete-subnet \
             --region "$AWS_REGION" \
-            --subnet-id "$subnet_id" || log "WARNING: Failed to delete subnet $subnet_id"
+            --subnet-id "$subnet_id" || { log "WARNING: Failed to delete subnet $subnet_id"; failed=1; }
     done

     log "GPU subnet cleanup complete for cluster $cluster_name"
+    return "$failed"
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 322 - 404, The
delete_gpu_subnets function currently swallows all errors and always returns
success; update delete_gpu_subnets to track per-subnet operation failures and
return non-zero if any AWS call fails so callers like delete_cluster_stacks can
detect failures. Concretely, add a local exit_code=0 at the top of
delete_gpu_subnets, change each "|| log ..." usage to set exit_code=1 when an
AWS command fails (e.g., disassociate-route-table, detach-network-interface,
delete-network-interface, delete-subnet, and wait_for_eni_detach failures), and
return $exit_code at the end; preserve existing log messages and still continue
attempting cleanup for other subnets but ensure any transient or fatal error
flips the exit code. Ensure wait_for_eni_detach remains invoked and its failure
is counted toward exit_code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 192-203: The loop that populates active_oidc_ids for each cluster
(iterating active_clusters and running aws eks describe-cluster to set
cluster_oidc) currently swallows failures with "|| continue", which can omit an
active cluster's OIDC and cause its provider to be deleted; replace this
behaviour so that any failure to run aws eks describe-cluster for a cluster (or
to obtain cluster_oidc) causes the script to abort the orphan-detection pass:
remove the "|| continue", check the command's exit status or whether
cluster_oidc is non-empty/valid (and not "None"), and on error log a clear
message (including $cluster and $AWS_REGION) and exit with non-zero instead of
continuing; keep appending to active_oidc_ids only when cluster_oidc is valid.

---

Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 96-107: The comment above the for-loop that trims stack names
(using parameter expansions on variable name, e.g., name="${stack#eksctl-}" and
name="${name%-nodegroup-*}") is incorrect about single % protecting cluster
names containing "-nodegroup-" or "-addon-"; update the comment to state that
%-suffix removes the shortest matching suffix and therefore will truncate any
cluster name that contains those substrings (this behavior is inherent to shell
pattern removal and cannot be fixed by using % vs %%), and note the real reason
this is safe in practice (NVSentinel cluster names do not contain "-nodegroup-"
or "-addon-") so readers aren’t misled.
- Around line 322-404: The delete_gpu_subnets function currently swallows all
errors and always returns success; update delete_gpu_subnets to track per-subnet
operation failures and return non-zero if any AWS call fails so callers like
delete_cluster_stacks can detect failures. Concretely, add a local exit_code=0
at the top of delete_gpu_subnets, change each "|| log ..." usage to set
exit_code=1 when an AWS command fails (e.g., disassociate-route-table,
detach-network-interface, delete-network-interface, delete-subnet, and
wait_for_eni_detach failures), and return $exit_code at the end; preserve
existing log messages and still continue attempting cleanup for other subnets
but ensure any transient or fatal error flips the exit code. Ensure
wait_for_eni_detach remains invoked and its failure is counted toward exit_code.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/uat/aws/delete-eks-cluster.sh (2)

157-162: OIDC provider delete failure is silently swallowed.

On line 159, || log "WARNING: ..." prevents set -e from aborting but also causes delete_oidc_provider to return 0. The caller at line 68 checks for failure (|| { ... failed=1 }), so a failed deletion won't be reflected in the final exit code.

If best-effort semantics are intentional here, this is fine — but it should be explicit.

Proposed fix to propagate the failure
     log "Deleting OIDC provider: $provider_arn"
-    aws iam delete-open-id-connect-provider \
-        --open-id-connect-provider-arn "$provider_arn" || log "WARNING: Failed to delete OIDC provider $provider_arn"
+    aws iam delete-open-id-connect-provider \
+        --open-id-connect-provider-arn "$provider_arn" || {
+        log "WARNING: Failed to delete OIDC provider $provider_arn"
+        return 1
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 157 - 162, The OIDC
deletion currently swallows aws iam errors via "|| log ..." which makes
delete_oidc_provider appear successful even on failure; update
delete_oidc_provider so the aws iam delete-open-id-connect-provider call
explicitly detects failure (e.g., capture its exit status) and returns a
non-zero exit code when it fails (and still logs the warning with provider_arn),
or if you intend best-effort semantics make that explicit by returning success
but documenting it; ensure the caller's expected behavior (the caller that does
"|| { ... failed=1 }") receives a non-zero return when deletion fails.

430-459: wait_for_eni_detach always returns 0, making its caller's error check dead code.

The function returns 0 on both success (line 447) and timeout (line 458). This means the check on lines 402-404 (if ! wait_for_eni_detach …; then exit_code=1; fi) can never trigger.

If the intention is best-effort (proceed to delete regardless), the function's behavior is fine, but consider either returning 1 on timeout so the caller's check is meaningful, or removing the dead if ! check at line 402.

Option A: return 1 on timeout so the caller check is effective
     if [ "$elapsed" -ge "$timeout" ]; then
         log "WARNING: Timed out waiting for ENI $eni_id to detach after ${timeout}s; proceeding with delete attempt"
     fi
 
-    return 0
+    return 1
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/aws/delete-eks-cluster.sh` around lines 430 - 459, The helper
wait_for_eni_detach always returns 0 (success) even after timing out, so the
caller's error check "if ! wait_for_eni_detach …; then exit_code=1; fi" can
never trigger; fix by changing wait_for_eni_detach (the function) to return a
non-zero exit code on timeout (e.g., return 1 when elapsed >= timeout) so
callers see failure, or alternatively remove the dead caller check if you intend
best-effort behavior—prefer the first option to preserve the existing caller
logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 157-162: The OIDC deletion currently swallows aws iam errors via
"|| log ..." which makes delete_oidc_provider appear successful even on failure;
update delete_oidc_provider so the aws iam delete-open-id-connect-provider call
explicitly detects failure (e.g., capture its exit status) and returns a
non-zero exit code when it fails (and still logs the warning with provider_arn),
or if you intend best-effort semantics make that explicit by returning success
but documenting it; ensure the caller's expected behavior (the caller that does
"|| { ... failed=1 }") receives a non-zero return when deletion fails.
- Around line 430-459: The helper wait_for_eni_detach always returns 0 (success)
even after timing out, so the caller's error check "if ! wait_for_eni_detach …;
then exit_code=1; fi" can never trigger; fix by changing wait_for_eni_detach
(the function) to return a non-zero exit code on timeout (e.g., return 1 when
elapsed >= timeout) so callers see failure, or alternatively remove the dead
caller check if you intend best-effort behavior—prefer the first option to
preserve the existing caller logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant