fix:Improve delete-eks-cluster.sh to perform comprehensive cleanup#878
fix:Improve delete-eks-cluster.sh to perform comprehensive cleanup#878cbumb wants to merge 9 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Changes
Sequence DiagramsequenceDiagram
participant Script as Deletion Script
participant CFN as CloudFormation
participant EKS as EKS/IAM
participant AWS as AWS Services
Script->>CFN: Describe stacks / list eksctl-style stacks
CFN-->>Script: Stack list
Script->>Script: Extract unique cluster names
rect rgba(100,150,200,0.5)
Note over Script,EKS: OIDC provider cleanup
Script->>EKS: Describe/delete OIDC provider
EKS-->>Script: Status / errors
Script->>EKS: List orphaned OIDC providers
EKS-->>Script: Orphan list
end
rect rgba(100,200,150,0.5)
Note over Script,CFN: Multi-phase stack deletion
Script->>CFN: Phase 1 - delete nodegroup stacks
CFN-->>Script: Deletion initiated
Script->>AWS: Phase 2 - delete GPU subnets (disassoc routes, detach ENIs)
AWS-->>Script: Ops progress
Script->>CFN: Phase 3 - delete addon/IAM stacks
CFN-->>Script: Deletion initiated
Script->>CFN: Phase 4 - delete cluster stack
CFN-->>Script: Deletion initiated
end
rect rgba(200,150,100,0.5)
Note over Script,CFN: Polling & completion
loop Until deletion complete or timeout
Script->>CFN: Poll stack status
CFN-->>Script: Status update
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/uat/aws/delete-eks-cluster.sh (4)
339-351: Hardcodedsleep 5may not be enough for ENI detach to complete.After force-detaching an ENI (line 341-344), the 5-second sleep on line 345 is a best-effort wait before deletion. If detachment takes longer (e.g., under high API latency), the
delete-network-interfacecall on line 349 may fail. The|| log "WARNING: ..."fallback prevents script failure, but the ENI would be left behind.A small poll on the ENI attachment status (e.g., checking for
"detached"or"available"status) with a timeout would be more robust. That said, for a CI teardown script where re-runs are possible, this is acceptable as-is.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 339 - 351, Replace the fixed sleep after force-detaching the ENI with a short polling loop that checks the ENI attachment/status via aws ec2 describe-network-interfaces (use the same $eni_id and $AWS_REGION variables), waiting until the interface shows no attachment or status is "available" or until a configurable timeout (e.g., 60s) is reached; if the timeout expires log a warning and continue to the delete step so delete-network-interface is attempted but more likely to succeed if detachment completed.
412-414: OnDELETE_FAILED, log the reason from stack events to aid debugging.When a stack enters
DELETE_FAILED, CloudFormation records the reason in stack events. Currently the script only logs a generic warning, making it hard to diagnose CI teardown failures. Fetching the latest failure reason would save debugging time.🔍 Proposed enhancement
DELETE_FAILED) - log "WARNING: Stack $stack_name deletion failed" + log "WARNING: Stack $stack_name deletion failed. Recent events:" + aws cloudformation describe-stack-events \ + --region "$AWS_REGION" \ + --stack-name "$stack_name" \ + --query 'StackEvents[?ResourceStatus==`DELETE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \ + --output table 2>/dev/null || true return 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 412 - 414, On the DELETE_FAILED branch, replace the generic warning with a log that includes the CloudFormation failure reason: call AWS CLI to fetch recent stack events for the current stack_name (e.g., use aws cloudformation describe-stack-events --stack-name "$stack_name" and inspect the top event's ResourceStatusReason or a matching DELETE_FAILED event) and include that reason in the process log (log "WARNING: Stack $stack_name deletion failed: <reason>"). Ensure this runs inside the DELETE_FAILED case so the reason is captured and printed for debugging.
151-152: Unused parametercluster_name.The
cluster_nameparameter on line 152 is accepted but never referenced in the function body. This could confuse maintainers into thinking the function scopes its cleanup to the given cluster. Consider removing it or adding a comment explaining why it's passed (e.g., for future use or logging).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 151 - 152, The function delete_orphaned_oidc_providers declares an unused parameter cluster_name; either remove the parameter from the function signature and all callers, or use cluster_name (e.g., add an explanatory comment or a log line inside delete_orphaned_oidc_providers referencing the cluster) so its presence is justified. Update any invocations of delete_orphaned_oidc_providers to match the new signature if you remove the parameter, or add a simple log like "Cleaning OIDC providers for ${cluster_name}" if you keep it.
82-101: Suffix stripping assumes cluster names don't contain-nodegroup-or-addon-.The sequential
sedreplacements on lines 93-95 would incorrectly truncate a cluster name that happens to contain-nodegroup-or-addon-as a substring. Given the current naming convention (nvs-d<run_id>) this is safe, but worth a comment for future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 82 - 101, In get_cluster_names_from_stacks, the sed replacements can accidentally truncate valid cluster names that include the substrings "-nodegroup-" or "-addon-"; update the two sed calls to only strip those patterns when they appear as trailing components by using stricter end-anchored patterns (e.g. change "s/-nodegroup-.*$//" to "s/-nodegroup-[^-]*$//" and "s/-addon-.*$//" to "s/-addon-[^-]*$//"), and add a short comment above the block documenting the assumption about cluster name formats for future maintainers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 151-168: In delete_orphaned_oidc_providers, don't swallow failures
of aws eks list-clusters; remove the "|| true" fallback for active_clusters,
check its exit status and if listing clusters fails log an error (with context)
and return a non-zero status (or exit) instead of proceeding—this ensures
active_clusters is valid before the loop that uses it and prevents deleting OIDC
providers when list-clusters errored.
---
Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 339-351: Replace the fixed sleep after force-detaching the ENI
with a short polling loop that checks the ENI attachment/status via aws ec2
describe-network-interfaces (use the same $eni_id and $AWS_REGION variables),
waiting until the interface shows no attachment or status is "available" or
until a configurable timeout (e.g., 60s) is reached; if the timeout expires log
a warning and continue to the delete step so delete-network-interface is
attempted but more likely to succeed if detachment completed.
- Around line 412-414: On the DELETE_FAILED branch, replace the generic warning
with a log that includes the CloudFormation failure reason: call AWS CLI to
fetch recent stack events for the current stack_name (e.g., use aws
cloudformation describe-stack-events --stack-name "$stack_name" and inspect the
top event's ResourceStatusReason or a matching DELETE_FAILED event) and include
that reason in the process log (log "WARNING: Stack $stack_name deletion failed:
<reason>"). Ensure this runs inside the DELETE_FAILED case so the reason is
captured and printed for debugging.
- Around line 151-152: The function delete_orphaned_oidc_providers declares an
unused parameter cluster_name; either remove the parameter from the function
signature and all callers, or use cluster_name (e.g., add an explanatory comment
or a log line inside delete_orphaned_oidc_providers referencing the cluster) so
its presence is justified. Update any invocations of
delete_orphaned_oidc_providers to match the new signature if you remove the
parameter, or add a simple log like "Cleaning OIDC providers for
${cluster_name}" if you keep it.
- Around line 82-101: In get_cluster_names_from_stacks, the sed replacements can
accidentally truncate valid cluster names that include the substrings
"-nodegroup-" or "-addon-"; update the two sed calls to only strip those
patterns when they appear as trailing components by using stricter end-anchored
patterns (e.g. change "s/-nodegroup-.*$//" to "s/-nodegroup-[^-]*$//" and
"s/-addon-.*$//" to "s/-addon-[^-]*$//"), and add a short comment above the
block documenting the assumption about cluster name formats for future
maintainers.
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/uat/aws/delete-eks-cluster.sh (3)
82-101: Minor:get_cluster_names_from_stacksproduces a leading space in its output.
cluster_namesis built as" name1 name2"(leading space from line 96). Aftertr | sort -u | tr, the empty element becomes a leading space in the result. This is functionally harmless — word splitting in the caller'sforloop (line 61) skips it — but the log on line 58 will show a leading space.Trim leading/trailing whitespace
# Return unique names - echo "$cluster_names" | tr ' ' '\n' | sort -u | tr '\n' ' ' + echo "$cluster_names" | tr ' ' '\n' | sort -u | grep -v '^$' | tr '\n' ' ' | sed 's/^ *//;s/ *$//'Or initialize
cluster_nameswithout the leading space:- cluster_names="${cluster_names} ${name}" + cluster_names="${cluster_names:+$cluster_names }${name}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 82 - 101, get_cluster_names_from_stacks currently builds cluster_names with a leading space; change the append logic for cluster_names (in function get_cluster_names_from_stacks where cluster_names and name are used) so the first item is added without a leading separator and subsequent items are prefixed with a single space (e.g. use a conditional append or the ${cluster_names:+...} pattern) and keep the final echo/tr|sort -u|tr pipeline intact so the function returns the same unique list but without leading/trailing whitespace.
373-398:wait_for_eni_detachreturns 0 even on timeout — confirm this is intentional.When the
|| breakfires (API error, line 385) or the timeout expires (line 395), the function always returns 0 (the exit status of the finalifstatement orlog). The caller on line 352 proceeds todelete-network-interfaceregardless, so this works, but an explicitreturn 0at the bottom would make the intent clearer.Make the implicit return explicit
if [ "$elapsed" -ge "$timeout" ]; then log "WARNING: Timed out waiting for ENI $eni_id to detach after ${timeout}s; proceeding with delete attempt" fi + return 0 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 373 - 398, The helper function wait_for_eni_detach currently falls through and returns the last command's exit status implicitly; make the intent explicit by adding an explicit return value: ensure that on successful detection of an available/empty ENI the function returns 0, and when the loop times out (after logging the warning) also explicitly return 0 so callers know we intentionally proceed with deletion; modify the function wait_for_eni_detach to include explicit return 0 in both the success path and after the timeout warning (and keep the existing break/early-exit behavior unchanged).
116-120:delete_orphaned_oidc_providersfailure is silently swallowed by the caller.Because
delete_orphaned_oidc_providers(line 118) is invoked inside a|| { … }block,set -eis inhibited. If the function returns 1 (e.g.,list-clustersfails on line 172), execution continues toreturn 0on line 119, and the main loop moves on without any indication of the failure propagating upward.The safety behavior is preserved (no accidental deletions), and the
ERRORlog inside the function provides visibility. For a cleanup script this is probably acceptable, but it's worth being aware that OIDC orphan cleanup could silently be skipped for a transient API failure.Option: propagate the error to make it visible in CI
--output text 2>/dev/null) || { log "WARNING: Could not describe cluster $cluster_name (may already be deleted). Searching for orphaned OIDC providers..." - delete_orphaned_oidc_providers "$cluster_name" - return 0 + delete_orphaned_oidc_providers "$cluster_name" || log "WARNING: Orphaned OIDC cleanup failed for $cluster_name (see error above)" + return 0 # continue cleanup even if orphaned OIDC cleanup fails }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 116 - 120, The current error path swallows failures from delete_orphaned_oidc_providers when the aws describe call fails because the function is invoked inside a `|| { ...; return 0; }` block; update the logic so that after calling delete_orphaned_oidc_providers "$cluster_name" you capture its exit status and if it is non-zero propagate that failure (exit or return non-zero) instead of unconditionally returning 0, ensuring set -e behavior is preserved and failures in delete_orphaned_oidc_providers surface in CI; reference the call to delete_orphaned_oidc_providers and the variable cluster_name when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 165-175: The change correctly replaces the previous `|| true` with
a fail-fast block that logs an ERROR and returns 1 when `aws eks list-clusters`
fails; keep this fail-fast logic around the `active_clusters=$(aws eks
list-clusters ...) || { log "ERROR: ..."; return 1; }` block and ensure any
caller of this script/function checks and propagates the non-zero return (or
exits) so orphaned OIDC cleanup is never run with an empty `active_clusters` or
unset `AWS_REGION`; reference `active_clusters`, `AWS_REGION`, and the `log`
call to locate the code.
---
Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 82-101: get_cluster_names_from_stacks currently builds
cluster_names with a leading space; change the append logic for cluster_names
(in function get_cluster_names_from_stacks where cluster_names and name are
used) so the first item is added without a leading separator and subsequent
items are prefixed with a single space (e.g. use a conditional append or the
${cluster_names:+...} pattern) and keep the final echo/tr|sort -u|tr pipeline
intact so the function returns the same unique list but without leading/trailing
whitespace.
- Around line 373-398: The helper function wait_for_eni_detach currently falls
through and returns the last command's exit status implicitly; make the intent
explicit by adding an explicit return value: ensure that on successful detection
of an available/empty ENI the function returns 0, and when the loop times out
(after logging the warning) also explicitly return 0 so callers know we
intentionally proceed with deletion; modify the function wait_for_eni_detach to
include explicit return 0 in both the success path and after the timeout warning
(and keep the existing break/early-exit behavior unchanged).
- Around line 116-120: The current error path swallows failures from
delete_orphaned_oidc_providers when the aws describe call fails because the
function is invoked inside a `|| { ...; return 0; }` block; update the logic so
that after calling delete_orphaned_oidc_providers "$cluster_name" you capture
its exit status and if it is non-zero propagate that failure (exit or return
non-zero) instead of unconditionally returning 0, ensuring set -e behavior is
preserved and failures in delete_orphaned_oidc_providers surface in CI;
reference the call to delete_orphaned_oidc_providers and the variable
cluster_name when making the change.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
tests/uat/aws/delete-eks-cluster.sh (2)
418-421: Phase-level stack deletion failures are silently swallowed, resulting in exit 0 even when resources leak.
wait_for_stack_deletionreturns 1 onDELETE_FAILED, but|| trueeats it. If Phase 1 (nodegroup) stacks fail to delete, Phases 2–4 still run and their own stack deletions will likely fail too (nodegroup ENIs still hold VPC resources). The job exits green while stacks and subnets remain.Consider tracking a failure flag and returning a non-zero exit from
mainwhen any phase reports a deletion failure:♻️ Proposed refactor
- # Wait for all to complete - for stack in $stacks; do - wait_for_stack_deletion "$stack" || true - done + # Wait for all to complete; accumulate failures + local failed=0 + for stack in $stacks; do + wait_for_stack_deletion "$stack" || { log "WARNING: Stack $stack did not delete cleanly"; failed=1; } + done + return "$failed"Then in
delete_cluster_stacks, propagate the flag from each phase and return it tomainso the script can exit non-zero.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 418 - 421, The loop that calls wait_for_stack_deletion currently swallows failures with "|| true", causing the script to exit 0 even when deletions fail; change the logic to track failures instead: remove "|| true", capture the return status of wait_for_stack_deletion inside delete_cluster_stacks (e.g., set a local failure flag if any call returns non-zero), propagate that flag out of delete_cluster_stacks (return non-zero when any phase failed), have main check the return value and exit with a non-zero code when failures occurred so phase-level DELETE_FAILED results cause the job to fail and stop subsequent phases when appropriate.
190-211: O(N×M)describe-clustercalls in the orphan detection loop.For every OIDC provider that matches the regional EKS URL, the inner loop calls
describe-clusteronce per active cluster. Pre-computing OIDC IDs for all active clusters before the outer loop reduces this to O(N) calls.♻️ Proposed refactor
+ # Pre-compute OIDC IDs for all active clusters (O(N) calls) + local active_oidc_ids="" + for cluster in $active_clusters; do + local cluster_oidc_url + cluster_oidc_url=$(aws eks describe-cluster \ + --name "$cluster" \ + --region "$AWS_REGION" \ + --query 'cluster.identity.oidc.issuer' \ + --output text 2>/dev/null) || continue + active_oidc_ids="${active_oidc_ids} ${cluster_oidc_url##*/}" + done + for arn in $provider_arns; do # EKS OIDC providers have URLs like: oidc.eks.{region}.amazonaws.com/id/{ID} local provider_url provider_url=$(aws iam get-open-id-connect-provider \ --open-id-connect-provider-arn "$arn" \ --query 'Url' \ --output text 2>/dev/null) || continue if echo "$provider_url" | grep -q "oidc.eks.${AWS_REGION}.amazonaws.com"; then local oidc_id oidc_id="${provider_url##*/}" local is_orphaned=true - for cluster in $active_clusters; do - local cluster_oidc - cluster_oidc=$(aws eks describe-cluster \ - --name "$cluster" \ - --region "$AWS_REGION" \ - --query 'cluster.identity.oidc.issuer' \ - --output text 2>/dev/null) || continue - - if echo "$cluster_oidc" | grep -qF "$oidc_id"; then + for active_id in $active_oidc_ids; do + if [[ "$active_id" == "$oidc_id" ]]; then is_orphaned=false break fi done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 190 - 211, Precompute each active cluster's OIDC issuer once before iterating OIDC providers: iterate over active_clusters, call aws eks describe-cluster to collect and store cluster identity OIDC IDs (e.g., in an array or associative map keyed by cluster name or just a newline-separated string variable), then replace the inner per-provider describe-cluster loop that sets cluster_oidc and checks grep -qF "$oidc_id" with a membership check against the precomputed collection (use the same oidc_id/arn/provider_url variable names and preserve the is_orphaned boolean logic and log/delete calls in the outer loop).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 222-229: The CloudFormation stack filter uses
starts_with(StackName, 'eksctl-${cluster_name}') which matches sibling clusters;
update the AWS CLI --query expression used when setting all_stacks to require
the trailing hyphen by changing the starts_with pattern to
'eksctl-${cluster_name}-' so only stacks belonging to this cluster (e.g.,
eksctl-<cluster_name>-*) are matched; the change should be applied where
all_stacks is computed (the aws cloudformation list-stacks --query expression)
to mirror the earlier fix.
- Around line 40-43: The CloudFormation query is using starts_with(StackName,
'eksctl-${CLUSTER_PREFIX}') which also matches names like eksctl-nvs2; update
the query to include the trailing '-' so it only matches cluster stacks (e.g.,
starts_with(StackName, 'eksctl-${CLUSTER_PREFIX}-')). Locate the snippet that
sets the stacks variable (the aws cloudformation list-stacks call) and change
the starts_with second argument to include the hyphen suffix; verify this change
also fixes downstream parsing in get_cluster_names_from_stacks which expects the
'-' separator.
- Around line 242-257: The case ordering misclassifies cluster stacks; in the
for loop over "for stack in $all_stacks" adjust the case statement so the
"*-cluster)" pattern is tested before the "*-nodegroup-*)" and "*-addon-*)"
patterns so stacks ending with "-cluster" are assigned to cluster_stack (instead
of being caught by nodegroup_stacks/addon_stacks); update the case order and
keep assignments to nodegroup_stacks, addon_stacks and cluster_stack unchanged.
- Around line 133-135: The current use of "|| true" after the aws iam
list-open-id-connect-providers call hides failures and yields an empty
provider_arn that later triggers the misleading "No OIDC provider found matching
issuer: …" log; remove the "|| true", capture the command exit status (e.g., run
the aws command in a subshell and check $? or use set -e with a trap), and if
the AWS call fails print a clear warning/error and return early instead of
proceeding (reference the provider_arn assignment and the subsequent "No OIDC
provider found matching issuer" log to update control flow).
---
Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 418-421: The loop that calls wait_for_stack_deletion currently
swallows failures with "|| true", causing the script to exit 0 even when
deletions fail; change the logic to track failures instead: remove "|| true",
capture the return status of wait_for_stack_deletion inside
delete_cluster_stacks (e.g., set a local failure flag if any call returns
non-zero), propagate that flag out of delete_cluster_stacks (return non-zero
when any phase failed), have main check the return value and exit with a
non-zero code when failures occurred so phase-level DELETE_FAILED results cause
the job to fail and stop subsequent phases when appropriate.
- Around line 190-211: Precompute each active cluster's OIDC issuer once before
iterating OIDC providers: iterate over active_clusters, call aws eks
describe-cluster to collect and store cluster identity OIDC IDs (e.g., in an
array or associative map keyed by cluster name or just a newline-separated
string variable), then replace the inner per-provider describe-cluster loop that
sets cluster_oidc and checks grep -qF "$oidc_id" with a membership check against
the precomputed collection (use the same oidc_id/arn/provider_url variable names
and preserve the is_orphaned boolean logic and log/delete calls in the outer
loop).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/uat/aws/delete-eks-cluster.sh (2)
96-107: Misleading comment:%(shortest suffix) does NOT protect cluster names containing-nodegroup-or-addon-.The comment on lines 98–99 claims single
%prevents truncation of cluster names that contain-nodegroup-or-addon-. This is incorrect. For example, a cluster namednvs-nodegroup-prod(stackeksctl-nvs-nodegroup-prod-cluster) would be processed as:nvs-nodegroup-prod-cluster → nvs-nodegroup-prod (after %-cluster) → nvs (after %-nodegroup-*, shortest suffix = "-nodegroup-prod") ← WRONGUsing
%%(longest suffix) would not help either — the ambiguity is fundamental. In practice this is not an issue because NVSentinel cluster names (prefixednvs) don't embed these keywords, but the comment should be corrected to avoid confusion.📝 Suggested comment correction
- # Stack names follow: eksctl-{CLUSTER}-cluster, eksctl-{CLUSTER}-nodegroup-{NG}, - # eksctl-{CLUSTER}-addon-{ADDON}. We strip the eksctl- prefix and known - # suffixes. Using shortest-suffix parameter expansion (single %) so that - # cluster names containing "-nodegroup-" or "-addon-" are not truncated. + # Stack names follow: eksctl-{CLUSTER}-cluster, eksctl-{CLUSTER}-nodegroup-{NG}, + # eksctl-{CLUSTER}-addon-{ADDON}. We strip the eksctl- prefix and the first + # matching known suffix. Note: cluster names that themselves contain + # "-nodegroup-" or "-addon-" will be mis-parsed; this is not expected for + # clusters created under the CLUSTER_PREFIX convention.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 96 - 107, The comment above the for-loop that trims stack names (using parameter expansions on variable name, e.g., name="${stack#eksctl-}" and name="${name%-nodegroup-*}") is incorrect about single % protecting cluster names containing "-nodegroup-" or "-addon-"; update the comment to state that %-suffix removes the shortest matching suffix and therefore will truncate any cluster name that contains those substrings (this behavior is inherent to shell pattern removal and cannot be fixed by using % vs %%), and note the real reason this is safe in practice (NVSentinel cluster names do not contain "-nodegroup-" or "-addon-") so readers aren’t misled.
322-404:delete_gpu_subnetsalways returns 0, making the callers' error-tracking dead code.Every exit path returns 0 (explicit
return 0on lines 335/340, and the loop body uses|| log "WARNING: …"instead of tracking failures). As a result:
- Line 292:
if ! delete_gpu_subnets "$cluster_name"can never be true — GPU subnet failures are never reflected indelete_cluster_stacks'sfailedcounter.- Line 254: the return value is also silently ignored for the no-stacks path.
Consider tracking subnet-level failures and returning non-zero:
♻️ Proposed refactor (representative — apply to each subnet operation)
delete_gpu_subnets() { local cluster_name="$1" + local failed=0 ... # Delete the subnet aws ec2 delete-subnet \ --region "$AWS_REGION" \ - --subnet-id "$subnet_id" || log "WARNING: Failed to delete subnet $subnet_id" + --subnet-id "$subnet_id" || { log "WARNING: Failed to delete subnet $subnet_id"; failed=1; } done log "GPU subnet cleanup complete for cluster $cluster_name" + return "$failed" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 322 - 404, The delete_gpu_subnets function currently swallows all errors and always returns success; update delete_gpu_subnets to track per-subnet operation failures and return non-zero if any AWS call fails so callers like delete_cluster_stacks can detect failures. Concretely, add a local exit_code=0 at the top of delete_gpu_subnets, change each "|| log ..." usage to set exit_code=1 when an AWS command fails (e.g., disassociate-route-table, detach-network-interface, delete-network-interface, delete-subnet, and wait_for_eni_detach failures), and return $exit_code at the end; preserve existing log messages and still continue attempting cleanup for other subnets but ensure any transient or fatal error flips the exit code. Ensure wait_for_eni_detach remains invoked and its failure is counted toward exit_code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 192-203: The loop that populates active_oidc_ids for each cluster
(iterating active_clusters and running aws eks describe-cluster to set
cluster_oidc) currently swallows failures with "|| continue", which can omit an
active cluster's OIDC and cause its provider to be deleted; replace this
behaviour so that any failure to run aws eks describe-cluster for a cluster (or
to obtain cluster_oidc) causes the script to abort the orphan-detection pass:
remove the "|| continue", check the command's exit status or whether
cluster_oidc is non-empty/valid (and not "None"), and on error log a clear
message (including $cluster and $AWS_REGION) and exit with non-zero instead of
continuing; keep appending to active_oidc_ids only when cluster_oidc is valid.
---
Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 96-107: The comment above the for-loop that trims stack names
(using parameter expansions on variable name, e.g., name="${stack#eksctl-}" and
name="${name%-nodegroup-*}") is incorrect about single % protecting cluster
names containing "-nodegroup-" or "-addon-"; update the comment to state that
%-suffix removes the shortest matching suffix and therefore will truncate any
cluster name that contains those substrings (this behavior is inherent to shell
pattern removal and cannot be fixed by using % vs %%), and note the real reason
this is safe in practice (NVSentinel cluster names do not contain "-nodegroup-"
or "-addon-") so readers aren’t misled.
- Around line 322-404: The delete_gpu_subnets function currently swallows all
errors and always returns success; update delete_gpu_subnets to track per-subnet
operation failures and return non-zero if any AWS call fails so callers like
delete_cluster_stacks can detect failures. Concretely, add a local exit_code=0
at the top of delete_gpu_subnets, change each "|| log ..." usage to set
exit_code=1 when an AWS command fails (e.g., disassociate-route-table,
detach-network-interface, delete-network-interface, delete-subnet, and
wait_for_eni_detach failures), and return $exit_code at the end; preserve
existing log messages and still continue attempting cleanup for other subnets
but ensure any transient or fatal error flips the exit code. Ensure
wait_for_eni_detach remains invoked and its failure is counted toward exit_code.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/uat/aws/delete-eks-cluster.sh (2)
157-162: OIDC provider delete failure is silently swallowed.On line 159,
|| log "WARNING: ..."preventsset -efrom aborting but also causesdelete_oidc_providerto return 0. The caller at line 68 checks for failure (|| { ... failed=1 }), so a failed deletion won't be reflected in the final exit code.If best-effort semantics are intentional here, this is fine — but it should be explicit.
Proposed fix to propagate the failure
log "Deleting OIDC provider: $provider_arn" - aws iam delete-open-id-connect-provider \ - --open-id-connect-provider-arn "$provider_arn" || log "WARNING: Failed to delete OIDC provider $provider_arn" + aws iam delete-open-id-connect-provider \ + --open-id-connect-provider-arn "$provider_arn" || { + log "WARNING: Failed to delete OIDC provider $provider_arn" + return 1 + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 157 - 162, The OIDC deletion currently swallows aws iam errors via "|| log ..." which makes delete_oidc_provider appear successful even on failure; update delete_oidc_provider so the aws iam delete-open-id-connect-provider call explicitly detects failure (e.g., capture its exit status) and returns a non-zero exit code when it fails (and still logs the warning with provider_arn), or if you intend best-effort semantics make that explicit by returning success but documenting it; ensure the caller's expected behavior (the caller that does "|| { ... failed=1 }") receives a non-zero return when deletion fails.
430-459:wait_for_eni_detachalways returns 0, making its caller's error check dead code.The function returns 0 on both success (line 447) and timeout (line 458). This means the check on lines 402-404 (
if ! wait_for_eni_detach …; then exit_code=1; fi) can never trigger.If the intention is best-effort (proceed to delete regardless), the function's behavior is fine, but consider either returning 1 on timeout so the caller's check is meaningful, or removing the dead
if !check at line 402.Option A: return 1 on timeout so the caller check is effective
if [ "$elapsed" -ge "$timeout" ]; then log "WARNING: Timed out waiting for ENI $eni_id to detach after ${timeout}s; proceeding with delete attempt" fi - return 0 + return 1 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/aws/delete-eks-cluster.sh` around lines 430 - 459, The helper wait_for_eni_detach always returns 0 (success) even after timing out, so the caller's error check "if ! wait_for_eni_detach …; then exit_code=1; fi" can never trigger; fix by changing wait_for_eni_detach (the function) to return a non-zero exit code on timeout (e.g., return 1 when elapsed >= timeout) so callers see failure, or alternatively remove the dead caller check if you intend best-effort behavior—prefer the first option to preserve the existing caller logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/uat/aws/delete-eks-cluster.sh`:
- Around line 157-162: The OIDC deletion currently swallows aws iam errors via
"|| log ..." which makes delete_oidc_provider appear successful even on failure;
update delete_oidc_provider so the aws iam delete-open-id-connect-provider call
explicitly detects failure (e.g., capture its exit status) and returns a
non-zero exit code when it fails (and still logs the warning with provider_arn),
or if you intend best-effort semantics make that explicit by returning success
but documenting it; ensure the caller's expected behavior (the caller that does
"|| { ... failed=1 }") receives a non-zero return when deletion fails.
- Around line 430-459: The helper wait_for_eni_detach always returns 0 (success)
even after timing out, so the caller's error check "if ! wait_for_eni_detach …;
then exit_code=1; fi" can never trigger; fix by changing wait_for_eni_detach
(the function) to return a non-zero exit code on timeout (e.g., return 1 when
elapsed >= timeout) so callers see failure, or alternatively remove the dead
caller check if you intend best-effort behavior—prefer the first option to
preserve the existing caller logic.
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit