Skip to content

fix: wait for GPUReset CRD rather than check syslog in UAT#879

Merged
lalitadithya merged 2 commits intoNVIDIA:mainfrom
natherz97:uat-fix
Feb 18, 2026
Merged

fix: wait for GPUReset CRD rather than check syslog in UAT#879
lalitadithya merged 2 commits intoNVIDIA:mainfrom
natherz97:uat-fix

Conversation

@natherz97
Copy link
Contributor

@natherz97 natherz97 commented Feb 18, 2026

Summary

The UAT test for GPU reset is currently failing because the "/var/log/syslog" directory does not exist on GCP nodes. Since we do not have a container with journalctl available to read syslogs directly from "/var/log/journal" to detect the GPU reset in the test, we will wait for a matching GPU reset CRD to be created. This check is required to confirm that the given injected XID error resulted in a GPU reset rather than a reboot. In summary, will wait for a GPUReset CRD which:

  • matches the expected node with the unhealthy GPU
  • matches the GPU UUID which had the injected XID syslog error
  • was created after the node was cordoned (to prevent stale GPUReset CRDs from previous test executions from being consumed)

Testing

I locally ran the wait_for_gpu_reset function with these arguments:

gpu_node="10.0.6.34"
uuid="GPU-8598879c-4839-1709-231e-36a2b2844bca"
wait_for_gpu_reset "$gpu_node" "$uuid"

Running the function against a cluster with a matching GPUReset CRD:

% ./wait_for_gpu_reset.sh
[2026-02-17 16:04:06] Waiting for GPU reset for GPU-8598879c-4839-1709-231e-36a2b2844bca on 10.0.6.34 (matching GPUReset CRD)...
[2026-02-17 16:04:07] GPUReset maintenance-10.0.6.34-697bc65faecdaeca50880d66 matches GPU-8598879c-4839-1709-231e-36a2b2844bca and 10.0.6.34
...
./wait_for_gpu_reset.sh: line 56: wait_for_node_unquarantine: command not found

I confirmed that the test will fail if a GPUReset CRD does not exist with the expected node name or GPU UUID. Additionally, the test will fail if the there is a GPUReset CRD with the required node and GPU but was created with too old of a timestamp.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: UAT

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Tests
    • Switched automated GPU reset verification from syslog parsing to CRD-driven detection (matching by node and GPU UUID). Tests now provide a timestamp baseline through the wait flow, include explicit timeout/elapsed handling, skip incomplete CRD records early, and record clearer match logs to improve reliability and observability of GPU reset validation.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: replacing syslog-based GPU reset detection with CRD-based detection in the UAT tests, which directly matches the primary objectives and changes in the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/uat/tests.sh (1)

418-421: ⚠️ Potential issue | 🟡 Minor

wait_for_node_unquarantine is called twice: once inside wait_for_gpu_reset (line 264) and again here (line 421).

wait_for_gpu_reset already calls wait_for_node_unquarantine "$node" at its end, so the explicit call at line 421 is redundant. The second invocation will return immediately (the node is already uncordoned), but it's confusing and inconsistent with how test_xid_monitoring_syslog (Test 2) delegates unquarantine entirely to wait_for_boot_id_change.

Either remove the call at line 421, or remove it from inside wait_for_gpu_reset and leave it to the caller — but pick one.

Option: remove the redundant call at the call site
     wait_for_gpu_reset "$gpu_node" "$uuid"

-    wait_for_node_unquarantine "$gpu_node"
-
     log "Test 3 PASSED ✓"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 418 - 421, The call site contains a
redundant invocation of wait_for_node_unquarantine because wait_for_gpu_reset
already calls wait_for_node_unquarantine at its end; remove the explicit
wait_for_node_unquarantine "$gpu_node" line from the caller (the block that logs
"Waiting for node to GPU reset and recover...") so unquarantine is only handled
inside wait_for_gpu_reset, and update any nearby comment to reflect that
wait_for_gpu_reset performs the unquarantine step (alternatively, if you prefer
caller-controlled unquarantine, remove the internal call inside
wait_for_gpu_reset—choose one approach and make the code/comments consistent).
🧹 Nitpick comments (2)
tests/uat/tests.sh (2)

229-230: local var=$(cmd) masks non-zero exit codes under set -euo pipefail.

local always returns 0, so if kubectl get gpuresets or jq fails, the error is silently swallowed and gpu_reset_list is set to an empty (or partial) value. The same pitfall applies to the local declarations on lines 233-235 and 240.

A common pattern to preserve set -e semantics:

local gpu_reset_list
gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')

This is a well-known bash gotcha — not urgent since the retry loop provides a natural safety net, but worth addressing for debuggability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 229 - 230, The issue is that using `local
var=$(cmd)` masks non-zero exit codes; change each occurrence to declare the
local first and then assign the command substitution so failures propagate under
set -euo pipefail — e.g., replace `local gpu_reset_list=$(kubectl get gpuresets
-o json | jq -c '.items[]')` with `local gpu_reset_list` followed by
`gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')`; do the
same for the other local declarations referenced (the subsequent local
assignments around the same block) such that `local` and the command
substitution are separate for variables like gpu_reset_list and the other
variables created on lines near it.

243-249: Missing break after finding a matching UUID — the outer for gpu_reset loop also keeps iterating.

Once matching_crd is set on line 245, neither the inner for current_uuid loop nor the outer for gpu_reset loop breaks. This means:

  1. Remaining UUIDs in the same CRD are compared unnecessarily.
  2. All remaining GPUReset CRDs are still processed, potentially overwriting matching_crd with a later (equally valid) match.
Proposed fix — break out of both loops on first match
                 for current_uuid in $uuids; do
                     if [ "$current_uuid" == "$uuid" ]; then
                         matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name')
                         log "GPUReset $matching_crd matches $uuid and $node"
+                        break
                     fi
                 done
+
+                if [ -n "$matching_crd" ]; then
+                    break
+                fi
             fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 243 - 249, The inner loop comparing
current_uuid to uuid sets matching_crd but doesn’t exit either loop, allowing
further comparisons and overwrites; after setting matching_crd and logging
(inside the inner for current_uuid loop where gpu_reset, uuids, uuid, node are
used) immediately break out of both loops — e.g., use a single break 2 to stop
the inner and outer for gpu_reset loop (or set a flag and break twice) so the
first matching CRD is preserved and no further UUIDs or GPUReset CRDs are
processed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/uat/tests.sh`:
- Around line 237-240: The script reads .status.startTime into start_time using
jq -r but jq prints "null" for missing fields, so the existing if [ -z
"$start_time" ] check misses that and date -d "null" runs; update the check or
jq invocation: either have the jq call return an empty string for null (e.g.,
adjust the jq expression to convert null to ""), or add a guard like if [ -z
"$start_time" ] || [ "$start_time" = "null" ]; then continue; before calling
date -d; refer to the start_time variable and start_ts assignment in tests.sh
and ensure date -d is only called when start_time is a real timestamp to avoid
the masked error and subsequent integer expression failures.

---

Outside diff comments:
In `@tests/uat/tests.sh`:
- Around line 418-421: The call site contains a redundant invocation of
wait_for_node_unquarantine because wait_for_gpu_reset already calls
wait_for_node_unquarantine at its end; remove the explicit
wait_for_node_unquarantine "$gpu_node" line from the caller (the block that logs
"Waiting for node to GPU reset and recover...") so unquarantine is only handled
inside wait_for_gpu_reset, and update any nearby comment to reflect that
wait_for_gpu_reset performs the unquarantine step (alternatively, if you prefer
caller-controlled unquarantine, remove the internal call inside
wait_for_gpu_reset—choose one approach and make the code/comments consistent).

---

Nitpick comments:
In `@tests/uat/tests.sh`:
- Around line 229-230: The issue is that using `local var=$(cmd)` masks non-zero
exit codes; change each occurrence to declare the local first and then assign
the command substitution so failures propagate under set -euo pipefail — e.g.,
replace `local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c
'.items[]')` with `local gpu_reset_list` followed by `gpu_reset_list=$(kubectl
get gpuresets -o json | jq -c '.items[]')`; do the same for the other local
declarations referenced (the subsequent local assignments around the same block)
such that `local` and the command substitution are separate for variables like
gpu_reset_list and the other variables created on lines near it.
- Around line 243-249: The inner loop comparing current_uuid to uuid sets
matching_crd but doesn’t exit either loop, allowing further comparisons and
overwrites; after setting matching_crd and logging (inside the inner for
current_uuid loop where gpu_reset, uuids, uuid, node are used) immediately break
out of both loops — e.g., use a single break 2 to stop the inner and outer for
gpu_reset loop (or set a flag and break twice) so the first matching CRD is
preserved and no further UUIDs or GPUReset CRDs are processed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/uat/tests.sh (3)

264-264: wait_for_node_unquarantine is called twice for the GPU-reset test path.

wait_for_gpu_reset already calls wait_for_node_unquarantine at line 264. Then test_xid_monitoring_syslog_gpu_reset calls it again at line 421. The second call is redundant — the node is already uncordoned at that point.

Proposed fix — remove the duplicate call in the test function
     wait_for_gpu_reset "$gpu_node" "$uuid"

-    wait_for_node_unquarantine "$gpu_node"
-
     log "Test 3 PASSED ✓"

Also applies to: 419-421

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` at line 264, The test calls wait_for_node_unquarantine
twice for the GPU-reset path; wait_for_gpu_reset already invokes
wait_for_node_unquarantine, so remove the redundant call from the test function
test_xid_monitoring_syslog_gpu_reset (delete the wait_for_node_unquarantine
"$node" invocation inside that test) to avoid calling the same unquarantine
helper twice.

243-249: Missing break after matching UUID causes unnecessary iteration and possible duplicate log messages.

Once matching_crd is set on line 245, neither the inner for loop nor the outer for loop breaks. The code continues scanning all remaining UUIDs and all remaining CRDs. Add break statements to exit both loops early.

Proposed fix
             if [ "$start_ts" -gt "$current_ts" ] && [ "$current_node" == "$node" ]; then
                 for current_uuid in $uuids; do
                     if [ "$current_uuid" == "$uuid" ]; then
                         matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name')
                         log "GPUReset $matching_crd matches $uuid and $node"
+                        break
                     fi
                 done
             fi
+
+            if [ -n "$matching_crd" ]; then
+                break
+            fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 243 - 249, After finding a matching UUID and
setting matching_crd in the inner loop (inside the for current_uuid in $uuids
loop where matching_crd is assigned and log "GPUReset $matching_crd matches
$uuid and $node" is emitted), exit both loops immediately to avoid extra
iterations and duplicate logs; add a break that exits both the inner and outer
for loops (e.g., use break 2 in this bash script) immediately after the log
statement.

228-250: Consider using while read -r instead of IFS-based for loop for robustness.

Setting local IFS=$'\n' inside the while loop and iterating with for gpu_reset in $gpu_reset_list works, but is fragile — local scopes to the function (not the block), so IFS stays modified for subsequent operations. A while read -r pipeline avoids this entirely and is more idiomatic for processing line-oriented output.

Sketch
-        local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')
-        local IFS=$'\n'
-
-        for gpu_reset in $gpu_reset_list; do
+        while IFS= read -r gpu_reset; do
+            [ -z "$gpu_reset" ] && continue
             ...
-        done
+        done < <(kubectl get gpuresets -o json | jq -c '.items[]')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 228 - 250, The current loop builds
gpu_reset_list and iterates with a for + IFS which leaks IFS scope and is
fragile; replace the for gpu_reset in $gpu_reset_list construct with a while
read -r loop that reads each jq -c '.items[]' line (preserving each gpu_reset
JSON) — e.g., pipe the kubectl | jq output into while IFS= read -r gpu_reset; do
... done — then inside use the existing jq calls (start_time, current_node,
uuids, matching_crd) and the same comparisons to set matching_crd and log; this
makes iteration robust and avoids modifying IFS globally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/uat/tests.sh`:
- Around line 237-240: The problem is that using "local start_ts=$(date -d
"$start_time" +%s)" masks a non-zero exit from date and leaves start_ts empty;
split declaration and assignment so failures are not masked: declare local
start_ts (or local start_ts="") first, then assign start_ts=$(date -d
"$start_time" +%s) and immediately check the command exit (e.g. "start_ts=$(date
-d "$start_time" +%s) || continue") or validate with a numeric check (e.g. "[[
$start_ts =~ ^[0-9]+$ ]] || continue") before using "[ "$start_ts" -gt
"$current_ts" ]"; reference variables start_time, start_ts and the comparison
against current_ts when applying the fix.

---

Nitpick comments:
In `@tests/uat/tests.sh`:
- Line 264: The test calls wait_for_node_unquarantine twice for the GPU-reset
path; wait_for_gpu_reset already invokes wait_for_node_unquarantine, so remove
the redundant call from the test function test_xid_monitoring_syslog_gpu_reset
(delete the wait_for_node_unquarantine "$node" invocation inside that test) to
avoid calling the same unquarantine helper twice.
- Around line 243-249: After finding a matching UUID and setting matching_crd in
the inner loop (inside the for current_uuid in $uuids loop where matching_crd is
assigned and log "GPUReset $matching_crd matches $uuid and $node" is emitted),
exit both loops immediately to avoid extra iterations and duplicate logs; add a
break that exits both the inner and outer for loops (e.g., use break 2 in this
bash script) immediately after the log statement.
- Around line 228-250: The current loop builds gpu_reset_list and iterates with
a for + IFS which leaks IFS scope and is fragile; replace the for gpu_reset in
$gpu_reset_list construct with a while read -r loop that reads each jq -c
'.items[]' line (preserving each gpu_reset JSON) — e.g., pipe the kubectl | jq
output into while IFS= read -r gpu_reset; do ... done — then inside use the
existing jq calls (start_time, current_node, uuids, matching_crd) and the same
comparisons to set matching_crd and log; this makes iteration robust and avoids
modifying IFS globally.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/uat/tests.sh (2)

243-250: Add break statements to short-circuit after the first UUID match.

After matching_crd is set, neither the inner for current_uuid loop nor the outer for gpu_reset loop exits early. The outer while loop only breaks at line 252 after all items have been scanned. If multiple CRDs satisfy the criteria, matching_crd ends up holding the last match rather than the first.

♻️ Proposed refactor
             for current_uuid in $uuids; do
                 if [ "$current_uuid" == "$uuid" ]; then
                     matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name')
                     log "GPUReset $matching_crd matches $uuid and $node"
+                    break
                 fi
             done
+            [ -n "$matching_crd" ] && break
         fi
     done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 243 - 250, The inner UUID loop and the outer
gpu_reset loop should short-circuit after the first match to avoid overwriting
matching_crd; when you set matching_crd inside the if block (where you currently
call log "GPUReset $matching_crd matches $uuid and $node"), add a break that
exits both loops (use break 2 in this shell context) so you stop iterating
current_uuid and gpu_reset immediately once a match is found.

229-232: local IFS=$'\n' is re-declared on every iteration of the while loop.

In bash, local is function-scoped, not loop-iteration-scoped. Placing it inside the loop re-evaluates the local declaration on each pass, which is redundant. Move both the IFS override and gpu_reset_list fetch to where they actually belong — IFS alongside the other local declarations at the function top, and gpu_reset_list inside the loop (since it re-fetches on each poll, which is intentional).

♻️ Proposed refactor
 wait_for_gpu_reset() {
     local node=$1
     local uuid=$2
     local current_ts=$3
     local timeout=${UAT_RESET_TIMEOUT:-600}
     local elapsed=0
     local matching_crd
+    local IFS=$'\n'

     log "Waiting for GPU reset for $uuid on $node (matching GPUReset CRD)..."

     while [[ $elapsed -lt $timeout ]]; do
-        local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')
-        local IFS=$'\n'
+        local gpu_reset_list
+        gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]') || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 229 - 232, The IFS override is declared with
`local IFS=$'\n'` inside the loop, which is redundant because `local` is
function-scoped; move the `local IFS=$'\n'` declaration up with the other
`local` declarations at the top of the enclosing function (so IFS is set once),
and keep the `gpu_reset_list=$(kubectl get gpuresets -o json | jq -c
'.items[]')` assignment inside the polling loop but remove the `local` keyword
there (so it reassigns the existing variable each iteration rather than
redeclaring it). Reference the variables/assignments `IFS`, `gpu_reset_list`,
and the loop that does `for gpu_reset in $gpu_reset_list; do`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/uat/tests.sh`:
- Around line 421-423: The test calls wait_for_node_unquarantine twice: once
inside wait_for_gpu_reset and again immediately after the wait_for_gpu_reset
"$gpu_node" "$uuid" "$current_ts" invocation; remove the explicit post-call to
wait_for_node_unquarantine to avoid redundancy and match the pattern used in
test_gpu_monitoring_dcgm where wait_for_boot_id_change (which internally calls
wait_for_node_unquarantine) is not followed by a second call.

---

Duplicate comments:
In `@tests/uat/tests.sh`:
- Line 240: The assignment using `local start_ts=$(date -d "$start_time" +%s)`
masks `date`'s exit status; split declaration and assignment so failures
propagate: declare `local start_ts` first, then run `start_ts=$(date -d
"$start_time" +%s)` and immediately check its exit code (or test that `start_ts`
is non-empty/numeric) before using it in the `[ "$start_ts" -gt "$current_ts" ]`
comparison; update the block around `start_ts`, `start_time`, and the comparison
to handle and log a `date -d` failure (or exit) if the command fails.

---

Nitpick comments:
In `@tests/uat/tests.sh`:
- Around line 243-250: The inner UUID loop and the outer gpu_reset loop should
short-circuit after the first match to avoid overwriting matching_crd; when you
set matching_crd inside the if block (where you currently call log "GPUReset
$matching_crd matches $uuid and $node"), add a break that exits both loops (use
break 2 in this shell context) so you stop iterating current_uuid and gpu_reset
immediately once a match is found.
- Around line 229-232: The IFS override is declared with `local IFS=$'\n'`
inside the loop, which is redundant because `local` is function-scoped; move the
`local IFS=$'\n'` declaration up with the other `local` declarations at the top
of the enclosing function (so IFS is set once), and keep the
`gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')` assignment
inside the polling loop but remove the `local` keyword there (so it reassigns
the existing variable each iteration rather than redeclaring it). Reference the
variables/assignments `IFS`, `gpu_reset_list`, and the loop that does `for
gpu_reset in $gpu_reset_list; do`.

Signed-off-by: Nathan Herz <nherz@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/uat/tests.sh (2)

228-230: local IFS=$'\n' is re-declared on every loop iteration — move it outside.

local IFS=$'\n' inside the while loop is redundant after the first iteration; the assignment is re-executed on every poll cycle. Moving it above the while loop is cleaner and makes the intent (split gpu_reset_list by newlines) obvious.

♻️ Suggested move
+    local IFS=$'\n'
     while [[ $elapsed -lt $timeout ]]; do
         local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')
-        local IFS=$'\n'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 228 - 230, Move the IFS setting out of the
while loop so it's not re-declared every iteration: set local IFS=$'\n' once
before the while [[ $elapsed -lt $timeout ]]; do loop that reads local
gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]'), ensuring
gpu_reset_list is still split on newlines as intended; update any surrounding
comments if needed to reflect that IFS applies for the loop body rather than
being reset each poll.

243-250: Add break after match to prevent "last-match-wins" and unnecessary iterations.

After matching_crd is set (line 245), neither the UUID for loop nor the enclosing gpu_reset for loop breaks. This has two effects:

  1. All remaining UUIDs in the same CRD are checked unnecessarily.
  2. All remaining GPUReset CRDs are still processed, so matching_crd can be overwritten by a later CRD that also satisfies the conditions — giving "last-match-wins" semantics rather than "first-match-wins".
♻️ Proposed fix
             for current_uuid in $uuids; do
                 if [ "$current_uuid" == "$uuid" ]; then
                     matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name')
                     log "GPUReset $matching_crd matches $uuid and $node"
+                    break
                 fi
             done
         fi
+        if [ -n "$matching_crd" ]; then
+            break
+        fi
     done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 243 - 250, The loop that sets matching_crd
can continue iterating and be overwritten; after you set matching_crd inside the
inner UUID loop (the block that logs "GPUReset $matching_crd matches $uuid and
$node"), add a break to exit the UUID loop and then add a break (or otherwise
stop) the enclosing gpu_reset loop so that once a matching CRD is found you stop
further UUID checks and stop iterating remaining gpu_reset items, ensuring
first-match-wins behavior for matching_crd.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/uat/tests.sh`:
- Line 240: The issue is that `local start_ts=$(date -d "$start_time" +%s)`
masks a failing `date` exit code; change it to first run `date -d "$start_time"
+%s` into a non-local temp (or run it and capture its exit), test the command's
exit status, and only then assign/declare the local `start_ts` (or exit with a
clear error) so malformed `start_time` won't produce an empty/invalid value;
look for the `start_ts`/`start_time` usage in tests.sh and update the logic to
validate the `date` call before using `start_ts` in the comparison with
`current_ts`.

---

Nitpick comments:
In `@tests/uat/tests.sh`:
- Around line 228-230: Move the IFS setting out of the while loop so it's not
re-declared every iteration: set local IFS=$'\n' once before the while [[
$elapsed -lt $timeout ]]; do loop that reads local gpu_reset_list=$(kubectl get
gpuresets -o json | jq -c '.items[]'), ensuring gpu_reset_list is still split on
newlines as intended; update any surrounding comments if needed to reflect that
IFS applies for the loop body rather than being reset each poll.
- Around line 243-250: The loop that sets matching_crd can continue iterating
and be overwritten; after you set matching_crd inside the inner UUID loop (the
block that logs "GPUReset $matching_crd matches $uuid and $node"), add a break
to exit the UUID loop and then add a break (or otherwise stop) the enclosing
gpu_reset loop so that once a matching CRD is found you stop further UUID checks
and stop iterating remaining gpu_reset items, ensuring first-match-wins behavior
for matching_crd.

@lalitadithya lalitadithya enabled auto-merge (squash) February 18, 2026 19:34
@lalitadithya lalitadithya merged commit b14649e into NVIDIA:main Feb 18, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments