fix: wait for GPUReset CRD rather than check syslog in UAT#879
fix: wait for GPUReset CRD rather than check syslog in UAT#879lalitadithya merged 2 commits intoNVIDIA:mainfrom
Conversation
📝 Walkthrough🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/uat/tests.sh (1)
418-421:⚠️ Potential issue | 🟡 Minor
wait_for_node_unquarantineis called twice: once insidewait_for_gpu_reset(line 264) and again here (line 421).
wait_for_gpu_resetalready callswait_for_node_unquarantine "$node"at its end, so the explicit call at line 421 is redundant. The second invocation will return immediately (the node is already uncordoned), but it's confusing and inconsistent with howtest_xid_monitoring_syslog(Test 2) delegates unquarantine entirely towait_for_boot_id_change.Either remove the call at line 421, or remove it from inside
wait_for_gpu_resetand leave it to the caller — but pick one.Option: remove the redundant call at the call site
wait_for_gpu_reset "$gpu_node" "$uuid" - wait_for_node_unquarantine "$gpu_node" - log "Test 3 PASSED ✓"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 418 - 421, The call site contains a redundant invocation of wait_for_node_unquarantine because wait_for_gpu_reset already calls wait_for_node_unquarantine at its end; remove the explicit wait_for_node_unquarantine "$gpu_node" line from the caller (the block that logs "Waiting for node to GPU reset and recover...") so unquarantine is only handled inside wait_for_gpu_reset, and update any nearby comment to reflect that wait_for_gpu_reset performs the unquarantine step (alternatively, if you prefer caller-controlled unquarantine, remove the internal call inside wait_for_gpu_reset—choose one approach and make the code/comments consistent).
🧹 Nitpick comments (2)
tests/uat/tests.sh (2)
229-230:local var=$(cmd)masks non-zero exit codes underset -euo pipefail.
localalways returns 0, so ifkubectl get gpuresetsorjqfails, the error is silently swallowed andgpu_reset_listis set to an empty (or partial) value. The same pitfall applies to thelocaldeclarations on lines 233-235 and 240.A common pattern to preserve
set -esemantics:local gpu_reset_list gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')This is a well-known bash gotcha — not urgent since the retry loop provides a natural safety net, but worth addressing for debuggability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 229 - 230, The issue is that using `local var=$(cmd)` masks non-zero exit codes; change each occurrence to declare the local first and then assign the command substitution so failures propagate under set -euo pipefail — e.g., replace `local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')` with `local gpu_reset_list` followed by `gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')`; do the same for the other local declarations referenced (the subsequent local assignments around the same block) such that `local` and the command substitution are separate for variables like gpu_reset_list and the other variables created on lines near it.
243-249: Missingbreakafter finding a matching UUID — the outerfor gpu_resetloop also keeps iterating.Once
matching_crdis set on line 245, neither the innerfor current_uuidloop nor the outerfor gpu_resetloop breaks. This means:
- Remaining UUIDs in the same CRD are compared unnecessarily.
- All remaining GPUReset CRDs are still processed, potentially overwriting
matching_crdwith a later (equally valid) match.Proposed fix — break out of both loops on first match
for current_uuid in $uuids; do if [ "$current_uuid" == "$uuid" ]; then matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name') log "GPUReset $matching_crd matches $uuid and $node" + break fi done + + if [ -n "$matching_crd" ]; then + break + fi fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 243 - 249, The inner loop comparing current_uuid to uuid sets matching_crd but doesn’t exit either loop, allowing further comparisons and overwrites; after setting matching_crd and logging (inside the inner for current_uuid loop where gpu_reset, uuids, uuid, node are used) immediately break out of both loops — e.g., use a single break 2 to stop the inner and outer for gpu_reset loop (or set a flag and break twice) so the first matching CRD is preserved and no further UUIDs or GPUReset CRDs are processed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/uat/tests.sh`:
- Around line 237-240: The script reads .status.startTime into start_time using
jq -r but jq prints "null" for missing fields, so the existing if [ -z
"$start_time" ] check misses that and date -d "null" runs; update the check or
jq invocation: either have the jq call return an empty string for null (e.g.,
adjust the jq expression to convert null to ""), or add a guard like if [ -z
"$start_time" ] || [ "$start_time" = "null" ]; then continue; before calling
date -d; refer to the start_time variable and start_ts assignment in tests.sh
and ensure date -d is only called when start_time is a real timestamp to avoid
the masked error and subsequent integer expression failures.
---
Outside diff comments:
In `@tests/uat/tests.sh`:
- Around line 418-421: The call site contains a redundant invocation of
wait_for_node_unquarantine because wait_for_gpu_reset already calls
wait_for_node_unquarantine at its end; remove the explicit
wait_for_node_unquarantine "$gpu_node" line from the caller (the block that logs
"Waiting for node to GPU reset and recover...") so unquarantine is only handled
inside wait_for_gpu_reset, and update any nearby comment to reflect that
wait_for_gpu_reset performs the unquarantine step (alternatively, if you prefer
caller-controlled unquarantine, remove the internal call inside
wait_for_gpu_reset—choose one approach and make the code/comments consistent).
---
Nitpick comments:
In `@tests/uat/tests.sh`:
- Around line 229-230: The issue is that using `local var=$(cmd)` masks non-zero
exit codes; change each occurrence to declare the local first and then assign
the command substitution so failures propagate under set -euo pipefail — e.g.,
replace `local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c
'.items[]')` with `local gpu_reset_list` followed by `gpu_reset_list=$(kubectl
get gpuresets -o json | jq -c '.items[]')`; do the same for the other local
declarations referenced (the subsequent local assignments around the same block)
such that `local` and the command substitution are separate for variables like
gpu_reset_list and the other variables created on lines near it.
- Around line 243-249: The inner loop comparing current_uuid to uuid sets
matching_crd but doesn’t exit either loop, allowing further comparisons and
overwrites; after setting matching_crd and logging (inside the inner for
current_uuid loop where gpu_reset, uuids, uuid, node are used) immediately break
out of both loops — e.g., use a single break 2 to stop the inner and outer for
gpu_reset loop (or set a flag and break twice) so the first matching CRD is
preserved and no further UUIDs or GPUReset CRDs are processed.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/uat/tests.sh (3)
264-264:wait_for_node_unquarantineis called twice for the GPU-reset test path.
wait_for_gpu_resetalready callswait_for_node_unquarantineat line 264. Thentest_xid_monitoring_syslog_gpu_resetcalls it again at line 421. The second call is redundant — the node is already uncordoned at that point.Proposed fix — remove the duplicate call in the test function
wait_for_gpu_reset "$gpu_node" "$uuid" - wait_for_node_unquarantine "$gpu_node" - log "Test 3 PASSED ✓"Also applies to: 419-421
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` at line 264, The test calls wait_for_node_unquarantine twice for the GPU-reset path; wait_for_gpu_reset already invokes wait_for_node_unquarantine, so remove the redundant call from the test function test_xid_monitoring_syslog_gpu_reset (delete the wait_for_node_unquarantine "$node" invocation inside that test) to avoid calling the same unquarantine helper twice.
243-249: Missingbreakafter matching UUID causes unnecessary iteration and possible duplicate log messages.Once
matching_crdis set on line 245, neither the innerforloop nor the outerforloop breaks. The code continues scanning all remaining UUIDs and all remaining CRDs. Addbreakstatements to exit both loops early.Proposed fix
if [ "$start_ts" -gt "$current_ts" ] && [ "$current_node" == "$node" ]; then for current_uuid in $uuids; do if [ "$current_uuid" == "$uuid" ]; then matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name') log "GPUReset $matching_crd matches $uuid and $node" + break fi done fi + + if [ -n "$matching_crd" ]; then + break + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 243 - 249, After finding a matching UUID and setting matching_crd in the inner loop (inside the for current_uuid in $uuids loop where matching_crd is assigned and log "GPUReset $matching_crd matches $uuid and $node" is emitted), exit both loops immediately to avoid extra iterations and duplicate logs; add a break that exits both the inner and outer for loops (e.g., use break 2 in this bash script) immediately after the log statement.
228-250: Consider usingwhile read -rinstead of IFS-basedforloop for robustness.Setting
local IFS=$'\n'inside thewhileloop and iterating withfor gpu_reset in $gpu_reset_listworks, but is fragile —localscopes to the function (not the block), so IFS stays modified for subsequent operations. Awhile read -rpipeline avoids this entirely and is more idiomatic for processing line-oriented output.Sketch
- local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]') - local IFS=$'\n' - - for gpu_reset in $gpu_reset_list; do + while IFS= read -r gpu_reset; do + [ -z "$gpu_reset" ] && continue ... - done + done < <(kubectl get gpuresets -o json | jq -c '.items[]')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 228 - 250, The current loop builds gpu_reset_list and iterates with a for + IFS which leaks IFS scope and is fragile; replace the for gpu_reset in $gpu_reset_list construct with a while read -r loop that reads each jq -c '.items[]' line (preserving each gpu_reset JSON) — e.g., pipe the kubectl | jq output into while IFS= read -r gpu_reset; do ... done — then inside use the existing jq calls (start_time, current_node, uuids, matching_crd) and the same comparisons to set matching_crd and log; this makes iteration robust and avoids modifying IFS globally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/uat/tests.sh`:
- Around line 237-240: The problem is that using "local start_ts=$(date -d
"$start_time" +%s)" masks a non-zero exit from date and leaves start_ts empty;
split declaration and assignment so failures are not masked: declare local
start_ts (or local start_ts="") first, then assign start_ts=$(date -d
"$start_time" +%s) and immediately check the command exit (e.g. "start_ts=$(date
-d "$start_time" +%s) || continue") or validate with a numeric check (e.g. "[[
$start_ts =~ ^[0-9]+$ ]] || continue") before using "[ "$start_ts" -gt
"$current_ts" ]"; reference variables start_time, start_ts and the comparison
against current_ts when applying the fix.
---
Nitpick comments:
In `@tests/uat/tests.sh`:
- Line 264: The test calls wait_for_node_unquarantine twice for the GPU-reset
path; wait_for_gpu_reset already invokes wait_for_node_unquarantine, so remove
the redundant call from the test function test_xid_monitoring_syslog_gpu_reset
(delete the wait_for_node_unquarantine "$node" invocation inside that test) to
avoid calling the same unquarantine helper twice.
- Around line 243-249: After finding a matching UUID and setting matching_crd in
the inner loop (inside the for current_uuid in $uuids loop where matching_crd is
assigned and log "GPUReset $matching_crd matches $uuid and $node" is emitted),
exit both loops immediately to avoid extra iterations and duplicate logs; add a
break that exits both the inner and outer for loops (e.g., use break 2 in this
bash script) immediately after the log statement.
- Around line 228-250: The current loop builds gpu_reset_list and iterates with
a for + IFS which leaks IFS scope and is fragile; replace the for gpu_reset in
$gpu_reset_list construct with a while read -r loop that reads each jq -c
'.items[]' line (preserving each gpu_reset JSON) — e.g., pipe the kubectl | jq
output into while IFS= read -r gpu_reset; do ... done — then inside use the
existing jq calls (start_time, current_node, uuids, matching_crd) and the same
comparisons to set matching_crd and log; this makes iteration robust and avoids
modifying IFS globally.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/uat/tests.sh (2)
243-250: Addbreakstatements to short-circuit after the first UUID match.After
matching_crdis set, neither the innerfor current_uuidloop nor the outerfor gpu_resetloop exits early. The outerwhileloop only breaks at line 252 after all items have been scanned. If multiple CRDs satisfy the criteria,matching_crdends up holding the last match rather than the first.♻️ Proposed refactor
for current_uuid in $uuids; do if [ "$current_uuid" == "$uuid" ]; then matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name') log "GPUReset $matching_crd matches $uuid and $node" + break fi done + [ -n "$matching_crd" ] && break fi done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 243 - 250, The inner UUID loop and the outer gpu_reset loop should short-circuit after the first match to avoid overwriting matching_crd; when you set matching_crd inside the if block (where you currently call log "GPUReset $matching_crd matches $uuid and $node"), add a break that exits both loops (use break 2 in this shell context) so you stop iterating current_uuid and gpu_reset immediately once a match is found.
229-232:local IFS=$'\n'is re-declared on every iteration of the while loop.In bash,
localis function-scoped, not loop-iteration-scoped. Placing it inside the loop re-evaluates thelocaldeclaration on each pass, which is redundant. Move both theIFSoverride andgpu_reset_listfetch to where they actually belong —IFSalongside the otherlocaldeclarations at the function top, andgpu_reset_listinside the loop (since it re-fetches on each poll, which is intentional).♻️ Proposed refactor
wait_for_gpu_reset() { local node=$1 local uuid=$2 local current_ts=$3 local timeout=${UAT_RESET_TIMEOUT:-600} local elapsed=0 local matching_crd + local IFS=$'\n' log "Waiting for GPU reset for $uuid on $node (matching GPUReset CRD)..." while [[ $elapsed -lt $timeout ]]; do - local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]') - local IFS=$'\n' + local gpu_reset_list + gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]') || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 229 - 232, The IFS override is declared with `local IFS=$'\n'` inside the loop, which is redundant because `local` is function-scoped; move the `local IFS=$'\n'` declaration up with the other `local` declarations at the top of the enclosing function (so IFS is set once), and keep the `gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')` assignment inside the polling loop but remove the `local` keyword there (so it reassigns the existing variable each iteration rather than redeclaring it). Reference the variables/assignments `IFS`, `gpu_reset_list`, and the loop that does `for gpu_reset in $gpu_reset_list; do`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/uat/tests.sh`:
- Around line 421-423: The test calls wait_for_node_unquarantine twice: once
inside wait_for_gpu_reset and again immediately after the wait_for_gpu_reset
"$gpu_node" "$uuid" "$current_ts" invocation; remove the explicit post-call to
wait_for_node_unquarantine to avoid redundancy and match the pattern used in
test_gpu_monitoring_dcgm where wait_for_boot_id_change (which internally calls
wait_for_node_unquarantine) is not followed by a second call.
---
Duplicate comments:
In `@tests/uat/tests.sh`:
- Line 240: The assignment using `local start_ts=$(date -d "$start_time" +%s)`
masks `date`'s exit status; split declaration and assignment so failures
propagate: declare `local start_ts` first, then run `start_ts=$(date -d
"$start_time" +%s)` and immediately check its exit code (or test that `start_ts`
is non-empty/numeric) before using it in the `[ "$start_ts" -gt "$current_ts" ]`
comparison; update the block around `start_ts`, `start_time`, and the comparison
to handle and log a `date -d` failure (or exit) if the command fails.
---
Nitpick comments:
In `@tests/uat/tests.sh`:
- Around line 243-250: The inner UUID loop and the outer gpu_reset loop should
short-circuit after the first match to avoid overwriting matching_crd; when you
set matching_crd inside the if block (where you currently call log "GPUReset
$matching_crd matches $uuid and $node"), add a break that exits both loops (use
break 2 in this shell context) so you stop iterating current_uuid and gpu_reset
immediately once a match is found.
- Around line 229-232: The IFS override is declared with `local IFS=$'\n'`
inside the loop, which is redundant because `local` is function-scoped; move the
`local IFS=$'\n'` declaration up with the other `local` declarations at the top
of the enclosing function (so IFS is set once), and keep the
`gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]')` assignment
inside the polling loop but remove the `local` keyword there (so it reassigns
the existing variable each iteration rather than redeclaring it). Reference the
variables/assignments `IFS`, `gpu_reset_list`, and the loop that does `for
gpu_reset in $gpu_reset_list; do`.
Signed-off-by: Nathan Herz <nherz@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/uat/tests.sh (2)
228-230:local IFS=$'\n'is re-declared on every loop iteration — move it outside.
local IFS=$'\n'inside thewhileloop is redundant after the first iteration; the assignment is re-executed on every poll cycle. Moving it above thewhileloop is cleaner and makes the intent (splitgpu_reset_listby newlines) obvious.♻️ Suggested move
+ local IFS=$'\n' while [[ $elapsed -lt $timeout ]]; do local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]') - local IFS=$'\n'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 228 - 230, Move the IFS setting out of the while loop so it's not re-declared every iteration: set local IFS=$'\n' once before the while [[ $elapsed -lt $timeout ]]; do loop that reads local gpu_reset_list=$(kubectl get gpuresets -o json | jq -c '.items[]'), ensuring gpu_reset_list is still split on newlines as intended; update any surrounding comments if needed to reflect that IFS applies for the loop body rather than being reset each poll.
243-250: Addbreakafter match to prevent "last-match-wins" and unnecessary iterations.After
matching_crdis set (line 245), neither the UUIDforloop nor the enclosinggpu_resetforloop breaks. This has two effects:
- All remaining UUIDs in the same CRD are checked unnecessarily.
- All remaining GPUReset CRDs are still processed, so
matching_crdcan be overwritten by a later CRD that also satisfies the conditions — giving "last-match-wins" semantics rather than "first-match-wins".♻️ Proposed fix
for current_uuid in $uuids; do if [ "$current_uuid" == "$uuid" ]; then matching_crd=$(echo "$gpu_reset" | jq -r '.metadata.name') log "GPUReset $matching_crd matches $uuid and $node" + break fi done fi + if [ -n "$matching_crd" ]; then + break + fi done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 243 - 250, The loop that sets matching_crd can continue iterating and be overwritten; after you set matching_crd inside the inner UUID loop (the block that logs "GPUReset $matching_crd matches $uuid and $node"), add a break to exit the UUID loop and then add a break (or otherwise stop) the enclosing gpu_reset loop so that once a matching CRD is found you stop further UUID checks and stop iterating remaining gpu_reset items, ensuring first-match-wins behavior for matching_crd.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/uat/tests.sh`:
- Line 240: The issue is that `local start_ts=$(date -d "$start_time" +%s)`
masks a failing `date` exit code; change it to first run `date -d "$start_time"
+%s` into a non-local temp (or run it and capture its exit), test the command's
exit status, and only then assign/declare the local `start_ts` (or exit with a
clear error) so malformed `start_time` won't produce an empty/invalid value;
look for the `start_ts`/`start_time` usage in tests.sh and update the logic to
validate the `date` call before using `start_ts` in the comparison with
`current_ts`.
---
Nitpick comments:
In `@tests/uat/tests.sh`:
- Around line 228-230: Move the IFS setting out of the while loop so it's not
re-declared every iteration: set local IFS=$'\n' once before the while [[
$elapsed -lt $timeout ]]; do loop that reads local gpu_reset_list=$(kubectl get
gpuresets -o json | jq -c '.items[]'), ensuring gpu_reset_list is still split on
newlines as intended; update any surrounding comments if needed to reflect that
IFS applies for the loop body rather than being reset each poll.
- Around line 243-250: The loop that sets matching_crd can continue iterating
and be overwritten; after you set matching_crd inside the inner UUID loop (the
block that logs "GPUReset $matching_crd matches $uuid and $node"), add a break
to exit the UUID loop and then add a break (or otherwise stop) the enclosing
gpu_reset loop so that once a matching CRD is found you stop further UUID checks
and stop iterating remaining gpu_reset items, ensuring first-match-wins behavior
for matching_crd.
Summary
The UAT test for GPU reset is currently failing because the "/var/log/syslog" directory does not exist on GCP nodes. Since we do not have a container with journalctl available to read syslogs directly from "/var/log/journal" to detect the GPU reset in the test, we will wait for a matching GPU reset CRD to be created. This check is required to confirm that the given injected XID error resulted in a GPU reset rather than a reboot. In summary, will wait for a GPUReset CRD which:
Testing
I locally ran the wait_for_gpu_reset function with these arguments:
Running the function against a cluster with a matching GPUReset CRD:
I confirmed that the test will fail if a GPUReset CRD does not exist with the expected node name or GPU UUID. Additionally, the test will fail if the there is a GPUReset CRD with the required node and GPU but was created with too old of a timestamp.
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit