Add NCCL all reduce test in preflight framework#837
Add NCCL all reduce test in preflight framework#837XRFXLP wants to merge 18 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
📝 WalkthroughWalkthroughAdds an NCCL all-reduce preflight check: a new Python package and Docker image, Helm chart/fabric overlays and topology support, Go webhook/controller changes to mount topology and extra volumes, protobuf/gRPC health event plumbing, tests, CI matrix updates, and supporting topology data files. Changes
Sequence Diagram(s)sequenceDiagram
participant Entrypoint as Entrypoint (pod)
participant GangWaiter as GangWaiter
participant CM as Gang ConfigMap
participant Torchrun as torchrun
participant NCCL as NCCL processes
participant Health as HealthReporter
participant PC as Platform Connector
Entrypoint->>GangWaiter: wait(pod_name, timeout)
GangWaiter->>CM: poll/read gang config
CM-->>GangWaiter: GangConfig (when formed)
GangWaiter-->>Entrypoint: GangConfig
Entrypoint->>Torchrun: exec torchrun with args
Torchrun->>NCCL: init process group / run benchmark
alt Success
NCCL->>Health: send_success(...)
Health->>PC: HealthEventOccurredV1 (grpc over unix socket, with retries)
PC-->>Health: ack
else Failure
NCCL->>Health: send_failure(...)
Health->>PC: HealthEventOccurredV1 (fatal)
PC-->>Health: ack
end
NCCL->>NCCL: cleanup
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
preflight/pkg/webhook/injector.go (1)
483-497:⚠️ Potential issue | 🟠 MajorUpdate design documentation to reflect the env var rename to
POD_NAMEandPOD_IP.The env var rename from
MY_POD_NAME/MY_POD_IPtoPOD_NAME/POD_IPis a breaking change for consumers. The preflight-checks code (nccl-allreduce) already expectsPOD_NAME, but the design documentdocs/designs/026-preflight-checks.mdstill referencesMY_POD_NAMEand must be updated to reflect this change. Additionally, verify that the MongoDB Helm chart templates (which referenceMY_POD_NAME) are not injected via this webhook or have been updated accordingly.
🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml`:
- Around line 43-46: The rendered YAML has an extra blank line because the
template uses "{{ toYaml .Values.gangCoordination.extraHostPathMounts }}"
without the left-trim marker; update the template around the extraHostPathMounts
block to use the left-trim form (i.e., change the toYaml invocation for
.Values.gangCoordination.extraHostPathMounts to include "{{-") so it matches the
other toYaml calls and removes the unwanted blank line after the
"extraHostPathMounts:" key.
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 186-194: The gangDiscovery section is currently populated with
Volcano-specific defaults; change it to be empty by default to match the
documented behavior: replace or clear the gangDiscovery map so it’s an empty map
(e.g., gangDiscovery: {}) and move the Volcano configuration (podGroupGVR, name,
annotationKeys, labelKeys, minCountExpr) into commented example lines so users
can opt-in to Volcano by uncommenting; ensure references to podGroupGVR,
minCountExpr and name remain in the commented example for clarity.
- Around line 129-132: The init container image for the NCCL all-reduce
preflight job is set to a personal registry image (image:
xrfxlp/preflight-nccl-allreduce:11131602) and should be updated to the project’s
official registry; locate the preflight entry with name:
preflight-nccl-allreduce and replace the image value with the canonical image
used by the chart (i.e., the same registry/repository pattern used by the main
preflight image reference above) or a values.yaml variable that points to the
maintained registry so the init container uses the official image.
- Around line 210-219: The values file currently enables an AWS-specific
hostPath under extraHostPathMounts (the entry with name: host-opt-amazon and
hostPath:/opt/amazon, mountPath:/opt/amazon) despite the comment saying "Keep
empty by default"; change this to leave extraHostPathMounts empty by default and
move the AWS example into a commented-out block (or fully comment out the
host-opt-amazon entry) so the default is CSP-agnostic while preserving the
example for users.
- Around line 146-170: The default values file is CSP-specific because the AWS
EFA env vars are uncommented; comment out the AWS EFA entries so the chart
remains provider-agnostic: change the FI_EFA_USE_DEVICE_RDMA and FI_PROVIDER env
entries and the EFA-specific LD_LIBRARY_PATH value to commented lines (same
style as the Azure InfiniBand block) so NCCL/GLOO/TP provider settings are only
enabled when intentionally uncommented.
- Around line 17-20: Replace the personal Docker Hub image and tag in the Helm
values (image.repository and image.tag) so they point to the official registry
and a semantic version: set image.repository to
ghcr.io/nvidia/nvsentinel/preflight (or the sanctioned canonical registry) and
change image.tag from "10581602" to a proper semantic version (e.g., "vX.Y.Z" or
"X.Y.Z") before merging; ensure image.pullPolicy remains appropriate
(IfNotPresent or Always) per release policy.
- Around line 98-128: The initContainers entries for preflight checks were fully
commented out; restore (uncomment) the preflight container definitions for
preflight-dcgm-diag and preflight-nccl-loopback in values.yaml so the preflight
checks are preserved, keeping the existing fields (name, image, volumeMounts,
env/Test env vars like BW_THRESHOLD_GBPS and TEST_SIZE_MB and any resources)
intact, ensure indentation and YAML validity after uncommenting, and verify the
nvsentinel-socket volumeMount and any SKIP_BANDWIDTH_CHECK logic remain correct
so the checks run as intended.
In `@preflight-checks/nccl-allreduce/Dockerfile`:
- Around line 75-81: The current RUN sequence installs the wheel with --no-deps
and separately forces protobuf/grpcio, which prevents pip from resolving a
consistent dependency set and ignores the wheel's pyproject constraints;
instead, run a single pip install that installs the wheel with its dependencies
(remove --no-deps on the /tmp/nvsentinel_nccl_allreduce*.whl install) and ensure
pip applies your constraints by passing --constraint /tmp/constraints.txt in
that same invocation (or use --no-build-isolation if you must respect pyproject
resolution), and remove the separate forced protobuf install so pip can resolve
a compatible protobuf/grpcio when installing the wheel.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Line 210: The _handle_success function signature is missing a type hint for
the result parameter; update the signature def _handle_success(cfg: Config,
result) -> int to def _handle_success(cfg: Config, result: BenchmarkResult) ->
int, and ensure BenchmarkResult is imported or referenced (e.g., from
nccl_allreduce.types or the module where BenchmarkResult is declared) so the
annotation resolves.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 72-80: The env var names used when reading configuration are wrong
and don't match the webhook injector; update the calls that set gang_config_dir
and gang_timeout_seconds to read the webhook's names instead: replace reading
"GANG_CONFIG_DIR" with "GANG_CONFIG_PATH" for the gang_config_dir assignment and
replace reading "GANG_TIMEOUT_SECONDS" with "GANG_TIMEOUT" for the
gang_timeout_seconds assignment (these calls use _parse_float/_parse_int
respectively), so the runtime picks up values injected by injector.go rather
than silently falling back to defaults.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 211-228: In _read_peers, parsing parts[2] with int(...) can raise
ValueError for malformed ranks; wrap the rank parse in a try/except in the loop
that builds peers (the block that sets rank = int(parts[2].strip()) if
len(parts) >= 3 else -1), catch ValueError, log a warning including the
offending line and the invalid rank string (use log.warning with extra={"line":
line, "bad_rank": parts[2].strip()}), and fall back to rank = -1 (or skip the
peer if you prefer consistent behavior) before appending PeerInfo so the error
is handled and the log clearly identifies the problematic peer line.
In
`@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2_grpc.py`:
- Around line 9-26: The generated stub (health_event_pb2_grpc.py) defines
GRPC_GENERATED_VERSION = '1.75.1' and will raise at import if installed grpcio
is older; update pyproject.toml so the grpcio dependency is bumped to a
constraint that guarantees >=1.75.1 (e.g., change grpcio = "^1.74.0" to grpcio =
"^1.75.1") to match GRPC_GENERATED_VERSION and the existing grpcio-tools =
"1.75.1", then regenerate/update the lockfile and reinstall dependencies so the
environment uses a compatible grpcio.
- Around line 63-66: The generated gRPC file declares GRPC_GENERATED_VERSION =
'1.75.1' and uses grpc.method_handlers_generic_handler, so update the project
dependency to require grpcio >= 1.75.1: edit pyproject.toml to change the grpcio
version specifier from "^1.74.0" to ">=1.75.1" (or a compatible caret range that
starts at 1.75.1) so the runtime check in health_event_pb2_grpc.py succeeds and
you won't get version mismatch errors.
In `@preflight-checks/nccl-allreduce/pyproject.toml`:
- Around line 22-27: The pyproject.toml pins protobuf = ">=6.31.1,<7.0.0" and
grpcio = "^1.74.0" but the Dockerfile currently installs older wheels
(protobuf>=5.x and grpcio>=1.60.x) with --no-deps, causing runtime mismatches;
update the Dockerfile package installation lines that install the prebuilt
wheels so they request protobuf>=6.31.1,<7.0.0 and grpcio>=1.74.0,<2.0.0 (or
otherwise match the pyproject.toml constraints) and ensure the install step
either allows dependency resolution or explicitly installs matching dependency
versions to keep runtime and pyproject constraints consistent.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 105-112: The code does unguarded int() conversions of environment
variables (GANG_TIMEOUT_SECONDS -> gang_timeout and NPROCS_PER_NODE ->
nprocs_per_node) which will raise raw ValueError on bad input; wrap these
conversions in try/except blocks around int(...) for both the
GANG_TIMEOUT_SECONDS/DEFAULT_GANG_TIMEOUT conversion that sets gang_timeout and
the NPROCS_PER_NODE branch that sets nprocs_per_node (when not using
detect_gpu_count()), validate the parsed value is a positive integer (and
non-zero if required), and on parse/validation failure emit a clear error
message including the env var name and bad value and exit with a non-zero code
(e.g., sys.exit(1)) so failures are reported cleanly.
In `@preflight/pkg/config/config.go`:
- Around line 98-127: The HostPathMount.ReadOnly field is documented to default
to true but setDefaults() doesn't enforce this; update the setDefaults()
function to iterate HostPathMount entries (e.g., where the enclosing config's
ExtraHostPathMounts is handled) and if a HostPathMount has ReadOnly == nil set
it to a pointer to true (create a local variable trueVal := true and assign
&trueVal) so consumers no longer need to treat nil specially; ensure you modify
the same setDefaults() that initializes other defaults for the config containing
ExtraHostPathMounts.
🧹 Nitpick comments (15)
preflight-checks/nccl-allreduce/nccl_allreduce/logger.py (1)
31-31: ImportCallablefromcollections.abcinstead oftyping.
typing.Callableis deprecated since Python 3.9 in favor ofcollections.abc.Callable. Since this project targets Python 3.10+, use the preferred import.♻️ Proposed fix
-from typing import Any, Callable, Final +from collections.abc import Callable +from typing import Any, Finalpreflight-checks/nccl-allreduce/Makefile (2)
28-29: Consider adding acleantarget.The Makefile defines
protos-cleanbut lacks a top-levelcleantarget. Adding one that delegates toprotos-clean(and any future cleanup) improves ergonomics and satisfies the conventional Makefile interface.♻️ Proposed addition (after `all` target)
.PHONY: all all: lint-test + +.PHONY: clean +clean: protos-clean
62-80:docker-buildanddocker-publishduplicate the entire build command.The only difference is
$(DOCKER_LOAD_ARG)vs--push. This is a common pattern in Makefiles and acceptable, but if more build args are added later, they'll need updating in two places.preflight/pkg/config/config.go (1)
186-196: No validation forExtraHostPathMountsentries.Missing
Name,HostPath, orMountPathwill silently produce broken volume/mount specs at webhook injection time. Consider validating required fields and theHostPathTypevalue (if set) withinvalidate().Proposed validation sketch
func (c *FileConfig) validate() error { if c.GangCoordination.Enabled { timeout, err := time.ParseDuration(c.GangCoordination.Timeout) if err != nil { return fmt.Errorf("invalid gangCoordination.timeout %q: %w", c.GangCoordination.Timeout, err) } c.GangCoordination.TimeoutDuration = timeout + + for i, m := range c.GangCoordination.ExtraHostPathMounts { + if m.Name == "" || m.HostPath == "" || m.MountPath == "" { + return fmt.Errorf("gangCoordination.extraHostPathMounts[%d]: name, hostPath, and mountPath are required", i) + } + } } return nil }preflight/pkg/webhook/injector.go (2)
220-252: NCCL-specific volume mounts are injected into every gang init container.
/dev/shm, the NCCL topo ConfigMap, and allextraHostPathMountsare added to every init container whengangCtx != nil, including non-NCCL containers likepreflight-dcgm-diag. While harmless for/dev/shm, mounting arbitrary host paths into containers that don't need them widens the attack surface unnecessarily.Consider gating these mounts on the container name (e.g., checking for the NCCL allreduce container specifically), similar to how
injectDCGMEnvgates oncontainer.Name != "preflight-dcgm-diag".
340-352: NoSizeLimiton the RAM-backed emptyDir.The
dshmvolume usesStorageMediumMemorywithout aSizeLimit. On nodes with constrained memory, NCCL could consume all available RAM. Consider setting an explicit size limit, or document that this is intentionally unbounded.preflight-checks/nccl-allreduce/nccl_allreduce/benchmark.py (3)
238-294: Consider explicitly freeing the GPU tensor between test sizes.Each call to
_run_singleallocates a potentially large tensor (e.g., 4 GB for float32). Whenrun()iterates over multiple message sizes, the tensor from the previous iteration relies on Python GC for deallocation. For large sizes, this could cause an OOM before the old tensor is freed.Proposed fix — explicit cleanup
# Calculate bandwidth metrics avg_time = elapsed / self._iters algo_bw = size_bytes / avg_time / 1e9 # GB/s # Bus bandwidth accounts for ring/tree algorithm overhead bus_bw = algo_bw * (2 * (world_size - 1) / world_size) passed = bus_bw >= self._threshold + del tensor + torch.cuda.empty_cache() + return TestResult(
255-270: Rename unused loop variables per linter.Proposed fix
- for i in range(self._warmup): + for _ in range(self._warmup): dist.all_reduce(tensor, op=dist.ReduceOp.SUM)- for i in range(self._iters): + for _ in range(self._iters): dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
72-95:parse_sizedoesn't handle bare numeric strings.If someone passes
MESSAGE_SIZES="4294967296"(bytes, no suffix), this raises aValueError. Consider supporting bare integers as raw byte counts, or document that a suffix is always required.preflight-checks/nccl-allreduce/nccl_allreduce/health.py (2)
102-141: Consider populatingentitiesImpactedwith the node entity.
entitiesImpactedis hardcoded to[]. Since the reporter already knowsnode_nameandCOMPONENT_CLASS, it could include the affected node entity to give the Platform Connector better correlation context. Not blocking, but worth considering.
183-207: Consider adding jitter and broadening the caught exception type.Two minor robustness observations on the retry loop:
Only
grpc.RpcErroris caught. While gRPC generally wraps connection failures inRpcError, transient OS-level errors (e.g.,OSErrorif the socket path is temporarily unavailable during container startup) would bypass retries and propagate as an unhandled exception to_send, which raisesRuntimeErroronly on exhausted retries.No jitter is applied to the backoff delay. For a single-reporter-per-node preflight this is fine, but adding small random jitter is a low-cost resilience improvement.
Neither is blocking for a WIP PR.
preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (4)
123-136: Redundant string check and uncaughtValueErrorin the error path.Two items:
"NCCL" in error_str or "nccl" in error_str.lower()— the second condition already covers the first (case-insensitive). Simplify to just"nccl" in error_str.lower().If
_handle_failureis ever called with an error whoseerror_code is None(e.g.,SUCCESS,HEALTH_REPORT_FAILED),HealthReporter.send_failureraisesValueError, which is not caught by theexcept RuntimeErrorin_handle_failure(line 272). Currently safe because all call sites pass errors with non-None codes, but this is fragile against future changes.Proposed simplification for the string check
- if "NCCL" in error_str or "nccl" in error_str.lower(): + if "nccl" in error_str.lower():
265-277:except RuntimeErrorwon't catchValueErrorfromsend_failure.
HealthReporter.send_failureraisesValueErrorwhen the error type has noerror_code(line 91 of health.py). Thisexcept RuntimeErrorblock would not catch it. While current callers always pass errors with valid codes, consider catching(RuntimeError, ValueError)orExceptionhere for defense-in-depth.Proposed fix
- except RuntimeError as err: + except (RuntimeError, ValueError) as err:
137-141: Minor: prefercontextlib.suppress(Exception)over baretry/except/pass.This aligns with the Ruff SIM105 hint and is slightly more idiomatic.
Proposed fix
+import contextlib ... finally: - try: - dist.destroy_process_group() - except Exception: - pass # Ignore cleanup errors + with contextlib.suppress(Exception): + dist.destroy_process_group()
98-119: All ranks will send health failure events ifinit_process_groupfails.When
dist.init_process_groupfails, the rank is unknown, so every torchrun process calls_handle_failure→HealthReporter.send_failure. This can produce N duplicate health events (one per GPU process). This may be acceptable given the constraint, but worth documenting with a comment so future maintainers understand the trade-off.
distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml
Outdated
Show resolved
Hide resolved
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2_grpc.py
Show resolved
Hide resolved
preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2_grpc.py
Show resolved
Hide resolved
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@preflight/pkg/webhook/injector.go`:
- Around line 381-393: The added emptyDir Volume for shared /dev/shm (when
gangCtx != nil and dshmVolumeName is not in existingVolumes) uses
StorageMediumMemory with no SizeLimit; update the VolumeSource.EmptyDir for
dshmVolumeName where volumesToAdd is appended to include a SizeLimit (e.g.,
derive from the Pod's memory request or a configurable cap) using a
resource.Quantity (via resource.ParseQuantity or resource.MustParse) so the
in-RAM emptyDir cannot consume all node memory; ensure the logic references
dshmVolumeName and volumesToAdd and preserves existing behavior when SizeLimit
cannot be determined.
🧹 Nitpick comments (4)
preflight/pkg/config/config.go (1)
130-134: Consider validatingHostPathTypeduring config loading.
HostPathTypeis a free-form string here, validated only at injection time ininjector.go(parseHostPathType). Invalid values silently skip the mount with a warning log. Validating atLoad()time would surface misconfigurations earlier (startup vs. per-pod admission).preflight/pkg/webhook/injector.go (2)
259-293: Duplicated ReadOnly defaulting logic — extract a helper.The nil-means-true pattern for
ReadOnlyis repeated verbatim forExtraHostPathMounts(Lines 264-267) andExtraVolumeMounts(Lines 283-286). Extract a small helper to reduce duplication.Proposed helper
func isReadOnly(ro *bool) bool { if ro == nil { return true } return *ro }Then use
ReadOnly: isReadOnly(m.ReadOnly)in both loops.
158-176: Add test coverage for IP resource mirroring logic.This code mirrors base NIC quantities to
.IPcompanion resources to ensure init containers match workload container footprints. The logic is correct—init containers capture the resource state at webhook time, independent of later platform injections. However, there's no test coverage verifying this behavior. Add tests to validate:
.IPresources are mirrored when missing from workload containers- Existing
.IPresources are not overwritten- Mirroring only happens for configured network resources
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
243-262: GCP multi-network resources are uncommented — consider commenting out for CSP-agnostic defaults.All 16 GCP
networking.gke.io.networks/gpu-nic*entries are active in the default values. While unused resources won't cause failures (they simply won't match on non-GCP clusters), this clutters the configuration and is inconsistent with the approach of keeping CSP-specific settings commented. Consider commenting these out with an explanatory note, similar to the provider-specific NCCL env vars.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
preflight/pkg/webhook/injector.go (1)
126-132: Nit: preferrangeover index-based loop.A
rangeloop is more idiomatic Go here since only the value is needed.Suggested change
- for idx := 0; idx < len(initContainers); idx++ { + for _, ic := range initContainers { patches = append(patches, PatchOperation{ Op: "add", Path: "/spec/initContainers/-", - Value: initContainers[idx], + Value: ic, }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/webhook/injector.go` around lines 126 - 132, The loop over initContainers should use an idiomatic range loop: replace the index-based for loop that appends PatchOperation entries into patches with a for _, ic := range initContainers { patches = append(patches, PatchOperation{ Op: "add", Path: "/spec/initContainers/-", Value: ic, }) } so you iterate values directly; update the block around initContainers, patches, and PatchOperation in injector.go accordingly.distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
159-178: DefaultnetworkResourceNamesincludes all CSP-specific resources.The default values file now lists AWS EFA and all 8 GCP TCPXO NIC resources. While unused resource names are harmless (they won't match if the resources don't exist on the cluster), this makes the defaults CSP-aware rather than agnostic. Consider whether these should be moved to the respective overlay files (
values-efa.yaml,values-tcpxo.yaml) and merged via Helm's list-merge behavior, or kept here for simplicity with a comment explaining the rationale.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 159 - 178, The default list networkResourceNames currently contains CSP-specific entries (AWS EFA and eight GCP TCPXO NICs); update the chart to keep values.yaml CSP-agnostic by moving those CSP-specific strings into their respective overlay value files (e.g., values-efa.yaml for "vpc.amazonaws.com/efa" and values-tcpxo.yaml for the "networking.gke.io.networks/gpu-nicX" entries) and rely on Helm to merge lists when overlays are applied, or if you choose to keep them in values.yaml add a clear comment explaining why CSP-specific names are included; edit the networkResourceNames block in values.yaml and add/merge the entries into values-efa.yaml and values-tcpxo.yaml respectively.preflight/pkg/config/config.go (1)
224-234: Consider validating required fields onExtraHostPathMountsandExtraVolumeMounts.
HostPathMountrequiresName,HostPath, andMountPathto produce a valid Kubernetes volume/mount, andExtraVolumeMountrequiresNameandMountPath. A missing field would silently produce a broken pod spec at admission time rather than failing early at config load. A quick validation loop invalidate()would surface misconfigurations sooner.Proposed addition to `validate()`
func (c *FileConfig) validate() error { if c.GangCoordination.Enabled { timeout, err := time.ParseDuration(c.GangCoordination.Timeout) if err != nil { return fmt.Errorf("invalid gangCoordination.timeout %q: %w", c.GangCoordination.Timeout, err) } c.GangCoordination.TimeoutDuration = timeout + + for i, m := range c.GangCoordination.ExtraHostPathMounts { + if m.Name == "" || m.HostPath == "" || m.MountPath == "" { + return fmt.Errorf("gangCoordination.extraHostPathMounts[%d]: name, hostPath, and mountPath are required", i) + } + } + for i, m := range c.GangCoordination.ExtraVolumeMounts { + if m.Name == "" || m.MountPath == "" { + return fmt.Errorf("gangCoordination.extraVolumeMounts[%d]: name and mountPath are required", i) + } + } } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/config/config.go` around lines 224 - 234, The validate() method on FileConfig should check that each entry in ExtraHostPathMounts has non-empty Name, HostPath, and MountPath and that each entry in ExtraVolumeMounts has non-empty Name and MountPath; update FileConfig.validate() to loop over c.ExtraHostPathMounts and c.ExtraVolumeMounts, validate required fields for each HostPathMount and VolumeMount (reporting the slice index and which field(s) are missing), and return a descriptive error (e.g., "invalid ExtraHostPathMounts[2]: missing HostPath") so misconfigurations fail fast instead of producing broken pod specs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 183-191: The gangDiscovery block is incorrectly prefilled with
Volcano-specific values; change the default to an empty configuration that uses
the Kubernetes native WorkloadRef API by clearing or removing the
Volcano-specific fields (e.g., remove or empty name, annotationKeys, labelKeys,
podGroupGVR and minCountExpr under gangDiscovery) and move the Volcano
configuration into commented example text instead; update the gangDiscovery
stanza so it defaults to an empty object (or blank fields) and leave the
existing Volcano values as a commented example for reference.
In `@preflight/pkg/webhook/injector.go`:
- Around line 266-269: The validation for ExtraHostPathMounts is inconsistent
between buildInitContainers and injectVolumes causing potential missing volumes;
update both places to require the same fields by checking that m.Name,
m.HostPath and m.MountPath are all non-empty (or factor a single helper like
validateHostPathMount(m) and call it from buildInitContainers and injectVolumes)
so entries with a MountPath but missing HostPath are rejected consistently
before injection.
---
Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values-efa.yaml`:
- Line 13: The image value currently points to a personal Docker Hub image
(image: xrfxlp/preflight-nccl-allreduce:10241702) and must be replaced before
merge; update the image entry used by the chart to the official project registry
(e.g., ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce) and set an
appropriate stable version tag (not a personal/dev tag), ensuring the change is
made to the image key in values-efa.yaml so Helm uses the official artifact.
In `@distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml`:
- Line 13: The values file contains a personal image reference
"xrfxlp/preflight-nccl-allreduce:10241702" under the image key; replace this
personal registry image with the approved repository/tag (or a chart/global
image placeholder) so the chart uses the official image (e.g., set image:
<official-registry>/preflight-nccl-allreduce:<version> or image: {{
.Values.global.preflightImage }}), ensuring no personal registry strings remain.
In `@distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml`:
- Line 19: The image field currently references a personal registry image
"xrfxlp/preflight-nccl-allreduce:10241702"; replace that value in the values
YAML (the image key) with the approved organization/official image or a standard
placeholder (e.g., <ORGANIZATION_REGISTRY>/<IMAGE>:<TAG> or the upstream image
name) before merging so no personal images remain in the chart configuration.
In `@distros/kubernetes/nvsentinel/charts/preflight/values-tcpxo.yaml`:
- Line 21: The image field currently points to a personal registry ("image:
xrfxlp/preflight-nccl-allreduce:10241702"); replace that value with the
approved/central registry image tag used by other overlays (or the chart's
templated image value), e.g., set the image key to the canonical image reference
used across overlays (or to a values template like {{ .Values.image.repository
}}:{{ .Values.image.tag }}) so no personal registry remains before merge.
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 98-128: The initContainers block has the preflight checks
commented out; restore the preflight-dcgm-diag and preflight-nccl-loopback
containers by uncommenting their definitions under initContainers, ensuring the
container names (preflight-dcgm-diag, preflight-nccl-loopback), their image
fields, volumeMounts (nvsentinel-socket -> /var/run) and env entries
(BW_THRESHOLD_GBPS, TEST_SIZE_MB and optional SKIP_BANDWIDTH_CHECK) are present
and valid; verify the preflight-nccl-loopback env values (BW_THRESHOLD_GBPS
value "150", TEST_SIZE_MB "256") and keep optional SKIP_BANDWIDTH_CHECK
commented or set per default behavior before committing.
- Around line 134-135: The initContainer entry using name
"preflight-nccl-allreduce" currently references a personal registry image
"xrfxlp/preflight-nccl-allreduce:10241702"; replace that hardcoded personal
image with the approved/public image or a chart value reference (e.g., use a
values variable like preflightNcclAllreduce.image or the official registry/tag)
so the chart uses a configurable/official image instead of the personal registry
before merging.
- Around line 17-20: The values.yaml currently points to a personal
registry/image ("image.repository: xrfxlp/preflight" and tag "11251702") which
must be replaced before merge; update image.repository to the official
registry/image and set image.tag to the correct release tag (or use a values
variable like image.tag: "stable" or image.pullPolicy as appropriate) so that
the chart references the sanctioned image—change the entries under image
(repository and tag) in the preflight chart values.yaml to the approved
production values.
In `@preflight/pkg/config/config.go`:
- Around line 125-157: The documented default of ReadOnly being true for
HostPathMount and ExtraVolumeMount isn’t applied in setDefaults(); update the
setDefaults() implementation to set ReadOnly to a true pointer when it is nil
for both HostPathMount and ExtraVolumeMount (e.g., if m.ReadOnly == nil { v :=
true; m.ReadOnly = &v }), or use an existing helper to create a *bool(true), so
consumers no longer need to treat nil as true and the documented behavior is
enforced by the config types.
In `@preflight/pkg/webhook/injector.go`:
- Around line 399-411: The emptyDir for /dev/shm created in the injector when
gangCtx != nil (see dshmVolumeName, volumesToAdd, existingVolumes and
corev1.EmptyDirVolumeSource) lacks a SizeLimit and thus can exhaust node RAM;
update the EmptyDirVolumeSource to set a SizeLimit (e.g., use a configured
constant or pod annotation value) when Medium is StorageMediumMemory so the
memory-backed emptyDir is capped, ensure the value is applied where the volume
is appended and validate/parse the configured size before use.
---
Nitpick comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 159-178: The default list networkResourceNames currently contains
CSP-specific entries (AWS EFA and eight GCP TCPXO NICs); update the chart to
keep values.yaml CSP-agnostic by moving those CSP-specific strings into their
respective overlay value files (e.g., values-efa.yaml for
"vpc.amazonaws.com/efa" and values-tcpxo.yaml for the
"networking.gke.io.networks/gpu-nicX" entries) and rely on Helm to merge lists
when overlays are applied, or if you choose to keep them in values.yaml add a
clear comment explaining why CSP-specific names are included; edit the
networkResourceNames block in values.yaml and add/merge the entries into
values-efa.yaml and values-tcpxo.yaml respectively.
In `@preflight/pkg/config/config.go`:
- Around line 224-234: The validate() method on FileConfig should check that
each entry in ExtraHostPathMounts has non-empty Name, HostPath, and MountPath
and that each entry in ExtraVolumeMounts has non-empty Name and MountPath;
update FileConfig.validate() to loop over c.ExtraHostPathMounts and
c.ExtraVolumeMounts, validate required fields for each HostPathMount and
VolumeMount (reporting the slice index and which field(s) are missing), and
return a descriptive error (e.g., "invalid ExtraHostPathMounts[2]: missing
HostPath") so misconfigurations fail fast instead of producing broken pod specs.
In `@preflight/pkg/webhook/injector.go`:
- Around line 126-132: The loop over initContainers should use an idiomatic
range loop: replace the index-based for loop that appends PatchOperation entries
into patches with a for _, ic := range initContainers { patches =
append(patches, PatchOperation{ Op: "add", Path: "/spec/initContainers/-",
Value: ic, }) } so you iterate values directly; update the block around
initContainers, patches, and PatchOperation in injector.go accordingly.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
preflight/pkg/webhook/injector.go (1)
559-603: Gang env injection logs atInfolevel per container — considerDebug.
slog.Infoat line 565 fires for every init container that gets gang env vars. In a cluster with many gang pods, this could be noisy. The gang membership is already logged atInfolevel at line 91-96. This per-container log would be better atDebug.Proposed fix
- slog.Info("Injecting gang environment variables", "gangID", gangCtx.GangID, "configMap", gangCtx.ConfigMapName) + slog.Debug("Injecting gang environment variables", "gangID", gangCtx.GangID, "configMap", gangCtx.ConfigMapName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/webhook/injector.go` around lines 559 - 603, Change the per-container log in injectGangEnv from Info to Debug to reduce noise: update the slog.Info call inside Injector.injectGangEnv (the one logging "Injecting gang environment variables" with keys gangID and configMap) to use slog.Debug (preserving the same message and structured fields); leave higher-level gang membership logs (earlier Info logs around GangContext creation) unchanged and keep mergeEnvVars and GangContext usage intact.distros/kubernetes/nvsentinel/charts/preflight/values.yaml (2)
109-109: Nit: stray double-comment# #.Line 109 has
# # NCCL loopback test— likely a leftover from toggling the comment. Should be# NCCL loopback testfor consistency with the other section comments (lines 98, 129).Proposed fix
- # # NCCL loopback test - validates intra-node GPU-to-GPU communication (NVLink/PCIe) + # NCCL loopback test - validates intra-node GPU-to-GPU communication (NVLink/PCIe)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` at line 109, The comment has a stray double-comment marker; update the comment string `# # NCCL loopback test` in values.yaml to a single `# NCCL loopback test` so it matches the style of nearby section headings (e.g., the other comment lines around the NCCL section); locate the offending comment text and replace the double `# #` with a single `#`.
155-178: DefaultnetworkResourceNamesincludes all GCP TCPXO NIC resources — intentional but notable.The 16 GCP
networking.gke.io.networks/gpu-nicXentries (8 base + 8.IP) are included in the default values. While this is needed formirrorIPCompanionResourcesto function and is harmless on non-GCP clusters (resources simply won't match), it makes the defaults somewhat CSP-aware.Consider adding a brief inline comment grouping them, similar to the
# GCP multi-network resources (TCPXO)on line 162, to help users understand these can be pruned on non-GCP deployments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 155 - 178, The default networkResourceNames list includes many GCP TCPXO entries which are CSP-specific; update the values.yaml by adding a concise inline grouping comment above the block of networking.gke.io entries (e.g., "// GCP multi-network resources (TCPXO) - can be pruned on non-GCP clusters") so users know these 16 gpu-nicX and gpu-nicX.IP entries are GCP-specific and only required for mirrorIPCompanionResources behavior; edit the networkResourceNames block and place the comment immediately before the first networking.gke.io entry to clearly group them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/files/ndv4-topo.xml`:
- Around line 2-33: The cpu elements in the ndv4-topo.xml diff all have the same
affinity attribute ("affinity=\"0000ffff,0000ffff\""), which is incorrect for a
4-NUMA-node system; update each <cpu numaid="0">, <cpu numaid="1">, <cpu
numaid="2"> and <cpu numaid="3"> to use distinct CPU affinity masks that
partition the 96 vCPUs across the four NUMA nodes (obtain masks from the actual
host via /sys/devices/system/node/node*/cpumap or numactl -H), and replace the
duplicated affinity values with the correct per-node masks so NCCL can detect
locality properly.
In `@distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml`:
- Around line 47-50: The template emits an extra blank line because the toYaml
call for extraVolumeMounts isn't left-trimmed; update the template block
handling extraVolumeMounts so the toYaml invocation uses the left-trim marker
(i.e., change the `{{ toYaml .Values.gangCoordination.extraVolumeMounts |
nindent 8 }}` invocation to use `{{- toYaml ... }}`) so the rendered YAML has no
blank line between the `extraVolumeMounts:` key and its indented list items.
- Around line 51-53: The template currently uses `ne (toString
.Values.gangCoordination.mirrorResourceClaims) ""`, which turns nil into the
string "<nil>" and renders invalid YAML; update the conditional to properly
detect unset/nil values by using Helm/Sprig's kind check, e.g. replace the `if`
with a guard like `if and (not (kindIs "invalid"
.Values.gangCoordination.mirrorResourceClaims)) (ne
.Values.gangCoordination.mirrorResourceClaims "")` (or at minimum `if not
(kindIs "invalid" .Values.gangCoordination.mirrorResourceClaims)`), so
`.Values.gangCoordination.mirrorResourceClaims` is only rendered when present
and non-empty.
In `@distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml`:
- Around line 62-69: The gangCoordination block is missing mirrorResourceClaims:
true so DRA ResourceClaims for GB200/MNNVL won't be mirrored; update the
gangCoordination configuration (the same section that contains
extraHostPathMounts) to add mirrorResourceClaims: true so init containers
receive IMEX channel claims and cross-node NVLink works correctly. Ensure the
key is added at the same nesting level as extraHostPathMounts under
gangCoordination and verify any template/renderer that defaults nil to false
will pick up this explicit true value.
- Around line 28-61: The overlay file replaces the entire initContainers list
which removes the base entries (preflight-dcgm-diag and
preflight-nccl-loopback); either (A) add a brief comment above the
initContainers block stating that this overlay intentionally replaces all
initContainers in the base (so MNNVL environments only run
preflight-nccl-allreduce), or (B) if you want to preserve the base checks,
duplicate the base initContainers entries (preflight-dcgm-diag and
preflight-nccl-loopback) into this file alongside preflight-nccl-allreduce so
all three run when merged; update the file near the initContainers /
preflight-nccl-allreduce section accordingly.
---
Duplicate comments:
In `@preflight/pkg/webhook/injector.go`:
- Around line 309-321: appendExtraHostPathMounts and collectExtraHostPathVolumes
validate different fields causing mounts without volumes; update
appendExtraHostPathMounts to perform the same validation as
collectExtraHostPathVolumes (i.e., skip entries when Name or HostPath is empty)
or centralize validation into a shared helper (e.g., validateExtraHostPathEntry)
and call it from both appendExtraHostPathMounts and collectExtraHostPathVolumes
so only entries with both Name and HostPath are used.
- Around line 453-465: The dshm emptyDir is created without a SizeLimit (risking
node RAM exhaustion); update the creation in injector.go where dshmVolumeName is
added to volumes (the block that builds corev1.Volume with
corev1.EmptyDirVolumeSource) to set a safe EmptyDir SizeLimit (e.g.,
resource.MustParse like "1Gi" or a configurable value) on the
EmptyDirVolumeSource.SizeLimit field, ensuring you import k8s resource types and
use a sensible default or config value rather than leaving it nil.
---
Nitpick comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Line 109: The comment has a stray double-comment marker; update the comment
string `# # NCCL loopback test` in values.yaml to a single `# NCCL loopback
test` so it matches the style of nearby section headings (e.g., the other
comment lines around the NCCL section); locate the offending comment text and
replace the double `# #` with a single `#`.
- Around line 155-178: The default networkResourceNames list includes many GCP
TCPXO entries which are CSP-specific; update the values.yaml by adding a concise
inline grouping comment above the block of networking.gke.io entries (e.g., "//
GCP multi-network resources (TCPXO) - can be pruned on non-GCP clusters") so
users know these 16 gpu-nicX and gpu-nicX.IP entries are GCP-specific and only
required for mirrorIPCompanionResources behavior; edit the networkResourceNames
block and place the comment immediately before the first networking.gke.io entry
to clearly group them.
In `@preflight/pkg/webhook/injector.go`:
- Around line 559-603: Change the per-container log in injectGangEnv from Info
to Debug to reduce noise: update the slog.Info call inside
Injector.injectGangEnv (the one logging "Injecting gang environment variables"
with keys gangID and configMap) to use slog.Debug (preserving the same message
and structured fields); leave higher-level gang membership logs (earlier Info
logs around GangContext creation) unchanged and keep mergeEnvVars and
GangContext usage intact.
distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml
Outdated
Show resolved
Hide resolved
distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml
Outdated
Show resolved
Hide resolved
distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml
Outdated
Show resolved
Hide resolved
distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (12)
.github/workflows/lint-test.yml (1)
139-156: No test artifact upload for preflight checks — intentional?Unlike
health-monitors-lint-testandmodules-lint-test, thepreflight-checks-lint-testjob does not upload coverage/test artifacts. This means preflight coverage won't appear in the consolidated coverage report. This is pre-existing fordcgm-diagandnccl-loopbackas well, but worth noting as nccl-allreduce is added.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/lint-test.yml around lines 139 - 156, The preflight-checks-lint-test job is missing the artifact upload step so coverage/test artifacts from make -C preflight-checks/${{ matrix.component }} lint-test aren’t collected; add a step after the "Run lint and test" step that mirrors the artifact upload used in health-monitors-lint-test/modules-lint-test (use the same upload action and artifact name pattern) so artifacts for each matrix.component (dcgm-diag, nccl-allreduce, nccl-loopback) are uploaded and included in the consolidated coverage report.preflight-checks/nccl-allreduce/scripts/entrypoint.py (1)
51-52:sys.pathmanipulation is fragile.
sys.path.insert(0, ...)is a common workaround for scripts, but consider using the-mflag or a proper package entry point (already defined inpyproject.toml) to avoid path hacking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py` around lines 51 - 52, The script currently mutates sys.path via sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) which is fragile; remove that manipulation and instead rely on running the package as a module or the package entry point declared in pyproject.toml (use python -m <package> or configure the entry point) so imports resolve normally; update entrypoint/launch instructions and the script's imports (keep the existing if __name__ == "__main__" guard) so the file can be executed as a proper package module rather than patching sys.path.preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (2)
130-142: NCCL error detection relies on string matching — consider documenting this heuristic.Checking for
"NCCL"or"nccl"in the error message is pragmatic but fragile. If PyTorch changes error message wording, legitimate NCCL errors could be re-raised as unhandledRuntimeErrors. A brief inline comment noting this trade-off would help future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 130 - 142, In the RuntimeError except block that checks for "NCCL" or "nccl" (the branch that calls log.error and returns _handle_failure with NCCLError.ALLREDUCE_TIMEOUT), add a concise inline comment explaining that detection is a heuristic based on string matching of the error message, that it may be brittle if upstream error wording changes, and that maintainers should consider more robust detection (e.g., specific exception types or error codes) in the future.
143-147:contextlib.suppressis more idiomatic for ignored cleanup errors.Proposed fix
+import contextlib ... finally: - try: - dist.destroy_process_group() - except Exception: - pass # Ignore cleanup errors + with contextlib.suppress(Exception): + dist.destroy_process_group()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 143 - 147, Replace the try/except cleanup with contextlib.suppress to make ignored cleanup errors more idiomatic: wrap the call to dist.destroy_process_group() inside a contextlib.suppress(Exception) block in the same finally section, and add an import for contextlib if it's not already imported; refer to the finally block containing dist.destroy_process_group() to locate where to apply this change.preflight-checks/nccl-allreduce/nccl_allreduce/gang.py (1)
203-206: Deadexcept FileNotFoundError—_read_stringalready handles it via the default.Since
_read_string(KEY_PEERS, default="")internally catchesFileNotFoundErrorand returns"", the outertry/except FileNotFoundErroron lines 203–206 is unreachable.Simplify
- try: - data = self._read_string(KEY_PEERS, default="") - except FileNotFoundError: - return [] + data = self._read_string(KEY_PEERS, default="")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py` around lines 203 - 206, The try/except around the call to _read_string(KEY_PEERS, default="") is dead because _read_string already catches FileNotFoundError and returns the default; remove the outer try/except and simply assign data = self._read_string(KEY_PEERS, default="") and proceed (update any following logic that expected early return to handle the empty string as before). Ensure you only modify the block using _read_string and KEY_PEERS in gang.py so behavior remains identical but without the unreachable FileNotFoundError handler.preflight-checks/nccl-allreduce/nccl_allreduce/health.py (1)
168-202: New gRPC channel created on every retry attempt.Each retry iteration opens and closes a fresh
grpc.insecure_channel. For a Unix domain socket this is cheap, but you could hoist the channel outside the loop and only reconnect on failure if you want to reduce overhead.Optional: reuse channel across retries
def _send_with_retries(self, health_events: pb.HealthEvents) -> bool: delay = INITIAL_DELAY - for attempt in range(MAX_RETRIES): - try: - with grpc.insecure_channel(f"unix://{self._socket_path}") as channel: + with grpc.insecure_channel(f"unix://{self._socket_path}") as channel: + for attempt in range(MAX_RETRIES): + try: stub = pb_grpc.PlatformConnectorStub(channel) stub.HealthEventOccurredV1( health_events, timeout=RPC_TIMEOUT, ) log.info("Health event sent successfully") return True - except grpc.RpcError as err: - log.warning( - "Failed to send health event", - extra={ - "attempt": attempt + 1, - "max_retries": MAX_RETRIES, - "error": str(err), - }, - ) - if attempt < MAX_RETRIES - 1: - sleep(delay) - delay *= BACKOFF_FACTOR + except grpc.RpcError as err: + log.warning( + "Failed to send health event", + extra={ + "attempt": attempt + 1, + "max_retries": MAX_RETRIES, + "error": str(err), + }, + ) + if attempt < MAX_RETRIES - 1: + sleep(delay) + delay *= BACKOFF_FACTOR return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/health.py` around lines 168 - 202, The code in _send_with_retries creates a new grpc.insecure_channel on every retry; hoist channel creation outside the for-loop to reuse the same channel (created with grpc.insecure_channel(f"unix://{self._socket_path}")), create the pb_grpc.PlatformConnectorStub once, and only recreate the channel/stub if a grpc.RpcError indicates the channel is unusable (i.e., on failure before the next retry). Ensure you still apply the same backoff/sleep logic and that the function returns True on successful stub.HealthEventOccurredV1 call and False after MAX_RETRIES.preflight/pkg/webhook/injector.go (3)
125-131: Preferrangeover index-based loop.The index variable
idxis only used to subscriptinitContainers. Arangeloop is more idiomatic Go.Suggested refactor
- for idx := 0; idx < len(initContainers); idx++ { + for _, c := range initContainers { patches = append(patches, PatchOperation{ Op: "add", Path: "/spec/initContainers/-", - Value: initContainers[idx], + Value: c, }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/webhook/injector.go` around lines 125 - 131, The loop adding PatchOperation entries uses an index-only for-loop; replace it with a range over initContainers to be more idiomatic and avoid explicit indexing — iterate like `for _, c := range initContainers` and append PatchOperation with Value set to the loop variable (referencing initContainers, patches, PatchOperation) so behavior is unchanged but code is cleaner.
15-15: Missing package-level godoc for thewebhookpackage.Per coding guidelines, package-level godoc is required for all Go packages. Consider adding a
// Package webhook ...comment before thepackage webhookdeclaration.As per coding guidelines: "Package-level godoc required for all Go packages".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/webhook/injector.go` at line 15, Add a package-level godoc comment immediately above the "package webhook" declaration describing the purpose and usage of the webhook package (e.g., what it provides, important behaviors or invariants). Ensure the comment starts with "Package webhook" to satisfy Go tooling and lint rules and keep it short, one or two sentences summarizing the package API and intent so linters accept it.
193-199: Consider usingboolDefaultformirrorClaimsconsistency.The
boolDefaulthelper (Line 595) was introduced for exactly this pattern. Using it here would be consistent with its usage inappendExtraHostPathMountsandappendExtraVolumeMounts.Suggested refactor
- mirrorClaims := i.cfg.GangCoordination.MirrorResourceClaims != nil && - *i.cfg.GangCoordination.MirrorResourceClaims + mirrorClaims := boolDefault(i.cfg.GangCoordination.MirrorResourceClaims, false)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/webhook/injector.go` around lines 193 - 199, Replace the manual nil-check assignment for mirrorClaims with the existing boolDefault helper to match other usages; specifically, in injector.go where mirrorClaims is computed from i.cfg.GangCoordination.MirrorResourceClaims, call boolDefault(i.cfg.GangCoordination.MirrorResourceClaims) (or the appropriate boolDefault signature) instead of the current nil-and-dereference logic so it is consistent with how appendExtraHostPathMounts and appendExtraVolumeMounts obtain their defaults.distros/kubernetes/nvsentinel/charts/preflight/values.yaml (2)
172-189: All CSP network resources in the default allowlist — confirm this is benign when absent.The webhook copies only resources that actually exist in the pod spec, so listing all supported resource names should be harmless on clusters that don't use them. However, adding a brief comment confirming this behavior would help operators who wonder why GCP-specific resources appear in their non-GCP deployment.
As per coding guidelines: "Document all values in Helm chart
values.yamlwith inline comments."Suggested comment addition
# Network resource names to copy to init containers (RDMA, InfiniBand, etc.) +# List all supported network resource types here; unused resources are ignored. +# The webhook only copies resources that exist in the pod spec. networkResourceNames:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 172 - 189, Add an inline comment above the "GCP multi-network resources (TCPXO)" block in values.yaml clarifying that the default allowlist intentionally lists all CSP-specific network resource names (e.g., "networking.gke.io.networks/gpu-nic0", "networking.gke.io.networks/gpu-nic0.IP", etc.) and that the webhook only copies resources that actually exist in a pod spec, so these GCP-specific entries are harmless on non-GCP clusters; update the comment near the existing header to be concise and follow the chart's inline-commenting style.
214-216: Minor:ncclTopoConfigMapdefault is an empty string — consider using a comment-only example instead.An empty string
""passes the truthiness check in the template ({{- if .Values.gangCoordination.ncclTopoConfigMap }}), so this works correctly. However, using a commented-out example (likemirrorResourceClaims) would be more consistent with the other optional fields in this section.As per coding guidelines: "Include examples for non-obvious configurations in Helm chart documentation."
Suggested change
# NCCL topology ConfigMap name - required for Azure NDv4/v5 InfiniBand # Without this, NCCL cannot map GPUs to IB NICs and falls back to TCP - ncclTopoConfigMap: "" + # ncclTopoConfigMap: "nccl-topo-ndv4" # Change to match your VM shape + ncclTopoConfigMap: ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 214 - 216, The ncclTopoConfigMap default is currently an empty string which can be confusing; replace it with a commented example to match other optional fields and Helm docs (e.g., comment out a sample: # ncclTopoConfigMap: "my-nccl-topo-configmap") so maintainers see a usage example while keeping the field effectively unset for the gangCoordination template check (referencing ncclTopoConfigMap and the template condition {{- if .Values.gangCoordination.ncclTopoConfigMap }}).distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml (1)
22-29: Consider adding aBW_THRESHOLD_GBPSoverride and brief inline comments for each env var.The base default is 100 GB/s. If IB clusters are expected to achieve significantly higher bandwidth (PR notes 165 gbps on Azure), you may want an explicit threshold here. Also, brief inline comments on each env var would align with the project's documentation guidelines — e.g., what
NCCL_IB_PCI_RELAXED_ORDERINGcontrols and whyeth0is chosen forNCCL_SOCKET_IFNAME.As per coding guidelines: "Document all values in Helm chart
values.yamlwith inline comments."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml` around lines 22 - 29, Add a BW_THRESHOLD_GBPS override into the ncclAllreduceExtraEnv block and add brief inline comments for each env var; specifically, add an env entry named BW_THRESHOLD_GBPS (default "100") to allow overriding the 100 GB/s baseline, and add single-line comments for NCCL_TOPO_FILE (path to topology XML), NCCL_IB_PCI_RELAXED_ORDERING (controls PCI relaxed ordering for IB performance), and NCCL_SOCKET_IFNAME (interface used for NCCL sockets, e.g., "eth0" rationale), so that ncclAllreduceExtraEnv and its new BW_THRESHOLD_GBPS entry and the existing NCCL_* variables are documented per chart guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 228-233: The Helm values comment is incorrect: the field
MirrorResourceClaims actually defaults to true unconditionally when omitted (per
the config initialization), not only when gangCoordination is enabled. Update
the comment for mirrorResourceClaims to state that it defaults to true when not
set (i.e., omitted) and remove the conditional wording referring to
gangCoordination so it matches the config.go behavior for MirrorResourceClaims.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Line 147: Add missing type annotations: annotate _wait_for_gang to declare its
return type (use -> GangWaiter) and annotate the parameter(s) of
_launch_torchrun (e.g., cfg: _EntrypointConfig) and its return type if missing
(-> None or appropriate type). Also add the suggested import "from
nccl_allreduce.gang import GangConfig, GangWaiter" so the new type names
resolve; update the function signatures for _wait_for_gang and _launch_torchrun
accordingly.
- Around line 130-131: The environment variable names used to populate
gang_config_dir and gang_timeout are mismatched with the Go webhook injector;
update the reads in entrypoint.py (variables gang_config_dir and gang_timeout)
to use GANG_CONFIG_PATH and GANG_TIMEOUT respectively (instead of
GANG_CONFIG_DIR and GANG_TIMEOUT_SECONDS), preserving the existing default
fallbacks and integer parsing for gang_timeout so behavior remains the same.
---
Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml`:
- Around line 45-52: The gangCoordination block is missing mirrorResourceClaims:
true, so DRA ResourceClaims (IMEX channel) won't be mirrored into init
containers due to the configmap.yaml kindIs "invalid" guard that omits nils;
update the values file by adding mirrorResourceClaims: true under
gangCoordination (next to extraHostPathMounts) so the template emits the setting
and init containers receive the GB200 MNNVL DRA claims.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Line 218: The function _handle_success currently has an untyped parameter
result; add the type hint "result: BenchmarkResult" to the signature and update
imports to bring BenchmarkResult into scope (e.g., from its defining module
where other benchmark types live) so the annotation resolves; ensure you update
any existing import block (or add a new import) rather than using a local or
string annotation.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 72-84: The env var names used to populate gang_config_dir and
gang_timeout_seconds do not match the webhook injector's names; update the env
var keys so gang_config_dir reads from "GANG_CONFIG_PATH" (falling back to
DEFAULT_GANG_CONFIG_DIR) and gang_timeout_seconds reads from "GANG_TIMEOUT"
(using _parse_int with DEFAULT_GANG_TIMEOUT_SECONDS) and search/replace any
other usages of "GANG_CONFIG_DIR" and "GANG_TIMEOUT_SECONDS" in this module to
"GANG_CONFIG_PATH" and "GANG_TIMEOUT" respectively so the webhook-provided
values are honored.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 222-224: When parsing the pod entry in the gang parsing logic (the
parts -> pod_name, pod_ip, rank logic), guard the numeric conversion so a
malformed rank doesn't produce an opaque ValueError; specifically wrap the
int(parts[2].strip()) conversion in a try/except ValueError, and on failure
raise a new ValueError (or include in the existing exception) that includes the
full offending line content and the parsed parts (e.g., show pod_name, pod_ip,
and the raw rank token) so GangWaiter.wait can log which line caused the parse
failure; reference the variables parts, pod_name, pod_ip, rank and the
GangWaiter.wait context when making this change.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 131-137: The int() conversions for GANG_TIMEOUT_SECONDS (assigned
to gang_timeout) and NPROCS_PER_NODE (assigned to nprocs_per_node) are unguarded
and should be wrapped in try/except ValueError; validate the env values before
converting, catch ValueError, emit a clear error message naming the env var and
its invalid value, and exit with a non‑zero code (e.g., sys.exit(1)) instead of
allowing a raw traceback; for NPROCS_PER_NODE, if the env exists but is invalid
treat it as an error and exit (do not silently fall back to
_detect_gpu_count()), otherwise when absent keep the current behavior of calling
_detect_gpu_count().
In `@preflight/pkg/webhook/injector.go`:
- Around line 284-296: The appendExtraHostPathMounts function currently only
checks Name and MountPath while collectExtraHostPathVolumes checks Name and
HostPath, causing mounts without volumes; create a shared validator (e.g.,
isValidExtraHostPathMount or validateExtraHostPathEntry) that ensures the triple
Name, HostPath and MountPath are all non-empty and use that validator in both
appendExtraHostPathMounts and collectExtraHostPathVolumes (replace the existing
per-field checks) so entries are consistently skipped when any of Name, HostPath
or MountPath is missing.
- Around line 430-438: The dshm emptyDir is created with Medium: Memory but no
SizeLimit, risking node RAM exhaustion; update the code that appends the
corev1.Volume for dshm (look for dshmVolumeName, existingVolumes and volumes in
injector.go) to set EmptyDir.SizeLimit to a safe default or configurable value:
parse a quantity (k8s.io/apimachinery/pkg/api/resource) and assign a pointer to
EmptyDirVolumeSource.SizeLimit, and expose the limit as a config field or
env/flag (e.g., dshmSizeLimit) so callers can override it; ensure imports are
added and the SizeLimit is used when constructing the Volume.
---
Nitpick comments:
In @.github/workflows/lint-test.yml:
- Around line 139-156: The preflight-checks-lint-test job is missing the
artifact upload step so coverage/test artifacts from make -C
preflight-checks/${{ matrix.component }} lint-test aren’t collected; add a step
after the "Run lint and test" step that mirrors the artifact upload used in
health-monitors-lint-test/modules-lint-test (use the same upload action and
artifact name pattern) so artifacts for each matrix.component (dcgm-diag,
nccl-allreduce, nccl-loopback) are uploaded and included in the consolidated
coverage report.
In `@distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml`:
- Around line 22-29: Add a BW_THRESHOLD_GBPS override into the
ncclAllreduceExtraEnv block and add brief inline comments for each env var;
specifically, add an env entry named BW_THRESHOLD_GBPS (default "100") to allow
overriding the 100 GB/s baseline, and add single-line comments for
NCCL_TOPO_FILE (path to topology XML), NCCL_IB_PCI_RELAXED_ORDERING (controls
PCI relaxed ordering for IB performance), and NCCL_SOCKET_IFNAME (interface used
for NCCL sockets, e.g., "eth0" rationale), so that ncclAllreduceExtraEnv and its
new BW_THRESHOLD_GBPS entry and the existing NCCL_* variables are documented per
chart guidelines.
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 172-189: Add an inline comment above the "GCP multi-network
resources (TCPXO)" block in values.yaml clarifying that the default allowlist
intentionally lists all CSP-specific network resource names (e.g.,
"networking.gke.io.networks/gpu-nic0", "networking.gke.io.networks/gpu-nic0.IP",
etc.) and that the webhook only copies resources that actually exist in a pod
spec, so these GCP-specific entries are harmless on non-GCP clusters; update the
comment near the existing header to be concise and follow the chart's
inline-commenting style.
- Around line 214-216: The ncclTopoConfigMap default is currently an empty
string which can be confusing; replace it with a commented example to match
other optional fields and Helm docs (e.g., comment out a sample: #
ncclTopoConfigMap: "my-nccl-topo-configmap") so maintainers see a usage example
while keeping the field effectively unset for the gangCoordination template
check (referencing ncclTopoConfigMap and the template condition {{- if
.Values.gangCoordination.ncclTopoConfigMap }}).
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 130-142: In the RuntimeError except block that checks for "NCCL"
or "nccl" (the branch that calls log.error and returns _handle_failure with
NCCLError.ALLREDUCE_TIMEOUT), add a concise inline comment explaining that
detection is a heuristic based on string matching of the error message, that it
may be brittle if upstream error wording changes, and that maintainers should
consider more robust detection (e.g., specific exception types or error codes)
in the future.
- Around line 143-147: Replace the try/except cleanup with contextlib.suppress
to make ignored cleanup errors more idiomatic: wrap the call to
dist.destroy_process_group() inside a contextlib.suppress(Exception) block in
the same finally section, and add an import for contextlib if it's not already
imported; refer to the finally block containing dist.destroy_process_group() to
locate where to apply this change.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 203-206: The try/except around the call to _read_string(KEY_PEERS,
default="") is dead because _read_string already catches FileNotFoundError and
returns the default; remove the outer try/except and simply assign data =
self._read_string(KEY_PEERS, default="") and proceed (update any following logic
that expected early return to handle the empty string as before). Ensure you
only modify the block using _read_string and KEY_PEERS in gang.py so behavior
remains identical but without the unreachable FileNotFoundError handler.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/health.py`:
- Around line 168-202: The code in _send_with_retries creates a new
grpc.insecure_channel on every retry; hoist channel creation outside the
for-loop to reuse the same channel (created with
grpc.insecure_channel(f"unix://{self._socket_path}")), create the
pb_grpc.PlatformConnectorStub once, and only recreate the channel/stub if a
grpc.RpcError indicates the channel is unusable (i.e., on failure before the
next retry). Ensure you still apply the same backoff/sleep logic and that the
function returns True on successful stub.HealthEventOccurredV1 call and False
after MAX_RETRIES.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 51-52: The script currently mutates sys.path via
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
which is fragile; remove that manipulation and instead rely on running the
package as a module or the package entry point declared in pyproject.toml (use
python -m <package> or configure the entry point) so imports resolve normally;
update entrypoint/launch instructions and the script's imports (keep the
existing if __name__ == "__main__" guard) so the file can be executed as a
proper package module rather than patching sys.path.
In `@preflight/pkg/webhook/injector.go`:
- Around line 125-131: The loop adding PatchOperation entries uses an index-only
for-loop; replace it with a range over initContainers to be more idiomatic and
avoid explicit indexing — iterate like `for _, c := range initContainers` and
append PatchOperation with Value set to the loop variable (referencing
initContainers, patches, PatchOperation) so behavior is unchanged but code is
cleaner.
- Line 15: Add a package-level godoc comment immediately above the "package
webhook" declaration describing the purpose and usage of the webhook package
(e.g., what it provides, important behaviors or invariants). Ensure the comment
starts with "Package webhook" to satisfy Go tooling and lint rules and keep it
short, one or two sentences summarizing the package API and intent so linters
accept it.
- Around line 193-199: Replace the manual nil-check assignment for mirrorClaims
with the existing boolDefault helper to match other usages; specifically, in
injector.go where mirrorClaims is computed from
i.cfg.GangCoordination.MirrorResourceClaims, call
boolDefault(i.cfg.GangCoordination.MirrorResourceClaims) (or the appropriate
boolDefault signature) instead of the current nil-and-dereference logic so it is
consistent with how appendExtraHostPathMounts and appendExtraVolumeMounts obtain
their defaults.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py (1)
36-49: Consider combining nestedwithstatements.Static analysis flagged these nested
withstatements. While functionally correct, combining them improves readability.♻️ Proposed fix
def test_rejects_non_numeric(self) -> None: - with patch.dict(os.environ, {"TEST_VAR": "abc"}): - with pytest.raises(ValueError, match="Invalid TEST_VAR"): - _parse_float("TEST_VAR", 0.0) + with patch.dict(os.environ, {"TEST_VAR": "abc"}), pytest.raises(ValueError, match="Invalid TEST_VAR"): + _parse_float("TEST_VAR", 0.0) def test_rejects_zero(self) -> None: - with patch.dict(os.environ, {"TEST_VAR": "0"}): - with pytest.raises(ValueError, match="must be positive"): - _parse_float("TEST_VAR", 1.0) + with patch.dict(os.environ, {"TEST_VAR": "0"}), pytest.raises(ValueError, match="must be positive"): + _parse_float("TEST_VAR", 1.0) def test_rejects_negative(self) -> None: - with patch.dict(os.environ, {"TEST_VAR": "-5.0"}): - with pytest.raises(ValueError, match="must be positive"): - _parse_float("TEST_VAR", 1.0) + with patch.dict(os.environ, {"TEST_VAR": "-5.0"}), pytest.raises(ValueError, match="must be positive"): + _parse_float("TEST_VAR", 1.0)The same pattern applies to the
TestParseIntandTestConfigFromEnvclasses.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around lines 36 - 49, Combine the nested context managers into a single with-statement in the test methods (e.g., in test_rejects_non_numeric, test_rejects_zero, test_rejects_negative) by merging patch.dict(...) and pytest.raises(...) into one line to improve readability; apply the same change to the analogous methods in TestParseInt and TestConfigFromEnv so each test uses a single with that contains both the environment patch and the expected exception context while keeping the same assertions that call _parse_float or the corresponding parse_int/config helpers.distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
178-195: Consider grouping GCP TCPXO resources under a comment block.The expanded
networkResourceNameslist is quite long. While the comment at line 179 helps, consider adding a closing comment or blank line after the GCP resources to improve visual separation:- "networking.gke.io.networks/gpu-nic7" - "networking.gke.io.networks/gpu-nic7.IP" + # End GCP TCPXO resourcesThis is a minor readability suggestion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 178 - 195, The GCP TCPXO entries in the networkResourceNames list are long and need clearer separation: group the GCP entries (the keys starting with "networking.gke.io.networks/gpu-nicX" and "networking.gke.io.networks/gpu-nicX.IP") under a single comment header like "# GCP multi-network resources (TCPXO)" and add a closing blank line or a trailing comment after that block to visually separate it from the rest of networkResourceNames; update the values.yaml list so the group is clearly delimited while preserving all existing entries and their exact strings.preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (2)
183-193: Consider logging benchmark failure on non-rank-0 processes before returning.When
benchmark.run()fails on a non-rank-0 process, it logs the error but silently returns the exit code. While rank-0 handles health reporting, having a debug/info log indicating the non-rank-0 behavior would help troubleshooting:except Exception as err: log.error("Benchmark execution failed", extra={"error": str(err)}) if rank == 0: return _handle_failure( cfg, NCCLError.ALLREDUCE_TIMEOUT, f"NCCL all-reduce benchmark failed: {err}", ) + log.debug("Non-rank-0 process returning failure exit code", extra={"rank": rank}) return NCCLError.ALLREDUCE_TIMEOUT.value.exit_codeThis is a minor observability improvement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 183 - 193, When benchmark.run(message_sizes) raises an exception, non-rank-0 branches currently return NCCLError.ALLREDUCE_TIMEOUT without additional context; add a log entry on those branches before returning. Specifically, in the except block surrounding benchmark.run, after log.error(...) and when rank != 0, emit a debug/info log (e.g., using log.debug or log.info) that includes the rank and the exception (err) to record that a non-root process failed and is returning the timeout exit code; keep the existing root behavior using _handle_failure and NCCLError.ALLREDUCE_TIMEOUT intact.
143-147: Usecontextlib.suppressfor cleaner exception handling.Per static analysis (Ruff SIM105), the try-except-pass pattern can be simplified.
♻️ Suggested refactor
+import contextlib + finally: - try: - dist.destroy_process_group() - except Exception: - pass # Ignore cleanup errors + with contextlib.suppress(Exception): + dist.destroy_process_group()Add the import at the top of the file with the other imports.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 143 - 147, Replace the try/except/pass cleanup around dist.destroy_process_group() with contextlib.suppress(Exception) to simplify exception handling: import contextlib alongside the other imports and wrap the call to dist.destroy_process_group() in a with contextlib.suppress(Exception): block (replace the finally: try: ... except: pass pattern around dist.destroy_process_group()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_errors.py`:
- Around line 61-64: The test test_error_def_is_frozen currently only checks
isinstance; update it to actually attempt mutation on the ErrorDef instance
returned by NCCLError.SUCCESS.value and assert that mutation raises the
appropriate exception (e.g., dataclasses.FrozenInstanceError or AttributeError).
Locate the test_error_def_is_frozen function and replace or extend the assertion
with a with pytest.raises(...) block that tries to set a field on the ErrorDef
instance (for example, error_def.code = 123 or error_def.message = "x") to
confirm immutability.
---
Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 234-239: The values.yaml comment about mirrorResourceClaims must
match the actual behavior in the Go code: inspect the config handling where
MirrorResourceClaims is defaulted (the MirrorResourceClaims field in config.go)
and if the code unconditionally sets MirrorResourceClaims to true when nil,
update the values.yaml comment to state "Defaults to true when omitted
(unconditionally)" or leave as-is if it already states that; if instead the code
conditions the default based on gang coordination, change the comment to
describe that conditional behavior. Ensure the fix references
mirrorResourceClaims in values.yaml and the MirrorResourceClaims field/logic in
config.go so the doc and implementation are consistent.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 71-83: The env var names used in this module don't match the
webhook: update the getenv keys so gang_config_dir uses "GANG_CONFIG_PATH"
instead of "GANG_CONFIG_DIR" and gang_timeout_seconds reads "GANG_TIMEOUT"
instead of "GANG_TIMEOUT_SECONDS"; keep the same fallback defaults
(DEFAULT_GANG_CONFIG_DIR, DEFAULT_GANG_TIMEOUT_SECONDS) and parsing helpers
(_parse_int for gang_timeout_seconds, _parse_float usage remains unchanged for
bw_threshold_gbps) so values injected by the webhook are actually picked up.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 131-137: The code currently calls int() directly on environment
values for GANG_TIMEOUT_SECONDS and NPROCS_PER_NODE which will raise raw
ValueError on invalid input; update the logic around gang_timeout and
nprocs_per_node to validate and parse the env vars safely: for
GANG_TIMEOUT_SECONDS, try to parse int(os.getenv(...)) in a try/except, ensure
the value is a positive integer, log a clear error and exit non-zero on failure;
for NPROCS_PER_NODE, if NPROCS_PER_NODE is set, similarly try/except the int
conversion and validate (>0) before assigning to nprocs_per_node, otherwise call
_detect_gpu_count(); ensure error messages reference the env var names and the
variables gang_timeout and nprocs_per_node so debugging is clear.
- Around line 130-131: The script reads environment variables gang_config_dir
and gang_timeout (via gang_config_dir = os.getenv("GANG_CONFIG_DIR", ...) and
gang_timeout = int(os.getenv("GANG_TIMEOUT_SECONDS", ...))), but the webhook
injector sets GANG_CONFIG_PATH and GANG_TIMEOUT; update the env var names used
by the script to match the injector (use os.getenv("GANG_CONFIG_PATH",
DEFAULT_GANG_CONFIG_DIR) for gang_config_dir and os.getenv("GANG_TIMEOUT",
str(DEFAULT_GANG_TIMEOUT)) for gang_timeout) so injected values are honored,
keeping the same defaults and casting to int for gang_timeout.
In `@preflight/pkg/webhook/injector.go`:
- Around line 283-296: The validation is inconsistent between
appendExtraHostPathMounts and collectExtraHostPathVolumes:
appendExtraHostPathMounts currently only skips entries with empty Name or
MountPath while collectExtraHostPathVolumes skips Name or HostPath, which can
cause mounts without matching volumes. Update appendExtraHostPathMounts to use
the same validation as collectExtraHostPathVolumes (skip if Name, MountPath, or
HostPath is empty) or move validation to config loading so both functions rely
on the same sanitized entries; locate and adjust the checks in
appendExtraHostPathMounts and collectExtraHostPathVolumes to ensure both require
Name, MountPath, and HostPath.
---
Nitpick comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 178-195: The GCP TCPXO entries in the networkResourceNames list
are long and need clearer separation: group the GCP entries (the keys starting
with "networking.gke.io.networks/gpu-nicX" and
"networking.gke.io.networks/gpu-nicX.IP") under a single comment header like "#
GCP multi-network resources (TCPXO)" and add a closing blank line or a trailing
comment after that block to visually separate it from the rest of
networkResourceNames; update the values.yaml list so the group is clearly
delimited while preserving all existing entries and their exact strings.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 183-193: When benchmark.run(message_sizes) raises an exception,
non-rank-0 branches currently return NCCLError.ALLREDUCE_TIMEOUT without
additional context; add a log entry on those branches before returning.
Specifically, in the except block surrounding benchmark.run, after
log.error(...) and when rank != 0, emit a debug/info log (e.g., using log.debug
or log.info) that includes the rank and the exception (err) to record that a
non-root process failed and is returning the timeout exit code; keep the
existing root behavior using _handle_failure and NCCLError.ALLREDUCE_TIMEOUT
intact.
- Around line 143-147: Replace the try/except/pass cleanup around
dist.destroy_process_group() with contextlib.suppress(Exception) to simplify
exception handling: import contextlib alongside the other imports and wrap the
call to dist.destroy_process_group() in a with contextlib.suppress(Exception):
block (replace the finally: try: ... except: pass pattern around
dist.destroy_process_group()).
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py`:
- Around line 36-49: Combine the nested context managers into a single
with-statement in the test methods (e.g., in test_rejects_non_numeric,
test_rejects_zero, test_rejects_negative) by merging patch.dict(...) and
pytest.raises(...) into one line to improve readability; apply the same change
to the analogous methods in TestParseInt and TestConfigFromEnv so each test uses
a single with that contains both the environment patch and the expected
exception context while keeping the same assertions that call _parse_float or
the corresponding parse_int/config helpers.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py (1)
36-49: Consider combining nestedwithstatements.Static analysis flags that nested
withstatements can be combined for cleaner code. This is a minor style improvement.Example refactor
def test_rejects_non_numeric(self) -> None: - with patch.dict(os.environ, {"TEST_VAR": "abc"}): - with pytest.raises(ValueError, match="Invalid TEST_VAR"): - _parse_float("TEST_VAR", 0.0) + with patch.dict(os.environ, {"TEST_VAR": "abc"}), pytest.raises(ValueError, match="Invalid TEST_VAR"): + _parse_float("TEST_VAR", 0.0) def test_rejects_zero(self) -> None: - with patch.dict(os.environ, {"TEST_VAR": "0"}): - with pytest.raises(ValueError, match="must be positive"): - _parse_float("TEST_VAR", 1.0) + with patch.dict(os.environ, {"TEST_VAR": "0"}), pytest.raises(ValueError, match="must be positive"): + _parse_float("TEST_VAR", 1.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around lines 36 - 49, Combine the nested context managers in the three tests to use a single with statement per test: replace the nested "with patch.dict(os.environ, {...}): with pytest.raises(...):" pattern in test_rejects_non_numeric, test_rejects_zero, and test_rejects_negative with a single "with patch.dict(os.environ, {...}), pytest.raises(...):" line so the tests still call _parse_float("TEST_VAR", ...) but use the consolidated context manager form for clarity and style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 167-170: Add a commented concrete example for the
ncclAllreduceExtraEnv values entry to show how to pass extra environment
variables into the preflight-nccl-allreduce init container; update the
values.yaml near the ncclAllreduceExtraEnv: [] block to include a commented
example demonstrating an array of env objects (e.g., name/value and
name/valueFrom patterns) so operators can see how to override/extend the fabric
profile when using the preflight-nccl-allreduce init container.
- Around line 129-155: The new initContainer entry named
preflight-nccl-allreduce is missing inline comments for the new fields; add
concise Helm values.yaml comments describing the image (preflight-nccl-allreduce
image source and intended tag usage), the securityContext block and the added
capability IPC_LOCK (why IPC_LOCK is required for RDMA memory registration), and
brief notes for env (what BW_THRESHOLD_GBPS, MESSAGE_SIZES,
GANG_TIMEOUT_SECONDS, NCCL_DEBUG, NCCL_DEBUG_SUBSYS control) and volumeMounts
(purpose of nvsentinel-socket and mountPath); update the comments next to the
symbols image, securityContext, IPC_LOCK, env, and volumeMounts in the
preflight-nccl-allreduce block.
---
Duplicate comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 71-83: The environment variable names used in config.py are
inconsistent with the webhook: change the getenv call for gang_config_dir to
read "GANG_CONFIG_PATH" instead of "GANG_CONFIG_DIR", and change the call to
_parse_int for gang_timeout_seconds to use "GANG_TIMEOUT" instead of
"GANG_TIMEOUT_SECONDS" so the variables gang_config_dir and gang_timeout_seconds
pick up values injected by the Go webhook (leave the variable names and defaults
intact).
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_errors.py`:
- Around line 61-64: The test test_error_def_is_frozen currently only checks
type; update it to attempt mutating the ErrorDef instance returned by
NCCLError.SUCCESS.value and assert that mutation raises the expected exception
(e.g. dataclasses.FrozenInstanceError or AttributeError). Specifically, in
test_error_def_is_frozen get error_def = NCCLError.SUCCESS.value, then use
pytest.raises(FrozenInstanceError) (import from dataclasses) around a mutation
like setting error_def.code = <new value> to verify the dataclass is actually
frozen.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 23-38: The docstring and code use environment variables
GANG_CONFIG_DIR and GANG_TIMEOUT_SECONDS but the webhook injector may still
export GANG_CONFIG_PATH and GANG_TIMEOUT; update the webhook injector to export
GANG_CONFIG_DIR and GANG_TIMEOUT_SECONDS (or change the entrypoint to accept the
injector's names) and make the two sides consistent, then update the comment
block in entrypoint.py to reflect the final chosen names; search for occurrences
of GANG_CONFIG_PATH, GANG_CONFIG_DIR, GANG_TIMEOUT, and GANG_TIMEOUT_SECONDS and
ensure all references (injector and the Python entrypoint functions/classes) use
the same variable names.
- Around line 118-136: In _load_config(), validate and safely parse
GANG_TIMEOUT_SECONDS and NPROCS_PER_NODE: replace direct int() calls with
guarded parsing (try/except ValueError/TypeError) for GANG_TIMEOUT_SECONDS
(default DEFAULT_GANG_TIMEOUT) and for NPROCS_PER_NODE (fall back to calling
_detect_gpu_count() only on empty env), ensure both parsed values are positive
(>0), and if parsing fails or values are non-positive call log.error with a
clear message identifying the variable (e.g., "invalid GANG_TIMEOUT_SECONDS" or
"invalid NPROCS_PER_NODE") and return None instead of letting an exception
propagate; update references to gang_timeout and nprocs_per_node accordingly so
callers receive a clean None on bad config.
In `@preflight/pkg/webhook/injector.go`:
- Around line 283-296: The appendExtraHostPathMounts function currently adds
VolumeMounts for entries missing HostPath which collectExtraHostPathVolumes
rejects; update appendExtraHostPathMounts to mirror the same validation as
collectExtraHostPathVolumes by skipping any mount where m.Name, m.MountPath, or
m.HostPath is empty (i.e., require m.HostPath != "" before appending) so
VolumeMounts always have matching volumes from
cfg.GangCoordination.ExtraHostPathMounts.
---
Nitpick comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py`:
- Around line 36-49: Combine the nested context managers in the three tests to
use a single with statement per test: replace the nested "with
patch.dict(os.environ, {...}): with pytest.raises(...):" pattern in
test_rejects_non_numeric, test_rejects_zero, and test_rejects_negative with a
single "with patch.dict(os.environ, {...}), pytest.raises(...):" line so the
tests still call _parse_float("TEST_VAR", ...) but use the consolidated context
manager form for clarity and style.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (12)
preflight/pkg/webhook/injector.go (2)
544-544:slog.InfoininjectGangEnvfires once per init container.
injectGangEnvis called inside thebuildInitContainersloop (line 226), so a pod with two init containers emits this log line twice with identical fields. Downgrade toslog.Debugor move the log to the call site after the loop to avoid duplicate structured-log noise.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/webhook/injector.go` at line 544, The structured log in injectGangEnv ("Injecting gang environment variables") is emitted for each init container causing duplicate entries; change the call in injectGangEnv from slog.Info to slog.Debug (or remove it and place a single slog.Info at the caller buildInitContainers after the loop) so the message is logged only once—update the slog call inside injectGangEnv (or add the single log in buildInitContainers) accordingly.
274-279: Consider deduplicating mirrored DRA resource claims.
container.Resources.Claimsis appended unconditionally. If the init-container template already declares some claims that overlap with pod-level claims, this produces duplicate entries and may be rejected by the API server.♻️ Proposed fix
+ existingClaims := make(map[string]bool) + for _, c := range container.Resources.Claims { + existingClaims[c.Name] = true + } if mirrorClaims { for _, podClaim := range podResourceClaims { - container.Resources.Claims = append(container.Resources.Claims, corev1.ResourceClaim{ - Name: podClaim.Name, - }) + if !existingClaims[podClaim.Name] { + container.Resources.Claims = append(container.Resources.Claims, corev1.ResourceClaim{ + Name: podClaim.Name, + }) + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/webhook/injector.go` around lines 274 - 279, When mirrorClaims is true we currently append every podResourceClaim into container.Resources.Claims unconditionally, causing duplicates; update the logic in injector.go (the mirrorClaims branch that iterates podResourceClaims and appends to container.Resources.Claims) to skip claims whose Name already exists in container.Resources.Claims by first building a set of existing claim names (from container.Resources.Claims) and only appending podResourceClaims with names not present in that set, ensuring duplicated ResourceClaim entries are not produced for init-container templates.preflight/pkg/config/config.go (2)
231-243: Shared pointer alias across all default-ReadOnly mounts.All
ExtraHostPathMountsandExtraVolumeMountsentries that lack an explicitReadOnlyvalue will share the same*boolpointer (trueVal). If any downstream code ever mutates through that pointer (e.g.,*mount.ReadOnly = false), it would silently affect every mount sharing the pointer. Safer to allocate per-entry.♻️ Proposed fix
- trueVal := true - for i := range c.ExtraHostPathMounts { if c.ExtraHostPathMounts[i].ReadOnly == nil { + trueVal := true c.ExtraHostPathMounts[i].ReadOnly = &trueVal } } for i := range c.ExtraVolumeMounts { if c.ExtraVolumeMounts[i].ReadOnly == nil { + trueVal := true c.ExtraVolumeMounts[i].ReadOnly = &trueVal } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/config/config.go` around lines 231 - 243, The current loops in config.go assign the same trueVal pointer to all mounts, creating a shared *bool that could be mutated across entries; instead allocate a distinct bool for each entry by creating a new local variable (e.g., tmp := true) inside each iteration and assign its address to ExtraHostPathMounts[i].ReadOnly and ExtraVolumeMounts[i].ReadOnly respectively so every mount gets its own *bool rather than sharing the same pointer.
147-151: Consider validatingHostPathTypeagainst known Kubernetes values.
HostPathTypeis a free-form string. An invalid value (e.g., a typo like"Dirctory") would pass config loading silently and only fail at pod admission time. A validation step invalidate()would surface misconfigurations earlier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/config/config.go` around lines 147 - 151, The HostPathType field is a free-form string and should be validated early; update the validate() function to check config.HostPathType against the allowed Kubernetes values ("", "Directory", "DirectoryOrCreate", "File", "FileOrCreate", "Socket", "CharDevice", "BlockDevice") and return a clear error when it is non-empty and not one of these options (include the invalid value in the message). Locate the HostPathType field definition and add this lookup/validation logic inside the existing validate() method (or helper called from validate()) so mis-typed values fail config validation instead of during pod admission.preflight-checks/nccl-allreduce/nccl_allreduce/gang.py (1)
201-204: Unreachableexcept FileNotFoundErrorbranch.
_read_string(KEY_PEERS, default="")already catchesFileNotFoundErrorinternally and returns""when a default is provided, so the outerexcept FileNotFoundErroron line 203 is dead code. Not harmful, but misleading to future readers.♻️ Simplified version
- try: - data = self._read_string(KEY_PEERS, default="") - except FileNotFoundError: - return [] + data = self._read_string(KEY_PEERS, default="")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py` around lines 201 - 204, The except FileNotFoundError branch is unreachable because _read_string(KEY_PEERS, default="") already handles FileNotFoundError and returns "" when missing; remove the try/except and directly assign data = self._read_string(KEY_PEERS, default="") then continue the existing logic (e.g., return [] when data == "" or parse peers otherwise). Update references in the gang.py method containing KEY_PEERS and _read_string accordingly so the code is clearer and no longer includes the dead except block.preflight/pkg/controller/gang_controller_test.go (1)
158-159: Empty config skips all new NCCL topo logic — consider adding a test forensureNCCLTopoConfigMap.
testCfgis initialized as&config.Config{}, soNCCLTopoConfigMapandNCCLTopoDataare empty strings, causingensureNCCLTopoConfigMapto short-circuit in every test case. The new ConfigMap creation path (Get → Create, TOCTOU handling, AlreadyExists guard) has no test coverage.Consider adding a test case where
testCfg.GangCoordination.NCCLTopoConfigMapandNCCLTopoDataare set and verifying the topo ConfigMap is created alongside the gang ConfigMap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/controller/gang_controller_test.go` around lines 158 - 159, The tests create NewGangController with an empty testCfg so ensureNCCLTopoConfigMap always returns early; add a test scenario that sets testCfg.GangCoordination.NCCLTopoConfigMap and testCfg.GangCoordination.NCCLTopoData to non-empty values, then call the same controller logic that creates the gang ConfigMap and assert that the NCCL topo ConfigMap is created as well (covering Get→Create TOCTOU and AlreadyExists paths); specifically modify the gang_controller_test to initialize testCfg with those fields, invoke the controller reconcile or the helper that calls ensureNCCLTopoConfigMap, and verify the ConfigMap exists and contains the provided NCCLTopoData.preflight/pkg/controller/gang_controller.go (2)
233-258: Consider adding amanaged-bylabel for cleanup discoverability.The NCCL topo ConfigMap has
componentandnamelabels but noapp.kubernetes.io/managed-bylabel. Adding one would make it straightforward to list and clean up auto-created topo ConfigMaps (e.g.,kubectl get cm -l app.kubernetes.io/managed-by=nvsentinel-preflight).♻️ Suggested label addition
Labels: map[string]string{ "app.kubernetes.io/component": "nccl-topo", "app.kubernetes.io/name": "nvsentinel", + "app.kubernetes.io/managed-by": "nvsentinel-preflight", },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/controller/gang_controller.go` around lines 233 - 258, Add a managed-by label to the NCCL topo ConfigMap labels map so auto-created topo ConfigMaps are discoverable/cleanable; update the Labels map in the cm object (where cm is created and gcfg.NCCLTopoConfigMap is used) to include "app.kubernetes.io/managed-by": "nvsentinel-preflight" alongside the existing "app.kubernetes.io/component" and "app.kubernetes.io/name" entries.
203-209: NCCL topo ConfigMap creation adds latency to the admission webhook path.
ensureNCCLTopoConfigMapperforms a Get + conditional Create against the API server synchronously insideRegisterPod, which runs in the webhook's admission path. While the existingEnsureConfigMapcall already adds latency, this doubles the API calls for pods with NCCL topo configured. If API server latency is high, this could push admission close to the webhook timeout.Consider offloading this to the controller's reconcile loop or a goroutine if webhook latency becomes a concern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight/pkg/controller/gang_controller.go` around lines 203 - 209, The call to ensureNCCLTopoConfigMap inside RegisterPod performs a synchronous Get+Create against the API server (duplicating EnsureConfigMap work) and can spike admission latency; move this work out of the webhook path by either (a) enqueueing the pod/namespace to the gang controller's reconcile loop so the reconcile handler ensures the NCCL topo ConfigMap asynchronously, or (b) spawning a background goroutine from RegisterPod that deduplicates requests (use a namespace-level singleflight/mutex or a map of in-flight requests keyed by reg.Namespace) and performs the Get/Create off the admission path; update ensureNCCLTopoConfigMap (and any EnsureConfigMap usage) to return errors via logs/metrics instead of blocking admission and ensure failures are retried by the controller reconcile logic.preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py (2)
68-71:TestParseIntis missing atest_rejects_negativecounterpart.
TestParseFloathas bothtest_rejects_zeroandtest_rejects_negative, butTestParseIntonly hastest_rejects_zero. Both helpers share the same<= 0guard, so adding the negative case gives symmetric coverage.🧪 Suggested addition
+ def test_rejects_negative(self) -> None: + with patch.dict(os.environ, {"TEST_VAR": "-1"}), pytest.raises(ValueError, match="must be positive"): + _parse_int("TEST_VAR", 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around lines 68 - 71, Add a new test method test_rejects_negative to the TestParseInt tests that mirrors TestParseFloat's negative case: use patch.dict to set {"TEST_VAR": "-1"} and assert that _parse_int("TEST_VAR", 1) raises ValueError with a "must be positive" match; place the new test alongside the existing test_rejects_zero so both zero and negative inputs are covered for _parse_int.
37-49: Combine nestedwithstatements — flagged as errors by Ruff (SIM117).All
pytest.raisesblocks use nestedwithstatements. Python'swithsupports multiple context managers in one statement.♻️ Example fix (pattern applies to all similar blocks in this file)
- with patch.dict(os.environ, {"TEST_VAR": "abc"}): - with pytest.raises(ValueError, match="Invalid TEST_VAR"): - _parse_float("TEST_VAR", 0.0) + with patch.dict(os.environ, {"TEST_VAR": "abc"}), pytest.raises(ValueError, match="Invalid TEST_VAR"): + _parse_float("TEST_VAR", 0.0)The same pattern applies to Lines 42-43, 47-48, 64-65, 69-70, 117-118, 124-125, 131-132, and 137-138.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around lines 37 - 49, Replace nested context-manager blocks with single combined with-statements for each test to satisfy Ruff SIM117: in functions test_rejects_zero, test_rejects_negative (and other listed tests) combine the patch.dict and pytest.raises into one line, e.g. use "with patch.dict(os.environ, {...}), pytest.raises(...):" and call _parse_float("TEST_VAR", ...) inside that single with-block; update all occurrences referencing _parse_float and the test function names accordingly.preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (1)
144-147: Usecontextlib.suppressinstead oftry/except/pass(Ruff SIM105).♻️ Proposed refactor
+import contextlib ... - try: - dist.destroy_process_group() - except Exception: - pass # Ignore cleanup errors + with contextlib.suppress(Exception): + dist.destroy_process_group()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 144 - 147, The try/except/pass around cleaning up the PyTorch process group should be replaced with contextlib.suppress to follow Ruff SIM105; wrap the call to dist.destroy_process_group() with contextlib.suppress(Exception) and add the corresponding import (either import contextlib or from contextlib import suppress) near the top of the module so the cleanup block uses suppress instead of an empty except in __main__.py.preflight-checks/nccl-allreduce/Dockerfile (1)
37-37: Poetry1.8.2is 2 major versions behind the current stable release.The current stable Poetry release is
2.3.2(Feb 1, 2026); the 1.x branch last published1.8.5(Dec 6, 2024). Poetry 1.8.2 (March 2024) is no longer maintained.Bump to
1.8.5to pick up security and bug fixes. For migration to Poetry 2.x, note thatpoetry exportwas removed in Poetry 2.0 (Jan 5, 2025) and requires installing thepoetry-plugin-exportplugin—either viapoetry self add poetry-plugin-exportor by declaring it inpyproject.tomlunder[tool.poetry.requires-plugins]. Thepoetry buildcommand remains available with new options added. Plan the migration accordingly if the build process relies on either command.♻️ Proposed immediate fix
- pip install poetry==1.8.2 + pip install poetry==1.8.5🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/Dockerfile` at line 37, Update the Poetry installation in the Dockerfile: replace the hardcoded version "poetry==1.8.2" with "poetry==1.8.5" to pick up security/bug fixes (update the pip install line that currently reads pip install poetry==1.8.2), and if you plan to migrate to Poetry 2.x later, add steps to install the poetry-plugin-export (via poetry self add poetry-plugin-export or declaring it under [tool.poetry.requires-plugins] in pyproject.toml) because poetry export was removed in 2.0; ensure any downstream build steps that rely on poetry export or changed CLI behavior are adjusted accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 106-116: The NCCL process-group is using the default 30-minute
timeout; update the dist.init_process_group call to pass an explicit timeout
derived from the existing gang timeout (e.g., use
datetime.timedelta(seconds=gang_timeout)) so the NCCL init will abort within the
preflight's gang_timeout; modify the code around dist.init_process_group to
import datetime (if needed) and call dist.init_process_group(backend="nccl",
timeout=datetime.timedelta(seconds=gang_timeout)) (use the actual gang timeout
variable name used in this module) and keep the surrounding logs unchanged.
- Around line 99-104: The code currently sets NCCL_ASYNC_ERROR_HANDLING via
os.environ; update the environment variable handling so the new PyTorch name
TORCH_NCCL_ASYNC_ERROR_HANDLING is set (and keep setting
NCCL_ASYNC_ERROR_HANDLING for backward compatibility). Specifically, in the
__main__.py block that manipulates os.environ, add a check to set
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"]="1" when neither variable exists
(or mirror the value when only the old name is present), ensuring both
TORCH_NCCL_ASYNC_ERROR_HANDLING and NCCL_ASYNC_ERROR_HANDLING are defined to
avoid deprecation warnings and future breakage.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Line 205: Wrap the os.execvp call inside _launch_torchrun with a try/except
that catches OSError, and on error log a clear message including the attempted
cmd and the exception (use the existing logger/process logger used elsewhere in
this module), then raise or exit with the module's structured NCCLError (use the
appropriate NCCLError code for "launcher not found" or a generic launcher
failure) instead of letting the raw traceback propagate; ensure callers of
_launch_torchrun retain the structured exception handling path.
- Around line 92-94: _win: The problem is that _wait_for_gang sends a
GANG_TIMEOUT health event but returns None, and main() treats any None as a
GANG_CONFIG_ERROR exit (2), masking the timeout (3). Fix by making
_wait_for_gang return an explicit exit code on failure (return
NCCLError.GANG_TIMEOUT.value.exit_code for TimeoutError, or other appropriate
NCCLError codes) instead of None, and update main() to check if gang_config is
an int (error code) and immediately return that code; otherwise proceed when
gang_config is a valid config object. Ensure you reference _wait_for_gang and
the gang_config variable and use NCCLError.GANG_TIMEOUT /
NCCLError.GANG_CONFIG_ERROR enum values for returned exit codes.
In `@preflight/pkg/webhook/injector.go`:
- Around line 298-312: The extra volume mounts are defaulting to ReadOnly=true
in appendExtraVolumeMounts (Injector.appendExtraVolumeMounts), which breaks
writable Unix socket volumes like TCPXO; update the default to false by changing
the call that constructs the VolumeMount ReadOnly field from
boolDefault(m.ReadOnly, true) to boolDefault(m.ReadOnly, false) (or ensure
callers explicitly set ReadOnly=false for socket volumes), so socket-based
plugins can open writable sockets during init and runtime.
---
Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 167-170: Add a short commented usage example for
ncclAllreduceExtraEnv in values.yaml showing the expected YAML list of env var
objects (name/value or valueFrom) that will be appended to the
preflight-nccl-allreduce init container after networkFabric defaults; update the
ncclAllreduceExtraEnv block to include one or two commented entries (e.g.,
overriding FABRIC_PROFILE or setting FABRIC_LOG_LEVEL) and a brief note that
these are merged after networkFabric defaults so operators know format and
precedence.
- Around line 135-154: Add inline comment annotations for the image and
volumeMounts entries for the preflight-nccl-allreduce container: update the
image line (image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:latest) to
include a brief comment describing what the image contains and when to pin/tag
it, and add a comment next to the volumeMounts nvsentinel-socket entry (name:
nvsentinel-socket, mountPath: /var/run) explaining the purpose of the socket
volume (e.g., sharing runtime sockets/IPC with host) and any required
permissions; modify the values under the preflight-nccl-allreduce container
definition so both the image and the nvsentinel-socket mount have inline Helm
values.yaml comments consistent with other documented fields.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 71-83: The env var names used to populate gang_config_dir and
gang_timeout_seconds are incorrect and must match the webhook: replace
os.getenv("GANG_CONFIG_DIR", ...) with os.getenv("GANG_CONFIG_PATH", ...) for
gang_config_dir and change _parse_int("GANG_TIMEOUT_SECONDS", ...) to
_parse_int("GANG_TIMEOUT", ...) for gang_timeout_seconds so operator-set values
(configMapMountPath and timeout) are honored at runtime; update any related
docs/comments so the expected env names (GANG_CONFIG_PATH, GANG_TIMEOUT) are
consistent across the codebase.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_errors.py`:
- Around line 61-64: The test test_error_def_is_frozen currently only checks
isinstance and doesn't verify immutability; update it to attempt a mutation on
an ErrorDef instance obtained via NCCLError.SUCCESS.value and assert that
mutation raises a TypeError (using pytest.raises) to confirm it's frozen, and
add import pytest at top of the test file; reference ErrorDef, NCCLError, and
test_error_def_is_frozen when making the change.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 129-130: The env var names used when reading configuration
(gang_config_dir and gang_timeout) do not match the webhook injector; update the
getenv keys to "GANG_CONFIG_PATH" and "GANG_TIMEOUT" respectively, keeping the
same defaults (DEFAULT_GANG_CONFIG_DIR and DEFAULT_GANG_TIMEOUT) and parsing the
timeout to int as before so gang_config_dir = os.getenv("GANG_CONFIG_PATH",
DEFAULT_GANG_CONFIG_DIR) and gang_timeout = int(os.getenv("GANG_TIMEOUT",
str(DEFAULT_GANG_TIMEOUT))) while leaving the variable names (gang_config_dir,
gang_timeout) unchanged.
- Around line 130-136: Wrap the raw int conversions for GANG_TIMEOUT_SECONDS and
NPROCS_PER_NODE in validation with try/except ValueError: when parsing
os.getenv("GANG_TIMEOUT_SECONDS", str(DEFAULT_GANG_TIMEOUT)) assign to
gang_timeout only if int(...) succeeds otherwise log a structured warning/error
referencing the variable name and provided value and fall back to
DEFAULT_GANG_TIMEOUT; similarly, when parsing nprocs_env for nprocs_per_node,
catch ValueError, log a structured error mentioning NPROCS_PER_NODE and the
invalid value, and then either fall back to calling _detect_gpu_count() or exit
with a clear message; update the code paths around gang_timeout,
nprocs_per_node, GANG_TIMEOUT_SECONDS, NPROCS_PER_NODE and _detect_gpu_count()
accordingly so no raw ValueError traceback is raised.
In `@preflight/pkg/webhook/injector.go`:
- Around line 284-296: The appendExtraHostPathMounts function is inserting
VolumeMounts even when collectExtraHostPathVolumes skips creating the
corresponding Volumes (when m.HostPath == "" or when m.HostPathType is
unsupported), causing broken pod specs; update appendExtraHostPathMounts to
perform the same validation as collectExtraHostPathVolumes (skip mounts when
m.HostPath == "" or when m.HostPathType is not one of the supported types) or
extract a shared helper (e.g., validateExtraHostPath) used by both
appendExtraHostPathMounts and collectExtraHostPathVolumes to centralize the
checks against i.cfg.GangCoordination.ExtraHostPathMounts so VolumeMounts are
only added when a matching Volume will be created.
---
Nitpick comments:
In `@preflight-checks/nccl-allreduce/Dockerfile`:
- Line 37: Update the Poetry installation in the Dockerfile: replace the
hardcoded version "poetry==1.8.2" with "poetry==1.8.5" to pick up security/bug
fixes (update the pip install line that currently reads pip install
poetry==1.8.2), and if you plan to migrate to Poetry 2.x later, add steps to
install the poetry-plugin-export (via poetry self add poetry-plugin-export or
declaring it under [tool.poetry.requires-plugins] in pyproject.toml) because
poetry export was removed in 2.0; ensure any downstream build steps that rely on
poetry export or changed CLI behavior are adjusted accordingly.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 144-147: The try/except/pass around cleaning up the PyTorch
process group should be replaced with contextlib.suppress to follow Ruff SIM105;
wrap the call to dist.destroy_process_group() with
contextlib.suppress(Exception) and add the corresponding import (either import
contextlib or from contextlib import suppress) near the top of the module so the
cleanup block uses suppress instead of an empty except in __main__.py.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 201-204: The except FileNotFoundError branch is unreachable
because _read_string(KEY_PEERS, default="") already handles FileNotFoundError
and returns "" when missing; remove the try/except and directly assign data =
self._read_string(KEY_PEERS, default="") then continue the existing logic (e.g.,
return [] when data == "" or parse peers otherwise). Update references in the
gang.py method containing KEY_PEERS and _read_string accordingly so the code is
clearer and no longer includes the dead except block.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py`:
- Around line 68-71: Add a new test method test_rejects_negative to the
TestParseInt tests that mirrors TestParseFloat's negative case: use patch.dict
to set {"TEST_VAR": "-1"} and assert that _parse_int("TEST_VAR", 1) raises
ValueError with a "must be positive" match; place the new test alongside the
existing test_rejects_zero so both zero and negative inputs are covered for
_parse_int.
- Around line 37-49: Replace nested context-manager blocks with single combined
with-statements for each test to satisfy Ruff SIM117: in functions
test_rejects_zero, test_rejects_negative (and other listed tests) combine the
patch.dict and pytest.raises into one line, e.g. use "with
patch.dict(os.environ, {...}), pytest.raises(...):" and call
_parse_float("TEST_VAR", ...) inside that single with-block; update all
occurrences referencing _parse_float and the test function names accordingly.
In `@preflight/pkg/config/config.go`:
- Around line 231-243: The current loops in config.go assign the same trueVal
pointer to all mounts, creating a shared *bool that could be mutated across
entries; instead allocate a distinct bool for each entry by creating a new local
variable (e.g., tmp := true) inside each iteration and assign its address to
ExtraHostPathMounts[i].ReadOnly and ExtraVolumeMounts[i].ReadOnly respectively
so every mount gets its own *bool rather than sharing the same pointer.
- Around line 147-151: The HostPathType field is a free-form string and should
be validated early; update the validate() function to check config.HostPathType
against the allowed Kubernetes values ("", "Directory", "DirectoryOrCreate",
"File", "FileOrCreate", "Socket", "CharDevice", "BlockDevice") and return a
clear error when it is non-empty and not one of these options (include the
invalid value in the message). Locate the HostPathType field definition and add
this lookup/validation logic inside the existing validate() method (or helper
called from validate()) so mis-typed values fail config validation instead of
during pod admission.
In `@preflight/pkg/controller/gang_controller_test.go`:
- Around line 158-159: The tests create NewGangController with an empty testCfg
so ensureNCCLTopoConfigMap always returns early; add a test scenario that sets
testCfg.GangCoordination.NCCLTopoConfigMap and
testCfg.GangCoordination.NCCLTopoData to non-empty values, then call the same
controller logic that creates the gang ConfigMap and assert that the NCCL topo
ConfigMap is created as well (covering Get→Create TOCTOU and AlreadyExists
paths); specifically modify the gang_controller_test to initialize testCfg with
those fields, invoke the controller reconcile or the helper that calls
ensureNCCLTopoConfigMap, and verify the ConfigMap exists and contains the
provided NCCLTopoData.
In `@preflight/pkg/controller/gang_controller.go`:
- Around line 233-258: Add a managed-by label to the NCCL topo ConfigMap labels
map so auto-created topo ConfigMaps are discoverable/cleanable; update the
Labels map in the cm object (where cm is created and gcfg.NCCLTopoConfigMap is
used) to include "app.kubernetes.io/managed-by": "nvsentinel-preflight"
alongside the existing "app.kubernetes.io/component" and
"app.kubernetes.io/name" entries.
- Around line 203-209: The call to ensureNCCLTopoConfigMap inside RegisterPod
performs a synchronous Get+Create against the API server (duplicating
EnsureConfigMap work) and can spike admission latency; move this work out of the
webhook path by either (a) enqueueing the pod/namespace to the gang controller's
reconcile loop so the reconcile handler ensures the NCCL topo ConfigMap
asynchronously, or (b) spawning a background goroutine from RegisterPod that
deduplicates requests (use a namespace-level singleflight/mutex or a map of
in-flight requests keyed by reg.Namespace) and performs the Get/Create off the
admission path; update ensureNCCLTopoConfigMap (and any EnsureConfigMap usage)
to return errors via logs/metrics instead of blocking admission and ensure
failures are retried by the controller reconcile logic.
In `@preflight/pkg/webhook/injector.go`:
- Line 544: The structured log in injectGangEnv ("Injecting gang environment
variables") is emitted for each init container causing duplicate entries; change
the call in injectGangEnv from slog.Info to slog.Debug (or remove it and place a
single slog.Info at the caller buildInitContainers after the loop) so the
message is logged only once—update the slog call inside injectGangEnv (or add
the single log in buildInitContainers) accordingly.
- Around line 274-279: When mirrorClaims is true we currently append every
podResourceClaim into container.Resources.Claims unconditionally, causing
duplicates; update the logic in injector.go (the mirrorClaims branch that
iterates podResourceClaims and appends to container.Resources.Claims) to skip
claims whose Name already exists in container.Resources.Claims by first building
a set of existing claim names (from container.Resources.Claims) and only
appending podResourceClaims with names not present in that set, ensuring
duplicated ResourceClaim entries are not produced for init-container templates.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "__main__.py" | grep ncclRepository: NVIDIA/NVSentinel
Length of output: 121
🏁 Script executed:
head -n 110 ./preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py | tail -n 20Repository: NVIDIA/NVSentinel
Length of output: 904
🌐 Web query:
TORCH_NCCL_ASYNC_ERROR_HANDLING PyTorch deprecation NCCL_ASYNC_ERROR_HANDLING
💡 Result:
PyTorch deprecated the NCCL env var name NCCL_ASYNC_ERROR_HANDLING in favor of the PyTorch-prefixed variable TORCH_NCCL_ASYNC_ERROR_HANDLING (rename noted as happening in PyTorch 2.2, with the old name kept for backward compatibility). When the old name is set, you may see a warning like: “Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead”. [1], [2]
What to do (migration)
- Stop setting:
NCCL_ASYNC_ERROR_HANDLING=...(deprecated name) [2]
- Set instead:
TORCH_NCCL_ASYNC_ERROR_HANDLING=...[1]
Example:
# old (deprecated)
# export NCCL_ASYNC_ERROR_HANDLING=1
# new
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1Semantics (PyTorch)
TORCH_NCCL_ASYNC_ERROR_HANDLING controls how ProcessGroupNCCL reacts when NCCL async errors are detected by the watchdog:
0: no async error handling1: abort communicator + tear down process2: abort communicator only3: tear down process only (default) [1]
Sources: PyTorch ProcessGroupNCCL env var docs for TORCH_NCCL_ASYNC_ERROR_HANDLING [1]; NVIDIA CUDA Graph troubleshooting note describing the rename to TORCH_NCCL_ASYNC_ERROR_HANDLING in PyTorch 2.2 [2].
[1] (docs.pytorch.org)
[2] (docs.nvidia.com)
Citations:
- 1: https://docs.pytorch.org/docs/stable/torch_nccl_environment_variables.html?utm_source=openai
- 2: https://docs.nvidia.com/dl-cuda-graph/latest/troubleshooting/capture-failures.html?utm_source=openai
🌐 Web query:
NGC 25.01 PyTorch version
💡 Result:
NGC PyTorch container release 25.01 (e.g., nvcr.io/nvidia/pytorch:25.01-py3) is based on PyTorch 2.6.0a0+ecf3bae40a. [1]
Sources: [1]
NCCL_ASYNC_ERROR_HANDLING is deprecated — use TORCH_NCCL_ASYNC_ERROR_HANDLING instead.
PyTorch deprecated NCCL_ASYNC_ERROR_HANDLING in favor of TORCH_NCCL_ASYNC_ERROR_HANDLING (PyTorch 2.2+), and NGC 25.01 ships PyTorch 2.6.x, which emits the warning: "Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead". Setting only the old name risks the watchdog being ignored in future PyTorch versions where backward compatibility is dropped.
Proposed fix
if "NCCL_ASYNC_ERROR_HANDLING" not in os.environ:
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
+if "TORCH_NCCL_ASYNC_ERROR_HANDLING" not in os.environ:
+ os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "1"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if "NCCL_TIMEOUT" not in os.environ: | |
| os.environ["NCCL_TIMEOUT"] = "1800" # 30 minutes default | |
| if "NCCL_ASYNC_ERROR_HANDLING" not in os.environ: | |
| os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1" | |
| if "NCCL_DEBUG" not in os.environ: | |
| os.environ["NCCL_DEBUG"] = "INFO" # Enable NCCL debug logging | |
| if "NCCL_TIMEOUT" not in os.environ: | |
| os.environ["NCCL_TIMEOUT"] = "1800" # 30 minutes default | |
| if "NCCL_ASYNC_ERROR_HANDLING" not in os.environ: | |
| os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1" | |
| if "TORCH_NCCL_ASYNC_ERROR_HANDLING" not in os.environ: | |
| os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "1" | |
| if "NCCL_DEBUG" not in os.environ: | |
| os.environ["NCCL_DEBUG"] = "INFO" # Enable NCCL debug logging |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 99 -
104, The code currently sets NCCL_ASYNC_ERROR_HANDLING via os.environ; update
the environment variable handling so the new PyTorch name
TORCH_NCCL_ASYNC_ERROR_HANDLING is set (and keep setting
NCCL_ASYNC_ERROR_HANDLING for backward compatibility). Specifically, in the
__main__.py block that manipulates os.environ, add a check to set
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"]="1" when neither variable exists
(or mirror the value when only the old name is present), ensuring both
TORCH_NCCL_ASYNC_ERROR_HANDLING and NCCL_ASYNC_ERROR_HANDLING are defined to
avoid deprecation warnings and future breakage.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (1)
247-274: PotentialValueErrorfromHealthReporter.send_failureif an error witherror_code=Noneis ever passed.Currently all callers of
_handle_failurepass errors with non-Noneerror_code(ALLREDUCE_INIT_FAILED,ALLREDUCE_TIMEOUT,ALLREDUCE_BW_DEGRADED), so this is safe today. However,HealthReporter.send_failureraisesValueErrorfor errors witherror_code=None, and this function only catchesRuntimeError. A defensive guard or broader catch would future-proof this.Proposed fix
try: reporter = HealthReporter( socket_path=cfg.connector_socket, node_name=cfg.node_name, processing_strategy=cfg.processing_strategy, ) reporter.send_failure(error, message) - except RuntimeError as err: + except (RuntimeError, ValueError) as err: log.error( "Failed to send health event", extra={"error": str(err)}, ) return NCCLError.HEALTH_REPORT_FAILED.value.exit_code🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 247 - 274, The _handle_failure function currently calls HealthReporter.send_failure without guarding against ValueError when an NCCLError with error_code=None is passed; update _handle_failure to either validate the passed error (NCCLError) has a non-None error_code before calling reporter.send_failure (and log+return an appropriate exit code if missing) or broaden the exception handler to catch ValueError in addition to RuntimeError around reporter.send_failure, ensuring you still log the error and return NCCLError.HEALTH_REPORT_FAILED.value.exit_code; reference _handle_failure and HealthReporter.send_failure in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 101-103: dist.init_process_group in __main__.py currently uses the
default (30min) timeout; modify the call to pass an explicit, shorter timeout to
enable fast-fail (e.g., timeout=datetime.timedelta(seconds=300)). Use an
existing timeout value from your Config if available (e.g., Config.nccl_timeout
or similar), otherwise apply a sensible hardcoded default (300s); update the
dist.init_process_group(backend="nccl") invocation to include the timeout
argument and ensure you import datetime if needed and reference the chosen
Config field name.
In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 129-130: The script is reading the wrong environment variable
names so injected values are ignored; update the code that sets gang_config_dir
and gang_timeout to read the same env vars the Go webhook injector sets (use
GANG_CONFIG_PATH instead of GANG_CONFIG_DIR for gang_config_dir, and use
GANG_TIMEOUT instead of GANG_TIMEOUT_SECONDS for gang_timeout), preserve the
default fallbacks (DEFAULT_GANG_CONFIG_DIR and DEFAULT_GANG_TIMEOUT) and ensure
gang_timeout is parsed to int as currently done in the gang_timeout assignment.
- Around line 129-136: The environment parsing uses unguarded int() calls for
gang_timeout and nprocs_per_node which will raise raw ValueError on malformed
input; wrap the int conversions for os.getenv("GANG_TIMEOUT_SECONDS",
str(DEFAULT_GANG_TIMEOUT)) and int(os.getenv("NPROCS_PER_NODE", "")) in
try/except ValueError, validate the parsed values (e.g., ensure timeout and
nprocs are positive integers), and on failure log a clear error via your logger
(or print) including the env var name and bad value and then exit with a
non-zero code; if NPROCS_PER_NODE is unset or invalid fall back to calling
_detect_gpu_count() as currently intended.
---
Nitpick comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 247-274: The _handle_failure function currently calls
HealthReporter.send_failure without guarding against ValueError when an
NCCLError with error_code=None is passed; update _handle_failure to either
validate the passed error (NCCLError) has a non-None error_code before calling
reporter.send_failure (and log+return an appropriate exit code if missing) or
broaden the exception handler to catch ValueError in addition to RuntimeError
around reporter.send_failure, ensuring you still log the error and return
NCCLError.HEALTH_REPORT_FAILED.value.exit_code; reference _handle_failure and
HealthReporter.send_failure in your change.
There was a problem hiding this comment.
Can we add a comment here with a link to where the original file is sourced from?
There was a problem hiding this comment.
Can we add a comment here with a link to where the original file is sourced from?
| # Per-fabric mount profiles. Keyed by networkFabric name. | ||
| # The selected profile's extraHostPathMounts and extraVolumeMounts are | ||
| # merged into gangCoordination at render time (see _helpers.tpl). | ||
| fabricProfiles: |
There was a problem hiding this comment.
Won't the user container also have specify these vars? Wouldn't it be easier to just copy the vars from the users container? In config we can have a pattern say NCCL_* and then any var that matches the pattern will be copied.
| }, | ||
| ) | ||
| for i in range(self._warmup): | ||
| dist.all_reduce(tensor, op=dist.ReduceOp.SUM) |
There was a problem hiding this comment.
Can we make dist.ReduceOp.SUM) a config as well?
Summary
Adds multi-node NCCL all-reduce as a preflight check. The webhook injects an init container that runs a bandwidth benchmark across all gang members before the workload starts. Validates GPU interconnect health across InfiniBand, EFA, TCPXO, and MNNVL fabrics.
Key changes:
preflight-nccl-allreducecontainer image (PyTorch + torchrun benchmark)networkFabricHelm selector with data-driven fabric profiles (ib,efa,mnnvl-efa,tcpxo)extraVolumeMountsfor GCP TCPXO plugin injectionncclTopoShape)Testing
Overall summary:
GCP H100 with TCPXO
Configmap for preflight:
Job spec submitted:
Init containers added:
Result in
331.5 GB/s:AWS H100 with EFA:
Job spec submitted:
Init container in pod spec:
gives 346 GB/s:
Azure with A100 (NDv4)
And corresponding topo configmap created in workload namespace:
$ kd cm preflight-nccl-topo -n preflight-test-ns | wc -l 51Submitted job spec:
Init containers were injected and bus bandwidth was 165.96 GB/s
In platform connector:
{ "time": "2026-02-18T10:43:48.009556013Z", "level": "INFO", "msg": "Successfully got item", "module": "platform-connectors", "version": "v0.7.1", "healthEvents": { "version": 1, "events": [ { "version": 1, "agent": "preflight-nccl-allreduce", "componentClass": "Node", "checkName": "NCCLAllReduceTest", "isHealthy": true, "message": "NCCL all-reduce bus bandwidth 165.96 GB/s meets threshold 100.00 GB/s", "metadata": { "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4", "nvidia.com/cuda.driver-version.full": "570.148.08", "nvidia.com/cuda.driver-version.major": "570", "nvidia.com/cuda.driver-version.minor": "148", "nvidia.com/cuda.driver-version.revision": "08", "nvidia.com/cuda.runtime-version.full": "12.8", "nvidia.com/cuda.runtime-version.major": "12", "nvidia.com/cuda.runtime-version.minor": "8", "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB", "providerID": "azure:..12", "topology.kubernetes.io/region": "southcentralus", "topology.kubernetes.io/zone": "0" }, "generatedTimestamp": { "seconds": 1771411427, "nanos": 980029000 }, "nodeName": "aks-gpu-xx-vmss00000c", "processingStrategy": 1 } ] } }AWS GB200 MNNVL with DRA support
Config with MNNVL enabled:
Since GB200 has multi-node NVLInk and the cluster was supporting DRA for IMEX channel,
compute-domainis requested as a resource claim under resource in the job spec:Corresponding resource claim template:
Init containers were injected upon the above submission as usual and bandwidth came out to be ~689 GB/s:
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Tests
Chores