Skip to content

Add NCCL all reduce test in preflight framework#837

Open
XRFXLP wants to merge 18 commits intoNVIDIA:mainfrom
XRFXLP:xrfxlp/766
Open

Add NCCL all reduce test in preflight framework#837
XRFXLP wants to merge 18 commits intoNVIDIA:mainfrom
XRFXLP:xrfxlp/766

Conversation

@XRFXLP
Copy link
Member

@XRFXLP XRFXLP commented Feb 16, 2026

Summary

Adds multi-node NCCL all-reduce as a preflight check. The webhook injects an init container that runs a bandwidth benchmark across all gang members before the workload starts. Validates GPU interconnect health across InfiniBand, EFA, TCPXO, and MNNVL fabrics.

Key changes:

  1. preflight-nccl-allreduce container image (PyTorch + torchrun benchmark)
  2. networkFabric Helm selector with data-driven fabric profiles (ib, efa, mnnvl-efa, tcpxo)
  3. DRA resource claim mirroring for GB200 MNNVL/IMEX
  4. extraVolumeMounts for GCP TCPXO plugin injection
  5. Auto-created NCCL topology ConfigMap for Azure IB (ncclTopoShape)

Testing

Overall summary:

Platform GPU Fabric Bus Bandwidth(GBps) Threshold
AWS EFA H100 LibFabric/RDMA 346 100
Azure IB A100 Infiniband 165 100
AWS MNNVL GB200 MultiNode NVlink 689 100
GCP TCPXO H100 FastRak/Ethernete 331 100

GCP H100 with TCPXO

Configmap for preflight:

$ kg cm preflight-config -n nvsentinel -o yaml | yq .data
config.yaml: |
  gpuResourceNames:
    - nvidia.com/gpu
  networkResourceNames:
    - nvidia.com/mlnxnics
    - vpc.amazonaws.com/efa
    - networking.gke.io.networks/gpu-nic0
    - networking.gke.io.networks/gpu-nic0.IP
     ....
    - networking.gke.io.networks/gpu-nic7
    - networking.gke.io.networks/gpu-nic7.IP
  dcgm:
    hostengineAddr: "nvidia-dcgm.gpu-operator.svc:5555"
    diagLevel: 2
    connectorSocket: "unix:///var/run/nvsentinel.sock"
    processingStrategy: "EXECUTE_REMEDIATION"
  gangDiscovery:
    annotationKeys:
    - scheduling.k8s.io/group-name
    labelKeys: []
    minCountExpr: podGroup.spec.minMember
    name: volcano
    podGroupGVR:
      group: scheduling.volcano.sh
      resource: podgroups
      version: v1beta1
  gangCoordination:
    enabled: true
    timeout: "10m"
    masterPort: 29500
    configMapMountPath: "/etc/preflight"
    extraVolumeMounts:
      - mountPath: /usr/local/nvidia
        name: nvtcpxo-libraries
        readOnly: true
      - mountPath: /dev/aperture_devices
        name: nvtcpxo-aperture-devices
  initContainers:
    - env:
      - name: BW_THRESHOLD_GBPS
        value: "100"
      - name: MESSAGE_SIZES
        value: 4G
      - name: GANG_TIMEOUT_SECONDS
        value: "600"
      - name: NCCL_DEBUG
        value: INFO
      - name: NCCL_DEBUG_SUBSYS
        value: INIT,NET
      - name: NCCL_SOCKET_IFNAME
        value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
      - name: NCCL_CROSS_NIC
        value: "0"
      - name: NCCL_NVLS_ENABLE
        value: "0"
      - name: NCCL_FASTRAK_CTRL_DEV
        value: eth0
      - name: NCCL_FASTRAK_IFNAME
        value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
      - name: NCCL_BUFFSIZE
        value: "8388608"
      - name: NCCL_FASTRAK_NUM_FLOWS
        value: "2"
      - name: NCCL_FASTRAK_USE_SNAP
        value: "1"
      - name: NCCL_FASTRAK_USE_LLCM
        value: "1"
      - name: NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY
        value: /dev/aperture_devices
      - name: NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL
        value: "0"
      - name: NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING
        value: "0"
      - name: NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS
        value: "600000"
      - name: NCCL_NET_GDR_LEVEL
        value: PIX
      - name: NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE
        value: /usr/local/nvidia/lib64/a3plus_guest_config.textproto
      - name: NCCL_TUNER_PLUGIN
        value: libnccl-tuner.so
      - name: NCCL_TUNER_CONFIG_PATH
        value: /usr/local/nvidia/lib64/a3plus_tuner_config.textproto
      - name: LD_LIBRARY_PATH
        value: /usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu
      - name: NCCL_PROTO
        value: Simple
      - name: NCCL_P2P_NVL_CHUNKSIZE
        value: "1048576"
      - name: NCCL_ALGO
        value: Ring,Tree
      - name: NCCL_MIN_NCHANNELS
        value: "4"
      - name: NCCL_DYNAMIC_CHUNK_SIZE
        value: "524288"
      - name: NCCL_P2P_NET_CHUNKSIZE
        value: "524288"
      - name: NCCL_P2P_PCI_CHUNKSIZE
        value: "524288"
      image: xrfxlp/preflight-nccl-allreduce:12211802
      name: preflight-nccl-allreduce
      securityContext:
        capabilities:
          add:
          - IPC_LOCK
      volumeMounts:
      - mountPath: /var/run
        name: nvsentinel-socket

Job spec submitted:

kubectl apply -f - <<EOF
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: preflight-test-2
  namespace: preflight-test-ns
spec:
  minAvailable: 2
  schedulerName: volcano
  tasks:
    - replicas: 2
      name: worker
      template:
        spec:
          restartPolicy: Never
          tolerations:
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoExecute
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoSchedule
          containers:
            - name: worker
              image: busybox
              command: ["sleep", "300"]
              resources:
                limits:
                  nvidia.com/gpu: 8
                  networking.gke.io.networks/gpu-nic0: "1"
                  ...
                  networking.gke.io.networks/gpu-nic7: "1"
                requests:
                  nvidia.com/gpu: 8
                 networking.gke.io.networks/gpu-nic0: "1"
                 ...
                 networking.gke.io.networks/gpu-nic7: "1"
EOF

Init containers added:

$ kgp -n preflight-test-ns preflight-test-2-worker-1 -o yaml | yq .spec.initContainers
- args:
    - |2

      set -ex
      chmod 755 /fts/entrypoint_rxdm_container.sh
      /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr --enforce_kernel_ipv6_support=false
  command:
    - /bin/sh
    - -c
  env:
    - name: LD_LIBRARY_PATH
      value: /usr/local/nvidia/lib64
  image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.8
  imagePullPolicy: Always
  name: tcpxo-daemon
  resources: {}
  restartPolicy: Always
  securityContext:
    capabilities:
      add:
        - CAP_NET_ADMIN
        - CAP_NET_BIND_SERVICE
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /usr/local/nvidia
      name: nvtcpxo-libraries
      readOnly: true
    - mountPath: /hostsysfs
      name: nvtcpxo-sys
    - mountPath: /hostprocsysfs
      name: nvtcpxo-proc-sys
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-6kwx8
      readOnly: true
- env:
    - name: BW_THRESHOLD_GBPS
      value: "100"
.......(same as above)
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: PLATFORM_CONNECTOR_SOCKET
      value: unix:///var/run/nvsentinel.sock
    - name: PROCESSING_STRATEGY
      value: EXECUTE_REMEDIATION
    - name: GANG_ID
      value: volcano-preflight-test-ns-preflight-test-2-bc8ffe65-017c-42f5-a4fa-dc9903980d7a
    - name: GANG_CONFIG_DIR
      value: /etc/preflight
    - name: MASTER_PORT
      value: "29500"
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
  image: xrfxlp/preflight-nccl-allreduce:12211802
  imagePullPolicy: IfNotPresent
  name: preflight-nccl-allreduce
  resources:
    limits:
      networking.gke.io.networks/gpu-nic0: "1"
       ...
      networking.gke.io.networks/gpu-nic7: "1"
      nvidia.com/gpu: "8"
    requests:
      cpu: 100m
      memory: 500Mi
      networking.gke.io.networks/gpu-nic0: "1"
       ...
      networking.gke.io.networks/gpu-nic7: "1"
      nvidia.com/gpu: "8"
  securityContext:
    capabilities:
      add:
        - IPC_LOCK
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /var/run
      name: nvsentinel-socket
    - mountPath: /etc/preflight
      name: nvsentinel-preflight-gang-config
      readOnly: true
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /usr/local/nvidia
      name: nvtcpxo-libraries
      readOnly: true
    - mountPath: /dev/aperture_devices
      name: nvtcpxo-aperture-devices
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-6kwx8
      readOnly: true

Result in 331.5 GB/s:

$ klog -n  preflight-test-ns       preflight-test-2-worker-0 -c preflight-nccl-allreduce | grep "Sending health event" | jq
{
  "event": "Sending health event",
  "level": "info",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "NCCL all-reduce bus bandwidth 331.50 GB/s meets threshold 100.00 GB/s",
  "timestamp": "2026-02-18T08:17:28.610839Z",
  "module": "preflight-nccl-allreduce",
  "version": "0.1.0"
}

AWS H100 with EFA:

$ helm template preflight -n nvsentienl . --set networkFabric=efa --show-only templates/configmap.yaml | yq '.data["config.yaml"]' | yq .initContainers
- env:
    - name: BW_THRESHOLD_GBPS
      value: "100"
    - name: MESSAGE_SIZES
      value: 4G
    - name: GANG_TIMEOUT_SECONDS
      value: "600"
    - name: NCCL_DEBUG
      value: INFO
    - name: NCCL_DEBUG_SUBSYS
      value: INIT,NET
    - name: FI_EFA_USE_DEVICE_RDMA
      value: "1"
    - name: FI_PROVIDER
      value: efa
    - name: NCCL_SOCKET_IFNAME
      value: eth0
    - name: LD_LIBRARY_PATH
      value: /opt/amazon/ofi-nccl/lib:/opt/amazon/efa/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
  image: xrfxlp/preflight-nccl-allreduce:14381802
  name: preflight-nccl-allreduce
  securityContext:
    capabilities:
      add:
        - IPC_LOCK
  volumeMounts:
    - mountPath: /var/run
      name: nvsentinel-socket

Job spec submitted:

kubectl apply -f - <<EOF
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: preflight-test-2
  namespace: preflight-test-ns
spec:
  minAvailable: 2
  schedulerName: volcano
  tasks:
    - replicas: 2
      name: worker
      template:
        spec:
          restartPolicy: Never
          tolerations:
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoExecute
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoSchedule
          containers:
            - name: worker
              image: busybox
              command: ["sleep", "300"]
              resources:
                limits:
                  nvidia.com/gpu: 8
                  vpc.amazonaws.com/efa: 32
                requests:
                  nvidia.com/gpu: 8
                  vpc.amazonaws.com/efa: 32
EOF

Init container in pod spec:

$ kgp preflight-test-2-worker-0 -n preflight-test-ns -o yaml | yq .spec.initContainers
- env:
    - name: BW_THRESHOLD_GBPS
      value: "100"
    - name: MESSAGE_SIZES
      value: 4G
    - name: GANG_TIMEOUT_SECONDS
      value: "600"
    - name: NCCL_DEBUG
      value: INFO
    - name: NCCL_DEBUG_SUBSYS
      value: INIT,NET
    - name: FI_EFA_USE_DEVICE_RDMA
      value: "1"
    - name: FI_PROVIDER
      value: efa
    - name: NCCL_SOCKET_IFNAME
      value: eth0
    - name: LD_LIBRARY_PATH
      value: /opt/amazon/ofi-nccl/lib:/opt/amazon/efa/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: PLATFORM_CONNECTOR_SOCKET
      value: unix:///var/run/nvsentinel.sock
    - name: PROCESSING_STRATEGY
      value: EXECUTE_REMEDIATION
    - name: GANG_ID
      value: volcano-preflight-test-ns-preflight-test-2-51e32b88-dc32-4efa-b391-e3347607af55
    - name: GANG_CONFIG_DIR
      value: /etc/preflight
    - name: MASTER_PORT
      value: "29500"
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
  image: xrfxlp/preflight-nccl-allreduce:14381802
  imagePullPolicy: IfNotPresent
  name: preflight-nccl-allreduce
  resources:
    limits:
      nvidia.com/gpu: "8"
      vpc.amazonaws.com/efa: "32"
    requests:
      cpu: 100m
      memory: 500Mi
      nvidia.com/gpu: "8"
      vpc.amazonaws.com/efa: "32"
  securityContext:
    capabilities:
      add:
        - IPC_LOCK
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /var/run
      name: nvsentinel-socket
    - mountPath: /etc/preflight
      name: nvsentinel-preflight-gang-config
      readOnly: true
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /opt/amazon
      name: host-opt-amazon
      readOnly: true
    - mountPath: /usr/lib/x86_64-linux-gnu/libefa.so.1
      name: host-libefa
      readOnly: true
    - mountPath: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
      name: host-libibverbs
      readOnly: true
    - mountPath: /usr/lib/x86_64-linux-gnu/libibverbs
      name: host-libibverbs-dir
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wlnl4
      readOnly: true

gives 346 GB/s:

$ klog -n  preflight-test-ns       preflight-test-2-worker-0 -c preflight-nccl-allreduce | grep "Sending health event" | jq 
{
  "event": "Sending health event",
  "level": "info",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "NCCL all-reduce bus bandwidth 346.36 GB/s meets threshold 100.00 GB/s",
  "timestamp": "2026-02-18T09:30:07.705494Z",
  "module": "preflight-nccl-allreduce",
  "version": "0.1.0"
}

Azure with A100 (NDv4)

$ helm template preflight -n nvsentinel . --set networkFabric=ib --set gangCoordination.ncclTopoShape=ndv4 --show-only templates/configmap.yaml | yq '.data["config.yaml"]' | yq .gangCoordination.ncclTopoData | head
<system version="1">
  <cpu numaid="0" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:01.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="0003:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0103:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0004:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0104:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
  <cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">

And corresponding topo configmap created in workload namespace:

$ kd cm preflight-nccl-topo -n preflight-test-ns | wc -l
51

Submitted job spec:

kubectl apply -f - <<EOF
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: preflight-test-2
  namespace: preflight-test-ns
spec:
  minAvailable: 2
  schedulerName: volcano
  tasks:
    - replicas: 2
      name: worker
      template:
        spec:
          restartPolicy: Never
          tolerations:
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoExecute
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoSchedule
          containers:
            - name: worker
              image: busybox
              command: ["sleep", "300"]
              resources:
                limits:
                  nvidia.com/gpu: 8
                  nvidia.com/mlnxnics: 8
                requests:
                  nvidia.com/gpu: 8
                  nvidia.com/mlnxnics: 8
EOF

Init containers were injected and bus bandwidth was 165.96 GB/s

$ klog preflight-test-2-worker-0 -n preflight-test-ns -c preflight-nccl-allreduce | grep "Sending health event" | jq
{
  "event": "Sending health event",
  "level": "info",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "NCCL all-reduce bus bandwidth 165.96 GB/s meets threshold 100.00 GB/s",
  "timestamp": "2026-02-18T10:43:47.981051Z",
  "module": "preflight-nccl-allreduce",
  "version": "0.1.0"
}

In platform connector:

{
  "time": "2026-02-18T10:43:48.009556013Z",
  "level": "INFO",
  "msg": "Successfully got item",
  "module": "platform-connectors",
  "version": "v0.7.1",
  "healthEvents": {
    "version": 1,
    "events": [
      {
        "version": 1,
        "agent": "preflight-nccl-allreduce",
        "componentClass": "Node",
        "checkName": "NCCLAllReduceTest",
        "isHealthy": true,
        "message": "NCCL all-reduce bus bandwidth 165.96 GB/s meets threshold 100.00 GB/s",
        "metadata": {
          "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4",
          "nvidia.com/cuda.driver-version.full": "570.148.08",
          "nvidia.com/cuda.driver-version.major": "570",
          "nvidia.com/cuda.driver-version.minor": "148",
          "nvidia.com/cuda.driver-version.revision": "08",
          "nvidia.com/cuda.runtime-version.full": "12.8",
          "nvidia.com/cuda.runtime-version.major": "12",
          "nvidia.com/cuda.runtime-version.minor": "8",
          "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB",
          "providerID": "azure:..12",
          "topology.kubernetes.io/region": "southcentralus",
          "topology.kubernetes.io/zone": "0"
        },
        "generatedTimestamp": {
          "seconds": 1771411427,
          "nanos": 980029000
        },
        "nodeName": "aks-gpu-xx-vmss00000c",
        "processingStrategy": 1
      }
    ]
  }
}

AWS GB200 MNNVL with DRA support

Config with MNNVL enabled:

$ helm template preflight -n nvsentienl . --set networkFabric=mnnvl-efa --show-only templates/configmap.yaml | yq '.data["config.yaml"]' | yq .initContainers
- env:
    - name: BW_THRESHOLD_GBPS
      value: "100"
    - name: MESSAGE_SIZES
      value: 4G
    - name: GANG_TIMEOUT_SECONDS
      value: "600"
    - name: NCCL_DEBUG
      value: INFO
    - name: NCCL_DEBUG_SUBSYS
      value: INIT,NET
    - name: FI_EFA_USE_DEVICE_RDMA
      value: "1"
    - name: NCCL_MNNVL_ENABLE
      value: "1"
    - name: NCCL_NVLS_ENABLE
      value: "0"
    - name: NCCL_NET_GDR_LEVEL
      value: PHB
    - name: NCCL_P2P_NET_CHUNKSIZE
      value: "2097152"
    - name: LD_LIBRARY_PATH
      value: /opt/amazon-efa-ofi/ofi/lib:/opt/amazon-efa-ofi/efa/lib:/usr/local/nvidia/lib64:/usr/lib:/usr/lib64:/usr/local/cuda/lib64
.......

Since GB200 has multi-node NVLInk and the cluster was supporting DRA for IMEX channel, compute-domain is requested as a resource claim under resource in the job spec:

kubectl create -f - <<'EOF'
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: preflight-test-2
  namespace: preflight-test-ns
spec:
  minAvailable: 2
  schedulerName: volcano
  tasks:
    - replicas: 2
      name: worker
      template:
        spec:
          restartPolicy: Never
          tolerations:
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoExecute
            - key: dedicated
              operator: Equal
              value: user-workload
              effect: NoSchedule
          resourceClaims:
            - name: compute-domain-channel
              resourceClaimTemplateName: preflight-test-channel
          containers:
            - name: worker
              image: busybox
              command: ["sleep", "300"]
              resources:
                claims:
                  - name: compute-domain-channel
                limits:
                  nvidia.com/gpu: 4
                  vpc.amazonaws.com/efa: 4
                requests:
                  nvidia.com/gpu: 4
                  vpc.amazonaws.com/efa: 4
EOF

Corresponding resource claim template:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  creationTimestamp: "2026-02-17T06:09:11Z"
  finalizers:
  - resource.nvidia.com/computeDomain
  labels:
    resource.nvidia.com/computeDomain: b28e52b5-b103-4cf0-a3e3-1fc5da119ce7
    resource.nvidia.com/computeDomainTarget: Workload
  name: preflight-test-channel
  namespace: preflight-test-ns
  resourceVersion: "63522941"
  uid: 5a928419-e5e8-4b1d-83d8-7b7bb9a751c5
spec:
  metadata: {}
  spec:
    devices:
      config:
      - opaque:
          driver: compute-domain.nvidia.com
          parameters:
            allocationMode: Single
            apiVersion: resource.nvidia.com/v1beta1
            domainID: b28e52b5-b103-4cf0-a3e3-1fc5da119ce7
            kind: ComputeDomainChannelConfig
        requests:
        - channel
      requests:
      - exactly:
          allocationMode: ExactCount
          count: 1
          deviceClassName: compute-domain-default-channel.nvidia.com
        name: channel

Init containers were injected upon the above submission as usual and bandwidth came out to be ~689 GB/s:

$ klog preflight-test-2-worker-0 -n preflight-test-ns -c preflight-nccl-allreduce | grep "Sending health event" | jq
{
  "event": "Sending health event",
  "level": "info",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "NCCL all-reduce bus bandwidth 689.04 GB/s meets threshold 100.00 GB/s",
  "timestamp": "2026-02-18T11:25:08.365548Z",
  "module": "preflight-nccl-allreduce",
  "version": "0.1.0"
}

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added NCCL all-reduce multi-node preflight check with runnable image and entrypoint for cluster validation.
    • Enhanced gang coordination: optional NCCL topology support, merged fabric overlays, and per-fabric mounts/env.
    • New network fabric profiles (InfiniBand, EFA, MNNVL‑EFA, TCPXO) and expanded cloud network resource detection.
    • Improved health event reporting for richer success/failure diagnostics.
  • Tests

    • Unit tests added for config, errors, and health reporting.
  • Chores

    • CI/build workflows updated to build/publish the new preflight check.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 16, 2026

📝 Walkthrough

Walkthrough

Adds an NCCL all-reduce preflight check: a new Python package and Docker image, Helm chart/fabric overlays and topology support, Go webhook/controller changes to mount topology and extra volumes, protobuf/gRPC health event plumbing, tests, CI matrix updates, and supporting topology data files.

Changes

Cohort / File(s) Summary
Helm chart & templates
distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml, .../values.yaml, .../templates/_helpers.tpl
Render ncclTopoConfigMap/ncclTopoData conditionally, compute fabric defaults, merge and emit extraHostPathMounts/extraVolumeMounts, conditional mirrorResourceClaims, and augment initContainers (deep-copy env for preflight-nccl-allreduce).
Python: nccl-allreduce runtime & CLI
preflight-checks/nccl-allreduce/Dockerfile, .../pyproject.toml, .../scripts/entrypoint.py, .../__main__.py, .../__init__.py
New two-stage Docker build, project metadata and console entry, gang-aware entrypoint launching torchrun, main runtime logic with versioning, structured logging, and env-driven configuration.
Python: benchmarking & config
preflight-checks/nccl-allreduce/nccl_allreduce/benchmark.py, .../config.py, .../errors.py
New Benchmark, TestResult/BenchmarkResult dataclasses and parse/format helpers; Config.from_env with validation; NCCLError definitions and exit-code mapping.
Python: gang, health, logging, protos
.../gang.py, .../health.py, .../logger.py, protos/*
GangConfigReader/GangWaiter for ConfigMap-based gang discovery; HealthReporter sending HealthEvents over unix socket with retries; structured JSON logger; generated health_event protobuf and gRPC stubs/typing.
Python: build/test infra
preflight-checks/nccl-allreduce/Makefile, .../tests/*, .../tests/__init__.py
Makefile targets for protos, poetry setup, docker-build/publish; unit tests for config, errors, and health reporter.
Go webhook & controller
preflight/pkg/config/config.go, preflight/pkg/webhook/injector.go, preflight/pkg/controller/gang_controller.go, preflight/pkg/controller/gang_controller_test.go, preflight/main.go
Added HostPathMount/ExtraVolumeMount types and defaults; injector refactor to collect/append gang volumes, extra hostPath/volume mounts, and pass podResourceClaims to initContainers; GangController now receives cfg and can create NCCL topology ConfigMap.
Topology data files
distros/kubernetes/nvsentinel/charts/preflight/files/ndv4-topo.xml, distros/kubernetes/nvsentinel/charts/preflight/files/ndv5-topo.xml
New static XML topology descriptors for NDv4 and NDv5 instance topologies.
CI / Build
.github/workflows/*, Makefile
Added nccl-allreduce entries to cleanup, build, lint, and publish matrices; root Makefile protos-generate now invokes nccl-allreduce protos generation.

Sequence Diagram(s)

sequenceDiagram
    participant Entrypoint as Entrypoint (pod)
    participant GangWaiter as GangWaiter
    participant CM as Gang ConfigMap
    participant Torchrun as torchrun
    participant NCCL as NCCL processes
    participant Health as HealthReporter
    participant PC as Platform Connector

    Entrypoint->>GangWaiter: wait(pod_name, timeout)
    GangWaiter->>CM: poll/read gang config
    CM-->>GangWaiter: GangConfig (when formed)
    GangWaiter-->>Entrypoint: GangConfig
    Entrypoint->>Torchrun: exec torchrun with args
    Torchrun->>NCCL: init process group / run benchmark
    alt Success
        NCCL->>Health: send_success(...)
        Health->>PC: HealthEventOccurredV1 (grpc over unix socket, with retries)
        PC-->>Health: ack
    else Failure
        NCCL->>Health: send_failure(...)
        Health->>PC: HealthEventOccurredV1 (fatal)
        PC-->>Health: ack
    end
    NCCL->>NCCL: cleanup
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Poem

🐰 I hopped through XML maps and helm,

envs merged, and mounts at the helm,
gangs found their ranks in the night,
health events sailed off in flight,
all-reduce checks now sparkle and helm.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add NCCL all reduce test in preflight framework' clearly and concisely describes the main change—introducing a new NCCL all-reduce health check as an injected init container in the preflight framework.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
preflight/pkg/webhook/injector.go (1)

483-497: ⚠️ Potential issue | 🟠 Major

Update design documentation to reflect the env var rename to POD_NAME and POD_IP.

The env var rename from MY_POD_NAME/MY_POD_IP to POD_NAME/POD_IP is a breaking change for consumers. The preflight-checks code (nccl-allreduce) already expects POD_NAME, but the design document docs/designs/026-preflight-checks.md still references MY_POD_NAME and must be updated to reflect this change. Additionally, verify that the MongoDB Helm chart templates (which reference MY_POD_NAME) are not injected via this webhook or have been updated accordingly.

🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml`:
- Around line 43-46: The rendered YAML has an extra blank line because the
template uses "{{ toYaml .Values.gangCoordination.extraHostPathMounts }}"
without the left-trim marker; update the template around the extraHostPathMounts
block to use the left-trim form (i.e., change the toYaml invocation for
.Values.gangCoordination.extraHostPathMounts to include "{{-") so it matches the
other toYaml calls and removes the unwanted blank line after the
"extraHostPathMounts:" key.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 186-194: The gangDiscovery section is currently populated with
Volcano-specific defaults; change it to be empty by default to match the
documented behavior: replace or clear the gangDiscovery map so it’s an empty map
(e.g., gangDiscovery: {}) and move the Volcano configuration (podGroupGVR, name,
annotationKeys, labelKeys, minCountExpr) into commented example lines so users
can opt-in to Volcano by uncommenting; ensure references to podGroupGVR,
minCountExpr and name remain in the commented example for clarity.
- Around line 129-132: The init container image for the NCCL all-reduce
preflight job is set to a personal registry image (image:
xrfxlp/preflight-nccl-allreduce:11131602) and should be updated to the project’s
official registry; locate the preflight entry with name:
preflight-nccl-allreduce and replace the image value with the canonical image
used by the chart (i.e., the same registry/repository pattern used by the main
preflight image reference above) or a values.yaml variable that points to the
maintained registry so the init container uses the official image.
- Around line 210-219: The values file currently enables an AWS-specific
hostPath under extraHostPathMounts (the entry with name: host-opt-amazon and
hostPath:/opt/amazon, mountPath:/opt/amazon) despite the comment saying "Keep
empty by default"; change this to leave extraHostPathMounts empty by default and
move the AWS example into a commented-out block (or fully comment out the
host-opt-amazon entry) so the default is CSP-agnostic while preserving the
example for users.
- Around line 146-170: The default values file is CSP-specific because the AWS
EFA env vars are uncommented; comment out the AWS EFA entries so the chart
remains provider-agnostic: change the FI_EFA_USE_DEVICE_RDMA and FI_PROVIDER env
entries and the EFA-specific LD_LIBRARY_PATH value to commented lines (same
style as the Azure InfiniBand block) so NCCL/GLOO/TP provider settings are only
enabled when intentionally uncommented.
- Around line 17-20: Replace the personal Docker Hub image and tag in the Helm
values (image.repository and image.tag) so they point to the official registry
and a semantic version: set image.repository to
ghcr.io/nvidia/nvsentinel/preflight (or the sanctioned canonical registry) and
change image.tag from "10581602" to a proper semantic version (e.g., "vX.Y.Z" or
"X.Y.Z") before merging; ensure image.pullPolicy remains appropriate
(IfNotPresent or Always) per release policy.
- Around line 98-128: The initContainers entries for preflight checks were fully
commented out; restore (uncomment) the preflight container definitions for
preflight-dcgm-diag and preflight-nccl-loopback in values.yaml so the preflight
checks are preserved, keeping the existing fields (name, image, volumeMounts,
env/Test env vars like BW_THRESHOLD_GBPS and TEST_SIZE_MB and any resources)
intact, ensure indentation and YAML validity after uncommenting, and verify the
nvsentinel-socket volumeMount and any SKIP_BANDWIDTH_CHECK logic remain correct
so the checks run as intended.

In `@preflight-checks/nccl-allreduce/Dockerfile`:
- Around line 75-81: The current RUN sequence installs the wheel with --no-deps
and separately forces protobuf/grpcio, which prevents pip from resolving a
consistent dependency set and ignores the wheel's pyproject constraints;
instead, run a single pip install that installs the wheel with its dependencies
(remove --no-deps on the /tmp/nvsentinel_nccl_allreduce*.whl install) and ensure
pip applies your constraints by passing --constraint /tmp/constraints.txt in
that same invocation (or use --no-build-isolation if you must respect pyproject
resolution), and remove the separate forced protobuf install so pip can resolve
a compatible protobuf/grpcio when installing the wheel.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Line 210: The _handle_success function signature is missing a type hint for
the result parameter; update the signature def _handle_success(cfg: Config,
result) -> int to def _handle_success(cfg: Config, result: BenchmarkResult) ->
int, and ensure BenchmarkResult is imported or referenced (e.g., from
nccl_allreduce.types or the module where BenchmarkResult is declared) so the
annotation resolves.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 72-80: The env var names used when reading configuration are wrong
and don't match the webhook injector; update the calls that set gang_config_dir
and gang_timeout_seconds to read the webhook's names instead: replace reading
"GANG_CONFIG_DIR" with "GANG_CONFIG_PATH" for the gang_config_dir assignment and
replace reading "GANG_TIMEOUT_SECONDS" with "GANG_TIMEOUT" for the
gang_timeout_seconds assignment (these calls use _parse_float/_parse_int
respectively), so the runtime picks up values injected by injector.go rather
than silently falling back to defaults.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 211-228: In _read_peers, parsing parts[2] with int(...) can raise
ValueError for malformed ranks; wrap the rank parse in a try/except in the loop
that builds peers (the block that sets rank = int(parts[2].strip()) if
len(parts) >= 3 else -1), catch ValueError, log a warning including the
offending line and the invalid rank string (use log.warning with extra={"line":
line, "bad_rank": parts[2].strip()}), and fall back to rank = -1 (or skip the
peer if you prefer consistent behavior) before appending PeerInfo so the error
is handled and the log clearly identifies the problematic peer line.

In
`@preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2_grpc.py`:
- Around line 9-26: The generated stub (health_event_pb2_grpc.py) defines
GRPC_GENERATED_VERSION = '1.75.1' and will raise at import if installed grpcio
is older; update pyproject.toml so the grpcio dependency is bumped to a
constraint that guarantees >=1.75.1 (e.g., change grpcio = "^1.74.0" to grpcio =
"^1.75.1") to match GRPC_GENERATED_VERSION and the existing grpcio-tools =
"1.75.1", then regenerate/update the lockfile and reinstall dependencies so the
environment uses a compatible grpcio.
- Around line 63-66: The generated gRPC file declares GRPC_GENERATED_VERSION =
'1.75.1' and uses grpc.method_handlers_generic_handler, so update the project
dependency to require grpcio >= 1.75.1: edit pyproject.toml to change the grpcio
version specifier from "^1.74.0" to ">=1.75.1" (or a compatible caret range that
starts at 1.75.1) so the runtime check in health_event_pb2_grpc.py succeeds and
you won't get version mismatch errors.

In `@preflight-checks/nccl-allreduce/pyproject.toml`:
- Around line 22-27: The pyproject.toml pins protobuf = ">=6.31.1,<7.0.0" and
grpcio = "^1.74.0" but the Dockerfile currently installs older wheels
(protobuf>=5.x and grpcio>=1.60.x) with --no-deps, causing runtime mismatches;
update the Dockerfile package installation lines that install the prebuilt
wheels so they request protobuf>=6.31.1,<7.0.0 and grpcio>=1.74.0,<2.0.0 (or
otherwise match the pyproject.toml constraints) and ensure the install step
either allows dependency resolution or explicitly installs matching dependency
versions to keep runtime and pyproject constraints consistent.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 105-112: The code does unguarded int() conversions of environment
variables (GANG_TIMEOUT_SECONDS -> gang_timeout and NPROCS_PER_NODE ->
nprocs_per_node) which will raise raw ValueError on bad input; wrap these
conversions in try/except blocks around int(...) for both the
GANG_TIMEOUT_SECONDS/DEFAULT_GANG_TIMEOUT conversion that sets gang_timeout and
the NPROCS_PER_NODE branch that sets nprocs_per_node (when not using
detect_gpu_count()), validate the parsed value is a positive integer (and
non-zero if required), and on parse/validation failure emit a clear error
message including the env var name and bad value and exit with a non-zero code
(e.g., sys.exit(1)) so failures are reported cleanly.

In `@preflight/pkg/config/config.go`:
- Around line 98-127: The HostPathMount.ReadOnly field is documented to default
to true but setDefaults() doesn't enforce this; update the setDefaults()
function to iterate HostPathMount entries (e.g., where the enclosing config's
ExtraHostPathMounts is handled) and if a HostPathMount has ReadOnly == nil set
it to a pointer to true (create a local variable trueVal := true and assign
&trueVal) so consumers no longer need to treat nil specially; ensure you modify
the same setDefaults() that initializes other defaults for the config containing
ExtraHostPathMounts.
🧹 Nitpick comments (15)
preflight-checks/nccl-allreduce/nccl_allreduce/logger.py (1)

31-31: Import Callable from collections.abc instead of typing.

typing.Callable is deprecated since Python 3.9 in favor of collections.abc.Callable. Since this project targets Python 3.10+, use the preferred import.

♻️ Proposed fix
-from typing import Any, Callable, Final
+from collections.abc import Callable
+from typing import Any, Final
preflight-checks/nccl-allreduce/Makefile (2)

28-29: Consider adding a clean target.

The Makefile defines protos-clean but lacks a top-level clean target. Adding one that delegates to protos-clean (and any future cleanup) improves ergonomics and satisfies the conventional Makefile interface.

♻️ Proposed addition (after `all` target)
 .PHONY: all
 all: lint-test
+
+.PHONY: clean
+clean: protos-clean

62-80: docker-build and docker-publish duplicate the entire build command.

The only difference is $(DOCKER_LOAD_ARG) vs --push. This is a common pattern in Makefiles and acceptable, but if more build args are added later, they'll need updating in two places.

preflight/pkg/config/config.go (1)

186-196: No validation for ExtraHostPathMounts entries.

Missing Name, HostPath, or MountPath will silently produce broken volume/mount specs at webhook injection time. Consider validating required fields and the HostPathType value (if set) within validate().

Proposed validation sketch
 func (c *FileConfig) validate() error {
 	if c.GangCoordination.Enabled {
 		timeout, err := time.ParseDuration(c.GangCoordination.Timeout)
 		if err != nil {
 			return fmt.Errorf("invalid gangCoordination.timeout %q: %w", c.GangCoordination.Timeout, err)
 		}
 		c.GangCoordination.TimeoutDuration = timeout
+
+		for i, m := range c.GangCoordination.ExtraHostPathMounts {
+			if m.Name == "" || m.HostPath == "" || m.MountPath == "" {
+				return fmt.Errorf("gangCoordination.extraHostPathMounts[%d]: name, hostPath, and mountPath are required", i)
+			}
+		}
 	}
 	return nil
 }
preflight/pkg/webhook/injector.go (2)

220-252: NCCL-specific volume mounts are injected into every gang init container.

/dev/shm, the NCCL topo ConfigMap, and all extraHostPathMounts are added to every init container when gangCtx != nil, including non-NCCL containers like preflight-dcgm-diag. While harmless for /dev/shm, mounting arbitrary host paths into containers that don't need them widens the attack surface unnecessarily.

Consider gating these mounts on the container name (e.g., checking for the NCCL allreduce container specifically), similar to how injectDCGMEnv gates on container.Name != "preflight-dcgm-diag".


340-352: No SizeLimit on the RAM-backed emptyDir.

The dshm volume uses StorageMediumMemory without a SizeLimit. On nodes with constrained memory, NCCL could consume all available RAM. Consider setting an explicit size limit, or document that this is intentionally unbounded.

preflight-checks/nccl-allreduce/nccl_allreduce/benchmark.py (3)

238-294: Consider explicitly freeing the GPU tensor between test sizes.

Each call to _run_single allocates a potentially large tensor (e.g., 4 GB for float32). When run() iterates over multiple message sizes, the tensor from the previous iteration relies on Python GC for deallocation. For large sizes, this could cause an OOM before the old tensor is freed.

Proposed fix — explicit cleanup
         # Calculate bandwidth metrics
         avg_time = elapsed / self._iters
         algo_bw = size_bytes / avg_time / 1e9  # GB/s
 
         # Bus bandwidth accounts for ring/tree algorithm overhead
         bus_bw = algo_bw * (2 * (world_size - 1) / world_size)
 
         passed = bus_bw >= self._threshold
 
+        del tensor
+        torch.cuda.empty_cache()
+
         return TestResult(

255-270: Rename unused loop variables per linter.

Proposed fix
-        for i in range(self._warmup):
+        for _ in range(self._warmup):
             dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
-        for i in range(self._iters):
+        for _ in range(self._iters):
             dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

72-95: parse_size doesn't handle bare numeric strings.

If someone passes MESSAGE_SIZES="4294967296" (bytes, no suffix), this raises a ValueError. Consider supporting bare integers as raw byte counts, or document that a suffix is always required.

preflight-checks/nccl-allreduce/nccl_allreduce/health.py (2)

102-141: Consider populating entitiesImpacted with the node entity.

entitiesImpacted is hardcoded to []. Since the reporter already knows node_name and COMPONENT_CLASS, it could include the affected node entity to give the Platform Connector better correlation context. Not blocking, but worth considering.


183-207: Consider adding jitter and broadening the caught exception type.

Two minor robustness observations on the retry loop:

  1. Only grpc.RpcError is caught. While gRPC generally wraps connection failures in RpcError, transient OS-level errors (e.g., OSError if the socket path is temporarily unavailable during container startup) would bypass retries and propagate as an unhandled exception to _send, which raises RuntimeError only on exhausted retries.

  2. No jitter is applied to the backoff delay. For a single-reporter-per-node preflight this is fine, but adding small random jitter is a low-cost resilience improvement.

Neither is blocking for a WIP PR.

preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (4)

123-136: Redundant string check and uncaught ValueError in the error path.

Two items:

  1. "NCCL" in error_str or "nccl" in error_str.lower() — the second condition already covers the first (case-insensitive). Simplify to just "nccl" in error_str.lower().

  2. If _handle_failure is ever called with an error whose error_code is None (e.g., SUCCESS, HEALTH_REPORT_FAILED), HealthReporter.send_failure raises ValueError, which is not caught by the except RuntimeError in _handle_failure (line 272). Currently safe because all call sites pass errors with non-None codes, but this is fragile against future changes.

Proposed simplification for the string check
-        if "NCCL" in error_str or "nccl" in error_str.lower():
+        if "nccl" in error_str.lower():

265-277: except RuntimeError won't catch ValueError from send_failure.

HealthReporter.send_failure raises ValueError when the error type has no error_code (line 91 of health.py). This except RuntimeError block would not catch it. While current callers always pass errors with valid codes, consider catching (RuntimeError, ValueError) or Exception here for defense-in-depth.

Proposed fix
-    except RuntimeError as err:
+    except (RuntimeError, ValueError) as err:

137-141: Minor: prefer contextlib.suppress(Exception) over bare try/except/pass.

This aligns with the Ruff SIM105 hint and is slightly more idiomatic.

Proposed fix
+import contextlib
 ...
     finally:
-        try:
-            dist.destroy_process_group()
-        except Exception:
-            pass  # Ignore cleanup errors
+        with contextlib.suppress(Exception):
+            dist.destroy_process_group()

98-119: All ranks will send health failure events if init_process_group fails.

When dist.init_process_group fails, the rank is unknown, so every torchrun process calls _handle_failureHealthReporter.send_failure. This can produce N duplicate health events (one per GPU process). This may be acceptable given the constraint, but worth documenting with a comment so future maintainers understand the trade-off.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@preflight/pkg/webhook/injector.go`:
- Around line 381-393: The added emptyDir Volume for shared /dev/shm (when
gangCtx != nil and dshmVolumeName is not in existingVolumes) uses
StorageMediumMemory with no SizeLimit; update the VolumeSource.EmptyDir for
dshmVolumeName where volumesToAdd is appended to include a SizeLimit (e.g.,
derive from the Pod's memory request or a configurable cap) using a
resource.Quantity (via resource.ParseQuantity or resource.MustParse) so the
in-RAM emptyDir cannot consume all node memory; ensure the logic references
dshmVolumeName and volumesToAdd and preserves existing behavior when SizeLimit
cannot be determined.
🧹 Nitpick comments (4)
preflight/pkg/config/config.go (1)

130-134: Consider validating HostPathType during config loading.

HostPathType is a free-form string here, validated only at injection time in injector.go (parseHostPathType). Invalid values silently skip the mount with a warning log. Validating at Load() time would surface misconfigurations earlier (startup vs. per-pod admission).

preflight/pkg/webhook/injector.go (2)

259-293: Duplicated ReadOnly defaulting logic — extract a helper.

The nil-means-true pattern for ReadOnly is repeated verbatim for ExtraHostPathMounts (Lines 264-267) and ExtraVolumeMounts (Lines 283-286). Extract a small helper to reduce duplication.

Proposed helper
func isReadOnly(ro *bool) bool {
	if ro == nil {
		return true
	}
	return *ro
}

Then use ReadOnly: isReadOnly(m.ReadOnly) in both loops.


158-176: Add test coverage for IP resource mirroring logic.

This code mirrors base NIC quantities to .IP companion resources to ensure init containers match workload container footprints. The logic is correct—init containers capture the resource state at webhook time, independent of later platform injections. However, there's no test coverage verifying this behavior. Add tests to validate:

  • .IP resources are mirrored when missing from workload containers
  • Existing .IP resources are not overwritten
  • Mirroring only happens for configured network resources
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)

243-262: GCP multi-network resources are uncommented — consider commenting out for CSP-agnostic defaults.

All 16 GCP networking.gke.io.networks/gpu-nic* entries are active in the default values. While unused resources won't cause failures (they simply won't match on non-GCP clusters), this clutters the configuration and is inconsistent with the approach of keeping CSP-specific settings commented. Consider commenting these out with an explanatory note, similar to the provider-specific NCCL env vars.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
preflight/pkg/webhook/injector.go (1)

126-132: Nit: prefer range over index-based loop.

A range loop is more idiomatic Go here since only the value is needed.

Suggested change
-		for idx := 0; idx < len(initContainers); idx++ {
+		for _, ic := range initContainers {
 			patches = append(patches, PatchOperation{
 				Op:    "add",
 				Path:  "/spec/initContainers/-",
-				Value: initContainers[idx],
+				Value: ic,
 			})
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/webhook/injector.go` around lines 126 - 132, The loop over
initContainers should use an idiomatic range loop: replace the index-based for
loop that appends PatchOperation entries into patches with a for _, ic := range
initContainers { patches = append(patches, PatchOperation{ Op: "add", Path:
"/spec/initContainers/-", Value: ic, }) } so you iterate values directly; update
the block around initContainers, patches, and PatchOperation in injector.go
accordingly.
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)

159-178: Default networkResourceNames includes all CSP-specific resources.

The default values file now lists AWS EFA and all 8 GCP TCPXO NIC resources. While unused resource names are harmless (they won't match if the resources don't exist on the cluster), this makes the defaults CSP-aware rather than agnostic. Consider whether these should be moved to the respective overlay files (values-efa.yaml, values-tcpxo.yaml) and merged via Helm's list-merge behavior, or kept here for simplicity with a comment explaining the rationale.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 159
- 178, The default list networkResourceNames currently contains CSP-specific
entries (AWS EFA and eight GCP TCPXO NICs); update the chart to keep values.yaml
CSP-agnostic by moving those CSP-specific strings into their respective overlay
value files (e.g., values-efa.yaml for "vpc.amazonaws.com/efa" and
values-tcpxo.yaml for the "networking.gke.io.networks/gpu-nicX" entries) and
rely on Helm to merge lists when overlays are applied, or if you choose to keep
them in values.yaml add a clear comment explaining why CSP-specific names are
included; edit the networkResourceNames block in values.yaml and add/merge the
entries into values-efa.yaml and values-tcpxo.yaml respectively.
preflight/pkg/config/config.go (1)

224-234: Consider validating required fields on ExtraHostPathMounts and ExtraVolumeMounts.

HostPathMount requires Name, HostPath, and MountPath to produce a valid Kubernetes volume/mount, and ExtraVolumeMount requires Name and MountPath. A missing field would silently produce a broken pod spec at admission time rather than failing early at config load. A quick validation loop in validate() would surface misconfigurations sooner.

Proposed addition to `validate()`
 func (c *FileConfig) validate() error {
 	if c.GangCoordination.Enabled {
 		timeout, err := time.ParseDuration(c.GangCoordination.Timeout)
 		if err != nil {
 			return fmt.Errorf("invalid gangCoordination.timeout %q: %w", c.GangCoordination.Timeout, err)
 		}
 		c.GangCoordination.TimeoutDuration = timeout
+
+		for i, m := range c.GangCoordination.ExtraHostPathMounts {
+			if m.Name == "" || m.HostPath == "" || m.MountPath == "" {
+				return fmt.Errorf("gangCoordination.extraHostPathMounts[%d]: name, hostPath, and mountPath are required", i)
+			}
+		}
+		for i, m := range c.GangCoordination.ExtraVolumeMounts {
+			if m.Name == "" || m.MountPath == "" {
+				return fmt.Errorf("gangCoordination.extraVolumeMounts[%d]: name and mountPath are required", i)
+			}
+		}
 	}
 
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/config/config.go` around lines 224 - 234, The validate() method
on FileConfig should check that each entry in ExtraHostPathMounts has non-empty
Name, HostPath, and MountPath and that each entry in ExtraVolumeMounts has
non-empty Name and MountPath; update FileConfig.validate() to loop over
c.ExtraHostPathMounts and c.ExtraVolumeMounts, validate required fields for each
HostPathMount and VolumeMount (reporting the slice index and which field(s) are
missing), and return a descriptive error (e.g., "invalid ExtraHostPathMounts[2]:
missing HostPath") so misconfigurations fail fast instead of producing broken
pod specs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 183-191: The gangDiscovery block is incorrectly prefilled with
Volcano-specific values; change the default to an empty configuration that uses
the Kubernetes native WorkloadRef API by clearing or removing the
Volcano-specific fields (e.g., remove or empty name, annotationKeys, labelKeys,
podGroupGVR and minCountExpr under gangDiscovery) and move the Volcano
configuration into commented example text instead; update the gangDiscovery
stanza so it defaults to an empty object (or blank fields) and leave the
existing Volcano values as a commented example for reference.

In `@preflight/pkg/webhook/injector.go`:
- Around line 266-269: The validation for ExtraHostPathMounts is inconsistent
between buildInitContainers and injectVolumes causing potential missing volumes;
update both places to require the same fields by checking that m.Name,
m.HostPath and m.MountPath are all non-empty (or factor a single helper like
validateHostPathMount(m) and call it from buildInitContainers and injectVolumes)
so entries with a MountPath but missing HostPath are rejected consistently
before injection.

---

Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values-efa.yaml`:
- Line 13: The image value currently points to a personal Docker Hub image
(image: xrfxlp/preflight-nccl-allreduce:10241702) and must be replaced before
merge; update the image entry used by the chart to the official project registry
(e.g., ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce) and set an
appropriate stable version tag (not a personal/dev tag), ensuring the change is
made to the image key in values-efa.yaml so Helm uses the official artifact.

In `@distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml`:
- Line 13: The values file contains a personal image reference
"xrfxlp/preflight-nccl-allreduce:10241702" under the image key; replace this
personal registry image with the approved repository/tag (or a chart/global
image placeholder) so the chart uses the official image (e.g., set image:
<official-registry>/preflight-nccl-allreduce:<version> or image: {{
.Values.global.preflightImage }}), ensuring no personal registry strings remain.

In `@distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml`:
- Line 19: The image field currently references a personal registry image
"xrfxlp/preflight-nccl-allreduce:10241702"; replace that value in the values
YAML (the image key) with the approved organization/official image or a standard
placeholder (e.g., <ORGANIZATION_REGISTRY>/<IMAGE>:<TAG> or the upstream image
name) before merging so no personal images remain in the chart configuration.

In `@distros/kubernetes/nvsentinel/charts/preflight/values-tcpxo.yaml`:
- Line 21: The image field currently points to a personal registry ("image:
xrfxlp/preflight-nccl-allreduce:10241702"); replace that value with the
approved/central registry image tag used by other overlays (or the chart's
templated image value), e.g., set the image key to the canonical image reference
used across overlays (or to a values template like {{ .Values.image.repository
}}:{{ .Values.image.tag }}) so no personal registry remains before merge.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 98-128: The initContainers block has the preflight checks
commented out; restore the preflight-dcgm-diag and preflight-nccl-loopback
containers by uncommenting their definitions under initContainers, ensuring the
container names (preflight-dcgm-diag, preflight-nccl-loopback), their image
fields, volumeMounts (nvsentinel-socket -> /var/run) and env entries
(BW_THRESHOLD_GBPS, TEST_SIZE_MB and optional SKIP_BANDWIDTH_CHECK) are present
and valid; verify the preflight-nccl-loopback env values (BW_THRESHOLD_GBPS
value "150", TEST_SIZE_MB "256") and keep optional SKIP_BANDWIDTH_CHECK
commented or set per default behavior before committing.
- Around line 134-135: The initContainer entry using name
"preflight-nccl-allreduce" currently references a personal registry image
"xrfxlp/preflight-nccl-allreduce:10241702"; replace that hardcoded personal
image with the approved/public image or a chart value reference (e.g., use a
values variable like preflightNcclAllreduce.image or the official registry/tag)
so the chart uses a configurable/official image instead of the personal registry
before merging.
- Around line 17-20: The values.yaml currently points to a personal
registry/image ("image.repository: xrfxlp/preflight" and tag "11251702") which
must be replaced before merge; update image.repository to the official
registry/image and set image.tag to the correct release tag (or use a values
variable like image.tag: "stable" or image.pullPolicy as appropriate) so that
the chart references the sanctioned image—change the entries under image
(repository and tag) in the preflight chart values.yaml to the approved
production values.

In `@preflight/pkg/config/config.go`:
- Around line 125-157: The documented default of ReadOnly being true for
HostPathMount and ExtraVolumeMount isn’t applied in setDefaults(); update the
setDefaults() implementation to set ReadOnly to a true pointer when it is nil
for both HostPathMount and ExtraVolumeMount (e.g., if m.ReadOnly == nil { v :=
true; m.ReadOnly = &v }), or use an existing helper to create a *bool(true), so
consumers no longer need to treat nil as true and the documented behavior is
enforced by the config types.

In `@preflight/pkg/webhook/injector.go`:
- Around line 399-411: The emptyDir for /dev/shm created in the injector when
gangCtx != nil (see dshmVolumeName, volumesToAdd, existingVolumes and
corev1.EmptyDirVolumeSource) lacks a SizeLimit and thus can exhaust node RAM;
update the EmptyDirVolumeSource to set a SizeLimit (e.g., use a configured
constant or pod annotation value) when Medium is StorageMediumMemory so the
memory-backed emptyDir is capped, ensure the value is applied where the volume
is appended and validate/parse the configured size before use.

---

Nitpick comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 159-178: The default list networkResourceNames currently contains
CSP-specific entries (AWS EFA and eight GCP TCPXO NICs); update the chart to
keep values.yaml CSP-agnostic by moving those CSP-specific strings into their
respective overlay value files (e.g., values-efa.yaml for
"vpc.amazonaws.com/efa" and values-tcpxo.yaml for the
"networking.gke.io.networks/gpu-nicX" entries) and rely on Helm to merge lists
when overlays are applied, or if you choose to keep them in values.yaml add a
clear comment explaining why CSP-specific names are included; edit the
networkResourceNames block in values.yaml and add/merge the entries into
values-efa.yaml and values-tcpxo.yaml respectively.

In `@preflight/pkg/config/config.go`:
- Around line 224-234: The validate() method on FileConfig should check that
each entry in ExtraHostPathMounts has non-empty Name, HostPath, and MountPath
and that each entry in ExtraVolumeMounts has non-empty Name and MountPath;
update FileConfig.validate() to loop over c.ExtraHostPathMounts and
c.ExtraVolumeMounts, validate required fields for each HostPathMount and
VolumeMount (reporting the slice index and which field(s) are missing), and
return a descriptive error (e.g., "invalid ExtraHostPathMounts[2]: missing
HostPath") so misconfigurations fail fast instead of producing broken pod specs.

In `@preflight/pkg/webhook/injector.go`:
- Around line 126-132: The loop over initContainers should use an idiomatic
range loop: replace the index-based for loop that appends PatchOperation entries
into patches with a for _, ic := range initContainers { patches =
append(patches, PatchOperation{ Op: "add", Path: "/spec/initContainers/-",
Value: ic, }) } so you iterate values directly; update the block around
initContainers, patches, and PatchOperation in injector.go accordingly.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
preflight/pkg/webhook/injector.go (1)

559-603: Gang env injection logs at Info level per container — consider Debug.

slog.Info at line 565 fires for every init container that gets gang env vars. In a cluster with many gang pods, this could be noisy. The gang membership is already logged at Info level at line 91-96. This per-container log would be better at Debug.

Proposed fix
-	slog.Info("Injecting gang environment variables", "gangID", gangCtx.GangID, "configMap", gangCtx.ConfigMapName)
+	slog.Debug("Injecting gang environment variables", "gangID", gangCtx.GangID, "configMap", gangCtx.ConfigMapName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/webhook/injector.go` around lines 559 - 603, Change the
per-container log in injectGangEnv from Info to Debug to reduce noise: update
the slog.Info call inside Injector.injectGangEnv (the one logging "Injecting
gang environment variables" with keys gangID and configMap) to use slog.Debug
(preserving the same message and structured fields); leave higher-level gang
membership logs (earlier Info logs around GangContext creation) unchanged and
keep mergeEnvVars and GangContext usage intact.
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (2)

109-109: Nit: stray double-comment # #.

Line 109 has # # NCCL loopback test — likely a leftover from toggling the comment. Should be # NCCL loopback test for consistency with the other section comments (lines 98, 129).

Proposed fix
-  # # NCCL loopback test - validates intra-node GPU-to-GPU communication (NVLink/PCIe)
+  # NCCL loopback test - validates intra-node GPU-to-GPU communication (NVLink/PCIe)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` at line 109, The
comment has a stray double-comment marker; update the comment string `# # NCCL
loopback test` in values.yaml to a single `# NCCL loopback test` so it matches
the style of nearby section headings (e.g., the other comment lines around the
NCCL section); locate the offending comment text and replace the double `# #`
with a single `#`.

155-178: Default networkResourceNames includes all GCP TCPXO NIC resources — intentional but notable.

The 16 GCP networking.gke.io.networks/gpu-nicX entries (8 base + 8 .IP) are included in the default values. While this is needed for mirrorIPCompanionResources to function and is harmless on non-GCP clusters (resources simply won't match), it makes the defaults somewhat CSP-aware.

Consider adding a brief inline comment grouping them, similar to the # GCP multi-network resources (TCPXO) on line 162, to help users understand these can be pruned on non-GCP deployments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 155
- 178, The default networkResourceNames list includes many GCP TCPXO entries
which are CSP-specific; update the values.yaml by adding a concise inline
grouping comment above the block of networking.gke.io entries (e.g., "// GCP
multi-network resources (TCPXO) - can be pruned on non-GCP clusters") so users
know these 16 gpu-nicX and gpu-nicX.IP entries are GCP-specific and only
required for mirrorIPCompanionResources behavior; edit the networkResourceNames
block and place the comment immediately before the first networking.gke.io entry
to clearly group them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/files/ndv4-topo.xml`:
- Around line 2-33: The cpu elements in the ndv4-topo.xml diff all have the same
affinity attribute ("affinity=\"0000ffff,0000ffff\""), which is incorrect for a
4-NUMA-node system; update each <cpu numaid="0">, <cpu numaid="1">, <cpu
numaid="2"> and <cpu numaid="3"> to use distinct CPU affinity masks that
partition the 96 vCPUs across the four NUMA nodes (obtain masks from the actual
host via /sys/devices/system/node/node*/cpumap or numactl -H), and replace the
duplicated affinity values with the correct per-node masks so NCCL can detect
locality properly.

In `@distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml`:
- Around line 47-50: The template emits an extra blank line because the toYaml
call for extraVolumeMounts isn't left-trimmed; update the template block
handling extraVolumeMounts so the toYaml invocation uses the left-trim marker
(i.e., change the `{{ toYaml .Values.gangCoordination.extraVolumeMounts |
nindent 8 }}` invocation to use `{{- toYaml ... }}`) so the rendered YAML has no
blank line between the `extraVolumeMounts:` key and its indented list items.
- Around line 51-53: The template currently uses `ne (toString
.Values.gangCoordination.mirrorResourceClaims) ""`, which turns nil into the
string "<nil>" and renders invalid YAML; update the conditional to properly
detect unset/nil values by using Helm/Sprig's kind check, e.g. replace the `if`
with a guard like `if and (not (kindIs "invalid"
.Values.gangCoordination.mirrorResourceClaims)) (ne
.Values.gangCoordination.mirrorResourceClaims "")` (or at minimum `if not
(kindIs "invalid" .Values.gangCoordination.mirrorResourceClaims)`), so
`.Values.gangCoordination.mirrorResourceClaims` is only rendered when present
and non-empty.

In `@distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml`:
- Around line 62-69: The gangCoordination block is missing mirrorResourceClaims:
true so DRA ResourceClaims for GB200/MNNVL won't be mirrored; update the
gangCoordination configuration (the same section that contains
extraHostPathMounts) to add mirrorResourceClaims: true so init containers
receive IMEX channel claims and cross-node NVLink works correctly. Ensure the
key is added at the same nesting level as extraHostPathMounts under
gangCoordination and verify any template/renderer that defaults nil to false
will pick up this explicit true value.
- Around line 28-61: The overlay file replaces the entire initContainers list
which removes the base entries (preflight-dcgm-diag and
preflight-nccl-loopback); either (A) add a brief comment above the
initContainers block stating that this overlay intentionally replaces all
initContainers in the base (so MNNVL environments only run
preflight-nccl-allreduce), or (B) if you want to preserve the base checks,
duplicate the base initContainers entries (preflight-dcgm-diag and
preflight-nccl-loopback) into this file alongside preflight-nccl-allreduce so
all three run when merged; update the file near the initContainers /
preflight-nccl-allreduce section accordingly.

---

Duplicate comments:
In `@preflight/pkg/webhook/injector.go`:
- Around line 309-321: appendExtraHostPathMounts and collectExtraHostPathVolumes
validate different fields causing mounts without volumes; update
appendExtraHostPathMounts to perform the same validation as
collectExtraHostPathVolumes (i.e., skip entries when Name or HostPath is empty)
or centralize validation into a shared helper (e.g., validateExtraHostPathEntry)
and call it from both appendExtraHostPathMounts and collectExtraHostPathVolumes
so only entries with both Name and HostPath are used.
- Around line 453-465: The dshm emptyDir is created without a SizeLimit (risking
node RAM exhaustion); update the creation in injector.go where dshmVolumeName is
added to volumes (the block that builds corev1.Volume with
corev1.EmptyDirVolumeSource) to set a safe EmptyDir SizeLimit (e.g.,
resource.MustParse like "1Gi" or a configurable value) on the
EmptyDirVolumeSource.SizeLimit field, ensuring you import k8s resource types and
use a sensible default or config value rather than leaving it nil.

---

Nitpick comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Line 109: The comment has a stray double-comment marker; update the comment
string `# # NCCL loopback test` in values.yaml to a single `# NCCL loopback
test` so it matches the style of nearby section headings (e.g., the other
comment lines around the NCCL section); locate the offending comment text and
replace the double `# #` with a single `#`.
- Around line 155-178: The default networkResourceNames list includes many GCP
TCPXO entries which are CSP-specific; update the values.yaml by adding a concise
inline grouping comment above the block of networking.gke.io entries (e.g., "//
GCP multi-network resources (TCPXO) - can be pruned on non-GCP clusters") so
users know these 16 gpu-nicX and gpu-nicX.IP entries are GCP-specific and only
required for mirrorIPCompanionResources behavior; edit the networkResourceNames
block and place the comment immediately before the first networking.gke.io entry
to clearly group them.

In `@preflight/pkg/webhook/injector.go`:
- Around line 559-603: Change the per-container log in injectGangEnv from Info
to Debug to reduce noise: update the slog.Info call inside
Injector.injectGangEnv (the one logging "Injecting gang environment variables"
with keys gangID and configMap) to use slog.Debug (preserving the same message
and structured fields); leave higher-level gang membership logs (earlier Info
logs around GangContext creation) unchanged and keep mergeEnvVars and
GangContext usage intact.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (12)
.github/workflows/lint-test.yml (1)

139-156: No test artifact upload for preflight checks — intentional?

Unlike health-monitors-lint-test and modules-lint-test, the preflight-checks-lint-test job does not upload coverage/test artifacts. This means preflight coverage won't appear in the consolidated coverage report. This is pre-existing for dcgm-diag and nccl-loopback as well, but worth noting as nccl-allreduce is added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/lint-test.yml around lines 139 - 156, The
preflight-checks-lint-test job is missing the artifact upload step so
coverage/test artifacts from make -C preflight-checks/${{ matrix.component }}
lint-test aren’t collected; add a step after the "Run lint and test" step that
mirrors the artifact upload used in health-monitors-lint-test/modules-lint-test
(use the same upload action and artifact name pattern) so artifacts for each
matrix.component (dcgm-diag, nccl-allreduce, nccl-loopback) are uploaded and
included in the consolidated coverage report.
preflight-checks/nccl-allreduce/scripts/entrypoint.py (1)

51-52: sys.path manipulation is fragile.

sys.path.insert(0, ...) is a common workaround for scripts, but consider using the -m flag or a proper package entry point (already defined in pyproject.toml) to avoid path hacking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py` around lines 51 - 52,
The script currently mutates sys.path via sys.path.insert(0,
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) which is fragile;
remove that manipulation and instead rely on running the package as a module or
the package entry point declared in pyproject.toml (use python -m <package> or
configure the entry point) so imports resolve normally; update entrypoint/launch
instructions and the script's imports (keep the existing if __name__ ==
"__main__" guard) so the file can be executed as a proper package module rather
than patching sys.path.
preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (2)

130-142: NCCL error detection relies on string matching — consider documenting this heuristic.

Checking for "NCCL" or "nccl" in the error message is pragmatic but fragile. If PyTorch changes error message wording, legitimate NCCL errors could be re-raised as unhandled RuntimeErrors. A brief inline comment noting this trade-off would help future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 130
- 142, In the RuntimeError except block that checks for "NCCL" or "nccl" (the
branch that calls log.error and returns _handle_failure with
NCCLError.ALLREDUCE_TIMEOUT), add a concise inline comment explaining that
detection is a heuristic based on string matching of the error message, that it
may be brittle if upstream error wording changes, and that maintainers should
consider more robust detection (e.g., specific exception types or error codes)
in the future.

143-147: contextlib.suppress is more idiomatic for ignored cleanup errors.

Proposed fix
+import contextlib
 ...
     finally:
-        try:
-            dist.destroy_process_group()
-        except Exception:
-            pass  # Ignore cleanup errors
+        with contextlib.suppress(Exception):
+            dist.destroy_process_group()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 143
- 147, Replace the try/except cleanup with contextlib.suppress to make ignored
cleanup errors more idiomatic: wrap the call to dist.destroy_process_group()
inside a contextlib.suppress(Exception) block in the same finally section, and
add an import for contextlib if it's not already imported; refer to the finally
block containing dist.destroy_process_group() to locate where to apply this
change.
preflight-checks/nccl-allreduce/nccl_allreduce/gang.py (1)

203-206: Dead except FileNotFoundError_read_string already handles it via the default.

Since _read_string(KEY_PEERS, default="") internally catches FileNotFoundError and returns "", the outer try/except FileNotFoundError on lines 203–206 is unreachable.

Simplify
-        try:
-            data = self._read_string(KEY_PEERS, default="")
-        except FileNotFoundError:
-            return []
+        data = self._read_string(KEY_PEERS, default="")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py` around lines 203 -
206, The try/except around the call to _read_string(KEY_PEERS, default="") is
dead because _read_string already catches FileNotFoundError and returns the
default; remove the outer try/except and simply assign data =
self._read_string(KEY_PEERS, default="") and proceed (update any following logic
that expected early return to handle the empty string as before). Ensure you
only modify the block using _read_string and KEY_PEERS in gang.py so behavior
remains identical but without the unreachable FileNotFoundError handler.
preflight-checks/nccl-allreduce/nccl_allreduce/health.py (1)

168-202: New gRPC channel created on every retry attempt.

Each retry iteration opens and closes a fresh grpc.insecure_channel. For a Unix domain socket this is cheap, but you could hoist the channel outside the loop and only reconnect on failure if you want to reduce overhead.

Optional: reuse channel across retries
     def _send_with_retries(self, health_events: pb.HealthEvents) -> bool:
         delay = INITIAL_DELAY
 
-        for attempt in range(MAX_RETRIES):
-            try:
-                with grpc.insecure_channel(f"unix://{self._socket_path}") as channel:
+        with grpc.insecure_channel(f"unix://{self._socket_path}") as channel:
+            for attempt in range(MAX_RETRIES):
+                try:
                     stub = pb_grpc.PlatformConnectorStub(channel)
                     stub.HealthEventOccurredV1(
                         health_events,
                         timeout=RPC_TIMEOUT,
                     )
                     log.info("Health event sent successfully")
                     return True
-            except grpc.RpcError as err:
-                log.warning(
-                    "Failed to send health event",
-                    extra={
-                        "attempt": attempt + 1,
-                        "max_retries": MAX_RETRIES,
-                        "error": str(err),
-                    },
-                )
-                if attempt < MAX_RETRIES - 1:
-                    sleep(delay)
-                    delay *= BACKOFF_FACTOR
+                except grpc.RpcError as err:
+                    log.warning(
+                        "Failed to send health event",
+                        extra={
+                            "attempt": attempt + 1,
+                            "max_retries": MAX_RETRIES,
+                            "error": str(err),
+                        },
+                    )
+                    if attempt < MAX_RETRIES - 1:
+                        sleep(delay)
+                        delay *= BACKOFF_FACTOR
 
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/health.py` around lines 168 -
202, The code in _send_with_retries creates a new grpc.insecure_channel on every
retry; hoist channel creation outside the for-loop to reuse the same channel
(created with grpc.insecure_channel(f"unix://{self._socket_path}")), create the
pb_grpc.PlatformConnectorStub once, and only recreate the channel/stub if a
grpc.RpcError indicates the channel is unusable (i.e., on failure before the
next retry). Ensure you still apply the same backoff/sleep logic and that the
function returns True on successful stub.HealthEventOccurredV1 call and False
after MAX_RETRIES.
preflight/pkg/webhook/injector.go (3)

125-131: Prefer range over index-based loop.

The index variable idx is only used to subscript initContainers. A range loop is more idiomatic Go.

Suggested refactor
-		for idx := 0; idx < len(initContainers); idx++ {
+		for _, c := range initContainers {
 			patches = append(patches, PatchOperation{
 				Op:    "add",
 				Path:  "/spec/initContainers/-",
-				Value: initContainers[idx],
+				Value: c,
 			})
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/webhook/injector.go` around lines 125 - 131, The loop adding
PatchOperation entries uses an index-only for-loop; replace it with a range over
initContainers to be more idiomatic and avoid explicit indexing — iterate like
`for _, c := range initContainers` and append PatchOperation with Value set to
the loop variable (referencing initContainers, patches, PatchOperation) so
behavior is unchanged but code is cleaner.

15-15: Missing package-level godoc for the webhook package.

Per coding guidelines, package-level godoc is required for all Go packages. Consider adding a // Package webhook ... comment before the package webhook declaration.

As per coding guidelines: "Package-level godoc required for all Go packages".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/webhook/injector.go` at line 15, Add a package-level godoc
comment immediately above the "package webhook" declaration describing the
purpose and usage of the webhook package (e.g., what it provides, important
behaviors or invariants). Ensure the comment starts with "Package webhook" to
satisfy Go tooling and lint rules and keep it short, one or two sentences
summarizing the package API and intent so linters accept it.

193-199: Consider using boolDefault for mirrorClaims consistency.

The boolDefault helper (Line 595) was introduced for exactly this pattern. Using it here would be consistent with its usage in appendExtraHostPathMounts and appendExtraVolumeMounts.

Suggested refactor
-	mirrorClaims := i.cfg.GangCoordination.MirrorResourceClaims != nil &&
-		*i.cfg.GangCoordination.MirrorResourceClaims
+	mirrorClaims := boolDefault(i.cfg.GangCoordination.MirrorResourceClaims, false)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/webhook/injector.go` around lines 193 - 199, Replace the manual
nil-check assignment for mirrorClaims with the existing boolDefault helper to
match other usages; specifically, in injector.go where mirrorClaims is computed
from i.cfg.GangCoordination.MirrorResourceClaims, call
boolDefault(i.cfg.GangCoordination.MirrorResourceClaims) (or the appropriate
boolDefault signature) instead of the current nil-and-dereference logic so it is
consistent with how appendExtraHostPathMounts and appendExtraVolumeMounts obtain
their defaults.
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (2)

172-189: All CSP network resources in the default allowlist — confirm this is benign when absent.

The webhook copies only resources that actually exist in the pod spec, so listing all supported resource names should be harmless on clusters that don't use them. However, adding a brief comment confirming this behavior would help operators who wonder why GCP-specific resources appear in their non-GCP deployment.

As per coding guidelines: "Document all values in Helm chart values.yaml with inline comments."

Suggested comment addition
 # Network resource names to copy to init containers (RDMA, InfiniBand, etc.)
+# List all supported network resource types here; unused resources are ignored.
+# The webhook only copies resources that exist in the pod spec.
 networkResourceNames:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 172
- 189, Add an inline comment above the "GCP multi-network resources (TCPXO)"
block in values.yaml clarifying that the default allowlist intentionally lists
all CSP-specific network resource names (e.g.,
"networking.gke.io.networks/gpu-nic0", "networking.gke.io.networks/gpu-nic0.IP",
etc.) and that the webhook only copies resources that actually exist in a pod
spec, so these GCP-specific entries are harmless on non-GCP clusters; update the
comment near the existing header to be concise and follow the chart's
inline-commenting style.

214-216: Minor: ncclTopoConfigMap default is an empty string — consider using a comment-only example instead.

An empty string "" passes the truthiness check in the template ({{- if .Values.gangCoordination.ncclTopoConfigMap }}), so this works correctly. However, using a commented-out example (like mirrorResourceClaims) would be more consistent with the other optional fields in this section.

As per coding guidelines: "Include examples for non-obvious configurations in Helm chart documentation."

Suggested change
   # NCCL topology ConfigMap name - required for Azure NDv4/v5 InfiniBand
   # Without this, NCCL cannot map GPUs to IB NICs and falls back to TCP
-  ncclTopoConfigMap: ""
+  # ncclTopoConfigMap: "nccl-topo-ndv4"  # Change to match your VM shape
+  ncclTopoConfigMap: ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 214
- 216, The ncclTopoConfigMap default is currently an empty string which can be
confusing; replace it with a commented example to match other optional fields
and Helm docs (e.g., comment out a sample: # ncclTopoConfigMap:
"my-nccl-topo-configmap") so maintainers see a usage example while keeping the
field effectively unset for the gangCoordination template check (referencing
ncclTopoConfigMap and the template condition {{- if
.Values.gangCoordination.ncclTopoConfigMap }}).
distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml (1)

22-29: Consider adding a BW_THRESHOLD_GBPS override and brief inline comments for each env var.

The base default is 100 GB/s. If IB clusters are expected to achieve significantly higher bandwidth (PR notes 165 gbps on Azure), you may want an explicit threshold here. Also, brief inline comments on each env var would align with the project's documentation guidelines — e.g., what NCCL_IB_PCI_RELAXED_ORDERING controls and why eth0 is chosen for NCCL_SOCKET_IFNAME.

As per coding guidelines: "Document all values in Helm chart values.yaml with inline comments."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml` around lines
22 - 29, Add a BW_THRESHOLD_GBPS override into the ncclAllreduceExtraEnv block
and add brief inline comments for each env var; specifically, add an env entry
named BW_THRESHOLD_GBPS (default "100") to allow overriding the 100 GB/s
baseline, and add single-line comments for NCCL_TOPO_FILE (path to topology
XML), NCCL_IB_PCI_RELAXED_ORDERING (controls PCI relaxed ordering for IB
performance), and NCCL_SOCKET_IFNAME (interface used for NCCL sockets, e.g.,
"eth0" rationale), so that ncclAllreduceExtraEnv and its new BW_THRESHOLD_GBPS
entry and the existing NCCL_* variables are documented per chart guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 228-233: The Helm values comment is incorrect: the field
MirrorResourceClaims actually defaults to true unconditionally when omitted (per
the config initialization), not only when gangCoordination is enabled. Update
the comment for mirrorResourceClaims to state that it defaults to true when not
set (i.e., omitted) and remove the conditional wording referring to
gangCoordination so it matches the config.go behavior for MirrorResourceClaims.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Line 147: Add missing type annotations: annotate _wait_for_gang to declare its
return type (use -> GangWaiter) and annotate the parameter(s) of
_launch_torchrun (e.g., cfg: _EntrypointConfig) and its return type if missing
(-> None or appropriate type). Also add the suggested import "from
nccl_allreduce.gang import GangConfig, GangWaiter" so the new type names
resolve; update the function signatures for _wait_for_gang and _launch_torchrun
accordingly.
- Around line 130-131: The environment variable names used to populate
gang_config_dir and gang_timeout are mismatched with the Go webhook injector;
update the reads in entrypoint.py (variables gang_config_dir and gang_timeout)
to use GANG_CONFIG_PATH and GANG_TIMEOUT respectively (instead of
GANG_CONFIG_DIR and GANG_TIMEOUT_SECONDS), preserving the existing default
fallbacks and integer parsing for gang_timeout so behavior remains the same.

---

Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values-mnnvl-efa.yaml`:
- Around line 45-52: The gangCoordination block is missing mirrorResourceClaims:
true, so DRA ResourceClaims (IMEX channel) won't be mirrored into init
containers due to the configmap.yaml kindIs "invalid" guard that omits nils;
update the values file by adding mirrorResourceClaims: true under
gangCoordination (next to extraHostPathMounts) so the template emits the setting
and init containers receive the GB200 MNNVL DRA claims.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Line 218: The function _handle_success currently has an untyped parameter
result; add the type hint "result: BenchmarkResult" to the signature and update
imports to bring BenchmarkResult into scope (e.g., from its defining module
where other benchmark types live) so the annotation resolves; ensure you update
any existing import block (or add a new import) rather than using a local or
string annotation.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 72-84: The env var names used to populate gang_config_dir and
gang_timeout_seconds do not match the webhook injector's names; update the env
var keys so gang_config_dir reads from "GANG_CONFIG_PATH" (falling back to
DEFAULT_GANG_CONFIG_DIR) and gang_timeout_seconds reads from "GANG_TIMEOUT"
(using _parse_int with DEFAULT_GANG_TIMEOUT_SECONDS) and search/replace any
other usages of "GANG_CONFIG_DIR" and "GANG_TIMEOUT_SECONDS" in this module to
"GANG_CONFIG_PATH" and "GANG_TIMEOUT" respectively so the webhook-provided
values are honored.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 222-224: When parsing the pod entry in the gang parsing logic (the
parts -> pod_name, pod_ip, rank logic), guard the numeric conversion so a
malformed rank doesn't produce an opaque ValueError; specifically wrap the
int(parts[2].strip()) conversion in a try/except ValueError, and on failure
raise a new ValueError (or include in the existing exception) that includes the
full offending line content and the parsed parts (e.g., show pod_name, pod_ip,
and the raw rank token) so GangWaiter.wait can log which line caused the parse
failure; reference the variables parts, pod_name, pod_ip, rank and the
GangWaiter.wait context when making this change.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 131-137: The int() conversions for GANG_TIMEOUT_SECONDS (assigned
to gang_timeout) and NPROCS_PER_NODE (assigned to nprocs_per_node) are unguarded
and should be wrapped in try/except ValueError; validate the env values before
converting, catch ValueError, emit a clear error message naming the env var and
its invalid value, and exit with a non‑zero code (e.g., sys.exit(1)) instead of
allowing a raw traceback; for NPROCS_PER_NODE, if the env exists but is invalid
treat it as an error and exit (do not silently fall back to
_detect_gpu_count()), otherwise when absent keep the current behavior of calling
_detect_gpu_count().

In `@preflight/pkg/webhook/injector.go`:
- Around line 284-296: The appendExtraHostPathMounts function currently only
checks Name and MountPath while collectExtraHostPathVolumes checks Name and
HostPath, causing mounts without volumes; create a shared validator (e.g.,
isValidExtraHostPathMount or validateExtraHostPathEntry) that ensures the triple
Name, HostPath and MountPath are all non-empty and use that validator in both
appendExtraHostPathMounts and collectExtraHostPathVolumes (replace the existing
per-field checks) so entries are consistently skipped when any of Name, HostPath
or MountPath is missing.
- Around line 430-438: The dshm emptyDir is created with Medium: Memory but no
SizeLimit, risking node RAM exhaustion; update the code that appends the
corev1.Volume for dshm (look for dshmVolumeName, existingVolumes and volumes in
injector.go) to set EmptyDir.SizeLimit to a safe default or configurable value:
parse a quantity (k8s.io/apimachinery/pkg/api/resource) and assign a pointer to
EmptyDirVolumeSource.SizeLimit, and expose the limit as a config field or
env/flag (e.g., dshmSizeLimit) so callers can override it; ensure imports are
added and the SizeLimit is used when constructing the Volume.

---

Nitpick comments:
In @.github/workflows/lint-test.yml:
- Around line 139-156: The preflight-checks-lint-test job is missing the
artifact upload step so coverage/test artifacts from make -C
preflight-checks/${{ matrix.component }} lint-test aren’t collected; add a step
after the "Run lint and test" step that mirrors the artifact upload used in
health-monitors-lint-test/modules-lint-test (use the same upload action and
artifact name pattern) so artifacts for each matrix.component (dcgm-diag,
nccl-allreduce, nccl-loopback) are uploaded and included in the consolidated
coverage report.

In `@distros/kubernetes/nvsentinel/charts/preflight/values-ib.yaml`:
- Around line 22-29: Add a BW_THRESHOLD_GBPS override into the
ncclAllreduceExtraEnv block and add brief inline comments for each env var;
specifically, add an env entry named BW_THRESHOLD_GBPS (default "100") to allow
overriding the 100 GB/s baseline, and add single-line comments for
NCCL_TOPO_FILE (path to topology XML), NCCL_IB_PCI_RELAXED_ORDERING (controls
PCI relaxed ordering for IB performance), and NCCL_SOCKET_IFNAME (interface used
for NCCL sockets, e.g., "eth0" rationale), so that ncclAllreduceExtraEnv and its
new BW_THRESHOLD_GBPS entry and the existing NCCL_* variables are documented per
chart guidelines.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 172-189: Add an inline comment above the "GCP multi-network
resources (TCPXO)" block in values.yaml clarifying that the default allowlist
intentionally lists all CSP-specific network resource names (e.g.,
"networking.gke.io.networks/gpu-nic0", "networking.gke.io.networks/gpu-nic0.IP",
etc.) and that the webhook only copies resources that actually exist in a pod
spec, so these GCP-specific entries are harmless on non-GCP clusters; update the
comment near the existing header to be concise and follow the chart's
inline-commenting style.
- Around line 214-216: The ncclTopoConfigMap default is currently an empty
string which can be confusing; replace it with a commented example to match
other optional fields and Helm docs (e.g., comment out a sample: #
ncclTopoConfigMap: "my-nccl-topo-configmap") so maintainers see a usage example
while keeping the field effectively unset for the gangCoordination template
check (referencing ncclTopoConfigMap and the template condition {{- if
.Values.gangCoordination.ncclTopoConfigMap }}).

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 130-142: In the RuntimeError except block that checks for "NCCL"
or "nccl" (the branch that calls log.error and returns _handle_failure with
NCCLError.ALLREDUCE_TIMEOUT), add a concise inline comment explaining that
detection is a heuristic based on string matching of the error message, that it
may be brittle if upstream error wording changes, and that maintainers should
consider more robust detection (e.g., specific exception types or error codes)
in the future.
- Around line 143-147: Replace the try/except cleanup with contextlib.suppress
to make ignored cleanup errors more idiomatic: wrap the call to
dist.destroy_process_group() inside a contextlib.suppress(Exception) block in
the same finally section, and add an import for contextlib if it's not already
imported; refer to the finally block containing dist.destroy_process_group() to
locate where to apply this change.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 203-206: The try/except around the call to _read_string(KEY_PEERS,
default="") is dead because _read_string already catches FileNotFoundError and
returns the default; remove the outer try/except and simply assign data =
self._read_string(KEY_PEERS, default="") and proceed (update any following logic
that expected early return to handle the empty string as before). Ensure you
only modify the block using _read_string and KEY_PEERS in gang.py so behavior
remains identical but without the unreachable FileNotFoundError handler.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/health.py`:
- Around line 168-202: The code in _send_with_retries creates a new
grpc.insecure_channel on every retry; hoist channel creation outside the
for-loop to reuse the same channel (created with
grpc.insecure_channel(f"unix://{self._socket_path}")), create the
pb_grpc.PlatformConnectorStub once, and only recreate the channel/stub if a
grpc.RpcError indicates the channel is unusable (i.e., on failure before the
next retry). Ensure you still apply the same backoff/sleep logic and that the
function returns True on successful stub.HealthEventOccurredV1 call and False
after MAX_RETRIES.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 51-52: The script currently mutates sys.path via
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
which is fragile; remove that manipulation and instead rely on running the
package as a module or the package entry point declared in pyproject.toml (use
python -m <package> or configure the entry point) so imports resolve normally;
update entrypoint/launch instructions and the script's imports (keep the
existing if __name__ == "__main__" guard) so the file can be executed as a
proper package module rather than patching sys.path.

In `@preflight/pkg/webhook/injector.go`:
- Around line 125-131: The loop adding PatchOperation entries uses an index-only
for-loop; replace it with a range over initContainers to be more idiomatic and
avoid explicit indexing — iterate like `for _, c := range initContainers` and
append PatchOperation with Value set to the loop variable (referencing
initContainers, patches, PatchOperation) so behavior is unchanged but code is
cleaner.
- Line 15: Add a package-level godoc comment immediately above the "package
webhook" declaration describing the purpose and usage of the webhook package
(e.g., what it provides, important behaviors or invariants). Ensure the comment
starts with "Package webhook" to satisfy Go tooling and lint rules and keep it
short, one or two sentences summarizing the package API and intent so linters
accept it.
- Around line 193-199: Replace the manual nil-check assignment for mirrorClaims
with the existing boolDefault helper to match other usages; specifically, in
injector.go where mirrorClaims is computed from
i.cfg.GangCoordination.MirrorResourceClaims, call
boolDefault(i.cfg.GangCoordination.MirrorResourceClaims) (or the appropriate
boolDefault signature) instead of the current nil-and-dereference logic so it is
consistent with how appendExtraHostPathMounts and appendExtraVolumeMounts obtain
their defaults.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py (1)

36-49: Consider combining nested with statements.

Static analysis flagged these nested with statements. While functionally correct, combining them improves readability.

♻️ Proposed fix
     def test_rejects_non_numeric(self) -> None:
-        with patch.dict(os.environ, {"TEST_VAR": "abc"}):
-            with pytest.raises(ValueError, match="Invalid TEST_VAR"):
-                _parse_float("TEST_VAR", 0.0)
+        with patch.dict(os.environ, {"TEST_VAR": "abc"}), pytest.raises(ValueError, match="Invalid TEST_VAR"):
+            _parse_float("TEST_VAR", 0.0)

     def test_rejects_zero(self) -> None:
-        with patch.dict(os.environ, {"TEST_VAR": "0"}):
-            with pytest.raises(ValueError, match="must be positive"):
-                _parse_float("TEST_VAR", 1.0)
+        with patch.dict(os.environ, {"TEST_VAR": "0"}), pytest.raises(ValueError, match="must be positive"):
+            _parse_float("TEST_VAR", 1.0)

     def test_rejects_negative(self) -> None:
-        with patch.dict(os.environ, {"TEST_VAR": "-5.0"}):
-            with pytest.raises(ValueError, match="must be positive"):
-                _parse_float("TEST_VAR", 1.0)
+        with patch.dict(os.environ, {"TEST_VAR": "-5.0"}), pytest.raises(ValueError, match="must be positive"):
+            _parse_float("TEST_VAR", 1.0)

The same pattern applies to the TestParseInt and TestConfigFromEnv classes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around
lines 36 - 49, Combine the nested context managers into a single with-statement
in the test methods (e.g., in test_rejects_non_numeric, test_rejects_zero,
test_rejects_negative) by merging patch.dict(...) and pytest.raises(...) into
one line to improve readability; apply the same change to the analogous methods
in TestParseInt and TestConfigFromEnv so each test uses a single with that
contains both the environment patch and the expected exception context while
keeping the same assertions that call _parse_float or the corresponding
parse_int/config helpers.
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)

178-195: Consider grouping GCP TCPXO resources under a comment block.

The expanded networkResourceNames list is quite long. While the comment at line 179 helps, consider adding a closing comment or blank line after the GCP resources to improve visual separation:

   - "networking.gke.io.networks/gpu-nic7"
   - "networking.gke.io.networks/gpu-nic7.IP"
+  # End GCP TCPXO resources

This is a minor readability suggestion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml` around lines 178
- 195, The GCP TCPXO entries in the networkResourceNames list are long and need
clearer separation: group the GCP entries (the keys starting with
"networking.gke.io.networks/gpu-nicX" and
"networking.gke.io.networks/gpu-nicX.IP") under a single comment header like "#
GCP multi-network resources (TCPXO)" and add a closing blank line or a trailing
comment after that block to visually separate it from the rest of
networkResourceNames; update the values.yaml list so the group is clearly
delimited while preserving all existing entries and their exact strings.
preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (2)

183-193: Consider logging benchmark failure on non-rank-0 processes before returning.

When benchmark.run() fails on a non-rank-0 process, it logs the error but silently returns the exit code. While rank-0 handles health reporting, having a debug/info log indicating the non-rank-0 behavior would help troubleshooting:

     except Exception as err:
         log.error("Benchmark execution failed", extra={"error": str(err)})
         if rank == 0:
             return _handle_failure(
                 cfg,
                 NCCLError.ALLREDUCE_TIMEOUT,
                 f"NCCL all-reduce benchmark failed: {err}",
             )
+        log.debug("Non-rank-0 process returning failure exit code", extra={"rank": rank})
         return NCCLError.ALLREDUCE_TIMEOUT.value.exit_code

This is a minor observability improvement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 183
- 193, When benchmark.run(message_sizes) raises an exception, non-rank-0
branches currently return NCCLError.ALLREDUCE_TIMEOUT without additional
context; add a log entry on those branches before returning. Specifically, in
the except block surrounding benchmark.run, after log.error(...) and when rank
!= 0, emit a debug/info log (e.g., using log.debug or log.info) that includes
the rank and the exception (err) to record that a non-root process failed and is
returning the timeout exit code; keep the existing root behavior using
_handle_failure and NCCLError.ALLREDUCE_TIMEOUT intact.

143-147: Use contextlib.suppress for cleaner exception handling.

Per static analysis (Ruff SIM105), the try-except-pass pattern can be simplified.

♻️ Suggested refactor
+import contextlib
+
     finally:
-        try:
-            dist.destroy_process_group()
-        except Exception:
-            pass  # Ignore cleanup errors
+        with contextlib.suppress(Exception):
+            dist.destroy_process_group()

Add the import at the top of the file with the other imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 143
- 147, Replace the try/except/pass cleanup around dist.destroy_process_group()
with contextlib.suppress(Exception) to simplify exception handling: import
contextlib alongside the other imports and wrap the call to
dist.destroy_process_group() in a with contextlib.suppress(Exception): block
(replace the finally: try: ... except: pass pattern around
dist.destroy_process_group()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_errors.py`:
- Around line 61-64: The test test_error_def_is_frozen currently only checks
isinstance; update it to actually attempt mutation on the ErrorDef instance
returned by NCCLError.SUCCESS.value and assert that mutation raises the
appropriate exception (e.g., dataclasses.FrozenInstanceError or AttributeError).
Locate the test_error_def_is_frozen function and replace or extend the assertion
with a with pytest.raises(...) block that tries to set a field on the ErrorDef
instance (for example, error_def.code = 123 or error_def.message = "x") to
confirm immutability.

---

Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 234-239: The values.yaml comment about mirrorResourceClaims must
match the actual behavior in the Go code: inspect the config handling where
MirrorResourceClaims is defaulted (the MirrorResourceClaims field in config.go)
and if the code unconditionally sets MirrorResourceClaims to true when nil,
update the values.yaml comment to state "Defaults to true when omitted
(unconditionally)" or leave as-is if it already states that; if instead the code
conditions the default based on gang coordination, change the comment to
describe that conditional behavior. Ensure the fix references
mirrorResourceClaims in values.yaml and the MirrorResourceClaims field/logic in
config.go so the doc and implementation are consistent.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 71-83: The env var names used in this module don't match the
webhook: update the getenv keys so gang_config_dir uses "GANG_CONFIG_PATH"
instead of "GANG_CONFIG_DIR" and gang_timeout_seconds reads "GANG_TIMEOUT"
instead of "GANG_TIMEOUT_SECONDS"; keep the same fallback defaults
(DEFAULT_GANG_CONFIG_DIR, DEFAULT_GANG_TIMEOUT_SECONDS) and parsing helpers
(_parse_int for gang_timeout_seconds, _parse_float usage remains unchanged for
bw_threshold_gbps) so values injected by the webhook are actually picked up.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 131-137: The code currently calls int() directly on environment
values for GANG_TIMEOUT_SECONDS and NPROCS_PER_NODE which will raise raw
ValueError on invalid input; update the logic around gang_timeout and
nprocs_per_node to validate and parse the env vars safely: for
GANG_TIMEOUT_SECONDS, try to parse int(os.getenv(...)) in a try/except, ensure
the value is a positive integer, log a clear error and exit non-zero on failure;
for NPROCS_PER_NODE, if NPROCS_PER_NODE is set, similarly try/except the int
conversion and validate (>0) before assigning to nprocs_per_node, otherwise call
_detect_gpu_count(); ensure error messages reference the env var names and the
variables gang_timeout and nprocs_per_node so debugging is clear.
- Around line 130-131: The script reads environment variables gang_config_dir
and gang_timeout (via gang_config_dir = os.getenv("GANG_CONFIG_DIR", ...) and
gang_timeout = int(os.getenv("GANG_TIMEOUT_SECONDS", ...))), but the webhook
injector sets GANG_CONFIG_PATH and GANG_TIMEOUT; update the env var names used
by the script to match the injector (use os.getenv("GANG_CONFIG_PATH",
DEFAULT_GANG_CONFIG_DIR) for gang_config_dir and os.getenv("GANG_TIMEOUT",
str(DEFAULT_GANG_TIMEOUT)) for gang_timeout) so injected values are honored,
keeping the same defaults and casting to int for gang_timeout.

In `@preflight/pkg/webhook/injector.go`:
- Around line 283-296: The validation is inconsistent between
appendExtraHostPathMounts and collectExtraHostPathVolumes:
appendExtraHostPathMounts currently only skips entries with empty Name or
MountPath while collectExtraHostPathVolumes skips Name or HostPath, which can
cause mounts without matching volumes. Update appendExtraHostPathMounts to use
the same validation as collectExtraHostPathVolumes (skip if Name, MountPath, or
HostPath is empty) or move validation to config loading so both functions rely
on the same sanitized entries; locate and adjust the checks in
appendExtraHostPathMounts and collectExtraHostPathVolumes to ensure both require
Name, MountPath, and HostPath.

---

Nitpick comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 178-195: The GCP TCPXO entries in the networkResourceNames list
are long and need clearer separation: group the GCP entries (the keys starting
with "networking.gke.io.networks/gpu-nicX" and
"networking.gke.io.networks/gpu-nicX.IP") under a single comment header like "#
GCP multi-network resources (TCPXO)" and add a closing blank line or a trailing
comment after that block to visually separate it from the rest of
networkResourceNames; update the values.yaml list so the group is clearly
delimited while preserving all existing entries and their exact strings.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 183-193: When benchmark.run(message_sizes) raises an exception,
non-rank-0 branches currently return NCCLError.ALLREDUCE_TIMEOUT without
additional context; add a log entry on those branches before returning.
Specifically, in the except block surrounding benchmark.run, after
log.error(...) and when rank != 0, emit a debug/info log (e.g., using log.debug
or log.info) that includes the rank and the exception (err) to record that a
non-root process failed and is returning the timeout exit code; keep the
existing root behavior using _handle_failure and NCCLError.ALLREDUCE_TIMEOUT
intact.
- Around line 143-147: Replace the try/except/pass cleanup around
dist.destroy_process_group() with contextlib.suppress(Exception) to simplify
exception handling: import contextlib alongside the other imports and wrap the
call to dist.destroy_process_group() in a with contextlib.suppress(Exception):
block (replace the finally: try: ... except: pass pattern around
dist.destroy_process_group()).

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py`:
- Around line 36-49: Combine the nested context managers into a single
with-statement in the test methods (e.g., in test_rejects_non_numeric,
test_rejects_zero, test_rejects_negative) by merging patch.dict(...) and
pytest.raises(...) into one line to improve readability; apply the same change
to the analogous methods in TestParseInt and TestConfigFromEnv so each test uses
a single with that contains both the environment patch and the expected
exception context while keeping the same assertions that call _parse_float or
the corresponding parse_int/config helpers.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py (1)

36-49: Consider combining nested with statements.

Static analysis flags that nested with statements can be combined for cleaner code. This is a minor style improvement.

Example refactor
     def test_rejects_non_numeric(self) -> None:
-        with patch.dict(os.environ, {"TEST_VAR": "abc"}):
-            with pytest.raises(ValueError, match="Invalid TEST_VAR"):
-                _parse_float("TEST_VAR", 0.0)
+        with patch.dict(os.environ, {"TEST_VAR": "abc"}), pytest.raises(ValueError, match="Invalid TEST_VAR"):
+            _parse_float("TEST_VAR", 0.0)

     def test_rejects_zero(self) -> None:
-        with patch.dict(os.environ, {"TEST_VAR": "0"}):
-            with pytest.raises(ValueError, match="must be positive"):
-                _parse_float("TEST_VAR", 1.0)
+        with patch.dict(os.environ, {"TEST_VAR": "0"}), pytest.raises(ValueError, match="must be positive"):
+            _parse_float("TEST_VAR", 1.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around
lines 36 - 49, Combine the nested context managers in the three tests to use a
single with statement per test: replace the nested "with patch.dict(os.environ,
{...}): with pytest.raises(...):" pattern in test_rejects_non_numeric,
test_rejects_zero, and test_rejects_negative with a single "with
patch.dict(os.environ, {...}), pytest.raises(...):" line so the tests still call
_parse_float("TEST_VAR", ...) but use the consolidated context manager form for
clarity and style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 167-170: Add a commented concrete example for the
ncclAllreduceExtraEnv values entry to show how to pass extra environment
variables into the preflight-nccl-allreduce init container; update the
values.yaml near the ncclAllreduceExtraEnv: [] block to include a commented
example demonstrating an array of env objects (e.g., name/value and
name/valueFrom patterns) so operators can see how to override/extend the fabric
profile when using the preflight-nccl-allreduce init container.
- Around line 129-155: The new initContainer entry named
preflight-nccl-allreduce is missing inline comments for the new fields; add
concise Helm values.yaml comments describing the image (preflight-nccl-allreduce
image source and intended tag usage), the securityContext block and the added
capability IPC_LOCK (why IPC_LOCK is required for RDMA memory registration), and
brief notes for env (what BW_THRESHOLD_GBPS, MESSAGE_SIZES,
GANG_TIMEOUT_SECONDS, NCCL_DEBUG, NCCL_DEBUG_SUBSYS control) and volumeMounts
(purpose of nvsentinel-socket and mountPath); update the comments next to the
symbols image, securityContext, IPC_LOCK, env, and volumeMounts in the
preflight-nccl-allreduce block.

---

Duplicate comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 71-83: The environment variable names used in config.py are
inconsistent with the webhook: change the getenv call for gang_config_dir to
read "GANG_CONFIG_PATH" instead of "GANG_CONFIG_DIR", and change the call to
_parse_int for gang_timeout_seconds to use "GANG_TIMEOUT" instead of
"GANG_TIMEOUT_SECONDS" so the variables gang_config_dir and gang_timeout_seconds
pick up values injected by the Go webhook (leave the variable names and defaults
intact).

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_errors.py`:
- Around line 61-64: The test test_error_def_is_frozen currently only checks
type; update it to attempt mutating the ErrorDef instance returned by
NCCLError.SUCCESS.value and assert that mutation raises the expected exception
(e.g. dataclasses.FrozenInstanceError or AttributeError). Specifically, in
test_error_def_is_frozen get error_def = NCCLError.SUCCESS.value, then use
pytest.raises(FrozenInstanceError) (import from dataclasses) around a mutation
like setting error_def.code = <new value> to verify the dataclass is actually
frozen.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 23-38: The docstring and code use environment variables
GANG_CONFIG_DIR and GANG_TIMEOUT_SECONDS but the webhook injector may still
export GANG_CONFIG_PATH and GANG_TIMEOUT; update the webhook injector to export
GANG_CONFIG_DIR and GANG_TIMEOUT_SECONDS (or change the entrypoint to accept the
injector's names) and make the two sides consistent, then update the comment
block in entrypoint.py to reflect the final chosen names; search for occurrences
of GANG_CONFIG_PATH, GANG_CONFIG_DIR, GANG_TIMEOUT, and GANG_TIMEOUT_SECONDS and
ensure all references (injector and the Python entrypoint functions/classes) use
the same variable names.
- Around line 118-136: In _load_config(), validate and safely parse
GANG_TIMEOUT_SECONDS and NPROCS_PER_NODE: replace direct int() calls with
guarded parsing (try/except ValueError/TypeError) for GANG_TIMEOUT_SECONDS
(default DEFAULT_GANG_TIMEOUT) and for NPROCS_PER_NODE (fall back to calling
_detect_gpu_count() only on empty env), ensure both parsed values are positive
(>0), and if parsing fails or values are non-positive call log.error with a
clear message identifying the variable (e.g., "invalid GANG_TIMEOUT_SECONDS" or
"invalid NPROCS_PER_NODE") and return None instead of letting an exception
propagate; update references to gang_timeout and nprocs_per_node accordingly so
callers receive a clean None on bad config.

In `@preflight/pkg/webhook/injector.go`:
- Around line 283-296: The appendExtraHostPathMounts function currently adds
VolumeMounts for entries missing HostPath which collectExtraHostPathVolumes
rejects; update appendExtraHostPathMounts to mirror the same validation as
collectExtraHostPathVolumes by skipping any mount where m.Name, m.MountPath, or
m.HostPath is empty (i.e., require m.HostPath != "" before appending) so
VolumeMounts always have matching volumes from
cfg.GangCoordination.ExtraHostPathMounts.

---

Nitpick comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py`:
- Around line 36-49: Combine the nested context managers in the three tests to
use a single with statement per test: replace the nested "with
patch.dict(os.environ, {...}): with pytest.raises(...):" pattern in
test_rejects_non_numeric, test_rejects_zero, and test_rejects_negative with a
single "with patch.dict(os.environ, {...}), pytest.raises(...):" line so the
tests still call _parse_float("TEST_VAR", ...) but use the consolidated context
manager form for clarity and style.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/preflight 0.00% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/controller 13.83% (-4.57%) 👎
github.com/nvidia/nvsentinel/preflight/pkg/webhook 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/preflight/main.go 0.00% (ø) 319 0 319
github.com/nvidia/nvsentinel/preflight/pkg/config/config.go 0.00% (ø) 202 (+52) 0 202 (+52)
github.com/nvidia/nvsentinel/preflight/pkg/controller/gang_controller.go 13.83% (-4.57%) 282 (+70) 39 243 (+70) 👎
github.com/nvidia/nvsentinel/preflight/pkg/webhook/injector.go 0.00% (ø) 732 (+219) 0 732 (+219)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/preflight/pkg/controller/gang_controller_test.go

@XRFXLP XRFXLP changed the title [WIP] NCCL allreduce Add NCCL all reduce test in preflight framework Feb 18, 2026
@XRFXLP XRFXLP added the enhancement New feature or request label Feb 18, 2026
@XRFXLP XRFXLP self-assigned this Feb 18, 2026
@XRFXLP XRFXLP linked an issue Feb 18, 2026 that may be closed by this pull request
1 task
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (12)
preflight/pkg/webhook/injector.go (2)

544-544: slog.Info in injectGangEnv fires once per init container.

injectGangEnv is called inside the buildInitContainers loop (line 226), so a pod with two init containers emits this log line twice with identical fields. Downgrade to slog.Debug or move the log to the call site after the loop to avoid duplicate structured-log noise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/webhook/injector.go` at line 544, The structured log in
injectGangEnv ("Injecting gang environment variables") is emitted for each init
container causing duplicate entries; change the call in injectGangEnv from
slog.Info to slog.Debug (or remove it and place a single slog.Info at the caller
buildInitContainers after the loop) so the message is logged only once—update
the slog call inside injectGangEnv (or add the single log in
buildInitContainers) accordingly.

274-279: Consider deduplicating mirrored DRA resource claims.

container.Resources.Claims is appended unconditionally. If the init-container template already declares some claims that overlap with pod-level claims, this produces duplicate entries and may be rejected by the API server.

♻️ Proposed fix
+	existingClaims := make(map[string]bool)
+	for _, c := range container.Resources.Claims {
+		existingClaims[c.Name] = true
+	}
 	if mirrorClaims {
 		for _, podClaim := range podResourceClaims {
-			container.Resources.Claims = append(container.Resources.Claims, corev1.ResourceClaim{
-				Name: podClaim.Name,
-			})
+			if !existingClaims[podClaim.Name] {
+				container.Resources.Claims = append(container.Resources.Claims, corev1.ResourceClaim{
+					Name: podClaim.Name,
+				})
+			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/webhook/injector.go` around lines 274 - 279, When mirrorClaims
is true we currently append every podResourceClaim into
container.Resources.Claims unconditionally, causing duplicates; update the logic
in injector.go (the mirrorClaims branch that iterates podResourceClaims and
appends to container.Resources.Claims) to skip claims whose Name already exists
in container.Resources.Claims by first building a set of existing claim names
(from container.Resources.Claims) and only appending podResourceClaims with
names not present in that set, ensuring duplicated ResourceClaim entries are not
produced for init-container templates.
preflight/pkg/config/config.go (2)

231-243: Shared pointer alias across all default-ReadOnly mounts.

All ExtraHostPathMounts and ExtraVolumeMounts entries that lack an explicit ReadOnly value will share the same *bool pointer (trueVal). If any downstream code ever mutates through that pointer (e.g., *mount.ReadOnly = false), it would silently affect every mount sharing the pointer. Safer to allocate per-entry.

♻️ Proposed fix
-	trueVal := true
-
 	for i := range c.ExtraHostPathMounts {
 		if c.ExtraHostPathMounts[i].ReadOnly == nil {
+			trueVal := true
 			c.ExtraHostPathMounts[i].ReadOnly = &trueVal
 		}
 	}
 
 	for i := range c.ExtraVolumeMounts {
 		if c.ExtraVolumeMounts[i].ReadOnly == nil {
+			trueVal := true
 			c.ExtraVolumeMounts[i].ReadOnly = &trueVal
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/config/config.go` around lines 231 - 243, The current loops in
config.go assign the same trueVal pointer to all mounts, creating a shared *bool
that could be mutated across entries; instead allocate a distinct bool for each
entry by creating a new local variable (e.g., tmp := true) inside each iteration
and assign its address to ExtraHostPathMounts[i].ReadOnly and
ExtraVolumeMounts[i].ReadOnly respectively so every mount gets its own *bool
rather than sharing the same pointer.

147-151: Consider validating HostPathType against known Kubernetes values.

HostPathType is a free-form string. An invalid value (e.g., a typo like "Dirctory") would pass config loading silently and only fail at pod admission time. A validation step in validate() would surface misconfigurations earlier.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/config/config.go` around lines 147 - 151, The HostPathType
field is a free-form string and should be validated early; update the validate()
function to check config.HostPathType against the allowed Kubernetes values ("",
"Directory", "DirectoryOrCreate", "File", "FileOrCreate", "Socket",
"CharDevice", "BlockDevice") and return a clear error when it is non-empty and
not one of these options (include the invalid value in the message). Locate the
HostPathType field definition and add this lookup/validation logic inside the
existing validate() method (or helper called from validate()) so mis-typed
values fail config validation instead of during pod admission.
preflight-checks/nccl-allreduce/nccl_allreduce/gang.py (1)

201-204: Unreachable except FileNotFoundError branch.

_read_string(KEY_PEERS, default="") already catches FileNotFoundError internally and returns "" when a default is provided, so the outer except FileNotFoundError on line 203 is dead code. Not harmful, but misleading to future readers.

♻️ Simplified version
-        try:
-            data = self._read_string(KEY_PEERS, default="")
-        except FileNotFoundError:
-            return []
+        data = self._read_string(KEY_PEERS, default="")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py` around lines 201 -
204, The except FileNotFoundError branch is unreachable because
_read_string(KEY_PEERS, default="") already handles FileNotFoundError and
returns "" when missing; remove the try/except and directly assign data =
self._read_string(KEY_PEERS, default="") then continue the existing logic (e.g.,
return [] when data == "" or parse peers otherwise). Update references in the
gang.py method containing KEY_PEERS and _read_string accordingly so the code is
clearer and no longer includes the dead except block.
preflight/pkg/controller/gang_controller_test.go (1)

158-159: Empty config skips all new NCCL topo logic — consider adding a test for ensureNCCLTopoConfigMap.

testCfg is initialized as &config.Config{}, so NCCLTopoConfigMap and NCCLTopoData are empty strings, causing ensureNCCLTopoConfigMap to short-circuit in every test case. The new ConfigMap creation path (Get → Create, TOCTOU handling, AlreadyExists guard) has no test coverage.

Consider adding a test case where testCfg.GangCoordination.NCCLTopoConfigMap and NCCLTopoData are set and verifying the topo ConfigMap is created alongside the gang ConfigMap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/controller/gang_controller_test.go` around lines 158 - 159, The
tests create NewGangController with an empty testCfg so ensureNCCLTopoConfigMap
always returns early; add a test scenario that sets
testCfg.GangCoordination.NCCLTopoConfigMap and
testCfg.GangCoordination.NCCLTopoData to non-empty values, then call the same
controller logic that creates the gang ConfigMap and assert that the NCCL topo
ConfigMap is created as well (covering Get→Create TOCTOU and AlreadyExists
paths); specifically modify the gang_controller_test to initialize testCfg with
those fields, invoke the controller reconcile or the helper that calls
ensureNCCLTopoConfigMap, and verify the ConfigMap exists and contains the
provided NCCLTopoData.
preflight/pkg/controller/gang_controller.go (2)

233-258: Consider adding a managed-by label for cleanup discoverability.

The NCCL topo ConfigMap has component and name labels but no app.kubernetes.io/managed-by label. Adding one would make it straightforward to list and clean up auto-created topo ConfigMaps (e.g., kubectl get cm -l app.kubernetes.io/managed-by=nvsentinel-preflight).

♻️ Suggested label addition
 		Labels: map[string]string{
 			"app.kubernetes.io/component": "nccl-topo",
 			"app.kubernetes.io/name":      "nvsentinel",
+			"app.kubernetes.io/managed-by": "nvsentinel-preflight",
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/controller/gang_controller.go` around lines 233 - 258, Add a
managed-by label to the NCCL topo ConfigMap labels map so auto-created topo
ConfigMaps are discoverable/cleanable; update the Labels map in the cm object
(where cm is created and gcfg.NCCLTopoConfigMap is used) to include
"app.kubernetes.io/managed-by": "nvsentinel-preflight" alongside the existing
"app.kubernetes.io/component" and "app.kubernetes.io/name" entries.

203-209: NCCL topo ConfigMap creation adds latency to the admission webhook path.

ensureNCCLTopoConfigMap performs a Get + conditional Create against the API server synchronously inside RegisterPod, which runs in the webhook's admission path. While the existing EnsureConfigMap call already adds latency, this doubles the API calls for pods with NCCL topo configured. If API server latency is high, this could push admission close to the webhook timeout.

Consider offloading this to the controller's reconcile loop or a goroutine if webhook latency becomes a concern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight/pkg/controller/gang_controller.go` around lines 203 - 209, The call
to ensureNCCLTopoConfigMap inside RegisterPod performs a synchronous Get+Create
against the API server (duplicating EnsureConfigMap work) and can spike
admission latency; move this work out of the webhook path by either (a)
enqueueing the pod/namespace to the gang controller's reconcile loop so the
reconcile handler ensures the NCCL topo ConfigMap asynchronously, or (b)
spawning a background goroutine from RegisterPod that deduplicates requests (use
a namespace-level singleflight/mutex or a map of in-flight requests keyed by
reg.Namespace) and performs the Get/Create off the admission path; update
ensureNCCLTopoConfigMap (and any EnsureConfigMap usage) to return errors via
logs/metrics instead of blocking admission and ensure failures are retried by
the controller reconcile logic.
preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py (2)

68-71: TestParseInt is missing a test_rejects_negative counterpart.

TestParseFloat has both test_rejects_zero and test_rejects_negative, but TestParseInt only has test_rejects_zero. Both helpers share the same <= 0 guard, so adding the negative case gives symmetric coverage.

🧪 Suggested addition
+    def test_rejects_negative(self) -> None:
+        with patch.dict(os.environ, {"TEST_VAR": "-1"}), pytest.raises(ValueError, match="must be positive"):
+            _parse_int("TEST_VAR", 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around
lines 68 - 71, Add a new test method test_rejects_negative to the TestParseInt
tests that mirrors TestParseFloat's negative case: use patch.dict to set
{"TEST_VAR": "-1"} and assert that _parse_int("TEST_VAR", 1) raises ValueError
with a "must be positive" match; place the new test alongside the existing
test_rejects_zero so both zero and negative inputs are covered for _parse_int.

37-49: Combine nested with statements — flagged as errors by Ruff (SIM117).

All pytest.raises blocks use nested with statements. Python's with supports multiple context managers in one statement.

♻️ Example fix (pattern applies to all similar blocks in this file)
-        with patch.dict(os.environ, {"TEST_VAR": "abc"}):
-            with pytest.raises(ValueError, match="Invalid TEST_VAR"):
-                _parse_float("TEST_VAR", 0.0)
+        with patch.dict(os.environ, {"TEST_VAR": "abc"}), pytest.raises(ValueError, match="Invalid TEST_VAR"):
+            _parse_float("TEST_VAR", 0.0)

The same pattern applies to Lines 42-43, 47-48, 64-65, 69-70, 117-118, 124-125, 131-132, and 137-138.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py` around
lines 37 - 49, Replace nested context-manager blocks with single combined
with-statements for each test to satisfy Ruff SIM117: in functions
test_rejects_zero, test_rejects_negative (and other listed tests) combine the
patch.dict and pytest.raises into one line, e.g. use "with
patch.dict(os.environ, {...}), pytest.raises(...):" and call
_parse_float("TEST_VAR", ...) inside that single with-block; update all
occurrences referencing _parse_float and the test function names accordingly.
preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (1)

144-147: Use contextlib.suppress instead of try/except/pass (Ruff SIM105).

♻️ Proposed refactor
+import contextlib
 ...
-        try:
-            dist.destroy_process_group()
-        except Exception:
-            pass  # Ignore cleanup errors
+        with contextlib.suppress(Exception):
+            dist.destroy_process_group()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 144
- 147, The try/except/pass around cleaning up the PyTorch process group should
be replaced with contextlib.suppress to follow Ruff SIM105; wrap the call to
dist.destroy_process_group() with contextlib.suppress(Exception) and add the
corresponding import (either import contextlib or from contextlib import
suppress) near the top of the module so the cleanup block uses suppress instead
of an empty except in __main__.py.
preflight-checks/nccl-allreduce/Dockerfile (1)

37-37: Poetry 1.8.2 is 2 major versions behind the current stable release.

The current stable Poetry release is 2.3.2 (Feb 1, 2026); the 1.x branch last published 1.8.5 (Dec 6, 2024). Poetry 1.8.2 (March 2024) is no longer maintained.

Bump to 1.8.5 to pick up security and bug fixes. For migration to Poetry 2.x, note that poetry export was removed in Poetry 2.0 (Jan 5, 2025) and requires installing the poetry-plugin-export plugin—either via poetry self add poetry-plugin-export or by declaring it in pyproject.toml under [tool.poetry.requires-plugins]. The poetry build command remains available with new options added. Plan the migration accordingly if the build process relies on either command.

♻️ Proposed immediate fix
-    pip install poetry==1.8.2
+    pip install poetry==1.8.5
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/Dockerfile` at line 37, Update the Poetry
installation in the Dockerfile: replace the hardcoded version "poetry==1.8.2"
with "poetry==1.8.5" to pick up security/bug fixes (update the pip install line
that currently reads pip install poetry==1.8.2), and if you plan to migrate to
Poetry 2.x later, add steps to install the poetry-plugin-export (via poetry self
add poetry-plugin-export or declaring it under [tool.poetry.requires-plugins] in
pyproject.toml) because poetry export was removed in 2.0; ensure any downstream
build steps that rely on poetry export or changed CLI behavior are adjusted
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 106-116: The NCCL process-group is using the default 30-minute
timeout; update the dist.init_process_group call to pass an explicit timeout
derived from the existing gang timeout (e.g., use
datetime.timedelta(seconds=gang_timeout)) so the NCCL init will abort within the
preflight's gang_timeout; modify the code around dist.init_process_group to
import datetime (if needed) and call dist.init_process_group(backend="nccl",
timeout=datetime.timedelta(seconds=gang_timeout)) (use the actual gang timeout
variable name used in this module) and keep the surrounding logs unchanged.
- Around line 99-104: The code currently sets NCCL_ASYNC_ERROR_HANDLING via
os.environ; update the environment variable handling so the new PyTorch name
TORCH_NCCL_ASYNC_ERROR_HANDLING is set (and keep setting
NCCL_ASYNC_ERROR_HANDLING for backward compatibility). Specifically, in the
__main__.py block that manipulates os.environ, add a check to set
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"]="1" when neither variable exists
(or mirror the value when only the old name is present), ensuring both
TORCH_NCCL_ASYNC_ERROR_HANDLING and NCCL_ASYNC_ERROR_HANDLING are defined to
avoid deprecation warnings and future breakage.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Line 205: Wrap the os.execvp call inside _launch_torchrun with a try/except
that catches OSError, and on error log a clear message including the attempted
cmd and the exception (use the existing logger/process logger used elsewhere in
this module), then raise or exit with the module's structured NCCLError (use the
appropriate NCCLError code for "launcher not found" or a generic launcher
failure) instead of letting the raw traceback propagate; ensure callers of
_launch_torchrun retain the structured exception handling path.
- Around line 92-94: _win: The problem is that _wait_for_gang sends a
GANG_TIMEOUT health event but returns None, and main() treats any None as a
GANG_CONFIG_ERROR exit (2), masking the timeout (3). Fix by making
_wait_for_gang return an explicit exit code on failure (return
NCCLError.GANG_TIMEOUT.value.exit_code for TimeoutError, or other appropriate
NCCLError codes) instead of None, and update main() to check if gang_config is
an int (error code) and immediately return that code; otherwise proceed when
gang_config is a valid config object. Ensure you reference _wait_for_gang and
the gang_config variable and use NCCLError.GANG_TIMEOUT /
NCCLError.GANG_CONFIG_ERROR enum values for returned exit codes.

In `@preflight/pkg/webhook/injector.go`:
- Around line 298-312: The extra volume mounts are defaulting to ReadOnly=true
in appendExtraVolumeMounts (Injector.appendExtraVolumeMounts), which breaks
writable Unix socket volumes like TCPXO; update the default to false by changing
the call that constructs the VolumeMount ReadOnly field from
boolDefault(m.ReadOnly, true) to boolDefault(m.ReadOnly, false) (or ensure
callers explicitly set ReadOnly=false for socket volumes), so socket-based
plugins can open writable sockets during init and runtime.

---

Duplicate comments:
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 167-170: Add a short commented usage example for
ncclAllreduceExtraEnv in values.yaml showing the expected YAML list of env var
objects (name/value or valueFrom) that will be appended to the
preflight-nccl-allreduce init container after networkFabric defaults; update the
ncclAllreduceExtraEnv block to include one or two commented entries (e.g.,
overriding FABRIC_PROFILE or setting FABRIC_LOG_LEVEL) and a brief note that
these are merged after networkFabric defaults so operators know format and
precedence.
- Around line 135-154: Add inline comment annotations for the image and
volumeMounts entries for the preflight-nccl-allreduce container: update the
image line (image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:latest) to
include a brief comment describing what the image contains and when to pin/tag
it, and add a comment next to the volumeMounts nvsentinel-socket entry (name:
nvsentinel-socket, mountPath: /var/run) explaining the purpose of the socket
volume (e.g., sharing runtime sockets/IPC with host) and any required
permissions; modify the values under the preflight-nccl-allreduce container
definition so both the image and the nvsentinel-socket mount have inline Helm
values.yaml comments consistent with other documented fields.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/config.py`:
- Around line 71-83: The env var names used to populate gang_config_dir and
gang_timeout_seconds are incorrect and must match the webhook: replace
os.getenv("GANG_CONFIG_DIR", ...) with os.getenv("GANG_CONFIG_PATH", ...) for
gang_config_dir and change _parse_int("GANG_TIMEOUT_SECONDS", ...) to
_parse_int("GANG_TIMEOUT", ...) for gang_timeout_seconds so operator-set values
(configMapMountPath and timeout) are honored at runtime; update any related
docs/comments so the expected env names (GANG_CONFIG_PATH, GANG_TIMEOUT) are
consistent across the codebase.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_errors.py`:
- Around line 61-64: The test test_error_def_is_frozen currently only checks
isinstance and doesn't verify immutability; update it to attempt a mutation on
an ErrorDef instance obtained via NCCLError.SUCCESS.value and assert that
mutation raises a TypeError (using pytest.raises) to confirm it's frozen, and
add import pytest at top of the test file; reference ErrorDef, NCCLError, and
test_error_def_is_frozen when making the change.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 129-130: The env var names used when reading configuration
(gang_config_dir and gang_timeout) do not match the webhook injector; update the
getenv keys to "GANG_CONFIG_PATH" and "GANG_TIMEOUT" respectively, keeping the
same defaults (DEFAULT_GANG_CONFIG_DIR and DEFAULT_GANG_TIMEOUT) and parsing the
timeout to int as before so gang_config_dir = os.getenv("GANG_CONFIG_PATH",
DEFAULT_GANG_CONFIG_DIR) and gang_timeout = int(os.getenv("GANG_TIMEOUT",
str(DEFAULT_GANG_TIMEOUT))) while leaving the variable names (gang_config_dir,
gang_timeout) unchanged.
- Around line 130-136: Wrap the raw int conversions for GANG_TIMEOUT_SECONDS and
NPROCS_PER_NODE in validation with try/except ValueError: when parsing
os.getenv("GANG_TIMEOUT_SECONDS", str(DEFAULT_GANG_TIMEOUT)) assign to
gang_timeout only if int(...) succeeds otherwise log a structured warning/error
referencing the variable name and provided value and fall back to
DEFAULT_GANG_TIMEOUT; similarly, when parsing nprocs_env for nprocs_per_node,
catch ValueError, log a structured error mentioning NPROCS_PER_NODE and the
invalid value, and then either fall back to calling _detect_gpu_count() or exit
with a clear message; update the code paths around gang_timeout,
nprocs_per_node, GANG_TIMEOUT_SECONDS, NPROCS_PER_NODE and _detect_gpu_count()
accordingly so no raw ValueError traceback is raised.

In `@preflight/pkg/webhook/injector.go`:
- Around line 284-296: The appendExtraHostPathMounts function is inserting
VolumeMounts even when collectExtraHostPathVolumes skips creating the
corresponding Volumes (when m.HostPath == "" or when m.HostPathType is
unsupported), causing broken pod specs; update appendExtraHostPathMounts to
perform the same validation as collectExtraHostPathVolumes (skip mounts when
m.HostPath == "" or when m.HostPathType is not one of the supported types) or
extract a shared helper (e.g., validateExtraHostPath) used by both
appendExtraHostPathMounts and collectExtraHostPathVolumes to centralize the
checks against i.cfg.GangCoordination.ExtraHostPathMounts so VolumeMounts are
only added when a matching Volume will be created.

---

Nitpick comments:
In `@preflight-checks/nccl-allreduce/Dockerfile`:
- Line 37: Update the Poetry installation in the Dockerfile: replace the
hardcoded version "poetry==1.8.2" with "poetry==1.8.5" to pick up security/bug
fixes (update the pip install line that currently reads pip install
poetry==1.8.2), and if you plan to migrate to Poetry 2.x later, add steps to
install the poetry-plugin-export (via poetry self add poetry-plugin-export or
declaring it under [tool.poetry.requires-plugins] in pyproject.toml) because
poetry export was removed in 2.0; ensure any downstream build steps that rely on
poetry export or changed CLI behavior are adjusted accordingly.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 144-147: The try/except/pass around cleaning up the PyTorch
process group should be replaced with contextlib.suppress to follow Ruff SIM105;
wrap the call to dist.destroy_process_group() with
contextlib.suppress(Exception) and add the corresponding import (either import
contextlib or from contextlib import suppress) near the top of the module so the
cleanup block uses suppress instead of an empty except in __main__.py.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/gang.py`:
- Around line 201-204: The except FileNotFoundError branch is unreachable
because _read_string(KEY_PEERS, default="") already handles FileNotFoundError
and returns "" when missing; remove the try/except and directly assign data =
self._read_string(KEY_PEERS, default="") then continue the existing logic (e.g.,
return [] when data == "" or parse peers otherwise). Update references in the
gang.py method containing KEY_PEERS and _read_string accordingly so the code is
clearer and no longer includes the dead except block.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/tests/test_config.py`:
- Around line 68-71: Add a new test method test_rejects_negative to the
TestParseInt tests that mirrors TestParseFloat's negative case: use patch.dict
to set {"TEST_VAR": "-1"} and assert that _parse_int("TEST_VAR", 1) raises
ValueError with a "must be positive" match; place the new test alongside the
existing test_rejects_zero so both zero and negative inputs are covered for
_parse_int.
- Around line 37-49: Replace nested context-manager blocks with single combined
with-statements for each test to satisfy Ruff SIM117: in functions
test_rejects_zero, test_rejects_negative (and other listed tests) combine the
patch.dict and pytest.raises into one line, e.g. use "with
patch.dict(os.environ, {...}), pytest.raises(...):" and call
_parse_float("TEST_VAR", ...) inside that single with-block; update all
occurrences referencing _parse_float and the test function names accordingly.

In `@preflight/pkg/config/config.go`:
- Around line 231-243: The current loops in config.go assign the same trueVal
pointer to all mounts, creating a shared *bool that could be mutated across
entries; instead allocate a distinct bool for each entry by creating a new local
variable (e.g., tmp := true) inside each iteration and assign its address to
ExtraHostPathMounts[i].ReadOnly and ExtraVolumeMounts[i].ReadOnly respectively
so every mount gets its own *bool rather than sharing the same pointer.
- Around line 147-151: The HostPathType field is a free-form string and should
be validated early; update the validate() function to check config.HostPathType
against the allowed Kubernetes values ("", "Directory", "DirectoryOrCreate",
"File", "FileOrCreate", "Socket", "CharDevice", "BlockDevice") and return a
clear error when it is non-empty and not one of these options (include the
invalid value in the message). Locate the HostPathType field definition and add
this lookup/validation logic inside the existing validate() method (or helper
called from validate()) so mis-typed values fail config validation instead of
during pod admission.

In `@preflight/pkg/controller/gang_controller_test.go`:
- Around line 158-159: The tests create NewGangController with an empty testCfg
so ensureNCCLTopoConfigMap always returns early; add a test scenario that sets
testCfg.GangCoordination.NCCLTopoConfigMap and
testCfg.GangCoordination.NCCLTopoData to non-empty values, then call the same
controller logic that creates the gang ConfigMap and assert that the NCCL topo
ConfigMap is created as well (covering Get→Create TOCTOU and AlreadyExists
paths); specifically modify the gang_controller_test to initialize testCfg with
those fields, invoke the controller reconcile or the helper that calls
ensureNCCLTopoConfigMap, and verify the ConfigMap exists and contains the
provided NCCLTopoData.

In `@preflight/pkg/controller/gang_controller.go`:
- Around line 233-258: Add a managed-by label to the NCCL topo ConfigMap labels
map so auto-created topo ConfigMaps are discoverable/cleanable; update the
Labels map in the cm object (where cm is created and gcfg.NCCLTopoConfigMap is
used) to include "app.kubernetes.io/managed-by": "nvsentinel-preflight"
alongside the existing "app.kubernetes.io/component" and
"app.kubernetes.io/name" entries.
- Around line 203-209: The call to ensureNCCLTopoConfigMap inside RegisterPod
performs a synchronous Get+Create against the API server (duplicating
EnsureConfigMap work) and can spike admission latency; move this work out of the
webhook path by either (a) enqueueing the pod/namespace to the gang controller's
reconcile loop so the reconcile handler ensures the NCCL topo ConfigMap
asynchronously, or (b) spawning a background goroutine from RegisterPod that
deduplicates requests (use a namespace-level singleflight/mutex or a map of
in-flight requests keyed by reg.Namespace) and performs the Get/Create off the
admission path; update ensureNCCLTopoConfigMap (and any EnsureConfigMap usage)
to return errors via logs/metrics instead of blocking admission and ensure
failures are retried by the controller reconcile logic.

In `@preflight/pkg/webhook/injector.go`:
- Line 544: The structured log in injectGangEnv ("Injecting gang environment
variables") is emitted for each init container causing duplicate entries; change
the call in injectGangEnv from slog.Info to slog.Debug (or remove it and place a
single slog.Info at the caller buildInitContainers after the loop) so the
message is logged only once—update the slog call inside injectGangEnv (or add
the single log in buildInitContainers) accordingly.
- Around line 274-279: When mirrorClaims is true we currently append every
podResourceClaim into container.Resources.Claims unconditionally, causing
duplicates; update the logic in injector.go (the mirrorClaims branch that
iterates podResourceClaims and appends to container.Resources.Claims) to skip
claims whose Name already exists in container.Resources.Claims by first building
a set of existing claim names (from container.Resources.Claims) and only
appending podResourceClaims with names not present in that set, ensuring
duplicated ResourceClaim entries are not produced for init-container templates.

Comment on lines 99 to 104
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "__main__.py" | grep nccl

Repository: NVIDIA/NVSentinel

Length of output: 121


🏁 Script executed:

head -n 110 ./preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py | tail -n 20

Repository: NVIDIA/NVSentinel

Length of output: 904


🌐 Web query:

TORCH_NCCL_ASYNC_ERROR_HANDLING PyTorch deprecation NCCL_ASYNC_ERROR_HANDLING

💡 Result:

PyTorch deprecated the NCCL env var name NCCL_ASYNC_ERROR_HANDLING in favor of the PyTorch-prefixed variable TORCH_NCCL_ASYNC_ERROR_HANDLING (rename noted as happening in PyTorch 2.2, with the old name kept for backward compatibility). When the old name is set, you may see a warning like: “Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead”. [1], [2]

What to do (migration)

  • Stop setting:
    • NCCL_ASYNC_ERROR_HANDLING=... (deprecated name) [2]
  • Set instead:
    • TORCH_NCCL_ASYNC_ERROR_HANDLING=... [1]

Example:

# old (deprecated)
# export NCCL_ASYNC_ERROR_HANDLING=1

# new
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1

Semantics (PyTorch)

TORCH_NCCL_ASYNC_ERROR_HANDLING controls how ProcessGroupNCCL reacts when NCCL async errors are detected by the watchdog:

  • 0: no async error handling
  • 1: abort communicator + tear down process
  • 2: abort communicator only
  • 3: tear down process only (default) [1]

Sources: PyTorch ProcessGroupNCCL env var docs for TORCH_NCCL_ASYNC_ERROR_HANDLING [1]; NVIDIA CUDA Graph troubleshooting note describing the rename to TORCH_NCCL_ASYNC_ERROR_HANDLING in PyTorch 2.2 [2].

[1] (docs.pytorch.org)
[2] (docs.nvidia.com)

Citations:


🌐 Web query:

NGC 25.01 PyTorch version

💡 Result:

NGC PyTorch container release 25.01 (e.g., nvcr.io/nvidia/pytorch:25.01-py3) is based on PyTorch 2.6.0a0+ecf3bae40a. [1]

Sources: [1]


NCCL_ASYNC_ERROR_HANDLING is deprecated — use TORCH_NCCL_ASYNC_ERROR_HANDLING instead.

PyTorch deprecated NCCL_ASYNC_ERROR_HANDLING in favor of TORCH_NCCL_ASYNC_ERROR_HANDLING (PyTorch 2.2+), and NGC 25.01 ships PyTorch 2.6.x, which emits the warning: "Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead". Setting only the old name risks the watchdog being ignored in future PyTorch versions where backward compatibility is dropped.

Proposed fix
 if "NCCL_ASYNC_ERROR_HANDLING" not in os.environ:
     os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
+if "TORCH_NCCL_ASYNC_ERROR_HANDLING" not in os.environ:
+    os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if "NCCL_TIMEOUT" not in os.environ:
os.environ["NCCL_TIMEOUT"] = "1800" # 30 minutes default
if "NCCL_ASYNC_ERROR_HANDLING" not in os.environ:
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
if "NCCL_DEBUG" not in os.environ:
os.environ["NCCL_DEBUG"] = "INFO" # Enable NCCL debug logging
if "NCCL_TIMEOUT" not in os.environ:
os.environ["NCCL_TIMEOUT"] = "1800" # 30 minutes default
if "NCCL_ASYNC_ERROR_HANDLING" not in os.environ:
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
if "TORCH_NCCL_ASYNC_ERROR_HANDLING" not in os.environ:
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "1"
if "NCCL_DEBUG" not in os.environ:
os.environ["NCCL_DEBUG"] = "INFO" # Enable NCCL debug logging
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 99 -
104, The code currently sets NCCL_ASYNC_ERROR_HANDLING via os.environ; update
the environment variable handling so the new PyTorch name
TORCH_NCCL_ASYNC_ERROR_HANDLING is set (and keep setting
NCCL_ASYNC_ERROR_HANDLING for backward compatibility). Specifically, in the
__main__.py block that manipulates os.environ, add a check to set
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"]="1" when neither variable exists
(or mirror the value when only the old name is present), ensuring both
TORCH_NCCL_ASYNC_ERROR_HANDLING and NCCL_ASYNC_ERROR_HANDLING are defined to
avoid deprecation warnings and future breakage.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py (1)

247-274: Potential ValueError from HealthReporter.send_failure if an error with error_code=None is ever passed.

Currently all callers of _handle_failure pass errors with non-None error_code (ALLREDUCE_INIT_FAILED, ALLREDUCE_TIMEOUT, ALLREDUCE_BW_DEGRADED), so this is safe today. However, HealthReporter.send_failure raises ValueError for errors with error_code=None, and this function only catches RuntimeError. A defensive guard or broader catch would future-proof this.

Proposed fix
     try:
         reporter = HealthReporter(
             socket_path=cfg.connector_socket,
             node_name=cfg.node_name,
             processing_strategy=cfg.processing_strategy,
         )
         reporter.send_failure(error, message)
-    except RuntimeError as err:
+    except (RuntimeError, ValueError) as err:
         log.error(
             "Failed to send health event",
             extra={"error": str(err)},
         )
         return NCCLError.HEALTH_REPORT_FAILED.value.exit_code
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py` around lines 247
- 274, The _handle_failure function currently calls HealthReporter.send_failure
without guarding against ValueError when an NCCLError with error_code=None is
passed; update _handle_failure to either validate the passed error (NCCLError)
has a non-None error_code before calling reporter.send_failure (and log+return
an appropriate exit code if missing) or broaden the exception handler to catch
ValueError in addition to RuntimeError around reporter.send_failure, ensuring
you still log the error and return
NCCLError.HEALTH_REPORT_FAILED.value.exit_code; reference _handle_failure and
HealthReporter.send_failure in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 101-103: dist.init_process_group in __main__.py currently uses the
default (30min) timeout; modify the call to pass an explicit, shorter timeout to
enable fast-fail (e.g., timeout=datetime.timedelta(seconds=300)). Use an
existing timeout value from your Config if available (e.g., Config.nccl_timeout
or similar), otherwise apply a sensible hardcoded default (300s); update the
dist.init_process_group(backend="nccl") invocation to include the timeout
argument and ensure you import datetime if needed and reference the chosen
Config field name.

In `@preflight-checks/nccl-allreduce/scripts/entrypoint.py`:
- Around line 129-130: The script is reading the wrong environment variable
names so injected values are ignored; update the code that sets gang_config_dir
and gang_timeout to read the same env vars the Go webhook injector sets (use
GANG_CONFIG_PATH instead of GANG_CONFIG_DIR for gang_config_dir, and use
GANG_TIMEOUT instead of GANG_TIMEOUT_SECONDS for gang_timeout), preserve the
default fallbacks (DEFAULT_GANG_CONFIG_DIR and DEFAULT_GANG_TIMEOUT) and ensure
gang_timeout is parsed to int as currently done in the gang_timeout assignment.
- Around line 129-136: The environment parsing uses unguarded int() calls for
gang_timeout and nprocs_per_node which will raise raw ValueError on malformed
input; wrap the int conversions for os.getenv("GANG_TIMEOUT_SECONDS",
str(DEFAULT_GANG_TIMEOUT)) and int(os.getenv("NPROCS_PER_NODE", "")) in
try/except ValueError, validate the parsed values (e.g., ensure timeout and
nprocs are positive integers), and on failure log a clear error via your logger
(or print) including the env var name and bad value and then exit with a
non-zero code; if NPROCS_PER_NODE is unset or invalid fall back to calling
_detect_gpu_count() as currently intended.

---

Nitpick comments:
In `@preflight-checks/nccl-allreduce/nccl_allreduce/__main__.py`:
- Around line 247-274: The _handle_failure function currently calls
HealthReporter.send_failure without guarding against ValueError when an
NCCLError with error_code=None is passed; update _handle_failure to either
validate the passed error (NCCLError) has a non-None error_code before calling
reporter.send_failure (and log+return an appropriate exit code if missing) or
broaden the exception handler to catch ValueError in addition to RuntimeError
around reporter.send_failure, ensuring you still log the error and return
NCCLError.HEALTH_REPORT_FAILED.value.exit_code; reference _handle_failure and
HealthReporter.send_failure in your change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here with a link to where the original file is sourced from?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here with a link to where the original file is sourced from?

# Per-fabric mount profiles. Keyed by networkFabric name.
# The selected profile's extraHostPathMounts and extraVolumeMounts are
# merged into gangCoordination at render time (see _helpers.tpl).
fabricProfiles:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the user container also have specify these vars? Wouldn't it be easier to just copy the vars from the users container? In config we can have a pattern say NCCL_* and then any var that matches the pattern will be copied.

},
)
for i in range(self._warmup):
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make dist.ReduceOp.SUM) a config as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add preflight check for NCCL allreduce test

2 participants

Comments