Skip to content

Conversation

@dgchinner
Copy link

@dgchinner dgchinner commented Jan 22, 2026

Enhancement: Add SKU-based customisations for optimal hardware performance

Reason: Different VM types have different hardware and network topologies that needs specific optimisations to be applied. This improves GPU performance on individual machines as well as cluster-wide networking performance.

Testing: installation and removal has been tested via mocking the API lookup that returns the current machine's SKU string. Manual installation can be run via:

# __MOCK_SKU=<sku type> /opt/hpc/azure/bin/setup_customisations.sh

To check that /etc/nccl.conf and the topology and graph file links in /var/hpc/azure/ have been set up properly.

Issue Tracker Tickets (Jira or BZ if any): RHELHPC-114

This PR is dependent on functionality in #48, it contains the same commits for managing HPC resource directories.

Summary by Sourcery

Add Azure VM SKU-aware customisation support to the HPC role, including resource directory management and automated hardware tuning setup.

New Features:

  • Introduce configurable SKU-based hardware tuning for Azure HPC VMs via the hpc_sku_customisation boolean variable.
  • Add Azure-specific resource and runtime directories for storing scripts, topology data, and runtime-selected configuration files.
  • Provide systemd-managed setup and removal scripts to apply or rollback NCCL, NVLink, and topology customisations per VM SKU.

Enhancements:

  • Document the new hpc_sku_customisation variable and its purpose in the role README.

Tests:

  • Add test role wiring to support validation of the new SKU customisation assets.

Define the directory hierarchy for cloud specific tools, scripts
resources and tests and encode them into variables for common usage.

The structure we want to use for static files follows this template:

/opt/hpc/<cloud-vendor>/bin/		# one-off binaries and scripts
/opt/hpc/<cloud-vendor>/lib/...		# resources and libraries
/opt/hpc/<cloud-vendor>/tools/...	# standalone tools
/opt/hpc/<cloud-vendor>/tests/		# test scripts for local/CI testing

For runtime files (e.g. configuration files set up by boot
services), we will store them in:

/var/hpc/<cloud-vendor>/....

These common directories will be defined by the following set of
variables:

__hpc_<cloud>_resource_dir		# /opt/hpc/<cloud-vendor>/
__hpc_<cloud>_tools_dir			# /opt/hpc/<cloud-vendor>/tools/
__hpc_<cloud>_tests_dir			# /opt/hpc/<cloud-vendor>/tests/
__hpc_<cloud>_runtime_dir		# /var/hpc/<cloud-vendor>/

At the moment we only support Azure, so only those variables are
defined.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
@sourcery-ai
Copy link

sourcery-ai bot commented Jan 22, 2026

Reviewer's Guide

Adds Azure SKU-based customisation support to the HPC role by provisioning Azure-specific resource directories, installing SKU-specific topology/customisation artifacts, and wiring them into NCCL and systemd so topology/NVLink/NCCL are tuned per VM type.

Sequence diagram for SKU customisation at boot via systemd service

sequenceDiagram
    actor Admin
    participant Systemd as systemd
    participant Svc as sku_customisation_service
    participant Script as setup_sku_customisations.sh
    participant IMDS as AzureMetadataService
    participant Skuscript as Sku_specific_script
    participant NCCL as etc_nccl_conf
    participant Topo as Topology_runtime_files

    Admin->>Systemd: Enable sku_customisation.service
    Systemd->>Svc: Start at boot
    Svc->>Script: Execute setup_sku_customisations.sh

    Script->>NCCL: Create nccl.conf
    Script->>Topo: Create runtime topology directory

    alt MOCK_SKU set
        Script->>Script: sku = $__MOCK_SKU
    else No MOCK_SKU
        loop Retry up to 5 times
            Script->>IMDS: GET /metadata/instance
            IMDS-->>Script: vmSize
        end
    end

    Script->>Script: Normalise sku to lowercase

    alt Known SKU pattern
        Script->>Skuscript: Execute matching SKU script
        Skuscript->>Topo: Create/link topology and graph files
        Skuscript->>NCCL: Append NCCL tuning options
        opt ND or NC SKU
            Skuscript->>Systemd: Enable and start nvidia-fabricmanager
        end
    else Unknown SKU
        Script->>Script: Print "No SKU customization"
    end

    Script->>Topo: Check for topo.xml
    Script->>NCCL: Append NCCL_TOPO_FILE if topo.xml exists
    Script->>Topo: Check for graph.xml
    Script->>NCCL: Append NCCL_GRAPH_FILE if graph.xml exists

    Script-->>Svc: Exit
    Svc-->>Systemd: Service finished
Loading

Flow diagram for SKU-specific customisation scripts

flowchart TD
    subgraph NDv4
        A1[ndv4.sh] --> A2[Link ndv4-topo.xml to TOPOLOGY_FILE]
        A2 --> A3[Remove TOPOLOGY_GRAPH]
        A3 --> A4[Append NCCL_IB_PCI_RELAXED_ORDERING=1 to nccl.conf]
        A4 --> A5[Enable nvidia-fabricmanager]
        A5 --> A6[Start nvidia-fabricmanager]
        A6 --> A7[Check nvidia-fabricmanager is active]
        A7 --> A8{Active?}
        A8 -->|Yes| A9[Exit 0]
        A8 -->|No| A10[Print error and exit with code]
    end

    subgraph NDv5
        B1[ndv5.sh] --> B2[Link ndv5-topo.xml to TOPOLOGY_FILE]
        B2 --> B3[Remove TOPOLOGY_GRAPH]
        B3 --> B4[Append NCCL_IB_PCI_RELAXED_ORDERING=1 to nccl.conf]
        B4 --> B5[Enable nvidia-fabricmanager]
        B5 --> B6[Start nvidia-fabricmanager]
        B6 --> B7[Check nvidia-fabricmanager is active]
        B7 --> B8{Active?}
        B8 -->|Yes| B9[Exit 0]
        B8 -->|No| B10[Print error and exit with code]
    end

    subgraph NDv2
        C1[ndv2.sh] --> C2[Link ndv2-topo.xml to TOPOLOGY_FILE]
        C2 --> C3[Remove TOPOLOGY_GRAPH]
    end

    subgraph NCv4
        D1[ncv4.sh] --> D2[Link ncv4-topo.xml to TOPOLOGY_FILE]
        D2 --> D3[Link ncv4-graph.xml to TOPOLOGY_GRAPH]
    end

    subgraph NCv5
        E1[ncv5.sh] --> E2[Remove TOPOLOGY_FILE and TOPOLOGY_GRAPH]
        E2 --> E3[Check NVLink status]
        E3 --> E4{NVLink Inactive?}
        E4 -->|Yes| E5[Wait for Hyper-V PCI devices]
        E5 --> E6[Wait for NVIDIA PCI devices]
        E6 --> E7[Stop nvidia-dcgm.service]
        E7 --> E8[Unload NVIDIA modules]
        E8 --> E9[Load NVIDIA modules]
        E9 --> E10[Start nvidia-dcgm.service]
        E4 -->|No| E11[Skip reload]
        E10 --> E12[Recheck NVLink status]
        E11 --> E12
        E12 --> E13{NVLink Active?}
        E13 -->|Yes| E14[Exit 0]
        E13 -->|No| E15[Print error and exit 1]
    end

    subgraph NDv6
        F1[ndv6.sh] --> F2[Remove TOPOLOGY_FILE and TOPOLOGY_GRAPH]
    end

    subgraph HBv4
        G1[hbv4.sh] --> G2[Remove TOPOLOGY_FILE and TOPOLOGY_GRAPH]
    end
Loading

File-Level Changes

Change Details Files
Create Azure HPC resource and runtime directory structure used by SKU customisations.
  • Introduce __hpc_azure_resource_dir, __hpc_azure_tools_dir, __hpc_azure_tests_dir, and __hpc_azure_runtime_dir variables with default locations under /opt/hpc/azure and /var/hpc/azure.
  • Add Ansible tasks to stat and create the Azure resource and runtime directories with appropriate ownership and permissions.
vars/main.yml
tasks/main.yml
Add feature flag and documentation for enabling/disabling SKU customisations.
  • Introduce hpc_sku_customisation boolean defaulting to true.
  • Document hpc_sku_customisation in README with explanation of Azure VM tuning behaviour.
defaults/main.yml
README.md
Install SKU customisation assets and systemd service when enabled.
  • Add conditional Ansible block that checks for existing topology installation and copies topology XMLs and customisation scripts into the Azure resource directory.
  • Template and install setup/remove scripts and sku_customisation systemd unit, then enable the service.
tasks/main.yml
templates/sku/setup_sku_customisations.sh
templates/sku/remove_sku_customisations.sh
templates/sku/sku_customisation.service
Implement SKU-aware setup script that queries Azure IMDS (or a mock) and invokes per-SKU tuning scripts.
  • Create setup_sku_customisations.sh template that initialises /etc/nccl.conf, sets runtime topology directories, obtains the VM SKU from Azure IMDS (with retry) or $__MOCK_SKU, normalises it, and dispatches to SKU-specific scripts via a case statement.
  • Append NCCL_TOPO_FILE and NCCL_GRAPH_FILE to /etc/nccl.conf when corresponding runtime files exist.
templates/sku/setup_sku_customisations.sh
Provide SKU-specific topology descriptions and tuning scripts for key Azure VM types.
  • Add topology XMLs for NCv4, NDv2, NDv4, and NDv5 SKUs, including an additional NCv4 NCCL graph description.
  • Implement per-SKU bash scripts that set or clear topology/graph symlinks, apply NCCL settings, manage NVIDIA Fabric Manager where needed, and in the NCv5 case work around NVLink initialisation issues by waiting for PCI devices and reloading NVIDIA modules.
  • Add simple placeholder scripts for SKUs that currently only clear topology/graph (HBv4, NDv6).
files/sku/topology/ncv4-graph.xml
files/sku/topology/ncv4-topo.xml
files/sku/topology/ndv2-topo.xml
files/sku/topology/ndv4-topo.xml
files/sku/topology/ndv5-topo.xml
files/sku/customisations/ncv4.sh
files/sku/customisations/ndv2.sh
files/sku/customisations/ndv4.sh
files/sku/customisations/ndv5.sh
files/sku/customisations/ndv6.sh
files/sku/customisations/hbv4.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In the NDv4/NDv5 customisation scripts, the ln -sf ${$TOPOLOGY_SRC_DIR}/... paths use invalid bash parameter expansion (${$VAR}); these should be corrected to ${TOPOLOGY_SRC_DIR} so the symlinks are created correctly.
  • The systemd unit is templated to /etc/systemd/system/ as a directory with mode 0755; consider targeting a concrete unit file path (e.g. /etc/systemd/system/sku_customisation.service) with a typical unit file mode (0644) to avoid relying on implicit naming and directory semantics.
  • The ncv5.sh workaround script calls sudo for module and service operations, but these scripts appear intended to run as root (e.g. via systemd); dropping sudo would simplify execution and avoid failures on systems where sudo is not available or configured.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the NDv4/NDv5 customisation scripts, the `ln -sf ${$TOPOLOGY_SRC_DIR}/...` paths use invalid bash parameter expansion (`${$VAR}`); these should be corrected to `${TOPOLOGY_SRC_DIR}` so the symlinks are created correctly.
- The systemd unit is templated to `/etc/systemd/system/` as a directory with mode `0755`; consider targeting a concrete unit file path (e.g. `/etc/systemd/system/sku_customisation.service`) with a typical unit file mode (0644) to avoid relying on implicit naming and directory semantics.
- The `ncv5.sh` workaround script calls `sudo` for module and service operations, but these scripts appear intended to run as root (e.g. via systemd); dropping `sudo` would simplify execution and avoid failures on systems where `sudo` is not available or configured.

## Individual Comments

### Comment 1
<location> `files/sku/customisations/ndv4.sh:4` </location>
<code_context>
+#!/bin/bash
+
+# Link the NDv4 topology file, no graph for this machine type
+ln -sf ${$TOPOLOGY_SRC_DIR}/topology/ndv4-topo.xml ${TOPOLOGY_FILE}
+rm -f ${TOPOLOGY_GRAPH}
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Invalid parameter expansion `${$TOPOLOGY_SRC_DIR}` will cause the ln command to fail.

This will either expand to an empty string or throw a bad substitution error, so the symlink is never created. Please change to `${TOPOLOGY_SRC_DIR}` here and make the same fix in `ndv5.sh`.
</issue_to_address>

### Comment 2
<location> `files/sku/customisations/ndv5.sh:4` </location>
<code_context>
+#!/bin/bash
+
+# Link the NDv5 topology file, no graph for this machine type
+ln -sf ${$TOPOLOGY_SRC_DIR}/topology/ndv5-topo.xml ${TOPOLOGY_FILE}
+rm -f ${TOPOLOGY_GRAPH}
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Same `${$TOPOLOGY_SRC_DIR}` parameter expansion bug as in ndv4.sh.

This will prevent the symlink from being created and the node from receiving the correct topology; please change it to `${TOPOLOGY_SRC_DIR}`.
</issue_to_address>

### Comment 3
<location> `files/sku/customisations/ndv4.sh:18` </location>
<code_context>
+error_code=$?
+if [ ${error_code} -ne 0 ]
+then
+    echo "NVIDIA Fabic Manager Inactive!"
+    exit ${error_code}
+fi
</code_context>

<issue_to_address>
**nitpick (typo):** Minor typo in error message for NVIDIA Fabric Manager.

The message currently prints `Fabic` instead of `Fabric`, which may affect log searches or alerts. Please update this here and in `ndv5.sh` where the same string appears.

Suggested implementation:

```
then
    echo "NVIDIA Fabric Manager Inactive!"
    exit ${error_code}
fi

```

You should also:
1. Locate the same string `NVIDIA Fabic Manager Inactive!` in `files/sku/customisations/ndv5.sh`.
2. Replace it with `NVIDIA Fabric Manager Inactive!` in that file as well, using the same edit pattern.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

These scripts tune the hardware and software according to the type
of Azure VM the image is running on. They run at machine startup
and source the relevant information according to the machine type
that is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
When testing the setup script with unknown SKUs, it was found that
this fails to clean the configuration files properly:

....
Testing some_unknown_sku_for_testing
No SKU customization for some_unknown_sku_for_testing
Unknown SKU
Failed: some_unknown_sku_for_testing: /etc/nccl.conf not empty
$

Fix this up by removing the various sku specific files on unknown
SKU types.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant