Skip to content

Conversation

@dgchinner
Copy link

@dgchinner dgchinner commented Jan 22, 2026

Enhancement: Test script for the SKU customisations feature in PR #49

This is dependent on the changes in RP #48 and PR #49, the commits a duplicated in the branch the PR is generated from.

Manual (mocked SKU) testing:

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh --manual

Testing standard_nc96ads_a100_v4
Test Passed: standard_nc96ads_a100_v4

Testing standard_nd40rs_v2
Test Passed: standard_nd40rs_v2

Testing standard_nd96asr_v4
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96asr_v4

Testing standard_hb176rs_v4
Test Passed: standard_hb176rs_v4

Testing standard_nc80adis_h100_v5
Check NVLink status after reloading NVIDIA kernel modules...
NVLink is Active.
Test Passed: standard_nc80adis_h100_v5

Testing standard_nd96isr_h200_v5
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96isr_h200_v5

Testing standard_nd128isr_gb300_v6
Test Passed: standard_nd128isr_gb300_v6

Testing some_unknown_sku_for_testing
No SKU customization for some_unknown_sku_for_testing
Unknown SKU: some_unknown_sku_for_testing
Test Passed: some_unknown_sku_for_testing
$

Testing the SKU installed correctly and the service is running on a given VM running the built image (e.g. via a CI system):

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh

Testing standard_nc8as_t4_v3
Unknown SKU: standard_nc8as_t4_v3
Test Passed: standard_nc8as_t4_v3
$

Issue Tracker Tickets (Jira or BZ if any): RHELHPC-126

Summary by Sourcery

Add Azure-specific SKU customisation support for NCCL/topology tuning on HPC VMs and provide tooling to manage and validate these configurations.

New Features:

  • Introduce configurable SKU customisation for Azure HPC VM types, including topology, NCCL, and hardware workaround scripts managed via a systemd service.
  • Add Azure HPC resource, tools, tests, and runtime directories to host SKU-specific configuration and runtime data.
  • Provide a test script to validate SKU customisation behaviour both via mocked SKUs and on real Azure VMs.

Enhancements:

  • Document the new hpc_sku_customisation boolean variable and default behaviour in the role README.

Tests:

  • Add SKU customisation test script and supporting files to verify correct installation and runtime behaviour across supported and unknown Azure SKUs.

Define the directory hierarchy for cloud specific tools, scripts
resources and tests and encode them into variables for common usage.

The structure we want to use for static files follows this template:

/opt/hpc/<cloud-vendor>/bin/		# one-off binaries and scripts
/opt/hpc/<cloud-vendor>/lib/...		# resources and libraries
/opt/hpc/<cloud-vendor>/tools/...	# standalone tools
/opt/hpc/<cloud-vendor>/tests/		# test scripts for local/CI testing

For runtime files (e.g. configuration files set up by boot
services), we will store them in:

/var/hpc/<cloud-vendor>/....

These common directories will be defined by the following set of
variables:

__hpc_<cloud>_resource_dir		# /opt/hpc/<cloud-vendor>/
__hpc_<cloud>_tools_dir			# /opt/hpc/<cloud-vendor>/tools/
__hpc_<cloud>_tests_dir			# /opt/hpc/<cloud-vendor>/tests/
__hpc_<cloud>_runtime_dir		# /var/hpc/<cloud-vendor>/

At the moment we only support Azure, so only those variables are
defined.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
These scripts tune the hardware and software according to the type
of Azure VM the image is running on. They run at machine startup
and source the relevant information according to the machine type
that is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
When testing the setup script with unknown SKUs, it was found that
this fails to clean the configuration files properly:

....
Testing some_unknown_sku_for_testing
No SKU customization for some_unknown_sku_for_testing
Unknown SKU
Failed: some_unknown_sku_for_testing: /etc/nccl.conf not empty
$

Fix this up by removing the various sku specific files on unknown
SKU types.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
@sourcery-ai
Copy link

sourcery-ai bot commented Jan 22, 2026

Reviewer's Guide

Adds Azure HPC SKU customisation support with scripts, topology files, and tests, wiring them into the Ansible role behind a new hpc_sku_customisation toggle and installing a systemd service to apply SKU-specific NCCL/topology tweaks at boot.

Sequence diagram for SKU customisation at boot via systemd service

sequenceDiagram
    participant Systemd
    participant SkuCustomisationService
    participant SetupScript
    participant AzureIMDS
    participant SkuCustomisationHandler
    participant NvidiaFabricManager
    participant NvidiaDCGM
    participant NCCL

    Systemd->>SkuCustomisationService: start sku_customisation.service
    SkuCustomisationService->>SetupScript: exec setup_sku_customisations.sh

    SetupScript->>SetupScript: init NCCL_CONF / topology runtime dirs
    alt sku mocked
        SetupScript->>SetupScript: read env __MOCK_SKU as sku
    else sku from IMDS
        loop up to 5 retries
            SetupScript->>AzureIMDS: GET /metadata/instance vmSize
            AzureIMDS-->>SetupScript: vmSize or empty
        end
        SetupScript->>SetupScript: tolower(sku)
    end

    alt known SKU pattern
        SetupScript->>SkuCustomisationHandler: run ncv4.sh | ndv4.sh | ndv5.sh | ndv2.sh | ncv5.sh | ndv6.sh | hbv4.sh
        SkuCustomisationHandler->>SetupScript: configure TOPOLOGY_FILE / TOPOLOGY_GRAPH
        opt NCCL tuning
            SkuCustomisationHandler->>SetupScript: append NCCL_* to NCCL_CONF
        end
        opt fabric manager control
            SkuCustomisationHandler->>NvidiaFabricManager: systemctl enable/start
            NvidiaFabricManager-->>SkuCustomisationHandler: is-active status
        end
        opt NVLink workaround
            SkuCustomisationHandler->>NvidiaDCGM: stop nvidia-dcgm.service
            SkuCustomisationHandler->>NvidiaDCGM: reload NVIDIA kernel modules
            SkuCustomisationHandler->>NvidiaDCGM: start nvidia-dcgm.service
        end
        SetupScript->>SetupScript: add NCCL_TOPO_FILE / NCCL_GRAPH_FILE to NCCL_CONF
    else unknown SKU
        SetupScript->>SetupScript: remove TOPOLOGY_FILE / TOPOLOGY_GRAPH / NCCL_CONF
    end

    SetupScript-->>SkuCustomisationService: exit status
    SkuCustomisationService-->>Systemd: service result

    Systemd-->>NCCL: NCCL_CONF ready for MPI jobs at runtime
Loading

Sequence diagram for manual SKU setup test using mocked SKU

sequenceDiagram
    actor Admin
    participant TestScript
    participant SetupScript
    participant SkuCustomisationHandler
    participant NvidiaFabricManager

    Admin->>TestScript: sudo test-sku-setup.sh --manual
    TestScript->>TestScript: select test SKU
    loop for each test_sku
        TestScript->>TestScript: export __MOCK_SKU=test_sku
        TestScript->>SetupScript: run setup_sku_customisations.sh
        SetupScript->>SetupScript: detect __MOCK_SKU and set sku
        SetupScript->>SkuCustomisationHandler: dispatch per SKU
        opt SKU requires fabric manager
            SkuCustomisationHandler->>NvidiaFabricManager: enable/start
            NvidiaFabricManager-->>SkuCustomisationHandler: status
        end
        SetupScript-->>TestScript: exit code
        alt exit code 0
            TestScript->>Admin: "Test Passed: test_sku"
        else failure
            TestScript->>Admin: log failure details
        end
    end
Loading

File-Level Changes

Change Details Files
Introduce Azure HPC resource/runtime directory layout and default variables, and ensure directories are created during role execution.
  • Define __hpc_azure_resource_dir, __hpc_azure_tools_dir, __hpc_azure_tests_dir, and __hpc_azure_runtime_dir in vars/main.yml.
  • Add an Ansible task block to stat and create the resource and runtime directories with root ownership and 0755 permissions.
vars/main.yml
tasks/main.yml
Add an Ansible-controlled SKU customisation feature guarded by a new hpc_sku_customisation boolean variable.
  • Introduce hpc_sku_customisation default variable (true) and document it in README.md including its purpose for Azure SKU-specific tuning.
  • Add a task block that, when hpc_sku_customisation is true and not already installed, copies topology and customisation files, installs setup/removal scripts and tests, and installs/enables the sku_customisation systemd service.
defaults/main.yml
README.md
tasks/main.yml
Provide SKU setup, removal, and test scripts that drive SKU-specific NCCL and topology configuration using Azure IMDS or a mocked SKU for tests.
  • Implement setup_sku_customisations.sh to query Azure IMDS (or use __MOCK_SKU), select per-SKU customisation scripts, manage topology files under /var/hpc/azure/topology, and populate /etc/nccl.conf including NCCL_TOPO_FILE/NCCL_GRAPH_FILE.
  • Implement remove_sku_customisations.sh to stop/disable nvidia-fabricmanager, unload nvidia_peermem, remove runtime topology files, and clear /etc/nccl.conf.
  • Implement test-sku-setup.sh to run manual mode tests over a fixed SKU list (via __MOCK_SKU) or CI mode using the real VM SKU, asserting presence/absence of topology/graph files and nccl.conf contents, and verifying the sku customisation service is active for supported SKUs.
templates/sku/setup_sku_customisations.sh
templates/sku/remove_sku_customisations.sh
templates/sku/test-sku-setup.sh
Deliver per-SKU topology descriptions and customisation scripts for various Azure GPU/HPC VM types, including NVLink workaround for NCv5 and NCCL tuning/Fabric Manager enablement for NDv4/NDv5.
  • Add static topology XML files for NCv4, NDv2, NDv4, and NDv5 SKUs describing GPU and NIC PCI layout and, where applicable, NVLink relationships.
  • Add customisation scripts ncv4.sh, ndv2.sh, ndv4.sh, ndv5.sh, ndv6.sh, hbv4.sh, and ncv5.sh to configure topology symlinks, manage topology/graph removal when unused, set NCCL_IB_PCI_RELAXED_ORDERING where appropriate, and manage nvidia-fabricmanager and NVLink reinitialisation for NCv5.
  • Wire these SKU customisation scripts into setup_sku_customisations.sh’s case statement keyed by VM size patterns (e.g. standard_ndv4, standard_nc80adis_h100_v5, standard_nd128is_gb[2-3]00_v6).
files/sku/topology/ncv4-graph.xml
files/sku/topology/ncv4-topo.xml
files/sku/topology/ndv2-topo.xml
files/sku/topology/ndv4-topo.xml
files/sku/topology/ndv5-topo.xml
files/sku/customisations/ncv4.sh
files/sku/customisations/ndv2.sh
files/sku/customisations/ndv4.sh
files/sku/customisations/ndv5.sh
files/sku/customisations/ndv6.sh
files/sku/customisations/hbv4.sh
files/sku/customisations/ncv5.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The systemd unit name used in the test script (systemctl is-active --quiet sku-customisations) does not match the installed service name (sku_customisation.service); aligning these (and choosing a single spelling/format) will avoid false negatives in service state checks.
  • The IMDS SKU lookup and retry logic is duplicated between setup_sku_customisations.sh and test-sku-setup.sh; consider factoring this into a common helper or at least a shared function to keep the behavior consistent and easier to change.
  • Several SKU customisation scripts (e.g. NDv4/NDv5) contain very similar nvidia-fabricmanager enable/start/error-handling blocks; extracting this into a common helper or function would reduce repetition and the risk of divergent behavior between SKUs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The systemd unit name used in the test script (`systemctl is-active --quiet sku-customisations`) does not match the installed service name (`sku_customisation.service`); aligning these (and choosing a single spelling/format) will avoid false negatives in service state checks.
- The IMDS SKU lookup and retry logic is duplicated between `setup_sku_customisations.sh` and `test-sku-setup.sh`; consider factoring this into a common helper or at least a shared function to keep the behavior consistent and easier to change.
- Several SKU customisation scripts (e.g. NDv4/NDv5) contain very similar `nvidia-fabricmanager` enable/start/error-handling blocks; extracting this into a common helper or function would reduce repetition and the risk of divergent behavior between SKUs.

## Individual Comments

### Comment 1
<location> `README.md:198` </location>
<code_context>
+Whether to install the hardware tuning files for different Azure VM types (SKUs).
+
+This will install definitions for optimal hardware configurations for the different types of high performance VMs that are typically used for HPC workloads in the Azure environment.
+These include Infiniband and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
+
+Default: `true`
</code_context>

<issue_to_address>
**suggestion (typo):** Use the standard spelling "InfiniBand" for the interconnect name.

Use the vendor-standard capitalization "InfiniBand" to match industry and Azure documentation conventions.

```suggestion
These include InfiniBand and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Add a script to enable both manual and automated testing of the
Azure SKU customisation scripts.

When running the tests manually, it will exercise all the different
supported SKU types via mocking and checking that appropriate links
are installed. It will not check that the customisation service is
active and running as manual mode is expected to used on dev
machines that are unsupported SKU types.

Manual testing like this may throw some warnings or errors because
hardware is not directly supported. For example, testing on a VM
type that does not have GPUs that are supported by the fabric
manager will result in warnings that the service failed to start:

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh --manual
Testing standard_nc96ads_a100_v4
Test Passed: standard_nc96ads_a100_v4
Testing standard_nd40rs_v2
Test Passed: standard_nd40rs_v2
Testing standard_nd96asr_v4
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96asr_v4
Testing standard_hb176rs_v4
Test Passed: standard_hb176rs_v4
Testing standard_nc80adis_h100_v5
Check NVLink status after reloading NVIDIA kernel modules...
NVLink is Active.
Test Passed: standard_nc80adis_h100_v5
Testing standard_nd96isr_h200_v5
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96isr_h200_v5
$

Such warnings are fine.

When not in manual mode, the test expects that it is running on a
supported SKU VM (e.g. in the CI system) and will query the current
the SKU type.

If the SKU is unsupported, it will check that no files are currently
installed. It will fail in that case:

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh
Unknown SKU
Failed: Standard_NC8as_T4_v3: /etc/nccl.conf not empty
$

If the SKU is supported, it will check that appropriate files are
installed and the service is running.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
@dgchinner dgchinner force-pushed the test-sku-customisations branch from 6201e98 to abb3d95 Compare January 22, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant