-
Notifications
You must be signed in to change notification settings - Fork 5
SKU customisations #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Define the directory hierarchy for cloud specific tools, scripts resources and tests and encode them into variables for common usage. The structure we want to use for static files follows this template: /opt/hpc/<cloud-vendor>/bin/ # one-off binaries and scripts /opt/hpc/<cloud-vendor>/lib/... # resources and libraries /opt/hpc/<cloud-vendor>/tools/... # standalone tools /opt/hpc/<cloud-vendor>/tests/ # test scripts for local/CI testing For runtime files (e.g. configuration files set up by boot services), we will store them in: /var/hpc/<cloud-vendor>/.... These common directories will be defined by the following set of variables: __hpc_<cloud>_resource_dir # /opt/hpc/<cloud-vendor>/ __hpc_<cloud>_tools_dir # /opt/hpc/<cloud-vendor>/tools/ __hpc_<cloud>_tests_dir # /opt/hpc/<cloud-vendor>/tests/ __hpc_<cloud>_runtime_dir # /var/hpc/<cloud-vendor>/ At the moment we only support Azure, so only those variables are defined. Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewer's GuideAdds Azure SKU-based customisation support to the HPC role by provisioning Azure-specific resource directories, installing SKU-specific topology/customisation artifacts, and wiring them into NCCL and systemd so topology/NVLink/NCCL are tuned per VM type. Sequence diagram for SKU customisation at boot via systemd servicesequenceDiagram
actor Admin
participant Systemd as systemd
participant Svc as sku_customisation_service
participant Script as setup_sku_customisations.sh
participant IMDS as AzureMetadataService
participant Skuscript as Sku_specific_script
participant NCCL as etc_nccl_conf
participant Topo as Topology_runtime_files
Admin->>Systemd: Enable sku_customisation.service
Systemd->>Svc: Start at boot
Svc->>Script: Execute setup_sku_customisations.sh
Script->>NCCL: Create nccl.conf
Script->>Topo: Create runtime topology directory
alt MOCK_SKU set
Script->>Script: sku = $__MOCK_SKU
else No MOCK_SKU
loop Retry up to 5 times
Script->>IMDS: GET /metadata/instance
IMDS-->>Script: vmSize
end
end
Script->>Script: Normalise sku to lowercase
alt Known SKU pattern
Script->>Skuscript: Execute matching SKU script
Skuscript->>Topo: Create/link topology and graph files
Skuscript->>NCCL: Append NCCL tuning options
opt ND or NC SKU
Skuscript->>Systemd: Enable and start nvidia-fabricmanager
end
else Unknown SKU
Script->>Script: Print "No SKU customization"
end
Script->>Topo: Check for topo.xml
Script->>NCCL: Append NCCL_TOPO_FILE if topo.xml exists
Script->>Topo: Check for graph.xml
Script->>NCCL: Append NCCL_GRAPH_FILE if graph.xml exists
Script-->>Svc: Exit
Svc-->>Systemd: Service finished
Flow diagram for SKU-specific customisation scriptsflowchart TD
subgraph NDv4
A1[ndv4.sh] --> A2[Link ndv4-topo.xml to TOPOLOGY_FILE]
A2 --> A3[Remove TOPOLOGY_GRAPH]
A3 --> A4[Append NCCL_IB_PCI_RELAXED_ORDERING=1 to nccl.conf]
A4 --> A5[Enable nvidia-fabricmanager]
A5 --> A6[Start nvidia-fabricmanager]
A6 --> A7[Check nvidia-fabricmanager is active]
A7 --> A8{Active?}
A8 -->|Yes| A9[Exit 0]
A8 -->|No| A10[Print error and exit with code]
end
subgraph NDv5
B1[ndv5.sh] --> B2[Link ndv5-topo.xml to TOPOLOGY_FILE]
B2 --> B3[Remove TOPOLOGY_GRAPH]
B3 --> B4[Append NCCL_IB_PCI_RELAXED_ORDERING=1 to nccl.conf]
B4 --> B5[Enable nvidia-fabricmanager]
B5 --> B6[Start nvidia-fabricmanager]
B6 --> B7[Check nvidia-fabricmanager is active]
B7 --> B8{Active?}
B8 -->|Yes| B9[Exit 0]
B8 -->|No| B10[Print error and exit with code]
end
subgraph NDv2
C1[ndv2.sh] --> C2[Link ndv2-topo.xml to TOPOLOGY_FILE]
C2 --> C3[Remove TOPOLOGY_GRAPH]
end
subgraph NCv4
D1[ncv4.sh] --> D2[Link ncv4-topo.xml to TOPOLOGY_FILE]
D2 --> D3[Link ncv4-graph.xml to TOPOLOGY_GRAPH]
end
subgraph NCv5
E1[ncv5.sh] --> E2[Remove TOPOLOGY_FILE and TOPOLOGY_GRAPH]
E2 --> E3[Check NVLink status]
E3 --> E4{NVLink Inactive?}
E4 -->|Yes| E5[Wait for Hyper-V PCI devices]
E5 --> E6[Wait for NVIDIA PCI devices]
E6 --> E7[Stop nvidia-dcgm.service]
E7 --> E8[Unload NVIDIA modules]
E8 --> E9[Load NVIDIA modules]
E9 --> E10[Start nvidia-dcgm.service]
E4 -->|No| E11[Skip reload]
E10 --> E12[Recheck NVLink status]
E11 --> E12
E12 --> E13{NVLink Active?}
E13 -->|Yes| E14[Exit 0]
E13 -->|No| E15[Print error and exit 1]
end
subgraph NDv6
F1[ndv6.sh] --> F2[Remove TOPOLOGY_FILE and TOPOLOGY_GRAPH]
end
subgraph HBv4
G1[hbv4.sh] --> G2[Remove TOPOLOGY_FILE and TOPOLOGY_GRAPH]
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 3 issues, and left some high level feedback:
- In the NDv4/NDv5 customisation scripts, the
ln -sf ${$TOPOLOGY_SRC_DIR}/...paths use invalid bash parameter expansion (${$VAR}); these should be corrected to${TOPOLOGY_SRC_DIR}so the symlinks are created correctly. - The systemd unit is templated to
/etc/systemd/system/as a directory with mode0755; consider targeting a concrete unit file path (e.g./etc/systemd/system/sku_customisation.service) with a typical unit file mode (0644) to avoid relying on implicit naming and directory semantics. - The
ncv5.shworkaround script callssudofor module and service operations, but these scripts appear intended to run as root (e.g. via systemd); droppingsudowould simplify execution and avoid failures on systems wheresudois not available or configured.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the NDv4/NDv5 customisation scripts, the `ln -sf ${$TOPOLOGY_SRC_DIR}/...` paths use invalid bash parameter expansion (`${$VAR}`); these should be corrected to `${TOPOLOGY_SRC_DIR}` so the symlinks are created correctly.
- The systemd unit is templated to `/etc/systemd/system/` as a directory with mode `0755`; consider targeting a concrete unit file path (e.g. `/etc/systemd/system/sku_customisation.service`) with a typical unit file mode (0644) to avoid relying on implicit naming and directory semantics.
- The `ncv5.sh` workaround script calls `sudo` for module and service operations, but these scripts appear intended to run as root (e.g. via systemd); dropping `sudo` would simplify execution and avoid failures on systems where `sudo` is not available or configured.
## Individual Comments
### Comment 1
<location> `files/sku/customisations/ndv4.sh:4` </location>
<code_context>
+#!/bin/bash
+
+# Link the NDv4 topology file, no graph for this machine type
+ln -sf ${$TOPOLOGY_SRC_DIR}/topology/ndv4-topo.xml ${TOPOLOGY_FILE}
+rm -f ${TOPOLOGY_GRAPH}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Invalid parameter expansion `${$TOPOLOGY_SRC_DIR}` will cause the ln command to fail.
This will either expand to an empty string or throw a bad substitution error, so the symlink is never created. Please change to `${TOPOLOGY_SRC_DIR}` here and make the same fix in `ndv5.sh`.
</issue_to_address>
### Comment 2
<location> `files/sku/customisations/ndv5.sh:4` </location>
<code_context>
+#!/bin/bash
+
+# Link the NDv5 topology file, no graph for this machine type
+ln -sf ${$TOPOLOGY_SRC_DIR}/topology/ndv5-topo.xml ${TOPOLOGY_FILE}
+rm -f ${TOPOLOGY_GRAPH}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Same `${$TOPOLOGY_SRC_DIR}` parameter expansion bug as in ndv4.sh.
This will prevent the symlink from being created and the node from receiving the correct topology; please change it to `${TOPOLOGY_SRC_DIR}`.
</issue_to_address>
### Comment 3
<location> `files/sku/customisations/ndv4.sh:18` </location>
<code_context>
+error_code=$?
+if [ ${error_code} -ne 0 ]
+then
+ echo "NVIDIA Fabic Manager Inactive!"
+ exit ${error_code}
+fi
</code_context>
<issue_to_address>
**nitpick (typo):** Minor typo in error message for NVIDIA Fabric Manager.
The message currently prints `Fabic` instead of `Fabric`, which may affect log searches or alerts. Please update this here and in `ndv5.sh` where the same string appears.
Suggested implementation:
```
then
echo "NVIDIA Fabric Manager Inactive!"
exit ${error_code}
fi
```
You should also:
1. Locate the same string `NVIDIA Fabic Manager Inactive!` in `files/sku/customisations/ndv5.sh`.
2. Replace it with `NVIDIA Fabric Manager Inactive!` in that file as well, using the same edit pattern.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
These scripts tune the hardware and software according to the type of Azure VM the image is running on. They run at machine startup and source the relevant information according to the machine type that is detected. Signed-off-by: Dave Chinner <dchinner@redhat.com>
d5488c4 to
2a71672
Compare
When testing the setup script with unknown SKUs, it was found that this fails to clean the configuration files properly: .... Testing some_unknown_sku_for_testing No SKU customization for some_unknown_sku_for_testing Unknown SKU Failed: some_unknown_sku_for_testing: /etc/nccl.conf not empty $ Fix this up by removing the various sku specific files on unknown SKU types. Signed-off-by: Dave Chinner <dchinner@redhat.com>
Enhancement: Add SKU-based customisations for optimal hardware performance
Reason: Different VM types have different hardware and network topologies that needs specific optimisations to be applied. This improves GPU performance on individual machines as well as cluster-wide networking performance.
Testing: installation and removal has been tested via mocking the API lookup that returns the current machine's SKU string. Manual installation can be run via:
# __MOCK_SKU=<sku type> /opt/hpc/azure/bin/setup_customisations.shTo check that /etc/nccl.conf and the topology and graph file links in /var/hpc/azure/ have been set up properly.
Issue Tracker Tickets (Jira or BZ if any): RHELHPC-114
This PR is dependent on functionality in #48, it contains the same commits for managing HPC resource directories.
Summary by Sourcery
Add Azure VM SKU-aware customisation support to the HPC role, including resource directory management and automated hardware tuning setup.
New Features:
Enhancements:
Tests: