-
Notifications
You must be signed in to change notification settings - Fork 5
Installing Moby container runtime and NVIDIA Container Toolkit #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds optional installation and configuration of Docker using moby-engine/moby-cli and NVIDIA Container Toolkit, including repo/package definitions, runtime configuration for Docker and containerd, and associated Ansible variables and handlers. Sequence diagram for provisioning GPU-enabled container runtimessequenceDiagram
actor Admin
participant AnsibleController
participant TargetNode
participant DockerService as docker
participant Containerd as containerd
participant NvidiaCTK as nvidia_container_toolkit_runtime
participant GPUDriver as nvidia_driver_GPU
Admin->>AnsibleController: Run HPC role
AnsibleController->>TargetNode: Apply tasks with hpc_install_docker=true
AnsibleController->>TargetNode: Install moby-engine and moby-cli
AnsibleController->>TargetNode: Enable and restart docker service
AnsibleController->>TargetNode: Add nvidia-container-toolkit repo
AnsibleController->>TargetNode: Install nvidia-container-toolkit
AnsibleController->>TargetNode: nvidia-ctk runtime configure --runtime=docker
AnsibleController->>TargetNode: Ensure /etc/containerd and config.toml
AnsibleController->>TargetNode: Enable SystemdCgroup in config.toml
AnsibleController->>TargetNode: nvidia-ctk runtime configure --runtime=containerd --set-as-default
AnsibleController->>Containerd: Notify Restart containerd service
Containerd-->>AnsibleController: containerd restarted
Admin->>DockerService: docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubi9 nvidia-smi
DockerService->>Containerd: Create and start container
Containerd->>NvidiaCTK: Use NVIDIA runtime for container
NvidiaCTK->>GPUDriver: Initialize GPU access
GPUDriver-->>NvidiaCTK: GPU ready
NvidiaCTK-->>Containerd: Runtime configured with GPU
Containerd-->>DockerService: Container running with GPU access
DockerService-->>Admin: nvidia-smi output with GPU details
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 3 issues, and left some high level feedback:
- The
nvidia-ctk runtime configuretasks always reportchangedwhenrc == 0, which will repeatedly reconfigure the runtimes and always restart containerd; consider parsing stdout or using a more specificchanged_when/guard so they are idempotent and only trigger a restart when configuration actually changes. - The Docker service is restarted directly in the install block; consider switching this to a handler so Docker restarts are deferred and coalesced with any other changes that might also need a restart.
- The package lists for
__hpc_docker_packagesand__hpc_nvidia_container_toolkit_packagesare pinned to exact versions, which may make maintenance and upgrades harder; consider allowing version overrides or using unversioned names with separate variables to control pinning.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `nvidia-ctk runtime configure` tasks always report `changed` when `rc == 0`, which will repeatedly reconfigure the runtimes and always restart containerd; consider parsing stdout or using a more specific `changed_when`/guard so they are idempotent and only trigger a restart when configuration actually changes.
- The Docker service is restarted directly in the install block; consider switching this to a handler so Docker restarts are deferred and coalesced with any other changes that might also need a restart.
- The package lists for `__hpc_docker_packages` and `__hpc_nvidia_container_toolkit_packages` are pinned to exact versions, which may make maintenance and upgrades harder; consider allowing version overrides or using unversioned names with separate variables to control pinning.
## Individual Comments
### Comment 1
<location> `tasks/main.yml:723-726` </location>
<code_context>
+ not in __hpc_versionlock_check.stdout
+ loop: "{{ __hpc_nvidia_container_toolkit_packages }}"
+
+ - name: Configure NVIDIA Container Toolkit for Docker runtime
+ command: nvidia-ctk runtime configure --runtime=docker
+ register: __hpc_nvidia_ctk_docker_config
+ changed_when: __hpc_nvidia_ctk_docker_config.rc == 0
+ failed_when: false
+
</code_context>
<issue_to_address>
**suggestion:** The `nvidia-ctk` Docker runtime configuration task will report `changed` on every run, reducing idempotency signal.
Because `changed_when` is tied only to `rc == 0`, this task will always be reported as changed on success, even when the runtime is already configured. That obscures real changes and can trigger handlers unnecessarily. Consider deriving `changed_when` from the command output (e.g., only when it reports creating/updating config) or gating the task with a `creates`/similar condition if `nvidia-ctk` manages a specific file.
Suggested implementation:
```
- name: Configure NVIDIA Container Toolkit for Docker runtime
command: nvidia-ctk runtime configure --runtime=docker
args:
creates: /etc/nvidia-container-runtime/config.toml
register: __hpc_nvidia_ctk_docker_config
changed_when: __hpc_nvidia_ctk_docker_config.rc == 0
failed_when: false
```
Depending on how `nvidia-ctk` behaves in your environment, you may want to:
1. Confirm that `/etc/nvidia-container-runtime/config.toml` is the correct file created/managed by `nvidia-ctk runtime configure --runtime=docker` and adjust the path if your distribution uses a different location.
2. Consider removing `failed_when: false` if you *do* want a non-zero return code from `nvidia-ctk` on the first (and only) run to fail the play, rather than silently ignoring errors.
</issue_to_address>
### Comment 2
<location> `tasks/main.yml:750-727` </location>
<code_context>
+ line: ' SystemdCgroup = true'
+ backup: true
+
+ - name: Configure NVIDIA Container Toolkit for containerd runtime
+ command: nvidia-ctk runtime configure --runtime=containerd --set-as-default
+ register: __hpc_nvidia_ctk_containerd_config
+ changed_when: __hpc_nvidia_ctk_containerd_config.rc == 0
+ failed_when: false
+ notify: Restart containerd service
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Ignoring failures from `nvidia-ctk` for containerd may hide misconfigurations or missing dependencies.
Because `failed_when: false` is set, any failure from `nvidia-ctk` (missing binary, bad flags, or missing containerd) is treated as success, making broken GPU setups hard to detect. If you need to tolerate only some failure modes, please narrow the condition (e.g., check specific return codes or error messages) instead of suppressing all failures, and consider logging a warning when configuration is skipped or fails.
</issue_to_address>
### Comment 3
<location> `README.md:155` </location>
<code_context>
+### hpc_install_docker
+
+Whether to install moby-engine and moby-cli packages, and enable docker service.
+
+Default: `true`
</code_context>
<issue_to_address>
**nitpick (typo):** Consider removing the comma and adding articles for smoother grammar.
For example: "Whether to install the moby-engine and moby-cli packages and enable the Docker service."
```suggestion
Whether to install the moby-engine and moby-cli packages and enable the Docker service.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
c03e6dc to
eb33cb1
Compare
|
Please prefix the PR title with |
| description: Microsoft Production repository | ||
| key: https://packages.microsoft.com/keys/microsoft.asc | ||
| baseurl: https://packages.microsoft.com/rhel/9/prod/ | ||
| __hpc_nvidia_container_toolkit_repo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment above saying that here only url is used. Nice to have name and description just for info.
Other variables in this form install repositories using the yum_repository module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spetrosi updated, before I tried to use yum_repository, notice that the repository contains both nvidia-container-toolkit and nvidia-container-toolkit-experimental, then try to use get_url method. Thank you so much.
| loop: "{{ __hpc_nvidia_container_toolkit_packages }}" | ||
|
|
||
| - name: Configure NVIDIA Container Toolkit for Docker runtime | ||
| command: nvidia-ctk runtime configure --runtime=docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The role should be idempotent - i.e. do run actions only when needed.
So the first time the role runs this task it results in state: changed, but the seccond time in change: skipped or success).
One option will be to add a task above to run some command to check whether this command has been run before, and then in this task add a condition that runs it only when needed.
The other option is to do changed_when and check the output of this command, like you do above in Prevent updates of NVIDIA Container Toolkit packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spetrosi updated based on https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html, thank you so much for your detailed guidance.
tasks/main.yml
Outdated
| mode: '0755' | ||
|
|
||
| - name: Generate default containerd config | ||
| shell: containerd config default > /etc/containerd/config.toml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again, only do this if file doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to use the creates keyword of the shell module for this: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/shell_module.html - you'll have to use the cmd form instead of the "free form" command:
- name: Generate default containerd config
shell:
cmd: containerd config default > /etc/containerd/config.toml
creates: /etc/containerd/config.tomlThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Install moby-cli and moby-engine - Install nvidia-container-toolkit package - Configure containerd to use NVIDIA runtime
eb33cb1 to
6be2ac5
Compare
Updated, thank you so much. |
|
|
||
| Whether to install the moby-engine and moby-cli packages as well as enable the Docker service. | ||
|
|
||
| Default: `true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be true by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - I see you must install docker if hpc_install_nvidia_container_toolkit is true
-
Is it possible to use
podmaninstead ofdocker, on platforms that supportpodman? We at Red Hat strongly prefer the use ofpodman. -
You could make it so that the default value of
hpc_install_dockeris set totrueifhpc_install_nvidia_container_toolkitis true e.g. in the defaults file:
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"that way, if the user does not set hpc_install_docker explicitly, it will be set to true implicitly if the user sets hpc_install_nvidia_container_toolkit: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richm Updated, thank you so much for your detailed comment. About "Is it possible to use podman instead of docker", this is a good question. I need to confirm with the team, but my understanding is that we haven't fully tested the NVIDIA Container Toolkit with Podman yet. While Podman may be a future consideration, it's likely not covered in the current commit.
README.md
Outdated
|
|
||
| Type: `bool` | ||
|
|
||
| ### hpc_install_nvidia_container_toolkit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't list the default value and type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thank you so much.
handlers/main.yml
Outdated
| - name: Restart docker service | ||
| service: | ||
| name: docker | ||
| state: restarted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline at end of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thank you so much.
|
|
||
| Whether to install the moby-engine and moby-cli packages as well as enable the Docker service. | ||
|
|
||
| Default: `true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - I see you must install docker if hpc_install_nvidia_container_toolkit is true
-
Is it possible to use
podmaninstead ofdocker, on platforms that supportpodman? We at Red Hat strongly prefer the use ofpodman. -
You could make it so that the default value of
hpc_install_dockeris set totrueifhpc_install_nvidia_container_toolkitis true e.g. in the defaults file:
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"that way, if the user does not set hpc_install_docker explicitly, it will be set to true implicitly if the user sets hpc_install_nvidia_container_toolkit: true
| not in __hpc_versionlock_check.stdout | ||
| loop: "{{ __hpc_nvidia_container_toolkit_packages }}" | ||
|
|
||
| - name: Check if NVIDIA runtime is configured in Docker daemon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to use the Ansible find module for this: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/find_module.html
- name: Check if NVIDIA runtime is configured in Docker daemon
find:
paths: [/etc/docker]
recurse: false
patterns: daemon.json
use_regex: false
contains: nvidia
register: docker_nvidia_runtime_check
then change the when condition in Configure NVIDIA Container Toolkit for Docker runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: hpc_install_docker - since it doesn't make sense to ever set it to false, just get rid of it, and automatically install docker if hpc_install_nvidia_container_toolkit: true
If you want to support podman too, you could have a new variable
hpc_container_runtime - default value is docker - user can set hpc_container_runtime: podman to use podman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richm update via find module, thank you so much for your detailed guidance. For podman related, will confirm with team, probably not be covered in the commit for short term plan.
tasks/main.yml
Outdated
|
|
||
| - name: Configure NVIDIA Container Toolkit for Docker runtime | ||
| command: nvidia-ctk runtime configure --runtime=docker | ||
| when: docker_nvidia_runtime_check.rc != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when: docker_nvidia_runtime_check.matched == 0
tasks/main.yml
Outdated
| register: nvidia_containerd_dropin | ||
|
|
||
| - name: Check if containerd config has drop-in imports | ||
| command: grep -q 'imports = \["/etc/containerd/conf.d/\*\.toml"\]' /etc/containerd/config.toml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the Ansible find module method described above to check if a file exists with a specific string
5a86e9c to
5fe6da5
Compare
Enhancement:
Reason:
Add new packages support for Moby container runtime and NVIDIA Container Toolkit
Reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Result:
docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubi9 nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |
| N/A 32C P8 12W / 70W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
Issue Tracker Tickets (Jira or BZ if any):
RHELHPC-107
Summary by Sourcery
Add optional installation and configuration of Moby-based Docker and NVIDIA Container Toolkit with GPU-enabled container runtimes.
New Features:
Enhancements:
Documentation: