Skip to content

Conversation

@lixuemin2016
Copy link

@lixuemin2016 lixuemin2016 commented Jan 20, 2026

Enhancement:

  • Install moby-cli and moby-engine
  • Install nvidia-container-toolkit package
  • Configure containerd to use NVIDIA runtime

Reason:
Add new packages support for Moby container runtime and NVIDIA Container Toolkit

Reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Result:
docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubi9 nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |
| N/A 32C P8 12W / 70W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |

Issue Tracker Tickets (Jira or BZ if any):
RHELHPC-107

Summary by Sourcery

Add optional installation and configuration of Moby-based Docker and NVIDIA Container Toolkit with GPU-enabled container runtimes.

New Features:

  • Introduce role option to install moby-engine and moby-cli and manage the Docker service on RHEL 9 systems.
  • Introduce role option to install and configure NVIDIA Container Toolkit for Docker and containerd GPU support.

Enhancements:

  • Configure containerd with SystemdCgroup and NVIDIA runtime as default when NVIDIA Container Toolkit is enabled.
  • Add a safety check to prevent enabling NVIDIA Container Toolkit when Docker installation is disabled.

Documentation:

  • Document new hpc_install_docker and hpc_install_nvidia_container_toolkit variables and their defaults in the role README.

@sourcery-ai
Copy link

sourcery-ai bot commented Jan 20, 2026

Reviewer's Guide

Adds optional installation and configuration of Docker using moby-engine/moby-cli and NVIDIA Container Toolkit, including repo/package definitions, runtime configuration for Docker and containerd, and associated Ansible variables and handlers.

Sequence diagram for provisioning GPU-enabled container runtimes

sequenceDiagram
  actor Admin
  participant AnsibleController
  participant TargetNode
  participant DockerService as docker
  participant Containerd as containerd
  participant NvidiaCTK as nvidia_container_toolkit_runtime
  participant GPUDriver as nvidia_driver_GPU

  Admin->>AnsibleController: Run HPC role
  AnsibleController->>TargetNode: Apply tasks with hpc_install_docker=true
  AnsibleController->>TargetNode: Install moby-engine and moby-cli
  AnsibleController->>TargetNode: Enable and restart docker service
  AnsibleController->>TargetNode: Add nvidia-container-toolkit repo
  AnsibleController->>TargetNode: Install nvidia-container-toolkit
  AnsibleController->>TargetNode: nvidia-ctk runtime configure --runtime=docker
  AnsibleController->>TargetNode: Ensure /etc/containerd and config.toml
  AnsibleController->>TargetNode: Enable SystemdCgroup in config.toml
  AnsibleController->>TargetNode: nvidia-ctk runtime configure --runtime=containerd --set-as-default
  AnsibleController->>Containerd: Notify Restart containerd service
  Containerd-->>AnsibleController: containerd restarted

  Admin->>DockerService: docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubi9 nvidia-smi
  DockerService->>Containerd: Create and start container
  Containerd->>NvidiaCTK: Use NVIDIA runtime for container
  NvidiaCTK->>GPUDriver: Initialize GPU access
  GPUDriver-->>NvidiaCTK: GPU ready
  NvidiaCTK-->>Containerd: Runtime configured with GPU
  Containerd-->>DockerService: Container running with GPU access
  DockerService-->>Admin: nvidia-smi output with GPU details
Loading

File-Level Changes

Change Details Files
Gate NVIDIA Container Toolkit installation on Docker presence to prevent unsupported configurations.
  • Add a fail task that aborts when NVIDIA Container Toolkit installation is requested without enabling Docker
  • Require hpc_install_docker to be true when hpc_install_nvidia_container_toolkit is true
tasks/main.yml
Introduce Ansible tasks to install and lock Docker (Moby) packages and manage the docker service.
  • Install moby-engine and moby-cli via a package task with ostree-aware backend selection
  • Enable and restart the docker service after package installation
  • Add dnf versionlocks for the Docker-related packages to prevent unintended updates
tasks/main.yml
vars/RedHat_9.yml
Introduce Ansible tasks to install, lock, and configure NVIDIA Container Toolkit for Docker and containerd runtimes.
  • Add NVIDIA Container Toolkit repo via get_url and install toolkit packages with ostree-aware backend selection
  • Version-lock NVIDIA Container Toolkit packages using dnf versionlock
  • Run nvidia-ctk to configure Docker runtime integration, ignoring failures
  • Create containerd config directory, generate default config.toml, and enable SystemdCgroup via lineinfile
  • Run nvidia-ctk to configure containerd as default NVIDIA runtime and trigger containerd restart via handler
tasks/main.yml
vars/RedHat_9.yml
handlers/main.yml
Expose new configuration toggles and default values for Docker and NVIDIA Container Toolkit in the role.
  • Add hpc_install_docker and hpc_install_nvidia_container_toolkit variables to defaults with true as the default
  • Document the new variables and their behavior in the README, including that NVIDIA Container Toolkit enables GPU support in Docker and containerd
  • Define Red Hat 9–specific repo and package lists for NVIDIA Container Toolkit and Docker (Moby) in vars
defaults/main.yml
README.md
vars/RedHat_9.yml
Add handler to restart containerd when NVIDIA runtime configuration changes.
  • Introduce a Restart containerd service handler that restarts the containerd service
  • Wire the handler to be notified when nvidia-ctk containerd configuration is applied
handlers/main.yml
tasks/main.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • The nvidia-ctk runtime configure tasks always report changed when rc == 0, which will repeatedly reconfigure the runtimes and always restart containerd; consider parsing stdout or using a more specific changed_when/guard so they are idempotent and only trigger a restart when configuration actually changes.
  • The Docker service is restarted directly in the install block; consider switching this to a handler so Docker restarts are deferred and coalesced with any other changes that might also need a restart.
  • The package lists for __hpc_docker_packages and __hpc_nvidia_container_toolkit_packages are pinned to exact versions, which may make maintenance and upgrades harder; consider allowing version overrides or using unversioned names with separate variables to control pinning.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `nvidia-ctk runtime configure` tasks always report `changed` when `rc == 0`, which will repeatedly reconfigure the runtimes and always restart containerd; consider parsing stdout or using a more specific `changed_when`/guard so they are idempotent and only trigger a restart when configuration actually changes.
- The Docker service is restarted directly in the install block; consider switching this to a handler so Docker restarts are deferred and coalesced with any other changes that might also need a restart.
- The package lists for `__hpc_docker_packages` and `__hpc_nvidia_container_toolkit_packages` are pinned to exact versions, which may make maintenance and upgrades harder; consider allowing version overrides or using unversioned names with separate variables to control pinning.

## Individual Comments

### Comment 1
<location> `tasks/main.yml:723-726` </location>
<code_context>
+        not in __hpc_versionlock_check.stdout
+      loop: "{{ __hpc_nvidia_container_toolkit_packages }}"
+
+    - name: Configure NVIDIA Container Toolkit for Docker runtime
+      command: nvidia-ctk runtime configure --runtime=docker
+      register: __hpc_nvidia_ctk_docker_config
+      changed_when: __hpc_nvidia_ctk_docker_config.rc == 0
+      failed_when: false
+
</code_context>

<issue_to_address>
**suggestion:** The `nvidia-ctk` Docker runtime configuration task will report `changed` on every run, reducing idempotency signal.

Because `changed_when` is tied only to `rc == 0`, this task will always be reported as changed on success, even when the runtime is already configured. That obscures real changes and can trigger handlers unnecessarily. Consider deriving `changed_when` from the command output (e.g., only when it reports creating/updating config) or gating the task with a `creates`/similar condition if `nvidia-ctk` manages a specific file.

Suggested implementation:

```
    - name: Configure NVIDIA Container Toolkit for Docker runtime
      command: nvidia-ctk runtime configure --runtime=docker
      args:
        creates: /etc/nvidia-container-runtime/config.toml
      register: __hpc_nvidia_ctk_docker_config
      changed_when: __hpc_nvidia_ctk_docker_config.rc == 0
      failed_when: false

```

Depending on how `nvidia-ctk` behaves in your environment, you may want to:
1. Confirm that `/etc/nvidia-container-runtime/config.toml` is the correct file created/managed by `nvidia-ctk runtime configure --runtime=docker` and adjust the path if your distribution uses a different location.
2. Consider removing `failed_when: false` if you *do* want a non-zero return code from `nvidia-ctk` on the first (and only) run to fail the play, rather than silently ignoring errors.
</issue_to_address>

### Comment 2
<location> `tasks/main.yml:750-727` </location>
<code_context>
+        line: '            SystemdCgroup = true'
+        backup: true
+
+    - name: Configure NVIDIA Container Toolkit for containerd runtime
+      command: nvidia-ctk runtime configure --runtime=containerd --set-as-default
+      register: __hpc_nvidia_ctk_containerd_config
+      changed_when: __hpc_nvidia_ctk_containerd_config.rc == 0
+      failed_when: false
+      notify: Restart containerd service
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Ignoring failures from `nvidia-ctk` for containerd may hide misconfigurations or missing dependencies.

Because `failed_when: false` is set, any failure from `nvidia-ctk` (missing binary, bad flags, or missing containerd) is treated as success, making broken GPU setups hard to detect. If you need to tolerate only some failure modes, please narrow the condition (e.g., check specific return codes or error messages) instead of suppressing all failures, and consider logging a warning when configuration is skipped or fails.
</issue_to_address>

### Comment 3
<location> `README.md:155` </location>
<code_context>

+### hpc_install_docker
+
+Whether to install moby-engine and moby-cli packages, and enable docker service.
+
+Default: `true`
</code_context>

<issue_to_address>
**nitpick (typo):** Consider removing the comma and adding articles for smoother grammar.

For example: "Whether to install the moby-engine and moby-cli packages and enable the Docker service."

```suggestion
Whether to install the moby-engine and moby-cli packages and enable the Docker service.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@spetrosi
Copy link
Collaborator

Please prefix the PR title with feat: to indicate that this introduces a feature. We use it to build changelog and versioning.

description: Microsoft Production repository
key: https://packages.microsoft.com/keys/microsoft.asc
baseurl: https://packages.microsoft.com/rhel/9/prod/
__hpc_nvidia_container_toolkit_repo:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment above saying that here only url is used. Nice to have name and description just for info.
Other variables in this form install repositories using the yum_repository module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spetrosi updated, before I tried to use yum_repository, notice that the repository contains both nvidia-container-toolkit and nvidia-container-toolkit-experimental, then try to use get_url method. Thank you so much.

loop: "{{ __hpc_nvidia_container_toolkit_packages }}"

- name: Configure NVIDIA Container Toolkit for Docker runtime
command: nvidia-ctk runtime configure --runtime=docker
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role should be idempotent - i.e. do run actions only when needed.
So the first time the role runs this task it results in state: changed, but the seccond time in change: skipped or success).
One option will be to add a task above to run some command to check whether this command has been run before, and then in this task add a condition that runs it only when needed.
The other option is to do changed_when and check the output of this command, like you do above in Prevent updates of NVIDIA Container Toolkit packages

Copy link
Author

@lixuemin2016 lixuemin2016 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spetrosi updated based on https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html, thank you so much for your detailed guidance.

tasks/main.yml Outdated
mode: '0755'

- name: Generate default containerd config
shell: containerd config default > /etc/containerd/config.toml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again, only do this if file doesn't exist.

Copy link
Contributor

@richm richm Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to use the creates keyword of the shell module for this: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/shell_module.html - you'll have to use the cmd form instead of the "free form" command:

    - name: Generate default containerd config
      shell:
        cmd: containerd config default > /etc/containerd/config.toml
        creates:  /etc/containerd/config.toml

Copy link
Author

@lixuemin2016 lixuemin2016 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richm @spetrosi, Updated, thank you so much for your detailed example.

- Install moby-cli and moby-engine
- Install nvidia-container-toolkit package
- Configure containerd to use NVIDIA runtime
@lixuemin2016
Copy link
Author

Please prefix the PR title with feat: to indicate that this introduces a feature. We use it to build changelog and versioning.

Updated, thank you so much.


Whether to install the moby-engine and moby-cli packages as well as enable the Docker service.

Default: `true`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be true by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I see you must install docker if hpc_install_nvidia_container_toolkit is true

  1. Is it possible to use podman instead of docker, on platforms that support podman? We at Red Hat strongly prefer the use of podman.

  2. You could make it so that the default value of hpc_install_docker is set to true if hpc_install_nvidia_container_toolkit is true e.g. in the defaults file:

hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"

that way, if the user does not set hpc_install_docker explicitly, it will be set to true implicitly if the user sets hpc_install_nvidia_container_toolkit: true

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richm Updated, thank you so much for your detailed comment. About "Is it possible to use podman instead of docker", this is a good question. I need to confirm with the team, but my understanding is that we haven't fully tested the NVIDIA Container Toolkit with Podman yet. While Podman may be a future consideration, it's likely not covered in the current commit.

README.md Outdated

Type: `bool`

### hpc_install_nvidia_container_toolkit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't list the default value and type

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thank you so much.

- name: Restart docker service
service:
name: docker
state: restarted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at end of file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thank you so much.


Whether to install the moby-engine and moby-cli packages as well as enable the Docker service.

Default: `true`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I see you must install docker if hpc_install_nvidia_container_toolkit is true

  1. Is it possible to use podman instead of docker, on platforms that support podman? We at Red Hat strongly prefer the use of podman.

  2. You could make it so that the default value of hpc_install_docker is set to true if hpc_install_nvidia_container_toolkit is true e.g. in the defaults file:

hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"

that way, if the user does not set hpc_install_docker explicitly, it will be set to true implicitly if the user sets hpc_install_nvidia_container_toolkit: true

not in __hpc_versionlock_check.stdout
loop: "{{ __hpc_nvidia_container_toolkit_packages }}"

- name: Check if NVIDIA runtime is configured in Docker daemon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to use the Ansible find module for this: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/find_module.html

    - name: Check if NVIDIA runtime is configured in Docker daemon
      find:
        paths: [/etc/docker]
        recurse: false
        patterns: daemon.json
        use_regex: false
        contains: nvidia
        register: docker_nvidia_runtime_check

then change the when condition in Configure NVIDIA Container Toolkit for Docker runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: hpc_install_docker - since it doesn't make sense to ever set it to false, just get rid of it, and automatically install docker if hpc_install_nvidia_container_toolkit: true

If you want to support podman too, you could have a new variable

hpc_container_runtime - default value is docker - user can set hpc_container_runtime: podman to use podman

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richm update via find module, thank you so much for your detailed guidance. For podman related, will confirm with team, probably not be covered in the commit for short term plan.

tasks/main.yml Outdated

- name: Configure NVIDIA Container Toolkit for Docker runtime
command: nvidia-ctk runtime configure --runtime=docker
when: docker_nvidia_runtime_check.rc != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when: docker_nvidia_runtime_check.matched == 0

tasks/main.yml Outdated
register: nvidia_containerd_dropin

- name: Check if containerd config has drop-in imports
command: grep -q 'imports = \["/etc/containerd/conf.d/\*\.toml"\]' /etc/containerd/config.toml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the Ansible find module method described above to check if a file exists with a specific string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants