Installing Moby container runtime and NVIDIA Container Toolkit #47

lixuemin2016 · 2026-01-20T05:36:01Z

Enhancement:

Install moby-cli and moby-engine
Install nvidia-container-toolkit package
Configure containerd to use NVIDIA runtime

Reason:
Add new packages support for Moby container runtime and NVIDIA Container Toolkit

Reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Result:
docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubi9 nvidia-smi

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |

Issue Tracker Tickets (Jira or BZ if any):
RHELHPC-107

Summary by Sourcery

Add optional installation and configuration of Moby-based Docker and NVIDIA Container Toolkit with GPU-enabled container runtimes.

New Features:

Introduce role option to install moby-engine and moby-cli and manage the Docker service on RHEL 9 systems.
Introduce role option to install and configure NVIDIA Container Toolkit for Docker and containerd GPU support.

Enhancements:

Configure containerd with SystemdCgroup and NVIDIA runtime as default when NVIDIA Container Toolkit is enabled.
Add a safety check to prevent enabling NVIDIA Container Toolkit when Docker installation is disabled.

Documentation:

Document new hpc_install_docker and hpc_install_nvidia_container_toolkit variables and their defaults in the role README.

sourcery-ai · 2026-01-20T05:36:17Z

Reviewer's Guide

Adds optional installation and configuration of Docker using moby-engine/moby-cli and NVIDIA Container Toolkit, including repo/package definitions, runtime configuration for Docker and containerd, and associated Ansible variables and handlers.

Sequence diagram for provisioning GPU-enabled container runtimes

sequenceDiagram
  actor Admin
  participant AnsibleController
  participant TargetNode
  participant DockerService as docker
  participant Containerd as containerd
  participant NvidiaCTK as nvidia_container_toolkit_runtime
  participant GPUDriver as nvidia_driver_GPU

  Admin->>AnsibleController: Run HPC role
  AnsibleController->>TargetNode: Apply tasks with hpc_install_docker=true
  AnsibleController->>TargetNode: Install moby-engine and moby-cli
  AnsibleController->>TargetNode: Enable and restart docker service
  AnsibleController->>TargetNode: Add nvidia-container-toolkit repo
  AnsibleController->>TargetNode: Install nvidia-container-toolkit
  AnsibleController->>TargetNode: nvidia-ctk runtime configure --runtime=docker
  AnsibleController->>TargetNode: Ensure /etc/containerd and config.toml
  AnsibleController->>TargetNode: Enable SystemdCgroup in config.toml
  AnsibleController->>TargetNode: nvidia-ctk runtime configure --runtime=containerd --set-as-default
  AnsibleController->>Containerd: Notify Restart containerd service
  Containerd-->>AnsibleController: containerd restarted

  Admin->>DockerService: docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubi9 nvidia-smi
  DockerService->>Containerd: Create and start container
  Containerd->>NvidiaCTK: Use NVIDIA runtime for container
  NvidiaCTK->>GPUDriver: Initialize GPU access
  GPUDriver-->>NvidiaCTK: GPU ready
  NvidiaCTK-->>Containerd: Runtime configured with GPU
  Containerd-->>DockerService: Container running with GPU access
  DockerService-->>Admin: nvidia-smi output with GPU details

File-Level Changes

Change	Details	Files
Gate NVIDIA Container Toolkit installation on Docker presence to prevent unsupported configurations.	Add a fail task that aborts when NVIDIA Container Toolkit installation is requested without enabling Docker Require hpc_install_docker to be true when hpc_install_nvidia_container_toolkit is true	`tasks/main.yml`
Introduce Ansible tasks to install and lock Docker (Moby) packages and manage the docker service.	Install moby-engine and moby-cli via a package task with ostree-aware backend selection Enable and restart the docker service after package installation Add dnf versionlocks for the Docker-related packages to prevent unintended updates	`tasks/main.yml` `vars/RedHat_9.yml`
Introduce Ansible tasks to install, lock, and configure NVIDIA Container Toolkit for Docker and containerd runtimes.	Add NVIDIA Container Toolkit repo via get_url and install toolkit packages with ostree-aware backend selection Version-lock NVIDIA Container Toolkit packages using dnf versionlock Run nvidia-ctk to configure Docker runtime integration, ignoring failures Create containerd config directory, generate default config.toml, and enable SystemdCgroup via lineinfile Run nvidia-ctk to configure containerd as default NVIDIA runtime and trigger containerd restart via handler	`tasks/main.yml` `vars/RedHat_9.yml` `handlers/main.yml`
Expose new configuration toggles and default values for Docker and NVIDIA Container Toolkit in the role.	Add hpc_install_docker and hpc_install_nvidia_container_toolkit variables to defaults with true as the default Document the new variables and their behavior in the README, including that NVIDIA Container Toolkit enables GPU support in Docker and containerd Define Red Hat 9–specific repo and package lists for NVIDIA Container Toolkit and Docker (Moby) in vars	`defaults/main.yml` `README.md` `vars/RedHat_9.yml`
Add handler to restart containerd when NVIDIA runtime configuration changes.	Introduce a Restart containerd service handler that restarts the containerd service Wire the handler to be notified when nvidia-ctk containerd configuration is applied	`handlers/main.yml` `tasks/main.yml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

The nvidia-ctk runtime configure tasks always report changed when rc == 0, which will repeatedly reconfigure the runtimes and always restart containerd; consider parsing stdout or using a more specific changed_when/guard so they are idempotent and only trigger a restart when configuration actually changes.
The Docker service is restarted directly in the install block; consider switching this to a handler so Docker restarts are deferred and coalesced with any other changes that might also need a restart.
The package lists for __hpc_docker_packages and __hpc_nvidia_container_toolkit_packages are pinned to exact versions, which may make maintenance and upgrades harder; consider allowing version overrides or using unversioned names with separate variables to control pinning.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `nvidia-ctk runtime configure` tasks always report `changed` when `rc == 0`, which will repeatedly reconfigure the runtimes and always restart containerd; consider parsing stdout or using a more specific `changed_when`/guard so they are idempotent and only trigger a restart when configuration actually changes.
- The Docker service is restarted directly in the install block; consider switching this to a handler so Docker restarts are deferred and coalesced with any other changes that might also need a restart.
- The package lists for `__hpc_docker_packages` and `__hpc_nvidia_container_toolkit_packages` are pinned to exact versions, which may make maintenance and upgrades harder; consider allowing version overrides or using unversioned names with separate variables to control pinning.

## Individual Comments

### Comment 1
<location> `tasks/main.yml:723-726` </location>
<code_context>
+        not in __hpc_versionlock_check.stdout
+      loop: "{{ __hpc_nvidia_container_toolkit_packages }}"
+
+    - name: Configure NVIDIA Container Toolkit for Docker runtime
+      command: nvidia-ctk runtime configure --runtime=docker
+      register: __hpc_nvidia_ctk_docker_config
+      changed_when: __hpc_nvidia_ctk_docker_config.rc == 0
+      failed_when: false
+
</code_context>

<issue_to_address>
**suggestion:** The `nvidia-ctk` Docker runtime configuration task will report `changed` on every run, reducing idempotency signal.

Because `changed_when` is tied only to `rc == 0`, this task will always be reported as changed on success, even when the runtime is already configured. That obscures real changes and can trigger handlers unnecessarily. Consider deriving `changed_when` from the command output (e.g., only when it reports creating/updating config) or gating the task with a `creates`/similar condition if `nvidia-ctk` manages a specific file.

Suggested implementation:

```
    - name: Configure NVIDIA Container Toolkit for Docker runtime
      command: nvidia-ctk runtime configure --runtime=docker
      args:
        creates: /etc/nvidia-container-runtime/config.toml
      register: __hpc_nvidia_ctk_docker_config
      changed_when: __hpc_nvidia_ctk_docker_config.rc == 0
      failed_when: false

```

Depending on how `nvidia-ctk` behaves in your environment, you may want to:
1. Confirm that `/etc/nvidia-container-runtime/config.toml` is the correct file created/managed by `nvidia-ctk runtime configure --runtime=docker` and adjust the path if your distribution uses a different location.
2. Consider removing `failed_when: false` if you *do* want a non-zero return code from `nvidia-ctk` on the first (and only) run to fail the play, rather than silently ignoring errors.
</issue_to_address>

### Comment 2
<location> `tasks/main.yml:750-727` </location>
<code_context>
+        line: '            SystemdCgroup = true'
+        backup: true
+
+    - name: Configure NVIDIA Container Toolkit for containerd runtime
+      command: nvidia-ctk runtime configure --runtime=containerd --set-as-default
+      register: __hpc_nvidia_ctk_containerd_config
+      changed_when: __hpc_nvidia_ctk_containerd_config.rc == 0
+      failed_when: false
+      notify: Restart containerd service
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Ignoring failures from `nvidia-ctk` for containerd may hide misconfigurations or missing dependencies.

Because `failed_when: false` is set, any failure from `nvidia-ctk` (missing binary, bad flags, or missing containerd) is treated as success, making broken GPU setups hard to detect. If you need to tolerate only some failure modes, please narrow the condition (e.g., check specific return codes or error messages) instead of suppressing all failures, and consider logging a warning when configuration is skipped or fails.
</issue_to_address>

### Comment 3
<location> `README.md:155` </location>
<code_context>

+### hpc_install_docker
+
+Whether to install moby-engine and moby-cli packages, and enable docker service.
+
+Default: `true`
</code_context>

<issue_to_address>
**nitpick (typo):** Consider removing the comma and adding articles for smoother grammar.

For example: "Whether to install the moby-engine and moby-cli packages and enable the Docker service."

```suggestion
Whether to install the moby-engine and moby-cli packages and enable the Docker service.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

tasks/main.yml

README.md

spetrosi · 2026-01-20T14:36:50Z

Please prefix the PR title with feat: to indicate that this introduces a feature. We use it to build changelog and versioning.

tasks/main.yml

spetrosi · 2026-01-20T19:02:09Z

vars/RedHat_9.yml

  description: Microsoft Production repository
  key: https://packages.microsoft.com/keys/microsoft.asc
  baseurl: https://packages.microsoft.com/rhel/9/prod/
+__hpc_nvidia_container_toolkit_repo:


Add a comment above saying that here only url is used. Nice to have name and description just for info.
Other variables in this form install repositories using the yum_repository module.

@spetrosi updated, before I tried to use yum_repository, notice that the repository contains both nvidia-container-toolkit and nvidia-container-toolkit-experimental, then try to use get_url method. Thank you so much.

spetrosi · 2026-01-20T19:11:59Z

tasks/main.yml

+      loop: "{{ __hpc_nvidia_container_toolkit_packages }}"
+
+    - name: Configure NVIDIA Container Toolkit for Docker runtime
+      command: nvidia-ctk runtime configure --runtime=docker


The role should be idempotent - i.e. do run actions only when needed.
So the first time the role runs this task it results in state: changed, but the seccond time in change: skipped or success).
One option will be to add a task above to run some command to check whether this command has been run before, and then in this task add a condition that runs it only when needed.
The other option is to do changed_when and check the output of this command, like you do above in Prevent updates of NVIDIA Container Toolkit packages

@spetrosi updated based on https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html, thank you so much for your detailed guidance.

spetrosi · 2026-01-20T19:12:29Z

tasks/main.yml

+        mode: '0755'
+
+    - name: Generate default containerd config
+      shell: containerd config default > /etc/containerd/config.toml


Here again, only do this if file doesn't exist.

You might be able to use the creates keyword of the shell module for this: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/shell_module.html - you'll have to use the cmd form instead of the "free form" command:

- name: Generate default containerd config shell: cmd: containerd config default > /etc/containerd/config.toml creates: /etc/containerd/config.toml

@richm @spetrosi, Updated, thank you so much for your detailed example.

- Install moby-cli and moby-engine - Install nvidia-container-toolkit package - Configure containerd to use NVIDIA runtime

lixuemin2016 · 2026-01-22T10:32:55Z

Please prefix the PR title with feat: to indicate that this introduces a feature. We use it to build changelog and versioning.

Updated, thank you so much.

richm · 2026-01-22T16:20:03Z

README.md

+
+Whether to install the moby-engine and moby-cli packages as well as enable the Docker service.
+
+Default: `true`


Should this be true by default?

Ah - I see you must install docker if hpc_install_nvidia_container_toolkit is true

Is it possible to use podman instead of docker, on platforms that support podman? We at Red Hat strongly prefer the use of podman.

You could make it so that the default value of hpc_install_docker is set to true if hpc_install_nvidia_container_toolkit is true e.g. in the defaults file:

hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"

that way, if the user does not set hpc_install_docker explicitly, it will be set to true implicitly if the user sets hpc_install_nvidia_container_toolkit: true

@richm Updated, thank you so much for your detailed comment. About "Is it possible to use podman instead of docker", this is a good question. I need to confirm with the team, but my understanding is that we haven't fully tested the NVIDIA Container Toolkit with Podman yet. While Podman may be a future consideration, it's likely not covered in the current commit.

richm · 2026-01-22T16:20:20Z

README.md

+
+Type: `bool`
+
+### hpc_install_nvidia_container_toolkit


This doesn't list the default value and type

Updated, thank you so much.

richm · 2026-01-22T16:20:48Z

handlers/main.yml

+- name: Restart docker service
+  service:
+    name: docker
+    state: restarted


missing newline at end of file

Updated, thank you so much.

richm · 2026-01-22T16:25:30Z

README.md

+
+Whether to install the moby-engine and moby-cli packages as well as enable the Docker service.
+
+Default: `true`


Ah - I see you must install docker if hpc_install_nvidia_container_toolkit is true

Is it possible to use podman instead of docker, on platforms that support podman? We at Red Hat strongly prefer the use of podman.

You could make it so that the default value of hpc_install_docker is set to true if hpc_install_nvidia_container_toolkit is true e.g. in the defaults file:

hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"

that way, if the user does not set hpc_install_docker explicitly, it will be set to true implicitly if the user sets hpc_install_nvidia_container_toolkit: true

richm · 2026-01-22T16:33:17Z

tasks/main.yml

+        not in __hpc_versionlock_check.stdout
+      loop: "{{ __hpc_nvidia_container_toolkit_packages }}"
+
+    - name: Check if NVIDIA runtime is configured in Docker daemon


Prefer to use the Ansible find module for this: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/find_module.html

- name: Check if NVIDIA runtime is configured in Docker daemon find: paths: [/etc/docker] recurse: false patterns: daemon.json use_regex: false contains: nvidia register: docker_nvidia_runtime_check

then change the when condition in Configure NVIDIA Container Toolkit for Docker runtime

re: hpc_install_docker - since it doesn't make sense to ever set it to false, just get rid of it, and automatically install docker if hpc_install_nvidia_container_toolkit: true

If you want to support podman too, you could have a new variable

hpc_container_runtime - default value is docker - user can set hpc_container_runtime: podman to use podman

@richm update via find module, thank you so much for your detailed guidance. For podman related, will confirm with team, probably not be covered in the commit for short term plan.

richm · 2026-01-22T16:33:56Z

tasks/main.yml

+
+    - name: Configure NVIDIA Container Toolkit for Docker runtime
+      command: nvidia-ctk runtime configure --runtime=docker
+      when: docker_nvidia_runtime_check.rc != 0


when: docker_nvidia_runtime_check.matched == 0

richm · 2026-01-22T16:42:53Z

tasks/main.yml

+      register: nvidia_containerd_dropin
+
+    - name: Check if containerd config has drop-in imports
+      command: grep -q 'imports = \["/etc/containerd/conf.d/\*\.toml"\]' /etc/containerd/config.toml


Use the Ansible find module method described above to check if a file exists with a specific string

lixuemin2016 requested review from richm and spetrosi as code owners January 20, 2026 05:36

sourcery-ai bot reviewed Jan 20, 2026

View reviewed changes

tasks/main.yml Outdated Show resolved Hide resolved

tasks/main.yml Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

lixuemin2016 force-pushed the dockertest branch from c03e6dc to eb33cb1 Compare January 20, 2026 11:52

spetrosi reviewed Jan 20, 2026

View reviewed changes

feat: Installing Moby container runtime and NVIDIA Container Toolkit

6be2ac5

- Install moby-cli and moby-engine - Install nvidia-container-toolkit package - Configure containerd to use NVIDIA runtime

lixuemin2016 force-pushed the dockertest branch from eb33cb1 to 6be2ac5 Compare January 22, 2026 10:32

richm reviewed Jan 22, 2026

View reviewed changes

Merge branch 'main' into dockertest

5fe6da5

lixuemin2016 force-pushed the dockertest branch from 5a86e9c to 5fe6da5 Compare January 23, 2026 09:46


		Whether to install the moby-engine and moby-cli packages as well as enable the Docker service.

		Default: `true`

Installing Moby container runtime and NVIDIA Container Toolkit #47

Are you sure you want to change the base?

Installing Moby container runtime and NVIDIA Container Toolkit #47

Conversation

lixuemin2016 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for provisioning GPU-enabled container runtimes

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spetrosi commented Jan 20, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lixuemin2016 Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richm Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lixuemin2016 Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lixuemin2016 commented Jan 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

lixuemin2016 commented Jan 20, 2026 •

edited

Loading

sourcery-ai bot commented Jan 20, 2026 •

edited

Loading

lixuemin2016 Jan 22, 2026 •

edited

Loading

richm Jan 20, 2026 •

edited

Loading

lixuemin2016 Jan 22, 2026 •

edited

Loading