Skip to content

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jan 27, 2026

This is an alternative to #5066, fixing runc exec vs go1.26 issue with 1.4.0.
Go 1.26.0 is to be released soon (in February).

Fixes: #5060
Closes: #5066


Since PR 4812, runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
[Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see CL 728642, and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure. From the
CL 734200 it looks like it is an acceptable way. If the upstream
will introduce cmd.Clone we'll switch to it.

As for the test case, it looks like adding Go 1.26 to the testing
matrix is sufficient to reveal the issue (some tests involving
runc exec fail with Go 1.26 before the fix).

@kolyshkin kolyshkin changed the title ci: add go 1.26 rc2 Fix exec vs go1.26 Jan 27, 2026
@kolyshkin kolyshkin added this to the 1.4.1 milestone Jan 27, 2026
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Jan 27, 2026

Note that copying of exec.Cmd is not recommended but accepted by upstream, judging by CL 734200. They may introduce cmd.Clone method later and we'll switch to it.

@kolyshkin kolyshkin marked this pull request as ready for review January 27, 2026 01:11
@kolyshkin kolyshkin changed the title Fix exec vs go1.26 Fix runc exec vs go1.26 + older kernel Jan 27, 2026
@kolyshkin
Copy link
Contributor Author

@rata @lifubang @cyphar PTAL

@lifubang
Copy link
Member

This is a great solution—overall, LGTM.

Would you consider cherry-picking the unit test from #5066, or adding a similar test case? I’m concerned about whether the second run will succeed without it, for example, maybe some fds has been closed in the first run.

@AkihiroSuda AkihiroSuda added backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 backport/1.4-todo A PR in main branch which needs to backported to release-1.4 backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 labels Jan 28, 2026
@rata
Copy link
Member

rata commented Jan 28, 2026

@kolyshkin so let me see if I'm following correctly. The proposal to add cmd.Clone() about cloning mentions this:

Generally this involves making a shallow copy of all exported fields: cmd = &exec.Cmd{Path: cmd.Path, Args: cmd.Args, ...}.

However, we think it is kind of accepted upstream for now because this CL https://go-review.googlesource.com/c/go/+/734200 is changing the atomic.Bool that causes go vet or something to fail, it is changed to something else that avoid the warning. Therefore, they are allowing this workaround for now.

Am I getting it right?

In that case, it seems fine, but it might break again in 1.27 (as they implied this will be removed if cmd.Clone() is merged).

@lifubang lifubang removed backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 labels Jan 28, 2026
@kolyshkin
Copy link
Contributor Author

This is a great solution—overall, LGTM.

Would you consider cherry-picking the unit test from #5066, or adding a similar test case? I’m concerned about whether the second run will succeed without it, for example, maybe some fds has been closed in the first run.

@lifubang this case is already (implicitly) covered by the existing tests. For example, runc exec (cgroup v2 + init process in non-root cgroup) succeeds makes a container, moves its init to a sub-cgroup and then enables a domain controller. In this situation, we can no longer join the default (container's top-level) cgroup (IOW clone3(CLONE_INTO_CGROUP) fails) and fall back to reading the init's cgroup. When we run this test on Go 1.26 (which forces the error on the second cmd.Exec), it fails.

The failing test part is here:

# an exec process can no longer join "/" after turning on a domain controller.
# falls back to "/foo".
runc exec test_cgroups_group cat /proc/self/cgroup
[ "$status" -eq 0 ]

The failure itself (before the commit with the fix) is here: https://github.com/opencontainers/runc/actions/runs/21379582395/job/61543423814

This is the reason why I'm adding Go 1.26 to the CI matrix.

I am working on simplifying the test from #5066 (in its current form it's too complicated) so maybe I will add something here, but in general it's not needed.

@kolyshkin
Copy link
Contributor Author

Am I getting it right?

In that case, it seems fine, but it might break again in 1.27 (as they implied this will be removed if cmd.Clone() is merged).

You are right, it may stop working in case they will use something in exec.Command that could/should not be copied. Let's hope they are to add the Clone method in this case, and we'll just switch to it (depending on Go version):

//go:build go1.27
func cloneCmd(cmd *exec.Cmd) *exec.Cmd {
       return cmd.Clone()
}

//go:build !go1.27
func cloneCmd(cmd *exec.Cmd) *exec.Cmd {
    cmdCopy = *cmd
    return &cmdCopy
}

and then, eventually, remove the cloneCmd wrapper and use cmd.Clone directly.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change LGTM

are new tests needed? (I see the other PR added some tests, but I haven't looked if they're good)

@thaJeztah
Copy link
Member

Oh, lol, missed the discussion (thanks browser refresh 😆)

This is mostly to test whether https://go.dev/cl/728642 results in
any test failures in the current CI matrix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

@lifubang OK I've added a (slightly simplified) test from #5066 but in fact the same code path is covered by the existing runc exec (cgroup v2 + init process in non-root cgroup) succeeds bats test (perhaps in a less hackish manner), so it's up to you whether you want to include it or not.

@lifubang
Copy link
Member

but in fact the same code path is covered by the existing runc exec (cgroup v2 + init process in non-root cgroup) succeeds bats test

Yes, you are right.

so it's up to you whether you want to include it or not.

I think it's not necessary to include another unit test, from my perspective, you can remove the last commit.

@rata
Copy link
Member

rata commented Jan 29, 2026

You are right, it may stop working in case they will use something in exec.Command that could/should not be copied. Let's hope they are to add the Clone method in this case, and we'll just switch to it (depending on Go version):

//go:build go1.27
func cloneCmd(cmd *exec.Cmd) *exec.Cmd {
       return cmd.Clone()
}

//go:build !go1.27
func cloneCmd(cmd *exec.Cmd) *exec.Cmd {
    cmdCopy = *cmd
    return &cmdCopy
}

and then, eventually, remove the cloneCmd wrapper and use cmd.Clone directly.

@kolyshkin I'd still prefer to have the pre go 1.27 cloneCmd() just copy all the public fields of the struct. It is still quite simple (https://pkg.go.dev/os/exec#Cmd there are not so many, AI can probably do it :D) and we don't have to make a release on all the branches when go 1.27 is released. We can just switch whenever we want. We can just copy all of them, that is all.

The only potential downside is if more fields are added in 1.26 and we start using them and we forget to update this copy. It seems unlikely, I don't think 1.26 will add fields. This is short lived (ideally, until 1.27 comes out), then we can remove it. So I don't think this risk (that is the only I see) is a concern.

What do you think?

@kolyshkin
Copy link
Contributor Author

@kolyshkin I'd still prefer to have the pre go 1.27 cloneCmd() just copy all the public fields of the struct. It is still quite simple (https://pkg.go.dev/os/exec#Cmd there are not so many, AI can probably do it :D) and we don't have to make a release on all the branches when go 1.27 is released. We can just switch whenever we want. We can just copy all of them, that is all.

The only potential downside is if more fields are added in 1.26 and we start using them and we forget to update this copy. It seems unlikely, I don't think 1.26 will add fields. This is short lived (ideally, until 1.27 comes out), then we can remove it. So I don't think this risk (that is the only I see) is a concern.

What do you think?

@rata done, PTAL

@kolyshkin
Copy link
Contributor Author

A few things to note here:

  1. We can probably be fine with a post-copy (IOW copy the command only after the first Start failed), but it's more risk theoretically so I left things as is (ie pre-copy).
  2. I chose not to copy Process and ProcessState fields -- the first one is populated upon the successful start, the second -- upon the process exit.
  3. I chose not to copy Err -- in most cases it is populated upon the process start or exit, except that exec.Command() can also set it. Practically, exec.Command can set it on Windows or if the command path requires lookup -- neither of which is the case here. But for safely I've added the appropriate check (as a separate commit).

Theoretically, exec.Command can set cmd.Err.

Practically, this should never happen (Linux, Go <= 1.26, exePath is
absolute), but in the unlikely case it does, let's fail early.

This is related to the cloneCmd (to be introduced by the following
commit) which chooses to not copy the Err field. Theoretically,
exec.Command can set Err and so the first call to cmd.Start will fail
(since Err != nil), and the second call to cmd.Start may succeed because
Err == nil. Yet, this scenario is highly unlikely, but better be safe
than sorry.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <efimverzakov@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

The first commit should be a draft until Go 1.26 is released, but we can patch this up later. Or, I can remove the first commit for now, for a cleaner git history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.4-todo A PR in main branch which needs to backported to release-1.4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runc 1.4 breaks the rule that cmd.Start is called only once that's why process Wait can return an error.

5 participants