The Wayback Machine - https://web.archive.org/web/20211210183341/https://github.com/opencontainers/runc/pull/3057
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chown cgroup to process uid in container namespace #3057

Merged
merged 1 commit into from Dec 7, 2021

Conversation

@frasertweedale
Copy link
Contributor

@frasertweedale frasertweedale commented Jul 2, 2021

Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
libcontainer/factory_linux.go Outdated Show resolved Hide resolved
Loading
@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jul 2, 2021

In general this kinda makes sense, although I'm afraid systemd might overwrite that at some random moment.

This also needs test(s).

@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Jul 2, 2021

In general this kinda makes sense, although I'm afraid systemd might overwrite that at some random moment.

This also needs test(s).

You can use systemd transient unit API to set a different user. Unfortunately, the uid needs to
have a corresponding passwd entity or else systemd refuses to proceed.

I filed an issue against systemd to relax this requirement. It was provisionally rejected:
systemd/systemd#19781. Another approach is to ship an NSSwitch
module to synthesise passwd entites for UIDs in recognised subuid ranges. But with changes
to support subuid ranges themselves coming from NSSwitch, I don't think it's a good idea
(right now). See this blog post for more detail: https://frasertweedale.github.io/blog-redhat/posts/2021-06-09-systemd-cgroups-subuid.html#discussion-and-next-steps .

I would like if we can push back and convince systemd project (i.e. Lennart) to just relax the checks
in systemd. Then runc could achieve correct cgroup ownership via systemd transient unit API. But I
don't think that can be hoped for.

libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

For security reason, the default behavior should not be changed.
We can probably have a feature gate as an annotation

https://github.com/containers/crun/blob/main/crun.1.md#extensions-to-oci

https://github.com/opencontainers/runc/blob/master/docs/systemd.md#auxiliary-properties

@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Jul 2, 2021

For security reason, the default behavior should not be changed.
We can probably have a feature gate as an annotation

https://github.com/containers/crun/blob/main/crun.1.md#extensions-to-oci

https://github.com/opencontainers/runc/blob/master/docs/systemd.md#auxiliary-properties

User namespaces is already behind an annotation (io.kubernetes.cri-o.userns-mode). And user namespaces are somewhat hobbled without the ability to manage cgroups (for the workloads that I care about anyway). So, do we really need a separate annotation to gate this feature? I don't mind if the answer is yes, but something to consider...

@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Jul 2, 2021

I'm talking about OCI Runtime Spec annotations, not about Kubernetes annotations.

CRI-O may set the suggest OCI Runtime Spec annotation by default at their own risk.

@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Jul 2, 2021

@AkihiroSuda thanks for the clarification. Fair enough. I'll gate it. I'll try to make all the suggested changes next week.

@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Jul 2, 2021

@AkihiroSuda what would be an appropriate key/value for the annotation? "org.opencontainers.runc.chown-cgroup": "{true,false}"

@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Jul 2, 2021

"org.opencontainers.runc.chown-cgroup": "{true,false}"

SGTM, but in the long term we should have proper JSON entry in OCI Runtime Spec https://github.com/opencontainers/runtime-spec

@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Jul 2, 2021

"org.opencontainers.runc.chown-cgroup": "{true,false}"

SGTM, but in the long term we should have proper JSON entry in OCI Runtime Spec https://github.com/opencontainers/runtime-spec

Better would be to adjust the semantics such that the container process can create scopes/slices in its own cgroup, and manage its child processes/threads. But that is a discussion to be had over there, not here :)

libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch from 9e9c973 to 3c94bb5 Jul 15, 2021
@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Jul 15, 2021

Thanks @kolyshkin and @AkihiroSuda for your reviews. I implemented all of your requested changes, and added a test.

By the way, for testing with OpenShift, do not use 4.8.0, there is a regression in CRI-O 1.21 that prevents k8s annotations from being propagated to the OCI configuration. CRI-O 1.20 (OpenShift 4.7.x) is fine.

@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch 3 times, most recently from 43e5c14 to 0f4ffc4 Jul 16, 2021
@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Jul 21, 2021

Ping @kolyshkin and @AkihiroSuda - are you able to re-review this PR soon?

@frasertweedale frasertweedale requested a review from AkihiroSuda Nov 29, 2021
@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch from c276066 to 4efd94e Nov 29, 2021
@AkihiroSuda AkihiroSuda requested a review from cyphar Nov 29, 2021
@frasertweedale frasertweedale changed the title chown cgroup to uid 0 in container namespace chown cgroup to process uid in container namespace Nov 29, 2021
tests/integration/cgroup_delegation.bats Outdated Show resolved Hide resolved
Loading
@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch from 4efd94e to 99cf0a8 Nov 29, 2021
libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch from 99cf0a8 to e155172 Nov 29, 2021
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Overall looks good and ready; only left a single suggestion.

libcontainer/cgroups/systemd/v2.go Show resolved Hide resolved
Loading
@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch from e155172 to cce31b0 Nov 29, 2021
libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
Loading
@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch from cce31b0 to 8212dfa Nov 29, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch from 8212dfa to 35d20c4 Nov 29, 2021
Copy link
Contributor

@kolyshkin kolyshkin left a comment

LGTM

Copy link
Contributor

@kolyshkin kolyshkin left a comment

LGTM

@kolyshkin kolyshkin requested review from AkihiroSuda and thaJeztah Nov 30, 2021
@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Dec 2, 2021

Thanks to all who have reviewed this PR and given feedback. We need more approvals and then someone can press the merge button. Ping @cyphar, @AkihiroSuda, @giuseppe, @jfroy.

@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Dec 6, 2021

Any takers? @mrunalp?

cyphar
cyphar approved these changes Dec 7, 2021
Copy link
Member

@cyphar cyphar left a comment

LGTM.

@cyphar cyphar merged commit cdce249 into opencontainers:master Dec 7, 2021
23 checks passed
Loading
@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Dec 7, 2021

Many thanks, @cyphar and all reviewers.

@frasertweedale
Copy link
Contributor Author

@frasertweedale frasertweedale commented Dec 7, 2021

I created #3311 to backport this feature to the release-1.0 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment