LXC uses cgroups to track and constrain resource use by containers. Historically cgroups have been administered through a filesystem interface. A root owned task can mount the cgroup filesystem and change its current cgroup or the limits of its cgroup. Lxc must therefore rely on apparmor to disallow cgroup mounts, and make sure to bind mount only the container’s own cgroup into the container. It must also calculate its own cgroup for each controller to choose and track a full new cgroup for a new container. Along with some other complications, this caused the amount of code in lxc to deal with cgroups to become quite large.
To help deal with this, we wrote cgmanager, the cgroup manager. Its primary goal was to allow any task to seamlessly and securely (in terms of the host’s safety) administer its own cgroups. Its secondary goal was to ensure that lxc could deal with cgroups equally simply regardless of whether it was nested.
Cgmanager presents a D-Bus interface for making cgroup administration requests. Every request is made in relation to the requesting task’s current cgroup. Therefore ‘lxc-start’ can simply request for cgroup u1 to be created, without having to worry about what cgroup it is in now.
To make this work, we read the (un-alterable) process credentials of the requesting task over the D-Bus socket. We can check the task’s current cgroup using /proc/pid/cgroup, as well as check its /proc/pid/status and /proc/pid/uid_map. For a simple request like ‘create a cgroup’, this is all the information we need.
For requests relating to another task (“Move that task to another cgroup”) or credentials (“Change ownership to that userid”), we have two cases. If the requestor is in the same namespaces as the cgmanager (which we can verify on recent kernels), then the requestor can pass the values as regular integers. We can then verify using /proc whether the requestor has the privilege to perform the access.
But if the requestor is in a different namespace, then we need to uids and pids converted. We do this by having the requestor pass SCM_CREDENTIALS over a file descriptor. When these are passed, the kernel (a) ensures that the requesting task has privilege to write those credentials, and (b) converts them from the requestor’s namespace to the reader (cgmanager).
The SCM-enhanced D-Bus calls are a bit more complicated to use than regular D-Bus calls, and can’t be made with (unpatched) dbus-send. Therefore we provide a cgmanager proxy (cgproxy) which accepts the plain D-Bus requests from a task which shares its namespaces and converts them to the enhanced messages. So when you fire up a Trusty containers host, it will run the cgmanager. Each container on that host can bind the cgmanager D-Bus socket and run a cgproxy. (The cgmanager upstart job will start the right daemon at startup) Lxc can administer cgroups the exact same way whether it is being run inside a container or on the host.
Cgmanager is now in main in trusty. When you log into a trusty desktop, logind should place you into your own cgroup, which you can verify by reading /proc/self/cgroup. If entries there look like
then you have your own delegated cgroups. If it instead looks like
then you do not. You can create your own cgroup using cgm, which is just a script to wrap rather long calls to dbus-send.
sudo cgm create all $USER sudo cgm chown all $USER $(id -u) $(id -g)
Next enter your shell into the new cgroup using
cgm movepid all $USER $$
Now you can go on to https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/ to run your unprivileged containers. Or, I sometimes like to stick a compute job in a separate freezer cgroup so I can freeze it if the cpu needs to cool down,
cgm create freezer cc bash docompile.sh & cgm movepid freezer cc $!
This way I can manually freeze the job when I like, or I can have a script watching my cpu temp as follows:
state="thawed" while [ 1 ]; do d=`cat /sys/devices/virtual/thermal/thermal_zone0/temp` || d=1000; d=$((d/1000)); if [ $d -gt 93 -a "$state" = "thawed" ]; then cgm setvalue freezer cc freezer.state FROZEN state="frozen" elif [ $d -lt 89 -a "$state" = "frozen" ]; then cgm setvalue freezer cc freezer.state THAWED state="thawed"; fi; sleep 1; done