PSA: nested lxc containers

lxc has long supported nesting containers. There’s a lot of (historically accurate) documentation out there saying to use the line

lxc.aa_profile = lxc-container-default-with-nesting

to enable that. Sadly, a somewhat new kernel restriction has recently required a bit more work. To support that, the new way to support nesting in lxc is to use the configuration line:

lxc.include = /usr/share/lxc/config/nesting.conf

That configuration file includes the old aa_profile line. If you have your own custom nesting profile, you can follow the above lxc.include line with your lxc.aa_profile line, i.e.

lxc.include = /usr/share/lxc/config/nesting.conf
lxc.aa_profile = my-custom-nesting-profile

If you’re using lxd, this of course does not affect you. You can continue to use the ‘security.nesting: true’ config property as always.

Posted in Uncategorized | Leave a comment

Containers – inspect, don’t introspect

You’ve got a whatzit daemon running in a VM. The VM starts acting suspiciously – a lot more cpu, memory, or i/o than you’d expect. What do you do? You could log in and look around. But if the VM’s been 0wned, you may be running trojaned tools in the VM. In that case, you’d be better off mounting the VM’s root disk and looking around from your (hopefully) safe root context.

The same of course is true in containers. lxc-attach is a very convenient tool, as it doesn’t even require you to be running ssh in the container. But you’re trusting the container to be pristine.

One of the cool things about containers is that you can inspect pretty flexibly from the host. While the whatzit daemon is still running, you can strace it from the host, you can look for instance at it’s proc filesystem through /proc/$(pidof whatzit)/root/proc, you can see its process tree by just doing ps (i.e. pstree, ps -axjf).

So, the point of this post is mainly to recommend doing so 🙂 Importantly, I’m not claiming here “and therefore containers are better/safer” – that would be nonsense. (The trivial counter argument would be that the container shares – and can easily exploit – the shared kernel). Rather, the point is to use the appropriate tools and, then, to use them as well as possible by exploiting its advantages.

Posted in Uncategorized | Leave a comment

Cgroups are now handled a bit differently in Xenial

In the past, when you logged into an Ubuntu system, you would receive and be logged into a cgroup which you owned, one per controller (i.e. memory, freezer, etc). The main reason for this is so that unprivileged users can use things like lxc.

However this caused some trouble, especially through the cpuset controller. The problem is that when a cpu is plugged in, it is not added to any existing cpusets (in the legacy cgroup hierarchy, which we use). This is true even if you previously unplugged that cpu. So if your system has two cpus, when you first login you have cpus 0-1. 1 gets unplugged and replugged, now you only have 0. Now 0 gets unplugged…

The cgroup creation previously was done through a systemd patch, and is not configurable. In Xenial, we’ve now reduced that patch to only work on the name=systemd cgroup. Other controllers are to be handled by the new libpam-cgm package. By default it only creates a cgroup for the freezer controller. You can change the list by editing /etc/pam.d/common-session. For instance to add memory, you would change the line

optional -c freezer


optional -c freezer,memory

One more change expected to come to Xenial is to switch libpam-cgm to using lxcfs instead of cgmanager (or, just as likely, create a new conflicting libpam-cgroup package which does so). Since Xenial and later systems use systemd, which won’t boot without lxcfs anyway, we’ll lose no functionality by requiring lxcfs for unprivileged container creation on login.

On a side note, reducing the set of user-owned cgroups also required a patch to lxc. This means that in a mixture of nested lxcs, you may run into trouble if using nested unprivileged containers in older releases. For instance, if you create an unprivileged Trusty container on a Xenial host, you won’t own the memory cgroup by default, even if you’re root in the container. At the moment Trusty’s lxc doesn’t know how to handle that yet to create a nested container. The lxc patches should hopefully get SRUd, but in the meantime you can use the ubuntu-lxc ppas to get newer packages if needed. (Note that this is a non-issue when running lxd on the host.)

Posted in Uncategorized | Leave a comment

Nested containers in LXD

We’ve long considered nested containers an important use case in lxc. Lxd is no different in this regard. Lately there have been several questions
If you are using privileged lxd containers (security.privileged: true), then the only thing you need to do is to set the security.nesting flag to true:

lxc launch ubuntu nestc1 -c security.nesting=true -c security.privileged=true

or to change an existing container:

lxc config set nestc1 security.nesting true

However, we heavily encourage the use of unprivileged containers whenever possible. Nesting with unprivileged containers works just as well, but requires an extra step.

Recall that unprivileged users run in a user namespace. A user namespace has a mapping from host uids to container uids. For instance, the range of host uids 100000-199999 might be mapped to container uids 0-99999. The key insight for nesting is that you can only map uids which are defined in the parent container. So in this example, we cannot map uids 100000-199999 to nested containers because they do not exist! So we have two choices – either choose uids which do exist, or increase the range passed to parent containers. Since lxd currently demands at least 65k uids and gids, we’ll have to go with the latter.

Generally this isn’t too complicated. If you wish to run container c3 in container c2 in container c1, you’ll need 65536 uids in c3; in c2 you’ll need 65536 for c2 itself plus the 65536 for c3; and in c1 you’ll need 65536 for c1 plus 65536 for c2 plus 65536 for c3.

Lxd will gain per-tenant uid mappings, but for now you create the allocations by editing /etc/subuid and /etc/subgid (or by using usermod). On the host, we’ll delegate the 196608 ids starting at 500000 to the root user:

sed -i ‘/^root:/d’ /etc/subuid /etc/subgid
echo “root:500000:196608” >> /etc/subuid
echo “root:500000:196608” >> /etc/subgid

The first number is the host uid being delegated, and the second is the range. We know lxd will map those to the same number of ids starting at 0. On the host we have all uids available, but in the first container only ids 0-196607 will be defined.

Now make sure lxd is stopped, then restart it and create a container

lxc launch ubuntu c1 -c security.nesting=true

Log into c1, and set the subuid and subgid entries to:


Create your c2 container now,

lxc launch ubuntu c2 -c security.nesting=true

log in and this time set the subuid and subgid entires to:


Now you can create c3,

lxc launch ubuntu c3

You could of course go deeper, if you changed the allocations.

If this all seems a bit too much work, I’ve written a little program (whose functionality may eventually move into lxd in some form or other) called uidmapviz, which aims to show you what allocations look like, and warns you if a configuration won’t work due to too few subuids.

Extra tip of the day

lxc file push and pull are very handy. Whether the container is running or not, instead of having to get ssh set up in the container or knowing where the rootfs is mounted, you can simply

lxc image export trusty

This produces the rootfs and metadata files for the image called ‘trusty’ (assuming it exists) in your current directory. Push them both into the container, using

lxc file push meta-ubuntu-trusty-14.04-amd64-server-20150928.tar.xz nestc1/meta-ubuntu-trusty-14.04-amd64-server-20150928.tar.xz
lxc file push ubuntu-trusty-14.04-amd64-server-20150928.tar.xz nestc1/ubuntu-trusty-14.04-amd64-server-20150928.tar.xz

then in the container

lxc image import /meta-ubuntu-trusty-14.04-amd64-server-20150928.tar.xz /ubuntu-trusty-14.04-amd64-server-20150928.tar.xz

which is how i copied images into containers for nesting, rather than waiting for lxd-images to pull images from the network.

Posted in Uncategorized | 4 Comments

Ambient capabilities

There are several problems with posix capabilities. The first is the name: capabilities are something entirely different, so now we have to distinguish between “classical” and “posix” capabilities. Next, capabilities come from a defunct posix draft. That’s a serious downside for some people.

But another complaint has come up several times since file capabilities were implemented in Linux: people wanted an easy way for a program, once it has capabilities, to keep them. Capabilities are re-calculated every time the task executes a new file, taking the executable file’s capabilities into account. If a file has no capabilities, then (outside of the special exception for root when SECBIT_NOROOT is off) the resulting privilege set will be empty. And for shellscripts, file capabilities are always empty.

Fundamental to posix capabilities is the concept that part of your authority stems from who you are, and part stems from the programs you run. In a world of trojan horses and signed binaries this may seem sensible, but in the real world it is not always desirable. In particular, consider a case where a program wants to run as non-root user, but with a few capabilities – perhaps only cap_net_admin. If there is a very small set of files which the program may want to execute with privilege, and none are scripts, then cap_net_admin could be added to the inheritable file privileges for each of those programs. Then only processes with cap_net_admin in their inheritable process capabilities will be able to run those programs with privilege. But what if the program wants to run *anything*, including scripts and without having to predict what will be executed? This currently is not possible.

Christopher Lameter has been facing this problem for some time, and requested an enhancement of posix capabilities to allow him to solve it. Not only did he raise the problem and provide a good, real use case, he also sent several patches for discussion. In the end, a concept of “ambient capabilities” was agreed to and implemented (final patch by Andy Lutomirski). It’s currently available in -mm.

Here is how it works:

(Note – for more background on posix capabilities as implemented in linux, please see this Linux Symposium paper. For an example of how to use file capabilities to run as non-root before ambient capabilities, see this Linux Journal article. The ambient capability set has gotten several LWN mentions as well.)

Tasks have a new capability set, pA, the ambient set. As Andy Lutomirski put it, “pA does what most people expect pI to do.” Bits can only be set in pA if they are in pP or pI, and they are dropped from pA if they are dropped from pP or pI. When a new file is executed, all bits in pA are enabled in pP. Note though that executing any file which has file capabilities, or using the SECBIT_KEEPCAPS prctl option (followed by setresuid), will clear pA after the next exec.

So once a program moves CAP_NET_ADMIN into its pA, it can proceed to fork+exec a shellscript doing some /sbin/ip processing without losing CAP_NET_ADMIN.

How to use it (example):

Below is a test program, originally by Christopher, which I slightly modified. Write it to a file ‘ambient.c’. Build it, using

$ gcc -o ambient ambient.c -lcap-ng

Then assign it a set of file capabilities, for instance:

$ sudo setcap cap_net_raw,cap_net_admin,cap_sys_nice,cap_setpcap+p ambient

I was lazy and didn’t add interpretation of capabilities to ambient.c, so you’ll need to check /usr/include/linux/capability.h for the integers representing each capability. Run a shell with ambient capabilities by running, for instance:

$ ./ambient.c -c 13,12,23,8 /bin/bash

In this shell, check your capabilities:

$ grep Cap /proc/self/status
CapInh: 0000000000803100
CapPrm: 0000000000803100
CapEff: 0000000000803100
CapBnd: 0000003fffffffff
CapAmb: 0000000000803100

You can see that you have the requested ambient capabilities. If you run a new shell there, it retains those capabilities:

$ bash -c “grep Cap /proc/self/status”
CapInh: 0000000000803100
CapPrm: 0000000000803100
CapEff: 0000000000803100
CapBnd: 0000003fffffffff
CapAmb: 0000000000803100

What if we drop all but cap_net_admin from our inheritable set? We can test that using the ‘capsh’ program shipped with libcap:

$ capsh –caps=cap_net_admin=pi — -c “grep Cap /proc/self/status”
CapInh: 0000000000001000
CapPrm: 0000000000001000
CapEff: 0000000000001000
CapBnd: 0000003fffffffff
CapAmb: 0000000000001000

As you can see, the other capabilities were dropped from our ambient, and hence from our effective set.

ambient.c source
* Test program for the ambient capabilities. This program spawns a shell
* that allows running processes with a defined set of capabilities.
* (C) 2015 Christoph Lameter
* (C) 2015 Serge Hallyn
* Released under: GPL v3 or later.
* Compile using:
* gcc -o ambient_test ambient_test.o -lcap-ng
* This program must have the following capabilities to run properly:
* A command to equip the binary with the right caps is:
* setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
* To get a shell with additional caps that can be inherited by other processes:
* ./ambient_test /bin/bash
* Verifying that it works:
* From the bash spawed by ambient_test run
* cat /proc/$$/status
* and have a look at the capabilities.


* Definitions from the kernel header files. These are going to be removed
* when the /usr/include files have these defined.
#define PR_CAP_AMBIENT 47

static void set_ambient_cap(int cap)
int rc;

rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
if (rc) {
printf(“Cannot add inheritable cap\n”);

/* Note the two 0s at the end. Kernel checks for these */
if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
perror(“Cannot set cap”);

void usage(const char *me) {
printf(“Usage: %s [-c caps] new-program new-args\n”, me);

int default_caplist[] = {CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE, -1};

int *get_caplist(const char *arg) {
int i = 1;
int *list = NULL;
char *dup = strdup(arg), *tok;

for (tok = strtok(dup, “,”); tok; tok = strtok(NULL, “,”)) {
list = realloc(list, (i + 1) * sizeof(int));
if (!list) {
perror(“out of memory”);
list[i-1] = atoi(tok);
list[i] = -1;
return list;

int main(int argc, char **argv)
int rc, i, gotcaps = 0;
int *caplist = NULL;
int index = 1; // argv index for cmd to start

if (argc < 2)

if (strcmp(argv[1], "-c") == 0) {
if (argc <= 3) {
caplist = get_caplist(argv[2]);
index = 3;

if (!caplist) {
caplist = (int *)default_caplist;

for (i = 0; caplist[i] != -1; i++) {
printf("adding %d to ambient list\n", caplist[i]);

printf("Ambient_test forking shell\n");
if (execv(argv[index], argv + index))
perror("Cannot exec");

return 0;

Posted in Uncategorized | Leave a comment

Tiling windows in Unity

Using the compiz grid plugin, Unity supports placing windows, one at a time, in a tiled-like fashion. However, there is no support for tilling a workspace in one fell stroke. That is something which users of dwm, wmii, i3, xmonad, awesome, qtile etc come to expect.

A few years ago I ran across a python script called stiler which tiled all windows, mainly using wmctrl. I’ve made a few updates to make that work cleanly in Unity, and have been using that for about a week. Here is how it works:

windows-enter is mapped to “stiler term”. This starts a new terminal (of the type defined in ~/.stilerrc), then tiles the current desktop. windows-j and windows-k are mapped to ‘stiler simple-next’ and ‘stiler simple-prev’, which first call the ‘simple’ function to make sure windows are tiled if they weren’t already, then focuses the next or previous window. So, if you have a set of windows which isn’t tiled (for instance you just exited a terminal), you can win-j to tile the remaining windows. Windows-shift-j cycles the tile locations so that the active window becomes the first non-tiled, etc.

This is clearly very focused on a dwm-like experience. stiler also supports vertical and horizontal layouts, and could easily be taught others like matrix.

If this is something that anyone but me actually wants to use, I’ll package properly in ppa, but for now the script can be found at .

Posted in Uncategorized | 12 Comments

Publishing lxd images

While some work remains to be done for ‘lxc publish’, the current support is sufficient to show a full cycle of image workload with lxd.

Ubuntu wily comes with systemd by default. Sometimes you might need a wily container with upstart. And to repeatedly reproduce some tests on wily with upstart, you might want to create a container image.

# lxc remote add lxc
# lxc launch lxc:ubuntu/wily/amd64 w1
# lxc exec w1 -- apt-get -y install upstart-bin upstart-sysv
# lxc stop w1
# lxc publish --public w1 --alias=wily-with-upstart
# lxc image copy wily-with-upstart remote:  # optional

Now you can start a new container using

# lxc launch wily-with-upstart w-test-1
# lxc exec w-test-1 -- ls -alh /sbin/init
lrwxrwxrwx 1 root root 7 May 18 10:20 /sbin/init -> upstart
# lxc exec w-test-1 run-my-tests

Importantly, because “–public” was passed to the lxc publish command, anyone who can reach your lxd server or the image server at “remote:” will also be able to use the image. Of course, for private images, don’t use “–public”.


Posted in Uncategorized | Leave a comment