Ambient capabilities

There are several problems with posix capabilities. The first is the name: capabilities are something entirely different, so now we have to distinguish between “classical” and “posix” capabilities. Next, capabilities come from a defunct posix draft. That’s a serious downside for some people.

But another complaint has come up several times since file capabilities were implemented in Linux: people wanted an easy way for a program, once it has capabilities, to keep them. Capabilities are re-calculated every time the task executes a new file, taking the executable file’s capabilities into account. If a file has no capabilities, then (outside of the special exception for root when SECBIT_NOROOT is off) the resulting privilege set will be empty. And for shellscripts, file capabilities are always empty.

Fundamental to posix capabilities is the concept that part of your authority stems from who you are, and part stems from the programs you run. In a world of trojan horses and signed binaries this may seem sensible, but in the real world it is not always desirable. In particular, consider a case where a program wants to run as non-root user, but with a few capabilities – perhaps only cap_net_admin. If there is a very small set of files which the program may want to execute with privilege, and none are scripts, then cap_net_admin could be added to the inheritable file privileges for each of those programs. Then only processes with cap_net_admin in their inheritable process capabilities will be able to run those programs with privilege. But what if the program wants to run *anything*, including scripts and without having to predict what will be executed? This currently is not possible.

Christopher Lameter has been facing this problem for some time, and requested an enhancement of posix capabilities to allow him to solve it. Not only did he raise the problem and provide a good, real use case, he also sent several patches for discussion. In the end, a concept of “ambient capabilities” was agreed to and implemented (final patch by Andy Lutomirski). It’s currently available in -mm.

Here is how it works:

(Note – for more background on posix capabilities as implemented in linux, please see this Linux Symposium paper. For an example of how to use file capabilities to run as non-root before ambient capabilities, see this Linux Journal article. The ambient capability set has gotten several LWN mentions as well.)

Tasks have a new capability set, pA, the ambient set. As Andy Lutomirski put it, “pA does what most people expect pI to do.” Bits can only be set in pA if they are in pP or pI, and they are dropped from pA if they are dropped from pP or pI. When a new file is executed, all bits in pA are enabled in pP. Note though that executing any file which has file capabilities, or using the SECBIT_KEEPCAPS prctl option (followed by setresuid), will clear pA after the next exec.

So once a program moves CAP_NET_ADMIN into its pA, it can proceed to fork+exec a shellscript doing some /sbin/ip processing without losing CAP_NET_ADMIN.

How to use it (example):

Below is a test program, originally by Christopher, which I slightly modified. Write it to a file ‘ambient.c’. Build it, using

$ gcc -o ambient ambient.c -lcap-ng

Then assign it a set of file capabilities, for instance:

$ sudo setcap cap_net_raw,cap_net_admin,cap_sys_nice,cap_setpcap+p ambient

I was lazy and didn’t add interpretation of capabilities to ambient.c, so you’ll need to check /usr/include/linux/capability.h for the integers representing each capability. Run a shell with ambient capabilities by running, for instance:

$ ./ambient.c -c 13,12,23,8 /bin/bash

In this shell, check your capabilities:

$ grep Cap /proc/self/status
CapInh: 0000000000803100
CapPrm: 0000000000803100
CapEff: 0000000000803100
CapBnd: 0000003fffffffff
CapAmb: 0000000000803100

You can see that you have the requested ambient capabilities. If you run a new shell there, it retains those capabilities:

$ bash -c “grep Cap /proc/self/status”
CapInh: 0000000000803100
CapPrm: 0000000000803100
CapEff: 0000000000803100
CapBnd: 0000003fffffffff
CapAmb: 0000000000803100

What if we drop all but cap_net_admin from our inheritable set? We can test that using the ‘capsh’ program shipped with libcap:

$ capsh –caps=cap_net_admin=pi — -c “grep Cap /proc/self/status”
CapInh: 0000000000001000
CapPrm: 0000000000001000
CapEff: 0000000000001000
CapBnd: 0000003fffffffff
CapAmb: 0000000000001000

As you can see, the other capabilities were dropped from our ambient, and hence from our effective set.

ambient.c source
* Test program for the ambient capabilities. This program spawns a shell
* that allows running processes with a defined set of capabilities.
* (C) 2015 Christoph Lameter
* (C) 2015 Serge Hallyn
* Released under: GPL v3 or later.
* Compile using:
* gcc -o ambient_test ambient_test.o -lcap-ng
* This program must have the following capabilities to run properly:
* A command to equip the binary with the right caps is:
* setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
* To get a shell with additional caps that can be inherited by other processes:
* ./ambient_test /bin/bash
* Verifying that it works:
* From the bash spawed by ambient_test run
* cat /proc/$$/status
* and have a look at the capabilities.


* Definitions from the kernel header files. These are going to be removed
* when the /usr/include files have these defined.
#define PR_CAP_AMBIENT 47

static void set_ambient_cap(int cap)
int rc;

rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
if (rc) {
printf(“Cannot add inheritable cap\n”);

/* Note the two 0s at the end. Kernel checks for these */
if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
perror(“Cannot set cap”);

void usage(const char *me) {
printf(“Usage: %s [-c caps] new-program new-args\n”, me);

int default_caplist[] = {CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE, -1};

int *get_caplist(const char *arg) {
int i = 1;
int *list = NULL;
char *dup = strdup(arg), *tok;

for (tok = strtok(dup, “,”); tok; tok = strtok(NULL, “,”)) {
list = realloc(list, (i + 1) * sizeof(int));
if (!list) {
perror(“out of memory”);
list[i-1] = atoi(tok);
list[i] = -1;
return list;

int main(int argc, char **argv)
int rc, i, gotcaps = 0;
int *caplist = NULL;
int index = 1; // argv index for cmd to start

if (argc < 2)

if (strcmp(argv[1], "-c") == 0) {
if (argc <= 3) {
caplist = get_caplist(argv[2]);
index = 3;

if (!caplist) {
caplist = (int *)default_caplist;

for (i = 0; caplist[i] != -1; i++) {
printf("adding %d to ambient list\n", caplist[i]);

printf("Ambient_test forking shell\n");
if (execv(argv[index], argv + index))
perror("Cannot exec");

return 0;

Posted in Uncategorized | Leave a comment

Tiling windows in Unity

Using the compiz grid plugin, Unity supports placing windows, one at a time, in a tiled-like fashion. However, there is no support for tilling a workspace in one fell stroke. That is something which users of dwm, wmii, i3, xmonad, awesome, qtile etc come to expect.

A few years ago I ran across a python script called stiler which tiled all windows, mainly using wmctrl. I’ve made a few updates to make that work cleanly in Unity, and have been using that for about a week. Here is how it works:

windows-enter is mapped to “stiler term”. This starts a new terminal (of the type defined in ~/.stilerrc), then tiles the current desktop. windows-j and windows-k are mapped to ‘stiler simple-next’ and ‘stiler simple-prev’, which first call the ‘simple’ function to make sure windows are tiled if they weren’t already, then focuses the next or previous window. So, if you have a set of windows which isn’t tiled (for instance you just exited a terminal), you can win-j to tile the remaining windows. Windows-shift-j cycles the tile locations so that the active window becomes the first non-tiled, etc.

This is clearly very focused on a dwm-like experience. stiler also supports vertical and horizontal layouts, and could easily be taught others like matrix.

If this is something that anyone but me actually wants to use, I’ll package properly in ppa, but for now the script can be found at .

Posted in Uncategorized | 7 Comments

Publishing lxd images

While some work remains to be done for ‘lxc publish’, the current support is sufficient to show a full cycle of image workload with lxd.

Ubuntu wily comes with systemd by default. Sometimes you might need a wily container with upstart. And to repeatedly reproduce some tests on wily with upstart, you might want to create a container image.

# lxc remote add lxc
# lxc launch lxc:ubuntu/wily/amd64 w1
# lxc exec w1 -- apt-get -y install upstart-bin upstart-sysv
# lxc stop w1
# lxc publish --public w1 --alias=wily-with-upstart
# lxc image copy wily-with-upstart remote:  # optional

Now you can start a new container using

# lxc launch wily-with-upstart w-test-1
# lxc exec w-test-1 -- ls -alh /sbin/init
lrwxrwxrwx 1 root root 7 May 18 10:20 /sbin/init -> upstart
# lxc exec w-test-1 run-my-tests

Importantly, because “–public” was passed to the lxc publish command, anyone who can reach your lxd server or the image server at “remote:” will also be able to use the image. Of course, for private images, don’t use “–public”.


Posted in Uncategorized | Leave a comment

LXD 0.3


LXD 0.3 has been released. This version provides huge usability improvements over past versions.

Getting started

Here’s an example of quickly getting started on a fresh Ubuntu 15.04 VM:

sudo add-apt-repository ppa:ubuntu-lxc/lxd-daily
sudo apt-get update
sudo apt-get install lxd
sudo lxd-images import lxc ubuntu trusty amd64 --alias ubuntu

(If you are using Ubuntu 14.04 Trusty,  you can just add ppa:ubuntu-lxc/daily to get the uptodate packages;  If running something else, see the LXD website for instructions.)

lxd-images is a temporary script which downloads an image from You can also manually import any valid image tarball using the ‘lxc image import’ command, however the goal eventually is to have images automatically be downloaded (subject to your consent, i.e. depending on your current network situation) by the lxd package.

You can download and import a debian image by doing:

lxd-images import lxc debian wheezy amd64 --alias debian

You can view the list of available local images by doing:

ubuntu@vlxd:~$ lxc image list
| debian | 532fc26c | no     |             |
| ubuntu | 8d39d97e | no     |             |

Once the image is downloaded, you can launch containers based on it:

ubuntu@vlxd:~$ lxc launch debian d1
Creating container...done
Starting container...done
ubuntu@vlxd:~$ lxc launch ubuntu u2
Creating container...done
Starting container...done

Container manipulation

ubuntu@vlxd:~$ lxc list
| NAME |  STATE  |         IPV4          | IPV6 |
| u1   | RUNNING |, | ::1  |
| u2   | RUNNING |,  | ::1  |
| d1   | RUNNING |             | ::1  |

ubuntu@vlxd:~$ lxc exec u2 bash
root@u2:~# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.108 ms

--- ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.108/0.108/0.108/0.000 ms
root@u2:~# exit

ubuntu@vlxd:~$ lxc exec u1 -- ps -ef
root         1     0  0 04:40 ?        00:00:00 /sbin/init
root       399     1  0 04:40 ?        00:00:00 upstart-udev-bridge --daemon
root       433     1  0 04:40 ?        00:00:00 /lib/systemd/systemd-udevd --dae
syslog     547     1  0 04:40 ?        00:00:00 rsyslogd
root       570     1  0 04:40 ?        00:00:00 upstart-file-bridge --daemon
root       571     1  0 04:40 ?        00:00:00 upstart-socket-bridge --daemon
root      1380     1  0 04:40 ?        00:00:00 dhclient -1 -v -pf /run/dhclient
root      1446     1  0 04:40 tty4     00:00:00 /sbin/getty -8 38400 tty4
root      1448     1  0 04:40 tty2     00:00:00 /sbin/getty -8 38400 tty2
root      1449     1  0 04:40 tty3     00:00:00 /sbin/getty -8 38400 tty3
root      1458     1  0 04:40 ?        00:00:00 cron
root      1490     1  0 04:40 console  00:00:00 /sbin/getty -8 38400 console
root      1492     1  0 04:40 tty1     00:00:00 /sbin/getty -8 38400 tty1
root      1530     0  0 04:42 ?        00:00:00 ps -ef


Version 0.3 introduces container configuration and profiles. Both are configured using the ‘lxc config’ command. By default, new containers are created with the ‘default’ profile, which has a nic enabled on bridge lxcbr0. You can edit this profile by doing

lxc config profile edit default

which will bring the profile up in an external editor, and update when you save and exit.

To take the default profile out of container u1,

ubuntu@vlxd:~$ lxc config profile list
ubuntu@vlxd:~$ lxc config show u1
Profiles: default
ubuntu@vlxd:~$ lxc config profile apply u1
Profile (none) applied to u1
ubuntu@vlxd:~$ lxc config show u1

Now u1 won’t have any nics.

Lets say we want a container to have two nics. We can do this a few ways. We can create a new profile with a second nic, and apply both profiles. We can create a new nic with two nics, and apply only that one. Or we can add the device right to the container, like so:

ubuntu@vlxd:~$ lxc config device add u1 eth1 nic nictype=bridged parent=lxcbr1
Device eth1 added to u1
ubuntu@vlxd:~$ lxc config device list u1
eth1: nic
eth0: nic

I’ve only shown local usage in this post. This means I’ve left out the exciting part – remote usage! I’ll leave that for the next post.

In the meantime, you can get lxd from the above-cited ppa, from the lxd website, or from github.

Posted in Uncategorized | Tagged , , , , | 7 Comments

Introducing lxcfs

Last year around this time, we were announcing the availability of cgmanager, a daemon allowing users and programs to easily administer and delegate cgroups over a dbus interface. It was key to supporting nested containers and unprivileged users.

While its dbus interface turned out to have tremendous benefits (I wasn’t sold at first), there are programs which want to continue using the cgroup file interface. To support use of these in a container with the same delegation benefits of cgmanager, there is now lxcfs.

Lxcfs is a fuse filesystem mainly designed for use by lxc containers. On a Ubuntu 15.04 system, it will be used by default to provide two things: first, a virtualized view of some /proc files; and secondly, filtered access to the host’s cgroup filesystems.

The proc files filtered by lxcfs are cpuinfo, meminfo, stat, and uptime. These are filtered using cgroup information to show only the cpus and memory which are available to the reading task. They can be seen on the host under /var/lib/lxcfs/proc, and containers by default will bind-mount the proc files over the container’s proc files. There have been several attempts to push this virtualization into /proc itself, but those have been rejected. The proposed alternative was to write a library which all userspace would use to get filtered /proc information. Unfortunately no such effort seems to be taking off, and if it took off now it wouldn’t help with legacy containers. In contrast, lxcfs works perfectly with 12.04 and 14.04 containers.

The cgroups are mounted per-host-mounted-hierarchy under /var/lib/lxcfs/cgroup/. When a container is started, each filtered hierarchy will be bind-mounted under /sys/fs/cgroup/* in the container. The container cannot see any information for ancestor cgroups, so for instance /var/lib/lxcfs/cgroup/freezer will contain only a directory called ‘lxc’ or ‘user.slice’.

Lxcfs was instrumental in allowing us to boot systemd containers, both privileged and unprivileged. It also, through its proc filtering, answers a frequent years-old request. We do hope that kernel support for cgroup namespaces will eventually allow us to drop the cgroup part of lxcfs. Since we’ll need to support LTS containers for some time, that will definitely require cgroup namespace support for non-unified hierarchies, but that’s not out of the realm of possibilities.

Lxcfs is packaged in ubuntu 15.04, the source is hosted at, and news can be tracked at

In summary, on a 15.04 host, you can now create a container the usual way,

lxc-create -t download -n v1 — -d ubuntu -r vivid -a amd64

The resulting container will have “correct” results for uptime, top, etc.

root@v1:~# uptime
03:09:08 up 0 min, 0 users, load average: 0.02, 0.13, 0.12

It will get cgroup hierarchies under /sys/fs/cgroup:

root@v1:~# find /sys/fs/cgroup/freezer/

And, it can run systemd as init.

Posted in Uncategorized | Tagged , , | 7 Comments

Where does lxd fit in

Since its announcement, there appears to have been some confusion and concern about lxd, how it relates to lxc, and whether it will be taking away from lxc development.

When lxc was first started around 2007, it was mainly a userspace tool – some c code and some shell scripts – to exercise the in-development new kernel features intended for container and checkpoint-restart functionality. The lxc command line experience, after all these years, is quite set in stone. While it is not ideal (the mandatory -n annoys a lot of people), it has served us very well for a long time.

A few years ago, we took all of the main container-related functions which could be done with various commands, and exported them through the new ‘lxc API’. For instance, lxc-create had been a script, and lxc-start and lxc-execute were separate c programs. The new lxc ‘API’ was premised around a container object with methods, including ‘create’ and ‘start’, for the common operations.

From the start we had in mind at least python bindings to the API, and in quick order bindings came into being for C, python3, python2, go, lua, haskell, and more, allowing container administration from these languages without having to shell out to the lxc commands. So now code running on the same machine can manipulate containers. But we still have the arguably crufty command line language, and the API is local only.

lxd addresses those two issues. First, it presents a REST API for manipulating containers, thereby exporting container management over the network. Secondly, it offers a command line client using the REST API to administer containers across remote hosts. The command line API is basically what we came up with when we asked “what, after years of working with containers, would the perfect, intuitive, most concise and still flexible CLI we could imagine?” For handling remote containers it borrows some good parts of the git remote API. (I say “we” here, but really the inestimable stgraber designed the CLI). This allows us to leave the legacy lxc api as-is for administering local containers (“lxc-create”, “lxc-start”, etc), while giving us a nicer API and easier administration using the new CLI (“lxc start c1”, “lxc start images:ubuntu/trusty/amd64 host2:new-container”).

Above all, lxd exports a new interface over the network, but entirely wrapped around lxc. So lxc will not be going away, and focus on lxd will mean further improvements for lxc, not a shift away from lxc.

Posted in Uncategorized | Tagged , | 6 Comments

Live container migration – on its way

The criu project has been working hard to make application checkpoint/restart feasible. Tycho has implemented lxc-checkpoint and lxc-restart on top of that (as well as of course contributing the needed bits to criu itself), and now shows off first steps toward real live migration:


Posted in Uncategorized | Tagged , , | Leave a comment