LXD Image Updates

LXD is sweet. To create a ubuntu xenial container, just do

lxc launch ubuntu:xenial x1

The remote xenial image will be automatically fetched on the first use, and cached for future uses. Nifty. Hmm… but what then?

There are three server settings which guide how images are treated after their initial use. And setting these can make the difference between frequent higher bandwidth use and more uptodate images.

The discussion will be eased if we define a term: a ‘cached’ image is one which is created implicitly as part of a container creation. I.e. the first time that you do

lxc launch ubuntu:trusty t1

lxd will fetch the ubuntu:trusty image, implicitly creating a cached copy on the target remote. In contrast, you can explicitly create a non-cached image using

lxc image copy ubuntu:wily local: –alias wily

As you might expect, these two images are treated a bit differently. The first is cached for a while to save on future bandwidth, but it’s a second class citizen compared to the second image, which the user explicitly requested.

With that in mind, as I said earlier, there are three configuration settings which guide image behavior.

First there is the cached image auto update setting,

lxc config set images.auto_update_cached true

Any cached images which are created while this is set to true will be auto-updated. Note that setting this to true after creating a container will not change that pre-existing cached image’s auto update setting.

For explicitly created images, you can instead specify auto-update manually using

lxc image copy ubuntu:wily local: –alias wily –auto-update

The second setting is remote_cache_expiry:

lxc config set images.remote_cache_expiry 1

This specifies, in days, how long an unused cached image is kept in the cache before it is flushed. If this value is not specified, then it defaults to 10 days. Note that explicitly copied images are not expired.

Finally, auto_update_interval specifies how many hours should pass between lxd’s checks for whether a cached image should be updated. Setting this value to 0 disables the checks. So if you find yourself temporarily on a slow, expensive, 3G router, you may want to

lxc config set images.auto_update_interval 0

to save yourself from accidentally downloading a few hundred megs or more of updated image contents. On the other hand, when back at the datacenter, you might want to

lxc config set images.auto_update_interval 1

to keep as in sync as possible.

Posted in Uncategorized | 5 Comments

Docker in LXD

Since the very early days of upstream Linux containers – around 2006 – we’ve been distinguishing between ‘application’ and ‘system’ containers. (The definition of application containers has changed a bit, and their use case has changed a *lot*, but the general gist remains the same).

A few years ago I would get regular – daily! – queries by lots of people asking what I thought of Docker. Some asked because, as one of the early people involved in kernel container functionality, I’d be interested. Others did so because I had been working with http://linuxcontainers.org/lxc, a particular container administration suite, and thought I’d feel competitive. However, as we’ve said for a long time, Docker is a great tool for application containers and application container purposes. From a LXC/LXD perspective, we’re looking at different use cases. One of those is hosting containers in which to run Docker:)

And, in Ubuntu 16.04, you can easily do so. (The Docker patches to enable this are on their way upstream.) To run Docker inside a container, the container must have a few properties. These are conferred by the ‘docker’ profile. The docker profile does not include a network interface, so you’ll want to create a container with both the default and docker profiles:

lxc launch ubuntu-daily:xenial docker1 -p default -p docker

Now, enter the container and install the docker.io package:

lxc exec docker1 — apt update
lxc exec docker1 — apt install docker.io
lxc exec docker1 — docker pull ubuntu
lxc exec docker1 — docker run -it ubuntu bash

et voila, a docker container is running inside your lxd container. By itself this may seem like a novelty. However, when you start deploying the lxd hosts with openstack nova-lxd plugin or juju-lxd, the possibilities are endless.

Posted in Uncategorized | Leave a comment

PSA: nested lxc containers

lxc has long supported nesting containers. There’s a lot of (historically accurate) documentation out there saying to use the line

lxc.aa_profile = lxc-container-default-with-nesting

to enable that. Sadly, a somewhat new kernel restriction has recently required a bit more work. To support that, the new way to support nesting in lxc is to use the configuration line:

lxc.include = /usr/share/lxc/config/nesting.conf

That configuration file includes the old aa_profile line. If you have your own custom nesting profile, you can follow the above lxc.include line with your lxc.aa_profile line, i.e.

lxc.include = /usr/share/lxc/config/nesting.conf
lxc.aa_profile = my-custom-nesting-profile

If you’re using lxd, this of course does not affect you. You can continue to use the ‘security.nesting: true’ config property as always.

Posted in Uncategorized | Leave a comment

Containers – inspect, don’t introspect

You’ve got a whatzit daemon running in a VM. The VM starts acting suspiciously – a lot more cpu, memory, or i/o than you’d expect. What do you do? You could log in and look around. But if the VM’s been 0wned, you may be running trojaned tools in the VM. In that case, you’d be better off mounting the VM’s root disk and looking around from your (hopefully) safe root context.

The same of course is true in containers. lxc-attach is a very convenient tool, as it doesn’t even require you to be running ssh in the container. But you’re trusting the container to be pristine.

One of the cool things about containers is that you can inspect pretty flexibly from the host. While the whatzit daemon is still running, you can strace it from the host, you can look for instance at it’s proc filesystem through /proc/$(pidof whatzit)/root/proc, you can see its process tree by just doing ps (i.e. pstree, ps -axjf).

So, the point of this post is mainly to recommend doing so:) Importantly, I’m not claiming here “and therefore containers are better/safer” – that would be nonsense. (The trivial counter argument would be that the container shares – and can easily exploit – the shared kernel). Rather, the point is to use the appropriate tools and, then, to use them as well as possible by exploiting its advantages.

Posted in Uncategorized | Leave a comment

Cgroups are now handled a bit differently in Xenial

In the past, when you logged into an Ubuntu system, you would receive and be logged into a cgroup which you owned, one per controller (i.e. memory, freezer, etc). The main reason for this is so that unprivileged users can use things like lxc.

However this caused some trouble, especially through the cpuset controller. The problem is that when a cpu is plugged in, it is not added to any existing cpusets (in the legacy cgroup hierarchy, which we use). This is true even if you previously unplugged that cpu. So if your system has two cpus, when you first login you have cpus 0-1. 1 gets unplugged and replugged, now you only have 0. Now 0 gets unplugged…

The cgroup creation previously was done through a systemd patch, and is not configurable. In Xenial, we’ve now reduced that patch to only work on the name=systemd cgroup. Other controllers are to be handled by the new libpam-cgm package. By default it only creates a cgroup for the freezer controller. You can change the list by editing /etc/pam.d/common-session. For instance to add memory, you would change the line

optional pam_cgm.so -c freezer

to

optional pam_cgm.so -c freezer,memory

One more change expected to come to Xenial is to switch libpam-cgm to using lxcfs instead of cgmanager (or, just as likely, create a new conflicting libpam-cgroup package which does so). Since Xenial and later systems use systemd, which won’t boot without lxcfs anyway, we’ll lose no functionality by requiring lxcfs for unprivileged container creation on login.

On a side note, reducing the set of user-owned cgroups also required a patch to lxc. This means that in a mixture of nested lxcs, you may run into trouble if using nested unprivileged containers in older releases. For instance, if you create an unprivileged Trusty container on a Xenial host, you won’t own the memory cgroup by default, even if you’re root in the container. At the moment Trusty’s lxc doesn’t know how to handle that yet to create a nested container. The lxc patches should hopefully get SRUd, but in the meantime you can use the ubuntu-lxc ppas to get newer packages if needed. (Note that this is a non-issue when running lxd on the host.)

Posted in Uncategorized | Leave a comment

Nested containers in LXD

We’ve long considered nested containers an important use case in lxc. Lxd is no different in this regard. Lately there have been several questions
If you are using privileged lxd containers (security.privileged: true), then the only thing you need to do is to set the security.nesting flag to true:

lxc launch ubuntu nestc1 -c security.nesting=true -c security.privileged=true

or to change an existing container:

lxc config set nestc1 security.nesting true

However, we heavily encourage the use of unprivileged containers whenever possible. Nesting with unprivileged containers works just as well, but requires an extra step.

Recall that unprivileged users run in a user namespace. A user namespace has a mapping from host uids to container uids. For instance, the range of host uids 100000-199999 might be mapped to container uids 0-99999. The key insight for nesting is that you can only map uids which are defined in the parent container. So in this example, we cannot map uids 100000-199999 to nested containers because they do not exist! So we have two choices – either choose uids which do exist, or increase the range passed to parent containers. Since lxd currently demands at least 65k uids and gids, we’ll have to go with the latter.

Generally this isn’t too complicated. If you wish to run container c3 in container c2 in container c1, you’ll need 65536 uids in c3; in c2 you’ll need 65536 for c2 itself plus the 65536 for c3; and in c1 you’ll need 65536 for c1 plus 65536 for c2 plus 65536 for c3.

Lxd will gain per-tenant uid mappings, but for now you create the allocations by editing /etc/subuid and /etc/subgid (or by using usermod). On the host, we’ll delegate the 196608 ids starting at 500000 to the root user:

sed -i ‘/^root:/d’ /etc/subuid /etc/subgid
echo “root:500000:196608” >> /etc/subuid
echo “root:500000:196608” >> /etc/subgid

The first number is the host uid being delegated, and the second is the range. We know lxd will map those to the same number of ids starting at 0. On the host we have all uids available, but in the first container only ids 0-196607 will be defined.

Now make sure lxd is stopped, then restart it and create a container

lxc launch ubuntu c1 -c security.nesting=true

Log into c1, and set the subuid and subgid entries to:

root:65536:131072

Create your c2 container now,

lxc launch ubuntu c2 -c security.nesting=true

log in and this time set the subuid and subgid entires to:

root:65536:65536

Now you can create c3,

lxc launch ubuntu c3

You could of course go deeper, if you changed the allocations.

If this all seems a bit too much work, I’ve written a little program (whose functionality may eventually move into lxd in some form or other) called uidmapviz, which aims to show you what allocations look like, and warns you if a configuration won’t work due to too few subuids.

Extra tip of the day

lxc file push and pull are very handy. Whether the container is running or not, instead of having to get ssh set up in the container or knowing where the rootfs is mounted, you can simply

lxc image export trusty

This produces the rootfs and metadata files for the image called ‘trusty’ (assuming it exists) in your current directory. Push them both into the container, using

lxc file push meta-ubuntu-trusty-14.04-amd64-server-20150928.tar.xz nestc1/meta-ubuntu-trusty-14.04-amd64-server-20150928.tar.xz
lxc file push ubuntu-trusty-14.04-amd64-server-20150928.tar.xz nestc1/ubuntu-trusty-14.04-amd64-server-20150928.tar.xz

then in the container

lxc image import /meta-ubuntu-trusty-14.04-amd64-server-20150928.tar.xz /ubuntu-trusty-14.04-amd64-server-20150928.tar.xz

which is how i copied images into containers for nesting, rather than waiting for lxd-images to pull images from the network.

Posted in Uncategorized | 4 Comments

Ambient capabilities

There are several problems with posix capabilities. The first is the name: capabilities are something entirely different, so now we have to distinguish between “classical” and “posix” capabilities. Next, capabilities come from a defunct posix draft. That’s a serious downside for some people.

But another complaint has come up several times since file capabilities were implemented in Linux: people wanted an easy way for a program, once it has capabilities, to keep them. Capabilities are re-calculated every time the task executes a new file, taking the executable file’s capabilities into account. If a file has no capabilities, then (outside of the special exception for root when SECBIT_NOROOT is off) the resulting privilege set will be empty. And for shellscripts, file capabilities are always empty.

Fundamental to posix capabilities is the concept that part of your authority stems from who you are, and part stems from the programs you run. In a world of trojan horses and signed binaries this may seem sensible, but in the real world it is not always desirable. In particular, consider a case where a program wants to run as non-root user, but with a few capabilities – perhaps only cap_net_admin. If there is a very small set of files which the program may want to execute with privilege, and none are scripts, then cap_net_admin could be added to the inheritable file privileges for each of those programs. Then only processes with cap_net_admin in their inheritable process capabilities will be able to run those programs with privilege. But what if the program wants to run *anything*, including scripts and without having to predict what will be executed? This currently is not possible.

Christopher Lameter has been facing this problem for some time, and requested an enhancement of posix capabilities to allow him to solve it. Not only did he raise the problem and provide a good, real use case, he also sent several patches for discussion. In the end, a concept of “ambient capabilities” was agreed to and implemented (final patch by Andy Lutomirski). It’s currently available in -mm.

Here is how it works:

(Note – for more background on posix capabilities as implemented in linux, please see this Linux Symposium paper. For an example of how to use file capabilities to run as non-root before ambient capabilities, see this Linux Journal article. The ambient capability set has gotten several LWN mentions as well.)

Tasks have a new capability set, pA, the ambient set. As Andy Lutomirski put it, “pA does what most people expect pI to do.” Bits can only be set in pA if they are in pP or pI, and they are dropped from pA if they are dropped from pP or pI. When a new file is executed, all bits in pA are enabled in pP. Note though that executing any file which has file capabilities, or using the SECBIT_KEEPCAPS prctl option (followed by setresuid), will clear pA after the next exec.

So once a program moves CAP_NET_ADMIN into its pA, it can proceed to fork+exec a shellscript doing some /sbin/ip processing without losing CAP_NET_ADMIN.

How to use it (example):

Below is a test program, originally by Christopher, which I slightly modified. Write it to a file ‘ambient.c’. Build it, using

$ gcc -o ambient ambient.c -lcap-ng

Then assign it a set of file capabilities, for instance:

$ sudo setcap cap_net_raw,cap_net_admin,cap_sys_nice,cap_setpcap+p ambient

I was lazy and didn’t add interpretation of capabilities to ambient.c, so you’ll need to check /usr/include/linux/capability.h for the integers representing each capability. Run a shell with ambient capabilities by running, for instance:

$ ./ambient.c -c 13,12,23,8 /bin/bash

In this shell, check your capabilities:

$ grep Cap /proc/self/status
CapInh: 0000000000803100
CapPrm: 0000000000803100
CapEff: 0000000000803100
CapBnd: 0000003fffffffff
CapAmb: 0000000000803100

You can see that you have the requested ambient capabilities. If you run a new shell there, it retains those capabilities:

$ bash -c “grep Cap /proc/self/status”
CapInh: 0000000000803100
CapPrm: 0000000000803100
CapEff: 0000000000803100
CapBnd: 0000003fffffffff
CapAmb: 0000000000803100

What if we drop all but cap_net_admin from our inheritable set? We can test that using the ‘capsh’ program shipped with libcap:

$ capsh –caps=cap_net_admin=pi — -c “grep Cap /proc/self/status”
CapInh: 0000000000001000
CapPrm: 0000000000001000
CapEff: 0000000000001000
CapBnd: 0000003fffffffff
CapAmb: 0000000000001000

As you can see, the other capabilities were dropped from our ambient, and hence from our effective set.

================================================================================
ambient.c source
================================================================================
/*
* Test program for the ambient capabilities. This program spawns a shell
* that allows running processes with a defined set of capabilities.
*
* (C) 2015 Christoph Lameter
* (C) 2015 Serge Hallyn
* Released under: GPL v3 or later.
*
*
* Compile using:
*
* gcc -o ambient_test ambient_test.o -lcap-ng
*
* This program must have the following capabilities to run properly:
* Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
*
* A command to equip the binary with the right caps is:
*
* setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
*
*
* To get a shell with additional caps that can be inherited by other processes:
*
* ./ambient_test /bin/bash
*
*
* Verifying that it works:
*
* From the bash spawed by ambient_test run
*
* cat /proc/$$/status
*
* and have a look at the capabilities.
*/

#include
#include
#include
#include
#include
#include
#include

/*
* Definitions from the kernel header files. These are going to be removed
* when the /usr/include files have these defined.
*/
#define PR_CAP_AMBIENT 47
#define PR_CAP_AMBIENT_IS_SET 1
#define PR_CAP_AMBIENT_RAISE 2
#define PR_CAP_AMBIENT_LOWER 3
#define PR_CAP_AMBIENT_CLEAR_ALL 4

static void set_ambient_cap(int cap)
{
int rc;

capng_get_caps_process();
rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
if (rc) {
printf(“Cannot add inheritable cap\n”);
exit(2);
}
capng_apply(CAPNG_SELECT_CAPS);

/* Note the two 0s at the end. Kernel checks for these */
if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
perror(“Cannot set cap”);
exit(1);
}
}

void usage(const char *me) {
printf(“Usage: %s [-c caps] new-program new-args\n”, me);
exit(1);
}

int default_caplist[] = {CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE, -1};

int *get_caplist(const char *arg) {
int i = 1;
int *list = NULL;
char *dup = strdup(arg), *tok;

for (tok = strtok(dup, “,”); tok; tok = strtok(NULL, “,”)) {
list = realloc(list, (i + 1) * sizeof(int));
if (!list) {
perror(“out of memory”);
exit(1);
}
list[i-1] = atoi(tok);
list[i] = -1;
i++;
}
return list;
}

int main(int argc, char **argv)
{
int rc, i, gotcaps = 0;
int *caplist = NULL;
int index = 1; // argv index for cmd to start

if (argc < 2)
usage(argv[0]);

if (strcmp(argv[1], "-c") == 0) {
if (argc <= 3) {
usage(argv[0]);
}
caplist = get_caplist(argv[2]);
index = 3;
}

if (!caplist) {
caplist = (int *)default_caplist;
}

for (i = 0; caplist[i] != -1; i++) {
printf("adding %d to ambient list\n", caplist[i]);
set_ambient_cap(caplist[i]);
}

printf("Ambient_test forking shell\n");
if (execv(argv[index], argv + index))
perror("Cannot exec");

return 0;
}
================================================================================

Posted in Uncategorized | Leave a comment