Call for testing: new qemu packages for raring

tl;dr

If you use qemu, kvm, or qemu-user in raring, please test the candidate packages in ppa:serge-hallyn/crossc.

Background

The qemu and kvm projects historically had somewhat different code bases with some different features and advantages. For years they have been trying to merge the bases, and now they are just about there.

There was also divergence between the Debian and Ubuntu packages. The Ubuntu functionality was offered through two source packages – qemu-kvm in main, and qemu-linaro in universe. The qemu-kvm tree provided kvm binaries for x86 and amd64, while qemu-linaro provided everything else. The qemu-linaro tree also provided bleeding edge arm patches which were not yet in upstream qemu-kvm or qemu trees.

The wonderful Debian qemu team has an experimental set of packages to use the 1.2 upstream qemu to replace both qemu and qemu-kvm. The packages in ppa:serge-hallyn/crossc are based on that tree. They have: some packaging changes to accommodate upgrades from our current packaging layouts (thanks to stgraber, slangasek and infinity for help with some thorny issues); changes to reflect things which are not in main in Ubuntu; and additional arm patches from the qemu-linaro 1.2 tree. With these packages, we will be able to collaborate much more closely with the Debian team.

I’d like to get these packages into the archive no later than early January. Therefore, if at all possible, please do test the candidate packages, both for clean upgrades from your current setup to the new package layout (in other words, looking for errors when doing ‘apt-get dist-upgrade’) and for regression bugs in qemu itself.

To test, do the following in a raring install:

sudo add-apt-repository ppa:serge-hallyn/crossc
sudo apt-get update

and then either

sudo apt-get dist-upgrade

if you already had the packages you are interested in installed, or

sudo apt-get install qemu-system # qemu-user and qemu-user-static if you want those

Please feel free to report those here or the Ubuntu-server mailing list.

Thanks!

Posted in Uncategorized | Tagged , , , | 4 Comments

Full Ubuntu container confined in a user namespace

I’ve mentioned user namespaces here before, and shown how to play a bit with them. When a task is cloned into a new user namespace, the uids in the namespace can be mapped (1-1, in blocks) to uids on the host – for instance uid 0 in the container could be uid 100000 on the host. The uids are translated at the kernel-userspace boundary (i.e. stat, etc), and capabilities for a namespaced task are only valid against objects owned by that namespace. The result is that root in a container is unprivileged on the host.

Eric has been making great progress in moving the kernel functionality upstream. With the newest 3.7 based ubuntu kernel, plus a few of his not yet merged patches, a milestone has been reached – it’s now possible to run a full ubuntu container in a user namespace!

First start up a fresh, uptodate quantal vm or instance. Install my user namespace ppa, install the kernel and nsexec packages from there, create a container, and convert it to be namespaced:

sudo add-apt-repository ppa:serge-hallyn/userns-natty
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install linux-image-3.7.0-0-generic nsexec lxc
sudo lxc-create -t ubuntu -n q1
sudo container-userns-convert q1 100000
sudo reboot

The ‘container-userns-convert’ script just shifts the user and group ids of file owners in the container rootfs, and adds two lines to the container configuration file to tell lxc to clone the new user namespace and set up the uid/gid mappings.

Now you can start the container,

sudo lxc-start -n q1 -d
sudo lxc-console -n q1

Look around the container, sudo bash; notice that it looks like a normal system, with ubuntu as uid 1000, root as uid 0. But look from the host, and you see root tasks in the container are actually running as uid 100000, and ubuntu ones as uid 100000.

There are a few oddnesses (you can sudo on ttys 1-4, but sometimes it fails on /dev/console, and shutdown in the container does not kill init); the lxc package needs a few more changes (the cgroup setup needs to be moved to the container parent); and plenty of things are not yet allowed by the kernel (mounting an ext4 filesystem).

But this is a full Ubuntu image, confined by a private user namespace!

After working out some kinks, we’ll next want to look into container startup by unprivileged users.

Posted in Uncategorized | Tagged , | Leave a comment

deploying multiple (connected) lxc compute nodes – with juju

This post got delayed a bit due to a few unexpected complications. First, it turns out that you cannot connect GRE tunnels in Amazon’s EC2 over the instances’ private addresses. You must use the public addresses. Second, quantal removed the openvswitch-datapath-dkms package because the openvswitch kernel module is now available upstream. However it turns out that the upstream openvswitch module does not yet provide GRE tunnels configurable through the db. Therefore hopefully the openvswitch-datapath-dkms package will soon be reintroduced, but meanwhile we will use it from the inestimable James Page’s “junk” ppa.

Oh, but first things second. What are we doing today? We’re going to use juju to fire off a set of lxc compute nodes, pre-populated with LVM backed pristine containers which can be very quickly cloned, and which will be able to communicate over an openvswitch private network no matter which compute node hosts them.

My use case for this is to set up for a long varied bug triage and replication session. It takes about 10-20 minutes (much longer on amazon, but setting a local mirror in /etc/default/lxc should speed that up there) to initially set up, after which starting a new container takes about 3 seconds.

There are two bzr trees involved. The actual juju charm is at lp:~serge-hallyn/charms/quantal/ovs-lxc/trunk. It relates one master compute node to any number of slave nodes. The master node will be used just as the slave ones, but is set apart to be the central openvswitch hub. So every slave will have a GRE tunnel to the master, and slaves can talk to each other over two GRE links (through the master). (You’ll want to check this out under ~/charms/quantal, i.e. “mkdir -p ~/charms/quantal; cd ~/charms/quantal; bzr branch lp:~serge-hallyn/charms/quantal/ovs-lxc/trunk ovs-lxc;”)

The other bzr tree is lp:~serge-hallyn/+junk/jujulxcscripts. The first script here is ‘juju-deploy-lxc’, which accepts a number of slaves to start, boostraps juju, deploys the nodes, and relates each slave to the master. It finally runs ‘grabnodes’ which will gather information used by the other scripts.

Next, ‘startcontainer’ will clone and start a new container. It rotates round robin among the master and slaves each time it is invoked. With no arguments it will start an amd64 quantal container. It can also be called as

startcontainer precise

or
startcontainer quantal i386

for the obvious result.

Finally, ‘sshcontainer (n)’ will ssh into the (n)th container you’ve started, starting with 0. The scripts don’t get too fancy or try to do too much – if you want much more, you might actually want to deploy openstack :)

I do hope at some point to expand this so as to use a (juju-deployed) ceph cluster for the container backing store. It is not as flexible as it ought to be, as it expects /dev/vdb or /dev/xvdb to be a spare drive and mounted on /mnt at instance startup, but this is good enough to work for me on Amazon ec2 as well as an openstack based cloud, which is all I need to make this useful for myself.

It won’t work by default on a local (lxc-backed) juju config, but I will play with that as an exercise to investigate what sorts of site customizations we should support in juju-lxc. In particular, we’ll need to (a) be able to use lxc mount hooks (so cgroups can be mounted in the container) and custom apparmor profiles.

Posted in Uncategorized | Tagged , , | 4 Comments

Easily making a blockdev available to a container

Often it would be nice to mount an existing (lvm) block device into a container. For instance, to emulate an Amazon ec2 environment, I’d like to have /dev/vdb or /dev/xvdb as a block device.

So I wrote a mount hook which will ‘insert’ a block device from the host into the container. Of course in Ubuntu containers are clamped down so that the container isn’t allowed to use this device. So I use this script to set a container up to use a particular block device.

For instance, if I have a pristine lvm-backed container called ‘quantal-amd64′, and I want to run a container which has a 500M block device available as /dev/xvdb, I would do:

# clone a new container
sudo lxc-clone -s -o quantal-amd64 -n q1
# create a LVM block device in the lxc VG
sudo lvcreate -L 500M -n q1-d1 lxc
# expose the block device to the container as /dev/xvdb
sudo lxc-enabledev.sh /dev/lxc/q1-d1 xvdb

Now when I start the container, I can format the device and mount it:

sudo mkfs.ext2 /dev/xvdb
sudo mount /dev/xvdb /mnt
echo "hello world" | sudo tee /mnt/ab

Of course I can also format the device on the host, and preserve the device between multiple containers.

If this turns out to be something many people want, we can add support for this into lxc itself. But for the moment this meets my needs, and uses only existing lxc features.

One note: when you delete the container, you’ll want to also delete the custom apparmor profile which this created.

Posted in Uncategorized | Tagged | 2 Comments

ecryptfs-backed containers

During this cycle, the lxc package gained the ability to call ‘hooks’ at various points of a container’s life cycle. Just today, a new hook point was added to the quantal package, which supports a simple use of ecryptfs backed containers.

Why would you want that, you might ask? Well, it offers a few advantages. First, if you’re running your containers on a cloud instance, you can rest assured that if your instance’s disk space is re-used for someone else’s instance without first being zeroed out, the container rootfs contents will not be revealed. Secondly, the un-encrypted rootfs contents are never mounted in the host’s namespace (though they are accessible by privileged tasks through /proc/$$/root), so unprivileged tasks on the host should not be able to read those contents either. Third, there is the usual ecryptfs advantage of supporting simple encrypted backups.

Currently it takes a few extra steps to make use of this. During the next cycle, we will hopefully move all this work into the standard ‘ubuntu’ container creation template, so that a simple

lxc-create -t ubuntu -n e1 — -e 2be2810752901deb

will create container whose rootfs is encrypted by the fekek in your keyring with sig 2be2810752901deb. But for now, you’ll need to do:

  • add ‘lxc.hook.premount = /usr/share/lxc/hooks/mountecryptfsroot’ to the container’s configuration file
  • change the rootfs to /var/lib/lxc/ecryptfs-root in the configuration file by setting ‘lxc.rootfs = /var/lib/lxc/ecryptfs-root’
  • add the line ‘mount -> /var/lib/lxc/ecryptfs-root’ to /etc/apparmor.d/abstractions/lxc/start-container
  • convert your container’s root filesystem to be ecryptfs-backed. Assuming your container is called ‘q1′, do
    • c=q1
    • mv /var/lib/lxc/$x/rootfs /var/lib/lxc/$c/rootfs.plain
    • mkdir /var/lib/lxc/$c/rootfs{,.crypt}
    • sig=`echo none | ecryptfs-add-passphrase | grep -v Passphrase | cut -d[ -f 2 | cut -d] -f 1`
    • mount -t ecryptfs -o ecryptfs_cipher=aes,ecryptfs_key_bytes=16,ecryptfs_passthrough=n,ecryptfs_enable_filename_crypto=n,ecryptfs_sig=${sig},sig=${sig},verbosity=0 rootfs.crypt rootfs
    • rsync -va /var/lib/lxc/$c/rootfs.plain/ /var/lib/lxc/$c/rootfs/
    • umount /var/lib/lxc/$c/rootfs
    • rm -rf /var/lib/lxc/$c/rootfs.plain
  • Now you can start your container by adding the passphrase to your in-kernel keyring using ‘ecryptfs-add-passphrase’, then starting your container as normal.
    • echo none | ecryptfs-add-passphrase
    • lxc-start -n q1

(These directions are copied from those in the /usr/share/lxc/hooks/mountecryptfsroot file)

Posted in Uncategorized | Tagged , , | 3 Comments

Playing with seccomp

Seccomp is a linux kernel feature by Andrea Arcangeli which limits the system calls which a task can use, by allowing a task to say “from now on, msyelf and my new children should not be able to do anything but compute, and read and write to already-open files (and return and exit).” The intent was to allow untrusted guests to use your cpu resources without them being able to abuse any other resources.

Years later, Will Drewry extended the seccomp idea by adding a new mode, typically called seccomp2. He had the brilliant idea to use a BPF (berkeley packet filter) compiler to allow clients to more flexibly define the constraints to be applied. He also managed to support many use cases by providing options for what to do on a denied action. The offending task can be killed; or the system call can be made to return a specified error code (i.e. ENOSYS). Or it can be traced.

In order to make seccomp2 easier to use, libseccomp was introduced. Over the next few weeks I want to add seccomp2 support to lxc containers. So I thought I’d first try a simple test program using libseccomp. Here is the program I used:

#include <stdio.h>
#include <stdlib.h>
#include <seccomp.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

int main()
{
	FILE *f1;
	int fd;
	int ret;

	f1 = fopen("/tmp/test1", "w");

	ret = seccomp_init(SCMP_ACT_ERRNO(5));
	if (ret < 0)
		printf("Error from seccomp_init\n");
	ret = seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
	if (!ret)
		ret = seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(dup), 0);
	if (!ret)
		ret = seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
	if (!ret)
		ret = seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
	if (!ret)
		ret = seccomp_load();
	if (ret)
		printf("error setting seccomp\n");

	fprintf(f1, "hi there\n");
	fd = open("/tmp/test2", O_RDWR);
	if (fd >= 0)
		printf("error, was able to open f2\n");
	else
		fprintf(f1, "open returned %d errno %d\n", fd, errno);
	fclose(f1);
	exit(0);
}

I installed libseccomp-dev, compiled the program, and executed it using

sudo apt-get -y install libseccomp-dev
gcc -o seccomp1 seccomp1.c -lseccomp
./seccomp1

The program opens /tmp/test1 for writing. Then it loads a seccomp policy to allow it to only write to files, close files, and exit. It then writes to the open file (allowed), opens a new file (not allowed), and writes the errno it received to the already open file. I told seccomp_init() to give us errno 5, so if you run the program you can check the output in /tmp/test1 to verify that -5 is what you got. If I had called seccomp_init() with the SCMP_KILL argument, then the program would have exited at the open() call, and the last line in the output file would not have been written. You can get much fancier by having the kernel the offending program a SIGSYS and rewinding its execution, or notifying a tracer.

Neat!

Posted in Uncategorized | 2 Comments

Crypto tutorial

One of the first real web pages I put up was a small set of tutorials on how basic crypto algorithms work. This was back in the days (mid 90s) when people actually tended to sit down and roll their own – for fun, for avoiding crypto export regulations (zounds!), and because you couldn’t just fire up python and say “encrypt this, plz”.

For old times’ sake I’ve put them back up as a google site, though I don’t expect to get traffic and emails about them like I used to :)

Nevertheless, despite the intro page assertion to the contrary, I’m tempted to add a few new bits. Perhaps a tutorial on differential cryptanalysis, and one on my favorite paper of all time, Claude Shannon’s exposition on information theory and entropy.

Posted in Uncategorized | Leave a comment