LXC – improved clone support

Recently I took some time to work on implementing container clones through the lxc API. lxc-clone previously existed as a shell script which could create snapshot clones of lvm and btrfs containers. There were several shortcomings to this:

1. clone was not exportable through the API (to be used in python, lua, go and c programs). Now it is, so a Go program can create a container clone in one function call.
2. expanding the set of supported clone types became unsavory
3. overlayfs was only supported as ‘ephemeral containers’, which could be made persistent through the use of pre-mount hooks. They were not first class citizens. Now they are.

The result is now in upstream git as well as in the packages at the ubuntu-lxc/daily ppa. Supported backing store types currently include dir (directory), lvm, btrfs, overlayfs, and zfs. Hopefully loop and qemu-nbd will be added soon. They each are somewhat different due to the nature of the backing store itself, so I’ll go over each. However in my opinion the coolest thing you can do with this is:

# create a stock directory backed container
sudo lxc-create -t ubuntu -n dir1
# create an overlayfs snapshot of it
sudo lxc-clone -s -B overlayfs dir1 s1

The -s argument asks for a snapshot (rather than copy) clone, and -B specifies the backing store type for the new container. When container s1 starts, it will mount a private writeable overlay (/var/lib/lxc/dir1/delta0) over a readonly mount of the original /var/lib/lxc/dir1/rootfs.

Now make some changes to start customizing s1. Checkpoint that state by cloning it:

sudo lxc-clone -s s1 s2

This will reference the same rootfs (/var/lib/lxc/dir1/rootfs) and rsync the overlayfs delta from s1 to s2. Now you can keep working on s1, keeping s2 as a checkpoint. Make more changes, and create your next snapshot

sudo lxc-clone -s s1 s3

sudo lxc-clone -s s1 s4

If at some point you realize you need to go back to an older snapshot, say s3, then you can

sudo lxc-clone -s s1 s1_bad # just to make sure
sudo lxc-destroy -n s1
sudo lxc-clone -s s3 s1

and pick up where you left off. Finally, if you’re happy and want to tar up what you have to ship it or copy to another machine, clone it back to a directory backed container:

sudo lxc-clone -B dir s1 dir_ship
sudo tar zcf /var/lib/lxc/dir_ship.tgz /var/lib/lxc/dir_ship

So far I’ve shown dir (directory) backing store and overlayfs. Specific to directory backed containers is that they cannot be snapshotted, except by converting them to overlayfs backed containers. Specific to overlayfs containers is that the original directory backed container must not be deleted, since the snapshot depends on it. (I’ll address this soon, marking the snapshotted container so that lxc-destroy will leave it alone, but that is not yet done)

To use btrfs containers, the entire lxc configuration path must be btrfs. However since the configuration path is flexible, that’s not as bad as it used to be. For instance, I mounted a btrfs at $HOME/lxcbase, then did

sudo lxc-create -t ubuntu -P $HOME/lxcbase -n b1

(The ‘-P’ argument chooses a custom ‘lxcpath’, or lxc configuration path, than the default /var/lib/lxc. You can also specify a global default other than /var/lib/lxc in /etc/lxc/lxc.conf.) lxc-create detects the btrfs and automatically makes the container a new subvolume, which can then be snapshotted

sudo lxc-clone -s b1 b2

For zfs, a zfsroot can be specified in /etc/lxc/lxc.conf. I created a zfs pool called ‘lxc’ (which is actually the default for the lxc tools, so I did not list it in /etc/lxc/lxc.conf), then did

sudo lxc-create -B zfs -t ubuntu -n z1
or
sudo lxc-clone -B zfs dir1 z1

This created ‘lxc/z1′ as a new zfs fs and mounted it under /var/lib/lxc/z1/rootfs. Next I could

sudo lxc-clone -s z1 z2

Now lxc-destroy needs some smarts still built-in to make zfs backed containers easier to destroy. That is because when lxc-clone creates z2 from z1, it must first create a snapshot ‘lxc/z1@z2′, then clone that to ‘lxc/z2′. So before you can destroy z1, you currently must

sudo lxc-destroy -n z2
sudo zfs destroy lxc/z1@x2

Finally, you can also use LVM. LVM snapshot container clones have been supported longer than any others (with btrfs being second). I like the fact that you can use any filesystem inside the LV. However, the two major shortcomings are that you cannot snapshot a snapshot, and that you must (depending at least on the filesystem type) choose a filesystem size in advance.

To clone LVM conatiners, you either need a vg called ‘lxc’, or you can specify a default vg in /etc/lxc/lxc.conf. You can create the initial lvm container with

sudo lxc-create -t ubuntu -n lvm1 –fssize 2G –fstype xfs
or
sudo lxc-clone -B lvm dir1 lvm1

Then snapshot it using

sudo lxc-clone -s lvm1 lvm2

Note that unlike overlayfs, snapshots in zfs, btrfs, and lvm are safe from having the base container destroyed. In btrfs, that is because the btrfs snapshot is metadata based, so destroying the base container simply does not delete any of the data in use by the snapshot container. LVM and zfs both will note that there are active snapshots of the base rootfs and prevent the base container from being destroyed.

Posted in Uncategorized | Tagged , | 11 Comments

gtd – managing projects

I learned about GTD 5 or 8 years ago, and pretty immediately was trying to use it. Ever since then I keep all of my information in one gtd folder, with Projects and Reference folders, a nextactions file, etc. I’ve blogged before about my tickler file, which frankly rocks and never lets me down.

However, a few months ago I decided I wasn’t happy with my nextactions file. Sitting down for a bit to think about it, it was clear that the following happens: some new project comes in. I only have time to jot a quick note, so I do so in nextactions. Later, another piece of information comes in, so I add it there. Over time, my nextactions files grows and is no longer a nextactions file.

I briefly tried simply not using the Projects/ directory, and keeping a indented/formatted structure in the nextactions file. But that does not work out – I spend most of my time either gazing at too much information, or/and ignoring parts which I hadn’t been working on recently. (I also briefly tried ETM and bug which both are *very* neat, but they similarly didn’t work for me for GTD.)

I have a Projects directory, so why am I not using it? Doing so takes several steps (think of a name, make the directory, open a file in it, make the notes, exit) and after that I don’t have a good system for managing the project files. Looking at a project again involves several steps – cd into gtd/Projects, look around, cd , look again. Clearly, project files needed better tools.

So I wrote up a simple ‘project’ script, with a corresponding bash_completion file. If info comes in for a trip I have to take in a few months, I can simply

	project start trip-sandiego-201303

or

	p s trip-sandiego-201303

This creates the project directory and opens vim with three buffers, for each of the three files – a summary, actions, and log. (‘project new’ will create without pulling up vim with those files.) Later, I can

	project list

or (for short)

	p l

to list all open projects,

	p e tr<tab>

to edit the project – which again opens the same files, or

	p cat tr<tab>

to cat the files to stdout. I’ve added a ‘Postponed’ directory for projects which are on hold, so I can

	project postpone trip-sandiego-201303

or just

	p po tr<tab>

to temporarily move the project folder into Postponed, or

	p complete tr<tab>

to move the project folder into the Completed/ directory.

I’ve been using this for a few months now, and am very happy with the result. The scipt and completion file are in lp:~serge-hallyn/+junk/gtdproject. It’s really not much, but so useful!

Posted in Uncategorized | Tagged | 2 Comments

Qemu updates in raring

The raring feature freeze took effect last week. What’s been happening with qemu in the meantime?

A lot! I’ll touch on the following main changes in this post: package reorg, spice support, hugepages, uefi, and rbd support.

* package reorg

Perhaps best to begin with a bit of Ubuntu qemu packaging history. In hardy (before my time) Ubuntu shipped with separate qemu and kvm packages. This reflected the separate upstream qemu and kvm source trees. In August of 2009, upstream was already talking about merging the two trees, and Dustin Kirkland started a new qemu-kvm Ubuntu package which provided both qemu and kvm.

In 2010, a new ‘qemu-linaro’ source package was created in universe, to provide qemu with more bleeding-edge arm support from linaro. Eventually the qemu-kvm package provided the i386 and amd64 qemu-system binaries, qemu-common, and qemu-utils. All other target architecture system binaries, plus all qemu-user binaries, plus qemu-kvm-spice, came from qemu-linaro. This is clearly non-ideal from many viewpoints, and especially QA testing and bug duplication. But any reorganization would have to make sure that upgrades work seamlessly for raring-raring, quantal-raring, and future LTS-to-LTS upgrades, for the many commonly used packages (qemu-kvm, qemu on various packages, and qemu-user).

In the traditional 6-month-plus-LTS Ubuntu cycle, raring was a good time (not too close to next LTS) to try to straighten that out. It was also a good time in that upstream qemu and kvm were now very close together, and especially in that the wonderfully helpful debian qemu team which was also starting to merge debian’s qemu and qemu-kvm sources into a new qemu source tree in debian experimental.

And so, it’s done! The qemu-linaro and qemu-kvm source packages have been merged into qemu. Most arm patches from linaro are in our package, but you can still run linaro’s qemu from ppa at https://launchpad.net/~linaro-maintainers/+archive/tools/. The Ubuntu and Debian teams are working together, which should mean more stable packages in both, and combined resources in addressing bugs. Thanks especially to Michael Tokarev for helping to review the Ubuntu delta, and to infinity for more than once helping to figure out packaging issues I couldn’t have figured out on my own.

* Spice support. Spice has finally made it into main! The qemu package in main therefore finally supports spice, without having to install a separate qemu-kvm-spice package. As a simple example, if you used to do:

kvm -vga vmware -vnc :1

then you can use spice by doing:

kvm -vga qxl -spice port=5900,disable-ticketing

then connect with spicec or spicy:

spicec -h hostname -p 5900

3. Transparent hugepages. The 1.4.0 qemu release includes support for transparent hugepages. This means that when hugepages are available, qemu instances migrate some memory pages from regular to huge pages. Hugepages offer performance improvements due to (1) requiring fewer TLB entries for the same amount of memory, (2) requiring fewer lookups per page, and (3) requiring fewer page faults for nearby memory references (since each memory page is much larger).

4. Hugetlbfs mount. While transparent hugepages are convenient, if you want a particular vm to run with hugepages backing the whole VM, you will want to use dedicated hugepages. To do this, simply set KVM_HUGEPAGES to 1 in /etc/init/qemu-kvm.conf, then add an entry to /etc/sysctl.conf like:

vm.nr_hugepages = 512

(for 1G of hugepages – 512 2M pages). Make sure to leave at least around 1G of memory not dedicated to hugepages. Then add the arguments

-mem-path /run/hugepages/kvm

to your kvm command. Dedicated hugepages are not new, but the automatic mounting of the /sys/hugepages/kvm is.

6. UEFI. If you install the ovmf package, then you can run qemu with a UEFI bios (to test secureboot, for instance) by adding the ‘-bios OVMF.fd’ arguments to kvm. As was pointed out during vUDS there are some bugs to work out to make this seamless.

5. rbd. Ok this has been enabled since precise, but it’s still cool. You can use a ceph cluster to back your kvm instances (as an alternative to, say, nfs) to easily enable live migration. Just

qemu-img create -f rbd rbd:pool/vm1 10G
kvm -m 512 -drive format=rbd,file=rbd:pool/vm1 -cdrom raring.iso -boot d

See http://ceph.com/docs/master/rbd/qemu-rbd/ for more information.

So there’s what I can think of that is new in qemu this cycle. I hope you all enjoy, and if you find upgrading issues please do raise a bug.

Posted in Uncategorized | Tagged , , , | Leave a comment

Experimenting with user namespaces

User namespaces are a really neat feature, but there are some subtleties involved which can make them perplexing to first play with. Here I’m going to show a few things you can do with them, with an eye to explaining some of the things which might otherwise be confusing.

First, you’ll need a bleeding edge kernel. A 3.9 kernel hand-compiled with user namespace support should be fine (some of the latest missing patches aren’t needed for these games as we won’t be creating full system containers). But for simplicity, you can simply fire up a new raring box and do:

sudo add-apt-repository ppa:ubuntu-lxc/kernel
sudo apt-get update
sudo apt-get dist-upgrade

Now get a few tools from my ppa – you can of course get the source for all from either the ppa, or from my bzr trees.

sudo add-apt-repository ppa:serge-hallyn/user-natty
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install nsexec uidmap

Now let’s try a first experiment. Run the following program from nsexec:

usernsselfmap

This is a simple program which forks a child which runs as root in a new user namespace. Here a brief reminder of how user namespaces are designed is in order. When a new user namespace is created, the task populating it starts as userid -1, nobody. At this point it has full privileges (POSIX capabilities), but those capabilities can only be used toward resources owned by the new namespace. Furthermore, the privileges will be lost as soon as the task runs exec(3) of a normal file. See the capabilities(7) manpage for an explanation.

At this point, userids from the parent namespace may be mapped into the child. For instance, one might map userids 0-9999 in the child to userids 100000-109999 on the host. This is done by writing values to /proc/pid/uid_map (and analogously to /proc/pid/gid_map). The task writing to the map files must have privilege over the parent uids being mapped in.

This is where usernsselfmap comes in. You currently do not have privilege over userids on the host – except your own. usernsselfmap simply maps uid 0 in the container to your own userid on the host. Then it changes to gid and uid 0, and finally executes a shell.

Now look around this shell

ifconfig
ifconfig eth0 down

Note that even though you have CAP_SYS_ADMIN, you cannot change the host’s network settings. However, you can now unshare a new network namespace (still without having privilege on the host) and create network devices in that namespace

nsexec -cmn /bin/bash
ifconfig
ip link add type veth
ifconfig veth0 10.0.0.1 up
ifconfig -a

Note also that you can’t read under /root. But you can mount a new mounts namespace and mount your $HOME onto /root

ls /root
# permission denied
nsexec -m /bin/bash
mount –bind $HOME /root
ls root
# homedir contents

Now, in addition to the kernel implementation of user namespaces, Eric Biederman has also provided a patchset against shadow to add a concept of subuids and subgids. Briefly, you can modify login.defs to say that every new user should be allocated 10000 (unique) uids and gids above 100000. Then when you add a new user, it will automatically receive a set of 10000 unique subuids. These allocations are stored in /etc/subuid and /etc/subgid, and two new setuid-root binaries, newuidmap and newgidmap (which are shipped in the uidmap binary package, generated from the shadow source package) may be used by an unprivileged user to map userids in a child user namespace to his allocated subuids on the host.

To conclude this post, here is an example of using the new shadow package along with nsexec to manually create a user namespace with more than one userid. First, use usermod to allocate some subuids and subgids for your user (who I’ll assume is user ‘ubuntu’ on an ec2 host) since it likely was created before subuids were configured:

sudo usermod ubuntu -v 110000-120000 -W 110000-120000

Now open two terminals as user ubuntu (or a split byobu screen). In the one, run

nsexec -UW -s 0 -S 0 /bin/bash
about to unshare with 10000000
Press any key to exec (I am 5358)

You’ve asked nsexec to unshare its user namespace (-U), to wait for a keypress before executing /bin/bash (-W), and to switch to userid 0 (-s 0) and groupid 0 (-S 0) before starting that shell. In this example nsexec tells you it is process id 5358, so that you can map userids to it. So from the other shell do:

newuidmap 5358 0 110000 10000
newgidmap 5358 0 110000 10000

Now hit return in the nsexec window, and you will see something like:

root@server:~#

Now you can play around as above, but unlike above, you can also switch to userids other than root.

root@server:~# newuidshell 1001
ubuntu2@server:~/nsexec

But since we’ve not set up a proper container (or chroot), and since our userid maps to 111001, which is not 1001, we can’t actually write to ubuntu2′s files or read any files which are not world readable.

This then will be the basis of ongoing and upcoming work to facility unprivileged users creating and using containers. Exciting!

(One note: I am here using an old toy ‘nsexec’ for manipulating namespaces. This will eventually be deprecated in favor of the new programs in upstream util-linux. However there has not yet been a release of util-linux with those patches, so they are not yet in the ubuntu package.)

The source tree for the modified shadow package is at lp:~serge-hallyn/ubuntu/raring/shadow/shadow-userns and source for utilities in the nsexec package is at lp:~serge-hallyn/+junk/nsexec.

Posted in Uncategorized | Tagged , | Leave a comment

User Namespaces LXC meeting

Last week we held an irc meeting to talk about user namespaces as they relate to lxc containers. The IRC log is posted at https://wiki.ubuntu.com/LxcUsernsIrcChat .

I had two goals for this meeting. The first was to make sure that lxc developers were familiar with user namespaces, so that as new patches started rolling in to accomodate user namespaces, more people might be inclined to review them – and spot my silly errors. The other was to discuss some design issues in the lxc code.

I began with some background on user namespaces, their design, motivation, and current status, topped off by a little demo on ec2. Then we proceeded to discuss future needed changes.

There are two terrific advantages to using user namespaces.

The first is that host resources are not subject to privilege in the container. That is, root in the container is not root on the host, and a fully privileged task in a container cannot exert any privilege over any resources which are not owned by the container. This advantage is fully realized right now when using lxc with a custom kernel, as per http://s3hh.wordpress.com/2012/10/31/full-ubuntu-container-confined-in-a-user-namespace. By the time raring is released, I hope for the stock raring lxc, with a custom kernel from ppa:ubuntu-lxc/kernels, to be usable in place of my personal ppa.

The second advantage of user namespaces is that they will allow unprivileged users to create and use containers. There are little things which will require privilege – like mapping userids into the container, and hooking the container’s network interface somehow into the host. Each of those can be enabled by small privileged helpers and configured in advance (and generically). So that, by 14.04 LTS, an unprivileged user should be able to safely, out of the box, do

lxc-create -t ubuntu -n r1
lxc-start -n r1

This should also be tremendously helpful for safer usage of juju with local provider.

The steps needed (or, at least, hopefully most of them) to get to that point are discussed in the meeting log above.

Posted in Uncategorized | Tagged , | 4 Comments

Call for testing: new qemu packages for raring

tl;dr

If you use qemu, kvm, or qemu-user in raring, please test the candidate packages in ppa:serge-hallyn/crossc.

Background

The qemu and kvm projects historically had somewhat different code bases with some different features and advantages. For years they have been trying to merge the bases, and now they are just about there.

There was also divergence between the Debian and Ubuntu packages. The Ubuntu functionality was offered through two source packages – qemu-kvm in main, and qemu-linaro in universe. The qemu-kvm tree provided kvm binaries for x86 and amd64, while qemu-linaro provided everything else. The qemu-linaro tree also provided bleeding edge arm patches which were not yet in upstream qemu-kvm or qemu trees.

The wonderful Debian qemu team has an experimental set of packages to use the 1.2 upstream qemu to replace both qemu and qemu-kvm. The packages in ppa:serge-hallyn/crossc are based on that tree. They have: some packaging changes to accommodate upgrades from our current packaging layouts (thanks to stgraber, slangasek and infinity for help with some thorny issues); changes to reflect things which are not in main in Ubuntu; and additional arm patches from the qemu-linaro 1.2 tree. With these packages, we will be able to collaborate much more closely with the Debian team.

I’d like to get these packages into the archive no later than early January. Therefore, if at all possible, please do test the candidate packages, both for clean upgrades from your current setup to the new package layout (in other words, looking for errors when doing ‘apt-get dist-upgrade’) and for regression bugs in qemu itself.

To test, do the following in a raring install:

sudo add-apt-repository ppa:serge-hallyn/crossc
sudo apt-get update

and then either

sudo apt-get dist-upgrade

if you already had the packages you are interested in installed, or

sudo apt-get install qemu-system # qemu-user and qemu-user-static if you want those

Please feel free to report those here or the Ubuntu-server mailing list.

Thanks!

Posted in Uncategorized | Tagged , , , | 4 Comments

Full Ubuntu container confined in a user namespace

I’ve mentioned user namespaces here before, and shown how to play a bit with them. When a task is cloned into a new user namespace, the uids in the namespace can be mapped (1-1, in blocks) to uids on the host – for instance uid 0 in the container could be uid 100000 on the host. The uids are translated at the kernel-userspace boundary (i.e. stat, etc), and capabilities for a namespaced task are only valid against objects owned by that namespace. The result is that root in a container is unprivileged on the host.

Eric has been making great progress in moving the kernel functionality upstream. With the newest 3.7 based ubuntu kernel, plus a few of his not yet merged patches, a milestone has been reached – it’s now possible to run a full ubuntu container in a user namespace!

First start up a fresh, uptodate quantal vm or instance. Install my user namespace ppa, install the kernel and nsexec packages from there, create a container, and convert it to be namespaced:

sudo add-apt-repository ppa:serge-hallyn/userns-natty
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install linux-image-3.7.0-0-generic nsexec lxc
sudo lxc-create -t ubuntu -n q1
sudo container-userns-convert q1 100000
sudo reboot

The ‘container-userns-convert’ script just shifts the user and group ids of file owners in the container rootfs, and adds two lines to the container configuration file to tell lxc to clone the new user namespace and set up the uid/gid mappings.

Now you can start the container,

sudo lxc-start -n q1 -d
sudo lxc-console -n q1

Look around the container, sudo bash; notice that it looks like a normal system, with ubuntu as uid 1000, root as uid 0. But look from the host, and you see root tasks in the container are actually running as uid 100000, and ubuntu ones as uid 100000.

There are a few oddnesses (you can sudo on ttys 1-4, but sometimes it fails on /dev/console, and shutdown in the container does not kill init); the lxc package needs a few more changes (the cgroup setup needs to be moved to the container parent); and plenty of things are not yet allowed by the kernel (mounting an ext4 filesystem).

But this is a full Ubuntu image, confined by a private user namespace!

After working out some kinks, we’ll next want to look into container startup by unprivileged users.

Posted in Uncategorized | Tagged , | 6 Comments