Introducing cgmanager

LXC uses cgroups to track and constrain resource use by containers. Historically cgroups have been administered through a filesystem interface. A root owned task can mount the cgroup filesystem and change its current cgroup or the limits of its cgroup. Lxc must therefore rely on apparmor to disallow cgroup mounts, and make sure to bind mount only the container’s own cgroup into the container. It must also calculate its own cgroup for each controller to choose and track a full new cgroup for a new container. Along with some other complications, this caused the amount of code in lxc to deal with cgroups to become quite large.

To help deal with this, we wrote cgmanager, the cgroup manager. Its primary goal was to allow any task to seamlessly and securely (in terms of the host’s safety) administer its own cgroups. Its secondary goal was to ensure that lxc could deal with cgroups equally simply regardless of whether it was nested.

Cgmanager presents a D-Bus interface for making cgroup administration requests. Every request is made in relation to the requesting task’s current cgroup. Therefore ‘lxc-start’ can simply request for cgroup u1 to be created, without having to worry about what cgroup it is in now.

To make this work, we read the (un-alterable) process credentials of the requesting task over the D-Bus socket. We can check the task’s current cgroup using /proc/pid/cgroup, as well as check its /proc/pid/status and /proc/pid/uid_map. For a simple request like ‘create a cgroup’, this is all the information we need.

For requests relating to another task (“Move that task to another cgroup”) or credentials (“Change ownership to that userid”), we have two cases. If the requestor is in the same namespaces as the cgmanager (which we can verify on recent kernels), then the requestor can pass the values as regular integers. We can then verify using /proc whether the requestor has the privilege to perform the access.

But if the requestor is in a different namespace, then we need to uids and pids converted. We do this by having the requestor pass SCM_CREDENTIALS over a file descriptor. When these are passed, the kernel (a) ensures that the requesting task has privilege to write those credentials, and (b) converts them from the requestor’s namespace to the reader (cgmanager).

The SCM-enhanced D-Bus calls are a bit more complicated to use than regular D-Bus calls, and can’t be made with (unpatched) dbus-send. Therefore we provide a cgmanager proxy (cgproxy) which accepts the plain D-Bus requests from a task which shares its namespaces and converts them to the enhanced messages. So when you fire up a Trusty containers host, it will run the cgmanager. Each container on that host can bind the cgmanager D-Bus socket and run a cgproxy. (The cgmanager upstart job will start the right daemon at startup) Lxc can administer cgroups the exact same way whether it is being run inside a container or on the host.

Using cgmanager

Cgmanager is now in main in trusty. When you log into a trusty desktop, logind should place you into your own cgroup, which you can verify by reading /proc/self/cgroup. If entries there look like

2:cpuset:/user/1000.user/c2.session

then you have your own delegated cgroups. If it instead looks like

2:cpuset:/

then you do not. You can create your own cgroup using cgm, which is just a script to wrap rather long calls to dbus-send.

sudo cgm create all $USER
sudo cgm chown all $USER $(id -u) $(id -g)

Next enter your shell into the new cgroup using

cgm movepid all $USER $$

Now you can go on to https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/ to run your unprivileged containers. Or, I sometimes like to stick a compute job in a separate freezer cgroup so I can freeze it if the cpu needs to cool down,

cgm create freezer cc
bash docompile.sh &
cgm movepid freezer cc $!

This way I can manually freeze the job when I like, or I can have a script watching my cpu temp as follows:

state="thawed"
while [ 1 ]; do
	d=`cat /sys/devices/virtual/thermal/thermal_zone0/temp` || d=1000;
	d=$((d/1000));
	if [ $d -gt 93 -a "$state" = "thawed" ]; then
		cgm setvalue freezer cc freezer.state FROZEN
		state="frozen"
	elif [ $d -lt 89 -a "$state" = "frozen" ]; then
		cgm setvalue freezer cc freezer.state THAWED
		state="thawed";
	fi;
	sleep 1;
done
Posted in Uncategorized | Leave a comment

Upcoming Qemu changes for 14.04

Qemu 2.0 is looking to be released on April 4. Ubuntu 14.04 closes on April 10, with release on April 17. How’s that for timing. Currently the qemu package in trusty has hundreds of patches, the majority of which fall into two buckets – old omap3 patches from qemu-linaro, and new aarch64 patches from upstream.

So I’d like to do two things. FIrst, I’d like to drop the omap3 patches. Please please, if you need these, let me know. I’ve hung onto them, without ever hearing from any user who wanted them, since the qemu tree replaced both the qemu-kvm and qemu-linaro packages.

Second, I’ve filed for a FFE to hopefuly get qemu 2.0 into 14.04. I’ll be pushing candidate packages to ppa:ubuntu-virt/candidate hopefully starting tomorrow. After a few days, if testing seems ok, I will put out a wider call for testing. After -rc0, if testing is going great, I will start pushing rc’s to the archive, and maybe, just maybe, we can call 2.0 ready in time for 14.04!

Posted in Uncategorized | 1 Comment

Emulating tagged views in unity

I like tagged tiling window managers. I like tiling because it lets me avoid tedious window move+resize. I like tagged wm because I can add multiple tags to windows so that different tag views can show different subsets of my windows – irc and mail, irc and task1, task1 and browsers, task2 and email…

Unity doesn’t tile, but has the grid plugin which is quite nice. But what about a tagged view? There used to be a compiz plugin called group. In the past when I’ve tried it it didn’t seem to quite fit my needs, but beyond that I couldn’t find it in recent releases.

I briefly considered building it straight into unity, but I really just wanted something to work with < 1 hr work. So I implemented it as a script, winmark. Winmark takes a single-character mark (think of marking in vi, ma, 'a) and stores or restores the layout of the currently un-minimized windows under that mark (in ~/.local/share/winmark/a). Another tiny c program grabs the keyboard to read a mark character, then calls winmark with that character.

So now I can hit shift-F9 a to store the current layout, set up a different layout, hit shift-f9 b to store that, then restore them with F9 a and F9 b.

I’m not packaging this right now as I *suspect* this is the sort of thing noone but me would want. However I’m mentioning it here in case I’m wrong. The source is at lp:~serge-hallyn/+junk/markwindows.

There’s definite room for improvement, but I’ve hit my hour time limit, and it is useful as is :) Potential improvements would include showing overlay previews as with application switching, and restoring the stacking order.

Posted in Uncategorized | 1 Comment

Quickly run Ubuntu cloud images locally using uvtool

We have long been able to test Ubuntu isos very easily by
using ‘testdrive’. It syncs releases/architectures you are interested
in and starts them in kvm. Very nice. But nowadays, in addition to
the isos, we also distribute cloud images. They are the basis for
cloud instances and ubuntu-cloud containers, but starting a local vm
based on them took some manual steps. Now you can use ‘uvtool’ to
easily sync and launch vms with cloud images.

uvtool is in the archive in trusty and saucy, but if you’re on precise
you’ll need the ppa:

sudo add-apt-repository ppa:uvtool-dev/trunk
sudo apt-get update
sudo apt-get install uvtool

Now you can sync the release you like, using a command like:

uvt-simplestreams-libvirt sync release=saucy
or
uvt-simplestreams-libvirt sync --source http://cloud-images.ubuntu.com/daily release=trusty arch=amd64

See what you’ve got syncd:

uvt-simplestreams-libvirt query

then launch a vm

uvt-kvm create xxx release=saucy arch=amd64
uvt-kvm list

and connect to it

uvt-kvm ssh --insecure -l ubuntu xxx

While it exists you can manage it using libvirt,

virsh list
virsh destroy xxx
virsh start xxx
virsh dumpxml xxx

Doing so, you can find out that the backing store is a qcow snapshot
of the ‘canonical’ image. If you decided you wanted to publish a
resulting vm, you could of course convert the backing store to a
raw file or whatever:

sudo qemu-img convert -f qcow2 -O raw /var/lib/uvtool/libvirt/images/xxx.qcow xxx.img

When you’re done, destroy it using

uvt-kvm destroy xxx

Very nice!

Posted in Uncategorized

RSS over Pocket

When google reader went away, I switched to rss2email (r2e) which forwards rss feeds I follow to my inbox. Soon after that I was to take a trip, and I wanted to be able to read blogs on my e-reader while on the road and when disconnected.

The e-reader has readitlater (now called Pocket) installed, and pocket accepts links over email. Perfect!

So I started procmailing the r2e emails into a separate rss folder, and wrote a little script, run hourly by cron, to send the article link embedded in emails in that folder to Pocket.

Now I just sync readitlater on the nook, and read all my blog posts at my leisure! If the file called do-rss-forward does not exist in my home directory, then the script does nothing. So when I want to read blogs with mutt I just move that file out of the way.

The script can be seen here. You’ll of course have to fill in your Pocket-registered email address, and need a mailer running on localhost.

Have fun!

Posted in Uncategorized | Leave a comment

announcing lxc-snapshot

In April, lxc-clone gained the ability to create overlayfs snapshot clones of directory backed containers. In may, I wrote a little lxc-snap program based on that which introduced simple ‘snapshots’ to enable simple incremental development of container images. But a standalone program is not only more tedious to discover and install, it will also tend to break when the lxc API changes.

Now (well, recently) the ability to make snapshots has been moved into the lxc API itself, and the program lxc-snapshot, based on that, is shipped with lxc. (Leaving lxc-snap happily deprecated.)

As an example, let’s say you have a container c1, and you want to test a change in its /lib/init/fstab. You can snapshot it,

sudo lxc-snapshot -n c1

test your change, and, if you don’t like the result, you can recreate the original container using

sudo lxc-snapshot -n c1 -r snap0

The snapshot is stored as a full snapshot-cloned container, and restoring is done as a copy-clone using the original container’s backing store type. If your original container was /var/lib/lxc/c1, then the first snapshot will be /var/lib/lxcsnaps/c1/snap0, the next will be /var/lib/lxcsnaps/c1/snap1, etc.

There are some complications. Restoring a container to its original name as done in the above example will work if you have a btrfs backed container. But if your original container was directory backed, then the snapshot will be overlayfs-based, and will depend on the original container’s rootfs existing. Therefore it will pin the original container, and you’ll need to restore the snapshot to a new name, i.e.

sudo lxc-snapshot -n c1 -r snap0 c2

If you want to see a list of snapshots for container c1, do

sudo lxc-snapshot -n c1 -L

If you want to store a comment with the snapshot, you can

echo "This is before my /lib/init/fstab change" >> comment1
sudo lxc-snapshot -n c1 -c comment1

And then when you do

sudo lxc-snapshot -n c1 -L -C

you’ll see the snapshot comments after each snapshot.

There is certainly room for lots of feature development in lxc-snapshot. It could add removal support, improve comment snapshot support, sort snapshots in the listing, and for that matter could work around the overlayfs shortcomings to allow restoring a container to its original name. So if someone is looking for something to do, here’s one of many things waiting for an owner :) Meanwhile it seems to me plenty useful as is.

Have fun!

Posted in Uncategorized | Tagged , | 3 Comments

libvirt defaults (and openvswitch bridge performance)

The libvirt-bin package in Ubuntu installs a default NATed virtual network,
virbr0. This isn’t always the best choice for everyone, however it “just
works” everywhere. It also provides some simple protection – the VMs aren’t
exposed on the network for all attackers to see.

Two alternatives are sometimes suggested. One is to simply default to a
non-NATed bridge. The biggest reason we can’t do this is that it would break
users with wireless cards. Another issue is that instead of simply tacking
something new onto the network, we have to usurp the default network interface
into our new bridge. It’s impossible to guess all the ways users might have
already customized their network.

The other alternative is to use an openvswitch bridge. This actually has the
same problems as the linux bridge – you still can’t add a VM nic to an
openvswitch-bridged wireless NIC, and we still would be modifying the default
network.

However the suggestion did make me wonder – how would ovs bridges compare to
linux ones in terms of performance? I’d have expected them to be slower (as a
tradeoff for much greater flexibility), but I was surprised when I was told
that ovs bridges are expected to perform better. So I set up a quick test, and
sure enough!

I set up two laptops running saucy, connected over a physical link. On the one
I installed a saucy VM. Then I ran iperf over the physical link from the other
laptop to the VM. When the VM was attached using a linux bridge, I got:

830 M/s
757 M/s
755 M/s
827 M/s
821 M/s

When I instead used an openvswitch bridge, I got:

925 M/s
925 M/s
925 M/s
916 M/s
924 M/s

So, if we’re going to go with a new default, using openvswitch seems like a
good way to go. I’m still loath to make changes to the default, however a
script (sitting next to libvirt-migrate-qemu-disks) which users can optionally
run to do the gruntwork for them might be workable.

Posted in Uncategorized | Tagged , , , | 4 Comments