Nested lxc

One of the core features of cgmanager is to easily, safely, and transparently support the cgroup requirements of container nesting. Processes can administer cgroups exactly the same way whether inside a container or not. This also makes nested lxc very easy.

To create a container in which you can use cgroups, first create a container as usual (note, do this on an Ubuntu 14.04 system, unless you have enabled all the pieces you need – which I am not covering here):

sudo lxc-create -t download -n t1 -- -d ubuntu -r trusty -a amd64

Now to bind the cgmanager socket inside the container,

echo "lxc.mount.auto = cgroup" | sudo tee -a /var/lib/lxc/t1/config

If you also want to be able to start nested containers, then you need to use an apparmor profile which allows lxc mounting:

echo "lxc.aa_profile = lxc-container-default-with-nesting" | \
	sudo tee -a /var/lib/lxc/t1/config

Now, simply start the container

sudo lxc-start -n t1

You can run the cgmanager testsuite,

sudo apt-get -y install cgmanager-tests
cd /usr/share/cgmanager/tests
sudo ./runtests.sh

and use the cgm program to interact with cgmanager

cgm ping
sudo cgm create all compile
sudo cgm chown all compile 1000 1000
cgm movepid all compile $$

If you changed the aa_profile to permit nesting, then you can simply create and use containers inside the t1 container.

What I showed here is using privileged (root-owned) containers. In this case, the lxc-container-default-with-nesting profile is actually far less safe than the default profile. However, when using unprivileged containers (https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/) for at least the first layer, nesting works the exact same way, and the profile safety difference becomes moot.

Posted in Uncategorized | 4 Comments

Introducing cgmanager

LXC uses cgroups to track and constrain resource use by containers. Historically cgroups have been administered through a filesystem interface. A root owned task can mount the cgroup filesystem and change its current cgroup or the limits of its cgroup. Lxc must therefore rely on apparmor to disallow cgroup mounts, and make sure to bind mount only the container’s own cgroup into the container. It must also calculate its own cgroup for each controller to choose and track a full new cgroup for a new container. Along with some other complications, this caused the amount of code in lxc to deal with cgroups to become quite large.

To help deal with this, we wrote cgmanager, the cgroup manager. Its primary goal was to allow any task to seamlessly and securely (in terms of the host’s safety) administer its own cgroups. Its secondary goal was to ensure that lxc could deal with cgroups equally simply regardless of whether it was nested.

Cgmanager presents a D-Bus interface for making cgroup administration requests. Every request is made in relation to the requesting task’s current cgroup. Therefore ‘lxc-start’ can simply request for cgroup u1 to be created, without having to worry about what cgroup it is in now.

To make this work, we read the (un-alterable) process credentials of the requesting task over the D-Bus socket. We can check the task’s current cgroup using /proc/pid/cgroup, as well as check its /proc/pid/status and /proc/pid/uid_map. For a simple request like ‘create a cgroup’, this is all the information we need.

For requests relating to another task (“Move that task to another cgroup”) or credentials (“Change ownership to that userid”), we have two cases. If the requestor is in the same namespaces as the cgmanager (which we can verify on recent kernels), then the requestor can pass the values as regular integers. We can then verify using /proc whether the requestor has the privilege to perform the access.

But if the requestor is in a different namespace, then we need to uids and pids converted. We do this by having the requestor pass SCM_CREDENTIALS over a file descriptor. When these are passed, the kernel (a) ensures that the requesting task has privilege to write those credentials, and (b) converts them from the requestor’s namespace to the reader (cgmanager).

The SCM-enhanced D-Bus calls are a bit more complicated to use than regular D-Bus calls, and can’t be made with (unpatched) dbus-send. Therefore we provide a cgmanager proxy (cgproxy) which accepts the plain D-Bus requests from a task which shares its namespaces and converts them to the enhanced messages. So when you fire up a Trusty containers host, it will run the cgmanager. Each container on that host can bind the cgmanager D-Bus socket and run a cgproxy. (The cgmanager upstart job will start the right daemon at startup) Lxc can administer cgroups the exact same way whether it is being run inside a container or on the host.

Using cgmanager

Cgmanager is now in main in trusty. When you log into a trusty desktop, logind should place you into your own cgroup, which you can verify by reading /proc/self/cgroup. If entries there look like

2:cpuset:/user/1000.user/c2.session

then you have your own delegated cgroups. If it instead looks like

2:cpuset:/

then you do not. You can create your own cgroup using cgm, which is just a script to wrap rather long calls to dbus-send.

sudo cgm create all $USER
sudo cgm chown all $USER $(id -u) $(id -g)

Next enter your shell into the new cgroup using

cgm movepid all $USER $$

Now you can go on to https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/ to run your unprivileged containers. Or, I sometimes like to stick a compute job in a separate freezer cgroup so I can freeze it if the cpu needs to cool down,

cgm create freezer cc
bash docompile.sh &
cgm movepid freezer cc $!

This way I can manually freeze the job when I like, or I can have a script watching my cpu temp as follows:

state="thawed"
while [ 1 ]; do
	d=`cat /sys/devices/virtual/thermal/thermal_zone0/temp` || d=1000;
	d=$((d/1000));
	if [ $d -gt 93 -a "$state" = "thawed" ]; then
		cgm setvalue freezer cc freezer.state FROZEN
		state="frozen"
	elif [ $d -lt 89 -a "$state" = "frozen" ]; then
		cgm setvalue freezer cc freezer.state THAWED
		state="thawed";
	fi;
	sleep 1;
done
Posted in Uncategorized | Leave a comment

Upcoming Qemu changes for 14.04

Qemu 2.0 is looking to be released on April 4. Ubuntu 14.04 closes on April 10, with release on April 17. How’s that for timing. Currently the qemu package in trusty has hundreds of patches, the majority of which fall into two buckets – old omap3 patches from qemu-linaro, and new aarch64 patches from upstream.

So I’d like to do two things. FIrst, I’d like to drop the omap3 patches. Please please, if you need these, let me know. I’ve hung onto them, without ever hearing from any user who wanted them, since the qemu tree replaced both the qemu-kvm and qemu-linaro packages.

Second, I’ve filed for a FFE to hopefuly get qemu 2.0 into 14.04. I’ll be pushing candidate packages to ppa:ubuntu-virt/candidate hopefully starting tomorrow. After a few days, if testing seems ok, I will put out a wider call for testing. After -rc0, if testing is going great, I will start pushing rc’s to the archive, and maybe, just maybe, we can call 2.0 ready in time for 14.04!

Posted in Uncategorized | 1 Comment

Emulating tagged views in unity

I like tagged tiling window managers. I like tiling because it lets me avoid tedious window move+resize. I like tagged wm because I can add multiple tags to windows so that different tag views can show different subsets of my windows – irc and mail, irc and task1, task1 and browsers, task2 and email…

Unity doesn’t tile, but has the grid plugin which is quite nice. But what about a tagged view? There used to be a compiz plugin called group. In the past when I’ve tried it it didn’t seem to quite fit my needs, but beyond that I couldn’t find it in recent releases.

I briefly considered building it straight into unity, but I really just wanted something to work with < 1 hr work. So I implemented it as a script, winmark. Winmark takes a single-character mark (think of marking in vi, ma, 'a) and stores or restores the layout of the currently un-minimized windows under that mark (in ~/.local/share/winmark/a). Another tiny c program grabs the keyboard to read a mark character, then calls winmark with that character.

So now I can hit shift-F9 a to store the current layout, set up a different layout, hit shift-f9 b to store that, then restore them with F9 a and F9 b.

I’m not packaging this right now as I *suspect* this is the sort of thing noone but me would want. However I’m mentioning it here in case I’m wrong. The source is at lp:~serge-hallyn/+junk/markwindows.

There’s definite room for improvement, but I’ve hit my hour time limit, and it is useful as is :) Potential improvements would include showing overlay previews as with application switching, and restoring the stacking order.

Posted in Uncategorized | 1 Comment

Quickly run Ubuntu cloud images locally using uvtool

We have long been able to test Ubuntu isos very easily by
using ‘testdrive’. It syncs releases/architectures you are interested
in and starts them in kvm. Very nice. But nowadays, in addition to
the isos, we also distribute cloud images. They are the basis for
cloud instances and ubuntu-cloud containers, but starting a local vm
based on them took some manual steps. Now you can use ‘uvtool’ to
easily sync and launch vms with cloud images.

uvtool is in the archive in trusty and saucy, but if you’re on precise
you’ll need the ppa:

sudo add-apt-repository ppa:uvtool-dev/trunk
sudo apt-get update
sudo apt-get install uvtool

Now you can sync the release you like, using a command like:

uvt-simplestreams-libvirt sync release=saucy
or
uvt-simplestreams-libvirt sync --source http://cloud-images.ubuntu.com/daily release=trusty arch=amd64

See what you’ve got syncd:

uvt-simplestreams-libvirt query

then launch a vm

uvt-kvm create xxx release=saucy arch=amd64
uvt-kvm list

and connect to it

uvt-kvm ssh --insecure -l ubuntu xxx

While it exists you can manage it using libvirt,

virsh list
virsh destroy xxx
virsh start xxx
virsh dumpxml xxx

Doing so, you can find out that the backing store is a qcow snapshot
of the ‘canonical’ image. If you decided you wanted to publish a
resulting vm, you could of course convert the backing store to a
raw file or whatever:

sudo qemu-img convert -f qcow2 -O raw /var/lib/uvtool/libvirt/images/xxx.qcow xxx.img

When you’re done, destroy it using

uvt-kvm destroy xxx

Very nice!

Posted in Uncategorized

RSS over Pocket

When google reader went away, I switched to rss2email (r2e) which forwards rss feeds I follow to my inbox. Soon after that I was to take a trip, and I wanted to be able to read blogs on my e-reader while on the road and when disconnected.

The e-reader has readitlater (now called Pocket) installed, and pocket accepts links over email. Perfect!

So I started procmailing the r2e emails into a separate rss folder, and wrote a little script, run hourly by cron, to send the article link embedded in emails in that folder to Pocket.

Now I just sync readitlater on the nook, and read all my blog posts at my leisure! If the file called do-rss-forward does not exist in my home directory, then the script does nothing. So when I want to read blogs with mutt I just move that file out of the way.

The script can be seen here. You’ll of course have to fill in your Pocket-registered email address, and need a mailer running on localhost.

Have fun!

Posted in Uncategorized | Leave a comment

announcing lxc-snapshot

In April, lxc-clone gained the ability to create overlayfs snapshot clones of directory backed containers. In may, I wrote a little lxc-snap program based on that which introduced simple ‘snapshots’ to enable simple incremental development of container images. But a standalone program is not only more tedious to discover and install, it will also tend to break when the lxc API changes.

Now (well, recently) the ability to make snapshots has been moved into the lxc API itself, and the program lxc-snapshot, based on that, is shipped with lxc. (Leaving lxc-snap happily deprecated.)

As an example, let’s say you have a container c1, and you want to test a change in its /lib/init/fstab. You can snapshot it,

sudo lxc-snapshot -n c1

test your change, and, if you don’t like the result, you can recreate the original container using

sudo lxc-snapshot -n c1 -r snap0

The snapshot is stored as a full snapshot-cloned container, and restoring is done as a copy-clone using the original container’s backing store type. If your original container was /var/lib/lxc/c1, then the first snapshot will be /var/lib/lxcsnaps/c1/snap0, the next will be /var/lib/lxcsnaps/c1/snap1, etc.

There are some complications. Restoring a container to its original name as done in the above example will work if you have a btrfs backed container. But if your original container was directory backed, then the snapshot will be overlayfs-based, and will depend on the original container’s rootfs existing. Therefore it will pin the original container, and you’ll need to restore the snapshot to a new name, i.e.

sudo lxc-snapshot -n c1 -r snap0 c2

If you want to see a list of snapshots for container c1, do

sudo lxc-snapshot -n c1 -L

If you want to store a comment with the snapshot, you can

echo "This is before my /lib/init/fstab change" >> comment1
sudo lxc-snapshot -n c1 -c comment1

And then when you do

sudo lxc-snapshot -n c1 -L -C

you’ll see the snapshot comments after each snapshot.

There is certainly room for lots of feature development in lxc-snapshot. It could add removal support, improve comment snapshot support, sort snapshots in the listing, and for that matter could work around the overlayfs shortcomings to allow restoring a container to its original name. So if someone is looking for something to do, here’s one of many things waiting for an owner :) Meanwhile it seems to me plenty useful as is.

Have fun!

Posted in Uncategorized | Tagged , | 3 Comments