Live container migration – on its way

The criu project has been working hard to make application checkpoint/restart feasible. Tycho has implemented lxc-checkpoint and lxc-restart on top of that (as well as of course contributing the needed bits to criu itself), and now shows off first steps toward real live migration: http://tycho.ws/blog/2014/09/container-migration.html

Excellent!

Posted in Uncategorized | Tagged , , | Leave a comment

rsync.net feature: subuids

The problem: Some time ago, I had a server “in the wild” from which I
wanted some data backed up to my rsync.net account. I didn’t want to
put sensitive credentials on this server in case it got compromised.

The awesome admins at rsync.net pointed out their subuid feature. For
no extra charge, they’ll give you another uid, which can have its own
ssh keys, whose home directory is symbolically linked under your main
uid’s home directory. So the server can rsync backups to the subuid,
and if it is compromised, attackers cannot get at any info which didn’t
originate from that server anyway.

Very nice.

Posted in Uncategorized | Leave a comment

unprivileged btrfs clones

In 14.04 you can create unprivileged container clones usign overlayfs. Depending on your use case, these can be ideal, since the delta between your cloned and original containers is directly accessible as ~/.local/share/lxc/clonename/delta0/, ready to rsync.

However, that is not my use case. I like to keep a set of original containers updated for quick clone and use by my package build scripts or for manual use for bug reproduction etc. Overlayfs gets in the way here since updating the original container requires making sure no clones exist, else you can cause glitches or corruption in the clone.

Fortunately, if you are using ppa:ubuntu-lxc/daily, or building from git HEAD, then as of last week you can use btrfs clones with your unprivileged containers. This is perfect for me as I can update the originals while a long-running build is on-going in a clone, or if I just want to keep a clone around until i get time to extract the patch or bugfix or log contents sitting there.

So I create base containers using

lxc-create --template download -B btrfs --name c-trusty -- -d ubuntu -r trusty -a amd64

then have create_container and start_container scripts which basically do

lxc-clone --snapshot --orig c-trusty --new c-trusty-5

Perfect.

Posted in Uncategorized | Leave a comment

Xspice in containers

For some time I’ve been wanting to run ubuntu-desktop and others, remotely, in containers, using spice. Historically vnc has been the best way to do remote desktops, but spice should provide a far better experience. Unfortunately, Xspice has always failed for me, most recently segfaulting on startup. But fortunately, this is fixed in git, and I’m told a new release may be coming soon. While waiting for the new release (0.12.7?), I pushed a package based on git HEAD to ppa:serge-hallyn/virt.

You can create a container to test this with as follows:

lxc-create -t download -n desk1 -- -d ubuntu -r trusty -a amd64
lxc-start -n desk1 -d
lxc-attach -n desk1

Then inside that container shell,

add-apt-repository ppa:serge-hallyn/virt
apt-get update
apt-get install xserver-xspice ubuntu-desktop

ubuntu-desktop can take awhile to install. You can simply install fvwm and xterm if you want a quicker test. Once that’s all one, copy the xspice configuration file into your home directory, uncompress it, set the SpiceDisableTicketing option (or configure a password), and use the config file to configure an Xorg session:

cp /usr/share/doc/xserver-xspice/spiceqxl.xorg.conf.example.gz /root
cd /root
gunzip spiceqxl.xorg.conf.example.gz
cat >> spiceqxl.xorg.conf.example.gz << EOF
Option "SpiceDisableTicketing" "1"
EOF
/usr/bin/Xorg -config /root/spiceqxl.xorg.conf.example :2 &

Now fire up unity, xterm, or fvwm:

DISPLAY=:2 unity

Now connect using either spicy or spicec,

spicec -h  -p 5900

Of course if the container is on a remote host, you’ll want to set up some ssh port forwards to enable that, but if needed then that’s a subject for another post.

Posted in Uncategorized | 4 Comments

Nested lxc

One of the core features of cgmanager is to easily, safely, and transparently support the cgroup requirements of container nesting. Processes can administer cgroups exactly the same way whether inside a container or not. This also makes nested lxc very easy.

To create a container in which you can use cgroups, first create a container as usual (note, do this on an Ubuntu 14.04 system, unless you have enabled all the pieces you need – which I am not covering here):

sudo lxc-create -t download -n t1 -- -d ubuntu -r trusty -a amd64

Now to bind the cgmanager socket inside the container,

echo "lxc.mount.auto = cgroup" | sudo tee -a /var/lib/lxc/t1/config

If you also want to be able to start nested containers, then you need to use an apparmor profile which allows lxc mounting:

echo "lxc.aa_profile = lxc-container-default-with-nesting" | \
	sudo tee -a /var/lib/lxc/t1/config

Now, simply start the container

sudo lxc-start -n t1

You can run the cgmanager testsuite,

sudo apt-get -y install cgmanager-tests
cd /usr/share/cgmanager/tests
sudo ./runtests.sh

and use the cgm program to interact with cgmanager

cgm ping
sudo cgm create all compile
sudo cgm chown all compile 1000 1000
cgm movepid all compile $$

If you changed the aa_profile to permit nesting, then you can simply create and use containers inside the t1 container.

What I showed here is using privileged (root-owned) containers. In this case, the lxc-container-default-with-nesting profile is actually far less safe than the default profile. However, when using unprivileged containers (https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/) for at least the first layer, nesting works the exact same way, and the profile safety difference becomes moot.

Posted in Uncategorized | 4 Comments

Introducing cgmanager

LXC uses cgroups to track and constrain resource use by containers. Historically cgroups have been administered through a filesystem interface. A root owned task can mount the cgroup filesystem and change its current cgroup or the limits of its cgroup. Lxc must therefore rely on apparmor to disallow cgroup mounts, and make sure to bind mount only the container’s own cgroup into the container. It must also calculate its own cgroup for each controller to choose and track a full new cgroup for a new container. Along with some other complications, this caused the amount of code in lxc to deal with cgroups to become quite large.

To help deal with this, we wrote cgmanager, the cgroup manager. Its primary goal was to allow any task to seamlessly and securely (in terms of the host’s safety) administer its own cgroups. Its secondary goal was to ensure that lxc could deal with cgroups equally simply regardless of whether it was nested.

Cgmanager presents a D-Bus interface for making cgroup administration requests. Every request is made in relation to the requesting task’s current cgroup. Therefore ‘lxc-start’ can simply request for cgroup u1 to be created, without having to worry about what cgroup it is in now.

To make this work, we read the (un-alterable) process credentials of the requesting task over the D-Bus socket. We can check the task’s current cgroup using /proc/pid/cgroup, as well as check its /proc/pid/status and /proc/pid/uid_map. For a simple request like ‘create a cgroup’, this is all the information we need.

For requests relating to another task (“Move that task to another cgroup”) or credentials (“Change ownership to that userid”), we have two cases. If the requestor is in the same namespaces as the cgmanager (which we can verify on recent kernels), then the requestor can pass the values as regular integers. We can then verify using /proc whether the requestor has the privilege to perform the access.

But if the requestor is in a different namespace, then we need to uids and pids converted. We do this by having the requestor pass SCM_CREDENTIALS over a file descriptor. When these are passed, the kernel (a) ensures that the requesting task has privilege to write those credentials, and (b) converts them from the requestor’s namespace to the reader (cgmanager).

The SCM-enhanced D-Bus calls are a bit more complicated to use than regular D-Bus calls, and can’t be made with (unpatched) dbus-send. Therefore we provide a cgmanager proxy (cgproxy) which accepts the plain D-Bus requests from a task which shares its namespaces and converts them to the enhanced messages. So when you fire up a Trusty containers host, it will run the cgmanager. Each container on that host can bind the cgmanager D-Bus socket and run a cgproxy. (The cgmanager upstart job will start the right daemon at startup) Lxc can administer cgroups the exact same way whether it is being run inside a container or on the host.

Using cgmanager

Cgmanager is now in main in trusty. When you log into a trusty desktop, logind should place you into your own cgroup, which you can verify by reading /proc/self/cgroup. If entries there look like

2:cpuset:/user/1000.user/c2.session

then you have your own delegated cgroups. If it instead looks like

2:cpuset:/

then you do not. You can create your own cgroup using cgm, which is just a script to wrap rather long calls to dbus-send.

sudo cgm create all $USER
sudo cgm chown all $USER $(id -u) $(id -g)

Next enter your shell into the new cgroup using

cgm movepid all $USER $$

Now you can go on to https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/ to run your unprivileged containers. Or, I sometimes like to stick a compute job in a separate freezer cgroup so I can freeze it if the cpu needs to cool down,

cgm create freezer cc
bash docompile.sh &
cgm movepid freezer cc $!

This way I can manually freeze the job when I like, or I can have a script watching my cpu temp as follows:

state="thawed"
while [ 1 ]; do
	d=`cat /sys/devices/virtual/thermal/thermal_zone0/temp` || d=1000;
	d=$((d/1000));
	if [ $d -gt 93 -a "$state" = "thawed" ]; then
		cgm setvalue freezer cc freezer.state FROZEN
		state="frozen"
	elif [ $d -lt 89 -a "$state" = "frozen" ]; then
		cgm setvalue freezer cc freezer.state THAWED
		state="thawed";
	fi;
	sleep 1;
done
Posted in Uncategorized | Leave a comment

Upcoming Qemu changes for 14.04

Qemu 2.0 is looking to be released on April 4. Ubuntu 14.04 closes on April 10, with release on April 17. How’s that for timing. Currently the qemu package in trusty has hundreds of patches, the majority of which fall into two buckets – old omap3 patches from qemu-linaro, and new aarch64 patches from upstream.

So I’d like to do two things. FIrst, I’d like to drop the omap3 patches. Please please, if you need these, let me know. I’ve hung onto them, without ever hearing from any user who wanted them, since the qemu tree replaced both the qemu-kvm and qemu-linaro packages.

Second, I’ve filed for a FFE to hopefuly get qemu 2.0 into 14.04. I’ll be pushing candidate packages to ppa:ubuntu-virt/candidate hopefully starting tomorrow. After a few days, if testing seems ok, I will put out a wider call for testing. After -rc0, if testing is going great, I will start pushing rc’s to the archive, and maybe, just maybe, we can call 2.0 ready in time for 14.04!

Posted in Uncategorized | 1 Comment