Where does lxd fit in

Since its announcement, there appears to have been some confusion and concern about lxd, how it relates to lxc, and whether it will be taking away from lxc development.

When lxc was first started around 2007, it was mainly a userspace tool – some c code and some shell scripts – to exercise the in-development new kernel features intended for container and checkpoint-restart functionality. The lxc command line experience, after all these years, is quite set in stone. While it is not ideal (the mandatory -n annoys a lot of people), it has served us very well for a long time.

A few years ago, we took all of the main container-related functions which could be done with various commands, and exported them through the new ‘lxc API’. For instance, lxc-create had been a script, and lxc-start and lxc-execute were separate c programs. The new lxc ‘API’ was premised around a container object with methods, including ‘create’ and ‘start’, for the common operations.

From the start we had in mind at least python bindings to the API, and in quick order bindings came into being for C, python3, python2, go, lua, haskell, and more, allowing container administration from these languages without having to shell out to the lxc commands. So now code running on the same machine can manipulate containers. But we still have the arguably crufty command line language, and the API is local only.

lxd addresses those two issues. First, it presents a REST API for manipulating containers, thereby exporting container management over the network. Secondly, it offers a command line client using the REST API to administer containers across remote hosts. The command line API is basically what we came up with when we asked “what, after years of working with containers, would the perfect, intuitive, most concise and still flexible CLI we could imagine?” For handling remote containers it borrows some good parts of the git remote API. (I say “we” here, but really the inestimable stgraber designed the CLI). This allows us to leave the legacy lxc api as-is for administering local containers (“lxc-create”, “lxc-start”, etc), while giving us a nicer API and easier administration using the new CLI (“lxc start c1″, “lxc start images:ubuntu/trusty/amd64 host2:new-container”).

Above all, lxd exports a new interface over the network, but entirely wrapped around lxc. So lxc will not be going away, and focus on lxd will mean further improvements for lxc, not a shift away from lxc.

Posted in Uncategorized | Tagged , | 6 Comments

Live container migration – on its way

The criu project has been working hard to make application checkpoint/restart feasible. Tycho has implemented lxc-checkpoint and lxc-restart on top of that (as well as of course contributing the needed bits to criu itself), and now shows off first steps toward real live migration: http://tycho.ws/blog/2014/09/container-migration.html

Excellent!

Posted in Uncategorized | Tagged , , | Leave a comment

rsync.net feature: subuids

The problem: Some time ago, I had a server “in the wild” from which I
wanted some data backed up to my rsync.net account. I didn’t want to
put sensitive credentials on this server in case it got compromised.

The awesome admins at rsync.net pointed out their subuid feature. For
no extra charge, they’ll give you another uid, which can have its own
ssh keys, whose home directory is symbolically linked under your main
uid’s home directory. So the server can rsync backups to the subuid,
and if it is compromised, attackers cannot get at any info which didn’t
originate from that server anyway.

Very nice.

Posted in Uncategorized | Leave a comment

unprivileged btrfs clones

In 14.04 you can create unprivileged container clones usign overlayfs. Depending on your use case, these can be ideal, since the delta between your cloned and original containers is directly accessible as ~/.local/share/lxc/clonename/delta0/, ready to rsync.

However, that is not my use case. I like to keep a set of original containers updated for quick clone and use by my package build scripts or for manual use for bug reproduction etc. Overlayfs gets in the way here since updating the original container requires making sure no clones exist, else you can cause glitches or corruption in the clone.

Fortunately, if you are using ppa:ubuntu-lxc/daily, or building from git HEAD, then as of last week you can use btrfs clones with your unprivileged containers. This is perfect for me as I can update the originals while a long-running build is on-going in a clone, or if I just want to keep a clone around until i get time to extract the patch or bugfix or log contents sitting there.

So I create base containers using

lxc-create --template download -B btrfs --name c-trusty -- -d ubuntu -r trusty -a amd64

then have create_container and start_container scripts which basically do

lxc-clone --snapshot --orig c-trusty --new c-trusty-5

Perfect.

Posted in Uncategorized | Leave a comment

Xspice in containers

For some time I’ve been wanting to run ubuntu-desktop and others, remotely, in containers, using spice. Historically vnc has been the best way to do remote desktops, but spice should provide a far better experience. Unfortunately, Xspice has always failed for me, most recently segfaulting on startup. But fortunately, this is fixed in git, and I’m told a new release may be coming soon. While waiting for the new release (0.12.7?), I pushed a package based on git HEAD to ppa:serge-hallyn/virt.

You can create a container to test this with as follows:

lxc-create -t download -n desk1 -- -d ubuntu -r trusty -a amd64
lxc-start -n desk1 -d
lxc-attach -n desk1

Then inside that container shell,

add-apt-repository ppa:serge-hallyn/virt
apt-get update
apt-get install xserver-xspice ubuntu-desktop

ubuntu-desktop can take awhile to install. You can simply install fvwm and xterm if you want a quicker test. Once that’s all one, copy the xspice configuration file into your home directory, uncompress it, set the SpiceDisableTicketing option (or configure a password), and use the config file to configure an Xorg session:

cp /usr/share/doc/xserver-xspice/spiceqxl.xorg.conf.example.gz /root
cd /root
gunzip spiceqxl.xorg.conf.example.gz
cat >> spiceqxl.xorg.conf.example.gz << EOF
Option "SpiceDisableTicketing" "1"
EOF
/usr/bin/Xorg -config /root/spiceqxl.xorg.conf.example :2 &

Now fire up unity, xterm, or fvwm:

DISPLAY=:2 unity

Now connect using either spicy or spicec,

spicec -h  -p 5900

Of course if the container is on a remote host, you’ll want to set up some ssh port forwards to enable that, but if needed then that’s a subject for another post.

Posted in Uncategorized | 4 Comments

Nested lxc

One of the core features of cgmanager is to easily, safely, and transparently support the cgroup requirements of container nesting. Processes can administer cgroups exactly the same way whether inside a container or not. This also makes nested lxc very easy.

To create a container in which you can use cgroups, first create a container as usual (note, do this on an Ubuntu 14.04 system, unless you have enabled all the pieces you need – which I am not covering here):

sudo lxc-create -t download -n t1 -- -d ubuntu -r trusty -a amd64

Now to bind the cgmanager socket inside the container,

echo "lxc.mount.auto = cgroup" | sudo tee -a /var/lib/lxc/t1/config

If you also want to be able to start nested containers, then you need to use an apparmor profile which allows lxc mounting:

echo "lxc.aa_profile = lxc-container-default-with-nesting" | \
	sudo tee -a /var/lib/lxc/t1/config

Now, simply start the container

sudo lxc-start -n t1

You can run the cgmanager testsuite,

sudo apt-get -y install cgmanager-tests
cd /usr/share/cgmanager/tests
sudo ./runtests.sh

and use the cgm program to interact with cgmanager

cgm ping
sudo cgm create all compile
sudo cgm chown all compile 1000 1000
cgm movepid all compile $$

If you changed the aa_profile to permit nesting, then you can simply create and use containers inside the t1 container.

What I showed here is using privileged (root-owned) containers. In this case, the lxc-container-default-with-nesting profile is actually far less safe than the default profile. However, when using unprivileged containers (https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/) for at least the first layer, nesting works the exact same way, and the profile safety difference becomes moot.

Posted in Uncategorized | 4 Comments

Introducing cgmanager

LXC uses cgroups to track and constrain resource use by containers. Historically cgroups have been administered through a filesystem interface. A root owned task can mount the cgroup filesystem and change its current cgroup or the limits of its cgroup. Lxc must therefore rely on apparmor to disallow cgroup mounts, and make sure to bind mount only the container’s own cgroup into the container. It must also calculate its own cgroup for each controller to choose and track a full new cgroup for a new container. Along with some other complications, this caused the amount of code in lxc to deal with cgroups to become quite large.

To help deal with this, we wrote cgmanager, the cgroup manager. Its primary goal was to allow any task to seamlessly and securely (in terms of the host’s safety) administer its own cgroups. Its secondary goal was to ensure that lxc could deal with cgroups equally simply regardless of whether it was nested.

Cgmanager presents a D-Bus interface for making cgroup administration requests. Every request is made in relation to the requesting task’s current cgroup. Therefore ‘lxc-start’ can simply request for cgroup u1 to be created, without having to worry about what cgroup it is in now.

To make this work, we read the (un-alterable) process credentials of the requesting task over the D-Bus socket. We can check the task’s current cgroup using /proc/pid/cgroup, as well as check its /proc/pid/status and /proc/pid/uid_map. For a simple request like ‘create a cgroup’, this is all the information we need.

For requests relating to another task (“Move that task to another cgroup”) or credentials (“Change ownership to that userid”), we have two cases. If the requestor is in the same namespaces as the cgmanager (which we can verify on recent kernels), then the requestor can pass the values as regular integers. We can then verify using /proc whether the requestor has the privilege to perform the access.

But if the requestor is in a different namespace, then we need to uids and pids converted. We do this by having the requestor pass SCM_CREDENTIALS over a file descriptor. When these are passed, the kernel (a) ensures that the requesting task has privilege to write those credentials, and (b) converts them from the requestor’s namespace to the reader (cgmanager).

The SCM-enhanced D-Bus calls are a bit more complicated to use than regular D-Bus calls, and can’t be made with (unpatched) dbus-send. Therefore we provide a cgmanager proxy (cgproxy) which accepts the plain D-Bus requests from a task which shares its namespaces and converts them to the enhanced messages. So when you fire up a Trusty containers host, it will run the cgmanager. Each container on that host can bind the cgmanager D-Bus socket and run a cgproxy. (The cgmanager upstart job will start the right daemon at startup) Lxc can administer cgroups the exact same way whether it is being run inside a container or on the host.

Using cgmanager

Cgmanager is now in main in trusty. When you log into a trusty desktop, logind should place you into your own cgroup, which you can verify by reading /proc/self/cgroup. If entries there look like

2:cpuset:/user/1000.user/c2.session

then you have your own delegated cgroups. If it instead looks like

2:cpuset:/

then you do not. You can create your own cgroup using cgm, which is just a script to wrap rather long calls to dbus-send.

sudo cgm create all $USER
sudo cgm chown all $USER $(id -u) $(id -g)

Next enter your shell into the new cgroup using

cgm movepid all $USER $$

Now you can go on to https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/ to run your unprivileged containers. Or, I sometimes like to stick a compute job in a separate freezer cgroup so I can freeze it if the cpu needs to cool down,

cgm create freezer cc
bash docompile.sh &
cgm movepid freezer cc $!

This way I can manually freeze the job when I like, or I can have a script watching my cpu temp as follows:

state="thawed"
while [ 1 ]; do
	d=`cat /sys/devices/virtual/thermal/thermal_zone0/temp` || d=1000;
	d=$((d/1000));
	if [ $d -gt 93 -a "$state" = "thawed" ]; then
		cgm setvalue freezer cc freezer.state FROZEN
		state="frozen"
	elif [ $d -lt 89 -a "$state" = "frozen" ]; then
		cgm setvalue freezer cc freezer.state THAWED
		state="thawed";
	fi;
	sleep 1;
done
Posted in Uncategorized | Leave a comment