libvirt defaults (and openvswitch bridge performance)

The libvirt-bin package in Ubuntu installs a default NATed virtual network,
virbr0. This isn’t always the best choice for everyone, however it “just
works” everywhere. It also provides some simple protection – the VMs aren’t
exposed on the network for all attackers to see.

Two alternatives are sometimes suggested. One is to simply default to a
non-NATed bridge. The biggest reason we can’t do this is that it would break
users with wireless cards. Another issue is that instead of simply tacking
something new onto the network, we have to usurp the default network interface
into our new bridge. It’s impossible to guess all the ways users might have
already customized their network.

The other alternative is to use an openvswitch bridge. This actually has the
same problems as the linux bridge – you still can’t add a VM nic to an
openvswitch-bridged wireless NIC, and we still would be modifying the default
network.

However the suggestion did make me wonder – how would ovs bridges compare to
linux ones in terms of performance? I’d have expected them to be slower (as a
tradeoff for much greater flexibility), but I was surprised when I was told
that ovs bridges are expected to perform better. So I set up a quick test, and
sure enough!

I set up two laptops running saucy, connected over a physical link. On the one
I installed a saucy VM. Then I ran iperf over the physical link from the other
laptop to the VM. When the VM was attached using a linux bridge, I got:

830 M/s
757 M/s
755 M/s
827 M/s
821 M/s

When I instead used an openvswitch bridge, I got:

925 M/s
925 M/s
925 M/s
916 M/s
924 M/s

So, if we’re going to go with a new default, using openvswitch seems like a
good way to go. I’m still loath to make changes to the default, however a
script (sitting next to libvirt-migrate-qemu-disks) which users can optionally
run to do the gruntwork for them might be workable.

Posted in Uncategorized | Tagged , , , | 2 Comments

Line buffering (and talking computers and irc meetings)

When I was a kid, I wanted computers to talk to me (and vice versa). Why, decades later, am I still squinting at the screen during hour-long irc meetings? Basta! So the last two weeks I’ve been listening to our hour-long team irc meeting.

There is an excellent screen reader integrated with unity (well, it needs some tweaking to speed up the voice, which you can do in /etc/speech-dispatcher/speechd.conf), but it’s designed for a different purpose: every time you foreground a window it will switch to reading from that window. I want something simpler – I want a stream of text to be converted to speech as it comes in, while I play on other windows and desktops without the speech being affected. So the way I’ve decided to do it is as follows. First I connect using ‘sic’ (the ‘simple irc client’ from suckless.org) to the irc server. Sic uses simple text files for input and output, with all output (for all channels) going to the same file. So I start it as:

   rm -f sic.in; touch sic.in; tail -f sic.in | sic -h irc.freenode.net -n lurker | tee -a sic.out

and in one window I have a loop into which I can type:

    while read l; do [ -z "$l" ] && break; echo "$l" >> sic.in; done

Then I use the following to make it speak:

    tail -n 2 -f ~/sic/sic.out | egrep --line-buffered -v 'ChanServ|TOPIC|JOIN|PART|QUIT|ubottu' | awk -F\  '{ $1=""; $2=""; $3=""; print; fflush(); }' | espeak --stdout -s 220 -ven-us | aplay 2>/dev/null

All this does is filter out some of the stuff I don’t want to hear. (Yes the grep is “superfluous”, deal with it :) But the part which I found interesting enough to blog about was the extra steps in egrep and awk. By default, they do not use line buffers on output, because that could harm performance in normal use. But for what I’m doing, I need every line to be buffered to trigger the next command in the pipeline.

In egrep, this is done simply using –line-buffered. In awk, I do it by adding a fflush() after the print command. I actually had to spend some time looking into this, so I thought I’d share my results.

As a simple example, try the following:

	tail -f /etc/hosts | while read line; do echo $line; done

Does what you’d expect. Now insert awk into the pipeline to print only the hostname:

	tail -f /etc/hosts | awk '{ print $2 }' | while read line; do echo $line; done

There’s no output. That’s because awk is buffering and hasn’t yet flushed its stdout. So the while loop has no output. You can fix this by forcing it to fflush(), as in:

	tail -f /etc/hosts | awk '{ print $2; fflush(); }' | while read line; do echo $line; done

Now there is output.

So there you go. If you want to use something like irssi (which I also have a script for, to listen to hilights window output), you just need to do some tweaking to the grep and awk commands which drop the cruft.

Have fun!

Posted in Uncategorized | 2 Comments

Creating and using containers – without privilege

Today I posted a (working but mainly POC) patchset against lxc which allows me to create and start ubuntu-cloud containers – completely as an unprivileged user. For more details see the introductory email to the patchset at http://sourceforge.net/mailarchive/forum.php?thread_name=1374246151-7069-9-git-send-email-serge.hallyn%40ubuntu.com&forum_name=lxc-devel

Glossing over prerequisites (which you can see in the email), the actual commands I used were:

lxc-create -t /home/serge/lxc-ubuntu-cloud -P /home/serge/lxcbase -n x3 -f default.conf -- -T precise.tar.gz
lxc-start -P /home/serge/lxcbase -n x3

There’s more work to be done:

  • unprivileged containers cannot (yet) be networked
  • something needs to set up per-user cgroups at boot or login
  • something needs to create a per-user lockdir under /run
  • user namespaces need to be enabled in the default kernels
  • template handling of caching and locking needs to be made saner with respect to configurable lxcpaths

These are pretty minor, though, compared to what we’ve already achieved:

  • User namespace support (minus XFS support) is in the kernel – thanks to a heroic effort by Eric Biederman (and sponsored by Canonical).
  • The work needed enable subuids – also written by Eric – was accepted into our shadow package in saucy.
  • The basic patchset to enable use of user namespaces by privileged users has been in lxc for some time now.

Background on user namespaces:

When you create a new user namespace, it initially is unmapped. Your task has uid and gid -1. You can then map userids from the parent namespace onto userids in the new namespace by ranges. For instance if you are userid 1000, then you can map uid 1000 in the parent to uid 0 in the namespace. From the kernel’s point of view, you can only map uids which you have privilege over – either by being that uid, or having CAP_SYS_ADMIN in the parent.

This is where subuids come in. /etc/subuids and /etc/subgids list range of uids which users are allowed to map. The newuidmap and newgidmap are setuid-root programs which will respect those subuids to allow unprivileged users to map their allotted subuids.

Lxc uses these programs (indirectly through the ‘usernsexec’ program) to allow unprivileged users to map their allotted subuids to containers.

Of note is that regular DAC and MAC remain unchanged. Therefore although I as user serge/uid 1000 may have 100000-199999 as my subuids, I do not own files owned by those subuids! To work around this, map the uids together into a namespace. For instance, if you are uid 1000 and want to create a file owned by uid 100000, you can

touch foo
usernsexec -m b:0:100000:1 -m b:1000:1000 -- /bin/chown 0 foo

This maps 100000 on the host to root in the new namespace, and 1000 on the host to 1000 in the namespace. So host uid 100000 actually has privilege over the namespace, including over host uid 1000. It is therefore allowed to chown a file owned by uid 1000 to uid 0 (which is host uid 100000). You end up with foo owned by uid 100000. You can do the same sort of games to clean up containers (and lxc-destroy will do so).

Posted in Uncategorized | Tagged , | 3 Comments

RSS

When I saw the news that Google reader was going away, my first thought, like a lot of other people, was “woohoo, it’s going to be fun writing a new way to follow RSS.” This past weekend I looked at the calendar and, like most people, shouted “oh no, July 1 is just about here and I’ve done nothing.” I originally figured I would write a script to push links to pocket, which formats very nicely on the nook and in browsers. But then I ran into rss2email. I may hack it to send just one pocket-able url (though writing from scratch may be more fun) but for now, unchanged and sending full posts in email which I can read anywhere, it’s keeping me quite happy.

Posted in Uncategorized | 2 Comments

2013 Linux Security Summit CFP closing soon

Just a short reminder that if you were interested in submitting a talk for the linux security summit, the call for participation (at http://kernsec.org/wiki/index.php/Linux_Security_Summit_2013) will be closing tomorrow, Friday Jun 14.

The summit will be held September 19-20 in New Orleans, co-located with LinuxCon. Hope to see you there!

Posted in Uncategorized | Leave a comment

Introducing lxc-snap

lxc-snap: lxc container snapshot management tool

BACKGROUND

Lxc supports containers backed by overlayfs snapshots. The way this is
typically done is to create a container backed by a regular directory,
then create a new container which mounts the first container’s rootfs
as a read-only lower mount, with a new private delta directory as
its read-write upper mount. For instance, you could

sudo lxc-create -t ubuntu -n r0 # create a normal directory
sudo lxc-clone -B overlayfs -s r0 o1 # create overlayfs clone

The second container, o1, when started up will mount /var/lib/lxc/o1/delta0
as a writeably overlay on top of /var/lib/lxc/r0/rootfs, and use that as its
root filesystem.

From here you can clone o1 to a new container o2. This simply copies the
the overlayfs delta from o1 to o2, and you is done with

sudo lxc-clone -s o1 o2

LXC-SNAP

One of the obvious use cases of these snapshot clones is to support
incremental development of rootfs images. Make some changes, snapshot,
make some more changes, snapshot, revert…

lxc-snap is a small program using the lxc API to more easily support
this use case. You begin with a overlayfs backed container, make some
changes, snapshot, make some changes, snapshot… This is a simpler
model than manually using clone because you continue developing the same
container, o1, while the snapshots are kept away until you need them.

EXAMPLE

Create your first container

sudo lxc-create -t ubuntu -n base
sudo lxc-clone -s -B overlayfs base mysql

Now make initial customizations, and snapshot:

sudo lxc-snap mysql

This will create a snapshot container /var/lib/lxcsnaps/mysql_0. You can actually
start it up if you like using ‘sudo lxc-start -P /var/lib/lxcsnaps -n mysql_0′.
(However, that is not recommended, as it will cause changes in the rootfs)

Next, make some more changes. Write a comment about the changes you made in this
version,

echo “Initial definition of table doomahicky” > /tmp/comment

sudo lxc-snap -c /tmp/comment mysql

Do this a few times. Now you realize you lost something you needed. You can
see the list of containers which have snapshots using

lxc-snap -l

and the list of versions of container mysql using

lxc-snap -l mysql

Note that it shows you the time when the snapshot was created, and any comments
you logged with the snapshot. You see that what you wanted was version 2, so
recover that snapshot. You can destroy container mysql and restore version 2
to it, or (I would recommend) use a different name to restore the snapshot to.

Use a different name with:

sudo lxc-snap -r mysql_2 mysql_tmp

or destroy mysql and restore the snapshot to it using

sudo lxc-destroy -n mysql
sudo lxc-snap -r mysql_2 mysql

When you’d like to export a container, you can clone it back to a directory
backed container and tar it up:

sudo lxc-clone -B dir mysql mysql_ship
sudo tar zcf /srv/mysql_ship.tar.gz /var/lib/lxc/mysql_ship

BUILD AND INSTALL

To use lxc-snap, you currently need to be using lxc from the ubuntu-lxc
daily ppa. On an ubuntu system (at least 12.04) you can

sudo add-apt-repository ppa:ubuntu-lxc/daily
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install lxc

lxc-snap will either become a part of the lxc package, or will become a
separate package. Currently it is available at
git://github.com/hallyn/lxc-snap. Fetch it using:

git clone git://github.com/hallyn/lxc-snap

Then build lxc-snap by typing ‘make’.

cd lxc-snap
make

Install into /usr/bin by typing

sudo DESTDIR=/usr make install

or install into /home/$USER/bin by typing

mkdir /home/$USER/bin
DESTDIR=/home/$USER make install

Note that lxc-snap is in very early development. It’s usage may
change over time, and as it currently ships a copy of liblxc .h
files it needs, it may occasionally break and need to be updated
from git and rebuilt. Using a package (as soon as it becomes
available) is recommended.

Note that lxc-snap is in very early development. It’s usage may
change over time, and as it currently ships a copy of liblxc .h
files it needs, it may occasionally break and need to be updated
from git and rebuilt. Using a package (as soon as it becomes
available) is recommended.
lxc package, or will become a
separate package. Currently it is available at
git://github.com/hallyn/lxc-snap. Fetch it using:

git clone git://github.com/hallyn/lxc-snap

Then build lxc-snap by typing ‘make’.

cd lxc-snap
make

Install into /usr/bin by typing

sudo DESTDIR=/usr make install

or install into /home/$USER/bin by typing

mkdir /home/$USER/bin
DESTDIR=/home/$USER make install

Note that lxc-snap is in very early development. It’s usage may
change over time, and as it currently ships a copy of liblxc .h
files it needs, it may occasionally break and need to be updated
from git and rebuilt. Using a package (as soon as it becomes
available) is recommended.

Posted in Uncategorized | Tagged , | 8 Comments

LXC – improved clone support

Recently I took some time to work on implementing container clones through the lxc API. lxc-clone previously existed as a shell script which could create snapshot clones of lvm and btrfs containers. There were several shortcomings to this:

1. clone was not exportable through the API (to be used in python, lua, go and c programs). Now it is, so a Go program can create a container clone in one function call.
2. expanding the set of supported clone types became unsavory
3. overlayfs was only supported as ‘ephemeral containers’, which could be made persistent through the use of pre-mount hooks. They were not first class citizens. Now they are.

The result is now in upstream git as well as in the packages at the ubuntu-lxc/daily ppa. Supported backing store types currently include dir (directory), lvm, btrfs, overlayfs, and zfs. Hopefully loop and qemu-nbd will be added soon. They each are somewhat different due to the nature of the backing store itself, so I’ll go over each. However in my opinion the coolest thing you can do with this is:

# create a stock directory backed container
sudo lxc-create -t ubuntu -n dir1
# create an overlayfs snapshot of it
sudo lxc-clone -s -B overlayfs dir1 s1

The -s argument asks for a snapshot (rather than copy) clone, and -B specifies the backing store type for the new container. When container s1 starts, it will mount a private writeable overlay (/var/lib/lxc/dir1/delta0) over a readonly mount of the original /var/lib/lxc/dir1/rootfs.

Now make some changes to start customizing s1. Checkpoint that state by cloning it:

sudo lxc-clone -s s1 s2

This will reference the same rootfs (/var/lib/lxc/dir1/rootfs) and rsync the overlayfs delta from s1 to s2. Now you can keep working on s1, keeping s2 as a checkpoint. Make more changes, and create your next snapshot

sudo lxc-clone -s s1 s3

sudo lxc-clone -s s1 s4

If at some point you realize you need to go back to an older snapshot, say s3, then you can

sudo lxc-clone -s s1 s1_bad # just to make sure
sudo lxc-destroy -n s1
sudo lxc-clone -s s3 s1

and pick up where you left off. Finally, if you’re happy and want to tar up what you have to ship it or copy to another machine, clone it back to a directory backed container:

sudo lxc-clone -B dir s1 dir_ship
sudo tar zcf /var/lib/lxc/dir_ship.tgz /var/lib/lxc/dir_ship

So far I’ve shown dir (directory) backing store and overlayfs. Specific to directory backed containers is that they cannot be snapshotted, except by converting them to overlayfs backed containers. Specific to overlayfs containers is that the original directory backed container must not be deleted, since the snapshot depends on it. (I’ll address this soon, marking the snapshotted container so that lxc-destroy will leave it alone, but that is not yet done)

To use btrfs containers, the entire lxc configuration path must be btrfs. However since the configuration path is flexible, that’s not as bad as it used to be. For instance, I mounted a btrfs at $HOME/lxcbase, then did

sudo lxc-create -t ubuntu -P $HOME/lxcbase -n b1

(The ‘-P’ argument chooses a custom ‘lxcpath’, or lxc configuration path, than the default /var/lib/lxc. You can also specify a global default other than /var/lib/lxc in /etc/lxc/lxc.conf.) lxc-create detects the btrfs and automatically makes the container a new subvolume, which can then be snapshotted

sudo lxc-clone -s b1 b2

For zfs, a zfsroot can be specified in /etc/lxc/lxc.conf. I created a zfs pool called ‘lxc’ (which is actually the default for the lxc tools, so I did not list it in /etc/lxc/lxc.conf), then did

sudo lxc-create -B zfs -t ubuntu -n z1
or
sudo lxc-clone -B zfs dir1 z1

This created ‘lxc/z1′ as a new zfs fs and mounted it under /var/lib/lxc/z1/rootfs. Next I could

sudo lxc-clone -s z1 z2

Now lxc-destroy needs some smarts still built-in to make zfs backed containers easier to destroy. That is because when lxc-clone creates z2 from z1, it must first create a snapshot ‘lxc/z1@z2′, then clone that to ‘lxc/z2′. So before you can destroy z1, you currently must

sudo lxc-destroy -n z2
sudo zfs destroy lxc/z1@x2

Finally, you can also use LVM. LVM snapshot container clones have been supported longer than any others (with btrfs being second). I like the fact that you can use any filesystem inside the LV. However, the two major shortcomings are that you cannot snapshot a snapshot, and that you must (depending at least on the filesystem type) choose a filesystem size in advance.

To clone LVM conatiners, you either need a vg called ‘lxc’, or you can specify a default vg in /etc/lxc/lxc.conf. You can create the initial lvm container with

sudo lxc-create -t ubuntu -n lvm1 –fssize 2G –fstype xfs
or
sudo lxc-clone -B lvm dir1 lvm1

Then snapshot it using

sudo lxc-clone -s lvm1 lvm2

Note that unlike overlayfs, snapshots in zfs, btrfs, and lvm are safe from having the base container destroyed. In btrfs, that is because the btrfs snapshot is metadata based, so destroying the base container simply does not delete any of the data in use by the snapshot container. LVM and zfs both will note that there are active snapshots of the base rootfs and prevent the base container from being destroyed.

Posted in Uncategorized | Tagged , | 11 Comments