2013 Linux Security Summit CFP closing soon

Just a short reminder that if you were interested in submitting a talk for the linux security summit, the call for participation (at http://kernsec.org/wiki/index.php/Linux_Security_Summit_2013) will be closing tomorrow, Friday Jun 14.

The summit will be held September 19-20 in New Orleans, co-located with LinuxCon. Hope to see you there!

Posted in Uncategorized | Leave a comment

Introducing lxc-snap

lxc-snap: lxc container snapshot management tool

BACKGROUND

Lxc supports containers backed by overlayfs snapshots. The way this is
typically done is to create a container backed by a regular directory,
then create a new container which mounts the first container’s rootfs
as a read-only lower mount, with a new private delta directory as
its read-write upper mount. For instance, you could

sudo lxc-create -t ubuntu -n r0 # create a normal directory
sudo lxc-clone -B overlayfs -s r0 o1 # create overlayfs clone

The second container, o1, when started up will mount /var/lib/lxc/o1/delta0
as a writeably overlay on top of /var/lib/lxc/r0/rootfs, and use that as its
root filesystem.

From here you can clone o1 to a new container o2. This simply copies the
the overlayfs delta from o1 to o2, and you is done with

sudo lxc-clone -s o1 o2

LXC-SNAP

One of the obvious use cases of these snapshot clones is to support
incremental development of rootfs images. Make some changes, snapshot,
make some more changes, snapshot, revert…

lxc-snap is a small program using the lxc API to more easily support
this use case. You begin with a overlayfs backed container, make some
changes, snapshot, make some changes, snapshot… This is a simpler
model than manually using clone because you continue developing the same
container, o1, while the snapshots are kept away until you need them.

EXAMPLE

Create your first container

sudo lxc-create -t ubuntu -n base
sudo lxc-clone -s -B overlayfs base mysql

Now make initial customizations, and snapshot:

sudo lxc-snap mysql

This will create a snapshot container /var/lib/lxcsnaps/mysql_0. You can actually
start it up if you like using ‘sudo lxc-start -P /var/lib/lxcsnaps -n mysql_0′.
(However, that is not recommended, as it will cause changes in the rootfs)

Next, make some more changes. Write a comment about the changes you made in this
version,

echo “Initial definition of table doomahicky” > /tmp/comment

sudo lxc-snap -c /tmp/comment mysql

Do this a few times. Now you realize you lost something you needed. You can
see the list of containers which have snapshots using

lxc-snap -l

and the list of versions of container mysql using

lxc-snap -l mysql

Note that it shows you the time when the snapshot was created, and any comments
you logged with the snapshot. You see that what you wanted was version 2, so
recover that snapshot. You can destroy container mysql and restore version 2
to it, or (I would recommend) use a different name to restore the snapshot to.

Use a different name with:

sudo lxc-snap -r mysql_2 mysql_tmp

or destroy mysql and restore the snapshot to it using

sudo lxc-destroy -n mysql
sudo lxc-snap -r mysql_2 mysql

When you’d like to export a container, you can clone it back to a directory
backed container and tar it up:

sudo lxc-clone -B dir mysql mysql_ship
sudo tar zcf /srv/mysql_ship.tar.gz /var/lib/lxc/mysql_ship

BUILD AND INSTALL

To use lxc-snap, you currently need to be using lxc from the ubuntu-lxc
daily ppa. On an ubuntu system (at least 12.04) you can

sudo add-apt-repository ppa:ubuntu-lxc/daily
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install lxc

lxc-snap will either become a part of the lxc package, or will become a
separate package. Currently it is available at
git://github.com/hallyn/lxc-snap. Fetch it using:

git clone git://github.com/hallyn/lxc-snap

Then build lxc-snap by typing ‘make’.

cd lxc-snap
make

Install into /usr/bin by typing

sudo DESTDIR=/usr make install

or install into /home/$USER/bin by typing

mkdir /home/$USER/bin
DESTDIR=/home/$USER make install

Note that lxc-snap is in very early development. It’s usage may
change over time, and as it currently ships a copy of liblxc .h
files it needs, it may occasionally break and need to be updated
from git and rebuilt. Using a package (as soon as it becomes
available) is recommended.

Note that lxc-snap is in very early development. It’s usage may
change over time, and as it currently ships a copy of liblxc .h
files it needs, it may occasionally break and need to be updated
from git and rebuilt. Using a package (as soon as it becomes
available) is recommended.
lxc package, or will become a
separate package. Currently it is available at
git://github.com/hallyn/lxc-snap. Fetch it using:

git clone git://github.com/hallyn/lxc-snap

Then build lxc-snap by typing ‘make’.

cd lxc-snap
make

Install into /usr/bin by typing

sudo DESTDIR=/usr make install

or install into /home/$USER/bin by typing

mkdir /home/$USER/bin
DESTDIR=/home/$USER make install

Note that lxc-snap is in very early development. It’s usage may
change over time, and as it currently ships a copy of liblxc .h
files it needs, it may occasionally break and need to be updated
from git and rebuilt. Using a package (as soon as it becomes
available) is recommended.

Posted in Uncategorized | Tagged , | 13 Comments

LXC – improved clone support

Recently I took some time to work on implementing container clones through the lxc API. lxc-clone previously existed as a shell script which could create snapshot clones of lvm and btrfs containers. There were several shortcomings to this:

1. clone was not exportable through the API (to be used in python, lua, go and c programs). Now it is, so a Go program can create a container clone in one function call.
2. expanding the set of supported clone types became unsavory
3. overlayfs was only supported as ‘ephemeral containers’, which could be made persistent through the use of pre-mount hooks. They were not first class citizens. Now they are.

The result is now in upstream git as well as in the packages at the ubuntu-lxc/daily ppa. Supported backing store types currently include dir (directory), lvm, btrfs, overlayfs, and zfs. Hopefully loop and qemu-nbd will be added soon. They each are somewhat different due to the nature of the backing store itself, so I’ll go over each. However in my opinion the coolest thing you can do with this is:

# create a stock directory backed container
sudo lxc-create -t ubuntu -n dir1
# create an overlayfs snapshot of it
sudo lxc-clone -s -B overlayfs dir1 s1

The -s argument asks for a snapshot (rather than copy) clone, and -B specifies the backing store type for the new container. When container s1 starts, it will mount a private writeable overlay (/var/lib/lxc/dir1/delta0) over a readonly mount of the original /var/lib/lxc/dir1/rootfs.

Now make some changes to start customizing s1. Checkpoint that state by cloning it:

sudo lxc-clone -s s1 s2

This will reference the same rootfs (/var/lib/lxc/dir1/rootfs) and rsync the overlayfs delta from s1 to s2. Now you can keep working on s1, keeping s2 as a checkpoint. Make more changes, and create your next snapshot

sudo lxc-clone -s s1 s3

sudo lxc-clone -s s1 s4

If at some point you realize you need to go back to an older snapshot, say s3, then you can

sudo lxc-clone -s s1 s1_bad # just to make sure
sudo lxc-destroy -n s1
sudo lxc-clone -s s3 s1

and pick up where you left off. Finally, if you’re happy and want to tar up what you have to ship it or copy to another machine, clone it back to a directory backed container:

sudo lxc-clone -B dir s1 dir_ship
sudo tar zcf /var/lib/lxc/dir_ship.tgz /var/lib/lxc/dir_ship

So far I’ve shown dir (directory) backing store and overlayfs. Specific to directory backed containers is that they cannot be snapshotted, except by converting them to overlayfs backed containers. Specific to overlayfs containers is that the original directory backed container must not be deleted, since the snapshot depends on it. (I’ll address this soon, marking the snapshotted container so that lxc-destroy will leave it alone, but that is not yet done)

To use btrfs containers, the entire lxc configuration path must be btrfs. However since the configuration path is flexible, that’s not as bad as it used to be. For instance, I mounted a btrfs at $HOME/lxcbase, then did

sudo lxc-create -t ubuntu -P $HOME/lxcbase -n b1

(The ‘-P’ argument chooses a custom ‘lxcpath’, or lxc configuration path, than the default /var/lib/lxc. You can also specify a global default other than /var/lib/lxc in /etc/lxc/lxc.conf.) lxc-create detects the btrfs and automatically makes the container a new subvolume, which can then be snapshotted

sudo lxc-clone -s b1 b2

For zfs, a zfsroot can be specified in /etc/lxc/lxc.conf. I created a zfs pool called ‘lxc’ (which is actually the default for the lxc tools, so I did not list it in /etc/lxc/lxc.conf), then did

sudo lxc-create -B zfs -t ubuntu -n z1
or
sudo lxc-clone -B zfs dir1 z1

This created ‘lxc/z1′ as a new zfs fs and mounted it under /var/lib/lxc/z1/rootfs. Next I could

sudo lxc-clone -s z1 z2

Now lxc-destroy needs some smarts still built-in to make zfs backed containers easier to destroy. That is because when lxc-clone creates z2 from z1, it must first create a snapshot ‘lxc/z1@z2′, then clone that to ‘lxc/z2′. So before you can destroy z1, you currently must

sudo lxc-destroy -n z2
sudo zfs destroy lxc/z1@x2

Finally, you can also use LVM. LVM snapshot container clones have been supported longer than any others (with btrfs being second). I like the fact that you can use any filesystem inside the LV. However, the two major shortcomings are that you cannot snapshot a snapshot, and that you must (depending at least on the filesystem type) choose a filesystem size in advance.

To clone LVM conatiners, you either need a vg called ‘lxc’, or you can specify a default vg in /etc/lxc/lxc.conf. You can create the initial lvm container with

sudo lxc-create -t ubuntu -n lvm1 –fssize 2G –fstype xfs
or
sudo lxc-clone -B lvm dir1 lvm1

Then snapshot it using

sudo lxc-clone -s lvm1 lvm2

Note that unlike overlayfs, snapshots in zfs, btrfs, and lvm are safe from having the base container destroyed. In btrfs, that is because the btrfs snapshot is metadata based, so destroying the base container simply does not delete any of the data in use by the snapshot container. LVM and zfs both will note that there are active snapshots of the base rootfs and prevent the base container from being destroyed.

Posted in Uncategorized | Tagged , | 11 Comments

gtd – managing projects

I learned about GTD 5 or 8 years ago, and pretty immediately was trying to use it. Ever since then I keep all of my information in one gtd folder, with Projects and Reference folders, a nextactions file, etc. I’ve blogged before about my tickler file, which frankly rocks and never lets me down.

However, a few months ago I decided I wasn’t happy with my nextactions file. Sitting down for a bit to think about it, it was clear that the following happens: some new project comes in. I only have time to jot a quick note, so I do so in nextactions. Later, another piece of information comes in, so I add it there. Over time, my nextactions files grows and is no longer a nextactions file.

I briefly tried simply not using the Projects/ directory, and keeping a indented/formatted structure in the nextactions file. But that does not work out – I spend most of my time either gazing at too much information, or/and ignoring parts which I hadn’t been working on recently. (I also briefly tried ETM and bug which both are *very* neat, but they similarly didn’t work for me for GTD.)

I have a Projects directory, so why am I not using it? Doing so takes several steps (think of a name, make the directory, open a file in it, make the notes, exit) and after that I don’t have a good system for managing the project files. Looking at a project again involves several steps – cd into gtd/Projects, look around, cd , look again. Clearly, project files needed better tools.

So I wrote up a simple ‘project’ script, with a corresponding bash_completion file. If info comes in for a trip I have to take in a few months, I can simply

	project start trip-sandiego-201303

or

	p s trip-sandiego-201303

This creates the project directory and opens vim with three buffers, for each of the three files – a summary, actions, and log. (‘project new’ will create without pulling up vim with those files.) Later, I can

	project list

or (for short)

	p l

to list all open projects,

	p e tr<tab>

to edit the project – which again opens the same files, or

	p cat tr<tab>

to cat the files to stdout. I’ve added a ‘Postponed’ directory for projects which are on hold, so I can

	project postpone trip-sandiego-201303

or just

	p po tr<tab>

to temporarily move the project folder into Postponed, or

	p complete tr<tab>

to move the project folder into the Completed/ directory.

I’ve been using this for a few months now, and am very happy with the result. The scipt and completion file are in lp:~serge-hallyn/+junk/gtdproject. It’s really not much, but so useful!

Posted in Uncategorized | Tagged | 2 Comments

Qemu updates in raring

The raring feature freeze took effect last week. What’s been happening with qemu in the meantime?

A lot! I’ll touch on the following main changes in this post: package reorg, spice support, hugepages, uefi, and rbd support.

* package reorg

Perhaps best to begin with a bit of Ubuntu qemu packaging history. In hardy (before my time) Ubuntu shipped with separate qemu and kvm packages. This reflected the separate upstream qemu and kvm source trees. In August of 2009, upstream was already talking about merging the two trees, and Dustin Kirkland started a new qemu-kvm Ubuntu package which provided both qemu and kvm.

In 2010, a new ‘qemu-linaro’ source package was created in universe, to provide qemu with more bleeding-edge arm support from linaro. Eventually the qemu-kvm package provided the i386 and amd64 qemu-system binaries, qemu-common, and qemu-utils. All other target architecture system binaries, plus all qemu-user binaries, plus qemu-kvm-spice, came from qemu-linaro. This is clearly non-ideal from many viewpoints, and especially QA testing and bug duplication. But any reorganization would have to make sure that upgrades work seamlessly for raring-raring, quantal-raring, and future LTS-to-LTS upgrades, for the many commonly used packages (qemu-kvm, qemu on various packages, and qemu-user).

In the traditional 6-month-plus-LTS Ubuntu cycle, raring was a good time (not too close to next LTS) to try to straighten that out. It was also a good time in that upstream qemu and kvm were now very close together, and especially in that the wonderfully helpful debian qemu team which was also starting to merge debian’s qemu and qemu-kvm sources into a new qemu source tree in debian experimental.

And so, it’s done! The qemu-linaro and qemu-kvm source packages have been merged into qemu. Most arm patches from linaro are in our package, but you can still run linaro’s qemu from ppa at https://launchpad.net/~linaro-maintainers/+archive/tools/. The Ubuntu and Debian teams are working together, which should mean more stable packages in both, and combined resources in addressing bugs. Thanks especially to Michael Tokarev for helping to review the Ubuntu delta, and to infinity for more than once helping to figure out packaging issues I couldn’t have figured out on my own.

* Spice support. Spice has finally made it into main! The qemu package in main therefore finally supports spice, without having to install a separate qemu-kvm-spice package. As a simple example, if you used to do:

kvm -vga vmware -vnc :1

then you can use spice by doing:

kvm -vga qxl -spice port=5900,disable-ticketing

then connect with spicec or spicy:

spicec -h hostname -p 5900

3. Transparent hugepages. The 1.4.0 qemu release includes support for transparent hugepages. This means that when hugepages are available, qemu instances migrate some memory pages from regular to huge pages. Hugepages offer performance improvements due to (1) requiring fewer TLB entries for the same amount of memory, (2) requiring fewer lookups per page, and (3) requiring fewer page faults for nearby memory references (since each memory page is much larger).

4. Hugetlbfs mount. While transparent hugepages are convenient, if you want a particular vm to run with hugepages backing the whole VM, you will want to use dedicated hugepages. To do this, simply set KVM_HUGEPAGES to 1 in /etc/init/qemu-kvm.conf, then add an entry to /etc/sysctl.conf like:

vm.nr_hugepages = 512

(for 1G of hugepages – 512 2M pages). Make sure to leave at least around 1G of memory not dedicated to hugepages. Then add the arguments

-mem-path /run/hugepages/kvm

to your kvm command. Dedicated hugepages are not new, but the automatic mounting of the /sys/hugepages/kvm is.

6. UEFI. If you install the ovmf package, then you can run qemu with a UEFI bios (to test secureboot, for instance) by adding the ‘-bios OVMF.fd’ arguments to kvm. As was pointed out during vUDS there are some bugs to work out to make this seamless.

5. rbd. Ok this has been enabled since precise, but it’s still cool. You can use a ceph cluster to back your kvm instances (as an alternative to, say, nfs) to easily enable live migration. Just

qemu-img create -f rbd rbd:pool/vm1 10G
kvm -m 512 -drive format=rbd,file=rbd:pool/vm1 -cdrom raring.iso -boot d

See http://ceph.com/docs/master/rbd/qemu-rbd/ for more information.

So there’s what I can think of that is new in qemu this cycle. I hope you all enjoy, and if you find upgrading issues please do raise a bug.

Posted in Uncategorized | Tagged , , , | Leave a comment

Experimenting with user namespaces

User namespaces are a really neat feature, but there are some subtleties involved which can make them perplexing to first play with. Here I’m going to show a few things you can do with them, with an eye to explaining some of the things which might otherwise be confusing.

First, you’ll need a bleeding edge kernel. A 3.9 kernel hand-compiled with user namespace support should be fine (some of the latest missing patches aren’t needed for these games as we won’t be creating full system containers). But for simplicity, you can simply fire up a new raring box and do:

sudo add-apt-repository ppa:ubuntu-lxc/kernel
sudo apt-get update
sudo apt-get dist-upgrade

Now get a few tools from my ppa – you can of course get the source for all from either the ppa, or from my bzr trees.

sudo add-apt-repository ppa:serge-hallyn/user-natty
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install nsexec uidmap

Now let’s try a first experiment. Run the following program from nsexec:

usernsselfmap

This is a simple program which forks a child which runs as root in a new user namespace. Here a brief reminder of how user namespaces are designed is in order. When a new user namespace is created, the task populating it starts as userid -1, nobody. At this point it has full privileges (POSIX capabilities), but those capabilities can only be used toward resources owned by the new namespace. Furthermore, the privileges will be lost as soon as the task runs exec(3) of a normal file. See the capabilities(7) manpage for an explanation.

At this point, userids from the parent namespace may be mapped into the child. For instance, one might map userids 0-9999 in the child to userids 100000-109999 on the host. This is done by writing values to /proc/pid/uid_map (and analogously to /proc/pid/gid_map). The task writing to the map files must have privilege over the parent uids being mapped in.

This is where usernsselfmap comes in. You currently do not have privilege over userids on the host – except your own. usernsselfmap simply maps uid 0 in the container to your own userid on the host. Then it changes to gid and uid 0, and finally executes a shell.

Now look around this shell

ifconfig
ifconfig eth0 down

Note that even though you have CAP_SYS_ADMIN, you cannot change the host’s network settings. However, you can now unshare a new network namespace (still without having privilege on the host) and create network devices in that namespace

nsexec -cmn /bin/bash
ifconfig
ip link add type veth
ifconfig veth0 10.0.0.1 up
ifconfig -a

Note also that you can’t read under /root. But you can mount a new mounts namespace and mount your $HOME onto /root

ls /root
# permission denied
nsexec -m /bin/bash
mount –bind $HOME /root
ls root
# homedir contents

Now, in addition to the kernel implementation of user namespaces, Eric Biederman has also provided a patchset against shadow to add a concept of subuids and subgids. Briefly, you can modify login.defs to say that every new user should be allocated 10000 (unique) uids and gids above 100000. Then when you add a new user, it will automatically receive a set of 10000 unique subuids. These allocations are stored in /etc/subuid and /etc/subgid, and two new setuid-root binaries, newuidmap and newgidmap (which are shipped in the uidmap binary package, generated from the shadow source package) may be used by an unprivileged user to map userids in a child user namespace to his allocated subuids on the host.

To conclude this post, here is an example of using the new shadow package along with nsexec to manually create a user namespace with more than one userid. First, use usermod to allocate some subuids and subgids for your user (who I’ll assume is user ‘ubuntu’ on an ec2 host) since it likely was created before subuids were configured:

sudo usermod ubuntu -v 110000-120000 -W 110000-120000

Now open two terminals as user ubuntu (or a split byobu screen). In the one, run

nsexec -UW -s 0 -S 0 /bin/bash
about to unshare with 10000000
Press any key to exec (I am 5358)

You’ve asked nsexec to unshare its user namespace (-U), to wait for a keypress before executing /bin/bash (-W), and to switch to userid 0 (-s 0) and groupid 0 (-S 0) before starting that shell. In this example nsexec tells you it is process id 5358, so that you can map userids to it. So from the other shell do:

newuidmap 5358 0 110000 10000
newgidmap 5358 0 110000 10000

Now hit return in the nsexec window, and you will see something like:

root@server:~#

Now you can play around as above, but unlike above, you can also switch to userids other than root.

root@server:~# newuidshell 1001
ubuntu2@server:~/nsexec

But since we’ve not set up a proper container (or chroot), and since our userid maps to 111001, which is not 1001, we can’t actually write to ubuntu2′s files or read any files which are not world readable.

This then will be the basis of ongoing and upcoming work to facility unprivileged users creating and using containers. Exciting!

(One note: I am here using an old toy ‘nsexec’ for manipulating namespaces. This will eventually be deprecated in favor of the new programs in upstream util-linux. However there has not yet been a release of util-linux with those patches, so they are not yet in the ubuntu package.)

The source tree for the modified shadow package is at lp:~serge-hallyn/ubuntu/raring/shadow/shadow-userns and source for utilities in the nsexec package is at lp:~serge-hallyn/+junk/nsexec.

Posted in Uncategorized | Tagged , | Leave a comment

User Namespaces LXC meeting

Last week we held an irc meeting to talk about user namespaces as they relate to lxc containers. The IRC log is posted at https://wiki.ubuntu.com/LxcUsernsIrcChat .

I had two goals for this meeting. The first was to make sure that lxc developers were familiar with user namespaces, so that as new patches started rolling in to accomodate user namespaces, more people might be inclined to review them – and spot my silly errors. The other was to discuss some design issues in the lxc code.

I began with some background on user namespaces, their design, motivation, and current status, topped off by a little demo on ec2. Then we proceeded to discuss future needed changes.

There are two terrific advantages to using user namespaces.

The first is that host resources are not subject to privilege in the container. That is, root in the container is not root on the host, and a fully privileged task in a container cannot exert any privilege over any resources which are not owned by the container. This advantage is fully realized right now when using lxc with a custom kernel, as per http://s3hh.wordpress.com/2012/10/31/full-ubuntu-container-confined-in-a-user-namespace. By the time raring is released, I hope for the stock raring lxc, with a custom kernel from ppa:ubuntu-lxc/kernels, to be usable in place of my personal ppa.

The second advantage of user namespaces is that they will allow unprivileged users to create and use containers. There are little things which will require privilege – like mapping userids into the container, and hooking the container’s network interface somehow into the host. Each of those can be enabled by small privileged helpers and configured in advance (and generically). So that, by 14.04 LTS, an unprivileged user should be able to safely, out of the box, do

lxc-create -t ubuntu -n r1
lxc-start -n r1

This should also be tremendously helpful for safer usage of juju with local provider.

The steps needed (or, at least, hopefully most of them) to get to that point are discussed in the meeting log above.

Posted in Uncategorized | Tagged , | 4 Comments