User namespaces – available to play!

Over the past few months, Eric Biederman has been working on completing the user namespace. Briefly, unprivileged users can create a user namespace, where he can pretend to be root and start new namespaces (i.e. network and pid) which he will own (Note, creating namespaces in child user namespaces isn’t yet allowed, but will be). With respect to anything he owns – for instance new network interfaces which he creates in his own network namespace – he should have privilege. But he should not be able to escape his existing privileges in the parent user namespace. This finally should allow an unprivileged user to create a new filesystem tree and chroot into it, without risk of maliciously confusing setuid applications on the host (for instance by bind mounting his own /etc/passwd).

Eric’s new design is based on a 1-1 uid mapping (by ranges) from uids
in the container to uids on the host. For instance, uid 0 in the namespace may really be uid 999990 on the host. Users can be pre-allocated their own private ranges to use however they please. For instance each user may get 10,000 uids, with the first user’s range starting at 100,000.  The uid and gid mappings are exposed and manipulated through /proc/pid/uid_map and /proc/pid/gid_map, which contain:

namespace_first_uid host_first_uid number_of_uids

For instance if it contains “0 100000 1000″, then uids 0 through 1000 in the namespace will map to uids 100000 through 101000 on the host, respectively. To write to the uid map, you must be privileged in your namespace, and your namespace must have the source ids mapped. (The mappings can be nested in the obvious way). In userspace, we expect to have a small setuid-root program which unprivileged users can call to map uids. That program will consult a root owned file which lists the permitted mappings. Right now we are using /etc/id_permission/uids and /etc/id_permission/gids. If /etc/id_permission/uids has

1000:100000:9999
1001:110000:9999

then uid 1000 (user hallyn) will be allowed to map the uids 100000 through 109999, and 1001 (user jschmoe) will be allowed to map uids 110000 through 119999.

Eric’s git tree is here. His patchset applied to the ubuntu quantal kernel tree is here, and the resulting kernel is built and available at ppa:serge-hallyn/userns-natty.

So you can try it out! Like so:

Start an amazon ec2 instance of precise. Find an ami to use (ami=`ubuntu-cloudimg-query precise`) and start it up (ec2-run-instances -k myid $ami). Log in and update /etc/apt/sources.list to look as follows:

deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ quantal main universe
deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ quantal main universe

then update (sudo apt-get update && sudo apt-get -y dist-upgrade). Add my userns-natty ppa (sudo add-apt-repository ppa:serge-hallyn/userns-natty) and update again (sudo apt-get update && sudo apt-get -y dist-upgrade), then reboot into the new kernel.

As I’ve said, the uid mapping is in /proc/self/uid_map. On the host that looks like

0 0 4294967295

Grab nsexec from my ppa to create new namespaces (sudo apt-get install nsexec) and run

sudo nsexec -cU /bin/bash

Inside the new namespace, /proc/self/uid_map is empty. So we need to add some mappings. From a root terminal on the host (not in the new namespace), do

echo “0 555550 10″ > /proc/$pid/uid_map
echo “0 555550 10″ > /proc/$pid/gid_map

Where $pid is the process id of the shell in the namespace. The nsexec package includes a utility called uidmap which will do this for you, so you can just do

sudo uidmap $pid 555550 10

(This utility will soon support being run setuid-root and consulting the above-mentioned /etc/id_permission/files)

Now back in the nsexec shell, switch to the new namespaced root userid using newuidexec (from the nsexec package) using:

newuidexec 0

Now you can do:

#id
uid=0(root) gid=0(root) groups=0(root)
#touch /tmp/zzz
#ls -l /root
ls: cannot open directory /root: Permission denied
#ls -l /tmp/zzz
-rw-r--r-- 1 root root 0 May 9 16:45 zzz

while back in your host root shell, you see:

#ls -l /tmp
-rw-r--r-- 1 55550 55550 0 May 9 16:45 zzz

The same thing will happen with all cases where a uid crosses the user->kernel api. For instance if you send credentials over a unix socket to a task in another user namespace, the uid will be converted to a valid mapping in the other user namespace, or, if none exists, to the overflowuid.

So, after many years, user namespaces are real! Perhaps the biggest remaining obstacle to using user namespaces for a real distro container is converting more capable() calls to ns_capable(). Soon.

Posted in Uncategorized | Tagged , | 3 Comments

LXC in precise and beyond

I haven’t blogged about lxc for some time. Recently Stéphane showed ( http://www.stgraber.org/2012/03/04/booting-an-ubuntu-12-04-virtual-machine-in-an-lxc-container/) what much of the lxc related work we did this cycle accomplished: making it possible to boot a stock ubuntu image in a container, and streamlining the creation of new containers.

We also now have documentation in the ubuntu server guide. You can see it at https://help.ubuntu.com/12.04/serverguide/lxc.html.

But perhaps the most … intense work this cycle was the addition of apparmor support. The thing we wanted most from apparmor was not yet available: the ability to mediate mounts in the container. If we want to say “the container cannot write to /proc/sysrq-trigger”, then for that to be useful we either need to say “/sysrq-trigger relative to a proc mount”, or we need to be able to prevent /proc being mounted anywhere else (like /mnt). John Johansen in a huge effort implemented the kernel apparmor functionality (in a way acceptable upstream!) and a nice addition to the apparmor profile language, and was always helpful as we were shaking out bugs.

In the end it was tight, but 12.04 now has containers constrained by apparmor by default!

The apparmor support works as follows. First /usr/bin/lxc-start is automatically transitioned to its own profile, where it is only allowed to mount into the container’s tree. Then, just before executing the container’s init, lxc-start transitions to the container’s own profile. Each container configuration can specify a custom profile (which should start with “lxc-” and “unconfined” is also valid), or, if unspecified, then “lxc-default” is used. The default policy attempts to protect the host from accidental container abuses – such as writing to /proc/sysrq-trigger and /proc/mem, changing its cgroup settings (including its devices whitelist), or mounting the host’s devpts instance and subsequently manipulating host ptys. The goal in 12.04 is not to protect the host from malicious root user in a container, but from accidental abuses in the container.

An important apparmor feature missing in 12.04, however, is support for stacked profiles. Stacked profiles will implement a profile hierarchy. They will make it possible to have a container, running in its own restrictive profile, further load profiles. For instance, a container will be able to load the libvirt profiles – so that the container is protected from libvirt – but with that libvirt profile being subordinate to the container’s profile.

Since that support is not currently there, one must choose: either run the container in a profile and not allow it to load or transition to any further profiles, or run the container unconfined, and allow the container to load profiles. By default, the former is chosen, as that is will usually be the best choice from the host’s point of view.

The next few releases then will be very exciting from a container security point of view. For 12.10, we hope to further protect the host from containers using seccomp2, which implements a per-process system call filter. We also intend to hook the high-level testsuite into a jenkins instance, and start a code rewrite which will better support good unit tests. For 13.04, we hope to be able to exploit user namespaces and support stacked apparmor profiles. Finally, other features we hope to complete by 13.10 (though getting all of them done is unlikely) include cgroup fake roots, a devices namespace, and a system log namespace.

In terms of general features over those same releases, we will add apport hooks for better debug support, container hooks at various states (i.e. post-create and pre-start), and greater scriptability by providing a liblxc api. And the user namespace should, before 14.04, allow us to support container use by unprivileged users.

If you’ll be at UDS next week, there will be a high level overview and demo of lxc, followed by q/a – see https://blueprints.launchpad.net/ubuntu/+spec/foundations-q-containers-demo.

Posted in Uncategorized | Tagged , | 1 Comment

First round of kvm performance tests

Here are the raw results from my first set of kvm performance runs. These were all using disk images on an ext4 filesystem on the host, and using the scripts I showed in the previous post.

To reiterate: for each test, I cloned the disk image, booted it once to wget a linux kernel tarball, then booted it 10 more times – 5 times with a read-intensive workload, and 5 with write-intensive workload. Due to some hiccoughs, some of the tests ended up being run twice (very spread apart). In those cases I list both sets of results, but perhaps I shouldn’t have, as it obfuscates patterns.

You may notice that often the first results is an outlier quite a bit higher than the rest. Moreover, this pattern is far more extreme for write tests than read tests.

My theory is that this is due to preallocation effects. This is born out by the fact that the best results as well as those with the smallest 95% confidence interval are those with preallocated qcow2.

Note that pre-allocated qcow2 files still only preallocate metadata. A further test of my theory will be to see what we get with a kvm root disk on a host lvm partition.

The results I show here are:

1. the disk image parameters
2. the list of times, in seconds from kvm vm startup to shutdown, of
read and write heavy workloads respectively
3. the mean +/- 95% confidence interval for reads and writes.

I’ve sorted them by the mean write times ascending. Of course cache=unsafe dominates the fastest times. But if cache=unsafe is not acceptable for you, then raw and qed with writeback cache,and aio=threads, and if=virtio appear to offer good performance for both read and write.

The raw data and scripts which I used to process the data are at http://people.canonical.com/~serge/kvm-perf.ext4/. There you can also find zwrites (the results below) and zreads (ordered by mean read time ascending).

========== Results: ==========

type qcow2 cache unsafe aio native if virtio prealloc yes
reads: [71.908, 69.98, 69.3, 70.736, 65.208, 72.79599999999999, 70.904, 69.824, 70.28, 69.688]
writes: [213.828, 210.642, 220.054, 210.911, 206.659, 221.735, 213.031, 209.053, 209.906, 226.05]
readinterval: 70.062400 +/- 1.439401
writeinterval: 214.186900 +/- 4.513276

type qcow2 cache unsafe aio threads if virtio prealloc yes
reads: [67.9, 65.64, 70.428, 64.916, 70.024, 77.756, 70.784, 70.004, 70.264, 69.972]
writes: [212.236, 223.84300000000002, 221.059, 210.651, 208.778, 219.92000000000002, 213.113, 212.90699999999998, 211.906, 210.511]
readinterval: 69.768800 +/- 2.498142
writeinterval: 214.492400 +/- 3.690026

type raw cache unsafe aio threads if virtio prealloc no
reads: [72.336, 69.728, 70.104, 69.88, 65.224]
writes: [207.495, 218.463, 211.106, 215.195, 220.608]
readinterval: 69.454400 +/- 3.218633
writeinterval: 214.573400 +/- 6.630368

type raw cache unsafe aio native if virtio prealloc no
reads: [72.144, 64.828, 72.092, 69.784, 69.284]
writes: [199.113, 222.543, 223.154, 217.915, 225.99]
readinterval: 69.626400 +/- 3.703418
writeinterval: 217.743000 +/- 13.422588

type qed cache unsafe aio threads if virtio prealloc no
reads: [67.196, 70.212, 70.672, 71.116, 70.843]
writes: [223.852, 240.652, 221.107, 220.10399999999998, 225.446]
readinterval: 70.007800 +/- 1.993900
writeinterval: 226.232200 +/- 10.352019

type qed cache unsafe aio native if virtio prealloc no
reads: [68.2, 70.88, 69.996, 69.772, 69.86]
writes: [245.998, 222.684, 215.52, 217.809, 231.90699999999998]
readinterval: 69.741600 +/- 1.202580
writeinterval: 226.783600 +/- 15.454367

type qcow2 cache unsafe aio threads if virtio prealloc no
reads: [72.904, 69.812, 69.892, 70.336, 70.668]
writes: [236.522, 235.276, 232.071, 215.768, 215.321]
readinterval: 70.722400 +/- 1.574275
writeinterval: 226.991600 +/- 13.132206

type qcow2 cache unsafe aio native if virtio prealloc no
reads: [72.96000000000001, 69.29599999999999, 70.752, 70.912, 69.956]
writes: [239.856, 228.488, 224.102, 218.566, 226.488]
readinterval: 70.775200 +/- 1.717330
writeinterval: 227.500000 +/- 9.738037

type raw cache writeback aio threads if virtio prealloc no
reads: [73.16, 69.304, 69.992, 71.08, 70.532]
writes: [276.42, 261.144, 241.359, 244.405, 246.88400000000001]
readinterval: 70.813600 +/- 1.821672
writeinterval: 254.042400 +/- 18.165892

type qed cache writeback aio threads if virtio prealloc no
reads: [73.756, 71.78, 71.136, 71.112, 64.944]
writes: [272.872, 245.23, 256.286, 254.131, 248.39]
readinterval: 70.545600 +/- 4.112407
writeinterval: 255.381800 +/- 13.318741

type raw cache writeback aio native if virtio prealloc no
reads: [73.224, 69.336, 69.68, 69.148, 64.804]
writes: [290.128, 245.698, 241.717, 249.07, 254.622]
readinterval: 69.238400 +/- 3.712638
writeinterval: 256.247000 +/- 24.240085

type qed cache writeback aio native if virtio prealloc no
reads: [67.792, 69.28, 66.684, 66.828, 71.19200000000001]
writes: [291.802, 255.961, 233.31, 247.034, 262.969]
readinterval: 68.355200 +/- 2.351399
writeinterval: 258.215200 +/- 27.068852

type raw cache unsafe aio native if ide prealloc no
reads: [75.904, 75.06, 76.044, 78.048, 76.064]
writes: [314.539, 293.501, 296.839, 279.1, 279.287]
readinterval: 76.224000 +/- 1.366151
writeinterval: 292.653200 +/- 18.201841

type qcow2 cache unsafe aio threads if ide prealloc yes
reads: [78.036, 73.988, 74.996, 75.03999999999999, 75.224, 77.032, 75.176, 73.94, 76.014, 73.984]
writes: [299.203, 278.489, 295.129, 281.617, 314.042, 308.012, 296.013, 295.495, 317.636, 283.315]
readinterval: 75.343000 +/- 0.967447
writeinterval: 296.895100 +/- 9.584470

type qcow2 cache unsafe aio native if ide prealloc yes
reads: [76.934, 74.356, 74.688, 75.984, 75.12, 78.564, 73.388, 74.08, 75.03999999999999, 73.932]
writes: [309.87, 327.163, 286.801, 289.347, 288.831, 304.659, 297.211, 291.344, 284.538, 294.106]
readinterval: 75.208600 +/- 1.120707
writeinterval: 297.387000 +/- 9.410988

type qcow2 cache unsafe aio native if ide prealloc no
reads: [76.868, 73.864, 73.852, 76.98, 74.012]
writes: [345.193, 301.156, 278.149, 305.268, 284.979]
readinterval: 75.115200 +/- 2.052318
writeinterval: 302.949000 +/- 32.444686

type qcow2 cache unsafe aio threads if ide prealloc no
reads: [77.604, 73.372, 75.088, 74.044, 74.004]
writes: [341.525, 292.598, 295.211, 294.166, 297.212]
readinterval: 74.822400 +/- 2.076512
writeinterval: 304.142400 +/- 26.031010

type raw cache unsafe aio threads if ide prealloc no
reads: [77.86, 78.176, 74.608, 77.97200000000001, 77.428]
writes: [309.522, 319.155, 291.431, 293.868, 321.356]
readinterval: 77.208800 +/- 1.836889
writeinterval: 307.066400 +/- 17.283457

type qed cache unsafe aio threads if ide prealloc no
reads: [79.51599999999999, 74.479, 74.916, 74.356, 75.904]
writes: [354.66700000000003, 310.124, 310.858, 324.602, 316.314]
readinterval: 75.834200 +/- 2.664899
writeinterval: 323.313000 +/- 22.918687

type qed cache unsafe aio native if ide prealloc no
reads: [77.088, 75.3, 75.732, 74.9, 74.232]
writes: [356.678, 335.235, 321.761, 307.689, 349.126]
readinterval: 75.450400 +/- 1.327346
writeinterval: 334.097800 +/- 24.729279

type qcow2 cache writeback aio native if virtio prealloc yes
reads: [72.884, 71.376, 76.564, 70.008, 69.876, 73.116, 69.728, 69.752, 69.6, 69.784]
writes: [323.319, 345.656, 332.029, 336.356, 355.151, 344.764, 337.784, 319.897, 339.707, 334.06600000000003]
readinterval: 71.268800 +/- 1.639358
writeinterval: 336.872900 +/- 7.487037

type qcow2 cache writeback aio native if virtio prealloc no
reads: [72.79599999999999, 71.79599999999999, 70.152, 65.28, 69.72]
writes: [341.786, 336.824, 329.788, 329.311, 356.161]
readinterval: 69.948800 +/- 3.588496
writeinterval: 338.774000 +/- 13.679155

type qcow2 cache writeback aio threads if virtio prealloc no
reads: [72.8, 69.8, 70.2, 71.832, 71.19200000000001]
writes: [346.134, 334.753, 335.104, 339.325, 339.099]
readinterval: 71.164800 +/- 1.509693
writeinterval: 338.883000 +/- 5.695326

type qcow2 cache writeback aio threads if virtio prealloc yes
reads: [73.68, 70.06, 69.836, 69.932, 69.98, 72.868, 70.02, 69.976, 70.096, 69.998]
writes: [330.735, 331.071, 362.747, 346.53, 355.036, 329.624, 346.205, 329.016, 325.988, 337.155]
readinterval: 70.644600 +/- 1.002024
writeinterval: 339.410700 +/- 8.983949

type raw cache writeback aio native if ide prealloc no
reads: [78.51599999999999, 74.78, 74.42, 74.82, 73.972]
writes: [403.936, 400.882, 424.5, 400.947, 400.139]
readinterval: 75.301600 +/- 2.271043
writeinterval: 406.080800 +/- 12.912044

type raw cache writethrough aio native if virtio prealloc no
reads: [68.528, 69.676, 69.264, 69.828, 69.884]
writes: [488.821, 413.534, 422.732, 405.829, 385.253]
readinterval: 69.436000 +/- 0.698544
writeinterval: 423.233800 +/- 48.653135

type qed cache writeback aio native if ide prealloc no
reads: [77.28, 74.352, 75.672, 75.132, 75.088]
writes: [425.21, 458.31, 430.34, 403.30899999999997, 401.867]
readinterval: 75.504800 +/- 1.363138
writeinterval: 423.807200 +/- 28.697195

type raw cache writethrough aio native if virtio prealloc no
reads: [68.888, 70.984, 66.88, 71.70400000000001, 70.268]
writes: [477.113, 414.424, 419.382, 419.52, 399.91700000000003]
readinterval: 69.744800 +/- 2.371302
writeinterval: 426.071200 +/- 36.795115

type qed cache writeback aio threads if ide prealloc no
reads: [78.491, 75.324, 74.484, 78.36, 74.88]
writes: [443.45, 419.983, 450.796, 426.647, 407.002]
readinterval: 76.307800 +/- 2.429239
writeinterval: 429.575600 +/- 21.975754

type raw cache writethrough aio threads if virtio prealloc no
reads: [73.868, 70.0, 71.092, 69.78, 69.44800000000001]
writes: [504.962, 416.249, 412.897, 414.723, 399.332]
readinterval: 70.837600 +/- 2.238366
writeinterval: 429.632600 +/- 52.949876

type raw cache writeback aio threads if ide prealloc no
reads: [76.632, 76.628, 74.16, 73.888, 74.751]
writes: [439.41, 453.696, 444.509, 416.295, 463.588]
readinterval: 75.211800 +/- 1.653519
writeinterval: 443.499600 +/- 22.084035

type qcow2 cache writethrough aio threads if virtio prealloc yes
reads: [74.012, 70.752, 69.896, 70.29599999999999, 65.612, 72.592, 70.732, 70.432, 69.753, 65.75]
writes: [512.072, 404.015, 422.361, 408.045, 393.726, 597.382, 432.496, 436.96, 438.931, 429.013]
readinterval: 69.982700 +/- 1.871147
writeinterval: 447.500100 +/- 44.198589

type qed cache writethrough aio threads if virtio prealloc no
reads: [72.752, 66.028, 71.484, 71.212, 71.304]
writes: [596.895, 404.658, 448.447, 425.341, 410.709]
readinterval: 70.556000 +/- 3.236448
writeinterval: 457.210000 +/- 99.194061

type qcow2 cache writethrough aio native if virtio prealloc yes
reads: [72.659, 70.34, 71.608, 69.868, 70.392, 73.036, 71.476, 70.316, 70.68, 71.172]
writes: [475.648, 413.068, 443.214, 428.523, 416.772, 624.818, 449.145, 472.21, 466.734, 444.459]
readinterval: 71.154700 +/- 0.751826
writeinterval: 463.459100 +/- 43.440139

type qcow2 cache writeback aio native if ide prealloc yes
reads: [77.03999999999999, 74.048, 75.028, 74.112, 74.8, 76.726, 73.856, 75.792, 74.356, 73.984]
writes: [459.85699999999997, 421.821, 470.205, 450.886, 481.52, 473.843, 437.646, 495.09, 471.445, 487.481]
readinterval: 74.974200 +/- 0.834674
writeinterval: 464.979400 +/- 16.293238

type qed cache writethrough aio native if virtio prealloc no
reads: [74.544, 66.272, 69.86, 73.132, 71.624]
writes: [588.529, 445.01800000000003, 448.782, 437.321, 426.602]
readinterval: 71.086400 +/- 3.980641
writeinterval: 469.250400 +/- 83.459586

type qcow2 cache writeback aio threads if ide prealloc no
reads: [78.688, 74.852, 75.036, 74.072, 74.156]
writes: [454.594, 465.601, 455.023, 482.267, 506.537]
readinterval: 75.360800 +/- 2.367901
writeinterval: 472.804400 +/- 27.253850

type qcow2 cache writeback aio native if ide prealloc no
reads: [77.55199999999999, 75.632, 75.276, 75.044, 74.084]
writes: [476.841, 478.186, 469.609, 463.385, 500.178]
readinterval: 75.517600 +/- 1.581568
writeinterval: 477.639800 +/- 17.301061

type raw cache none aio native if virtio prealloc no
reads: [77.68, 76.292, 77.52, 76.364, 77.612]
writes: [482.216, 502.968, 477.784, 470.051, 487.512]
readinterval: 77.093600 +/- 0.871224
writeinterval: 484.106200 +/- 15.314034

type qcow2 cache writethrough aio native if virtio prealloc no
reads: [72.472, 72.96000000000001, 71.732, 70.104, 69.744]
writes: [671.941, 430.206, 439.514, 425.877, 455.022]
readinterval: 71.402400 +/- 1.768546
writeinterval: 484.512000 +/- 130.834077

type qcow2 cache none aio native if virtio prealloc no
reads: [77.4, 80.688, 74.376, 80.46000000000001, 77.212]
writes: [543.64, 442.76, 487.82, 495.176, 463.548]
readinterval: 78.027200 +/- 3.249008
writeinterval: 486.588800 +/- 47.207473

type qcow2 cache writeback aio threads if ide prealloc yes
reads: [79.724, 74.616, 74.316, 75.844, 76.21600000000001, 79.748, 74.876, 74.252, 74.06, 74.087]
writes: [508.752, 468.39, 533.871, 483.505, 514.277, 491.674, 453.55899999999997, 476.473, 484.364, 472.598]
readinterval: 75.773900 +/- 1.581155
writeinterval: 488.746300 +/- 17.207381

type qcow2 cache writethrough aio threads if virtio prealloc no
reads: [73.144, 70.876, 69.86, 66.216, 71.856]
writes: [649.269, 448.008, 461.003, 456.055, 436.461]
readinterval: 70.390400 +/- 3.265897
writeinterval: 490.159200 +/- 111.039277

type qcow2 cache none aio native if virtio prealloc yes
reads: [76.964, 72.572, 77.236, 77.088, 76.838, 76.26, 76.508, 76.916, 76.42, 76.668]
writes: [476.156, 470.74, 499.128, 443.656, 474.364, 507.94, 551.748, 556.968, 489.34, 534.26]
readinterval: 76.347000 +/- 0.973792
writeinterval: 500.430000 +/- 26.690797

type qed cache none aio threads if virtio prealloc no
reads: [77.992, 77.724, 79.184, 77.332, 76.77199999999999]
writes: [529.155, 458.584, 526.116, 492.068, 522.908]
readinterval: 77.800800 +/- 1.116445
writeinterval: 505.766200 +/- 37.604137

type raw cache none aio threads if virtio prealloc no
reads: [78.104, 76.996, 77.632, 76.46000000000001, 76.88]
writes: [482.376, 503.86, 548.272, 490.648, 506.72]
readinterval: 77.214400 +/- 0.808136
writeinterval: 506.375200 +/- 31.565464

type qed cache none aio native if virtio prealloc no
reads: [76.84, 76.74, 77.22800000000001, 77.164, 77.712]
writes: [491.132, 540.764, 483.748, 526.808, 507.348]
readinterval: 77.136800 +/- 0.475029
writeinterval: 509.960000 +/- 29.651642

type qcow2 cache none aio threads if virtio prealloc yes
reads: [78.20400000000001, 78.18, 73.3, 76.816, 78.876, 77.71600000000001, 77.032, 77.596, 77.29599999999999, 72.02]
writes: [503.988, 495.144, 525.72, 542.712, 503.664, 518.556, 543.72, 495.94, 494.844, 516.844]
readinterval: 76.703600 +/- 1.598942
writeinterval: 514.113200 +/- 13.339433

type qcow2 cache none aio threads if virtio prealloc no
reads: [77.828, 76.864, 76.988, 77.236, 77.852]
writes: [611.708, 497.179, 561.54, 560.4639999999999, 520.304]
readinterval: 77.353600 +/- 0.575956
writeinterval: 550.239000 +/- 54.556150

type raw cache directsync aio native if virtio prealloc no
reads: [78.256, 76.624, 71.852, 76.236, 76.92]
writes: [664.876, 544.736, 512.644, 499.308, 541.676]
readinterval: 75.977600 +/- 3.014994
writeinterval: 552.648000 +/- 81.477149

type qcow2 cache directsync aio native if virtio prealloc yes
reads: [78.828, 77.076, 77.70400000000001, 77.424, 77.652]
writes: [713.552, 548.688, 535.836, 587.892, 558.424]
readinterval: 77.736800 +/- 0.817399
writeinterval: 588.878400 +/- 89.754168

type raw cache directsync aio threads if virtio prealloc no
reads: [79.28, 77.21600000000001, 77.79599999999999, 77.044, 76.98]
writes: [692.844, 577.1, 568.328, 549.608, 557.408]
readinterval: 77.663200 +/- 1.191259
writeinterval: 589.057600 +/- 73.201025

type qed cache directsync aio native if virtio prealloc no
reads: [79.116, 76.72800000000001, 76.94, 77.732, 77.172]
writes: [737.116, 593.968, 609.876, 513.872, 539.968]
readinterval: 77.537600 +/- 1.190214
writeinterval: 598.960000 +/- 107.443724

type qcow2 cache directsync aio native if virtio prealloc no
reads: [77.256, 72.05199999999999, 77.128, 76.908, 76.98]
writes: [799.844, 569.932, 563.976, 538.548, 567.46]
readinterval: 76.064800 +/- 2.790327
writeinterval: 607.952000 +/- 134.103245

type qcow2 cache directsync aio threads if virtio prealloc yes
reads: [79.72, 77.94800000000001, 77.69200000000001, 72.248, 72.94]
writes: [708.444, 575.008, 606.627, 633.548, 519.684]
readinterval: 76.109600 +/- 4.112374
writeinterval: 608.662200 +/- 86.982030

type qcow2 cache none aio native if ide prealloc no
reads: [89.71600000000001, 89.036, 89.024, 88.9, 89.03999999999999]
writes: [595.172, 634.948, 644.144, 639.612, 546.932]
readinterval: 89.143200 +/- 0.404064
writeinterval: 612.161600 +/- 51.342335

type raw cache none aio native if ide prealloc no
reads: [91.14, 90.804, 89.412, 89.592, 90.2]
writes: [667.164, 583.376, 594.404, 613.016, 649.488]
readinterval: 90.229600 +/- 0.928064
writeinterval: 621.489600 +/- 44.458414

type qcow2 cache none aio native if ide prealloc yes
reads: [91.644, 91.194, 89.924, 88.98400000000001, 89.364, 92.06, 89.9, 89.923, 89.44, 89.628]
writes: [615.552, 682.836, 621.06, 580.528, 675.52, 680.48, 701.068, 624.996, 578.476, 642.944]
readinterval: 90.206100 +/- 0.748667
writeinterval: 640.346000 +/- 31.068144

type qcow2 cache directsync aio threads if virtio prealloc no
reads: [77.964, 77.168, 76.944, 77.748, 76.852]
writes: [847.968, 632.316, 576.068, 586.364, 592.548]
readinterval: 77.335200 +/- 0.614677
writeinterval: 647.052800 +/- 141.947814

type raw cache none aio threads if ide prealloc no
reads: [91.884, 91.132, 90.904, 90.128, 90.892]
writes: [662.24, 663.763, 664.66, 642.552, 660.632]
readinterval: 90.988000 +/- 0.780231
writeinterval: 658.769400 +/- 11.416450

type qed cache directsync aio threads if virtio prealloc no
reads: [79.904, 77.904, 77.031, 77.564, 78.2]
writes: [844.788, 623.52, 637.112, 572.912, 646.456]
readinterval: 78.120600 +/- 1.350330
writeinterval: 664.957600 +/- 129.702072

type qcow2 cache none aio threads if ide prealloc no
reads: [93.376, 91.456, 91.196, 91.112, 91.048]
writes: [700.452, 711.628, 647.22, 666.568, 604.964]
readinterval: 91.637600 +/- 1.221937
writeinterval: 666.166400 +/- 53.214826

type qed cache none aio native if ide prealloc no
reads: [92.14, 89.748, 89.26, 89.652, 90.08]
writes: [750.128, 649.292, 553.716, 668.968, 741.012]
readinterval: 90.176000 +/- 1.410714
writeinterval: 672.623200 +/- 98.906635

type qcow2 cache none aio threads if ide prealloc yes
reads: [92.124, 91.2, 91.708, 91.939, 93.928, 91.276, 92.848, 91.872, 91.28, 91.224]
writes: [572.0840000000001, 692.144, 677.74, 628.88, 638.872, 715.38, 681.768, 672.928, 762.628, 704.492]
readinterval: 91.939900 +/- 0.622067
writeinterval: 674.691600 +/- 37.364491

type qed cache none aio threads if ide prealloc no
reads: [91.82, 91.072, 91.94, 90.988, 91.676]
writes: [807.588, 715.08, 534.876, 706.944, 697.044]
readinterval: 91.499200 +/- 0.545591
writeinterval: 692.306400 +/- 122.336222

type raw cache writethrough aio threads if ide prealloc no
reads: [80.444, 75.284, 77.084, 78.196, 74.964]
writes: [1057.925, 661.521, 661.179, 657.715, 616.761]
readinterval: 77.194400 +/- 2.790265
writeinterval: 731.020200 +/- 228.111179

type raw cache writethrough aio native if ide prealloc no
reads: [77.06, 77.696, 78.05199999999999, 75.844, 76.932]
writes: [1068.291, 667.963, 655.981, 670.858, 668.072]
readinterval: 77.116800 +/- 1.051292
writeinterval: 746.233000 +/- 223.657676

type raw cache writethrough aio native if ide prealloc no
reads: [79.696, 76.776, 77.616, 76.336, 76.9]
writes: [1071.565, 656.022, 704.98, 676.375, 652.853]
readinterval: 77.464800 +/- 1.650613
writeinterval: 752.359000 +/- 223.061965

type qcow2 cache writethrough aio threads if ide prealloc yes
reads: [77.896, 76.924, 78.644, 76.14, 76.056, 80.5, 75.36, 78.684, 74.19200000000001, 77.584]
writes: [1073.265, 653.21, 660.449, 703.049, 668.19, 1494.953, 640.01, 656.11, 688.671, 663.771]
readinterval: 77.198000 +/- 1.322313
writeinterval: 790.167800 +/- 199.748005

type qcow2 cache writethrough aio native if ide prealloc yes
reads: [78.85, 76.992, 76.696, 75.832, 78.388, 90.128, 78.012, 85.189, 75.583, 78.056]
writes: [1058.099, 672.306, 667.347, 681.649, 681.3, 1538.195, 705.342, 694.396, 669.951, 670.794]
readinterval: 79.372600 +/- 3.321813
writeinterval: 803.937900 +/- 203.330296

type qed cache writethrough aio threads if ide prealloc no
reads: [80.132, 76.8, 77.749, 76.44800000000001, 77.976]
writes: [1383.955, 727.032, 710.931, 686.392, 731.373]
readinterval: 77.821000 +/- 1.788330
writeinterval: 847.936600 +/- 372.699807

type qcow2 cache writethrough aio threads if ide prealloc no
reads: [80.388, 76.792, 75.988, 76.672, 77.064]
writes: [1593.794, 670.108, 708.875, 705.443, 682.376]
readinterval: 77.380800 +/- 2.144576
writeinterval: 872.119200 +/- 501.321377

type qed cache writethrough aio native if ide prealloc no
reads: [81.412, 74.928, 76.94, 77.22800000000001, 76.1]
writes: [1480.35, 732.154, 667.413, 754.678, 738.492]
readinterval: 77.321600 +/- 3.048239
writeinterval: 874.617400 +/- 422.465560

type qcow2 cache writethrough aio native if ide prealloc no
reads: [81.044, 79.763, 76.916, 77.988, 75.5]
writes: [1660.47, 725.64, 702.448, 727.507, 728.033]
readinterval: 78.242200 +/- 2.741941
writeinterval: 908.819600 +/- 521.897933

type qcow2 cache directsync aio threads if ide prealloc yes
reads: [93.824, 92.00399999999999, 100.856, 100.088, 92.068]
writes: [1329.155, 881.064, 847.872, 887.36, 963.416]
readinterval: 95.768000 +/- 5.418941
writeinterval: 981.773400 +/- 246.773343

type qcow2 cache directsync aio native if ide prealloc yes
reads: [93.732, 92.676, 89.956, 90.819, 90.252]
writes: [1282.052, 874.672, 885.06, 977.332, 955.54]
readinterval: 91.487000 +/- 2.037346
writeinterval: 994.931200 +/- 206.685491

type raw cache directsync aio threads if ide prealloc no
reads: [94.944, 91.656, 100.0, 94.032, 100.28]
writes: [1328.724, 879.88, 977.32, 971.396, 950.028]
readinterval: 96.182400 +/- 4.728545
writeinterval: 1021.469600 +/- 218.629245

type raw cache directsync aio native if ide prealloc no
reads: [95.152, 89.572, 90.036, 89.944, 89.86]
writes: [1327.28, 971.624, 946.948, 953.932, 957.163]
readinterval: 90.912800 +/- 2.950376
writeinterval: 1031.389400 +/- 205.684453

type qed cache directsync aio native if ide prealloc no
reads: [95.112, 90.032, 90.132, 89.996, 89.988]
writes: [1616.092, 966.284, 984.184, 1004.828, 892.448]
readinterval: 91.052000 +/- 2.818989
writeinterval: 1092.767200 +/- 367.036271

type qed cache directsync aio threads if ide prealloc no
reads: [94.988, 92.06, 92.27600000000001, 92.712, 92.78]
writes: [1799.532, 896.048, 958.512, 929.48, 915.404]
readinterval: 92.963200 +/- 1.453926
writeinterval: 1099.795200 +/- 486.517142

type qcow2 cache directsync aio native if ide prealloc no
reads: [92.824, 89.524, 89.368, 89.74, 89.248]
writes: [1789.824, 970.24, 957.368, 984.06, 864.708]
readinterval: 90.140800 +/- 1.876408
writeinterval: 1113.240000 +/- 473.205353

type qcow2 cache directsync aio threads if ide prealloc no
reads: [94.124, 91.852, 92.76400000000001, 91.94800000000001, 91.882]
writes: [2009.812, 888.624, 1048.88, 927.432, 891.664]
readinterval: 92.514000 +/- 1.212233
writeinterval: 1153.282400 +/- 600.007495

Posted in Uncategorized | Tagged , | 2 Comments

Kvm performance runs under way

I’ve finally gotten the kvm performance tests rolling. I’m hoping to have the first set of results some time next week. I installed a new precise server image on a laptop with 100M for rootfs (ext4), and a 100M partition for the guest images. I installed a precise server guest in a 10G base image on the rootfs, and for the first set of tests formatted the other partition as ext4. For each test, the base image gets copied to the experimental partition, where I intend to test xfs, jfs, etc.

The guest has the following upstart script.

{{{
# iotest – perform 5 read and 5 write tests
# shut down after each test so caller can time it
description “Run 5 read and 5 write tests, one per reboot”
author “Serge Hallyn “

start on runlevel [2345]
stop on runlevel [!2345]
console output

script
dosetup()
{
echo 0 > /etc/nreadtests
echo 0 > /etc/nwritetests
cd /root
which wget > /dev/null 2>&1 || {
sudo apt-get update
sudo apt-get -y install wget
}
wget http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.3.2.tar.bz2
shutdown -h now
exit 0
}
doreadtest()
{
cd /root
find / -type f > /dev/null 2>&1 || true
tar jvft linux-3.3.2.tar.bz2 > /dev/null
find / -type f > /dev/null 2>&1 || true
tar jvft linux-3.3.2.tar.bz2 > /dev/null
}
dowritetest()
{
rm -f /root/bigfile
dd if=/dev/zero of=/root/bigfile bs=1G count=2
sync || true
rm -f /root/bigfile
dd if=/dev/zero of=/root/bigfile bs=1G count=2
sync || true
rm -f /root/bigfile
dd if=/dev/zero of=/root/bigfile bs=1G count=2
sync || true
cd /root
tar jxf linux-3.3.2.tar.bz2
sync || true
rm -rf linux-3.3.2
sync || true
}

[ ! -f /etc/nreadtests ] && dosetup
nreads=`cat /etc/nreadtests`
nwrites=`cat /etc/nwritetests`
echo “nreads is $nreads”
echo “nwrites is $nwrites”
if [ $nreads -lt 10 ]; then
doreadtest
nreads=$((nreads+1))
echo $nreads > /etc/nreadtests
shutdown -h now
exit 0
fi
if [ $nwrites -lt 10 ]; then
dowritetest
nwrites=$((nwrites+1))
echo $nwrites > /etc/nwritetests
shutdown -h now
exit 0
fi
echo “iotest: at end (should not reach here until done with all runs)”
end script
}}}

On the first boot it downloads a kernel tarball, then shuts down. The next 5 boots it does some heavy read activity, then for the next 5 heavy write activity. Each time it shuts down.

On the host I am running the following script:

{{{
#!/bin/bash

#host fs: xfs, ext4, ext3, ext2, jfs, lvm, lvm snapshot
#interface: virtio, ide
#cache: unsafe, none writeback, writethrough, directsync
#aio: threads, native

runperf()
{
# one setup, 5 read, 5 write tests
for i in `seq 1 11`; do
echo “run $i” >> perfout
echo “kvm -m 1024 $* -net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no -vnc :1″
time kvm -m 1024 $* -net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no -vnc :1
done
}

runtests()
{
disk=$1
drive=$2

for cache in unsafe none writeback writethrough directsync; do
for aio in threads native; do
for myif in virtio ide; do
echo “drive type $drive cache $cache aio $aio myif $if starting”
echo “drive type $drive cache $cache aio $aio myif $if starting” >> perfout
prealloc=”"
if [ $drive = "qcow2pre" ]; then
prealloc=”-o preallocation=metadata “
drive=qcow2
fi
echo Converting disk…
echo “qemu-img convert -f raw -O $drive $prealloc /home/serge/base.img /srv/x.img”
echo “qemu-img convert -f raw -O $drive $prealloc /home/serge/base.img /srv/x.img” >> perfout
qemu-img convert -f raw -O $drive $prealloc /home/serge/base.img /srv/x.img
echo “Starting runs”
runperf -drive file=$disk,aio=$aio,cache=$cache,if=$myif,index=0
done
done
done
}

# NOTE need to create each hostfs by hand and run this script
# lvm and lvm snapshot count as more hostfs’s
# NOTE AFAIK preallocation only works with qcow2. That is handled inside
# runtests
for format in qcow2 qed raw qcow2pre; do
echo “Starting format $format”
echo “Starting format $format” >> perfout
runtests /srv/x.img $format
done
}}}

I left a console logged in doing ‘tail -f perfout’ so I can try to monitor progress without perturbing the system. I think the first set of qcow2 runs took about 24 hours, so figure about 4 days for the ext4 results.

I think I’ll only check the many cache/aio/etc options with ext4. Then I may pick one combination and use that in another script to test with lvm, and lvm snapshot, and with raw format image file on a hostfs of xfs, jfs, etc.

I’ll update when the first round of results are done.

Posted in Uncategorized | Tagged , | Leave a comment

The linux command line (book)

“The linux command line”, published by the No Starch Press, sells itself to people who are new to linux, and have been enjoying its gui goodness, but who now want to experience some of the famed power of the command line.

And it absolutely lives up to that promise. It starts very gently, showing you how to comfortably get around the shell with a few simple commands. Then it slowly expands to exploring the whole system. Along the way, it’s very readable, so even though I’ve been using unix for 20 years and didn’t have much to learn in this part, I still enjoyed reading it page by page.

While it was readable, it also included plenty of examples to encourage new users to experiment. I’m convinced that this book will help a new user of any level to become more proficient with the Linux command line, and have fun doing it.

Here’s where the book surprised me. About halfway through, I started having more and more “oh really?” moments. I started really slowing down. I’d never used ‘[[ $x =~ regexp]]’ or the ‘join’ command, for instance. And when I started using unix (sunos around 1993) we didn’t have no fancy option to sort ls output by size, you did that by hand – so I’ve been doing that ever since, even though gnu ls has now had an option for that for, I think, more than a decade. Since I’d expected to finish the book quickly, reading every page cover to cover, and then blog about it, it became frustrating :) But that’s only because I don’t want to miss a single potentially new tip.

So before and partway into reading the book, I was planning to recommend it to anyone using or interested in linux who has not yet had a chance to delve into the command line. Now, I’d recommend it for just about anyone.

Posted in Uncategorized | Tagged , | 5 Comments

gtd next-actions

For years now, I’ve kept the following directory structure to support my gtd workflow:

gtd/
   done.otl
   next_actions.otl
   someday_maybe.otl
   waiting_on.otl
   Projects/
   Reference/
   tickler/

I’ve discussed the tickler folder before. But while I like having the next_actions.otl file, I feel it’s stopping me from really following the gtd workflow. I tend to ignore some of the things under Projects/, as I have to manually look through those to find a next action.

So I’m going to try something a bit different. I’m going to not have a next_actions file. Rather, I’ll tag items in files under Projects/ with with ‘nextaction:’, and use ~/bin/nextactions which is basically just

grep -r "^nextaction" $HOME/gtd/Projects

Now that doesn’t make me look at files not having next actions, but it keeps me working under ~/gtd/Projects rather than keeping next_actions.otl open in an editor, which otherwise is the temptation. I can also easily print out all top-level files or directories which don’t have a nextaction in or under them,

====================================================
#!/bin/bash

d=$HOME/gtd/Projects

for f in $d/*; do
	grep -rq "^nextaction" $f
	if [ $? -ne 0 ]; then
		echo $f
	fi
done
====================================================

so I can trivially see which project I’ve ignored. Yikes, 23 of them.

We’ll see, I may not like it. For instance, it may seem more awkward cut-pasting things into gtd/done.otl But I think it could improve my workflow.

Posted in Uncategorized | Tagged | Leave a comment

Nested kvm guests

Yesterday, the right honorable James Page asked whether nested kvm was supported. It’s been long supported on AMD, but for a long time the answer has been “check back later” for Intel. I hadn’t checked in a while though, so I took a quick look. And lo! It appears to have been introduced in the upstream kernel in May 2011. It is turned off by default. To turn it on, you must provide the ‘nested=1′ parameter when loading the kvm_intel module.

I did a few tests with that parameter, and saw no instability nor performance degradation. So as of today, qemu-kvm in precise will by default enable nesting on Intel. If you want to turn it off, edit /etc/default/qemu-kvm and set KVM_NESTED=”".

The userspace qemu-kvm doesn’t need any changes to use this, however you do have to pass either ‘-cpu host’ or ‘-cpu qemu64,+vmx’ to the qemu command line options. So, for instance, I was testing with:

kvm -cpu host -drive file=x.img,if=virtio,cache=none,index=0 -m 1024 -redir tcp:2222::22

In that VM I started a nested minimal ubuntu install and compiled a tiny program, ii. With nested kvm that took 0.8s. With it (that is, without passing ‘-cpu host’ to the top level qemu), it took 2.8 seconds.

Enjoy!

Posted in Uncategorized | Tagged , , | 2 Comments