Over the past few months, Eric Biederman has been working on completing the user namespace. Briefly, unprivileged users can create a user namespace, where he can pretend to be root and start new namespaces (i.e. network and pid) which he will own (Note, creating namespaces in child user namespaces isn’t yet allowed, but will be). With respect to anything he owns – for instance new network interfaces which he creates in his own network namespace – he should have privilege. But he should not be able to escape his existing privileges in the parent user namespace. This finally should allow an unprivileged user to create a new filesystem tree and chroot into it, without risk of maliciously confusing setuid applications on the host (for instance by bind mounting his own /etc/passwd).
Eric’s new design is based on a 1-1 uid mapping (by ranges) from uids
in the container to uids on the host. For instance, uid 0 in the namespace may really be uid 999990 on the host. Users can be pre-allocated their own private ranges to use however they please. For instance each user may get 10,000 uids, with the first user’s range starting at 100,000. The uid and gid mappings are exposed and manipulated through /proc/pid/uid_map and /proc/pid/gid_map, which contain:
namespace_first_uid host_first_uid number_of_uids
For instance if it contains “0 100000 1000”, then uids 0 through 1000 in the namespace will map to uids 100000 through 101000 on the host, respectively. To write to the uid map, you must be privileged in your namespace, and your namespace must have the source ids mapped. (The mappings can be nested in the obvious way). In userspace, we expect to have a small setuid-root program which unprivileged users can call to map uids. That program will consult a root owned file which lists the permitted mappings. Right now we are using /etc/id_permission/uids and /etc/id_permission/gids. If /etc/id_permission/uids has
then uid 1000 (user hallyn) will be allowed to map the uids 100000 through 109999, and 1001 (user jschmoe) will be allowed to map uids 110000 through 119999.
So you can try it out! Like so:
Start an amazon ec2 instance of precise. Find an ami to use (ami=`ubuntu-cloudimg-query precise`) and start it up (ec2-run-instances -k myid $ami). Log in and update /etc/apt/sources.list to look as follows:
then update (sudo apt-get update && sudo apt-get -y dist-upgrade). Add my userns-natty ppa (sudo add-apt-repository ppa:serge-hallyn/userns-natty) and update again (sudo apt-get update && sudo apt-get -y dist-upgrade), then reboot into the new kernel.
As I’ve said, the uid mapping is in /proc/self/uid_map. On the host that looks like
0 0 4294967295
Grab nsexec from my ppa to create new namespaces (sudo apt-get install nsexec) and run
sudo nsexec -cU /bin/bash
Inside the new namespace, /proc/self/uid_map is empty. So we need to add some mappings. From a root terminal on the host (not in the new namespace), do
echo “0 555550 10” > /proc/$pid/uid_map
echo “0 555550 10” > /proc/$pid/gid_map
Where $pid is the process id of the shell in the namespace. The nsexec package includes a utility called uidmap which will do this for you, so you can just do
sudo uidmap $pid 555550 10
(This utility will soon support being run setuid-root and consulting the above-mentioned /etc/id_permission/files)
Now back in the nsexec shell, switch to the new namespaced root userid using newuidexec (from the nsexec package) using:
Now you can do:
#id uid=0(root) gid=0(root) groups=0(root) #touch /tmp/zzz #ls -l /root ls: cannot open directory /root: Permission denied #ls -l /tmp/zzz -rw-r--r-- 1 root root 0 May 9 16:45 zzz
while back in your host root shell, you see:
#ls -l /tmp -rw-r--r-- 1 55550 55550 0 May 9 16:45 zzz
The same thing will happen with all cases where a uid crosses the user->kernel api. For instance if you send credentials over a unix socket to a task in another user namespace, the uid will be converted to a valid mapping in the other user namespace, or, if none exists, to the overflowuid.
So, after many years, user namespaces are real! Perhaps the biggest remaining obstacle to using user namespaces for a real distro container is converting more capable() calls to ns_capable(). Soon.