I'm learning about Linux user namespaces and I'm observing a strange behavior which isn't completely clear to me.
I've created a range of UIDs in initial user namespaces to which I can map UIDs in child user namespace via newuidmap command. These are my settings:
$ grep '^woky:' /etc/subuid
woky:200000:10000
$ id -u
1000
Then I've tried to create a new user namespace and map its UID range [0-10000) to [200000-210000) in the parent user namespace:
First terminal:
$ PS1='% ' unshare -U bash
% echo $$
1337
% id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
Second terminal:
$ ps -p 1337 -o uid
UID
1000
$ newuidmap 1337 0 200000 10000
$ ps -p 1337 -o uid
UID
1000
First terminal:
% id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
So UIDs inside and outside the new user namespace weren't changed even though the newuidmap completed successfully.
Then I found the following article http://www.itinken.com/blog/2016/Sep/exploring-unprivileged-containers/ which opened my eyes a bit. I've tried the previous scenario but with the following test-unshare.py script, which I took from the article and slightly modified, instead of the unshare command:
#!/usr/bin/python3
import os
from cffi import FFI
CLONE_NEWUSER = 0x10000000
ffi = FFI()
ffi.cdef('int unshare(int flags);')
libc = ffi.dlopen(None)
libc.unshare(CLONE_NEWUSER)
print("user id = %d, process id = %d" % (os.getuid(), os.getpid()))
input("Press Enter to continue...")
# The uid must be set to 0 to avoid loosing capabilities when creating the shell.
os.setuid(0)
os.execlp('/bin/bash', 'bash')
First terminal:
$ python3 ./test-unshare.py
user id = 65534, process id = 1337
Press Enter to continue...
Second terminal:
$ ps -p 1337 -o uid
UID
1000
$ newuidmap 1337 0 200000 10000
$ ps -p 1337 -o uid
UID
1000
First terminal:
<Enter>
bash: /home/woky/.bashrc: Permission denied
bash-4.4# id
uid=0(root) gid=65534(nobody) groups=65534(nobody)
Second terminal:
$ ps -p 1337 -o uid
UID
200000
Now it looks like what I've expected from the beginning. Now my theory about why the UIDs in the first example weren't changed is the following:
The unshare called execve(2) to run /bin/bash without first calling setuid(2). Now the shell lost all its capabilities (as mentioned in user_namespaces(7)) and cannot change its UID from 65534. In the second case, the process changed its UID to 0, because it had capabilities to do so, and Linux mapped it to 200000 outside the new user namespace (according to /proc/1337/uid_map which newuidmap wrote). Which means that the first process in a new user namespace has to call setuid(START_UID) or otherwise it'd be stuck in 65534 after execve(2).
Is it correct?
However, in this scenario, the process in the new user namespace didn't have to call setuid(2) and yet its UID changed:
First terminal:
$ PS1='% ' unshare -U bash
% echo $$
1337
% id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
Second terminal:
$ ps -p 1337 -o uid
UID
1000
$ echo '500000 1000 1' >/proc/1337/uid_map
$ ps -p 1337 -o uid
UID
1000
First terminal:
% id
uid=500000 gid=65534(nobody) groups=65534(nobody)
Please, explain all situations in depth.
My journey started when I tried to understand what the /etc/subuid file is for. It's used by Docker and LXC but only few documents explain it. Sorry for the verbosity. It took me really long time to comprehend it and I still don't understand it fully, so I'm collecting here all I know.