A minimal Go program that demonstrates real vs effective UID — the kernel mechanism behind setuid binaries and why Docker chose a daemon model instead.
mkdir -p tmp
go build -o ./tmp/setuid-demo .
# Without setuid — both UIDs are yours
./tmp/setuid-demo
# real uid: 1000
# effective uid: 1000
# → both UIDs match — running as yourself, no privilege escalationVM / shared folder note: the binary must be on a filesystem mounted without
nosuid.
/tmpon Ubuntu is atmpfswithnosuid— setuid is silently ignored there.- Your macOS home shared into Lima via virtiofs also cannot store Linux UID 0 ownership.
- Use
/var/tmp(ext4, nonosuid) instead.
# Inside a linux machine or VM, copy to /var/tmp first
cp ./tmp/setuid-demo /var/tmp && cd /var/tmp
sudo chown root setuid-demo
sudo chmod u+s setuid-demo
ls -la setuid-demo
# -rwsr-xr-x root ...
# ^
# 's' = setuid bit: run as file owner (root), not as the calling user
# Now run as your normal user
./setuid-demo
# real uid: 1000 ← still you
# effective uid: 0 ← kernel treats this process as rootWhen the kernel execs a binary with the setuid bit set, it switches the effective UID to the file owner before your code runs. The real UID stays as the caller. The process therefore has the privileges of the owner (root here), not the user who ran it.
Classic real-world example: /usr/bin/passwd. A normal user can run it, but it needs root to write /etc/shadow. The setuid bit on the binary is what grants that.
A setuid binary that manages overlay mounts, network namespaces, and cgroups is a huge privileged attack surface — every bug in it is a local root exploit available to any user on the machine. Docker's original answer was to move all the privileged code into one long-lived root process (the daemon) and expose a defined API over a Unix socket, so the attack surface is at least bounded.
User namespaces — the kernel feature that makes rootless containers possible — were not ready when Docker launched.
| Date | Event | Source |
|---|---|---|
| 2013-02-18 | Linux 3.8 released — user namespaces merged, but implementation incomplete and full of privilege escalation CVEs | kernelnewbies.org/Linux_3.8 |
| 2013-03-23 | Docker 0.1 released — user namespaces too new and dangerous to rely on | github.com/moby/moby releases |
| 2014-12-07 | Linux 3.18 — OverlayFS merged into mainline; Docker had been using out-of-tree AUFS | kernelnewbies.org/Linux_3.18 |
| 2016–2022 | Most distros disabled unprivileged user namespaces by default due to ongoing CVEs; Ubuntu added kernel.unprivileged_userns_clone=0 |
Ubuntu bug #1555338 |
| 2019-07 | Rootless Docker released as experimental, using fuse-overlayfs as a workaround for missing kernel support |
moby/moby#38050 |
| 2021-02-14 | Linux 5.11 — unprivileged OverlayFS merged; rootless containers no longer need fuse-overlayfs workaround |
kernelnewbies.org/Linux_5.11 |
| 2021-08 | Rootless Docker graduates to stable in Docker Engine 20.10 | docs.docker.com/engine/security/rootless |
| 2023-10 | Ubuntu 23.10+ ships with kernel.apparmor_restrict_unprivileged_userns=1 — even today, distros restrict this by default |
Ubuntu Blog: Unprivileged user namespace restrictions |
The daemon model was the pragmatic choice in 2013: concentrate all privileged operations in one controlled root process rather than trusting a kernel feature that was unstable and disabled on most production systems.
The trade-off: anyone with write access to /var/run/docker.sock has effective root. Adding a user to the docker group is functionally equivalent to adding them to sudo.
Only trusted users should be allowed to control your Docker daemon. This is a direct consequence of some powerful Docker features. Specifically, Docker allows you to share a directory between the Docker host and a guest container without limiting the access rights of the container. This means you can start a container where
/hostis the/directory on your host — and the container can alter your host filesystem without any restriction. This is similar to how virtualization systems allow filesystem resource sharing: nothing prevents you from sharing your root filesystem (or even your root block device) with a virtual machine. The difference is that with Docker there is no hypervisor boundary — it is just a process with a bind mount.Concretely:
docker run -v /:/host --rm alpine chroot /hostgives you a root shell on the host. No exploit required — just socket access.
The practical consequence: treat /var/run/docker.sock with the same sensitivity as an SSH private key for root.
On Linux, UID 0 (root) is special — the kernel checks it everywhere. Containers share the host kernel, so a process that is UID 0 inside a container is UID 0 on the host kernel. That's why container escapes are so dangerous.
User namespaces let the kernel lie about UIDs. When you create a user namespace, you give it a mapping like:
container UID 0 → host UID 100000
container UID 1 → host UID 100001
...
container UID 65535 → host UID 165535
Inside the namespace, the process thinks it is root (UID 0) and can do root-like things within its own namespace. But the host kernel sees it as UID 100000 — an ordinary unprivileged user. If it escapes the container, it lands on the host as UID 100000: harmless.
Every kernel resource (Files and directories, mounted filesystems, network interfaces, processes, cgroups, ipc objects) has an owning user namespace — the namespace that was current when the resource was created. The kernel's capability check is always:
"Does the process have this capability in the namespace that owns the target resource?"
When Linux boots, there is one user namespace — the initial namespace (init_user_ns). Every file on disk, every network interface, every mounted filesystem was created under it. When you call unshare --user, you create a child namespace:
init_user_ns ← owns everything that existed at boot
│ ← files, mounts, network interfaces, host PIDs
│
└─ your_user_ns
owner = UID 501 (your host UID)
mapping: namespace UID 0 → host UID 501
Your CAP_SYS_ADMIN only reaches the subtree below your_user_ns. The host's resources are owned by init_user_ns — they are unreachable.
# On Ubuntu, apparmor may block unprivileged user namespaces — check first:
sysctl kernel.apparmor_restrict_unprivileged_userns
# If non-zero, enable with:
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
unshare --user --map-root-user sh
# You appear as root (UID 0), have CAP_SYS_ADMIN in your_user_ns
# ALLOWED — /var/tmp is world-writable, you effectively own it
touch /var/tmp/test
# REJECTED — /mnt is owned by host root (UID 0), which has no mapping
# here and therefore shows up as nobody/nogroup
touch /mnt/test
# touch: cannot touch '/mnt/test': Permission deniedThe kernel never lets a child namespace's capabilities bubble up to affect resources owned by a parent namespace. Your "root" inside the namespace is structurally cut off from the host — not just by policy, but by the kernel's ownership model.
The CVEs from 2013–2018 were mostly cases where a specific kernel subsystem forgot to check namespace ownership for some resource. Examples:
CAP_NET_ADMINinside a user namespace was used to manipulate host routing tables because the routing code skipped the ownership check.ptraceacross namespace boundaries could read host process memory because the ptrace path didn't verify the namespace relationship.
Each CVE was a missing ownership check in one subsystem — not a flaw in the design, but in the implementation. The concept was sound; every code path had to be individually audited. That's why distros disabled the feature for nearly a decade while the kernel caught up.
Traditional Docker has two privileged components — dockerd and containerd — both running as host root. Rootless mode moves the entire stack into a user namespace:
host (your user, UID 1000)
└─ rootlesskit ← sets up user namespace + network namespace
└─ dockerd ← runs inside namespace, thinks it's root
└─ containerd
└─ your container (UID 0 inside = UID 100000+ on host)
rootlesskit calls newuidmap/newgidmap (small setuid helpers that ship with the OS) to write the UID maps, then drops its own privileges. After that, no setuid or root is involved anywhere.
| Traditional Docker | Rootless Docker | |
|---|---|---|
| Daemon runs as | host root | your UID |
| Socket compromise | full host root | your UID only |
| Container escape | full host root | your UID only |
| Overlay filesystem | native OverlayFS | fuse-overlayfs (slower) |
| Port binding < 1024 | works | needs sysctl net.ipv4.ip_unprivileged_port_start |
Podman has no daemon at all. Each podman run is just a regular process in your user namespace — there's nothing listening on a socket to compromise. The attack surface reduces to the process itself and the kernel's namespace implementation.
The fundamental insight: containers don't need root, they need namespaces — and user namespaces let unprivileged processes create all the other namespaces they need.
# Combine user ns with pid, net, mount, and uts namespaces in one shot
unshare --user --map-root-user --pid --net --mount --uts --fork --mount-proc sh
# Inside: fully isolated namespace stack, no root required
id # uid=0(root)
hostname container-demo # own UTS namespace — won't affect host
hostname # container-demo
ip link # only sees lo — isolated network namespace
ls /proc # own PID namespace — only sees its own processes
exit
# Back on host: hostname unchanged, network unchanged# Start a user namespace in the background
unshare --user --map-root-user sleep 60 &
PID=$!
# Read the UID and GID maps the kernel is using
cat /proc/$PID/uid_map
# 0 1000 1
cat /proc/$PID/gid_map
# 0 1000 1
# On the host, this process is still your UID
ps -o uid,pid,comm -p $PID
# UID PID COMMAND
# 1000 ... sleep
kill $PID# lsns shows every namespace and which process owns it
lsns
# Filter to user namespaces only
lsns --type userexecve(2)— how the kernel applies the setuid bit when loading a binarygetuid(2)/geteuid(2)— real vs effective UID syscallscredentials(7)— full explanation of process credentials (real, effective, saved UIDs/GIDs)user_namespaces(7)— how rootless containers remap UIDs without privilege escalationcapabilities(7)— the modern replacement for blanket root: fine-grained privilege splitting
- Docker security: daemon attack surface — official docs on the socket risk
- Run the Docker daemon as a non-root user (rootless mode) — setup guide for rootless Docker