Deep dive15 min read← Back to crisp

Linux cgroups and namespaces (Docker basis)

How runc, Docker, and Kubernetes are built from kernel primitives. cgroup v1 vs v2, CFS quotas, OOM scoring, user namespaces.

The kernel primitives in detail

Containers exist because of three things the Linux kernel grew over 15 years:

Namespaces (2002 onwards, completed around 2013) for isolation.
cgroups (v1 from 2008, v2 from 2016) for resource control.
Capabilities and seccomp for privilege confinement.

There is no struct container in the kernel. A container is a process whose task_struct happens to reference its own pid namespace, net namespace, mount namespace, etc., plus a cgroup with limits.

Namespaces, deeper

Each namespace type is independent. You can be in a new PID namespace but share the host's network, or have isolated network and shared filesystems. Docker happens to use most of them together, but they're orthogonal.

PID namespace

Inside a new pid namespace, your processes start at PID 1. PID 1 has special responsibilities: reaping orphans (zombies). If your container's PID 1 doesn't handle SIGCHLD, zombies accumulate. This is why Docker has --init to run a tiny init like tini.

PID 1 also has signal restrictions: signals from inside the container don't terminate it unless explicitly handled. kill -9 1 inside a container does nothing if PID 1 doesn't handle SIGKILL (actually SIGKILL still works because the kernel handles it, but most other signals are ignored unless registered).

Network namespace

Each net namespace has its own loopback, its own interfaces, its own routing table, its own iptables/nftables rules, its own /proc/net.

Docker's default bridge mode: create a veth pair. One end goes into the container's net namespace as eth0. The other end attaches to a Linux bridge (docker0) on the host. NAT rules in iptables forward traffic between docker0 and the host's real interface.

--network host skips this and shares the host's network. Faster (no veth, no NAT), less isolated.

Mount namespace and pivot_root

Each mount namespace has its own mount table. A mount in one namespace doesn't appear in another by default.

pivot_root(new, old) swaps the root: new becomes /, the old root becomes accessible at old. The container can then umount(old) to hide the host's filesystem entirely.

Docker images are layered: each FROM/RUN creates a layer. At container start, runc mounts the layers via overlayfs:

lowerdir=layer1:layer2:layer3 (read-only)
upperdir=writable layer (copy-on-write)
workdir=internal scratch
merged at /var/lib/docker/.../merged

Reads search top to bottom, writes go to the upper layer, original files are COW'd on first write. This is how you can run 100 containers from one image and not use 100x the disk.

User namespace

The big one for security. UID 0 inside the container can map to UID 100000 on the host. Even if the container process is "root," it has no privileges on the host.

The mapping lives in /proc/PID/uid_map:

0 100000 65536

Means: UIDs 0-65535 inside the container map to host UIDs 100000-165535.

Docker can enable this with --userns-remap. Many production setups don't, because it complicates volume mounts (file ownership needs to be remapped). Podman uses user namespaces by default for rootless mode.

One host kernel, two containers, each with private namespaces and cgroup limits.

cgroups v1 vs v2

v1 (2008): each controller (cpu, memory, blkio, etc.) had its own hierarchy. You could put a process in different positions of different hierarchies. Powerful but confusing.

v2 (2016, default on most modern distros): unified hierarchy. One tree. Each cgroup can enable any combination of controllers. Simpler model.

Some controllers were redesigned in v2 (most notably the io controller, which is now weight-based and works well, vs. v1's broken blkio).

cgroup v2 paths live under /sys/fs/cgroup/. systemd creates a tree like:

/sys/fs/cgroup/system.slice/sshd.service/
/sys/fs/cgroup/user.slice/user-1000.slice/session-1.scope/
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/pod-xxx/container-yyy/

Files in each directory expose the controllers: cpu.max, memory.max, io.weight, pids.max, cgroup.procs (list of PIDs in this group).

CPU control: weight vs bandwidth

cpu.weight (1-10000, default 100): relative share. If two cgroups have weight 100 and 200, the second gets 2x the CPU when contended.
cpu.max ("MAX PERIOD"): hard bandwidth cap. "50000 100000" means 50ms of CPU per 100ms period = 0.5 cores worth.

Kubernetes maps requests.cpu to cpu.weight and limits.cpu to cpu.max. A pod with limit 500m gets cpu.max "50000 100000."

The bandwidth controller works in 100ms windows. If your pod uses its quota in 30ms, it's throttled for 70ms, then resumes. For bursty workloads (web servers with spike loads), this causes massive tail latency: p99 goes from 10ms to 100ms+ just because of throttling.

Many teams skip CPU limits entirely in Kubernetes, relying on requests for scheduling and letting bursts use spare capacity. Memory limits are necessary (you can't safely overcommit memory the way you can CPU).

Memory control

memory.max: hard limit. Hit it = cgroup OOM killer picks a victim inside the cgroup.
memory.high: soft limit. Hit it = kernel throttles allocations (puts the cgroup under reclaim pressure).
memory.low: below this, this cgroup is protected from reclaim.
memory.swap.max: swap quota.

The cgroup OOM killer is a different beast from the system OOM. It only kills inside the cgroup. The rest of the host is untouched. From outside, a containerized process suddenly dies and the host looks fine. Common cause of "my container restarts mysteriously."

Use dmesg | grep -i killed to see OOM events.

IO and pids controllers

io.weight (10-10000): like cpu.weight, for block I/O.
io.max: byte/IOPS caps per device.
pids.max: maximum number of processes in this cgroup. Prevents fork bombs.

io control on cgroup v1 was broken for many workloads. v2's io controller (with the BFQ or mq-deadline scheduler) actually works.

Capabilities and seccomp

Even after namespaces and cgroups, a "root" process in the container has many capabilities. Linux capabilities split root's privileges into ~40 buckets (CAP_NET_ADMIN, CAP_SYS_TIME, CAP_DAC_OVERRIDE, etc.). Docker drops most by default; the container keeps a minimal set.

docker run --cap-add SYS_ADMIN adds capabilities back. Be careful: SYS_ADMIN is basically root.

seccomp adds another layer: filter which syscalls are allowed. Docker has a default profile blocking ~50 syscalls (keyctl, add_key, etc.) that are rarely needed and have been used in escapes.

The combination: namespace isolation + cgroup limits + dropped capabilities + seccomp filter is what makes containers feel isolated.

The OCI spec and runc

Docker, podman, containerd, cri-o all delegate the actual "create container" step to a runtime that implements the OCI runtime spec. runc is the reference implementation.

Given a config.json describing namespaces, mounts, capabilities, seccomp, and a rootfs directory, runc:

Creates the namespaces via clone().
Sets up mounts (binds, tmpfs, etc.).
Pivots root.
Drops capabilities.
Applies seccomp.
exec()s the entrypoint.

That's it. There's no daemon required to keep the container alive; runc creates it and exits. The container runs as a normal process tree under whatever spawned it.

Container escape vectors (briefly)

Privileged containers (--privileged): all capabilities, all devices, no seccomp. Escape is trivial.
Mounting host /: if a container can mount the host filesystem, it owns the host.
CAP_SYS_ADMIN: lets you mount, modify namespaces, etc. Escape paths exist.
Shared PID namespace with host: kill any host process.
Kernel vulns: any kernel vuln is a container escape because the kernel is shared. Patch kernels.
Misconfigured user namespace: UID 0 in container = UID 0 on host = root.

Run containers as non-root (USER nobody in Dockerfile, or securityContext.runAsNonRoot in Kubernetes). Use user namespaces for defense in depth. Don't run privileged unless you absolutely must.

Real-world Kubernetes pitfalls

Observability inside containers

/proc/PID/cgroup shows which cgroup a process is in. From inside a container:

cat /proc/self/cgroup
# 0::/system.slice/docker-abc123.scope

/sys/fs/cgroup/... (the same path) gives current usage:

cat /sys/fs/cgroup/memory.current   # bytes used
cat /sys/fs/cgroup/cpu.stat          # nr_throttled, throttled_usec

cpu.stat's throttled_usec is the smoking gun for CPU throttling: if it's growing, your container is being paused.

Mental model

A container is a process wearing a costume:

Namespaces are the costume: it sees PID 1, eth0, /, hostname, all as if it were alone on a fresh machine.
cgroups are the wristband at a concert: you can drink up to your quota, no more.
Capabilities and seccomp are the bouncer: yes you can wear root's hat, but you still can't load kernel modules or change the system clock.
Overlayfs is the wardrobe: many costumes layered over a shared base, cheap to clone.

There's no magic. It's just careful use of primitives the kernel has been growing for two decades, wrapped in an OCI spec that everyone agreed on.

Learn more

Docs
kernel.org: cgroups v2kernel.org
Talk
Liz Rice: What is a container, really?Liz Rice
Article
Jess Frazelle: Container securityJess Frazelle
Repo
runc sourceOCI

Deep dive15 min read← Back to crisp

Linux cgroups and namespaces (Docker basis)

How runc, Docker, and Kubernetes are built from kernel primitives. cgroup v1 vs v2, CFS quotas, OOM scoring, user namespaces.

The kernel primitives in detail

Containers exist because of three things the Linux kernel grew over 15 years:

Namespaces (2002 onwards, completed around 2013) for isolation.
cgroups (v1 from 2008, v2 from 2016) for resource control.
Capabilities and seccomp for privilege confinement.

There is no struct container in the kernel. A container is a process whose task_struct happens to reference its own pid namespace, net namespace, mount namespace, etc., plus a cgroup with limits.

Namespaces, deeper

PID namespace

Network namespace

Each net namespace has its own loopback, its own interfaces, its own routing table, its own iptables/nftables rules, its own /proc/net.

--network host skips this and shares the host's network. Faster (no veth, no NAT), less isolated.

Mount namespace and pivot_root

Each mount namespace has its own mount table. A mount in one namespace doesn't appear in another by default.

pivot_root(new, old) swaps the root: new becomes /, the old root becomes accessible at old. The container can then umount(old) to hide the host's filesystem entirely.

Docker images are layered: each FROM/RUN creates a layer. At container start, runc mounts the layers via overlayfs:

lowerdir=layer1:layer2:layer3 (read-only)
upperdir=writable layer (copy-on-write)
workdir=internal scratch
merged at /var/lib/docker/.../merged

Reads search top to bottom, writes go to the upper layer, original files are COW'd on first write. This is how you can run 100 containers from one image and not use 100x the disk.

User namespace

The big one for security. UID 0 inside the container can map to UID 100000 on the host. Even if the container process is "root," it has no privileges on the host.

The mapping lives in /proc/PID/uid_map:

0 100000 65536

Means: UIDs 0-65535 inside the container map to host UIDs 100000-165535.

One host kernel, two containers, each with private namespaces and cgroup limits.

cgroups v1 vs v2

v1 (2008): each controller (cpu, memory, blkio, etc.) had its own hierarchy. You could put a process in different positions of different hierarchies. Powerful but confusing.

v2 (2016, default on most modern distros): unified hierarchy. One tree. Each cgroup can enable any combination of controllers. Simpler model.

Some controllers were redesigned in v2 (most notably the io controller, which is now weight-based and works well, vs. v1's broken blkio).

cgroup v2 paths live under /sys/fs/cgroup/. systemd creates a tree like:

/sys/fs/cgroup/system.slice/sshd.service/
/sys/fs/cgroup/user.slice/user-1000.slice/session-1.scope/
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/pod-xxx/container-yyy/

Files in each directory expose the controllers: cpu.max, memory.max, io.weight, pids.max, cgroup.procs (list of PIDs in this group).

CPU control: weight vs bandwidth

cpu.weight (1-10000, default 100): relative share. If two cgroups have weight 100 and 200, the second gets 2x the CPU when contended.
cpu.max ("MAX PERIOD"): hard bandwidth cap. "50000 100000" means 50ms of CPU per 100ms period = 0.5 cores worth.

Kubernetes maps requests.cpu to cpu.weight and limits.cpu to cpu.max. A pod with limit 500m gets cpu.max "50000 100000."

Memory control

memory.max: hard limit. Hit it = cgroup OOM killer picks a victim inside the cgroup.
memory.high: soft limit. Hit it = kernel throttles allocations (puts the cgroup under reclaim pressure).
memory.low: below this, this cgroup is protected from reclaim.
memory.swap.max: swap quota.

Use dmesg | grep -i killed to see OOM events.

IO and pids controllers

io.weight (10-10000): like cpu.weight, for block I/O.
io.max: byte/IOPS caps per device.
pids.max: maximum number of processes in this cgroup. Prevents fork bombs.

io control on cgroup v1 was broken for many workloads. v2's io controller (with the BFQ or mq-deadline scheduler) actually works.

Capabilities and seccomp

docker run --cap-add SYS_ADMIN adds capabilities back. Be careful: SYS_ADMIN is basically root.

seccomp adds another layer: filter which syscalls are allowed. Docker has a default profile blocking ~50 syscalls (keyctl, add_key, etc.) that are rarely needed and have been used in escapes.

The combination: namespace isolation + cgroup limits + dropped capabilities + seccomp filter is what makes containers feel isolated.

The OCI spec and runc

Docker, podman, containerd, cri-o all delegate the actual "create container" step to a runtime that implements the OCI runtime spec. runc is the reference implementation.

Given a config.json describing namespaces, mounts, capabilities, seccomp, and a rootfs directory, runc:

Creates the namespaces via clone().
Sets up mounts (binds, tmpfs, etc.).
Pivots root.
Drops capabilities.
Applies seccomp.
exec()s the entrypoint.

That's it. There's no daemon required to keep the container alive; runc creates it and exits. The container runs as a normal process tree under whatever spawned it.

Container escape vectors (briefly)

Privileged containers (--privileged): all capabilities, all devices, no seccomp. Escape is trivial.
Mounting host /: if a container can mount the host filesystem, it owns the host.
CAP_SYS_ADMIN: lets you mount, modify namespaces, etc. Escape paths exist.
Shared PID namespace with host: kill any host process.
Kernel vulns: any kernel vuln is a container escape because the kernel is shared. Patch kernels.
Misconfigured user namespace: UID 0 in container = UID 0 on host = root.

Run containers as non-root (USER nobody in Dockerfile, or securityContext.runAsNonRoot in Kubernetes). Use user namespaces for defense in depth. Don't run privileged unless you absolutely must.

Real-world Kubernetes pitfalls

Observability inside containers

/proc/PID/cgroup shows which cgroup a process is in. From inside a container:

cat /proc/self/cgroup
# 0::/system.slice/docker-abc123.scope

/sys/fs/cgroup/... (the same path) gives current usage:

cat /sys/fs/cgroup/memory.current   # bytes used
cat /sys/fs/cgroup/cpu.stat          # nr_throttled, throttled_usec

cpu.stat's throttled_usec is the smoking gun for CPU throttling: if it's growing, your container is being paused.

Mental model

A container is a process wearing a costume:

Namespaces are the costume: it sees PID 1, eth0, /, hostname, all as if it were alone on a fresh machine.
cgroups are the wristband at a concert: you can drink up to your quota, no more.
Capabilities and seccomp are the bouncer: yes you can wear root's hat, but you still can't load kernel modules or change the system clock.
Overlayfs is the wardrobe: many costumes layered over a shared base, cheap to clone.

There's no magic. It's just careful use of primitives the kernel has been growing for two decades, wrapped in an OCI spec that everyone agreed on.

Learn more

Docs
kernel.org: cgroups v2kernel.org
Talk
Liz Rice: What is a container, really?Liz Rice
Article
Jess Frazelle: Container securityJess Frazelle
Repo
runc sourceOCI