Linux cgroups and namespaces (Docker basis)

Namespaces give processes a private view; cgroups limit their resource use. Together, that's a container.

The two halves of a container

A container is not a kernel object. It's a process tree wrapped in:

Namespaces for isolation: own PIDs, own network interfaces, own mount table, own user IDs, own hostname, own IPC.
cgroups for limits: CPU share, memory cap, block I/O quota, pids limit.

Plus a chroot-like filesystem root (pivot_root) and a seccomp filter for syscall restriction. Docker, containerd, runc, podman all use these same primitives.

Namespaces, the seven kinds

Namespace	Isolates
pid	Process IDs. PID 1 inside the namespace, init-like.
net	Network interfaces, routing, iptables, /proc/net
mnt	Mount points. `pivot_root` to a new fs root.
uts	Hostname, domainname
ipc	SysV IPC, POSIX message queues
user	User and group IDs. Map UID 0 in container to UID 1000 on host.
cgroup	Hides outer cgroup hierarchy
time (newer)	Boot and monotonic clocks

Create with clone(CLONE_NEW*) flags or unshare. Enter with setns. List with ls -l /proc/PID/ns/.

cgroups, the resource controllers

cgroups v2 (unified hierarchy, modern):

cpu: weight (cpu.weight), bandwidth cap (cpu.max as "MAX PERIOD").
memory: memory.max (hard limit, OOM kill on overshoot), memory.high (soft, throttles allocations).
io: weight-based scheduling, bandwidth caps per device.
pids: pids.max prevents fork bombs.
hugetlb: huge page reservations.

Containers run in their own cgroup. Kubernetes sets requests (cpu.weight) and limits (cpu.max, memory.max).

A container is namespaces (private view) plus cgroups (bounded resources).

What Docker does, briefly

clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC | CLONE_NEWUTS) to create the new namespaces.
Mount the container's filesystem (overlayfs on top of image layers).
pivot_root to the new fs root.
Set up the network: create a veth pair, move one end into the container's net namespace, bridge the other end.
Apply cgroup limits: write to /sys/fs/cgroup/.../cpu.max, etc.
Apply seccomp filter and capabilities.
exec the container's command.

The "container runtime" (runc) does this. Docker, containerd, podman are layers above runc that handle image management and orchestration.

"A container is just a process with private namespaces (pid, net, mnt, uts, ipc, user) and cgroup-imposed resource limits. Namespaces give it the illusion of being alone; cgroups cap its CPU, memory, I/O. Add overlayfs for layered images and seccomp for syscall restriction, and you have Docker. There's no 'container' kernel object. It's existing primitives stitched together by runc."

Learn more