In revision.
Crisp5 min readGo deeper →

Linux cgroups and namespaces (Docker basis)

Namespaces give processes a private view; cgroups limit their resource use. Together, that's a container.

The two halves of a container

A container is not a kernel object. It's a process tree wrapped in:

  1. Namespaces for isolation: own PIDs, own network interfaces, own mount table, own user IDs, own hostname, own IPC.
  2. cgroups for limits: CPU share, memory cap, block I/O quota, pids limit.

Plus a chroot-like filesystem root (pivot_root) and a seccomp filter for syscall restriction. Docker, containerd, runc, podman all use these same primitives.

Namespaces, the seven kinds

NamespaceIsolates
pidProcess IDs. PID 1 inside the namespace, init-like.
netNetwork interfaces, routing, iptables, /proc/net
mntMount points. pivot_root to a new fs root.
utsHostname, domainname
ipcSysV IPC, POSIX message queues
userUser and group IDs. Map UID 0 in container to UID 1000 on host.
cgroupHides outer cgroup hierarchy
time (newer)Boot and monotonic clocks

Create with clone(CLONE_NEW*) flags or unshare. Enter with setns. List with ls -l /proc/PID/ns/.

cgroups, the resource controllers

cgroups v2 (unified hierarchy, modern):

  • cpu: weight (cpu.weight), bandwidth cap (cpu.max as "MAX PERIOD").
  • memory: memory.max (hard limit, OOM kill on overshoot), memory.high (soft, throttles allocations).
  • io: weight-based scheduling, bandwidth caps per device.
  • pids: pids.max prevents fork bombs.
  • hugetlb: huge page reservations.

Containers run in their own cgroup. Kubernetes sets requests (cpu.weight) and limits (cpu.max, memory.max).

A container is namespaces (private view) plus cgroups (bounded resources).

What Docker does, briefly

  1. clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC | CLONE_NEWUTS) to create the new namespaces.
  2. Mount the container's filesystem (overlayfs on top of image layers).
  3. pivot_root to the new fs root.
  4. Set up the network: create a veth pair, move one end into the container's net namespace, bridge the other end.
  5. Apply cgroup limits: write to /sys/fs/cgroup/.../cpu.max, etc.
  6. Apply seccomp filter and capabilities.
  7. exec the container's command.

The "container runtime" (runc) does this. Docker, containerd, podman are layers above runc that handle image management and orchestration.

The interview answer

"A container is just a process with private namespaces (pid, net, mnt, uts, ipc, user) and cgroup-imposed resource limits. Namespaces give it the illusion of being alone; cgroups cap its CPU, memory, I/O. Add overlayfs for layered images and seccomp for syscall restriction, and you have Docker. There's no 'container' kernel object. It's existing primitives stitched together by runc."

Learn more