Linux cgroups and namespaces (Docker basis)
Namespaces give processes a private view; cgroups limit their resource use. Together, that's a container.
The two halves of a container
A container is not a kernel object. It's a process tree wrapped in:
- Namespaces for isolation: own PIDs, own network interfaces, own mount table, own user IDs, own hostname, own IPC.
- cgroups for limits: CPU share, memory cap, block I/O quota, pids limit.
Plus a chroot-like filesystem root (pivot_root) and a seccomp filter for syscall restriction. Docker, containerd, runc, podman all use these same primitives.
Namespaces, the seven kinds
| Namespace | Isolates |
|---|---|
| pid | Process IDs. PID 1 inside the namespace, init-like. |
| net | Network interfaces, routing, iptables, /proc/net |
| mnt | Mount points. pivot_root to a new fs root. |
| uts | Hostname, domainname |
| ipc | SysV IPC, POSIX message queues |
| user | User and group IDs. Map UID 0 in container to UID 1000 on host. |
| cgroup | Hides outer cgroup hierarchy |
| time (newer) | Boot and monotonic clocks |
Create with clone(CLONE_NEW*) flags or unshare. Enter with setns. List with ls -l /proc/PID/ns/.
cgroups, the resource controllers
cgroups v2 (unified hierarchy, modern):
- cpu: weight (cpu.weight), bandwidth cap (cpu.max as "MAX PERIOD").
- memory: memory.max (hard limit, OOM kill on overshoot), memory.high (soft, throttles allocations).
- io: weight-based scheduling, bandwidth caps per device.
- pids: pids.max prevents fork bombs.
- hugetlb: huge page reservations.
Containers run in their own cgroup. Kubernetes sets requests (cpu.weight) and limits (cpu.max, memory.max).
What Docker does, briefly
clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC | CLONE_NEWUTS)to create the new namespaces.- Mount the container's filesystem (overlayfs on top of image layers).
pivot_rootto the new fs root.- Set up the network: create a veth pair, move one end into the container's net namespace, bridge the other end.
- Apply cgroup limits: write to
/sys/fs/cgroup/.../cpu.max, etc. - Apply seccomp filter and capabilities.
- exec the container's command.
The "container runtime" (runc) does this. Docker, containerd, podman are layers above runc that handle image management and orchestration.
The interview answer
"A container is just a process with private namespaces (pid, net, mnt, uts, ipc, user) and cgroup-imposed resource limits. Namespaces give it the illusion of being alone; cgroups cap its CPU, memory, I/O. Add overlayfs for layered images and seccomp for syscall restriction, and you have Docker. There's no 'container' kernel object. It's existing primitives stitched together by runc."
Learn more
- Docskernel.org: cgroups v2kernel.org
- Docskernel.org: namespaces overviewman7.org