System calls and user vs kernel mode
Rings, the syscall ABI, KPTI cost, vDSO, seccomp, eBPF tracing, and what io_uring does differently.
Privilege rings, the hardware story
x86 defines four privilege rings, 0 through 3. In practice every modern OS uses two: ring 0 (kernel) and ring 3 (userspace). Each ring grants different instruction privileges. Ring 0 can write to control registers (CR3, CR4), execute HLT, INVLPG, access I/O ports, modify page tables. Ring 3 cannot.
The MMU enforces ring boundaries via page table entries. Each PTE has a U/S bit. Pages marked Supervisor are accessible only from ring 0. Attempting to access a Supervisor page from ring 3 triggers a page fault. This is how the kernel protects itself from userspace.
ARM has its own version (Exception Levels EL0-EL3). RISC-V has machine, supervisor, user modes. The principles are the same.
The syscall instruction
Pre-Pentium 4, syscalls went through software interrupts: int 0x80 on Linux x86. That worked but was slow because interrupts went through the IDT, save full state, etc.
sysenter and syscall (different vendors, similar idea) are fast-path entries added to give a dedicated, lower-overhead path. On x86_64 Linux:
- Userspace puts the syscall number in
raxand args inrdi, rsi, rdx, r10, r8, r9. - Executes
syscall. The CPU savesripintorcx,rflagsintor11, and jumps to the address inMSR_LSTAR(the syscall entry point). - The CPU switches privilege level to ring 0 but does NOT switch stacks; the kernel entry does that manually.
- The kernel entry sets up the kernel stack, switches CR3 if KPTI is on, then dispatches via
sys_call_table[rax]. - After the work,
sysretputsrcxback intorip, restores user flags, drops to ring 3.
KPTI and the Meltdown tax
Meltdown (CVE-2017-5754) showed that userspace could read kernel memory via speculative execution side channels. The mitigation is Kernel Page Table Isolation (KPTI): the kernel maintains two sets of page tables. The userspace set has only a tiny stub of kernel mappings (just the syscall entry trampoline). The kernel set has everything.
On every syscall, KPTI swaps CR3 to load the full kernel page tables. On return, it swaps back. Each CR3 write traditionally invalidates the entire TLB, which would be devastating. PCID (process context identifier) makes this cheaper by tagging TLB entries with a context ID, so the swap does not invalidate.
Even with PCID, KPTI roughly doubles the cost of a syscall. Before KPTI, a syscall was 100-150ns. After, it is 300-500ns. For syscall-heavy workloads (proxies, network servers, databases) this was a measurable regression. Workloads that were already async/batched saw little impact.
vDSO: syscalls that aren't
Some syscalls don't need to actually enter the kernel. clock_gettime(CLOCK_MONOTONIC) reads a timestamp that the kernel updates and exposes via a shared memory page. The kernel maps a small ELF object (the vDSO) into every process's address space; userspace calls into it as a function call.
vDSO functions on Linux x86_64:
clock_gettimegettimeofdaytimegetcpu
These cost about 20ns instead of 300ns. Glibc transparently uses vDSO when available. You can see it in your process via cat /proc/PID/maps | grep vdso.
strace, ltrace, and the cost of tracing
strace uses ptrace to intercept every syscall. Each syscall now triggers two extra context switches (in for strace, out, then back) and a fair amount of bookkeeping. Strace can slow a program 10-50x.
For production observability, use eBPF-based tools:
bpftrace 'tracepoint:syscalls:sys_enter_read { @[comm] = count(); }'bcc-tools/syscountperf trace
These hook into kernel tracepoints with eBPF, near-zero overhead, can attach to a running production process.
seccomp: restricting what userspace can do
seccomp filters which syscalls a process is allowed to make. Docker, Kubernetes, Chrome sandboxes, and Firefox all use it. A seccomp BPF filter inspects the syscall number and arguments, returns ALLOW, KILL, ERRNO, or TRAP.
The filter is compiled to BPF bytecode at process startup and cannot be removed. Combined with namespaces and cgroups, seccomp is the third leg of the container security stool.
io_uring: the redesign
The traditional syscall model is one-call, one-result. io_uring (introduced in 5.1, mature by 5.10+) is many-calls, many-results via shared memory rings.
Two rings: submission queue (SQ) and completion queue (CQ). Userspace writes operations (read, write, accept, fsync, even openat) into the SQ. One syscall (io_uring_enter) tells the kernel to process them. The kernel writes results into the CQ. Userspace reads them.
In polling mode, even the io_uring_enter syscall goes away: a kernel thread polls the SQ continuously. Zero syscalls per I/O at steady state. This is how io_uring beats epoll for high-throughput storage and network workloads.
Trade-offs: complexity (the API has many footguns), security (it has had several CVEs), and not every operation has equal kernel support yet. Many production systems still use epoll for compatibility.
The syscall ABI is the most stable thing in Linux
Linus has a famous policy: do not break userspace. The syscall numbers, semantics, and behavior are stable across kernel versions. A binary compiled against Linux 2.6 still runs on Linux 6.x because syscalls 0-300 still mean the same things.
This is why containers work: the container has its own libc but uses the host kernel's syscalls. Compatibility is at the syscall layer, not the library layer.
This is also why Windows Subsystem for Linux (WSL1) was possible: Microsoft implemented the Linux syscall ABI on top of the NT kernel. WSL2 went back to a real Linux kernel in a VM because some syscalls were too complex to emulate fully.
How many syscalls is too many
A rough heuristic for a network server: anything over 100k syscalls per second per core deserves investigation. At 300ns per syscall, 100k/s is 30ms/s of pure syscall overhead, or 3% of CPU on that alone before any work.
A database doing 100k queries per second per core: ~500k syscalls/s easily if naive (one read per page, one write per WAL flush). With batching (group commit, vectored I/O) the ratio improves dramatically.
Modern proxies (Envoy, HAProxy in some modes) use io_uring or kTLS to drop syscall counts by an order of magnitude.
Cost table from real measurements
| Operation | Cost |
|---|---|
| getpid | 50ns |
| clock_gettime (vDSO) | 20ns |
| clock_gettime (no vDSO) | 300ns |
| read from pipe, no data | 400ns |
| write to /dev/null | 200ns |
| open + close | 1us |
| sendto on UDP socket | 800ns |
| epoll_wait with 1 ready fd | 500ns |
| io_uring submit + complete (no polling) | 300ns amortized in batch |
Pitfalls
Mental model
A syscall is a phone call from your office to the building's security desk. You can do most things at your desk, but for anything involving the building's resources (mail, printer, kicking someone out), you call security. The call is fast (~300ns), but each call interrupts your flow. The trick is to batch: instead of calling once per envelope, hand over the whole mail bin at once. That is what readv, sendmmsg, and io_uring are.
Learn more
- DocsLinux syscall tableman7.org
- Paperio_uring paper and docsJens Axboe
- DocsBrendan Gregg: BPF Performance ToolsBrendan Gregg
- ArticleLWN: KPTILWN