System calls and user vs kernel mode
A syscall is a controlled door from your code into the kernel. Cost: ~100-500ns per crossing, more with KPTI.
The answer
CPUs have privilege rings. Ring 0 is the kernel: full access to hardware, page tables, every instruction. Ring 3 is userspace: restricted instructions, no direct hardware access. To go from ring 3 to ring 0, your code executes a syscall instruction (x86_64) or svc (ARM64), which traps into a kernel handler at a fixed address. The kernel does the work and returns via sysret.
Direct cost of a syscall is about 100-500ns. With KPTI (Meltdown mitigation) it can be 1us+. Indirect cost is the same as a context switch: cold cache, cold TLB, cold branch predictor if the kernel touches a lot of code.
What requires a syscall
- Any I/O: read, write, send, recv, open, close.
- Any memory map change: mmap, munmap, brk, mprotect.
- Any thread or process operation: fork, clone, exit, wait.
- Any time-related call: gettimeofday on some kernels (vDSO bypasses this), nanosleep, clock_gettime.
- Any sync: futex, signal handling.
Things that look like syscalls but aren't (they live in the vDSO, mapped into your address space): clock_gettime, gettimeofday, getcpu, time. The kernel maps a small shared library into every process so these hot calls run in user mode.
The mechanism
userspace kernel
--------- ------
mov %rax, NR_read (syscall handler)
mov %rdi, fd <- read args from regs
mov %rsi, buf
mov %rdx, len
syscall --------> switch to ring 0
do the work
put result in %rax
<-------- sysret
check %rax for error
The 6 arg registers on Linux x86_64 are rdi rsi rdx r10 r8 r9 (not rcx because syscall clobbers it). Return is in rax. Errors are negative errno values.
Cost matters
| Syscall pattern | Approximate cost |
|---|---|
| getpid (basically free) | 50ns |
| clock_gettime (vDSO) | 20ns |
| write to /dev/null | 200ns |
| read from disk (cached) | 1us |
| read from disk (uncached) | 100us (SSD) to 10ms (HDD) |
Strace adds 10x slowdown because it intercepts every syscall via ptrace. Use perf or eBPF (bpftrace) for low-overhead tracing.
How to do fewer
readvandwritev: pass multiple buffers in one syscall.sendmmsgandrecvmmsg: send/receive multiple UDP datagrams in one call.io_uring: submit a batch of operations, get notifications when done. One syscall for thousands of operations.- Buffered I/O (FILE* in C, BufferedWriter in Java): library buffers data and flushes in large syscalls.
The interview answer
"Userspace is ring 3, kernel is ring 0. A syscall is the controlled entry point: the syscall instruction traps into a fixed handler, the kernel dispatches by syscall number, returns via sysret. Cost is 100-500ns plus KPTI overhead. The trap itself is fast; the real trick is doing fewer of them. io_uring is the modern answer at scale because you submit batches and get notified, instead of one syscall per operation."
Learn more
- DocsLinux syscall tableman7.org
- ArticleJulia Evans: How does strace workJulia Evans