System calls and user vs kernel mode

A syscall is a controlled door from your code into the kernel. Cost: ~100-500ns per crossing, more with KPTI.

The answer

CPUs have privilege rings. Ring 0 is the kernel: full access to hardware, page tables, every instruction. Ring 3 is userspace: restricted instructions, no direct hardware access. To go from ring 3 to ring 0, your code executes a syscall instruction (x86_64) or svc (ARM64), which traps into a kernel handler at a fixed address. The kernel does the work and returns via sysret.

Direct cost of a syscall is about 100-500ns. With KPTI (Meltdown mitigation) it can be 1us+. Indirect cost is the same as a context switch: cold cache, cold TLB, cold branch predictor if the kernel touches a lot of code.

What requires a syscall

Any I/O: read, write, send, recv, open, close.
Any memory map change: mmap, munmap, brk, mprotect.
Any thread or process operation: fork, clone, exit, wait.
Any time-related call: gettimeofday on some kernels (vDSO bypasses this), nanosleep, clock_gettime.
Any sync: futex, signal handling.

Things that look like syscalls but aren't (they live in the vDSO, mapped into your address space): clock_gettime, gettimeofday, getcpu, time. The kernel maps a small shared library into every process so these hot calls run in user mode.

The mechanism

userspace                  kernel
---------                  ------
mov %rax, NR_read          (syscall handler)
mov %rdi, fd                <- read args from regs
mov %rsi, buf
mov %rdx, len
syscall          -------->  switch to ring 0
                            do the work
                            put result in %rax
                 <--------  sysret
check %rax for error

The 6 arg registers on Linux x86_64 are rdi rsi rdx r10 r8 r9 (not rcx because syscall clobbers it). Return is in rax. Errors are negative errno values.

Syscall flow on x86_64 Linux.

Cost matters

Syscall pattern	Approximate cost
getpid (basically free)	50ns
clock_gettime (vDSO)	20ns
write to /dev/null	200ns
read from disk (cached)	1us
read from disk (uncached)	100us (SSD) to 10ms (HDD)

Strace adds 10x slowdown because it intercepts every syscall via ptrace. Use perf or eBPF (bpftrace) for low-overhead tracing.

How to do fewer

readv and writev: pass multiple buffers in one syscall.
sendmmsg and recvmmsg: send/receive multiple UDP datagrams in one call.
io_uring: submit a batch of operations, get notifications when done. One syscall for thousands of operations.
Buffered I/O (FILE* in C, BufferedWriter in Java): library buffers data and flushes in large syscalls.

The interview answer

"Userspace is ring 3, kernel is ring 0. A syscall is the controlled entry point: the syscall instruction traps into a fixed handler, the kernel dispatches by syscall number, returns via sysret. Cost is 100-500ns plus KPTI overhead. The trap itself is fast; the real trick is doing fewer of them. io_uring is the modern answer at scale because you submit batches and get notified, instead of one syscall per operation."

Learn more

Docs
Linux syscall tableman7.org
Article
Julia Evans: How does strace workJulia Evans