Context switching
Direct vs indirect cost, TLB shootdowns, PCID, KPTI, and how to measure switch overhead on real systems.
The mechanism
When the kernel decides to switch tasks, it executes a function called __schedule (Linux). The flow:
- Save the current task's volatile state into its
task_struct. The general-purpose registers, the floating-point and vector state if used, the stack pointer. - Update accounting: how much CPU time did the outgoing task get, did it cross any cgroup or rlimit threshold.
- Pick the next task from the run queue. CFS (until 6.6), then EEVDF, walks a red-black tree to find the leftmost (least-virtual-runtime) task.
- If the next task is in a different process, write its page table base (CR3 on x86_64) to switch address spaces.
- Load the next task's registers and return via
iretorsysretto user mode.
The hardware part is fast. The expensive part is everything around it.
Direct cost: roughly 1us
The instructions themselves are cheap. Saving and restoring 16 general-purpose registers, an FPU state (XSAVE), and a stack pointer is well under 100ns of register-shuffling work. The kernel code path adds bookkeeping, scheduler decisions, and accounting, which brings the total direct cost to about 1us on modern x86_64 Linux.
Direct cost grows if FPU/AVX state is dirty, because the kernel must save and restore the full vector register file. AVX-512 has 32 zmm registers of 64 bytes each, so a clean save/restore is 2KB plus mask registers.
Indirect cost: cache and TLB
This is the cost nobody benchmarks. When task B starts running, the L1 cache (32KB typical) is full of task A's data, the L2 (1MB) is partially shared, and the L3 might or might not have B's lines depending on how long B was off.
If B's working set is 200KB, every memory access it makes for the first few thousand instructions is at best an L2 hit (5 cycles) instead of an L1 hit (4 cycles). Worse, the TLB (64-128 entries for L1 dTLB) might have zero entries for B's address space, so every memory access also triggers a page walk (4 memory accesses on x86_64) until the TLB warms back up.
Measurements from real workloads suggest the indirect cost is 5-50us depending on working set size and how long the task ran before being switched out. For a small task that fits in L1 entirely, indirect cost is near zero. For a database query touching MB of data, indirect cost dominates.
TLB shootdowns and PCID
When you switch to a different process, the TLB is full of stale virtual-to-physical mappings for the old process. Historically, x86 invalidates the entire TLB on a CR3 write. This is fine if the new task runs long, painful if you bounce between two processes at high rate.
PCID (Process Context Identifier) tags each TLB entry with a context ID, so switching CR3 with PCID enabled does NOT flush the TLB. Linux uses 12-bit ASIDs since kernel 4.14. The TLB entries for the previous process stay, and when you switch back, they are still valid.
KPTI (Kernel Page Table Isolation, the Meltdown mitigation) hurts context switch performance because it forces a CR3 write on every syscall to swap between user and kernel page tables. With PCID this is partially mitigated. Without PCID, syscall cost roughly doubled when KPTI shipped in 2018.
A multi-core system also has to deal with TLB shootdowns. If one CPU modifies a page table entry that another CPU has cached in its TLB, the modifying CPU sends an IPI (inter-processor interrupt) to all CPUs that might have the entry, telling them to invalidate. Shootdowns show up in /proc/interrupts as TLB lines. High shootdown rate means heavy mmap/munmap activity or memory pressure.
Voluntary vs involuntary
A voluntary context switch happens when your code calls something that blocks: read on a socket with no data, pthread_mutex_lock when the mutex is held, sleep, epoll_wait. Your task asked to be parked.
An involuntary context switch happens when the scheduler preempts you: your time slice expired (typically 4ms for CFS), a higher-priority task became runnable, or you got migrated to another CPU.
pidstat -w 1 shows both columns:
PID cswch/s nvcswch/s
1234 1250.00 45.00 web-server
5678 5.00 2400.00 cpu-burner
The web-server has high voluntary switches because it is I/O-bound. The cpu-burner has high involuntary switches because it is competing for CPU with other work.
Migrations across CPUs
When the scheduler moves a task from CPU 0 to CPU 1, the L1 and L2 caches on CPU 1 do not have the task's data. Every load is now from L3 or memory. This is why pinning latency-sensitive threads to a CPU often helps.
perf stat -e migrations counts migrations. taskset -c 2,3 ./prog pins a program to CPUs 2 and 3. For real-time work, chrt and sched_setattr give you SCHED_FIFO or SCHED_DEADLINE policies that disable involuntary preemption almost entirely.
NUMA makes this worse. On a multi-socket machine, memory allocated on socket 0 is much slower to access from socket 1 (200ns vs 80ns). If a task is allocated on one socket and then migrates to another, every memory access pays NUMA tax. numactl and the numa_balancing sysctl control this.
How to measure on a real system
# system-wide
vmstat 1
# columns: cs is context switches per second
# per-process
pidstat -w 1
# per-workload, very precise
perf stat -e context-switches,cpu-migrations,cache-misses,dTLB-load-misses ./prog
# microbenchmark one switch cost
# lmbench: lat_ctx -P 1 -s 0 2The lat_ctx tool from lmbench measures the latency of a context switch directly by ping-ponging between two processes over a pipe. On a modern x86_64 with cold cache, lat_ctx reports 1.5-5us depending on working-set size parameter.
How to reduce context switches in real code
- Batch I/O.
read(fd, buf, 65536)is one syscall, one switch. 65 reads of 1KB each is 65 switches. - Use vectored I/O.
writevandreadvsend multiple buffers in one syscall. - Async I/O. epoll, kqueue, io_uring let one thread service thousands of connections without one switch per connection.
- Coalesce wakeups. If you have a producer and consumer, signal the consumer once per batch, not once per item.
- Avoid sleeping for short waits.
nanosleep(1us)is two switches (out and back). A spin loop is sometimes cheaper if the wait is shorter than the switch cost. - Tune scheduler.
sysctl kernel.sched_min_granularity_nsand friends control how often CFS preempts. Defaults are good for desktop, longer values reduce switches on a server.
Goroutines and async dodge most of this
The big win of M:N schedulers and async runtimes is that switching between goroutines (or futures) is a userspace operation. No syscall, no register save/restore beyond what's needed, no TLB or cache change because you stay in the same address space and probably the same CPU. A goroutine switch is around 200ns. An async task resume is around 50ns. That is why one event loop can do what a thousand threads cannot.
Mental model
Think of context switching as packing up your desk every time someone else wants to use it. The papers (registers) go in a drawer fast. But the post-its on the monitor (cache), the bookmarks in your browser (TLB), the memory of what page of the book you were on (branch predictor): all gone. The new person sets up from scratch. Direct cost: a few seconds to swap drawers. Indirect cost: 20 minutes to get back into flow.
Learn more
- DocsBrendan Gregg: Systems Performance, 2nd edBrendan Gregg
- DocsLinux perf documentationkernel.org
- Article
- DocsOSTEPOSTEP