Context switching

Swapping CPU state from one task to another costs 1-5us plus cache pollution. Minimize it.

The answer

A context switch saves the current task's registers and stack pointer into its task_struct, restores another task's registers, switches page tables if it is a different process, and jumps. Direct cost is about 1us. Indirect cost is a cold L1 cache, cold TLB, and cold branch predictor, which can stretch the effective cost to 10-100us depending on working set size.

Every voluntary syscall, every blocked I/O, every preemption tick is a potential context switch. If your service is doing 100k context switches per second per core, you are spending 10% of CPU on switching alone.

What gets switched

Hardware state: general-purpose registers, instruction pointer, stack pointer, flags, FPU/SSE state if used.

Memory state: if switching to a different process, swap CR3 (the page table base register on x86), which invalidates the TLB unless you have PCID enabled.

Kernel state: scheduler queues, run queue position, cgroup accounting.

Things that get blown away as side effects: L1/L2 cache lines belonging to the previous task, branch predictor history, return address stack.

The kernel saves and restores register state on every context switch.

How to see it

# total context switches per second on the system
vmstat 1
 
# per-process voluntary vs involuntary
pidstat -w 1
 
# perf stat for one workload
perf stat -e context-switches,cpu-migrations ./my-program

Voluntary switches mean your code blocked (read, recv, mutex). Involuntary switches mean the scheduler preempted you (timer tick, higher priority task ready). High involuntary count means CPU contention. High voluntary count means I/O wait.

How to reduce it

Batch I/O. One syscall reading 64KB beats 64 syscalls reading 1KB each.
Use async I/O (epoll, io_uring) instead of one-thread-per-connection.
Pin threads to cores (taskset, sched_setaffinity) for latency-sensitive work.
Avoid spinlocks in userspace unless contention is rare and short, otherwise use futex-based mutexes that only switch on actual contention.
Bump thread priority for the hot path if and only if you have measured.

The interview answer

"Direct cost is roughly 1 microsecond, but the cache pollution easily makes it 10x that for a fat working set. The biggest wins come from reducing the number of switches: bigger batches, async I/O, fewer threads. I look at pidstat -w to see voluntary vs involuntary. Voluntary means I/O wait, involuntary means CPU contention, and the fix differs by which one dominates."

Learn more

Docs
Brendan Gregg: Systems PerformanceBrendan Gregg
Docs
OSTEP: Mechanism: Limited Direct ExecutionOSTEP