CPU scheduling algorithms
From round-robin to MLFQ, CFS, EEVDF, and SCHED_DEADLINE. Plus the practical knobs: nice, cgroups, RT throttling.
What a scheduler must decide
Three questions, every scheduling tick:
- Which task runs next?
- For how long?
- On which CPU?
The answers trade off three goals: fairness (every task gets its share), latency (interactive tasks wake quickly), and throughput (don't waste CPU on switching). No algorithm wins all three.
Algorithm history in 90 seconds
- FCFS (first-come first-served): non-preemptive. Long jobs block short ones (convoy effect).
- SJF/SRTF (shortest job first): optimal average wait time, requires future knowledge.
- Round Robin: time slices, fair-ish, but uniform slice size is wrong if tasks have different needs.
- Multilevel Feedback Queue (MLFQ): multiple priority queues, jobs that use full slice get demoted, jobs that yield stay high. Heuristic for "interactive vs batch." Solaris and old Linux used variants.
- Lottery and stride scheduling: each task gets tickets, scheduler picks proportionally.
- CFS: weighted fair queueing on virtual runtime. Linux 2.6.23 to 6.5.
- EEVDF: earliest eligible virtual deadline first, adds latency knob. Linux 6.6+.
CFS in detail
CFS treats CPU like a fluid: every runnable task accumulates "wall time it has used" weighted by its priority. The weight for a task with nice value n is 1024 * 1.25^(-n), so nice -20 has weight ~88761 and nice +19 has weight ~15. A task's vruntime increments at real_time * NICE_0_LOAD / task_weight.
The runqueue is a red-black tree keyed by vruntime. Leftmost = smallest vruntime = "behind" = should run next. When a task runs, its vruntime grows; eventually another task is leftmost and we switch.
Sleep handling: when a task sleeps, its vruntime is frozen. When it wakes, the kernel adjusts vruntime to max(vruntime, min_vruntime - sched_latency/2) so it does not get an unbounded boost from sleeping forever. This is the wakeup preemption logic.
Key tunables in /proc/sys/kernel/:
sched_min_granularity_ns: minimum time a task runs before considering preemption. Default 2.25ms.sched_latency_ns: target window in which all runnable tasks should run once. Default ~18ms.sched_wakeup_granularity_ns: how much vruntime advantage a waking task needs to preempt the current task.
For server workloads, longer min_granularity (10-20ms) reduces context switches and helps throughput. For desktop/interactive, the shorter defaults help.
EEVDF: adding a latency dimension
CFS only let you control weight (via nice). It assumed all tasks wanted the same latency. EEVDF (Eligible Virtual Deadline First, Linux 6.6, by Peter Zijlstra) adds a per-task latency_nice knob.
Each task has a virtual deadline: vruntime + slice / weight. The scheduler picks the eligible task with the earliest virtual deadline. Eligibility comes from a "virtual lag" concept: a task is eligible if it has fallen behind its fair share.
The practical upshot: you can have two equally-weighted tasks where one preempts faster because its slice is smaller (lower latency_nice). Without giving up CPU share. CFS couldn't do this without lying about weight.
Real-time scheduling classes
Real-time means "the kernel will not preempt me except for a higher RT task." It does NOT mean fast. A SCHED_FIFO task is slower per-instruction than a SCHED_NORMAL task because cache effects are the same; the win is determinism.
SCHED_FIFO
- 99 priority levels, 1 (low) to 99 (high).
- A FIFO task runs until it: blocks, yields, or a higher-priority RT task becomes runnable.
- Same-priority FIFO tasks form a FIFO queue; new arrivals do NOT preempt.
SCHED_RR
- Same as FIFO, but same-priority tasks share via round-robin with a default 100ms quantum.
SCHED_DEADLINE
- Specify (runtime, deadline, period). Kernel admits if total bandwidth permits.
- Implemented as Earliest Deadline First (EDF) with Constant Bandwidth Server (CBS) for isolation.
- Strictly more powerful than FIFO/RR; the modern choice for real-time.
RT throttling
By default, real-time tasks are throttled to 95% of CPU per second (sched_rt_runtime_us=950000 / sched_rt_period_us=1000000). This is a safety net: if a buggy RT task spins, you still get 50ms/s of CPU for non-RT tasks (including login shells), so you can fix it.
Setting sched_rt_runtime_us=-1 disables throttling. Do this only if you know what you are doing.
SMP scheduling
Each CPU has its own runqueue. The scheduler tries to keep tasks on the same CPU (cache affinity) but periodically rebalances. Rebalancing fires:
- On wakeup: if a task wakes up and its previous CPU is busy, the scheduler may place it elsewhere.
- On tick: each CPU runs load balancing periodically.
- On idle: an idle CPU steals work from the busiest sibling.
Rebalancing respects NUMA: tasks prefer to stay on the same NUMA node as their memory. sched_domain hierarchy describes how aggressive to be at each level (SMT > LLC > socket > NUMA).
taskset -c 2,3 prog pins prog to CPUs 2 and 3. chrt --fifo --pid 99 PID makes a process SCHED_FIFO priority 99.
cgroups and group scheduling
cpu.shares (cgroup v1) or cpu.weight (v2) lets you share CPU between groups, not just tasks. Two cgroups with equal weight get 50/50 of contended CPU regardless of how many tasks each has. Three tasks in one cgroup and one task in another? Still 50/50 across groups; the three split 50% among themselves.
This is how Kubernetes resource requests/limits work. A pod with cpu.request=500m gets weight scaled accordingly; a pod with cpu.limit=500m gets bandwidth throttled (cpu.max in v2).
CPU throttling (CFS bandwidth control) is critical to understand for Kubernetes: a pod that exceeds its quota in a 100ms period is paused for the rest of the period. This causes tail latency spikes when bursty workloads hit their cgroup limit.
Scheduler latency in practice
The metric: from "task became runnable" to "task starts running on CPU." Call it runq_latency.
To measure: perf sched record then perf sched latency. Or use bcc's runqlat:
$ runqlat 10 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 12342 |********...
2 -> 3 : 8421 |*****...
4 -> 7 : 1203 |*
8 -> 15 : 234 |
16 -> 31 : 32 |
32 -> 63 : 5 |
A healthy server has p99 runq_latency under 1ms. Numbers above that mean CPU is contended and your service is queueing behind other work.
Common knobs and when to touch them
| Knob | Default | When to change |
|---|---|---|
| nice value | 0 | -5 to -10 for latency-sensitive, +5 for batch |
| sched_min_granularity_ns | 2.25ms | Increase to 10ms for throughput servers |
| sched_rt_runtime_us | 950000 | Lower if you have buggy RT, raise only if you must |
| cpu.weight (cgroup v2) | 100 | Per-tenant fairness |
| cpu.max (cgroup v2) | max | Hard cap, often misused |
| isolcpus boot param | empty | Reserve CPUs for RT, exclude from balancer |
How to think about scheduling in interviews
- Fairness: CFS/EEVDF. O(log N). Nice value controls weight.
- Latency: EEVDF latency_nice, or real-time classes.
- Hard deadlines: SCHED_DEADLINE with admission control.
- Group fairness: cgroups with cpu.weight.
- Pinning: taskset or sched_setaffinity for cache locality.
- Measurement: runqlat (eBPF), perf sched, /proc/PID/sched.
The job of the scheduler is to be invisible. If you are thinking about it, something is wrong (or you have unusually demanding real-time requirements).
Mental model
The scheduler is a barista. CFS gives every customer (task) the same priority unless they've already gotten coffee recently (high vruntime). EEVDF lets some customers say "I'm in a hurry" (low latency_nice) without giving them more total coffee. SCHED_FIFO is "the barista works on this customer until they leave, period." SCHED_DEADLINE is "this customer needs their coffee by 3pm, please plan." cgroups is "table 5 collectively gets 30% of the barista's time, divide it however you want among yourselves."
Learn more
- Docs
- Article
- Docskernel.org: sched documentationkernel.org
- DocsBrendan Gregg: Linux PerformanceBrendan Gregg