CPU scheduling algorithms

From round-robin to MLFQ, CFS, EEVDF, and SCHED_DEADLINE. Plus the practical knobs: nice, cgroups, RT throttling.

What a scheduler must decide

Three questions, every scheduling tick:

Which task runs next?
For how long?
On which CPU?

The answers trade off three goals: fairness (every task gets its share), latency (interactive tasks wake quickly), and throughput (don't waste CPU on switching). No algorithm wins all three.

Algorithm history in 90 seconds

FCFS (first-come first-served): non-preemptive. Long jobs block short ones (convoy effect).
SJF/SRTF (shortest job first): optimal average wait time, requires future knowledge.
Round Robin: time slices, fair-ish, but uniform slice size is wrong if tasks have different needs.
Multilevel Feedback Queue (MLFQ): multiple priority queues, jobs that use full slice get demoted, jobs that yield stay high. Heuristic for "interactive vs batch." Solaris and old Linux used variants.
Lottery and stride scheduling: each task gets tickets, scheduler picks proportionally.
CFS: weighted fair queueing on virtual runtime. Linux 2.6.23 to 6.5.
EEVDF: earliest eligible virtual deadline first, adds latency knob. Linux 6.6+.

CFS treats CPU like a fluid: every runnable task accumulates "wall time it has used" weighted by its priority. The weight for a task with nice value n is 1024 * 1.25^(-n), so nice -20 has weight ~88761 and nice +19 has weight ~15. A task's vruntime increments at real_time * NICE_0_LOAD / task_weight.

The runqueue is a red-black tree keyed by vruntime. Leftmost = smallest vruntime = "behind" = should run next. When a task runs, its vruntime grows; eventually another task is leftmost and we switch.

Sleep handling: when a task sleeps, its vruntime is frozen. When it wakes, the kernel adjusts vruntime to max(vruntime, min_vruntime - sched_latency/2) so it does not get an unbounded boost from sleeping forever. This is the wakeup preemption logic.

Key tunables in /proc/sys/kernel/:

sched_min_granularity_ns: minimum time a task runs before considering preemption. Default 2.25ms.
sched_latency_ns: target window in which all runnable tasks should run once. Default ~18ms.
sched_wakeup_granularity_ns: how much vruntime advantage a waking task needs to preempt the current task.

For server workloads, longer min_granularity (10-20ms) reduces context switches and helps throughput. For desktop/interactive, the shorter defaults help.

EEVDF: adding a latency dimension

CFS only let you control weight (via nice). It assumed all tasks wanted the same latency. EEVDF (Eligible Virtual Deadline First, Linux 6.6, by Peter Zijlstra) adds a per-task latency_nice knob.

Each task has a virtual deadline: vruntime + slice / weight. The scheduler picks the eligible task with the earliest virtual deadline. Eligibility comes from a "virtual lag" concept: a task is eligible if it has fallen behind its fair share.

The practical upshot: you can have two equally-weighted tasks where one preempts faster because its slice is smaller (lower latency_nice). Without giving up CPU share. CFS couldn't do this without lying about weight.

EEVDF separates 'how much CPU' from 'how quickly I get to run'.

Real-time scheduling classes

Real-time means "the kernel will not preempt me except for a higher RT task." It does NOT mean fast. A SCHED_FIFO task is slower per-instruction than a SCHED_NORMAL task because cache effects are the same; the win is determinism.

SCHED_FIFO

99 priority levels, 1 (low) to 99 (high).
A FIFO task runs until it: blocks, yields, or a higher-priority RT task becomes runnable.
Same-priority FIFO tasks form a FIFO queue; new arrivals do NOT preempt.

SCHED_RR

Same as FIFO, but same-priority tasks share via round-robin with a default 100ms quantum.

SCHED_DEADLINE

Specify (runtime, deadline, period). Kernel admits if total bandwidth permits.
Implemented as Earliest Deadline First (EDF) with Constant Bandwidth Server (CBS) for isolation.
Strictly more powerful than FIFO/RR; the modern choice for real-time.

RT throttling

By default, real-time tasks are throttled to 95% of CPU per second (sched_rt_runtime_us=950000 / sched_rt_period_us=1000000). This is a safety net: if a buggy RT task spins, you still get 50ms/s of CPU for non-RT tasks (including login shells), so you can fix it.

Setting sched_rt_runtime_us=-1 disables throttling. Do this only if you know what you are doing.

SMP scheduling

Each CPU has its own runqueue. The scheduler tries to keep tasks on the same CPU (cache affinity) but periodically rebalances. Rebalancing fires:

On wakeup: if a task wakes up and its previous CPU is busy, the scheduler may place it elsewhere.
On tick: each CPU runs load balancing periodically.
On idle: an idle CPU steals work from the busiest sibling.

Rebalancing respects NUMA: tasks prefer to stay on the same NUMA node as their memory. sched_domain hierarchy describes how aggressive to be at each level (SMT > LLC > socket > NUMA).

taskset -c 2,3 prog pins prog to CPUs 2 and 3. chrt --fifo --pid 99 PID makes a process SCHED_FIFO priority 99.

cgroups and group scheduling

cpu.shares (cgroup v1) or cpu.weight (v2) lets you share CPU between groups, not just tasks. Two cgroups with equal weight get 50/50 of contended CPU regardless of how many tasks each has. Three tasks in one cgroup and one task in another? Still 50/50 across groups; the three split 50% among themselves.

This is how Kubernetes resource requests/limits work. A pod with cpu.request=500m gets weight scaled accordingly; a pod with cpu.limit=500m gets bandwidth throttled (cpu.max in v2).

CPU throttling (CFS bandwidth control) is critical to understand for Kubernetes: a pod that exceeds its quota in a 100ms period is paused for the rest of the period. This causes tail latency spikes when bursty workloads hit their cgroup limit.

Scheduler latency in practice

The metric: from "task became runnable" to "task starts running on CPU." Call it runq_latency.

To measure: perf sched record then perf sched latency. Or use bcc's runqlat:

$ runqlat 10 1
Tracing run queue latency... Hit Ctrl-C to end.
     usecs               : count     distribution
         0 -> 1          : 12342    |********...
         2 -> 3          : 8421     |*****...
         4 -> 7          : 1203     |*
         8 -> 15         : 234      |
        16 -> 31         : 32       |
        32 -> 63         : 5        |

A healthy server has p99 runq_latency under 1ms. Numbers above that mean CPU is contended and your service is queueing behind other work.

Common knobs and when to touch them

Knob	Default	When to change
nice value	0	-5 to -10 for latency-sensitive, +5 for batch
sched_min_granularity_ns	2.25ms	Increase to 10ms for throughput servers
sched_rt_runtime_us	950000	Lower if you have buggy RT, raise only if you must
cpu.weight (cgroup v2)	100	Per-tenant fairness
cpu.max (cgroup v2)	max	Hard cap, often misused
isolcpus boot param	empty	Reserve CPUs for RT, exclude from balancer

How to think about scheduling in interviews

Fairness: CFS/EEVDF. O(log N). Nice value controls weight.
Latency: EEVDF latency_nice, or real-time classes.
Hard deadlines: SCHED_DEADLINE with admission control.
Group fairness: cgroups with cpu.weight.
Pinning: taskset or sched_setaffinity for cache locality.
Measurement: runqlat (eBPF), perf sched, /proc/PID/sched.

The job of the scheduler is to be invisible. If you are thinking about it, something is wrong (or you have unusually demanding real-time requirements).

Mental model

The scheduler is a barista. CFS gives every customer (task) the same priority unless they've already gotten coffee recently (high vruntime). EEVDF lets some customers say "I'm in a hurry" (low latency_nice) without giving them more total coffee. SCHED_FIFO is "the barista works on this customer until they leave, period." SCHED_DEADLINE is "this customer needs their coffee by 3pm, please plan." cgroups is "table 5 collectively gets 30% of the barista's time, divide it however you want among yourselves."

Learn more

Docs
OSTEP scheduling chaptersOSTEP
Article
LWN: EEVDF schedulerLWN
Docs
kernel.org: sched documentationkernel.org
Docs
Brendan Gregg: Linux PerformanceBrendan Gregg

Deep dive15 min read← Back to crisp