epoll, kqueue, io_uring

How each works under the hood, edge vs level triggering, fixed buffers, SQPOLL, and the security history of io_uring.

Why we needed something better than select/poll

select and poll have two problems at scale:

O(N) cost per call. The kernel scans every fd in the set. With 10000 fds and 10 active, you do 10000 work per call to find the 10.
Bitmap/array copying. The fd set is copied from user to kernel on every call, even if nothing changed.

Both APIs were fine in 1986 when you had 50 fds. They are terrible in 2026 when you have 100k.

epoll and kqueue both fix this by separating registration from waiting. You register your interest once; subsequent calls only return what changed.

epoll internals

epoll_create1 returns an fd backed by an eventpoll kernel structure containing:

A red-black tree of fds you've registered (the "interest list"). Keyed by fd number.
A ready list (linked list) of fds that became ready.
A wait queue for the epoll fd itself.

When you register an fd with epoll_ctl(EPOLL_CTL_ADD), the kernel:

Adds the fd to the rbtree (O(log N)).
Hooks into the underlying file's poll table. When the file becomes ready (data arrives on a socket, for example), the file's wakeup callback adds the fd to the eventpoll's ready list.

When you call epoll_wait:

If the ready list has entries, copy them out, return.
Otherwise, sleep on the eventpoll's wait queue. When a file becomes ready, the callback wakes you.

Cost is O(K) where K is the number of ready fds. Independent of how many fds you registered.

epoll internals: registered fds form an rbtree, ready ones land in a list.

Level vs edge triggering, deeper

Level-triggered (default, like poll): epoll_wait returns the fd as long as the condition (data available) is true. If you read half and there's more, the next epoll_wait will return it again.

Edge-triggered (EPOLLET): epoll_wait returns the fd only on transitions. From "no data" to "data available" = one notification. If you don't drain fully and more data arrives later, you get another notification only on the next arrival.

ET is faster because fewer wakeups. But it has a strict requirement: when you read in ET mode, you must read until you get EAGAIN. Otherwise data sits there with no further notification until something else changes.

Nginx uses ET. Most application frameworks default to LT because the error mode (data lost? no, just stuck) is hard to debug.

EPOLLEXCLUSIVE: the thundering herd fix

Imagine N worker threads all using epoll_wait on the same listening socket. When a connection arrives, the kernel wakes them ALL. They all race to accept; one wins, the others return EAGAIN. This is the thundering herd.

EPOLLEXCLUSIVE (Linux 4.5+) tells the kernel to wake only one waiter per event. Add it via epoll_ctl when registering the listening fd. Nginx uses it. Many homegrown servers don't and waste CPU.

SO_REUSEPORT is a complementary fix: each worker has its own listening socket, kernel balances incoming connections across them, no shared epoll state at all. This is what modern servers prefer.

kqueue: the FreeBSD answer

kqueue does what epoll does for fds, plus more event sources:

EVFILT_READ, EVFILT_WRITE: like epoll for sockets and files.
EVFILT_VNODE: file changes (NOTE_DELETE, NOTE_WRITE, NOTE_RENAME). Used by fs watchers on macOS.
EVFILT_PROC: process events (NOTE_EXIT, NOTE_FORK).
EVFILT_SIGNAL: synchronous signal delivery.
EVFILT_TIMER: timers.
EVFILT_USER: userspace-triggered events.

The unified API is elegant. On Linux you need eventfd + signalfd + timerfd + inotify + epoll to cover the same ground.

The kevent syscall does both registration and waiting in one call:

struct kevent change, event;
EV_SET(&change, fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL);
int n = kevent(kq, &change, 1, &event, 1, NULL);

Performance is similar to epoll. The unified API is the main win.

io_uring: a different model

io_uring throws out "tell me when ready" entirely. Instead, you tell the kernel what to do.

Two ring buffers shared between user and kernel via mmap:

Submission queue (SQ): array of io_uring_sqe structs. Userspace fills entries, advances tail. Kernel reads entries, advances head.
Completion queue (CQ): array of io_uring_cqe structs. Kernel fills, advances tail. Userspace reads, advances head.

To submit an operation:

Get the next SQE from the SQ.
Fill it: opcode, fd, buffer, offset, user_data.
Bump the SQ tail.
Call io_uring_enter to wake the kernel (or skip this in SQPOLL mode).

To process completions:

Read the CQ head and tail.
For each new CQE, get the result and your original user_data.
Bump the CQ head.

The liburing library wraps all this nicely. Raw io_uring is doable but error-prone.

io_uring modes

Default: userspace calls io_uring_enter to submit. Kernel processes ops, writes CQEs.
SQPOLL: a kernel thread polls the SQ. Userspace submits with no syscall. Costs one CPU core constantly polling, but throughput is maximal.
IOPOLL: for block devices that support it, the submitting thread polls for completion instead of being interrupted. Lower latency for storage.
Fixed files (IORING_REGISTER_FILES): pre-register fds to skip atomic refcounting on each op.
Fixed buffers (IORING_REGISTER_BUFFERS): pre-register buffers to skip page-pinning on each op.

Combining SQPOLL + fixed files + fixed buffers + zero-copy send is how you hit millions of ops/sec.

io_uring shared rings. SQ for ops in, CQ for results out, optionally no syscalls in steady state.

io_uring security history

io_uring has had several CVEs since 5.1, largely because the attack surface is huge (almost every syscall plus async kernel handling). Some examples:

CVE-2022-29582: race in PI futex with io_uring, local privilege escalation.
CVE-2023-2598: out-of-bounds memory access in fixed buffers.
Multiple use-after-free and refcount bugs in 2023.

In response, Google disabled io_uring on production ChromeOS. Docker has io_uring blocked by default in some seccomp profiles. Use io_uring on trusted workloads; think carefully before exposing it to untrusted containers.

How runtimes pick

libuv (Node.js, others): epoll on Linux, kqueue on macOS/BSD, IOCP on Windows. No io_uring yet (experimental).
Tokio (Rust): mio crate (epoll/kqueue). io_uring via tokio-uring crate (separate runtime).
Go: epoll for net poller. No io_uring; the runtime team has resisted because it adds complexity.
Java NIO: epoll on Linux. io_uring via Netty's incubator API.
Nginx: epoll by default. io_uring optional in recent versions.

Performance: rough numbers

On a single core, modern Linux, 10k concurrent idle TCP connections, occasional bursts:

API	Ops/sec	Notes
select/poll	~50k	O(N) scan kills you
epoll	~500k	Solid baseline
io_uring (no SQPOLL)	~700k	Less syscall overhead
io_uring (SQPOLL + fixed)	~1.5M	At the cost of a dedicated core

These are rough; benchmark your workload, the variance is huge.

When epoll is still the right answer

Your kernel is 5.4 or older. (RHEL 8 ships 4.18 by default; many production systems are not on 5.10+.)
You need predictable security posture (less code surface than io_uring).
Your throughput target is already met. epoll handles way more than most apps need.
Your library/runtime doesn't support io_uring well yet.

Common pitfalls

Mental model

epoll is a smart receptionist: "tell me which of these phones is ringing, I'll answer it." kqueue is the same receptionist who also watches the office door, the calendar, and the printer. io_uring is a personal assistant: "here's a list of things, please do them and put the results in this tray." Each step pushes more work from your code into the kernel, reducing the syscalls you need to make per unit of useful work.

Learn more

Docs
epoll man pageman7.org
Paper
kqueue paper (Lemon, 2001)Jonathan Lemon
Paper
io_uring documentationJens Axboe
Article
Cloudflare: When TCP sockets refuse to dieCloudflare

Deep dive15 min read← Back to crisp

epoll, kqueue, io_uring

How each works under the hood, edge vs level triggering, fixed buffers, SQPOLL, and the security history of io_uring.

Why we needed something better than select/poll

select and poll have two problems at scale:

O(N) cost per call. The kernel scans every fd in the set. With 10000 fds and 10 active, you do 10000 work per call to find the 10.
Bitmap/array copying. The fd set is copied from user to kernel on every call, even if nothing changed.

Both APIs were fine in 1986 when you had 50 fds. They are terrible in 2026 when you have 100k.

epoll and kqueue both fix this by separating registration from waiting. You register your interest once; subsequent calls only return what changed.

epoll internals

epoll_create1 returns an fd backed by an eventpoll kernel structure containing:

A red-black tree of fds you've registered (the "interest list"). Keyed by fd number.
A ready list (linked list) of fds that became ready.
A wait queue for the epoll fd itself.

When you register an fd with epoll_ctl(EPOLL_CTL_ADD), the kernel:

Adds the fd to the rbtree (O(log N)).
Hooks into the underlying file's poll table. When the file becomes ready (data arrives on a socket, for example), the file's wakeup callback adds the fd to the eventpoll's ready list.

When you call epoll_wait:

If the ready list has entries, copy them out, return.
Otherwise, sleep on the eventpoll's wait queue. When a file becomes ready, the callback wakes you.

Cost is O(K) where K is the number of ready fds. Independent of how many fds you registered.

epoll internals: registered fds form an rbtree, ready ones land in a list.

Level vs edge triggering, deeper

Level-triggered (default, like poll): epoll_wait returns the fd as long as the condition (data available) is true. If you read half and there's more, the next epoll_wait will return it again.

Nginx uses ET. Most application frameworks default to LT because the error mode (data lost? no, just stuck) is hard to debug.

EPOLLEXCLUSIVE: the thundering herd fix

EPOLLEXCLUSIVE (Linux 4.5+) tells the kernel to wake only one waiter per event. Add it via epoll_ctl when registering the listening fd. Nginx uses it. Many homegrown servers don't and waste CPU.

SO_REUSEPORT is a complementary fix: each worker has its own listening socket, kernel balances incoming connections across them, no shared epoll state at all. This is what modern servers prefer.

kqueue: the FreeBSD answer

kqueue does what epoll does for fds, plus more event sources:

EVFILT_READ, EVFILT_WRITE: like epoll for sockets and files.
EVFILT_VNODE: file changes (NOTE_DELETE, NOTE_WRITE, NOTE_RENAME). Used by fs watchers on macOS.
EVFILT_PROC: process events (NOTE_EXIT, NOTE_FORK).
EVFILT_SIGNAL: synchronous signal delivery.
EVFILT_TIMER: timers.
EVFILT_USER: userspace-triggered events.

The unified API is elegant. On Linux you need eventfd + signalfd + timerfd + inotify + epoll to cover the same ground.

The kevent syscall does both registration and waiting in one call:

struct kevent change, event;
EV_SET(&change, fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL);
int n = kevent(kq, &change, 1, &event, 1, NULL);

Performance is similar to epoll. The unified API is the main win.

io_uring: a different model

io_uring throws out "tell me when ready" entirely. Instead, you tell the kernel what to do.

Two ring buffers shared between user and kernel via mmap:

Submission queue (SQ): array of io_uring_sqe structs. Userspace fills entries, advances tail. Kernel reads entries, advances head.
Completion queue (CQ): array of io_uring_cqe structs. Kernel fills, advances tail. Userspace reads, advances head.

To submit an operation:

Get the next SQE from the SQ.
Fill it: opcode, fd, buffer, offset, user_data.
Bump the SQ tail.
Call io_uring_enter to wake the kernel (or skip this in SQPOLL mode).

To process completions:

Read the CQ head and tail.
For each new CQE, get the result and your original user_data.
Bump the CQ head.

The liburing library wraps all this nicely. Raw io_uring is doable but error-prone.

io_uring modes

Default: userspace calls io_uring_enter to submit. Kernel processes ops, writes CQEs.
SQPOLL: a kernel thread polls the SQ. Userspace submits with no syscall. Costs one CPU core constantly polling, but throughput is maximal.
IOPOLL: for block devices that support it, the submitting thread polls for completion instead of being interrupted. Lower latency for storage.
Fixed files (IORING_REGISTER_FILES): pre-register fds to skip atomic refcounting on each op.
Fixed buffers (IORING_REGISTER_BUFFERS): pre-register buffers to skip page-pinning on each op.

Combining SQPOLL + fixed files + fixed buffers + zero-copy send is how you hit millions of ops/sec.

io_uring shared rings. SQ for ops in, CQ for results out, optionally no syscalls in steady state.

io_uring security history

io_uring has had several CVEs since 5.1, largely because the attack surface is huge (almost every syscall plus async kernel handling). Some examples:

CVE-2022-29582: race in PI futex with io_uring, local privilege escalation.
CVE-2023-2598: out-of-bounds memory access in fixed buffers.
Multiple use-after-free and refcount bugs in 2023.

How runtimes pick

libuv (Node.js, others): epoll on Linux, kqueue on macOS/BSD, IOCP on Windows. No io_uring yet (experimental).
Tokio (Rust): mio crate (epoll/kqueue). io_uring via tokio-uring crate (separate runtime).
Go: epoll for net poller. No io_uring; the runtime team has resisted because it adds complexity.
Java NIO: epoll on Linux. io_uring via Netty's incubator API.
Nginx: epoll by default. io_uring optional in recent versions.

Performance: rough numbers

On a single core, modern Linux, 10k concurrent idle TCP connections, occasional bursts:

API	Ops/sec	Notes
select/poll	~50k	O(N) scan kills you
epoll	~500k	Solid baseline
io_uring (no SQPOLL)	~700k	Less syscall overhead
io_uring (SQPOLL + fixed)	~1.5M	At the cost of a dedicated core

These are rough; benchmark your workload, the variance is huge.

When epoll is still the right answer

Your kernel is 5.4 or older. (RHEL 8 ships 4.18 by default; many production systems are not on 5.10+.)
You need predictable security posture (less code surface than io_uring).
Your throughput target is already met. epoll handles way more than most apps need.
Your library/runtime doesn't support io_uring well yet.

Common pitfalls

Mental model

Learn more

Docs
epoll man pageman7.org
Paper
kqueue paper (Lemon, 2001)Jonathan Lemon
Paper
io_uring documentationJens Axboe
Article
Cloudflare: When TCP sockets refuse to dieCloudflare