epoll, kqueue, io_uring
How each works under the hood, edge vs level triggering, fixed buffers, SQPOLL, and the security history of io_uring.
Why we needed something better than select/poll
select and poll have two problems at scale:
- O(N) cost per call. The kernel scans every fd in the set. With 10000 fds and 10 active, you do 10000 work per call to find the 10.
- Bitmap/array copying. The fd set is copied from user to kernel on every call, even if nothing changed.
Both APIs were fine in 1986 when you had 50 fds. They are terrible in 2026 when you have 100k.
epoll and kqueue both fix this by separating registration from waiting. You register your interest once; subsequent calls only return what changed.
epoll internals
epoll_create1 returns an fd backed by an eventpoll kernel structure containing:
- A red-black tree of fds you've registered (the "interest list"). Keyed by fd number.
- A ready list (linked list) of fds that became ready.
- A wait queue for the epoll fd itself.
When you register an fd with epoll_ctl(EPOLL_CTL_ADD), the kernel:
- Adds the fd to the rbtree (O(log N)).
- Hooks into the underlying file's poll table. When the file becomes ready (data arrives on a socket, for example), the file's wakeup callback adds the fd to the eventpoll's ready list.
When you call epoll_wait:
- If the ready list has entries, copy them out, return.
- Otherwise, sleep on the eventpoll's wait queue. When a file becomes ready, the callback wakes you.
Cost is O(K) where K is the number of ready fds. Independent of how many fds you registered.
Level vs edge triggering, deeper
Level-triggered (default, like poll): epoll_wait returns the fd as long as the condition (data available) is true. If you read half and there's more, the next epoll_wait will return it again.
Edge-triggered (EPOLLET): epoll_wait returns the fd only on transitions. From "no data" to "data available" = one notification. If you don't drain fully and more data arrives later, you get another notification only on the next arrival.
ET is faster because fewer wakeups. But it has a strict requirement: when you read in ET mode, you must read until you get EAGAIN. Otherwise data sits there with no further notification until something else changes.
Nginx uses ET. Most application frameworks default to LT because the error mode (data lost? no, just stuck) is hard to debug.
EPOLLEXCLUSIVE: the thundering herd fix
Imagine N worker threads all using epoll_wait on the same listening socket. When a connection arrives, the kernel wakes them ALL. They all race to accept; one wins, the others return EAGAIN. This is the thundering herd.
EPOLLEXCLUSIVE (Linux 4.5+) tells the kernel to wake only one waiter per event. Add it via epoll_ctl when registering the listening fd. Nginx uses it. Many homegrown servers don't and waste CPU.
SO_REUSEPORT is a complementary fix: each worker has its own listening socket, kernel balances incoming connections across them, no shared epoll state at all. This is what modern servers prefer.
kqueue: the FreeBSD answer
kqueue does what epoll does for fds, plus more event sources:
- EVFILT_READ, EVFILT_WRITE: like epoll for sockets and files.
- EVFILT_VNODE: file changes (NOTE_DELETE, NOTE_WRITE, NOTE_RENAME). Used by fs watchers on macOS.
- EVFILT_PROC: process events (NOTE_EXIT, NOTE_FORK).
- EVFILT_SIGNAL: synchronous signal delivery.
- EVFILT_TIMER: timers.
- EVFILT_USER: userspace-triggered events.
The unified API is elegant. On Linux you need eventfd + signalfd + timerfd + inotify + epoll to cover the same ground.
The kevent syscall does both registration and waiting in one call:
struct kevent change, event;
EV_SET(&change, fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL);
int n = kevent(kq, &change, 1, &event, 1, NULL);Performance is similar to epoll. The unified API is the main win.
io_uring: a different model
io_uring throws out "tell me when ready" entirely. Instead, you tell the kernel what to do.
Two ring buffers shared between user and kernel via mmap:
- Submission queue (SQ): array of
io_uring_sqestructs. Userspace fills entries, advances tail. Kernel reads entries, advances head. - Completion queue (CQ): array of
io_uring_cqestructs. Kernel fills, advances tail. Userspace reads, advances head.
To submit an operation:
- Get the next SQE from the SQ.
- Fill it: opcode, fd, buffer, offset, user_data.
- Bump the SQ tail.
- Call
io_uring_enterto wake the kernel (or skip this in SQPOLL mode).
To process completions:
- Read the CQ head and tail.
- For each new CQE, get the result and your original user_data.
- Bump the CQ head.
The liburing library wraps all this nicely. Raw io_uring is doable but error-prone.
io_uring modes
- Default: userspace calls io_uring_enter to submit. Kernel processes ops, writes CQEs.
- SQPOLL: a kernel thread polls the SQ. Userspace submits with no syscall. Costs one CPU core constantly polling, but throughput is maximal.
- IOPOLL: for block devices that support it, the submitting thread polls for completion instead of being interrupted. Lower latency for storage.
- Fixed files (IORING_REGISTER_FILES): pre-register fds to skip atomic refcounting on each op.
- Fixed buffers (IORING_REGISTER_BUFFERS): pre-register buffers to skip page-pinning on each op.
Combining SQPOLL + fixed files + fixed buffers + zero-copy send is how you hit millions of ops/sec.
io_uring security history
io_uring has had several CVEs since 5.1, largely because the attack surface is huge (almost every syscall plus async kernel handling). Some examples:
- CVE-2022-29582: race in PI futex with io_uring, local privilege escalation.
- CVE-2023-2598: out-of-bounds memory access in fixed buffers.
- Multiple use-after-free and refcount bugs in 2023.
In response, Google disabled io_uring on production ChromeOS. Docker has io_uring blocked by default in some seccomp profiles. Use io_uring on trusted workloads; think carefully before exposing it to untrusted containers.
How runtimes pick
- libuv (Node.js, others): epoll on Linux, kqueue on macOS/BSD, IOCP on Windows. No io_uring yet (experimental).
- Tokio (Rust): mio crate (epoll/kqueue). io_uring via tokio-uring crate (separate runtime).
- Go: epoll for net poller. No io_uring; the runtime team has resisted because it adds complexity.
- Java NIO: epoll on Linux. io_uring via Netty's incubator API.
- Nginx: epoll by default. io_uring optional in recent versions.
Performance: rough numbers
On a single core, modern Linux, 10k concurrent idle TCP connections, occasional bursts:
| API | Ops/sec | Notes |
|---|---|---|
| select/poll | ~50k | O(N) scan kills you |
| epoll | ~500k | Solid baseline |
| io_uring (no SQPOLL) | ~700k | Less syscall overhead |
| io_uring (SQPOLL + fixed) | ~1.5M | At the cost of a dedicated core |
These are rough; benchmark your workload, the variance is huge.
When epoll is still the right answer
- Your kernel is 5.4 or older. (RHEL 8 ships 4.18 by default; many production systems are not on 5.10+.)
- You need predictable security posture (less code surface than io_uring).
- Your throughput target is already met. epoll handles way more than most apps need.
- Your library/runtime doesn't support io_uring well yet.
Common pitfalls
Mental model
epoll is a smart receptionist: "tell me which of these phones is ringing, I'll answer it." kqueue is the same receptionist who also watches the office door, the calendar, and the printer. io_uring is a personal assistant: "here's a list of things, please do them and put the results in this tray." Each step pushes more work from your code into the kernel, reducing the syscalls you need to make per unit of useful work.
Learn more
- Docsepoll man pageman7.org
- Paperkqueue paper (Lemon, 2001)Jonathan Lemon
- Paperio_uring documentationJens Axboe
- Article