Blocking, non-blocking, async I/O

From select to io_uring, edge vs level triggering, why thread-per-conn dies at 10k, and how async runtimes hide it all.

The four models, formally

Blocking I/O (synchronous, blocking)

The default. read(fd, buf, n) on a socket with no data: the kernel parks your thread on a wait queue attached to the socket. When data arrives, the kernel wakes you up. Read returns the data, you continue.

Programming model: dead simple. Read, get data, process, repeat. One thread per connection.

Scaling problem: each thread needs a stack (8MB default), kernel structures, and context switch cost. 10k connections = 10k threads = 80GB virtual memory, real RSS in the GB range, and the scheduler is now thrashing.

Non-blocking I/O (synchronous, non-blocking)

Set the fd to O_NONBLOCK with fcntl. Now read() returns immediately: with data if available, or -1 errno=EAGAIN if not.

By itself this is useless: polling in a loop wastes 100% CPU. The point is to combine with multiplexing: only call read when you know data is ready.

I/O multiplexing (select, poll, epoll, kqueue)

You hand the kernel a set of fds and say "wake me when any of them is ready." This is the foundation of all modern network servers.

select (POSIX, 1983): bitmap of fd numbers, max 1024 fds, O(N) scan in kernel and userspace per call.
poll (POSIX, 1986): array of struct pollfd, no fd limit, still O(N).
epoll (Linux, 2002): kernel maintains an interest list; epoll_wait returns only the ready fds. O(1) per ready fd, scales to millions.
kqueue (FreeBSD, macOS, 2000): similar idea, broader scope (also handles signals, timers, fs events).

epoll and kqueue solved C10K. The pattern:

int ep = epoll_create1(0);
epoll_ctl(ep, EPOLL_CTL_ADD, fd, &event);  // register fd
while (1) {
    int n = epoll_wait(ep, events, 64, -1);
    for (int i = 0; i < n; i++) {
        handle(events[i].data.fd);
    }
}

One thread, thousands of connections.

Asynchronous I/O (true async)

You submit an operation. The kernel does it. You get notified when done, with the result. The kernel did the actual data movement, you never called read.

POSIX AIO exists but is poorly implemented on Linux (it uses a userspace thread pool, not actually async). Windows IOCP has been doing this well since 1999. Linux finally has a real answer in io_uring (5.1+, production-ready 5.10+).

Sync vs async. Multiplexing is still synchronous: it tells you readiness, you call read.

Edge-triggered vs level-triggered

epoll has two modes:

Level-triggered (LT, default): epoll_wait returns the fd as ready as long as data is available to read. Easier to program, same semantics as poll/select.
Edge-triggered (ET): epoll_wait returns the fd only when it transitions from "no data" to "data." Once notified, you must read until EAGAIN; otherwise you will never be notified again for that data.

ET is faster (one event per arrival instead of one per epoll_wait until you read). But it requires careful "read until EAGAIN" logic. nginx uses ET. Most others use LT for simplicity.

select vs poll vs epoll, complexity

For N watched fds with K ready:

API	Setup cost per call	Wait cost	Limit
select	O(N) bitmap copy	O(N)	FD_SETSIZE (1024)
poll	O(N) array copy	O(N)	no hard limit
epoll	O(1) (register once)	O(K)	none

At N=10000, K=10 (only 10 fds active), epoll wins by 3 orders of magnitude.

io_uring: why it's different

epoll tells you readiness. You still call read in a syscall. For high-throughput servers, that syscall (one per ready fd) is the bottleneck.

io_uring inverts: userspace and kernel share two ring buffers (submission queue SQ, completion queue CQ) in mmap'd memory. To submit operations, userspace writes entries into the SQ and bumps the tail pointer. To process them, the kernel reads SQ entries (either on io_uring_enter syscall or via a kernel thread in SQPOLL mode) and writes completions to the CQ.

Benefits:

One syscall amortized over many ops.
True async: read actually returns data via the CQ, not just "fd is ready."
Supports almost every operation: read, write, accept, connect, openat, fsync, recvmsg, sendmsg, splice.
Zero-copy with IORING_OP_SEND_ZC.
Linked operations: "do these in sequence."

Downsides:

API has many sharp edges; getting it right is hard.
Security CVEs have been frequent; some environments disable it.
Not all operations are equally optimized.

io_uring is the future for high-perf servers; epoll remains the durable, well-understood default.

Async runtimes: how higher-level frameworks hide this

You don't write epoll loops in production. You use a runtime:

Node.js: libuv runs an epoll/kqueue loop. Your JavaScript callbacks are scheduled when fds are ready.
Python asyncio: the event loop (selectors module) wraps epoll. async/await is sugar over callbacks.
Go: the runtime's netpoller uses epoll under the hood. net.Conn.Read looks blocking to your goroutine, but the runtime parks the goroutine and uses one OS thread to multiplex.
Tokio/Rust: epoll via mio crate, futures resumed when ready, work-stealing scheduler on top.
Java NIO/Netty: Selector wraps epoll, Netty adds buffer pooling and zero-copy.

The pattern is identical everywhere: one (or a few) OS threads run an event loop; user code runs as small units of work that yield on I/O.

The blocking-call-in-event-loop trap

If you call a synchronous function that blocks (file I/O without O_NONBLOCK, DNS resolution via getaddrinfo, a synchronous database driver, a CPU-heavy computation), the entire event loop is stalled. Every other connection on that loop waits.

Examples that have caused production incidents:

Node.js using fs.readFileSync in a request handler. p99 latency spikes when disk is slow.
Python asyncio using requests library (sync) instead of aiohttp. One slow upstream stalls everything.
Go calling cgo with a blocking C call. The runtime spawns a new OS thread to compensate, but the cost is a new OS thread per concurrent blocking call.
Java NIO calling JDBC on the event thread. JDBC is synchronous.

The fix: never block in an event loop. Move blocking work to a thread pool (Node.js worker_threads, Python asyncio.to_thread, Go is fine because goroutines, Java use a dedicated executor).

File I/O is special

Disk I/O does not play well with epoll on Linux. Regular files are always "ready to read" according to epoll, even if the read will block for 100ms on disk. The official advice for true async file I/O is io_uring.

POSIX AIO on Linux is implemented as a userspace thread pool inside glibc. It works but is slow. Most production systems either use synchronous file I/O in a thread pool, or io_uring for high-throughput cases.

Practical decision tree

Few connections (<100), simple service: blocking + thread per connection.
100-10000 connections, network heavy: epoll/kqueue (or async runtime that uses them).
10000+ connections, latency critical: io_uring on Linux, IOCP on Windows.
Disk I/O dominates: io_uring or a thread pool for sync I/O.
Mixed workload: use an async runtime that knows when to offload to threads.

Common pitfalls

Mental model

Blocking I/O is calling someone and waiting on hold. Non-blocking is checking your email every 30 seconds. Multiplexing (epoll) is having an assistant who tells you "you have mail from Alice and Bob now," so you only check when there's something. Async I/O (io_uring) is asking your assistant "fetch the document from Alice and put it on my desk," and you keep working until they tap your shoulder. Each step removes a layer of "the thread is waiting for the network."

Learn more

Article
C10K - Dan KegelDan Kegel
Paper
io_uring paperJens Axboe
Article
Cloudflare: io_uring is faster than epollCloudflare
Docs
Linux epoll man pageman7.org

Deep dive15 min read← Back to crisp

Blocking, non-blocking, async I/O

From select to io_uring, edge vs level triggering, why thread-per-conn dies at 10k, and how async runtimes hide it all.

The four models, formally

Blocking I/O (synchronous, blocking)

Programming model: dead simple. Read, get data, process, repeat. One thread per connection.

Non-blocking I/O (synchronous, non-blocking)

Set the fd to O_NONBLOCK with fcntl. Now read() returns immediately: with data if available, or -1 errno=EAGAIN if not.

By itself this is useless: polling in a loop wastes 100% CPU. The point is to combine with multiplexing: only call read when you know data is ready.

I/O multiplexing (select, poll, epoll, kqueue)

You hand the kernel a set of fds and say "wake me when any of them is ready." This is the foundation of all modern network servers.

select (POSIX, 1983): bitmap of fd numbers, max 1024 fds, O(N) scan in kernel and userspace per call.
poll (POSIX, 1986): array of struct pollfd, no fd limit, still O(N).
epoll (Linux, 2002): kernel maintains an interest list; epoll_wait returns only the ready fds. O(1) per ready fd, scales to millions.
kqueue (FreeBSD, macOS, 2000): similar idea, broader scope (also handles signals, timers, fs events).

epoll and kqueue solved C10K. The pattern:

int ep = epoll_create1(0);
epoll_ctl(ep, EPOLL_CTL_ADD, fd, &event);  // register fd
while (1) {
    int n = epoll_wait(ep, events, 64, -1);
    for (int i = 0; i < n; i++) {
        handle(events[i].data.fd);
    }
}

One thread, thousands of connections.

Asynchronous I/O (true async)

You submit an operation. The kernel does it. You get notified when done, with the result. The kernel did the actual data movement, you never called read.

Sync vs async. Multiplexing is still synchronous: it tells you readiness, you call read.

Edge-triggered vs level-triggered

epoll has two modes:

Level-triggered (LT, default): epoll_wait returns the fd as ready as long as data is available to read. Easier to program, same semantics as poll/select.
Edge-triggered (ET): epoll_wait returns the fd only when it transitions from "no data" to "data." Once notified, you must read until EAGAIN; otherwise you will never be notified again for that data.

ET is faster (one event per arrival instead of one per epoll_wait until you read). But it requires careful "read until EAGAIN" logic. nginx uses ET. Most others use LT for simplicity.

select vs poll vs epoll, complexity

For N watched fds with K ready:

API	Setup cost per call	Wait cost	Limit
select	O(N) bitmap copy	O(N)	FD_SETSIZE (1024)
poll	O(N) array copy	O(N)	no hard limit
epoll	O(1) (register once)	O(K)	none

At N=10000, K=10 (only 10 fds active), epoll wins by 3 orders of magnitude.

io_uring: why it's different

epoll tells you readiness. You still call read in a syscall. For high-throughput servers, that syscall (one per ready fd) is the bottleneck.

Benefits:

One syscall amortized over many ops.
True async: read actually returns data via the CQ, not just "fd is ready."
Supports almost every operation: read, write, accept, connect, openat, fsync, recvmsg, sendmsg, splice.
Zero-copy with IORING_OP_SEND_ZC.
Linked operations: "do these in sequence."

Downsides:

API has many sharp edges; getting it right is hard.
Security CVEs have been frequent; some environments disable it.
Not all operations are equally optimized.

io_uring is the future for high-perf servers; epoll remains the durable, well-understood default.

Async runtimes: how higher-level frameworks hide this

You don't write epoll loops in production. You use a runtime:

Node.js: libuv runs an epoll/kqueue loop. Your JavaScript callbacks are scheduled when fds are ready.
Python asyncio: the event loop (selectors module) wraps epoll. async/await is sugar over callbacks.
Go: the runtime's netpoller uses epoll under the hood. net.Conn.Read looks blocking to your goroutine, but the runtime parks the goroutine and uses one OS thread to multiplex.
Tokio/Rust: epoll via mio crate, futures resumed when ready, work-stealing scheduler on top.
Java NIO/Netty: Selector wraps epoll, Netty adds buffer pooling and zero-copy.

The pattern is identical everywhere: one (or a few) OS threads run an event loop; user code runs as small units of work that yield on I/O.

The blocking-call-in-event-loop trap

Examples that have caused production incidents:

Node.js using fs.readFileSync in a request handler. p99 latency spikes when disk is slow.
Python asyncio using requests library (sync) instead of aiohttp. One slow upstream stalls everything.
Go calling cgo with a blocking C call. The runtime spawns a new OS thread to compensate, but the cost is a new OS thread per concurrent blocking call.
Java NIO calling JDBC on the event thread. JDBC is synchronous.

The fix: never block in an event loop. Move blocking work to a thread pool (Node.js worker_threads, Python asyncio.to_thread, Go is fine because goroutines, Java use a dedicated executor).

File I/O is special

Practical decision tree

Few connections (<100), simple service: blocking + thread per connection.
100-10000 connections, network heavy: epoll/kqueue (or async runtime that uses them).
10000+ connections, latency critical: io_uring on Linux, IOCP on Windows.
Disk I/O dominates: io_uring or a thread pool for sync I/O.
Mixed workload: use an async runtime that knows when to offload to threads.

Common pitfalls

Mental model

Learn more

Article
C10K - Dan KegelDan Kegel
Paper
io_uring paperJens Axboe
Article
Cloudflare: io_uring is faster than epollCloudflare
Docs
Linux epoll man pageman7.org