2026-05-08
Every read() or write() costs a syscall: ~100ns of mode switch, register save/restore, and (post-Spectre) page table flushes. For a server doing 1M IOPS, that's 100ms of pure overhead per second per core — burned before any actual work. io_uring (Linux 5.1+) eliminates this by replacing syscalls with shared memory ring buffers between userspace and the kernel.
The mechanism is two lock-free, single-producer/single-consumer ring buffers in memory mapped between your process and the kernel:
You submit work by writing memory — no syscall. You reap completions by reading memory — no syscall. The only syscall is io_uring_enter(), which kicks the kernel to process pending SQEs, and even that can be skipped with SQPOLL mode: a kernel thread polls the SQ tail and processes entries as they appear. Hot path becomes purely memory operations.
Real-world example: ScyllaDB and recent versions of PostgreSQL (17+) use io_uring for storage I/O. The Cloudflare team reported their proxy moving from epoll+read to io_uring cut CPU per request by ~30% under load, primarily by collapsing accept/recv/send into a single batched submission. fio benchmarks on NVMe routinely show io_uring matching SPDK's userspace driver within 5%, while still using the kernel block layer.
Beyond avoiding syscalls, io_uring supports operations epoll never could: OP_READ, OP_WRITE, OP_OPENAT, OP_STATX, OP_SENDMSG — even buffered file I/O, which aio(7) famously couldn't do without falling back to synchronous behavior. Linked SQEs let you express dependencies (open → read → close) as one submission.
Rule of thumb: if your workload exceeds ~50K IOPS per thread, syscall overhead becomes a measurable fraction of CPU. Switch to io_uring. Below that, epoll is simpler and the win is marginal.
Gotchas: Buffers passed in SQEs must remain valid until the corresponding CQE arrives — the kernel reads them asynchronously. Use IORING_REGISTER_BUFFERS to pin and pre-translate buffer pages once, saving per-op page-walk cost. And SQPOLL burns a full core when idle unless you tune sq_thread_idle.
