Daily Low-Level Programming: io_uring: Asynchronous I/O Without the Syscall Tax

io_uring: Asynchronous I/O Without the Syscall Tax

2026-05-08

Every read() or write() costs a syscall: ~100ns of mode switch, register save/restore, and (post-Spectre) page table flushes. For a server doing 1M IOPS, that's 100ms of pure overhead per second per core — burned before any actual work. io_uring (Linux 5.1+) eliminates this by replacing syscalls with shared memory ring buffers between userspace and the kernel.

The mechanism is two lock-free, single-producer/single-consumer ring buffers in memory mapped between your process and the kernel:

Submission Queue (SQ): userspace writes I/O requests (opcode, fd, buffer, offset) as SQEs and bumps the tail.
Completion Queue (CQ): kernel writes results as CQEs and bumps its tail; userspace reads from the head.

You submit work by writing memory — no syscall. You reap completions by reading memory — no syscall. The only syscall is io_uring_enter(), which kicks the kernel to process pending SQEs, and even that can be skipped with SQPOLL mode: a kernel thread polls the SQ tail and processes entries as they appear. Hot path becomes purely memory operations.

Real-world example: ScyllaDB and recent versions of PostgreSQL (17+) use io_uring for storage I/O. The Cloudflare team reported their proxy moving from epoll+read to io_uring cut CPU per request by ~30% under load, primarily by collapsing accept/recv/send into a single batched submission. fio benchmarks on NVMe routinely show io_uring matching SPDK's userspace driver within 5%, while still using the kernel block layer.

Beyond avoiding syscalls, io_uring supports operations epoll never could: OP_READ, OP_WRITE, OP_OPENAT, OP_STATX, OP_SENDMSG — even buffered file I/O, which aio(7) famously couldn't do without falling back to synchronous behavior. Linked SQEs let you express dependencies (open → read → close) as one submission.

Rule of thumb: if your workload exceeds ~50K IOPS per thread, syscall overhead becomes a measurable fraction of CPU. Switch to io_uring. Below that, epoll is simpler and the win is marginal.

Gotchas: Buffers passed in SQEs must remain valid until the corresponding CQE arrives — the kernel reads them asynchronously. Use IORING_REGISTER_BUFFERS to pin and pre-translate buffer pages once, saving per-op page-walk cost. And SQPOLL burns a full core when idle unless you tune sq_thread_idle.

See it in action: Check out Diego Didona - Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io_uring by Systor Conference to see this theory applied.

Key Takeaway: io_uring replaces per-I/O syscalls with shared-memory ring buffers, turning the hot path into pure memory operations and unlocking syscall-free, batched, dependency-aware async I/O.

All newsletters