Daily Low-Level Programming: Direct I/O and the O_DIRECT Flag: Bypassing the Page Cache

Direct I/O and the O_DIRECT Flag: Bypassing the Page Cache

2026-05-07

The page cache is usually your friend — but sometimes it's overhead you don't want. O_DIRECT tells the kernel: skip the cache, DMA straight between userspace buffers and the storage device. The data path becomes app buffer → device, not app buffer → page cache → device.

Why bypass the cache? Three legitimate reasons:

Double-buffering waste: Databases like PostgreSQL, MySQL/InnoDB, and Oracle maintain their own buffer pools. Letting the kernel also cache the same data wastes RAM and CPU on redundant copies.
Predictable latency: The page cache introduces unpredictable writeback storms when dirty pages flush. Direct I/O gives you deterministic timing — critical for write-ahead logs and trading systems.
Cache pollution: A backup tool reading 2 TB of cold data shouldn't evict your hot working set. O_DIRECT keeps cold data out of the cache entirely.

The brutal alignment rules. O_DIRECT enforces three constraints, and violating any returns EINVAL:

The buffer address must be aligned to the device's logical block size (typically 512 B or 4 KiB).
The transfer length must be a multiple of that block size.
The file offset must also be a multiple of that block size.

You can't just malloc(4096) — glibc gives no alignment guarantee. Use posix_memalign(&buf, 4096, size) or aligned_alloc(4096, size). Query the required alignment with statx() using STATX_DIOALIGN (Linux 6.1+), which returns stx_dio_mem_align and stx_dio_offset_align.

Concrete example — PostgreSQL. For decades Postgres relied on the page cache and used fsync() for durability. As of PG 16 (2023), io_method=io_uring combined with O_DIRECT is supported for WAL and data files, because the shared_buffers pool already caches pages. On a 256 GB server with 64 GB shared_buffers, double-caching with the page cache wasted ~60 GB of RAM that could now hold more index data.

Rule of thumb for sizing. Direct I/O wins when your application's hit rate on its own cache exceeds the page cache's hit rate. If your DB buffer pool is >25% of RAM and tracks access patterns better than LRU (most do — they use clock-pro or similar), O_DIRECT pays off. Below that threshold, let the kernel cache for you.

Gotcha: O_DIRECT does not imply O_SYNC. The write may still sit in the disk's volatile write cache. You still need fdatasync() for durability — direct I/O bypasses the kernel's cache, not the drive's.

See it in action: Check out why read is faster when using O_DIRECT flag? by Peter Schneider to see this theory applied.

Key Takeaway: O_DIRECT trades the page cache's adaptive caching for predictable latency and zero double-buffering — but only pays off when your app's own cache is smarter than the kernel's, and the alignment rules are unforgiving.

All newsletters