Daily Low-Level Programming: The Page Cache: Why Your Second Read Is 1000x Faster

The Page Cache: Why Your Second Read Is 1000x Faster

2026-05-07

When you read() a file on Linux, the kernel doesn't go to disk if it can avoid it. It serves the data from the page cache — a giant, opportunistic cache of file-backed pages kept in otherwise-free RAM. Understanding it is the difference between debugging "why is my benchmark suspiciously fast?" and "why does production fall off a cliff at 3am?"

How it works. Every file read goes through the page cache. The kernel hashes (inode, offset) to find a 4 KiB page. On a hit, it memcpys into your buffer — no disk I/O. On a miss, it allocates a page, issues a block-layer read, and (importantly) does readahead: it pulls in 16–256 KiB of subsequent pages on the assumption you'll keep going. This is why sequential reads on cold files still hit ~SSD bandwidth: the second page is already in flight before you ask for it.

Writes are deferred. A write() just dirties a page in cache and returns. The pdflush/writeback kernel threads flush dirty pages to disk based on vm.dirty_ratio (default 20% of RAM) and vm.dirty_expire_centisecs (default 30s). Pull the power cord and you lose them. fsync() blocks until your file's dirty pages are durable.

Real-world example. Postgres deliberately uses small shared_buffers (often 25% of RAM) and lets the page cache hold the rest of the database. A SELECT on a hot table never touches disk; perf stat will show ~0 block I/O. But after a server reboot, the cache is empty — the first hour of queries is 100x slower until the working set is "warm" again. This is why ops teams run pg_prewarm or just cat table_files > /dev/null after maintenance.

Inspecting it. /proc/meminfo shows Cached: (clean file pages) and Dirty: (modified, not yet flushed). vmtouch tells you what fraction of a specific file is resident. echo 3 > /proc/sys/vm/drop_caches evicts everything — useful for honest benchmarks.

Rule of thumb. A page-cache hit costs ~100 ns (one memcpy). An NVMe miss costs ~100 µs. A spinning-disk miss costs ~10 ms. That's a 1000x cliff between cache and NVMe, and 100,000x to spinning rust — so your tail latency is dominated by your miss rate, not your hit speed. If your p50 looks great but p99 is awful, you're almost certainly missing the cache on the long tail.

Bypassing it. O_DIRECT skips the page cache entirely — required for databases that manage their own buffer pool (MySQL InnoDB, ScyllaDB) and don't want double-buffering. posix_fadvise(POSIX_FADV_DONTNEED) evicts a file you know you won't reread, freeing RAM for things that benefit.

See it in action: Check out Easy Persistent Cache For Slow Functions With Joblib by Jimi V. (Bitswired) to see this theory applied.

Key Takeaway: Free RAM isn't wasted — Linux fills it with file pages, and the gap between a page-cache hit (~100 ns) and an NVMe miss (~100 µs) is what shapes your tail latency.

All newsletters