Daily Low-Level Programming: Memory Prefetching: Telling the CPU What You'll Need Next

Memory Prefetching: Telling the CPU What You'll Need Next

2026-05-13

A cache miss to DRAM costs ~100ns — roughly 300 cycles on a 3GHz CPU. The CPU's hardware prefetcher tries to hide this latency by detecting access patterns and pulling cache lines in before you ask for them. But it only handles predictable strides: sequential walks, fixed offsets, and simple 2D patterns. The moment your code chases pointers, hashes into a table, or walks a tree, the prefetcher gives up and you pay full DRAM latency on every miss.

That's where software prefetching comes in. On x86, the PREFETCHT0, PREFETCHT1, PREFETCHT2, and PREFETCHNTA instructions tell the CPU to start loading a cache line into L1, L2, L3, or non-temporally (bypassing cache pollution). They're hints — they don't fault, don't block, and don't even trap on bad addresses. In C, GCC and Clang expose this as __builtin_prefetch(addr, rw, locality).

The real-world win: linked-list traversal. Consider summing values from a linked list of 10 million nodes scattered across the heap:

Naive loop: each node = node->next stalls ~100ns waiting for the next cache line. 10M × 100ns = 1 second.
With prefetch-ahead: __builtin_prefetch(node->next->next) issued while processing the current node. The next-next node is fetched in parallel with current work.

In production, a well-tuned prefetch distance on pointer-chasing code typically yields 1.5x–3x speedup. Postgres uses this in btree scans; the Linux kernel uses it in list_for_each on hot paths; ClickHouse uses it in hash join probes.

Rule of thumb for prefetch distance: prefetch N iterations ahead, where N ≈ memory_latency / per_iteration_work. If your loop body takes 20ns and DRAM latency is 100ns, prefetch 5 iterations ahead. Too close and the line hasn't arrived; too far and it gets evicted before use, or you prefetch past valid data and pollute cache with garbage.

When NOT to prefetch:

Sequential access — the hardware prefetcher already handles this, and software prefetch just wastes issue slots.
Tight loops where the data is already L1-resident — you're adding instructions for no benefit.
When you can't predict the address far enough in advance (e.g., the address itself requires a cache miss to compute).

Verify with perf stat -e L1-dcache-load-misses,LLC-load-misses before and after. If LLC misses don't drop, your prefetch is either too late, too early, or targeting addresses the hardware prefetcher already caught.

See it in action: Check out CPU Cache Explained – Why Your Processor Needs Its Own Memory by Turtle Code to see this theory applied.

Key Takeaway: Software prefetching wins on pointer-chasing workloads where access patterns defeat the hardware prefetcher — but only when you can compute the future address before the CPU stalls waiting for it.

All newsletters