2026-05-13
A cache miss to DRAM costs ~100ns — roughly 300 cycles on a 3GHz CPU. The CPU's hardware prefetcher tries to hide this latency by detecting access patterns and pulling cache lines in before you ask for them. But it only handles predictable strides: sequential walks, fixed offsets, and simple 2D patterns. The moment your code chases pointers, hashes into a table, or walks a tree, the prefetcher gives up and you pay full DRAM latency on every miss.
That's where software prefetching comes in. On x86, the PREFETCHT0, PREFETCHT1, PREFETCHT2, and PREFETCHNTA instructions tell the CPU to start loading a cache line into L1, L2, L3, or non-temporally (bypassing cache pollution). They're hints — they don't fault, don't block, and don't even trap on bad addresses. In C, GCC and Clang expose this as __builtin_prefetch(addr, rw, locality).
The real-world win: linked-list traversal. Consider summing values from a linked list of 10 million nodes scattered across the heap:
node = node->next stalls ~100ns waiting for the next cache line. 10M × 100ns = 1 second.__builtin_prefetch(node->next->next) issued while processing the current node. The next-next node is fetched in parallel with current work.In production, a well-tuned prefetch distance on pointer-chasing code typically yields 1.5x–3x speedup. Postgres uses this in btree scans; the Linux kernel uses it in list_for_each on hot paths; ClickHouse uses it in hash join probes.
Rule of thumb for prefetch distance: prefetch N iterations ahead, where N ≈ memory_latency / per_iteration_work. If your loop body takes 20ns and DRAM latency is 100ns, prefetch 5 iterations ahead. Too close and the line hasn't arrived; too far and it gets evicted before use, or you prefetch past valid data and pollute cache with garbage.
When NOT to prefetch:
Verify with perf stat -e L1-dcache-load-misses,LLC-load-misses before and after. If LLC misses don't drop, your prefetch is either too late, too early, or targeting addresses the hardware prefetcher already caught.
