Hardware Prefetching: How CPUs Predict Your Memory Access Patterns

2026-05-01

Cache hierarchies are useless if data isn't in them when the CPU needs it. Hardware prefetchers solve this by detecting memory access patterns and fetching data before the CPU asks for it. They're pattern-matching engines sitting between the CPU and memory, and understanding them changes how you write performance-critical code.

The core problem: An L3 cache miss costs roughly 200+ cycles to DRAM on modern hardware. If your code touches memory in a predictable pattern, why wait? A prefetcher can issue the fetch 200 cycles early and hide the latency entirely.

Modern CPUs use several prefetcher types simultaneously:

Real-world impact: Consider traversing a linked list versus an array. Both access the same amount of data, but the array gets near-perfect prefetching (sequential stride = 0 or constant), while the linked list has pointer-dependent addresses the prefetcher can't predict. This is why on an Intel Core i7, iterating a 1-million-element array might run at 90% of DRAM bandwidth, while a pointer-chasing linked list over the same data achieves under 10%. The hardware is identical — only the access pattern changed.

Rule of thumb: A stride prefetcher needs 2–3 accesses to lock onto a pattern and typically prefetches 8–20 lines ahead. If your inner loop touches fewer than 3 elements before changing pattern, the prefetcher never engages. This is why short inner loops over irregular data structures are the worst case for memory performance.

When prefetching hurts: Prefetchers consume memory bandwidth and pollute caches. Random access patterns (hash tables, pointer-heavy graphs) cause prefetchers to fetch useless data, evicting useful lines. This is why some HPC workloads explicitly disable hardware prefetchers via MSR registers and rely on software prefetch instructions (__builtin_prefetch or prefetcht0) instead, where the programmer knows the pattern but the hardware can't detect it.

AMD's Zen 4 and Intel's Golden Cove both added more sophisticated "irregular" prefetchers that attempt to learn complex patterns using small lookup tables of recent address deltas — blurring the line between traditional stride detection and machine learning.

See it in action: Check out Prefetching Explained: How CPUs Predict Your Memory Accesses by CodeLucky to see this theory applied.
Key Takeaway: Hardware prefetchers hide memory latency by detecting access patterns automatically, which is why predictable, strided memory access (arrays, contiguous buffers) dramatically outperforms pointer-chasing code — the CPU literally sees your data coming.

All newsletters