Daily Hardware Architecture: Hardware Prefetching: How CPUs Predict Your Memory Access Patterns

Hardware Prefetching: How CPUs Predict Your Memory Access Patterns

2026-05-01

Cache hierarchies are useless if data isn't in them when the CPU needs it. Hardware prefetchers solve this by detecting memory access patterns and fetching data before the CPU asks for it. They're pattern-matching engines sitting between the CPU and memory, and understanding them changes how you write performance-critical code.

The core problem: An L3 cache miss costs roughly 200+ cycles to DRAM on modern hardware. If your code touches memory in a predictable pattern, why wait? A prefetcher can issue the fetch 200 cycles early and hide the latency entirely.

Modern CPUs use several prefetcher types simultaneously:

Next-line prefetcher: The simplest — if you access cache line N, speculatively fetch N+1. Intel calls this the "DCU prefetcher" on the L1. It's cheap, effective for sequential access, and almost always enabled.
Stride prefetcher: Detects constant-stride patterns. If you access addresses 0x1000, 0x1040, 0x1080 (stride of 64 bytes), it predicts 0x10C0 is next. This handles array traversals with non-unit stride, like iterating every other struct member. Intel's L2 stride prefetcher tracks roughly 16 simultaneous streams.
Spatial prefetcher (pair prefetcher): When you access one half of a 128-byte aligned pair, it fetches the other half. This is Intel's L2 adjacent-line prefetcher — it effectively doubles your fetch granularity.
Stream prefetcher: Tracks ascending or descending address sequences and prefetches ahead by multiple lines. Intel's L2 streamer can look ahead by up to 20 cache lines once a pattern is confirmed.

Real-world impact: Consider traversing a linked list versus an array. Both access the same amount of data, but the array gets near-perfect prefetching (sequential stride = 0 or constant), while the linked list has pointer-dependent addresses the prefetcher can't predict. This is why on an Intel Core i7, iterating a 1-million-element array might run at 90% of DRAM bandwidth, while a pointer-chasing linked list over the same data achieves under 10%. The hardware is identical — only the access pattern changed.

Rule of thumb: A stride prefetcher needs 2–3 accesses to lock onto a pattern and typically prefetches 8–20 lines ahead. If your inner loop touches fewer than 3 elements before changing pattern, the prefetcher never engages. This is why short inner loops over irregular data structures are the worst case for memory performance.

When prefetching hurts: Prefetchers consume memory bandwidth and pollute caches. Random access patterns (hash tables, pointer-heavy graphs) cause prefetchers to fetch useless data, evicting useful lines. This is why some HPC workloads explicitly disable hardware prefetchers via MSR registers and rely on software prefetch instructions (__builtin_prefetch or prefetcht0) instead, where the programmer knows the pattern but the hardware can't detect it.

AMD's Zen 4 and Intel's Golden Cove both added more sophisticated "irregular" prefetchers that attempt to learn complex patterns using small lookup tables of recent address deltas — blurring the line between traditional stride detection and machine learning.

See it in action: Check out Prefetching Explained: How CPUs Predict Your Memory Accesses by CodeLucky to see this theory applied.

Key Takeaway: Hardware prefetchers hide memory latency by detecting access patterns automatically, which is why predictable, strided memory access (arrays, contiguous buffers) dramatically outperforms pointer-chasing code — the CPU literally sees your data coming.

All newsletters