Daily Digital Circuits: Hardware Prefetchers: How Hardware Predicts Your Next Memory Access Before You Make It

Hardware Prefetchers: How Hardware Predicts Your Next Memory Access Before You Make It

2026-06-09

A DRAM access costs 200-300 cycles. An L1 hit costs 4. The CPU can't afford to wait for memory on every miss, so it tries to fetch the line before you ask for it. That's a hardware prefetcher: a state machine watching your access stream, guessing what you'll touch next, and issuing speculative loads to fill the cache ahead of the demand stream.

Three flavors dominate real silicon:

Next-line prefetcher: on a miss to address A, also fetch A+64 (one cache line ahead). Cheap, catches sequential code/data, but useless on strided or pointer-chasing workloads.
Stride prefetcher: tracks the delta between consecutive misses from the same PC (program counter). If a load instruction repeatedly misses at addresses 0x1000, 0x1080, 0x1100, the stride is 0x80, and the prefetcher issues 0x1180, 0x1200, etc. Lives in a small table indexed by PC, ~64-256 entries.
Stream/IP-based prefetcher: like stride but tracks multiple outstanding streams and ramps up the prefetch distance (degree) when confidence builds. Intel's L2 streamer prefetches up to 20 lines ahead.

Real example — matrix traversal: iterating a float A[1024][1024] row-major hits a 4KB stride per outer loop iteration. The stride prefetcher locks onto 4096, and L2 misses drop near zero. Switch to column-major access and you stride by 4096 between consecutive inner loop loads — still detectable, but now you blow out the TLB and the prefetcher fights the page walker. Same code, 10× slowdown, mostly because the prefetcher gave up.

The accuracy/coverage tradeoff: aggressive prefetching wastes DRAM bandwidth and pollutes the cache with lines you never use. Designers tune by tracking useful prefetches (the line was demanded before eviction) vs total prefetches. A rule of thumb: keep accuracy above 50% or you're net-negative — every wasted prefetch evicts something that was being used, and DRAM bandwidth is finite.

Calculation: If your prefetcher issues 1 prefetch per demand miss with 60% accuracy, and a useful prefetch saves 250 cycles while a wasted one costs ~5 cycles of bandwidth contention, net savings per demand miss = 0.6 × 250 − 0.4 × 5 = 148 cycles. That's why even imperfect prefetchers are huge wins.

Pointer-chasing (linked lists, hash tables) defeats all of these — the next address is data-dependent, not pattern-predictable. That's why modern chips also ship indirect prefetchers that learn correlations like "load at PC X produces an address used by load at PC Y," but accuracy collapses fast outside narrow workloads.

See it in action: Check out Prefetching Explained: How CPUs Predict Your Memory Accesses by CodeLucky to see this theory applied.

Key Takeaway: Hardware prefetchers turn predictable access patterns into free latency — write loops with regular strides and the silicon will fetch your data before you ask for it.

All newsletters