Daily Hardware Architecture: MSHRs: How Caches Juggle Multiple Cache Misses at Once

MSHRs: How Caches Juggle Multiple Cache Misses at Once

2026-05-11

A naive cache stalls on every miss until DRAM responds. That's catastrophic — a single L2 miss costs 200+ cycles, and modern CPUs need dozens of misses in flight simultaneously to hide that latency. The structure that makes this possible is the Miss Status Holding Register (MSHR), sometimes called a Miss Address File.

An MSHR is a small table entry that tracks one outstanding miss. When a load misses L1, the cache controller:

Allocates a free MSHR entry, storing the missing cache line address, the destination physical register, the requesting core/thread ID, and the type of access.
Issues a fill request to the next cache level.
Marks the load as "miss pending" in the load queue so it doesn't block retirement of younger independent ops.
When the line eventually arrives, the MSHR uses the stashed metadata to wake up waiting loads and deallocate.

The clever part is secondary misses: if a second load hits the same in-flight line, you don't want a second DRAM request. Each MSHR has multiple secondary entry slots that piggyback onto the existing request. When the line arrives, every waiting load gets serviced. This is called miss coalescing or MSHR merging.

A cache without MSHRs is called blocking. A cache with N MSHRs is N-way non-blocking, meaning N distinct miss addresses can be outstanding. Intel Skylake's L1D has roughly 10 fill buffers (its MSHR equivalent); Apple M1's L1D has around 8; Zen 4 has ~22 at L2. Once you exhaust them, the cache stalls regardless of how many ROB entries are free.

Rule of thumb (Little's Law applied to memory): Sustainable memory bandwidth = (MSHRs × cache line size) / miss latency. With 10 MSHRs, 64-byte lines, and 80 ns latency: 10 × 64 / 80e-9 ≈ 8 GB/s per core. This is why a single thread can't saturate a 50 GB/s memory channel — you're MSHR-bound, not bandwidth-bound. You need multiple cores or aggressive prefetching (which also consumes MSHRs) to fill the pipe.

Real example: Linked-list traversal is the classic MSHR killer. Each pointer chase depends on the previous load, so only one miss is ever outstanding — your 10 MSHRs sit idle, and you measure ~12 ns per node instead of the 1 ns hardware could theoretically sustain. Convert to an array-of-indices and the prefetcher fills MSHRs in parallel; throughput jumps 10×.

See it in action: Check out 18 Cache Design 2 by Yifan

GPU to see this theory applied.

Key Takeaway: MSHRs cap how many cache misses a core can have in flight, making them the hidden ceiling on memory-level parallelism and single-threaded bandwidth.

All newsletters