Daily Hardware Architecture: The Memory-Level Parallelism (MLP) Wall: Why Adding More Out-of-Order Window Stops Helping

The Memory-Level Parallelism (MLP) Wall: Why Adding More Out-of-Order Window Stops Helping

2026-06-07

Out-of-order execution exists largely to hide memory latency. When a load misses to DRAM (~200-300 cycles), the CPU keeps executing past it, hoping to find another independent load to issue in parallel. The number of outstanding misses a core can sustain is its Memory-Level Parallelism, and it's almost always the real bottleneck — not ROB size, not issue width.

MLP is gated by the smallest of several hardware structures:

L1 MSHRs: typically 10-16 per core. Once full, new misses stall at issue.
L2 fill buffers: usually 16-32 entries shared across L1 I/D misses.
LFB (Line Fill Buffers) on Intel: famously 10 on Skylake-era cores, raised to 12+ on newer designs.
Load queue depth: ~128 entries on modern x86, but you can't fill it if MSHRs are saturated.

Here's the rule of thumb. By Little's Law: achievable bandwidth = outstanding misses × line size / latency. With 10 LFBs, 64-byte lines, and 80ns DRAM latency: 10 × 64 / 80ns = 8 GB/s per core. That's it. Doesn't matter that your DRAM channel does 25 GB/s — one core physically cannot pull more because it can't keep enough requests in flight.

Real-world example: pointer-chasing a linked list. Each load depends on the previous one, so MLP = 1. At 80ns per hop, you get 12.5M nodes/second — about 800 MB/s of effective bandwidth on a machine rated for 50 GB/s. This is why std::list traversal benchmarks look catastrophic and why game engines flatten everything into arrays. The fix isn't a bigger ROB; the ROB is already empty waiting on the load. The fix is independent misses — prefetch, software pipelining, or restructured data layout so the CPU sees multiple chase chains at once.

This is also why doubling ROB size from 224 to 512 entries (Sunny Cove → Golden Cove) yields surprisingly modest gains on memory-bound workloads. The window is bigger, but the LFBs are the same. You're parking more instructions behind the same 10-12 outstanding misses. The branch predictor lets you see further; MLP determines whether seeing further does anything.

The architectural lesson: when profiling a memory-bound loop, measure outstanding L1 misses via perf counters (l1d_pend_miss.pending on Intel). If it's pinned at 10, your ROB doesn't matter, your IPC doesn't matter, your clock speed barely matters. You've hit the MLP wall, and only restructuring the access pattern moves it.

Key Takeaway: Single-core memory bandwidth is capped by outstanding miss slots, not DRAM speed — Little's Law turns ~10 LFBs into a hard ceiling that no amount of ROB growth can lift.

All newsletters