Daily Hardware Architecture: The Fill Buffer: How Caches Track Lines That Haven't Arrived Yet

The Fill Buffer: How Caches Track Lines That Haven't Arrived Yet

2026-06-05

When an L1 cache misses, the line it wants is somewhere downstream — maybe L2, maybe L3, maybe DRAM 200 cycles away. The cache can't just freeze. It needs a place to remember the miss is in flight and a staging area for the bytes when they finally arrive. That's the fill buffer (Intel calls them Line Fill Buffers; AMD calls them Miss Address Buffers; ARM uses the term "linefill buffer" directly).

A fill buffer entry holds:

The physical address of the missing line
A bitmask tracking which bytes have arrived (lines come back in chunks, often 16 or 32 bytes per beat)
The list of in-flight loads/stores waiting on this line
State bits for coherence (am I getting this Shared or Exclusive?)

The fill buffer is what makes early restart and critical-word-first possible. If a load wants byte 40 of a 64-byte line, the memory controller can return the chunk containing byte 40 first; the fill buffer notices that the required bytes are now present and wakes the load before the rest of the line lands. The load completes, and the remaining bytes drizzle in over the next few cycles to fill the cache.

Fill buffers are also where write-combining happens for WC memory — stores merge into a fill buffer entry instead of going to cache, then drain as a single burst.

Concrete example: Skylake has 10 fill buffers per core. That caps outstanding L1D misses at 10. A loop streaming through memory with one cache miss every ~6 ns and a 60 ns L2 latency needs 10 in-flight misses to hide the latency (Little's Law: 60/6 = 10). Exactly the buffer count — no coincidence. Cross that threshold and the core stalls on fill buffer allocation, not on memory itself. This is why perf's l1d_pend_miss.fb_full counter exists; it's a direct readout of "you've hit the parallelism wall."

Rule of thumb: Maximum sustainable memory bandwidth per core ≈ (fill buffers × line size) / memory latency. On Skylake: (10 × 64 B) / 80 ns ≈ 8 GB/s per core. That's why a single thread can't saturate a 50 GB/s memory channel no matter how clever the code — it's structurally bottlenecked by the fill buffer count, not by DRAM.

Fill buffers also became infamous as the leak channel for MDS / RIDL (CVE-2018-12130). Stale data sitting in a fill buffer entry could be speculatively forwarded to a faulting load, leaking across hyperthreads. The mitigation (VERW on context switch) explicitly scrubs them.

See it in action: Check out The Hidden YouTube Setting! by Beebom to see this theory applied.

Key Takeaway: Fill buffers are the cache's outstanding-miss tracker, and their count sets a hard ceiling on single-thread memory bandwidth via Little's Law.

All newsletters