Critical Word First and Early Restart: How Caches Return the Important Bytes Before the Line Finishes Loading

2026-06-08

When an L1 miss hits L2 (or worse, DRAM), the CPU needs a 64-byte cache line. The memory bus is narrower than that — typically 16 or 32 bytes wide per beat — so the line arrives in burst transfers of 2-4 beats. A naive design waits for all beats before letting the load complete. Two clever tricks avoid that wait.

Critical Word First (CWF): The memory controller reorders the burst so the specific 8-byte word the load actually needs arrives first, not in natural address order. If your load wants bytes 32-39 of the line, the burst sequence becomes [32-47, 48-63, 0-15, 16-31] instead of starting at byte 0. DRAM supports this directly via the DDR burst chop and critical-word-first ordering in the column-address strobe.

Early Restart: The moment the critical word lands in the fill buffer, forward it to the waiting load and let dependent instructions wake up. The rest of the line keeps streaming in behind it, populating the cache. The load doesn't block until the whole line is resident.

Concrete example: Intel's ring-bus era L2 returns 32 bytes per cycle. A 64-byte fill takes 2 cycles on the data path. With CWF + early restart, the load completes after 1 cycle of data transfer instead of 2 — a 50% reduction in perceived miss latency for that load. Multiplied across millions of misses per second in a memory-bound workload like graph traversal, this is a measurable win on benchmarks like GAP-BS.

The gotcha: Any other access to the same line while it's mid-fill must check the fill buffer's valid-byte mask. If a second load wants bytes 0-7 and only bytes 32-63 have arrived so far, that load stalls or replays. This is why fill buffers track per-beat valid bits, not just a single "in flight" bit.

Rule of thumb: For a 64-byte line on a 32-byte-wide bus, CWF + early restart saves you (N-1)/N of the fill duration on the critical load, where N is the number of beats. For 4-beat fills (16-byte bus), that's 75% latency reduction on the critical word — but only for the first consumer of the line.

This is also why line splits (loads crossing a cache-line boundary) are so painful: you may be waiting on two separate CWF sequences, and the slower one dictates completion.

See it in action: Check out Do this if your PC Blue Screens BSOD by HowtoInsider to see this theory applied.
Key Takeaway: Cache misses don't have to wait for the full line — CWF reorders the burst so your specific word arrives first, and early restart wakes up dependent instructions before the rest of the line has even finished streaming in.

All newsletters