2026-06-03
You've seen hardware prefetchers before, but the stream buffer is a specific structure that solves a sharp problem: how do you prefetch aggressively without trashing the cache when you're wrong? The answer is to prefetch into a separate, small FIFO that sits beside L1, not inside it.
Norman Jouppi introduced stream buffers in 1990 as a complement to victim caches. The idea: when a load misses L1, allocate a stream buffer entry and start fetching sequential cache lines (line+1, line+2, line+3, line+4) into a small FIFO — typically 4 to 16 entries deep. The lines never enter L1 until the program actually demands them. On the next miss, the L1 controller checks the stream buffers in parallel; if the address matches the head of a stream, that line gets promoted into L1 and the FIFO advances, triggering another sequential prefetch at the tail.
Why a separate buffer instead of just prefetching into L1? Two reasons:
Concrete example: matrix copy. Consider memcpy of a 1 MB array. Each load misses, but the access pattern is perfectly sequential. With a 4-entry stream buffer, after the first miss you have 4 lines in flight. By the time the CPU asks for line+1, it's already waiting in the stream buffer — promote it to L1 in a couple cycles instead of waiting 200+ cycles for DRAM. Effective bandwidth approaches the DRAM peak instead of being latency-limited.
Rule of thumb: Stream buffer depth should cover the memory latency. If DRAM latency is 200 cycles and you consume one line every 16 cycles, you need 200/16 ≈ 13 entries to fully hide latency. Modern Intel CPUs implement variations called the "Streamer" prefetcher with similar depth, watching both forward and backward strides.
Multiple stream buffers run in parallel — typically 4 to 8 — so a workload touching several arrays simultaneously (think a stencil computation reading three rows) gets a dedicated FIFO per stream. When a new miss doesn't match any existing stream, the oldest stream gets evicted, much like cache replacement.
The elegance: stream buffers convert latency-bound sequential code into bandwidth-bound code, without ever risking the working set that's actually in L1.
