Daily Hardware Architecture: The Stream Buffer: How CPUs Prefetch Sequential Data Without Polluting the Cache

The Stream Buffer: How CPUs Prefetch Sequential Data Without Polluting the Cache

2026-06-03

You've seen hardware prefetchers before, but the stream buffer is a specific structure that solves a sharp problem: how do you prefetch aggressively without trashing the cache when you're wrong? The answer is to prefetch into a separate, small FIFO that sits beside L1, not inside it.

Norman Jouppi introduced stream buffers in 1990 as a complement to victim caches. The idea: when a load misses L1, allocate a stream buffer entry and start fetching sequential cache lines (line+1, line+2, line+3, line+4) into a small FIFO — typically 4 to 16 entries deep. The lines never enter L1 until the program actually demands them. On the next miss, the L1 controller checks the stream buffers in parallel; if the address matches the head of a stream, that line gets promoted into L1 and the FIFO advances, triggering another sequential prefetch at the tail.

Why a separate buffer instead of just prefetching into L1? Two reasons:

No pollution. If the stream prediction is wrong (the program doesn't actually want line+3), the unused lines silently age out of the FIFO. They never evicted real working-set data from L1.
No tag pressure. Stream buffer entries don't need full L1 tags; they're checked by comparing against the FIFO head only. That's one comparator per stream, not a full set lookup.

Concrete example: matrix copy. Consider memcpy of a 1 MB array. Each load misses, but the access pattern is perfectly sequential. With a 4-entry stream buffer, after the first miss you have 4 lines in flight. By the time the CPU asks for line+1, it's already waiting in the stream buffer — promote it to L1 in a couple cycles instead of waiting 200+ cycles for DRAM. Effective bandwidth approaches the DRAM peak instead of being latency-limited.

Rule of thumb: Stream buffer depth should cover the memory latency. If DRAM latency is 200 cycles and you consume one line every 16 cycles, you need 200/16 ≈ 13 entries to fully hide latency. Modern Intel CPUs implement variations called the "Streamer" prefetcher with similar depth, watching both forward and backward strides.

Multiple stream buffers run in parallel — typically 4 to 8 — so a workload touching several arrays simultaneously (think a stencil computation reading three rows) gets a dedicated FIFO per stream. When a new miss doesn't match any existing stream, the oldest stream gets evicted, much like cache replacement.

The elegance: stream buffers convert latency-bound sequential code into bandwidth-bound code, without ever risking the working set that's actually in L1.

See it in action: Check out 3 2 8 Software Prefetching to Reduce Miss Rate or Miss Penalty by Prof. Dr. Ben H. Juurlink to see this theory applied.

Key Takeaway: Stream buffers prefetch sequential lines into a separate FIFO beside L1, hiding DRAM latency on streaming workloads while keeping wrong guesses from evicting useful data.

All newsletters