Daily Hardware Architecture: Store-to-Load Forwarding: How CPUs Read What Hasn't Been Written Yet

Store-to-Load Forwarding: How CPUs Read What Hasn't Been Written Yet

2026-05-15

When a store executes, its data sits in the store buffer for tens of cycles before retiring to L1 cache. If a younger load reads the same address during that window, the CPU can't wait — it would stall the entire pipeline. Store-to-load forwarding (STLF) is the hardware shortcut: the load grabs the data directly from the store buffer entry, bypassing cache entirely.

The mechanism lives inside the load/store unit. When a load issues, it performs a CAM (content-addressable memory) search across all in-flight store buffer entries. If it finds an older store to the same address with data ready, it forwards. If the store's data isn't ready yet, the load stalls. If the addresses partially overlap — this is where things get painful.

The alignment gotcha: Forwarding requires the load to be fully contained within a single store, and (on most x86 implementations) aligned to the store's boundaries. A 4-byte store at offset 0 followed by a 2-byte load at offset 1? Forwards fine on modern Intel. But a 1-byte store followed by an 8-byte load that includes that byte? The load must wait for the store to drain to L1, then re-read — a store-forwarding stall costing 10-20 cycles.

Concrete example: Type-punning via memory is a classic offender:

*(uint64_t*)buf = value; — 8-byte store
uint32_t lo = *(uint32_t*)buf; — 4-byte load, contained → forwards (~5 cycles)
*(uint32_t*)buf = lo_val; *(uint32_t*)(buf+4) = hi_val; — two 4-byte stores
uint64_t v = *(uint64_t*)buf; — 8-byte load spanning both stores → stall, ~15-20 cycles

This is why memcpy implementations check alignment first — small unaligned copies that split into multiple stores then get reloaded as a wide load will tank performance.

The rule of thumb: Loads should be fully contained within a single prior store and ideally aligned to the store's natural boundary. Stack spills (compiler-generated stores followed by reloads of the same register) are the highest-volume forwarding case — modern CPUs even have "zero-cycle" forwarding paths for the common case where the store data is already in a physical register.

Intel's perf counter ld_blocks.store_forward tells you when this hurts. If you see millions per second, look for: type punning across mismatched widths, vectorized stores followed by scalar loads of pieces, or union-style code reading bytes after writing words.

See it in action: Check out How to Change the SIM Card Preferences on a TECNO Spark 2024 - Phone Calls, Mobile Data

amp; SMS#shots by Technical Ahmad to see this theory applied.

Key Takeaway: Store-to-load forwarding bypasses cache when a load reads from a not-yet-retired store, but only if the load is fully contained within a single store — mismatched widths or partial overlaps cause stalls that cost more than a cache hit would.

All newsletters