2026-05-15
When a store executes, its data sits in the store buffer for tens of cycles before retiring to L1 cache. If a younger load reads the same address during that window, the CPU can't wait — it would stall the entire pipeline. Store-to-load forwarding (STLF) is the hardware shortcut: the load grabs the data directly from the store buffer entry, bypassing cache entirely.
The mechanism lives inside the load/store unit. When a load issues, it performs a CAM (content-addressable memory) search across all in-flight store buffer entries. If it finds an older store to the same address with data ready, it forwards. If the store's data isn't ready yet, the load stalls. If the addresses partially overlap — this is where things get painful.
The alignment gotcha: Forwarding requires the load to be fully contained within a single store, and (on most x86 implementations) aligned to the store's boundaries. A 4-byte store at offset 0 followed by a 2-byte load at offset 1? Forwards fine on modern Intel. But a 1-byte store followed by an 8-byte load that includes that byte? The load must wait for the store to drain to L1, then re-read — a store-forwarding stall costing 10-20 cycles.
Concrete example: Type-punning via memory is a classic offender:
*(uint64_t*)buf = value; — 8-byte storeuint32_t lo = *(uint32_t*)buf; — 4-byte load, contained → forwards (~5 cycles)*(uint32_t*)buf = lo_val; *(uint32_t*)(buf+4) = hi_val; — two 4-byte storesuint64_t v = *(uint64_t*)buf; — 8-byte load spanning both stores → stall, ~15-20 cyclesThis is why memcpy implementations check alignment first — small unaligned copies that split into multiple stores then get reloaded as a wide load will tank performance.
The rule of thumb: Loads should be fully contained within a single prior store and ideally aligned to the store's natural boundary. Stack spills (compiler-generated stores followed by reloads of the same register) are the highest-volume forwarding case — modern CPUs even have "zero-cycle" forwarding paths for the common case where the store data is already in a physical register.
Intel's perf counter ld_blocks.store_forward tells you when this hurts. If you see millions per second, look for: type punning across mismatched widths, vectorized stores followed by scalar loads of pieces, or union-style code reading bytes after writing words.
