The Load-Hit-Store Penalty: When Your CPU Trips Over Its Own Feet

2026-05-14

You'd think the store buffer makes writes free — fire and forget, let the CPU drain it to cache in the background. But there's a nasty case where the store buffer becomes a speed trap: when a load reads from an address that a recent store just wrote to, before that store has committed. This is the load-hit-store (LHS) penalty, and it's one of the most underappreciated performance cliffs in modern CPUs.

Here's the mechanism. When a load issues, the CPU checks the store queue for matching addresses. If it finds one, it can sometimes forward the stored value directly to the load, skipping cache entirely. This is store-to-load forwarding, and when it works, it's nearly free (1-2 cycles). But forwarding only works under strict conditions:

Violate these, and the CPU can't forward. Instead, it must wait for the store to drain to L1 cache, then re-execute the load. On Intel Skylake-era cores, this costs 10-20 cycles. On older PowerPC chips (notoriously the PS3's Cell), it could hit 40+ cycles. Game developers used to call this "the LHS stall" and hunt them down with profilers.

Real-world example: The classic offender is type-punning via memory:

Same address, same size, same alignment — forwarding works, you pay 2 cycles. Now change it: store a 64-bit double, then load the low 32 bits as an int. Partial overlap, forwarding fails, you eat ~15 cycles. This is why memcpy-based type punning sometimes outperforms union tricks: the compiler can sometimes prove the access is safe and avoid the narrowing pattern.

Rule of thumb: If you store N bytes and then load M bytes from the same region, you only get fast forwarding when M ≤ N and the load is fully inside the store's footprint. Reading wider than you wrote, or reading from a misaligned subset, is the trap.

Another pattern that bites: writing structure fields individually then reading the whole struct as a single wide load. The wide load can't be reconstructed from multiple narrow store-queue entries (most CPUs won't merge across entries for forwarding), so you stall waiting for all of them to drain.

You can spot LHS stalls in perf via ld_blocks.store_forward on Intel — if that counter is high relative to total loads, your code is tripping on its own stores.

See it in action: Check out I GOT KIDNAPPED BY GANG MEMBERS 😰😱 #shorts by Nathan Davis Jr to see this theory applied.
Key Takeaway: Store-to-load forwarding is fast only when the load fits entirely inside a single prior store; reading wider or differently-aligned data than you just wrote forces the CPU to drain the store buffer first, costing 10-20 cycles.

All newsletters