2026-04-21
When your CPU executes a store instruction, the data doesn't go straight to the cache. It enters a store buffer — a small queue (typically 42-56 entries on modern x86 cores, ~12-20 on ARM) that sits between the execution unit and the L1 data cache. This structure is the hardware reason memory ordering matters to programmers.
Why store buffers exist: Stores are slow. They must check cache tags, potentially evict lines, and participate in coherence protocols. A store buffer lets the core continue executing without waiting. The store "retires" from the pipeline into the buffer, and the buffer drains to cache in the background. This is a massive performance win — without it, every store would stall the pipeline for 4-5 cycles minimum.
The reordering consequence: Because stores sit in the buffer while subsequent loads execute immediately against the cache, a load can overtake an older store to a different address. This is called store-load reordering, and it's the one reordering x86 (TSO — Total Store Order) permits. ARM and RISC-V with their weaker models allow all four combinations: store-load, store-store, load-load, and load-store reordering.
Store forwarding: If a load hits the same address as a pending store in the buffer, the hardware forwards the data directly — no cache access needed. But there's a catch: partial overlaps (e.g., a 4-byte store followed by an 8-byte load at the same address) often cause a store forwarding stall of ~10-15 cycles on Intel cores. This is a real performance trap in code that aliases memory through different-width accesses.
Concrete example: The classic double-checked locking pattern breaks without a memory barrier precisely because of the store buffer. Thread A writes the object fields then writes the pointer. Those stores enter the buffer. On x86, stores drain in order (no store-store reordering), so it works. On ARM, stores can drain out of order — another core might see the pointer before the fields. A dmb (ARM) or mfence (x86) instruction forces the store buffer to drain before proceeding.
Rule of thumb for store buffer sizing: You need enough entries to cover the latency of an L1 cache write multiplied by your store throughput. At 1 store/cycle and ~5 cycle L1 latency, you need at least 5 entries just to avoid stalls. Real CPUs over-provision heavily (Intel Golden Cove has 56 entries) because cache misses that go to L2/L3 or DRAM can occupy an entry for hundreds of cycles.
Observing this in practice: Intel's perf stat exposes the counter ld_blocks.store_forward which counts store forwarding failures. If this counter is high in your hot loop, you likely have aliased accesses at different widths — restructure your data to avoid partial overlaps.
