Store Buffers and Memory Ordering Hardware

2026-04-21

When your CPU executes a store instruction, the data doesn't go straight to the cache. It enters a store buffer — a small queue (typically 42-56 entries on modern x86 cores, ~12-20 on ARM) that sits between the execution unit and the L1 data cache. This structure is the hardware reason memory ordering matters to programmers.

Why store buffers exist: Stores are slow. They must check cache tags, potentially evict lines, and participate in coherence protocols. A store buffer lets the core continue executing without waiting. The store "retires" from the pipeline into the buffer, and the buffer drains to cache in the background. This is a massive performance win — without it, every store would stall the pipeline for 4-5 cycles minimum.

The reordering consequence: Because stores sit in the buffer while subsequent loads execute immediately against the cache, a load can overtake an older store to a different address. This is called store-load reordering, and it's the one reordering x86 (TSO — Total Store Order) permits. ARM and RISC-V with their weaker models allow all four combinations: store-load, store-store, load-load, and load-store reordering.

Store forwarding: If a load hits the same address as a pending store in the buffer, the hardware forwards the data directly — no cache access needed. But there's a catch: partial overlaps (e.g., a 4-byte store followed by an 8-byte load at the same address) often cause a store forwarding stall of ~10-15 cycles on Intel cores. This is a real performance trap in code that aliases memory through different-width accesses.

Concrete example: The classic double-checked locking pattern breaks without a memory barrier precisely because of the store buffer. Thread A writes the object fields then writes the pointer. Those stores enter the buffer. On x86, stores drain in order (no store-store reordering), so it works. On ARM, stores can drain out of order — another core might see the pointer before the fields. A dmb (ARM) or mfence (x86) instruction forces the store buffer to drain before proceeding.

Rule of thumb for store buffer sizing: You need enough entries to cover the latency of an L1 cache write multiplied by your store throughput. At 1 store/cycle and ~5 cycle L1 latency, you need at least 5 entries just to avoid stalls. Real CPUs over-provision heavily (Intel Golden Cove has 56 entries) because cache misses that go to L2/L3 or DRAM can occupy an entry for hundreds of cycles.

Observing this in practice: Intel's perf stat exposes the counter ld_blocks.store_forward which counts store forwarding failures. If this counter is high in your hot loop, you likely have aliased accesses at different widths — restructure your data to avoid partial overlaps.

See it in action: Check out CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?” by CppCon to see this theory applied.

Key Takeaway: Store buffers decouple execution from the memory hierarchy for performance, but their asynchronous draining is the hardware root cause of memory reordering that makes barriers and atomics necessary in concurrent code.