Daily Hardware Architecture: Load/Store Queues: How CPUs Track Memory Operations in Flight

Load/Store Queues: How CPUs Track Memory Operations in Flight

2026-05-05

Out-of-order CPUs juggle dozens of memory operations simultaneously, but memory has a problem registers don't: aliasing. Two stores to unknown addresses might target the same byte. The Load Queue (LQ) and Store Queue (SQ) are the structures that keep this chaos legal.

When a load or store enters the pipeline, it gets a slot in the LQ or SQ in program order. The address and data may arrive later, in any order. The queues track three things:

Disambiguation: When a load executes, it scans the SQ for older stores to the same address. If found, it forwards data directly from the store (store-to-load forwarding) instead of going to cache.
Ordering violations: If a store's address resolves after a younger load already read that address from cache, the load got stale data. The CPU squashes the load and everything after it.
Commit ordering: Stores stay in the SQ until they retire, then drain to the L1 cache in program order. This is what makes x86's TSO model work despite OoO execution.

Concrete example — Skylake: Intel Skylake has a 72-entry Load Buffer and 56-entry Store Buffer. AMD Zen 4 has 136 and 64 respectively. When you see a memory-bound benchmark plateau at a certain unrolling factor, you're often hitting LQ/SQ capacity — the CPU literally can't track more in-flight memory ops.

Store-to-load forwarding pitfalls: Forwarding only works for clean overlaps. If you store 4 bytes at address X and load 8 bytes at X-2, the load partially overlaps the store. Most CPUs can't forward this and stall for ~10–20 cycles waiting for the store to drain to L1, then re-read. This is a real performance bug — variable-width writes followed by wider reads (common in serialization code) can tank throughput.

Rule of thumb: Store-to-load forwarding latency is typically 4–5 cycles when the addresses match exactly and the load is fully contained in the store. Misaligned or partial overlap: 10–20+ cycles. Same-cache-line but different bytes: free (no dependency).

The LQ/SQ is also where memory fences (mfence, dmb) do their work — they prevent the SQ from draining or the LQ from issuing past the fence. ARM's weaker memory model gets away with smaller queues because it doesn't need to enforce TSO; x86's stronger guarantees demand more bookkeeping.

See it in action: Check out https://www.sully.studio/course #coding #python #unity3d #scratch by Jackson Academy to see this theory applied.

Key Takeaway: The Load/Store Queue is what makes out-of-order memory access safe — and its forwarding rules are why misaligned writes-then-reads can secretly cost you 20 cycles.

All newsletters