Daily Hardware Architecture: Memory Dependence Predictors: Teaching CPUs When to Stop Being Paranoid

Memory Dependence Predictors: Teaching CPUs When to Stop Being Paranoid

2026-05-12

Out-of-order CPUs want to execute loads as early as possible, but there's a problem: a load might depend on a prior store whose address isn't computed yet. Playing it safe — waiting for every older store address before issuing a load — kills performance. Playing it loose — executing loads speculatively past unresolved stores — triggers expensive pipeline flushes when you get it wrong. Memory dependence predictors learn which loads tend to collide with which stores, so the CPU knows when to be brave and when to be cautious.

The naive approach is store-set predictor, used in Intel cores since the Pentium 4 era and refined heavily in Alder Lake and Zen 4. Each load PC is associated with a "store set" — the set of store PCs that have previously aliased with it. When the predictor says "this load has matched store X before," the scheduler holds the load until store X resolves. Loads with empty store sets fly free.

The structure is typically two tables: an SSIT (Store Set ID Table) indexed by instruction PC, mapping to a small integer, and an LFST (Last Fetched Store Table) indexed by that integer, holding the inflight store that the matching load must wait for. Set sizes are tiny — 4096 entries × 8-bit IDs is typical, so the whole thing fits in under 8 KB.

Concrete example: Linked list traversal. node = node->next in a loop does a load that depends on the previous iteration's load result, but never on any store in the loop body. The predictor quickly learns "load has empty store set" and issues loads speculatively as soon as the address register is ready, exposing parallelism across iterations. Meanwhile, an idiom like *p = x; y = *p; teaches the predictor that the load always aliases — it learns to serialize them after one or two mispredictions.

Rule of thumb: a memory-order misprediction costs roughly the same as a branch misprediction — 15–20 cycles to flush and replay. If a predictor reduces mispredictions from 5% of loads to 0.1%, on a workload doing one load per 4 instructions at IPC 4, you save about (0.049 × 18 / 4) ≈ 0.22 cycles per instruction, or ~20% throughput on memory-heavy code.

The hardest cases are polymorphic loads — the same load PC sometimes aliases, sometimes doesn't, depending on dynamic context. Modern predictors hash in a snippet of branch history (path-based indexing) to disambiguate. AMD's Zen 4 added per-load confidence counters that suppress prediction when behavior is too inconsistent — better to stall than to flush.

See it in action: Check out Magnus explains why chess is addictive 😮 by GJ_Chess to see this theory applied.

Key Takeaway: Memory dependence predictors are bet-management systems: they let loads run ahead of unresolved stores when history says it's safe, and quietly serialize them when it isn't.

All newsletters