2026-05-07
Register renaming solved false dependencies between registers, but it left a glaring hole: when the compiler runs out of registers and spills to the stack, you get a store rax, [rsp-8] followed almost immediately by a load rbx, [rsp-8]. That round trip through the L1 cache costs 4-5 cycles minimum, even though the data never logically left the CPU. Memory renaming (sometimes called store-to-load forwarding's smarter cousin, or "stack engine bypass") catches these patterns and short-circuits them entirely.
The mechanism builds on what the load/store queue already does. Standard store-to-load forwarding matches an in-flight store's address against a younger load and forwards the data without touching cache — but it still requires the store's address to be computed and the load to enter the pipeline. Memory renaming goes further: it predicts at rename time that a load will hit a recent store, and aliases the load's destination physical register directly to the store's source physical register. The load effectively becomes a register-to-register move that never executes a memory operation at all.
Intel introduced this in Ivy Bridge (2012) for stack-relative accesses, and AMD's Zen 4 expanded it to general patterns. The CPU maintains a small predictor table keyed on the store/load instruction pointers and offsets. On a hit, the load issues with zero latency.
Real-world example: a recursive function with many local variables on x86-64. The ABI forces callee-saved registers (rbx, rbp, r12-r15) to be pushed at function entry and popped at return. Without memory renaming, every pop waits for the corresponding push's store to forward — about 4 cycles each. With memory renaming, the pop is a 0-cycle rename. For a function pushing 5 registers, that's 20 cycles saved per call, which compounds dramatically in recursion-heavy code like tree traversals or parsers.
Rule of thumb: a successful memory rename saves roughly load-use latency minus rename latency ≈ 4 cycles on modern x86. AMD reports memory renaming hit rates of 40-60% on typical workloads, meaning roughly half of all stack spills cost zero. You can observe this with perf stat -e ld_blocks.store_forward and the AMD-specific ls_dispatch.ld_dispatch counters — a high ratio of forwarded-without-execution loads indicates the predictor is working.
The catch: misprediction is expensive. If the predictor aliases a load to the wrong store (different address, same offset pattern), the pipeline must flush from the load forward — typically 15-20 cycles. So predictors are conservative, only triggering when the IP+offset pair has a strong history.
