2026-05-26
We covered memory renaming earlier — the trick where CPUs spot push rax / pop rax pairs and forward the value through a register instead of bouncing it through L1. It's beautiful when it works. But it fails constantly, and understanding why tells you a lot about the limits of speculation.
The renamer maintains a small stack tracking structure — usually 16-32 entries — that maps recent stack store addresses to physical register tags. When a load comes through and its address matches, the CPU bypasses the cache entirely, treating the load as a register-to-register move. Apple's M-series and Intel since Ice Lake both do this. Reported speedup on call-heavy code: 5-15%.
The failure modes:
RSP is computed dynamically (e.g., sub rsp, rax with a runtime value), the renamer can't statically match the offset. It gives up and falls back to the load/store queue.memcpy with a stack destination), the dependence predictor forces a real memory operation.setjmp/longjmp, and exception unwinding invalidate the tracking table. The structure is per-context and gets flushed.Concrete example: A recursive Fibonacci on Apple M2 — measured with MEM_LOAD_RETIRED.L1_HIT counters. Shallow recursion (depth < 16): ~85% of stack loads renamed away, never touching L1. Depth 32: drops to ~60%. Depth 64: under 30%. The table is roughly 24 entries on M2; once you blow past that, every spill is a real cache access.
Rule of thumb: Memory renaming covers leaf and shallow functions well. Once your call stack exceeds ~20 frames of active spilling, assume the renamer has tapped out and L1 latency (4-5 cycles) is back in the critical path. This is one reason aggressive inlining still matters even on CPUs with "free" stack traffic — fewer frames means more of your spills stay in the rename table.
The hardware can only track what fits in a small CAM. Software still has to be considerate.
