The Memory Renamer's Limits: Why Stack Tracking Hardware Gives Up

2026-05-26

We covered memory renaming earlier — the trick where CPUs spot push rax / pop rax pairs and forward the value through a register instead of bouncing it through L1. It's beautiful when it works. But it fails constantly, and understanding why tells you a lot about the limits of speculation.

The renamer maintains a small stack tracking structure — usually 16-32 entries — that maps recent stack store addresses to physical register tags. When a load comes through and its address matches, the CPU bypasses the cache entirely, treating the load as a register-to-register move. Apple's M-series and Intel since Ice Lake both do this. Reported speedup on call-heavy code: 5-15%.

The failure modes:

Concrete example: A recursive Fibonacci on Apple M2 — measured with MEM_LOAD_RETIRED.L1_HIT counters. Shallow recursion (depth < 16): ~85% of stack loads renamed away, never touching L1. Depth 32: drops to ~60%. Depth 64: under 30%. The table is roughly 24 entries on M2; once you blow past that, every spill is a real cache access.

Rule of thumb: Memory renaming covers leaf and shallow functions well. Once your call stack exceeds ~20 frames of active spilling, assume the renamer has tapped out and L1 latency (4-5 cycles) is back in the critical path. This is one reason aggressive inlining still matters even on CPUs with "free" stack traffic — fewer frames means more of your spills stay in the rename table.

The hardware can only track what fits in a small CAM. Software still has to be considerate.

See it in action: Check out How to Rename multiple files fast by Amit Gangania to see this theory applied.
Key Takeaway: Memory renaming eliminates stack traffic only within a small tracking window — deep call stacks, dynamic RSP, size mismatches, and aliasing all force the CPU to fall back to real cache accesses.

All newsletters