2026-05-16
Your x86-64 program has 16 general-purpose registers: RAX, RBX, RCX, and so on. That's the architectural register file — what the ISA exposes. But inside a modern Intel or AMD core, there are roughly 180–280 physical registers backing them. Understanding this gap explains why register pressure isn't what you think it is, and why two instructions writing the same register can run in parallel.
The trick is register renaming. When the front-end decodes an instruction like add rax, rbx, the Register Alias Table (RAT) maps the architectural names to currently-free physical registers. Each new write to RAX allocates a fresh physical register — the old RAX still exists, holding its old value for any in-flight instruction that needs it. The architectural name is just a label that gets reassigned.
This eliminates two kinds of false dependencies:
mov rax, [mem1]; mov rax, [mem2] — the second write doesn't have to wait for the first. They get different physical registers.add rbx, rax; mov rax, 0 — the mov doesn't wait for add to finish reading RAX.Concrete example. Consider a loop that XORs RAX with itself to zero it before reuse. xor rax, rax is special-cased: the renamer recognizes the idiom, allocates a physical register, marks it zero, and doesn't even dispatch the XOR to an execution unit. Zero-latency, no port consumed. This is why compilers emit xor eax, eax instead of mov eax, 0 — same effect, but the renamer breaks the dependency chain on the old EAX value entirely. The mov form may not get the same treatment on older microarchitectures.
Rule of thumb. Skylake has 180 integer PRF entries. The reorder buffer is ~224 entries. If your loop body has heavy register pressure causing spills, the bottleneck is usually the architectural 16, not the physical 180 — but if you have long dependency chains, you'll exhaust the PRF and stall the front-end before the ROB fills. Roughly: PRF stalls happen when in-flight instructions need more live values than physical registers.
Why this matters in practice. Loop unrolling exposes more independent work to the renamer, which allocates more physical registers to track parallel iterations. The compiler writing r8, r9, r10, r11 across unrolled iterations isn't wasting registers — it's enabling the renamer to dispatch them in parallel without WAW hazards. Reusing the same architectural register across iterations would serialize them through false dependencies the renamer can mostly, but not perfectly, eliminate.
xor eax, eax costs literally nothing.
