2026-05-15
Modern x86 cores execute instructions out of order — issuing whichever instruction has its operands ready, regardless of program order. But architecturally, software must observe results as if instructions ran sequentially. The bridge between these two realities is the Reorder Buffer (ROB).
When the front-end decodes an instruction, it allocates a ROB entry and renames its destination register to a physical register from the pool. The instruction is dispatched to a scheduler, executes whenever ready, and writes its result into the physical register. The ROB entry is marked "complete" — but the result is speculative. It is not yet architecturally visible.
Only at retirement — when the ROB's head pointer reaches the entry — does the result commit: the architectural register file is updated, stores drain to the store buffer, exceptions fire, and the instruction is "really done." Retirement happens strictly in program order, even though execution didn't.
Why this matters:
Size rule of thumb: Intel Golden Cove has a 512-entry ROB; Apple's M3 has ~670; Zen 4 has 320. Multiply by retire width (typically 6–8 instructions/cycle on modern cores) to get the maximum in-flight window. At 4 GHz with a 512-entry ROB, the CPU can have ~128 ns of work speculatively buffered — roughly one DRAM round trip.
Concrete example: Consider load r1, [rsi]; add r2, r3, r4; sub r5, r2, 1; load r6, [rdi]. The first load misses to DRAM (200 cycles). The independent add and sub execute in cycles 2–3 and sit completed in the ROB. The second load issues immediately — memory-level parallelism. All three wait for the head load to retire before any architectural state changes. If the CPU ran out of ROB entries while waiting, the front-end stalls — this is why ROB size directly bounds memory-level parallelism on cache-miss-heavy workloads.
When you see a "backend stall: ROB full" counter in perf, you've found a workload bottlenecked not by compute but by the CPU's ability to hide latency.
