Daily Low-Level Programming: The Reorder Buffer: How the CPU Commits Instructions in Order It Didn't Execute Them

The Reorder Buffer: How the CPU Commits Instructions in Order It Didn't Execute Them

2026-05-15

Modern x86 cores execute instructions out of order — issuing whichever instruction has its operands ready, regardless of program order. But architecturally, software must observe results as if instructions ran sequentially. The bridge between these two realities is the Reorder Buffer (ROB).

When the front-end decodes an instruction, it allocates a ROB entry and renames its destination register to a physical register from the pool. The instruction is dispatched to a scheduler, executes whenever ready, and writes its result into the physical register. The ROB entry is marked "complete" — but the result is speculative. It is not yet architecturally visible.

Only at retirement — when the ROB's head pointer reaches the entry — does the result commit: the architectural register file is updated, stores drain to the store buffer, exceptions fire, and the instruction is "really done." Retirement happens strictly in program order, even though execution didn't.

Why this matters:

Precise exceptions: If instruction N faults, all instructions before N retire normally; everything after N (which may have already executed!) is squashed by flushing the ROB. The architectural state looks exactly as if execution stopped at N.
Branch misprediction recovery: Same mechanism — flush every ROB entry after the mispredicted branch, free their physical registers, restart fetch.
Speculative execution attacks: Spectre/Meltdown exploit the gap between execution and retirement — squashed instructions leave microarchitectural side effects (cache state) even though they never retired.

Size rule of thumb: Intel Golden Cove has a 512-entry ROB; Apple's M3 has ~670; Zen 4 has 320. Multiply by retire width (typically 6–8 instructions/cycle on modern cores) to get the maximum in-flight window. At 4 GHz with a 512-entry ROB, the CPU can have ~128 ns of work speculatively buffered — roughly one DRAM round trip.

Concrete example: Consider load r1, [rsi]; add r2, r3, r4; sub r5, r2, 1; load r6, [rdi]. The first load misses to DRAM (200 cycles). The independent add and sub execute in cycles 2–3 and sit completed in the ROB. The second load issues immediately — memory-level parallelism. All three wait for the head load to retire before any architectural state changes. If the CPU ran out of ROB entries while waiting, the front-end stalls — this is why ROB size directly bounds memory-level parallelism on cache-miss-heavy workloads.

When you see a "backend stall: ROB full" counter in perf, you've found a workload bottlenecked not by compute but by the CPU's ability to hide latency.

Key Takeaway: The Reorder Buffer lets the CPU execute instructions in any order while retiring them in program order, giving you precise exceptions and recoverable speculation at the cost of a hard ceiling on in-flight work.

All newsletters