The Reorder Buffer: How CPUs Unscramble Out-of-Order Results

2026-05-03

We covered out-of-order execution previously — how CPUs execute instructions in whatever order the data dependencies allow. But there's a critical problem: if instructions finish out of order, how does the CPU maintain the illusion of sequential execution? The answer is the Reorder Buffer (ROB), one of the most important structures in a modern CPU core.

The ROB is a circular buffer that tracks every in-flight instruction in program order. When an instruction is decoded, it gets an entry at the ROB's tail. When it finishes execution (potentially out of order), its result is written into its ROB entry. But the instruction only retires — commits its result to the architectural register file — when it reaches the ROB's head and all older instructions have already retired. This head-to-tail ordering is the key constraint.

Why does this matter? Consider what happens during a mispredicted branch. The CPU may have speculatively executed 50 instructions down the wrong path. Because none of those results have been committed to architectural state (they're sitting in ROB entries that haven't reached the head), the CPU simply flushes the ROB from the misprediction point forward. Clean recovery, no corrupted state. Without the ROB, rollback would be a nightmare.

The ROB also enables precise exceptions. If instruction #47 faults, the CPU can retire instructions #1–46, report the exception at exactly instruction #47, and discard everything after it. The OS sees a clean, precise architectural state — as if the processor had executed instructions one at a time.

Real-world sizing: Intel's Golden Cove cores (Alder Lake, 2021) have a 512-entry ROB. AMD's Zen 4 uses a 320-entry ROB. Apple's Firestorm cores reportedly have ~630 entries. A larger ROB means more instructions can be in flight, which helps the CPU find more instruction-level parallelism — especially useful when waiting on cache misses.

Rule of thumb: A ROB needs to be deep enough to cover your longest expected latency. If an L3 cache miss takes ~40 ns and your core runs at 5 GHz, that's 200 cycles. At roughly 4–6 instructions dispatched per cycle, you need at least 200 × 5 = 1,000 instruction slots of in-flight capacity (shared across ROB, load/store queues, and reservation stations) to keep the core fully fed during a miss. This is why ROBs keep growing generation over generation.

Each ROB entry typically stores: the destination register tag, the computed result value, a completion bit, an exception flag, and the PC for recovery. At 512 entries, this structure consumes significant area and power. Designers use techniques like move elimination (bypassing the ROB entirely for register-to-register moves) and zeroing idiom detection (e.g., xor eax, eax) to reduce ROB pressure without adding entries.

One subtle design choice: some architectures merge the ROB with the physical register file (a merged register file design, used in Intel cores), while others keep them separate (AMD's approach). The merged design saves a copy when results are produced but complicates register reclamation at retirement.

Key Takeaway: The Reorder Buffer is the hardware bookkeeper that lets CPUs execute instructions out of order for performance while retiring them in order for correctness, enabling both speculative recovery and precise exceptions.

All newsletters