Daily Hardware Architecture: The Retirement Unit: How CPUs Commit Results in Program Order

The Retirement Unit: How CPUs Commit Results in Program Order

2026-05-22

Out-of-order CPUs execute instructions whenever their operands are ready, but the architectural state must update as if the program ran sequentially. The retirement unit (also called the commit stage) is where speculation becomes reality. It's the final pipeline stage where instructions transition from "executed" to "officially happened."

Retirement walks the head of the Reorder Buffer (ROB) in program order. An instruction can retire only when:

It has completed execution without exception
All older instructions have already retired
Any branch it depended on resolved correctly
For stores: the data is ready to drain from the store buffer to L1

When retirement fires, several things happen atomically from the architectural viewpoint: the rename map's "architectural" pointer advances to the new physical register, the old physical register goes back to the free list, the ROB slot is released, and stores become eligible to leave the store buffer. If the instruction was a mispredicted branch or faulted, retirement instead triggers a pipeline flush, squashing everything younger and restoring the rename map from a checkpoint.

Retirement width matters. Intel's Golden Cove can retire up to 8 µops per cycle, while Apple's M1 Firestorm retires 8 as well. But sustained retirement is bounded by the slowest instruction at the ROB head — a single cache-missing load can stall retirement for hundreds of cycles while the ROB fills behind it. This is why ROB sizes have ballooned: Golden Cove has 512 entries, M1 has ~630. The ROB must be large enough to hide L2/L3 latency without stalling the front end.

Concrete example: Consider a loop where instruction N is a load that misses to DRAM (~250 cycles), and N+1 through N+200 are independent ALU ops. The ALU ops execute and sit in the ROB with their "complete" bits set. Retirement is blocked at N. If the ROB holds 512 entries and you average 3 µops/cycle of dispatch, you fill it in ~170 cycles — then the front end stalls. This is the memory wall made visible: bigger ROBs buy you more latency tolerance, linearly.

Rule of thumb: ROB_size ≥ retire_width × worst_tolerable_latency. For an 8-wide machine targeting 300-cycle DRAM tolerance: 8 × 300 = 2400 entries needed for full overlap. Real CPUs hit ~25% of that, which is why DRAM-bound code never reaches peak IPC.

Retirement is also where precise exceptions are delivered. The CPU may have speculatively executed past a faulting instruction, but the fault only "happens" when that instruction reaches the head of the ROB — everything younger is discarded, giving the OS handler a clean architectural snapshot.

See it in action: Check out How CPUs Commit Instructions The Retirement Stage Software Execution by Software Explained to see this theory applied.

Key Takeaway: Retirement is the CPU's commit point — it converts speculative execution into architectural truth, in program order, and its width and queue depth bound how much latency the machine can hide.

All newsletters