Daily Hardware Architecture: Out-of-Order Execution: How CPUs Break the Rules to Go Faster

Out-of-Order Execution: How CPUs Break the Rules to Go Faster

2026-04-23

Your compiler emits instructions in a specific order. Your CPU ignores it. Out-of-order (OoO) execution lets the processor find and execute independent instructions while earlier ones are stalled — waiting on cache misses, division units, or unresolved branches. The program's visible results stay in-order. The internal execution does not.

The machinery that makes this work has three key stages:

Rename (Front-end): Instructions are decoded and their architectural registers (like x86's RAX, RBX) are mapped to a much larger set of physical registers. This eliminates false dependencies. If two instructions both write to RAX, renaming gives them different physical registers so they can execute in parallel. A modern Zen 5 core has 228 physical integer registers backing just 16 architectural ones.
Issue/Execute (Middle): Instructions wait in a structure called the Reservation Station (Intel) or Scheduler (AMD). Once all source operands are ready, the instruction is dispatched to an execution unit — regardless of its original program position. A Golden Cove core can have up to 512 instructions in-flight in its reorder buffer.
Retire (Back-end): The Reorder Buffer (ROB) tracks original program order. Instructions commit their results to architectural state strictly in-order. If instruction #7 finishes before #4, it waits in the ROB until #4 (and #5, #6) retire first. This preserves the illusion of sequential execution.

Real-world example: Consider a cache miss on a load (say, 50 cycles to L3). Without OoO, every subsequent instruction stalls for 50 cycles. With OoO, the CPU scans ahead and finds dozens of independent instructions — ALU ops, other loads that hit L1, address calculations — and executes them during those 50 dead cycles. On server workloads with frequent cache misses, OoO can improve throughput by 2–3x over an equivalently-clocked in-order core.

Rule of thumb: The ROB size roughly bounds how far ahead the CPU can look. With a 512-entry ROB and an average instruction latency of ~1 cycle for hits, the CPU can tolerate stalls of up to ~500 cycles by finding enough independent work. In practice, dependency chains limit this — a realistic "useful window" is closer to the square root of the ROB size for pointer-chasing workloads (~22 instructions for a 512-entry ROB).

This is also why in-order cores (like ARM Cortex-A55 efficiency cores) are so much smaller. The rename unit, reservation stations, and ROB dominate transistor count and power. Apple's Firestorm core dedicates roughly 3x the silicon to its OoO engine compared to its actual execution units. The performance payoff justifies it — but only for high-performance workloads, which is why your phone has both big OoO cores and small in-order ones.

See it in action: Check out Your CPU Finishes Instructions… Before They’re Supposed To by Software Explained to see this theory applied.

Key Takeaway: Out-of-order execution transforms wasted stall cycles into useful work by dynamically finding independent instructions, using register renaming to eliminate false dependencies and a reorder buffer to preserve the illusion of sequential execution.

All newsletters