Out-of-Order Execution: How CPUs Break the Rules to Go Faster

2026-04-23

Your compiler emits instructions in a specific order. Your CPU ignores it. Out-of-order (OoO) execution lets the processor find and execute independent instructions while earlier ones are stalled — waiting on cache misses, division units, or unresolved branches. The program's visible results stay in-order. The internal execution does not.

The machinery that makes this work has three key stages:

Real-world example: Consider a cache miss on a load (say, 50 cycles to L3). Without OoO, every subsequent instruction stalls for 50 cycles. With OoO, the CPU scans ahead and finds dozens of independent instructions — ALU ops, other loads that hit L1, address calculations — and executes them during those 50 dead cycles. On server workloads with frequent cache misses, OoO can improve throughput by 2–3x over an equivalently-clocked in-order core.

Rule of thumb: The ROB size roughly bounds how far ahead the CPU can look. With a 512-entry ROB and an average instruction latency of ~1 cycle for hits, the CPU can tolerate stalls of up to ~500 cycles by finding enough independent work. In practice, dependency chains limit this — a realistic "useful window" is closer to the square root of the ROB size for pointer-chasing workloads (~22 instructions for a 512-entry ROB).

This is also why in-order cores (like ARM Cortex-A55 efficiency cores) are so much smaller. The rename unit, reservation stations, and ROB dominate transistor count and power. Apple's Firestorm core dedicates roughly 3x the silicon to its OoO engine compared to its actual execution units. The performance payoff justifies it — but only for high-performance workloads, which is why your phone has both big OoO cores and small in-order ones.

See it in action: Check out Your CPU Finishes Instructions… Before They’re Supposed To by Software Explained to see this theory applied.
Key Takeaway: Out-of-order execution transforms wasted stall cycles into useful work by dynamically finding independent instructions, using register renaming to eliminate false dependencies and a reorder buffer to preserve the illusion of sequential execution.

All newsletters