Daily Hardware Architecture: Reservation Stations: How CPUs Park Instructions Until Their Operands Show Up

Reservation Stations: How CPUs Park Instructions Until Their Operands Show Up

2026-05-14

You've seen the reorder buffer track results and load/store queues track memory ops. But there's a third structure sitting between dispatch and execution that's arguably the heart of out-of-order: the reservation station (RS), sometimes called the scheduler or issue queue. It's where instructions wait for their inputs to become available, then pounce on a free execution port the instant they can.

The classic Tomasulo algorithm (IBM 360/91, 1967) introduced the idea: each instruction sits in a slot holding either the operand value or a tag identifying which in-flight instruction will produce it. When a result is broadcast on the common data bus (CDB), every RS entry compares its waiting tags against the broadcast tag. Matches latch in the value. When all operands are present and a suitable port is free, the entry "wakes up" and issues.

Two main flavors in modern designs:

Distributed (per-port) RS — Intel since P6, AMD Zen. Each execution port (or small cluster) has its own queue. Simpler wakeup logic, less wire fanout, but instructions get bound to ports at dispatch — if your ALU ports fill up, integer math stalls even if FP ports are idle.
Unified RS — Intel's Skylake-era scheduler is a 97-entry unified pool feeding 8 ports. More flexible load balancing, but the wakeup CAM (content-addressable memory) gets brutally expensive as it grows. That's why these structures top out around 100 entries despite the ROB being 350+.

The wakeup–select loop is the hardest part of CPU design. Every cycle: (1) broadcast completing tags, (2) every entry checks if all operands are ready, (3) arbiter picks N ready instructions per port. All in one clock. This loop is why scheduler size, not ROB size, often gates IPC — and why a 1-cycle ALU result must be forwarded speculatively before it's verified, leading to replay when a dependent load misses L1.

Concrete example: On Skylake, a load that hits L1 has 4-cycle latency. The scheduler optimistically wakes dependents 3 cycles after issue, assuming L1 hit. If the load actually misses, every dependent instruction already issued gets replayed — re-executed when data finally arrives. Tight loops chasing pointers can burn 20%+ of issue slots on replays alone.

Rule of thumb: RS entries ≈ ROB/3 to ROB/4. If your hot loop has more than ~30 in-flight dependent instructions waiting on a slow operation (divide, L2 miss), you'll fill the scheduler and stall dispatch even though the ROB has hundreds of empty slots. perf stat -e resource_stalls.rs tells you when it's happening.

Key Takeaway: The reservation station is the wakeup-and-select engine that decides which ready instruction issues each cycle — and its size, not the ROB's, usually sets the real out-of-order window.

All newsletters