2026-05-14
You've seen the reorder buffer track results and load/store queues track memory ops. But there's a third structure sitting between dispatch and execution that's arguably the heart of out-of-order: the reservation station (RS), sometimes called the scheduler or issue queue. It's where instructions wait for their inputs to become available, then pounce on a free execution port the instant they can.
The classic Tomasulo algorithm (IBM 360/91, 1967) introduced the idea: each instruction sits in a slot holding either the operand value or a tag identifying which in-flight instruction will produce it. When a result is broadcast on the common data bus (CDB), every RS entry compares its waiting tags against the broadcast tag. Matches latch in the value. When all operands are present and a suitable port is free, the entry "wakes up" and issues.
Two main flavors in modern designs:
The wakeup–select loop is the hardest part of CPU design. Every cycle: (1) broadcast completing tags, (2) every entry checks if all operands are ready, (3) arbiter picks N ready instructions per port. All in one clock. This loop is why scheduler size, not ROB size, often gates IPC — and why a 1-cycle ALU result must be forwarded speculatively before it's verified, leading to replay when a dependent load misses L1.
Concrete example: On Skylake, a load that hits L1 has 4-cycle latency. The scheduler optimistically wakes dependents 3 cycles after issue, assuming L1 hit. If the load actually misses, every dependent instruction already issued gets replayed — re-executed when data finally arrives. Tight loops chasing pointers can burn 20%+ of issue slots on replays alone.
Rule of thumb: RS entries ≈ ROB/3 to ROB/4. If your hot loop has more than ~30 in-flight dependent instructions waiting on a slow operation (divide, L2 miss), you'll fill the scheduler and stall dispatch even though the ROB has hundreds of empty slots. perf stat -e resource_stalls.rs tells you when it's happening.
