2026-05-19
Out-of-order execution gets the headlines, but the actual decision-making organ is the issue queue (sometimes called the unified scheduler). It's the small, brutally hot piece of silicon that decides, every cycle, which of dozens of waiting instructions get to fire on which execution port. Get this wrong and your 6-wide superscalar runs like a 2-wide.
What's inside. Each entry holds a decoded µop, a destination tag, two source tags, and two "ready" bits. When a result is broadcast on the bypass network, every entry compares its source tags against the broadcast tag (CAM lookup). If a source matches, the ready bit flips. When both ready bits are set, the instruction becomes a wakeup candidate.
Wakeup vs. select. These are two distinct phases, usually crammed into one cycle:
Unified vs. distributed. Intel's P-cores use a unified scheduler (~97 entries on Golden Cove) feeding all ports. AMD Zen uses distributed schedulers — separate queues for integer (4×24), FP (64), and memory (28×2). Unified gives better utilization; distributed gives faster wakeup-select cycles because each queue is smaller. The tradeoff is essentially CAM width vs. load balancing.
Speculative wakeup and the replay problem. Here's the dirty secret: schedulers guess that loads will hit L1. A dependent instruction is woken up 4-5 cycles before the load's data actually arrives, so it can execute the moment the data hits the bypass network. If the load misses L1, every speculatively-woken dependent must be replayed — pulled back into the queue and re-issued. Long chains of dependent loads on misses can cause replay storms that tank IPC.
Real example: pointer-chasing through a linked list. Each load depends on the previous. If everything hits L1 (4-cycle latency), you sustain one load every 4 cycles. One L2 miss (12 cycles) doesn't just stall — it triggers replay of every speculatively-scheduled dependent, costing you maybe 6-8 wasted issue slots downstream.
Rule of thumb: issue queue capacity ≈ issue width × average µop latency × 4. A 6-wide machine with 5-cycle average latency needs ~120 entries to avoid stalling on full queues. Below that, the front-end has to throttle.
