Daily Hardware Architecture: The Allocator and Free List: How CPUs Hand Out Resources to Incoming Instructions

The Allocator and Free List: How CPUs Hand Out Resources to Incoming Instructions

2026-05-20

Every cycle, a modern CPU's frontend dumps a fistful of decoded µops on the backend's doorstep and says "find them homes." The allocator (sometimes called the rename/dispatch stage) is the bureaucrat that hands each µop the resources it needs to enter the out-of-order engine: a physical register, a reorder buffer (ROB) entry, a reservation station slot, and possibly load queue or store queue entries. If any one resource is exhausted, the entire frontend stalls. No half-allocation, no partial dispatch — it's all-or-nothing per cycle.

The trickiest resource is the physical register. Modern x86 cores have ~280 physical integer registers backing 16 architectural ones. The allocator pulls free registers from a free list — typically a circular FIFO of register IDs that aren't currently mapped to any architectural register or in-flight instruction. When an instruction commits and its old physical register is no longer needed (because a newer write to the same architectural register has also committed), that physical register returns to the free list.

The free list itself is usually a small SRAM with head and tail pointers. Allocating N registers per cycle means reading N entries from the head; freeing N means writing N to the tail. This sounds trivial until you realize a 6-wide machine might allocate 6 and free 6 in the same cycle — that's 12 ports on a structure that needs to stay fast. Most designs split it into banks or use a bitmap (one bit per physical register) with priority encoders to find free entries, trading area for port count.

Concrete example: Intel's Golden Cove allocates up to 6 µops per cycle. It has separate free lists for integer (~280 entries), vector (~332 entries), and predicate registers. If you're running AVX-512-heavy code, you can exhaust the vector free list while the integer one sits half-empty — the frontend stalls anyway. This is why mixing scalar and vector work can sometimes throughput-balance better than pure vector code: you're consuming separate resource pools.

Rule of thumb: Maximum useful in-flight instructions ≈ min(ROB size, physical register file size − architectural registers). On Golden Cove, ROB is 512 but integer PRF is ~280, so for integer-heavy code you'll hit register pressure before ROB pressure. The PRF, not the ROB, is your effective window.

Resource exhaustion shows up in perf counters as RESOURCE_STALLS.RS, RESOURCE_STALLS.ROB, or vendor-equivalent events. If you see allocator stalls dominating, widening loop unrolling won't help — you need to shorten dependency chains so instructions retire and free their registers faster.

Key Takeaway: The allocator is an all-or-nothing gate at the OoO engine's entrance — and physical registers, not the ROB, are usually the resource that runs out first.

All newsletters