Daily Hardware Architecture: The Banked Register File: How CPUs Fake More Read Ports Than They Actually Have

The Banked Register File: How CPUs Fake More Read Ports Than They Actually Have

2026-06-05

A 6-wide superscalar CPU theoretically needs 12 read ports on its register file (two operands per instruction). But each port adds wires, multiplexers, and area that scale roughly as O(ports²). A monolithic 12-port SRAM would be slower than the ALUs it feeds. So designers cheat: they bank the register file into smaller arrays, each with fewer ports, and pray that simultaneous reads land in different banks.

The trick: split 128 physical registers into, say, 4 banks of 32 registers each, where bank assignment is determined by low-order bits of the physical register number. Each bank has only 3 read ports instead of 12. Total read bandwidth stays at 12 reads/cycle — as long as no two instructions in the same cycle need operands from the same bank simultaneously.

When they do collide, that's a bank conflict, and one of the instructions has to wait a cycle. The scheduler tries to avoid this by tracking bank assignments during register renaming and steering allocations to balance the banks.

Real example: The Alpha 21264 (1998) was famous for taking this to an extreme. Its integer register file was duplicated into two clusters of 80 registers each, with each cluster having 4 read ports. Cross-cluster operand reads took an extra cycle. The compiler and scheduler worked together to keep dependent instructions in the same cluster. AMD's Zen cores use a similar banked-and-clustered approach for their FP register file — bank conflicts on the FP side are a measurable performance counter event.

Rule of thumb for port scaling: SRAM area scales as roughly (read_ports + write_ports)². Going from a 6-port to 12-port register file isn't 2× the area — it's closer to 4×, and the access time grows too. Banking into N banks with P/N ports each cuts area to roughly N × (P/N)² = P²/N — a 4-bank split saves ~75% of the area for the same aggregate bandwidth, at the cost of conflict stalls.

The downside is uneven utilization. If a hot loop happens to keep allocating registers that hash to bank 0, you get persistent conflicts even when banks 1–3 sit idle. Modern renamers include bank-aware allocation heuristics — preferring free physical registers in underused banks when the choice is otherwise arbitrary. It's the same idea as NUMA-aware memory placement, just at nanosecond scale inside a single core.

See it in action: Check out Things not to do in ur exam 😤 #school #hack #exam #test by Mementoe to see this theory applied.

Key Takeaway: Wide CPUs can't afford true multi-ported register files, so they bank the storage and hope parallel reads land in different banks — when they don't, you pay a stall.

All newsletters