2026-05-24
You've seen how CPUs rename registers, schedule instructions, and write results back. But there's a problem: if every instruction had to write its result to the physical register file and then have the next instruction read it back, you'd burn at least 2 cycles between dependent operations. On a modern CPU running billions of dependent add chains, that's catastrophic. The bypass network (also called the forwarding network or operand bypass) is the dense web of wires that lets a freshly-computed result skip the register file entirely and feed directly into the next instruction's execution unit.
Picture it physically: each execution unit's output isn't just routed to the register file write port. It's also fanned out to multiplexers sitting in front of every other execution unit's operand inputs. When the scheduler wakes up a dependent instruction, it doesn't wait for the producer to finish writeback — it tells the consumer's input mux to grab the value directly off the producer's output wire.
The math is brutal. On a CPU with N execution units, full bypassing requires roughly N² forwarding paths. Skylake has ~8 execution ports; that's potentially 64 bypass paths, each carrying 64+ bits across the chip. The wires themselves become a power and timing nightmare — long bypass wires can't make the cycle, so designers introduce partial bypass networks: full forwarding within a cluster of nearby units, slower one-cycle delays to reach distant ones.
Real-world example: AMD's Zen architecture splits integer and FP execution into separate clusters with full intra-cluster bypass but a 1-cycle penalty crossing between them. This is why moving a value from an XMM register to a GPR (via MOVD) costs more than the instruction itself suggests — you're paying the cross-domain bypass tax. Intel's Ice Lake similarly groups its integer ALUs to keep bypass wires short.
Rule of thumb: If you see an instruction with "1 cycle latency" in Agner Fog's tables, that's the bypass network working. The result genuinely never visits the register file before the dependent op consumes it — it lives only on a wire for a few hundred picoseconds. If you see latency jump to 2+ cycles for the same operation between domains (int↔FP, vector↔scalar), that's a bypass gap.
This is why type punning via memory (storing a float, loading as int) can be slower than a register-to-register move even though it "looks" like more work — the move stays in one bypass domain, the load goes through the load/store queue and re-enters via a different port.
