Daily Hardware Architecture: CPU Pipeline Stages and Hazards: Why Your CPU Is a Factory Assembly Line

CPU Pipeline Stages and Hazards: Why Your CPU Is a Factory Assembly Line

2026-04-24

A CPU pipeline splits instruction execution into stages so multiple instructions overlap, like an assembly line. A classic 5-stage RISC pipeline has: Fetch → Decode → Execute → Memory → Writeback. Without pipelining, each instruction takes 5 cycles. With it, you ideally retire one instruction per cycle — a 5× throughput gain for free.

But reality fights back. Hazards are situations where the next instruction can't proceed on schedule. There are three kinds, and understanding them explains half of microarchitectural complexity.

Data Hazards occur when an instruction depends on a result not yet produced. Consider:

ADD R1, R2, R3 — result in R1 available after Execute
SUB R4, R1, R5 — needs R1 in Decode/Execute

Without intervention, SUB reads a stale R1. The fix is forwarding (bypassing): dedicated paths route the ALU output directly back to the ALU input, avoiding a 2-cycle stall. Nearly every modern CPU implements this. However, load-use hazards can't be fully forwarded — if an instruction loads from memory, the data isn't available until the end of the Memory stage. This forces a 1-cycle pipeline bubble. Compilers actively reorder instructions to fill that slot, which is why -O2 code sometimes looks shuffled.

Control Hazards arise from branches. When a branch is fetched, the CPU doesn't know the target for several cycles. A taken branch in a 5-stage pipeline wastes 1–2 cycles of fetched-but-wrong instructions. You've already studied branch prediction — it exists specifically to hide this cost. In deeper pipelines, the penalty grows: Intel's Pentium 4 (Prescott) had a 31-stage pipeline, meaning a mispredicted branch flushed up to ~20 stages of work. This is why deep pipelines demand excellent predictors.

Structural Hazards happen when two instructions need the same hardware unit simultaneously — say, one instruction accessing memory for a load while another is being fetched (both need the memory port). The classic fix: use separate L1 instruction and data caches (Harvard-style split). This is why virtually every modern CPU has distinct L1I and L1D caches.

Rule of thumb: Pipeline depth ≈ clock frequency gain, but with diminishing returns. Going from 5 to 10 stages might let you boost the clock 40–60%, but misprediction and hazard penalties scale roughly linearly with depth. Modern designs settle around 10–20 stages as the sweet spot. ARM Cortex-A78 uses ~13 stages; Apple's M-series performance cores use ~16.

A practical calculation: if your branch predictor is 95% accurate and the misprediction penalty is 15 cycles, the effective CPI cost from branches alone is 0.20 × 15 × 0.05 = 0.15 CPI (assuming ~20% of instructions are branches). That's a 15% throughput tax — substantial, and exactly why prediction accuracy matters more as pipelines deepen.

See it in action: Check out Introduction to CPU Pipelining by Merlin Wellington to see this theory applied.

Key Takeaway: Pipelining multiplies throughput by overlapping instruction stages, but data, control, and structural hazards constantly threaten that gain — forwarding, branch prediction, and split caches are the essential countermeasures baked into every modern CPU.

All newsletters