Daily Low-Level Programming: The CPU Pipeline and Instruction-Level Parallelism

The CPU Pipeline and Instruction-Level Parallelism

2026-05-10

Modern CPUs don't execute instructions one at a time. A single core has dozens of instructions in flight simultaneously, in various stages of completion. Understanding this pipeline is the difference between code that runs at 0.5 IPC (instructions per cycle) and code that runs at 4+ IPC on the same hardware.

The classic 5-stage pipeline — Fetch, Decode, Execute, Memory, Writeback — is the textbook model. Real x86 cores (Intel Golden Cove, AMD Zen 4) have 15-20 stages and are deeply superscalar: they fetch ~6 instructions per cycle, decode them into micro-ops, and dispatch them to 10+ execution ports in parallel. The reorder buffer (ROB) on Zen 4 holds 320+ in-flight instructions.

Out-of-order execution is what makes this work. The CPU sees a window of upcoming instructions and executes any whose inputs are ready, regardless of program order. Results are retired in program order so software sees sequential execution. Hazards stall the pipeline:

Data hazards: instruction B needs A's result. Solved by forwarding/bypassing, but a load-to-use dependency on an L1 hit still costs 4-5 cycles.
Control hazards: branches. The CPU predicts and speculates; a mispredict costs ~15-20 cycles to flush.
Structural hazards: two instructions need the same execution port. Skylake has 4 ALU ports but only 2 load ports.

Concrete example: summing a million floats. The naive loop creates a serial dependency chain — each add depends on the previous accumulator value. FP add latency is ~4 cycles, so you get one add per 4 cycles, ~1 GFLOP/s.

Unroll with 4 independent accumulators:

float s0=0, s1=0, s2=0, s3=0;
for (i=0; i<n; i+=4) {
  s0 += a[i];   s1 += a[i+1];
  s2 += a[i+2]; s3 += a[i+3];
}
float sum = s0+s1+s2+s3;

Now four independent chains run in parallel through the pipeline. Throughput hits one add per cycle (the FMA unit's throughput), 4x faster. Add SIMD and you get another 8x.

Rule of thumb: latency × independent-chains = throughput. To saturate a unit with N-cycle latency and 1-per-cycle throughput, you need N independent dependency chains in flight. FP add: 4 chains. FMA on Skylake: 4 chains. Integer mul: 3 chains.

This is why perf stat reports instructions and cycles separately — IPC tells you how much pipeline parallelism your code actually extracts. Anything below 1.0 means you're stalled on dependencies, mispredicts, or cache misses.

See it in action: Check out Instruction Level Parallelism (ILP) - Georgia Tech - HPCA: Part 2 by Udacity to see this theory applied.

Key Takeaway: The CPU can run many instructions in parallel, but only if your code exposes independent dependency chains — break serial chains to multiply throughput.

All newsletters