2026-05-10
Modern CPUs don't execute instructions one at a time. A single core has dozens of instructions in flight simultaneously, in various stages of completion. Understanding this pipeline is the difference between code that runs at 0.5 IPC (instructions per cycle) and code that runs at 4+ IPC on the same hardware.
The classic 5-stage pipeline — Fetch, Decode, Execute, Memory, Writeback — is the textbook model. Real x86 cores (Intel Golden Cove, AMD Zen 4) have 15-20 stages and are deeply superscalar: they fetch ~6 instructions per cycle, decode them into micro-ops, and dispatch them to 10+ execution ports in parallel. The reorder buffer (ROB) on Zen 4 holds 320+ in-flight instructions.
Out-of-order execution is what makes this work. The CPU sees a window of upcoming instructions and executes any whose inputs are ready, regardless of program order. Results are retired in program order so software sees sequential execution. Hazards stall the pipeline:
Concrete example: summing a million floats. The naive loop creates a serial dependency chain — each add depends on the previous accumulator value. FP add latency is ~4 cycles, so you get one add per 4 cycles, ~1 GFLOP/s.
Unroll with 4 independent accumulators:
float s0=0, s1=0, s2=0, s3=0;
for (i=0; i<n; i+=4) {
s0 += a[i]; s1 += a[i+1];
s2 += a[i+2]; s3 += a[i+3];
}
float sum = s0+s1+s2+s3;
Now four independent chains run in parallel through the pipeline. Throughput hits one add per cycle (the FMA unit's throughput), 4x faster. Add SIMD and you get another 8x.
Rule of thumb: latency × independent-chains = throughput. To saturate a unit with N-cycle latency and 1-per-cycle throughput, you need N independent dependency chains in flight. FP add: 4 chains. FMA on Skylake: 4 chains. Integer mul: 3 chains.
This is why perf stat reports instructions and cycles separately — IPC tells you how much pipeline parallelism your code actually extracts. Anything below 1.0 means you're stalled on dependencies, mispredicts, or cache misses.
