Branch Prediction: Why Your CPU Guesses, and What Happens When It's Wrong

2026-04-30

A modern out-of-order CPU pipeline is 15–20+ stages deep. When it encounters a conditional branch (je, bne), it won't know the outcome for several more cycles. Stalling would waste all those pipeline stages. So the CPU guesses which way the branch goes and speculatively executes down that path. If correct, zero penalty. If wrong, it flushes the pipeline and restarts — costing roughly 15–25 cycles on current x86 hardware.

How the predictor works: Modern CPUs use a combination of structures:

Real-world example — sorted vs. unsorted data: Consider summing array elements greater than 128 in a 0–255 range:

for (int i = 0; i < N; i++) if (data[i] > 128) sum += data[i];

On sorted data, the branch follows a clean pattern: long run of not-taken, then long run of taken. The predictor learns this trivially. On random data, the branch is essentially a coin flip — ~50% mispredict rate. The measured difference is dramatic: sorted runs 3–6x faster than unsorted on the same data, purely due to branch misprediction cost.

Rule of thumb: Each misprediction costs ~20 wasted cycles. If a branch in a hot loop mispredicts 50% of the time and executes 100M iterations, that's 50M × 20 = 1 billion wasted cycles — roughly 0.3 seconds on a 3 GHz core.

Practical mitigation techniques:

Note that cmov isn't always faster — it introduces a data dependency. If the branch is highly predictable (99%+ one direction), a well-predicted branch outperforms cmov because the CPU speculates past it with no dependency stall. Profile before converting.

See it in action: Check out Branch Prediction: How CPUs Guess Your Code
#39;s Future 🔮 by CodeLucky to see this theory applied.
Key Takeaway: Branch mispredictions flush the pipeline at ~20 cycles each; in hot loops with unpredictable branches, restructure to branchless code or sort your data to restore predictable patterns.