Daily Low-Level Programming: Branch Prediction: Why Your CPU Guesses, and What Happens When It's Wrong

Branch Prediction: Why Your CPU Guesses, and What Happens When It's Wrong

2026-04-30

A modern out-of-order CPU pipeline is 15–20+ stages deep. When it encounters a conditional branch (je, bne), it won't know the outcome for several more cycles. Stalling would waste all those pipeline stages. So the CPU guesses which way the branch goes and speculatively executes down that path. If correct, zero penalty. If wrong, it flushes the pipeline and restarts — costing roughly 15–25 cycles on current x86 hardware.

How the predictor works: Modern CPUs use a combination of structures:

Branch Target Buffer (BTB): A cache mapping branch instruction addresses to their predicted target addresses.
Pattern History Table (PHT): Indexed by recent branch history (a shift register of taken/not-taken bits), storing 2-bit saturating counters. A counter must mispredict twice before flipping its prediction — this resists noise from occasional deviations.
TAGE predictors (used in modern Intel/AMD/ARM): multiple tables indexed by different history lengths, allowing accurate prediction of patterns hundreds of branches deep.

Real-world example — sorted vs. unsorted data: Consider summing array elements greater than 128 in a 0–255 range:

for (int i = 0; i < N; i++) if (data[i] > 128) sum += data[i];

On sorted data, the branch follows a clean pattern: long run of not-taken, then long run of taken. The predictor learns this trivially. On random data, the branch is essentially a coin flip — ~50% mispredict rate. The measured difference is dramatic: sorted runs 3–6x faster than unsorted on the same data, purely due to branch misprediction cost.

Rule of thumb: Each misprediction costs ~20 wasted cycles. If a branch in a hot loop mispredicts 50% of the time and executes 100M iterations, that's 50M × 20 = 1 billion wasted cycles — roughly 0.3 seconds on a 3 GHz core.

Practical mitigation techniques:

Branchless code: Replace if (x > 128) sum += x with sum += x & -(x > 128) or use conditional moves (cmov on x86, csel on ARM). No branch, no misprediction.
Compiler hints: __builtin_expect(condition, 1) tells GCC/Clang the likely path, improving code layout so the common path is fall-through (which the BTB handles better).
Profile-guided optimization (PGO): The compiler instruments branches, collects real execution data, then recompiles with optimal layout and hint placement.
Sort or partition data before processing when branch-heavy logic dominates runtime.

Note that cmov isn't always faster — it introduces a data dependency. If the branch is highly predictable (99%+ one direction), a well-predicted branch outperforms cmov because the CPU speculates past it with no dependency stall. Profile before converting.

See it in action: Check out Branch Prediction: How CPUs Guess Your Code

#39;s Future 🔮 by CodeLucky to see this theory applied.

Key Takeaway: Branch mispredictions flush the pipeline at ~20 cycles each; in hot loops with unpredictable branches, restructure to branchless code or sort your data to restore predictable patterns.

All newsletters