Branch Prediction: How CPUs Bet on the Future

2026-04-21

A modern CPU pipeline is 15–20 stages deep. When a conditional branch appears, the CPU won't know the outcome for another 10+ cycles. Stalling would be catastrophic — on typical code, roughly one in five instructions is a branch. So the CPU predicts the direction and speculatively fetches/executes down that path. If wrong, it flushes the pipeline and pays the full misprediction penalty.

The simplest predictor: a 2-bit saturating counter. Each branch gets a counter (0–3). Values 0–1 predict "not taken," 2–3 predict "taken." A correct prediction increments toward the predicted direction; a miss decrements. The key insight: a single anomalous outcome (like a loop exit) doesn't immediately flip the prediction. This alone gets ~85% accuracy on typical code.

Two-level adaptive prediction was the breakthrough. It uses a Branch History Register (BHR) — a shift register recording the last N outcomes (taken/not-taken) of a branch. This pattern indexes into a Pattern History Table (PHT) of 2-bit counters. Now the predictor learns correlations: "when this branch followed the pattern TTNT, it next goes T." Intel's Pentium Pro used this scheme.

Modern CPUs use TAGE (TAgged GEometric history length) predictors. TAGE keeps multiple tables indexed by different history lengths (e.g., 5, 10, 20, 40, 80, 160 branches of context). On a lookup, the table with the longest matching history wins. This handles both short-period loops (short history) and complex data-dependent patterns (long history). AMD Zen and recent Intel cores use TAGE variants, achieving ~97% accuracy.

Rule of thumb for the misprediction cost: multiply pipeline depth by the issue width. A 20-stage, 4-wide superscalar CPU wastes roughly 20 × 4 = 80 instructions worth of work per mispredict. At 3% mispredict rate with a branch every 5 instructions, that's 0.03 × (80/5) = ~0.48 cycles of penalty per instruction — a significant fraction of your IPC budget.

Real-world implication you've felt: sorting an array before doing branch-heavy processing on it (like filtering values above a threshold) can run 2–5× faster. The sorted data makes the branch pattern predictable (long runs of taken, then long runs of not-taken) versus random data that thrashes the predictor. This is the famous "Why is processing a sorted array faster?" result.

Indirect branches (virtual function calls, switch statements via jump tables) are harder — the target address changes, not just taken/not-taken. CPUs maintain a separate Branch Target Buffer (BTB) that caches recent target addresses keyed by the branch's PC. Polymorphic call sites with many targets remain a performance headache in C++ and Java.

See it in action: Check out Life Of a Bookie🫣 #shorts #ipl #ipl2023 #cricket #rohitsharma #viratkohli by Sigma Ani to see this theory applied.
Key Takeaway: Branch predictors are pattern-matching machines that learn from history — writing code with predictable branch patterns (sorted data, consistent loop bounds, minimal polymorphic dispatch) directly translates to fewer pipeline flushes and higher throughput.