Daily Hardware Architecture: TAGE Branch Predictor: How CPUs Use Geometric History to Win the Prediction War

TAGE Branch Predictor: How CPUs Use Geometric History to Win the Prediction War

2026-05-17

TAGE (TAgged GEometric history length) is the predictor that quietly powers modern high-performance cores — variants ship in Intel Sunny Cove and later, AMD Zen 3+, IBM POWER, and Arm Neoverse V-series. It's been the reigning champion of the Championship Branch Prediction contests for over a decade. If perceptrons taught CPUs to weight history linearly, TAGE taught them to choose how much history matters per branch.

The core insight: different branches need different amounts of history. A loop back-edge needs almost none ("taken 999 times in a row"). A branch deep inside a recursive parser may only become predictable when you correlate it with 200+ prior branches. A single fixed history length is always wrong for somebody.

The structure: TAGE is a stack of tagged tables indexed by progressively longer histories, growing in a geometric series (e.g., 5, 13, 34, 87, 219 bits). Plus one base predictor (usually a bimodal table) that always provides a default.

Each tagged entry stores: a tag (partial hash of PC + history), a 3-bit saturating counter (the prediction), and a 2-bit useful counter (has this entry been recently helpful?).
On lookup, all tables are probed in parallel. The prediction comes from the longest-history table that produces a tag hit. If nothing hits, fall back to the bimodal.
On a misprediction, allocate a new entry in a table with a longer history than the one that mispredicted — but only steal a slot whose useful counter is 0.

Why geometric and not linear? Useful correlations are sparse at long histories. Doubling each step (5 → 13 → 34...) covers a huge range with only 5–7 tables. A linear progression (10, 20, 30...) would waste silicon on redundant lengths and miss the deep correlations entirely.

Concrete example: consider if (buffer[i] < threshold) inside a sort. The branch outcome depends on data, not local history — bimodal gives ~50%. But if the data was just partitioned by a previous pass, the outcome correlates with branches ~50 instructions back. TAGE's L3 table (history ~34) catches this; a gshare with fixed 12-bit history can't. SPEC2017 traces show TAGE-SC-L hitting ~3 MPKI (mispredicts per kilo-instruction) vs ~6 MPKI for tournament predictors — a 2× reduction that translates to roughly 5–10% IPC on branchy code.

Rule of thumb: each halving of branch MPKI is worth roughly 3–5% IPC on a deep out-of-order core, because a misprediction costs ~15–20 cycles of pipeline flush. Going from 6 → 3 MPKI on a 3 IPC core saves ~9 cycles per 1000 instructions — directly visible as performance.

Real implementations add a statistical corrector (SC) and a loop predictor (L) — hence "TAGE-SC-L" — to patch the few patterns TAGE alone gets wrong.

Key Takeaway: TAGE wins by letting each branch pick its own optimal history length from a geometric series of tagged tables, instead of forcing one global history size on all branches.

All newsletters