2026-05-17
TAGE (TAgged GEometric history length) is the predictor that quietly powers modern high-performance cores — variants ship in Intel Sunny Cove and later, AMD Zen 3+, IBM POWER, and Arm Neoverse V-series. It's been the reigning champion of the Championship Branch Prediction contests for over a decade. If perceptrons taught CPUs to weight history linearly, TAGE taught them to choose how much history matters per branch.
The core insight: different branches need different amounts of history. A loop back-edge needs almost none ("taken 999 times in a row"). A branch deep inside a recursive parser may only become predictable when you correlate it with 200+ prior branches. A single fixed history length is always wrong for somebody.
The structure: TAGE is a stack of tagged tables indexed by progressively longer histories, growing in a geometric series (e.g., 5, 13, 34, 87, 219 bits). Plus one base predictor (usually a bimodal table) that always provides a default.
Why geometric and not linear? Useful correlations are sparse at long histories. Doubling each step (5 → 13 → 34...) covers a huge range with only 5–7 tables. A linear progression (10, 20, 30...) would waste silicon on redundant lengths and miss the deep correlations entirely.
Concrete example: consider if (buffer[i] < threshold) inside a sort. The branch outcome depends on data, not local history — bimodal gives ~50%. But if the data was just partitioned by a previous pass, the outcome correlates with branches ~50 instructions back. TAGE's L3 table (history ~34) catches this; a gshare with fixed 12-bit history can't. SPEC2017 traces show TAGE-SC-L hitting ~3 MPKI (mispredicts per kilo-instruction) vs ~6 MPKI for tournament predictors — a 2× reduction that translates to roughly 5–10% IPC on branchy code.
Rule of thumb: each halving of branch MPKI is worth roughly 3–5% IPC on a deep out-of-order core, because a misprediction costs ~15–20 cycles of pipeline flush. Going from 6 → 3 MPKI on a 3 IPC core saves ~9 cycles per 1000 instructions — directly visible as performance.
Real implementations add a statistical corrector (SC) and a loop predictor (L) — hence "TAGE-SC-L" — to patch the few patterns TAGE alone gets wrong.
