2026-06-08
Modern CPUs have 15-20 pipeline stages. By the time a branch instruction resolves at stage 12, you've already fetched a dozen more instructions. If you guessed wrong, you flush them all — a 15-cycle penalty. With branches every 5-6 instructions in typical code, accurate prediction isn't a nice-to-have; it's the only way pipelines work at all.
The 2-bit saturating counter — the workhorse: Each branch gets a 2-bit state: Strongly Taken (11), Weakly Taken (10), Weakly Not-Taken (01), Strongly Not-Taken (00). Each time the branch resolves, increment on taken, decrement on not-taken, saturating at the ends. The MSB is your prediction.
Why 2 bits and not 1? A loop branch is taken 99 times then not-taken once at exit. A 1-bit predictor flips on the exit and mispredicts the next entry too. The 2-bit version requires two consecutive mispredictions to flip its prediction — one wrong guess at loop exit doesn't cost you the next loop.
The Branch History Table (BHT): A direct-mapped array of these 2-bit counters, indexed by PC bits. A 4096-entry BHT is just 8 Kbits of SRAM — tiny — but achieves ~93% accuracy on typical code.
Two-level adaptive (gshare) — the real magic: XOR the PC with a Global History Register (the last N branch outcomes shifted in). Use the XOR result to index the counter table. This captures correlation between branches:
if (x > 0) ... followed by if (x < 10) ... — the second branch's behavior correlates with the first.The Branch Target Buffer (BTB): Prediction direction (taken/not-taken) is only half the problem. For taken branches you also need the address. The BTB is a cache mapping PC → target address, queried in parallel with the BHT during fetch.
Rule of thumb: Misprediction cost = pipeline depth × IPC. On a 15-stage, 4-wide CPU: ~60 instructions wasted per misprediction. At 95% accuracy with one branch every 5 instructions, you waste ~0.6 instructions per branch — a 12% overhead. Push to 99% and that drops to 2.4%. Every fraction of a percent matters; that's why modern predictors are absurdly complex (TAGE, perceptron, neural).
Real-world example: The Apple M1's branch predictor is rumored to use TAGE-like tables and achieves >99% accuracy on SPEC benchmarks. This is a major reason the M1 sustains 8-wide decode where x86 cores stall at 4-6 wide — accurate prediction keeps the wide front-end fed.
