Branch Predictors: How Hardware Guesses the Future to Keep the Pipeline Full

2026-06-08

Modern CPUs have 15-20 pipeline stages. By the time a branch instruction resolves at stage 12, you've already fetched a dozen more instructions. If you guessed wrong, you flush them all — a 15-cycle penalty. With branches every 5-6 instructions in typical code, accurate prediction isn't a nice-to-have; it's the only way pipelines work at all.

The 2-bit saturating counter — the workhorse: Each branch gets a 2-bit state: Strongly Taken (11), Weakly Taken (10), Weakly Not-Taken (01), Strongly Not-Taken (00). Each time the branch resolves, increment on taken, decrement on not-taken, saturating at the ends. The MSB is your prediction.

Why 2 bits and not 1? A loop branch is taken 99 times then not-taken once at exit. A 1-bit predictor flips on the exit and mispredicts the next entry too. The 2-bit version requires two consecutive mispredictions to flip its prediction — one wrong guess at loop exit doesn't cost you the next loop.

The Branch History Table (BHT): A direct-mapped array of these 2-bit counters, indexed by PC bits. A 4096-entry BHT is just 8 Kbits of SRAM — tiny — but achieves ~93% accuracy on typical code.

Two-level adaptive (gshare) — the real magic: XOR the PC with a Global History Register (the last N branch outcomes shifted in). Use the XOR result to index the counter table. This captures correlation between branches:

The Branch Target Buffer (BTB): Prediction direction (taken/not-taken) is only half the problem. For taken branches you also need the address. The BTB is a cache mapping PC → target address, queried in parallel with the BHT during fetch.

Rule of thumb: Misprediction cost = pipeline depth × IPC. On a 15-stage, 4-wide CPU: ~60 instructions wasted per misprediction. At 95% accuracy with one branch every 5 instructions, you waste ~0.6 instructions per branch — a 12% overhead. Push to 99% and that drops to 2.4%. Every fraction of a percent matters; that's why modern predictors are absurdly complex (TAGE, perceptron, neural).

Real-world example: The Apple M1's branch predictor is rumored to use TAGE-like tables and achieves >99% accuracy on SPEC benchmarks. This is a major reason the M1 sustains 8-wide decode where x86 cores stall at 4-6 wide — accurate prediction keeps the wide front-end fed.

See it in action: Check out branch prediction failures how to cope and optimize by CodeIgnite to see this theory applied.
Key Takeaway: Branch predictors trade a few kilobits of SRAM and some XOR gates for tens of percent IPC — they're the unsung heroes that make deep pipelines economically viable.

All newsletters