The Pattern History Table: How CPUs Learn Branch Behavior Over Time

2026-05-15

We've covered branch prediction broadly, but the Pattern History Table (PHT) deserves its own lesson — it's the actual data structure that decides whether your if is taken. Understanding its geometry tells you why some branches predict perfectly and others thrash forever.

A PHT is an array of 2-bit saturating counters, indexed by some hash of the branch's PC (and often global history). The counter states are:

The high bit is the prediction. Each branch outcome nudges the counter ±1 (saturating). The 2-bit design is the magic: a single mispredict doesn't flip a strongly-biased branch. A loop that runs 1000 iterations and exits once still predicts "taken" the next time you enter the loop.

Indexing matters more than size. A naive PHT indexed by PC[11:2] aliases: two unrelated branches sharing the same low bits collide and corrupt each other's counters. Modern predictors (gshare, GAg, TAGE) XOR the PC with a global history register — a shift register of recent branch outcomes — so the same branch under different control-flow contexts gets different PHT entries.

Concrete example: Consider a loop processing a sorted array with if (x > threshold). Once you cross the threshold, every branch goes the same way. A 2-bit counter saturates at 11, and you get near-100% accuracy. Now shuffle the array randomly — outcomes become 50/50, the counter oscillates between 01 and 10, and you mispredict ~50% of the time. This is the famous Stack Overflow "why is processing a sorted array faster" answer: a 4-6× speedup purely from PHT behavior.

Rule of thumb: A modern PHT has roughly 16K entries × 2 bits = 4KB of state per predictor table. With 12 bits of index, you have 4096 slots. If your hot loop has more than ~4096 unique branch contexts, you'll alias and lose accuracy. Mispredict cost on a deep pipeline is 15-20 cycles — at 1B branches/sec with 5% mispredict rate, that's 750M-1B wasted cycles per second, or roughly 25-30% of a 3GHz core.

TAGE (the predictor in modern Intel/AMD/ARM cores) layers multiple PHTs indexed with different history lengths and picks whichever has the most confident "tag match" — giving near-optimal prediction for both short-period and long-period patterns. It's the reason indirect branches in interpreters got dramatically faster after ~2015.

See it in action: Check out The CPU Cache - Short Animated Overview by BitLemon to see this theory applied.
Key Takeaway: The PHT is just an array of 2-bit counters, but its indexing scheme — how PC bits mix with global history — determines whether your branches predict perfectly or alias into chaos.

All newsletters