2026-05-17
By the mid-2000s, two-level predictors with global history were hitting a wall. To capture longer correlations you needed a bigger Pattern History Table, but PHT size grows exponentially with history length. Daniel Jiménez proposed a different angle in 2001: use a perceptron — a single-layer neural network — whose storage grows only linearly with history length. AMD shipped it in Bobcat (2011), then Bulldozer, Jaguar, and every Zen core since.
The math is almost embarrassingly simple. For each branch PC, store a vector of small signed weights w[0..N], one per history bit (plus a bias w[0]). To predict, compute:
y = w[0] + Σ (w[i] × h[i]) where h[i] is +1 for taken, -1 for not-takeny ≥ 0, predict taken; otherwise not-takenBecause history bits are ±1, the multiply degenerates into add-or-subtract. The whole dot product is a wide carry-save adder tree — fast, but its depth grows with history length, which is the main implementation headache.
Training happens at retirement. If the prediction was wrong, or if |y| was below a threshold θ (meaning low confidence), update each weight: w[i] += t × h[i], where t is +1 if the branch was actually taken, -1 otherwise. Weights saturate at a small range like ±127 (8 bits). That's it. No backprop, no learning rate tuning — one pass of perceptron learning per resolved branch.
Why this beats giant PHTs: a 2-level predictor with 60 bits of history would need a 2^60-entry table — impossible. A perceptron with 60 weights of 8 bits each is 60 bytes per branch. AMD's Zen reportedly uses ~60-bit histories with perceptron-based prediction, achieving misprediction rates well under 3% on SPEC.
Real example: consider a loop where branch B is taken whenever branches A and C earlier in the history were both taken. A bimodal predictor sees B as roughly 50/50 — useless. A two-level predictor needs to memorize the full history pattern. A perceptron learns positive weights on the A and C positions and near-zero weights elsewhere, generalizing across irrelevant history bits.
Limitation: perceptrons can only learn linearly separable functions. Classic XOR-style correlations (taken iff exactly one of A,C is taken) defeat them. Modern CPUs hybridize: TAGE handles the linearly-non-separable cases via tagged geometric histories, while perceptrons or their descendants (hashed perceptron, piecewise linear) catch the long-history correlations TAGE misses.
Rule of thumb: perceptron storage = (history_length + 1) × weight_bits per entry. 64-bit history × 8-bit weights = 65 bytes/branch — cheap enough to track thousands of branches.
