2026-05-21
Conditional branches are only half the prediction problem. The other half is where a branch goes — trivial for direct jumps (the target is encoded in the instruction), but a real puzzle for indirect branches like call rax, jmp [rdx+8], vtable dispatches, and computed gotos. The CPU needs the target address before the load that produces it has even completed. Enter the Branch Target Buffer (BTB).
The BTB is a cache keyed by the branch's instruction pointer. Each entry stores the predicted target address of the last (or last few) times that branch was taken. When the front-end fetches an instruction, it looks up the IP in the BTB in parallel with decode. If there's a hit, fetching continues at the predicted target — often 15+ cycles before the actual target is computed. A miss means the pipeline stalls (or speculates down the fall-through path and eats a 15–20 cycle misprediction penalty when the real target arrives).
Modern Intel BTBs hold roughly 4K–8K entries, organized as a set-associative cache with multiple ways. ARM's Cortex-A78 has a 6K-entry main BTB plus a smaller, faster 64-entry "nano-BTB" for the hottest branches. Both architectures use branch history (the path taken to reach this branch) as part of the lookup, so the same indirect call site can have different predictions depending on how you got there — critical for polymorphic dispatch where one vtable call resolves to different methods based on caller context.
Real-world example: A C++ interpreter's main dispatch loop uses computed goto: goto *handlers[opcode]. With a single indirect jump, the BTB sees one branch site with hundreds of targets — terrible prediction. The classic optimization is threaded code: duplicate the dispatch at the end of every opcode handler. Now the BTB sees N distinct branch sites, each with its own correlated pattern (after ADD, the next opcode is often STORE). Python 3.11's specializing interpreter rewrite reported a ~15% speedup largely from this effect.
Rule of thumb: An indirect call costs ~1 cycle if BTB-predicted correctly, ~20 cycles on a miss. If your dispatch table has more than ~16 hot targets called from a single site, threading the dispatch (one indirect per handler) can cut misprediction rate from 50%+ down to under 5%.
The BTB is also the attack surface for Spectre v2: an attacker poisons BTB entries to redirect a victim's indirect branch into a gadget, leaking data through cache side channels. Mitigations like IBRS and retpolines exist precisely because the BTB is shared across privilege levels and (historically) across hyperthreads.
