2026-05-21
You already know the Branch Target Buffer caches where direct branches go. But what about indirect branches — call rax, virtual function dispatch, jump tables, function pointers? The CPU can't read the target from the instruction; it has to predict it from history. That predictor is the Indirect Branch Predictor (IBP), and it's the gun that fired Spectre v2.
The IBP keys on the branch's address (plus global history) and stores predicted targets. Crucially, on pre-2018 Intel hardware, the predictor table was shared across privilege levels and across hyperthread siblings. Two unrelated indirect branches that happened to alias to the same predictor entry would train each other.
The attack: An attacker in userspace finds an indirect branch in the kernel (say, a function pointer call in a syscall path). They:
mov rax, [rdi]; mov rbx, [rax+rcx*8].The mitigation: retpolines. Instead of emitting call *%rax, the compiler emits a thunk that uses the Return Stack Buffer (which is per-thread and harder to poison) to redirect control:
call set_up_target ; pushes return address capture_spec: pause lfence jmp capture_spec ; speculation trap set_up_target: mov [rsp], %rax ; overwrite return addr with real target ret ; RSB-predicted, lands in capture_spec speculatively
The CPU's speculative path goes nowhere useful (the pause; jmp trap), while the architectural path correctly returns to *rax. Newer CPUs have IBRS, IBPB, and STIBP — hardware controls to flush or partition the predictor on kernel entry and across hyperthreads.
Real-world cost: Linux's retpoline mitigation added roughly 5–25% overhead to syscall-heavy workloads in 2018. Network packet processing took the worst hit because every protocol dispatch is an indirect call. This is why CONFIG_RETPOLINE kernels were measurably slower, and why subsequent CPUs (Zen 3, Ice Lake) added eIBRS — "enhanced IBRS" that's always-on with near-zero cost, letting distros drop retpolines.
Rule of thumb: One indirect call costs ~1–2 cycles when predicted correctly, ~15–20 cycles on misprediction, and ~25–40 cycles when wrapped in a retpoline. C++ vtables, function-pointer dispatch tables, and JIT trampolines all pay this tax — devirtualization (LTO, PGO, final classes) isn't just about inlining, it's about removing predictor pressure.
