The Return Stack Buffer: How the CPU Predicts Function Returns

2026-05-13

You already know branch predictors guess conditional jumps. But ret is an indirect jump — its target is whatever address sits on top of the stack. The CPU can't wait for that load to resolve; the pipeline would stall for dozens of cycles. So it keeps a dedicated predictor just for returns: the Return Stack Buffer (RSB), also called the Return Address Stack.

The RSB is a small hardware stack — typically 16 entries on Intel, 16–32 on AMD, 8 on older ARM cores. Every time the CPU decodes a call, it pushes the predicted return address (the instruction after the call) onto the RSB. Every ret pops from it and speculatively jumps there. If the actual return address (loaded from memory later) matches, you pay nothing. If it doesn't, the pipeline flushes — 15–25 cycles wasted.

Where this breaks:

Real example: Go's goroutine scheduler used to do stack switches via assembly that confused the RSB. Switching between two goroutines would mispredict every return until the RSB drained. The fix was an explicit call/ret dance to keep the RSB balanced — measurable throughput gains on call-heavy workloads.

Spectre-RSB / Retbleed connection: Attackers learned to poison RSB entries to make returns speculatively jump to attacker-chosen gadgets. Mitigations like RSB stuffing — filling all 16 entries with a known safe address on context switch — exist for this reason. That stuffing also explains why the first ~16 returns after a context switch are predicted, even if your real call chain is shallower.

Rule of thumb: If your hot path has a call chain deeper than ~15 frames and returns frequently, you're paying for RSB underflow. Inline aggressively, flatten recursion into iteration, or tail-call where possible. A perf stat -e br_misp_retired.return will show you the misprediction count directly.

See it in action: Check out Return Address Stack (RAS) - Georgia Tech - HPCA: Part 1 by Udacity to see this theory applied.
Key Takeaway: The CPU predicts ret via a tiny 16-entry hardware stack — exceed its depth, mismatch calls and returns, or do stack tricks, and every return becomes a 20-cycle pipeline flush.

All newsletters