Daily Low-Level Programming: The PAUSE Instruction: Why Spin Loops Need a Hint to the CPU

The PAUSE Instruction: Why Spin Loops Need a Hint to the CPU

2026-06-05

You wrote a spinlock. It works. But your benchmark shows the contended case is 10x slower than it should be, and your power draw spiked. The fix is one instruction: PAUSE (encoded as F3 90, which is REP NOP — older CPUs decode it as a no-op).

What the CPU thinks a spin loop is. A tight loop reading a memory location looks like vectorizable, parallelizable work. The CPU's out-of-order engine pipelines dozens of speculative loads of the same address. When the value finally changes (another core writes it), the memory order machine clear fires: every speculative load has to be flushed and re-executed because they all observed the stale value. That flush costs ~50-100 cycles and burns power the whole time you were spinning.

What PAUSE does. Three things, none of which are documented as guarantees but all of which are real:

De-pipelines the loop. Tells the CPU "don't speculate ahead" — only one load of the lock variable is in flight at a time.
Yields SMT resources. On a hyperthreaded core, PAUSE hands fetch/decode bandwidth to the sibling thread. Without it, your spin loop starves the other logical CPU sharing your physical core.
Inserts a delay. Pre-Skylake: ~10 cycles. Skylake+: ~140 cycles. Ice Lake reverted to ~40. The exact number is microarchitecture-dependent and that matters — code tuned for one generation can spin too long on another.

Concrete example. Linux's arch_spin_lock on x86 uses rep; nop (PAUSE) in its wait loop. Intel measured the contended uncontested-then-contended transition: removing PAUSE made the lock release latency jump from ~25ns to ~250ns on a 2-socket system, because the memory order violation flush ran every time the writer's invalidate reached the spinner.

The rule of thumb. Any loop that reads a memory location waiting for another thread to write it needs PAUSE in the body. In C: __builtin_ia32_pause() or _mm_pause(). In Rust: std::hint::spin_loop(). ARM has the equivalent YIELD instruction (and ARMv8.7 added WFET, wait-for-event-with-timeout, for the same purpose). If you spin without it, expect ~10x worse release latency and a measurable power penalty on laptops.

The Skylake gotcha. When Intel bumped PAUSE from 10 to 140 cycles, real-world code regressed. Lock implementations that called PAUSE in a tight backoff loop suddenly spun far too long before checking the lock again, missing wake-up windows. If you maintain a userspace lock library, your backoff schedule needs to be tuned against measured PAUSE latency, not assumed.

See it in action: Check out Nemotron 3 Ultra: A Daily Driver for Your Stack? by Ray Fernando to see this theory applied.

Key Takeaway: PAUSE isn't a no-op — it tells the CPU "this is a spin loop," preventing speculative load pile-ups, yielding SMT bandwidth to the sibling thread, and inserting a microarchitecture-specific delay that's now 14x longer on Skylake than on older CPUs.

All newsletters