2026-06-05
You wrote a spinlock. It works. But your benchmark shows the contended case is 10x slower than it should be, and your power draw spiked. The fix is one instruction: PAUSE (encoded as F3 90, which is REP NOP — older CPUs decode it as a no-op).
What the CPU thinks a spin loop is. A tight loop reading a memory location looks like vectorizable, parallelizable work. The CPU's out-of-order engine pipelines dozens of speculative loads of the same address. When the value finally changes (another core writes it), the memory order machine clear fires: every speculative load has to be flushed and re-executed because they all observed the stale value. That flush costs ~50-100 cycles and burns power the whole time you were spinning.
What PAUSE does. Three things, none of which are documented as guarantees but all of which are real:
Concrete example. Linux's arch_spin_lock on x86 uses rep; nop (PAUSE) in its wait loop. Intel measured the contended uncontested-then-contended transition: removing PAUSE made the lock release latency jump from ~25ns to ~250ns on a 2-socket system, because the memory order violation flush ran every time the writer's invalidate reached the spinner.
The rule of thumb. Any loop that reads a memory location waiting for another thread to write it needs PAUSE in the body. In C: __builtin_ia32_pause() or _mm_pause(). In Rust: std::hint::spin_loop(). ARM has the equivalent YIELD instruction (and ARMv8.7 added WFET, wait-for-event-with-timeout, for the same purpose). If you spin without it, expect ~10x worse release latency and a measurable power penalty on laptops.
The Skylake gotcha. When Intel bumped PAUSE from 10 to 140 cycles, real-world code regressed. Lock implementations that called PAUSE in a tight backoff loop suddenly spun far too long before checking the lock again, missing wake-up windows. If you maintain a userspace lock library, your backoff schedule needs to be tuned against measured PAUSE latency, not assumed.
