2026-06-04
When a core has nothing to do, spinning on a memory location burns power and starves the SMT sibling thread of execution resources. MONITOR/MWAIT is the hardware mechanism that lets a core sleep until a specific cache line is written, without polling.
The protocol is two instructions:
The clever part: there's no polling loop and no syscall. The wake signal piggybacks on the cache coherence traffic that would happen anyway when another core writes to that line. A write from another core sends an invalidation message to this core's L1; the monitor logic sees the invalidation and breaks out of MWAIT.
Real-world example: DPDK and Spinlocks. DPDK's poll-mode drivers traditionally spin at 100% CPU waiting for packets. Modern DPDK uses rte_power_monitor() which wraps UMWAIT (the user-mode variant added in Tremont/Tiger Lake) to sleep on the NIC's RX descriptor ring tail pointer. When the NIC DMAs a new descriptor, the cache line invalidation wakes the core in ~50ns — comparable to spinning, but the core drops to ~5W instead of ~25W. Across a 64-core packet processor, that's a kilowatt saved at idle.
The Linux kernel uses MWAIT in cpuidle: when a CPU goes idle, it executes MWAIT with a C-state hint chosen by the menu/teo governor based on predicted idle duration. The mwait_idle_with_hints() function in arch/x86/include/asm/mwait.h is the entry point.
Rule of thumb: The wake latency scales with C-state depth.
If your latency budget is under 10µs, force C1 via intel_idle.max_cstate=1 on the kernel command line — otherwise the governor will happily put cores into C6 and your tail latencies will explode.
One trap: MONITOR's watched region is implementation-defined, often 64 bytes but sometimes 128. Reading CPUID leaf 5 returns the exact min/max monitor line size. False wakeups from adjacent variables are the MWAIT version of false sharing.
