Daily Low-Level Programming: The MONITOR/MWAIT Instructions: How Idle Cores Wait Without Burning Power

The MONITOR/MWAIT Instructions: How Idle Cores Wait Without Burning Power

2026-06-04

When a core has nothing to do, spinning on a memory location burns power and starves the SMT sibling thread of execution resources. MONITOR/MWAIT is the hardware mechanism that lets a core sleep until a specific cache line is written, without polling.

The protocol is two instructions:

MONITOR — takes a linear address in RAX and arms a hardware watchpoint on the cache line containing that address. The cache coherence machinery (MESI) will notify this core if any other agent writes to that line.
MWAIT — puts the core into an implementation-defined C-state (C1, C1E, C3, etc., specified in ECX hints) until either the monitored line is written, an interrupt fires, or an NMI/SMI arrives.

The clever part: there's no polling loop and no syscall. The wake signal piggybacks on the cache coherence traffic that would happen anyway when another core writes to that line. A write from another core sends an invalidation message to this core's L1; the monitor logic sees the invalidation and breaks out of MWAIT.

Real-world example: DPDK and Spinlocks. DPDK's poll-mode drivers traditionally spin at 100% CPU waiting for packets. Modern DPDK uses rte_power_monitor() which wraps UMWAIT (the user-mode variant added in Tremont/Tiger Lake) to sleep on the NIC's RX descriptor ring tail pointer. When the NIC DMAs a new descriptor, the cache line invalidation wakes the core in ~50ns — comparable to spinning, but the core drops to ~5W instead of ~25W. Across a 64-core packet processor, that's a kilowatt saved at idle.

The Linux kernel uses MWAIT in cpuidle: when a CPU goes idle, it executes MWAIT with a C-state hint chosen by the menu/teo governor based on predicted idle duration. The mwait_idle_with_hints() function in arch/x86/include/asm/mwait.h is the entry point.

Rule of thumb: The wake latency scales with C-state depth.

C1 (MWAIT halt): ~1 µs wake — caches retained, voltage held
C3: ~50 µs wake — L1/L2 flushed to L3
C6: ~200 µs wake — core power-gated, full state save to SRAM

If your latency budget is under 10µs, force C1 via intel_idle.max_cstate=1 on the kernel command line — otherwise the governor will happily put cores into C6 and your tail latencies will explode.

One trap: MONITOR's watched region is implementation-defined, often 64 bytes but sometimes 128. Reading CPUID leaf 5 returns the exact min/max monitor line size. False wakeups from adjacent variables are the MWAIT version of false sharing.

Key Takeaway: MONITOR/MWAIT lets a core sleep on a cache line and wake via the coherence protocol — same wake signal as a spinlock, but at idle power.

All newsletters