The UMWAIT Instruction: User-Space Idle Without a Syscall

2026-06-05

You've seen MONITOR/MWAIT — the ring-0 instructions that let an idle core sleep until a cache line is written. But what if a user-space thread wants to wait for a memory location without burning a CPU with PAUSE loops? Before 2019, you couldn't: MWAIT faulted in ring 3. Intel's Tremont/Tiger Lake added UMONITOR, UMWAIT, and TPAUSE — the user-mode versions.

How it works. UMONITOR rax arms an address-range monitor on the cache line containing [rax]. UMWAIT ecx then halts the logical core until one of three things happens: (1) a write touches the monitored line, (2) the TSC deadline in edx:eax expires, or (3) an interrupt arrives. The ecx register picks the C-state hint — bit 0 clear means C0.2 (deeper sleep, ~50µs wake latency, lets the sibling hyperthread run faster); bit 0 set means C0.1 (shallow, ~1µs wake). TPAUSE is the same but without the monitor — just a timed nap.

The OS controls the ceiling. The IA32_UMWAIT_CONTROL MSR caps how long user code can sleep (default ~100µs on Linux). Exceed it and UMWAIT returns early with CF=1. This prevents a malicious thread from parking forever on a core the scheduler wants back.

Real-world example: DPDK polling. A DPDK worker thread polls an RX ring descriptor for new packets. The classic loop is while (!desc->done) _mm_pause(); — which burns 100% CPU and shows as a fully loaded core in top. With UMWAIT:

Power draw drops 30-50% on idle cores, the sibling hyperthread regains ~15% throughput, and packet-arrival latency stays under 10µs. Intel's own measurements on Sapphire Rapids show C0.2 saves ~1.5W per core at idle versus a PAUSE spin.

Rule of thumb. If your spin-wait expects to wait longer than a cache miss but shorter than a syscall (roughly 100ns to 50µs), UMWAIT with C0.1 beats both PAUSE-spinning and futex(). Below 100ns, just PAUSE-spin — the C-state transition costs more than you save. Above 50µs, go to the kernel.

Check CPUID.7.0:ECX[5] (WAITPKG) before using; AMD didn't ship this until Zen 5.

Key Takeaway: UMWAIT lets user-space threads sleep on a memory address for microseconds without entering the kernel — closing the gap between PAUSE-spinning and futex() with hardware-level power savings.

All newsletters