TLB Shootdowns: Why Unmapping Memory on One Core Stalls All the Others

2026-05-27

Every core caches virtual-to-physical translations in its own private TLB. When one core modifies a page table — via munmap, mprotect, madvise(MADV_DONTNEED), swap-out, or COW — every other core that might have cached that translation now holds a stale entry. The hardware does not coherently invalidate TLBs across cores. The OS must do it in software, and the mechanism is called a TLB shootdown.

The sequence on x86 Linux:

The cost: an IPI round-trip is ~1–3 µs on modern Xeons. With 64 cores all running threads of the same process, a single munmap can stall the originator for tens of microseconds and steal cycles from every other core. At scale this dominates.

Real-world example: A JVM with 200 threads on a 96-core box calling System.gc(). The collector unmaps reclaimed regions; each munmap fires shootdowns to all 96 cores. Production traces have shown 40% of "GC pause time" being TLB shootdown IPIs, not actual collection work. Same pattern hits Go's scavenger releasing memory back to the OS.

Rule of thumb: A shootdown costs ~1.5 µs × (active cores in mm_cpumask). At 64 cores that's ~100 µs per unmap. If you're unmapping in a hot loop, batch the operations or use MADV_FREE instead of MADV_DONTNEEDMADV_FREE defers the actual unmap and avoids the immediate shootdown.

Mitigations the kernel already does: coalescing multiple invalidations into a single full flush (MOV CR3) when the per-page list exceeds tlb_single_page_flush_ceiling (default 33 pages), and PCID tags so a MOV CR3 doesn't blow away unrelated entries. ARM is better off here — TLBI instructions broadcast over the interconnect, no IPI needed — but the stall on the interconnect is real too.

Key Takeaway: Unmapping memory isn't a local operation — it's a synchronous, all-cores IPI storm whose cost scales linearly with how many cores share your address space.

All newsletters