2026-05-27
Every core caches virtual-to-physical translations in its own private TLB. When one core modifies a page table — via munmap, mprotect, madvise(MADV_DONTNEED), swap-out, or COW — every other core that might have cached that translation now holds a stale entry. The hardware does not coherently invalidate TLBs across cores. The OS must do it in software, and the mechanism is called a TLB shootdown.
The sequence on x86 Linux:
INVLPG.mm_cpumask to find which cores have ever run threads of this address space.CALL_FUNCTION_VECTOR or a dedicated TLB vector.flush_tlb_func, executes INVLPG (or a full MOV CR3 for batch flushes), and ACKs.The cost: an IPI round-trip is ~1–3 µs on modern Xeons. With 64 cores all running threads of the same process, a single munmap can stall the originator for tens of microseconds and steal cycles from every other core. At scale this dominates.
Real-world example: A JVM with 200 threads on a 96-core box calling System.gc(). The collector unmaps reclaimed regions; each munmap fires shootdowns to all 96 cores. Production traces have shown 40% of "GC pause time" being TLB shootdown IPIs, not actual collection work. Same pattern hits Go's scavenger releasing memory back to the OS.
Rule of thumb: A shootdown costs ~1.5 µs × (active cores in mm_cpumask). At 64 cores that's ~100 µs per unmap. If you're unmapping in a hot loop, batch the operations or use MADV_FREE instead of MADV_DONTNEED — MADV_FREE defers the actual unmap and avoids the immediate shootdown.
Mitigations the kernel already does: coalescing multiple invalidations into a single full flush (MOV CR3) when the per-page list exceeds tlb_single_page_flush_ceiling (default 33 pages), and PCID tags so a MOV CR3 doesn't blow away unrelated entries. ARM is better off here — TLBI instructions broadcast over the interconnect, no IPI needed — but the stall on the interconnect is real too.
