Daily Low-Level Programming: TLB Shootdowns: Why Unmapping Memory on One Core Stalls All the Others

TLB Shootdowns: Why Unmapping Memory on One Core Stalls All the Others

2026-05-27

Every core caches virtual-to-physical translations in its own private TLB. When one core modifies a page table — via munmap, mprotect, madvise(MADV_DONTNEED), swap-out, or COW — every other core that might have cached that translation now holds a stale entry. The hardware does not coherently invalidate TLBs across cores. The OS must do it in software, and the mechanism is called a TLB shootdown.

The sequence on x86 Linux:

Initiating core updates the PTE and flushes its own TLB with INVLPG.
Kernel walks the mm_cpumask to find which cores have ever run threads of this address space.
It sends each of them an inter-processor interrupt (IPI) — vector CALL_FUNCTION_VECTOR or a dedicated TLB vector.
Each target core takes the interrupt, runs flush_tlb_func, executes INVLPG (or a full MOV CR3 for batch flushes), and ACKs.
Initiating core spins until all ACKs arrive, then returns to userspace.

The cost: an IPI round-trip is ~1–3 µs on modern Xeons. With 64 cores all running threads of the same process, a single munmap can stall the originator for tens of microseconds and steal cycles from every other core. At scale this dominates.

Real-world example: A JVM with 200 threads on a 96-core box calling System.gc(). The collector unmaps reclaimed regions; each munmap fires shootdowns to all 96 cores. Production traces have shown 40% of "GC pause time" being TLB shootdown IPIs, not actual collection work. Same pattern hits Go's scavenger releasing memory back to the OS.

Rule of thumb: A shootdown costs ~1.5 µs × (active cores in mm_cpumask). At 64 cores that's ~100 µs per unmap. If you're unmapping in a hot loop, batch the operations or use MADV_FREE instead of MADV_DONTNEED — MADV_FREE defers the actual unmap and avoids the immediate shootdown.

Mitigations the kernel already does: coalescing multiple invalidations into a single full flush (MOV CR3) when the per-page list exceeds tlb_single_page_flush_ceiling (default 33 pages), and PCID tags so a MOV CR3 doesn't blow away unrelated entries. ARM is better off here — TLBI instructions broadcast over the interconnect, no IPI needed — but the stall on the interconnect is real too.

Key Takeaway: Unmapping memory isn't a local operation — it's a synchronous, all-cores IPI storm whose cost scales linearly with how many cores share your address space.

All newsletters