2026-05-25
When you call pthread_kill, invalidate a TLB entry on another core, or wake a thread on a different CPU, something has to physically deliver that message across silicon. That something is the Local APIC (Advanced Programmable Interrupt Controller) — one per logical core — and the messages it sends are Inter-Processor Interrupts (IPIs).
Every core has its own Local APIC, memory-mapped at 0xFEE00000 by default (or accessed via x2APIC MSRs starting at 0x800). To send an IPI, you write to the Interrupt Command Register (ICR): target APIC ID, vector number, and delivery mode. The destination core's APIC raises an interrupt at the specified vector, and the receiving CPU jumps to the IDT entry for that vector — the same mechanism as a device interrupt, just sourced from another core.
The killer use case: TLB shootdowns. When you munmap() a page, the kernel must invalidate that page's TLB entry on every core that might have cached it. The local core does invlpg, but remote cores can't be poked directly. So the kernel sends an IPI (vector 0xFD on Linux x86_64, CALL_FUNCTION_VECTOR's cousin) to each affected core. Each remote core takes the interrupt, executes invlpg, acks, and resumes. The originating core spins waiting for all acks.
Why this is expensive: An IPI takes ~1–3 microseconds round-trip — easily 5,000+ cycles. With 64 cores, a single page unmap can take 50+ microseconds of pure shootdown overhead, with every target core taking a pipeline-flushing interrupt mid-execution. This is why hyperscalers obsess over MADV_FREE (lazy, no shootdown) over munmap(), and why mprotect() on a hot region can tank latency across an entire socket.
Rule of thumb: Cost of an IPI-based operation ≈ 2µs × number_of_target_cores. Munmapping one 4KB page across 32 cores costs ~64µs — more than the unmap itself by orders of magnitude.
Delivery modes worth knowing:
perf sampling — interrupts even with IRQs disabled.You can watch IPIs live: watch -n1 cat /proc/interrupts | grep -E 'TLB|Resched|Call'. The TLB row counts shootdowns; RES counts scheduler reschedule IPIs; CAL counts smp_call_function invocations. High numbers here often explain mysterious tail-latency spikes.
