Daily Low-Level Programming: The Apic and Inter-Processor Interrupts: How Cores Talk to Each Other

The Apic and Inter-Processor Interrupts: How Cores Talk to Each Other

2026-05-25

When you call pthread_kill, invalidate a TLB entry on another core, or wake a thread on a different CPU, something has to physically deliver that message across silicon. That something is the Local APIC (Advanced Programmable Interrupt Controller) — one per logical core — and the messages it sends are Inter-Processor Interrupts (IPIs).

Every core has its own Local APIC, memory-mapped at 0xFEE00000 by default (or accessed via x2APIC MSRs starting at 0x800). To send an IPI, you write to the Interrupt Command Register (ICR): target APIC ID, vector number, and delivery mode. The destination core's APIC raises an interrupt at the specified vector, and the receiving CPU jumps to the IDT entry for that vector — the same mechanism as a device interrupt, just sourced from another core.

The killer use case: TLB shootdowns. When you munmap() a page, the kernel must invalidate that page's TLB entry on every core that might have cached it. The local core does invlpg, but remote cores can't be poked directly. So the kernel sends an IPI (vector 0xFD on Linux x86_64, CALL_FUNCTION_VECTOR's cousin) to each affected core. Each remote core takes the interrupt, executes invlpg, acks, and resumes. The originating core spins waiting for all acks.

Why this is expensive: An IPI takes ~1–3 microseconds round-trip — easily 5,000+ cycles. With 64 cores, a single page unmap can take 50+ microseconds of pure shootdown overhead, with every target core taking a pipeline-flushing interrupt mid-execution. This is why hyperscalers obsess over MADV_FREE (lazy, no shootdown) over munmap(), and why mprotect() on a hot region can tank latency across an entire socket.

Rule of thumb: Cost of an IPI-based operation ≈ 2µs × number_of_target_cores. Munmapping one 4KB page across 32 cores costs ~64µs — more than the unmap itself by orders of magnitude.

Delivery modes worth knowing:

Fixed: deliver to a specific core (used for shootdowns, scheduler wakeups).
Lowest priority: hardware picks the least-busy target (used for device interrupts in many configurations).
NMI: non-maskable, used for watchdogs and perf sampling — interrupts even with IRQs disabled.
INIT/SIPI: how the BSP wakes up application processors during boot — the entire SMP bringup sequence is two IPIs.

You can watch IPIs live: watch -n1 cat /proc/interrupts | grep -E 'TLB|Resched|Call'. The TLB row counts shootdowns; RES counts scheduler reschedule IPIs; CAL counts smp_call_function invocations. High numbers here often explain mysterious tail-latency spikes.

See it in action: Check out Operating System Architecture - 005 : What is Interrupt in computer programming ? #os #tutorial by The Digital Folks to see this theory applied.

Key Takeaway: Every cross-core operation — TLB invalidation, thread wakeup, even AP boot — rides on an IPI sent via the Local APIC's ICR, and the per-core cost scales linearly with how many cores you target.

All newsletters