The Performance Monitoring Unit: Why "perf" Sees Things Your Code Can't

2026-05-24

Every modern x86 core ships with a Performance Monitoring Unit (PMU): a bank of hardware counters that tick on micro-architectural events the ISA otherwise hides. When you run perf stat ./a.out, you're not sampling — you're reading silicon registers that the CPU updated for free, in parallel with your code.

The PMU has two register types: fixed-function counters (always count one thing — instructions retired, core cycles, reference cycles) and general-purpose counters (programmable: pick an event from a per-microarchitecture list, write its code to IA32_PERFEVTSELx, and the matching IA32_PMCx starts incrementing). Skylake has 3 fixed + 8 GP counters per logical core; with hyperthreading on, that halves to 4 GP per thread.

The killer feature is PEBS (Precise Event-Based Sampling). A naïve interrupt-on-overflow fires somewhere after the event — typically dozens of instructions later, because of out-of-order execution. PEBS instead has the CPU itself write a record (RIP, registers, latency, data address) into a kernel buffer when the counter overflows, attached to the actual retiring instruction. That's how perf record -e cache-misses:pp can point at the exact load that missed.

Concrete example. You have a hash table with mysterious 8% slowdown. perf stat -e cycles,instructions,LLC-load-misses,dTLB-load-misses ./bench shows IPC of 0.4 (terrible) and 12M LLC misses on 200M loads — 6% miss rate. Switch to perf record -e LLC-load-misses:pp and the report fingers one line: the next pointer dereference during chain traversal. The cache line holding next is in a separate allocation from the key. Inline the key into the node — IPC jumps to 2.1.

Rule of thumb. An L3 miss costs ~200–300 cycles on modern Xeons. If LLC-load-misses × 250 / cycles > 0.2, you're memory-bound and no amount of micro-optimization will help — fix layout. Conversely, if branch-misses × 15 / cycles > 0.1, you're front-end-bound on mispredicts.

Gotchas. Counters are per-core, not per-process — the kernel multiplexes them when you ask for more events than counters exist, and the displayed values are scaled estimates (note the "(50.00%)" in perf output). Virtualization usually disables the PMU entirely unless the hypervisor explicitly exposes it (kvm -cpu host,+pmu). And event names like cache-misses are kernel aliases that map to different raw events per microarch — perf list shows what's actually wired up.

See it in action: Check out 10 Most Important Engineering Lessons Learned from 10 Years of Petabridge by Petabridge to see this theory applied.
Key Takeaway: The PMU turns invisible micro-architectural events into countable numbers — and PEBS turns them into precise source-line attributions, which is the only honest way to know why your code is slow.

All newsletters