2026-05-07
Every modern CPU ships with a small army of performance monitoring units (PMUs) — dedicated hardware counters that tally microarchitectural events without slowing the pipeline. They're how perf, VTune, and uProf see inside a running CPU.
A PMU has two flavors of registers:
The counters are free in the steady state — they're just adders next to the pipeline stages that already detect these events. The cost shows up at sample time: when a counter overflows, it raises a Performance Monitoring Interrupt (PMI), the kernel grabs the instruction pointer, and the user pays an interrupt's worth of cycles.
Real example — diagnosing a slow loop: A hash table lookup runs at 1.2 IPC and you suspect cache pressure. Run perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses ./bench. You see 18% L1 miss rate but 0.3% LLC miss rate. Diagnosis: working set fits in L2 but not L1 — prefetching or restructuring will help, but adding RAM won't.
For statistical profiling, perf record -e cycles -F 999 programs a counter to overflow every ~3.3M cycles at 3.3GHz, sampling the IP at each PMI. That's how you build a flame graph without instrumenting code.
The catch — skid: Out-of-order execution means by the time the PMI fires, the CPU has retired dozens more instructions. The reported IP isn't the instruction that caused the event. Intel's PEBS (Precise Event-Based Sampling) and AMD's IBS fix this by having hardware snapshot architectural state when the triggering µop retires, writing it to a buffer the kernel drains later. Always use perf record -e cycles:pp for precise sampling.
Rule of thumb: If your IPC is below 1.0 on modern x86, you're memory-bound or branch-mispredict-bound — check cache-misses and branch-misses first. Above 2.0 IPC, you're compute-bound; look at port pressure with uops_dispatched_port.* events.
Counters can also be virtualized per-VM (Intel calls this vPMU), so guests can profile themselves — though hypervisors often disable this to prevent side-channel leakage of host activity.
