Daily Hardware Architecture: Hardware Performance Counters: The CPU's Built-In Black Box Recorder

Hardware Performance Counters: The CPU's Built-In Black Box Recorder

2026-05-07

Every modern CPU ships with a small army of performance monitoring units (PMUs) — dedicated hardware counters that tally microarchitectural events without slowing the pipeline. They're how perf, VTune, and uProf see inside a running CPU.

A PMU has two flavors of registers:

Fixed-function counters: hardwired to count specific events — instructions retired, unhalted cycles, reference cycles. Intel has 3-4 of these per core.
Programmable counters: 4-8 per core (more with hyperthreading off). You write an event selector MSR with an event code + umask, and the counter starts incrementing on matching events. Cache misses, branch mispredicts, µop dispatches, port utilization — hundreds of events are exposed.

The counters are free in the steady state — they're just adders next to the pipeline stages that already detect these events. The cost shows up at sample time: when a counter overflows, it raises a Performance Monitoring Interrupt (PMI), the kernel grabs the instruction pointer, and the user pays an interrupt's worth of cycles.

Real example — diagnosing a slow loop: A hash table lookup runs at 1.2 IPC and you suspect cache pressure. Run perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses ./bench. You see 18% L1 miss rate but 0.3% LLC miss rate. Diagnosis: working set fits in L2 but not L1 — prefetching or restructuring will help, but adding RAM won't.

For statistical profiling, perf record -e cycles -F 999 programs a counter to overflow every ~3.3M cycles at 3.3GHz, sampling the IP at each PMI. That's how you build a flame graph without instrumenting code.

The catch — skid: Out-of-order execution means by the time the PMI fires, the CPU has retired dozens more instructions. The reported IP isn't the instruction that caused the event. Intel's PEBS (Precise Event-Based Sampling) and AMD's IBS fix this by having hardware snapshot architectural state when the triggering µop retires, writing it to a buffer the kernel drains later. Always use perf record -e cycles:pp for precise sampling.

Rule of thumb: If your IPC is below 1.0 on modern x86, you're memory-bound or branch-mispredict-bound — check cache-misses and branch-misses first. Above 2.0 IPC, you're compute-bound; look at port pressure with uops_dispatched_port.* events.

Counters can also be virtualized per-VM (Intel calls this vPMU), so guests can profile themselves — though hypervisors often disable this to prevent side-channel leakage of host activity.

See it in action: Check out These Are Not Your Grand Daddy

#39;s CPU Performance Counters - CPU Hardware Performance Counters... by Black Hat to see this theory applied.

Key Takeaway: PMU counters are nearly-free hardware that turns "the program is slow" into "the program is missing L1 18% of the time" — but always use precise sampling (PEBS/IBS) so the IP you blame is the IP that's actually guilty.

All newsletters