2026-05-17
When you say "CPU," you mean the cores. But a modern x86 die has two power and clock domains: the core (your registers, ALUs, L1/L2) and the uncore (L3 cache, memory controllers, ring/mesh interconnect, PCIe root, snoop filters, QPI/UPI links). Intel calls it the uncore; AMD calls the equivalent the Infinity Fabric / IOD. They run at different frequencies, have different power states, and are tuned for completely different workloads.
The split exists because cores got fast faster than memory did. The uncore is the "everything between your L2 and DRAM" — and it's a shared resource that all cores fight over. A 32-core Xeon has 32 core clocks but one uncore clock controlling the L3 ring, and that single frequency decides how fast cross-core communication, L3 hits, and DRAM accesses happen.
Three practical consequences:
/sys/devices/system/cpu/intel_uncore_frequency/. When cores idle into C-states, the uncore can downclock from 2.4 GHz to 800 MHz to save power. The next L3 miss then pays a wake-up penalty — sometimes 200+ ns added latency on the first access after idle.memcpy won't run 8x faster than one — they share the same memory controllers, sitting in the uncore. One core can typically saturate 15–20 GB/s; the rest is contention.Real-world example: Netflix's FreeBSD video servers hit a wall where adding cores stopped helping throughput. Diagnosis: uncore frequency was downclocking under "bursty" workloads because cores spent enough time idle between packets that the power governor scaled the uncore down — adding latency to every PCIe DMA from the NIC. Pinning the uncore to max frequency via MSR 0x620 recovered 30% throughput.
Rule of thumb: Uncore frequency × 8 bytes/cycle ≈ peak per-channel L3 bandwidth. A 2.4 GHz uncore on a single ring stop gives you ~19 GB/s of L3 read bandwidth. If your "fast in-cache" benchmark plateaus below this, you're uncore-bound, not core-bound. Check turbostat's UncMHz column — if it's bouncing, your latency measurements are lying to you.
