Daily Low-Level Programming: The Uncore: Why Your "CPU" Is Actually Two Different Chips

The Uncore: Why Your "CPU" Is Actually Two Different Chips

2026-05-17

When you say "CPU," you mean the cores. But a modern x86 die has two power and clock domains: the core (your registers, ALUs, L1/L2) and the uncore (L3 cache, memory controllers, ring/mesh interconnect, PCIe root, snoop filters, QPI/UPI links). Intel calls it the uncore; AMD calls the equivalent the Infinity Fabric / IOD. They run at different frequencies, have different power states, and are tuned for completely different workloads.

The split exists because cores got fast faster than memory did. The uncore is the "everything between your L2 and DRAM" — and it's a shared resource that all cores fight over. A 32-core Xeon has 32 core clocks but one uncore clock controlling the L3 ring, and that single frequency decides how fast cross-core communication, L3 hits, and DRAM accesses happen.

Three practical consequences:

Uncore frequency scaling is independent. Linux exposes it via /sys/devices/system/cpu/intel_uncore_frequency/. When cores idle into C-states, the uncore can downclock from 2.4 GHz to 800 MHz to save power. The next L3 miss then pays a wake-up penalty — sometimes 200+ ns added latency on the first access after idle.
L3 latency depends on geography. On a ring-bus Xeon, L3 slices are distributed around the ring. Hitting the slice attached to your core takes ~10 ns; hitting the slice on the opposite side of a 28-core ring takes ~25 ns. Same "L3," 2.5x difference. Mesh interconnects (Skylake-SP onward) flatten this but don't eliminate it.
Memory bandwidth is uncore-limited, not core-limited. Eight cores doing memcpy won't run 8x faster than one — they share the same memory controllers, sitting in the uncore. One core can typically saturate 15–20 GB/s; the rest is contention.

Real-world example: Netflix's FreeBSD video servers hit a wall where adding cores stopped helping throughput. Diagnosis: uncore frequency was downclocking under "bursty" workloads because cores spent enough time idle between packets that the power governor scaled the uncore down — adding latency to every PCIe DMA from the NIC. Pinning the uncore to max frequency via MSR 0x620 recovered 30% throughput.

Rule of thumb: Uncore frequency × 8 bytes/cycle ≈ peak per-channel L3 bandwidth. A 2.4 GHz uncore on a single ring stop gives you ~19 GB/s of L3 read bandwidth. If your "fast in-cache" benchmark plateaus below this, you're uncore-bound, not core-bound. Check turbostat's UncMHz column — if it's bouncing, your latency measurements are lying to you.

See it in action: Check out [2024] CPU Cores

amp; Threads Explained in 6 Minutes by Indigo Software to see this theory applied.

Key Takeaway: Half your CPU runs at a different clock than the other half, and it's the half that owns the L3, the memory controllers, and every PCIe device — so its frequency, not your core's, often decides real-world performance.

All newsletters