Daily Low-Level Programming: The Last-Level Cache and the Ring Bus: Why Core 0 Sees Memory Differently Than Core 15

The Last-Level Cache and the Ring Bus: Why Core 0 Sees Memory Differently Than Core 15

2026-05-18

Your L1 and L2 caches are private to each core, but the L3 (Last-Level Cache) is shared. On modern Intel server chips, that sharing is not free — the L3 is physically sliced, with one slice sitting next to each core, and a ring bus (or mesh, on newer Xeons) connects them. Your address doesn't live in "the L3" — it lives in one specific slice, determined by a hash of the physical address.

Here's the consequence: when core 0 accesses an address whose slice is co-located with core 15, the request travels across the ring. Each hop costs roughly 1 cycle. On a 20-core ring, the worst case is ~10 hops each way. The same L3 hit can cost 35 cycles for a near slice and 70+ cycles for a far one — a 2x variance for what your profiler calls "an L3 hit."

The hash is deliberately scrambled to prevent any one slice from becoming a hotspot. You cannot easily predict which slice owns a given page, and consecutive cache lines often live in different slices. This is good for bandwidth (parallelism across slices) but bad for latency predictability.

Real-world example: A trading firm pinned their market-data thread to core 0 and their order-submission thread to core 19 on a 20-core Xeon, assuming "isolation = good." Latency was inconsistent. The fix: pin both threads to adjacent cores (0 and 1). The shared cache lines for the order book now resolved through nearby L3 slices, and tail latency dropped by ~40%. Counterintuitively, putting threads closer on the ring beat spreading them out.

Mesh topology (Skylake-SP and later): Intel replaced the ring with a 2D mesh because rings don't scale past ~12 cores — latency grows linearly with core count. The mesh gives O(√n) worst-case hops instead of O(n), but introduces its own surprise: now both dimensions matter, and the "distance" between two cores depends on their (x,y) coordinates on the die.

Rule of thumb: On a ring-bus CPU, expect L3 latency to range from base_latency + 1·N to base_latency + 2·N cycles, where N is core count. If your hot data is shared between two threads, co-locate them on adjacent cores — the LLC slice they hit will be roughly equidistant from both. You can probe slice topology with the CBox performance counters (uncore events UNC_CBO_CACHE_LOOKUP) to see which slice serves your workload.

This is also why benchmarks vary across runs even on an isolated machine: the kernel may schedule your thread on a different core, and now the "same" cache hit takes a different number of cycles.

See it in action: Check out How to improve RAM Speed? by WePC to see this theory applied.

Key Takeaway: The L3 cache is not one cache — it's a collection of slices connected by an on-die network, and which slice owns your data determines whether an "L3 hit" takes 35 or 70 cycles.

All newsletters