Daily Hardware Architecture: On-Chip Interconnects: How Cores Talk to Each Other

On-Chip Interconnects: How Cores Talk to Each Other

2026-05-02

You have 16 cores, a shared L3 cache sliced into banks, memory controllers, and an IO block. How do they all communicate? The on-chip interconnect is the nervous system of a modern processor, and its design directly determines latency, bandwidth, and how well performance scales with core count.

Ring Bus: Intel used a bidirectional ring from Sandy Bridge through Broadwell. Each core, cache slice, and agent sits as a stop on the ring. A message hops stop-to-stop, taking one cycle per hop. On a 10-core chip, worst-case latency is 5 hops (half the ring). This scales linearly with core count — fine for 4-10 cores, but a 28-core ring means 14-hop worst case, adding ~14ns of unloaded latency just to reach a cache slice.

Mesh: Starting with Skylake-SP (Xeon Scalable), Intel switched to a 2D mesh. Cores are arranged in a grid, and messages route in X-then-Y fashion. A 6×5 mesh serving 28 cores has worst-case distance of 9 hops (5+4), versus 14 on a ring. More importantly, a mesh provides multiple parallel paths, so aggregate bandwidth scales with the number of nodes rather than being bottlenecked by a single ring's width.

AMD's Infinity Fabric: AMD takes a different approach — a scalable crossbar-like fabric connecting Core Complex Dies (CCDs), each containing 8 cores with their own L3. Intra-CCD communication is fast (shared L3), but cross-CCD traffic traverses the Infinity Fabric at higher latency (~40ns additional on EPYC). This is why NUMA-aware placement matters even within a single AMD socket.

Rule of thumb: Each mesh/ring hop costs roughly 1ns. If your interconnect adds H hops, expect H × 1ns added to cache-to-cache transfers. On a 5×6 mesh, average hop count is approximately (5+6)/3 ≈ 3.7 hops, so ~4ns average interconnect latency for L3 accesses.

Bandwidth matters too. A ring with a 32-byte wide data path running at 2 GHz delivers 64 GB/s per direction. A mesh with the same link width on a 6-wide grid provides 6 parallel vertical channels — 384 GB/s of bisection bandwidth. This is why server chips moved to meshes: parallel workloads saturate ring bandwidth long before they saturate a mesh.

Practical impact for programmers: On mesh architectures, L3 access latency varies depending on which slice holds your data (determined by address hash). Two threads sharing data on adjacent cores see ~12ns L3 latency; the same access from a far corner might take ~20ns. Tools like perf stat with offcore_response events can expose this non-uniformity. Pinning communicating threads to nearby cores on mesh processors yields measurable speedups in latency-sensitive code.

See it in action: Check out System on Chip (SoC) Explained by ALL ABOUT ELECTRONICS to see this theory applied.

Key Takeaway: On-chip interconnect topology (ring, mesh, or fabric) determines how core-to-core and core-to-cache latency scales with core count — meshes trade single-hop simplicity for parallel bandwidth that keeps 20+ core designs viable.

All newsletters