2026-06-02
Snooping coherence (MESI on a shared bus) works great for 4-8 cores. Every cache watches every transaction, and the bus arbitrates ordering. But scale to 64 cores and the bus becomes a screaming bottleneck — every miss broadcasts to every cache, every cache must check its tags, and the snoop traffic grows as O(N²). Directory-based coherence replaces broadcast with a phone book: a directory tracks exactly which caches hold each line, so the home node only messages the caches that actually care.
The structure: Each cache line in memory (or LLC) has a directory entry containing:
The protocol flow for a read miss on core 5:
For a write, the home invalidates exactly the cores in the sharer vector — not all 64 cores. That's the win.
Real example: AMD's EPYC chips and Intel's Xeon Scalable use directory-style coherence at the socket level via their mesh/IO die. The Xeon's caching and home agent (CHA) at each LLC slice is the directory for the addresses it owns. When core 5 on socket A writes a line cached on socket B, the CHA owning that address sends a targeted invalidate across UPI — not a broadcast.
The cost — directory storage: A full bit-vector directory costs N bits per cache line for N cores. Rule of thumb:
Directory overhead = (N bits) / (cache line size × 8) = N / 512 for 64B lines
For 64 cores: 64/512 = 12.5% memory overhead just for tracking. At 256 cores it's 50% — untenable. So real designs use limited pointer schemes (track only k sharers, fall back to broadcast when exceeded) or coarse vectors (one bit per group of 4 cores).
The latency tax: Snooping resolves a miss in one bus round-trip. Directory protocols often need three hops: requester → home → owner → requester. That extra hop is why NUMA-aware software still matters — keeping the home node and the sharers in the same socket avoids cross-socket directory walks.
