Daily Hardware Architecture: The Directory-Based Coherence Protocol: How Many-Core CPUs Replace Snooping with a Phone Book

The Directory-Based Coherence Protocol: How Many-Core CPUs Replace Snooping with a Phone Book

2026-06-02

Snooping coherence (MESI on a shared bus) works great for 4-8 cores. Every cache watches every transaction, and the bus arbitrates ordering. But scale to 64 cores and the bus becomes a screaming bottleneck — every miss broadcasts to every cache, every cache must check its tags, and the snoop traffic grows as O(N²). Directory-based coherence replaces broadcast with a phone book: a directory tracks exactly which caches hold each line, so the home node only messages the caches that actually care.

The structure: Each cache line in memory (or LLC) has a directory entry containing:

State bits — Uncached, Shared, or Modified/Exclusive
Sharer vector — which cores hold a copy (a bitmap, or a pointer list, or coarse-grained groups)
Owner ID — which core has the dirty copy, if any

The protocol flow for a read miss on core 5:

Core 5 sends a read request to the home node (determined by hashing the physical address).
Home node consults the directory. If state is Shared/Uncached, it forwards data and sets bit 5 in the sharer vector.
If state is Modified, home forwards the request to the owner. Owner ships the line to core 5 and writes back to home. Directory transitions to Shared with bits for owner and core 5.

For a write, the home invalidates exactly the cores in the sharer vector — not all 64 cores. That's the win.

Real example: AMD's EPYC chips and Intel's Xeon Scalable use directory-style coherence at the socket level via their mesh/IO die. The Xeon's caching and home agent (CHA) at each LLC slice is the directory for the addresses it owns. When core 5 on socket A writes a line cached on socket B, the CHA owning that address sends a targeted invalidate across UPI — not a broadcast.

The cost — directory storage: A full bit-vector directory costs N bits per cache line for N cores. Rule of thumb:

Directory overhead = (N bits) / (cache line size × 8) = N / 512 for 64B lines

For 64 cores: 64/512 = 12.5% memory overhead just for tracking. At 256 cores it's 50% — untenable. So real designs use limited pointer schemes (track only k sharers, fall back to broadcast when exceeded) or coarse vectors (one bit per group of 4 cores).

The latency tax: Snooping resolves a miss in one bus round-trip. Directory protocols often need three hops: requester → home → owner → requester. That extra hop is why NUMA-aware software still matters — keeping the home node and the sharers in the same socket avoids cross-socket directory walks.

Key Takeaway: Directory coherence trades broadcast bandwidth for targeted messaging by storing a per-line sharer list, making many-core scaling possible at the cost of three-hop latency and significant directory storage overhead.

All newsletters