Daily Low-Level Programming: The Snoop Filter: Why Adding More Cores Made Cache Coherence Get Slower

The Snoop Filter: Why Adding More Cores Made Cache Coherence Get Slower

2026-05-23

MESI keeps caches coherent by broadcasting: when core 7 wants exclusive access to a line, every other core must check its L1/L2 and respond. With 4 cores this is cheap. With 56 cores on a Xeon socket, broadcasting every coherence message would saturate the ring/mesh interconnect. The fix: the snoop filter, a directory in the uncore that tracks which cores hold which cache lines.

Instead of asking "does anyone have line X?", the memory controller consults the snoop filter, which says "lines X is in core 3's L2, exclusive" — and only core 3 gets snooped. The filter is itself a tag array (typically inclusive of all per-core L2s), sized to roughly match the aggregate private cache footprint.

Here's where it bites you: the snoop filter has limited capacity. On Skylake-X server parts, it tracks roughly the sum of all L2 tags. When it fills, evicting an entry forces a back-invalidation — the corresponding line gets yanked out of the core's L2 even though that core was actively using it. Your hot working set silently disappears from L2 because some unrelated thread on another core touched cold lines and overflowed the directory.

Real-world example: Netflix engineers debugged a Cassandra workload where adding cores to a 2-socket box made p99 latency worse. Profiling showed L2 miss rates climbing on cores that had idle, well-fit working sets. The culprit: a background scanner thread on another core was streaming through gigabytes of cold data, flooding the snoop filter, and back-invalidating Cassandra's hot index pages out of L2. Pinning the scanner to a dedicated CCX with `taskset` restored latency. The hot data hadn't been evicted by capacity pressure in its own L2 — it was evicted by coherence bookkeeping pressure two levels up.

Rule of thumb: snoop filter capacity ≈ Σ(per-core L2 size) ÷ line size. On a 28-core Skylake-SP with 1 MiB L2 per core and 64-byte lines, that's ~458K entries. If your aggregate working set across cores exceeds this — even split among threads that never share data — you'll see back-invalidation storms. Diagnose with `perf stat -e mem_load_l2_miss.*` and look for L2 misses that don't correlate with your code's actual reuse distance.

The deeper irony: cache coherence, sold as "your cores share memory transparently," has a cost structure that scales worse than the caches it protects. The filter is why an 8-core chiplet sometimes beats a monolithic 32-core die for latency-sensitive workloads — fewer cores means a smaller directory and no back-invalidation.

Key Takeaway: The snoop filter scales cache coherence to many cores, but its finite capacity means unrelated threads can evict your hot data through coherence bookkeeping, not capacity pressure.

All newsletters