2026-05-10
Two threads write to two completely separate variables. No locks, no atomics needed. Performance tanks anyway. Welcome to false sharing: the cache coherence protocol's revenge on naive data layout.
Cache coherence operates at cache line granularity, not variable granularity. On x86, that's 64 bytes. When CPU 0 writes to a line, the MESI protocol invalidates that line in every other core's cache. The next read from CPU 1 incurs a cache miss — even if CPU 1 only cares about a different byte in the same line.
The classic shape:
struct stats { uint64_t thread_a_count; uint64_t thread_b_count; };thread_a_count in a tight loop.thread_b_count in a tight loop.Real-world example: Linux's per-CPU counters. Early versions of network stack stats stored multiple counters in one struct. Profiling showed bizarre slowdowns under load — two unrelated counters incrementing on different CPUs caused 100ns+ stalls per operation. The fix was ____cacheline_aligned_in_smp padding, defined in <linux/cache.h>, which forces structs onto their own cache lines.
Detection: perf c2c record (cache-to-cache) is purpose-built for this. It identifies cache lines with high HITM (hit-modified) counts — the smoking gun. Look for lines where multiple CPUs report loads/stores at different offsets within the same line.
Rule of thumb: Per-thread mutable data must be ≥64-byte aligned and ≥64 bytes in size. If you're writing a per-thread counter array, declare it as:
struct counter { uint64_t val; char pad[56]; } __attribute__((aligned(64)));
Or in C++17+, alignas(std::hardware_destructive_interference_size).
The math: A coherence miss costs roughly 40–100ns on a single socket, 100–300ns cross-socket. A tight loop doing 100M increments per second on shared-line data can stall for seconds of accumulated wall time per thread. The fix — 56 bytes of padding — is essentially free.
Counterintuitive corner: Read-only sharing is fine. The MESI "Shared" state allows multiple caches to hold a line simultaneously. False sharing only bites when at least one party writes. A common misdiagnosis is padding read-mostly config structs that don't actually need it, while leaving the hot writable counter naked.
