2026-05-19
When core 0 writes to address X and core 1 reads X a nanosecond later, why does core 1 see the new value? Not magic — a state machine called MESI (Modified, Exclusive, Shared, Invalid) running silently in every cache line.
Each L1 cache line carries two bits of state metadata:
Cores exchange messages over the ring/mesh interconnect: Read, ReadForOwnership (RFO), Invalidate, WriteBack. The key rule: before any core writes a line, every other core must invalidate its copy. That handshake is what makes your `mov [x], 1` eventually visible to thread B.
Concrete example — the producer/consumer cost:
Thread A on core 0 writes a flag every iteration. Thread B on core 1 reads it. First write: core 0 has the line in S (because B previously read it). Core 0 must send an Invalidate, wait for the ack from core 1 (line goes I on core 1), then transition its own line to M and write. Core 1's next read misses, sends a Read request, core 0 transitions M→S and forwards the line. Each ping-pong costs ~40–80 cycles versus ~4 cycles for an L1 hit on a private line.
This is why uncontended atomic increments cost ~5 cycles but contended ones cost 50+. Same instruction, completely different cost — the difference is whether the line is in E/M on your core or bouncing between cores via Invalidate/RFO traffic.
Rule of thumb: Count the state transitions, not the instructions. A cache line ping-ponging between N cores generates roughly N invalidate broadcasts per write. On a 16-core Xeon, a single contended counter can saturate the coherence fabric long before it saturates a single core's ALU.
Practical implications:
perf stat -e cache-misses,mem_load_l3_miss_retired.remote_hitm shows you exactly how often you're paying the cross-core penalty.MESI is what makes shared memory feel like memory. It's also what makes shared memory expensive.
