Daily Low-Level Programming: The Cache Coherence Protocol: How MESI Keeps Your Cores From Lying to Each Other

The Cache Coherence Protocol: How MESI Keeps Your Cores From Lying to Each Other

2026-05-19

When core 0 writes to address X and core 1 reads X a nanosecond later, why does core 1 see the new value? Not magic — a state machine called MESI (Modified, Exclusive, Shared, Invalid) running silently in every cache line.

Each L1 cache line carries two bits of state metadata:

Modified (M): This core has the only copy and it's dirty. Memory is stale.
Exclusive (E): This core has the only copy and it's clean. Memory matches.
Shared (S): Multiple cores have clean copies. Memory matches.
Invalid (I): Garbage. Must refetch before use.

Cores exchange messages over the ring/mesh interconnect: Read, ReadForOwnership (RFO), Invalidate, WriteBack. The key rule: before any core writes a line, every other core must invalidate its copy. That handshake is what makes your `mov [x], 1` eventually visible to thread B.

Concrete example — the producer/consumer cost:

Thread A on core 0 writes a flag every iteration. Thread B on core 1 reads it. First write: core 0 has the line in S (because B previously read it). Core 0 must send an Invalidate, wait for the ack from core 1 (line goes I on core 1), then transition its own line to M and write. Core 1's next read misses, sends a Read request, core 0 transitions M→S and forwards the line. Each ping-pong costs ~40–80 cycles versus ~4 cycles for an L1 hit on a private line.

This is why uncontended atomic increments cost ~5 cycles but contended ones cost 50+. Same instruction, completely different cost — the difference is whether the line is in E/M on your core or bouncing between cores via Invalidate/RFO traffic.

Rule of thumb: Count the state transitions, not the instructions. A cache line ping-ponging between N cores generates roughly N invalidate broadcasts per write. On a 16-core Xeon, a single contended counter can saturate the coherence fabric long before it saturates a single core's ALU.

Practical implications:

Per-CPU counters aggregated lazily beat one shared atomic by 10–100x under contention.
Read-mostly data wants to live in S on every core — adding even a rare write forces global invalidation.
The infamous "writer starves readers" pattern: once a writer takes M, every reader misses until the line returns to S.
perf stat -e cache-misses,mem_load_l3_miss_retired.remote_hitm shows you exactly how often you're paying the cross-core penalty.

MESI is what makes shared memory feel like memory. It's also what makes shared memory expensive.

See it in action: Check out 🔍 The Albert Gate Mystery: A Captivating Detective Tale 🕵️‍♂️ by Storytime Haven to see this theory applied.

Key Takeaway: Every write to a shared cache line triggers an invalidation broadcast — so the cost of atomics and shared variables scales with how many cores are watching, not how fast any one core is.

All newsletters