Cache Coherence Protocols: How MESI Keeps Multicore Honest

2026-04-22

You already know about store buffers and memory ordering. But there's a deeper problem: when Core 0 writes to address X, how does Core 1 know its cached copy of X is now stale? This is the cache coherence problem, and the hardware solves it with a protocol called MESI.

Every cache line (typically 64 bytes) carries a state tag — one of four values:

The magic is in the transitions. When Core 0 wants to write a Shared line, it broadcasts an invalidation on the bus (or through a snoop filter/directory). Every other core holding that line must transition it to Invalid and acknowledge. Only after all acknowledgments arrive can Core 0 move to Modified and commit the write. This is why cross-core writes are expensive — they're not just memory operations, they're negotiations.

Real-world impact: false sharing. Imagine two threads updating independent variables that happen to sit on the same 64-byte cache line. Each write invalidates the other core's copy, forcing a full coherence round-trip. On a modern Intel processor, that round-trip costs roughly 40–70 ns between cores on the same die. If your hot loop hits this, you're burning 100+ cycles per iteration on invisible protocol traffic. The fix: pad your structs so independent hot fields land on separate cache lines (alignas(64) in C++).

Rule of thumb: A cache line bouncing between two cores in Modified state costs roughly 50–100x more than a local L1 hit (~1 ns vs 50–70 ns). If your profiler shows high "contested" or "remote HITM" counts (using perf c2c on Linux), false sharing is usually the culprit.

Modern CPUs extend MESI to MESIF (Intel, adding a Forward state to reduce redundant responses) or MOESI (AMD, adding an Owned state that lets dirty data be shared without writing back to memory first — saving bandwidth). The core idea is identical: every cache line has a state machine, and cores coordinate through invalidations and acknowledgments to maintain a single coherent view of memory.

Directory-based coherence (used in multi-socket NUMA systems) replaces broadcast snooping with a centralized tracker, reducing bus traffic at the cost of an extra lookup. This is why cross-socket coherence on a dual-socket server can cost 100–200 ns — roughly 2x the intra-die cost.

Key Takeaway: MESI ensures every core sees a consistent view of memory by attaching a state machine to each cache line, but this coherence traffic is the hidden cost behind false sharing and cross-core communication — making data layout as important as algorithm design on multicore hardware.

All newsletters