Daily Digital Circuits: MESI Cache Coherence: How Hardware Keeps Multiple Caches Consistent Without Asking the Software

MESI Cache Coherence: How Hardware Keeps Multiple Caches Consistent Without Asking the Software

2026-06-07

When you have four CPU cores each with their own L1 cache, the same memory line can sit in multiple places at once. If core 0 writes to address X while core 1 still has the old value cached, core 1 will read stale data. The hardware fix is a coherence protocol, and the canonical one is MESI.

Every cache line carries two state bits encoding one of four states:

Modified (M) — this cache has the only copy, and it's dirty (different from RAM). Must be written back before eviction.
Exclusive (E) — this cache has the only copy, but it's clean (matches RAM). Can transition to M silently on a write.
Shared (S) — multiple caches have read-only copies, all matching RAM.
Invalid (I) — the line is not present (or has been invalidated by another core's write).

The protocol runs on a snoop bus (or a directory in larger systems). When a core wants to write a line in S, it broadcasts a Read-For-Ownership (RFO) on the bus. Every other cache snoops, finds matching lines, and transitions them to I. The writer then transitions S → M and writes locally. When a core wants to read a line in I, it broadcasts a read; if another cache has it in M, that cache supplies the data and downgrades M → S (sometimes writing back to RAM as well).

Real-world example: The infamous false sharing performance bug. Two threads update independent variables that happen to share a 64-byte cache line. Every write triggers an RFO, invalidating the other core's copy. The line ping-pongs between caches at bus speed instead of staying local. A loop that should run at L1 speed (~4 cycles per access) runs at coherence-miss speed (~100+ cycles). Padding the variables to separate cache lines can give a 20–50× speedup with zero algorithmic change.

Rule of thumb: A coherence miss costs roughly 3–10× a normal L1 miss because it requires bus arbitration, snoop response, and often a writeback. If your line is in M on another core, count on at least 80–150 cycles to get it; if it's in S across N cores, count on N invalidation acks before your write completes.

Modern x86 extends MESI to MESIF (Intel, adds Forward for designating one sharer to respond) or MOESI (AMD, adds Owned for dirty-shared lines that don't need immediate writeback). The basic invariant is identical: at most one writer, or any number of readers, never both.

See it in action: Check out CPU Cache Write Policies (Write Through, Write Back, Write Allocate, No Write Allocate) by BitLemon to see this theory applied.

Key Takeaway: MESI enforces single-writer/multi-reader at cache-line granularity by snooping a bus, and false sharing is what happens when software accidentally weaponizes that mechanism against itself.

All newsletters