2026-06-01
Reads from a cache are conceptually simple: the line is either there or it isn't. Writes are where cache designers have to make hard choices, because a write changes state, and that change has to propagate somewhere eventually. The policy a cache picks determines its bandwidth, its coherence cost, and how it behaves when it misses.
Write-through sends every store to the next level of the hierarchy immediately. The cache holds a copy, but the source of truth is downstream. This is brutally simple — no dirty bits, no eviction surprises, coherence is trivial because lower levels are always current. The price is bandwidth: every store eats a write port on the next level. Modern L1 caches in some designs (older AMD K7, certain ARM cores) use write-through to a small write buffer that absorbs bursts before they hit L2.
Write-back updates only the local cache line and marks it dirty. The write to the next level happens only at eviction. This is what every modern x86 L1/L2/L3 uses for normal memory. Bandwidth to lower levels drops dramatically — a hot variable updated a million times causes one downstream write, not a million. The cost is complexity: every line needs a dirty bit, every eviction needs to check it, and cache coherence protocols (MESI's M state exists precisely for this) have to track which cache owns the dirty copy.
Orthogonal to that choice is what happens on a write miss. Write-allocate (a.k.a. fetch-on-write) loads the line into cache first, then writes. No-write-allocate sends the store directly to the next level and skips caching it. Write-back caches almost always pair with write-allocate (you need the line in cache to mark it dirty). Write-through caches sometimes use no-write-allocate to avoid polluting the cache with write-only data.
Concrete example: a tight loop zeroing a 1 GB buffer. With write-allocate, every cache line is fetched from memory just so you can overwrite it — doubling memory bandwidth for no reason. This is why x86 has MOVNT (non-temporal stores) that bypass the cache entirely, and why memset implementations switch to streaming stores past a size threshold. AMD64 measured: ~50% bandwidth gain on large zeroing with non-temporal stores.
Rule of thumb: dirty-line write-back traffic ≈ (store rate × dirty eviction rate × line size). For a workload writing 64-byte lines that evict 10% dirty at 1 GHz store rate, that's ~6.4 GB/s of writeback bandwidth — easily a meaningful fraction of DRAM bandwidth on a memory-bound workload.
