2026-04-27
You already know caches exist. Today we dig into why there are multiple levels and what engineering constraints force each level to be designed so differently.
L1: Speed at all costs. The L1 cache must deliver data in a single cycle to avoid stalling the pipeline. On a 5 GHz core, that's 200 picoseconds. This brutally constrains its size — typically 32–48 KB for data and 32–64 KB for instructions. Why split them? Because the fetch unit and the load/store unit need simultaneous access every cycle, and a single-ported SRAM large enough for both would be too slow. L1 is almost always set-associative (8-way or 12-way) to reduce conflict misses, with a line size of 64 bytes. Apple's M-series pushed L1D to 128 KB by using a wider associativity and accepting a slightly longer physical design.
L2: The bandwidth bridge. L2 sits per-core (in modern designs) and is typically 256 KB–2 MB. It runs at core clock but tolerates 4–12 cycle latency. This extra time budget lets designers use denser SRAM cells and higher associativity (16-way is common). L2's primary job is to absorb L1 misses before they hit the shared interconnect. AMD's Zen 4 uses a 1 MB L2 per core; Intel's Golden Cove uses 1.25 MB. The trend is upward because working sets keep growing.
L3 (LLC): Shared and inclusive (usually). L3 is shared across all cores — often 16–96 MB on server parts. Latency is 30–50+ cycles. It's typically organized as a distributed slice-per-core architecture connected via a ring bus or mesh. AMD's V-Cache stacks extra L3 via 3D packaging, hitting 96 MB on a CCD — a trick that boosted gaming performance by 15–25% purely from reduced DRAM accesses.
The key rule of thumb: each cache level is roughly 8–10× larger and 3–5× slower than the one above it. For a 5 GHz core:
Inclusion policies matter. An inclusive L3 guarantees that anything in L1/L2 is also in L3 — simplifying coherence snoops because you only check L3's tags. But it wastes capacity (L1/L2 contents duplicate in L3). A non-inclusive (NINE) policy, used by Intel since Skylake-SP and by AMD, avoids this waste but requires a separate snoop filter to track which core might hold a line.
Concrete calculation: If your working set is 3 MB and your per-core L2 is 1 MB, about 2/3 of accesses that miss L1 will also miss L2 and go to L3. If L3 latency is 40 cycles vs L2's 12, that's a 3.3× penalty on those misses — exactly why tuning data structures to fit in L2 (loop tiling, struct packing) gives measurable speedups.
