Daily Hardware Architecture: Cache Inclusion Policies: Why L2 Sometimes Duplicates L1 and Sometimes Doesn't

Cache Inclusion Policies: Why L2 Sometimes Duplicates L1 and Sometimes Doesn't

2026-05-31

When you have a multi-level cache hierarchy, you face a question that sounds simple but has enormous consequences: if a line is in L1, must it also be in L2? The answer defines your inclusion policy, and it shapes everything from coherence traffic to effective cache capacity.

There are three options:

Inclusive: Every line in L1 must also be in L2. L2 is a strict superset.
Exclusive: A line lives in exactly one level. L1 and L2 never overlap.
NINE (Non-Inclusive Non-Exclusive): No guarantee either way. Lines may or may not be duplicated.

Inclusive (Intel's traditional choice through Broadwell): Snoops from other cores only need to check L2. If a line isn't in L2, it can't be in L1 either. This makes coherence cheap — but it costs you. Your effective capacity is just L2's size, because L1's contents are duplicated. Worse, when L2 evicts a line, it must back-invalidate L1, even if L1 was using that line heavily. A hot L1 line can get nuked because L2 chose a victim poorly.

Exclusive (AMD's traditional choice, K7 through Zen 2 for L2/L3): Effective capacity = L1 + L2. Great for capacity, but every L1 miss that hits L2 requires a swap: the L2 line moves to L1, and the L1 victim moves to L2. Snoops must check both levels.

NINE (Intel Skylake-X onward, most ARM big cores): The pragmatic middle. L2 doesn't promise to hold L1's contents, but doesn't actively evict them either. You need a separate snoop filter (often called a coherence directory) to track what's in L1 without inclusion. Skylake-X's massive 1MB L2 made inclusion too expensive — duplicating 1MB into L3 wasted die area — so Intel switched.

Concrete example: On Haswell (inclusive L3), a single core thrashing 6MB of data could evict useful lines from other cores' L1/L2 caches via back-invalidation, because L3 evictions cascade upward. On Skylake-X (NINE L3), this cross-core interference largely disappears.

Rule of thumb: If L2 is less than ~4× the size of L1, inclusion wastes too much capacity to justify the coherence simplicity. Intel's switch happened when L2 grew from 256KB (8× L1's 32KB) to 1MB (32× L1) — at that ratio, the duplication cost stayed reasonable, but once L3 sizes per core stopped growing proportionally, inclusion at L3 became indefensible.

The hidden cost everyone forgets: back-invalidations show up as L1 misses in your profiler with no obvious cause. If you see mysterious L1 misses on inclusive hardware, suspect another core's L2/L3 activity.

Key Takeaway: Inclusion trades cache capacity for cheap coherence checks, and modern CPUs increasingly abandon it because snoop filters give you the coherence benefits without sacrificing the capacity.

All newsletters