2026-05-31
When a set-associative cache needs to bring in a new line, it has to evict one. Which one? That choice — the replacement policy — quietly determines whether your working set fits or thrashes. True LRU sounds obvious, but real CPUs almost never implement it.
Why true LRU is too expensive. For a 16-way set, true LRU needs to maintain a total ordering of 16 ways on every access. That's roughly log2(16!) ≈ 44 bits of state per set, plus update logic on every hit. Multiply by thousands of sets and you've burned area and power tracking something the cache doesn't even need to be exactly right about.
Pseudo-LRU (PLRU) — the workhorse. A binary tree of 1-bit "which half was used more recently" flags. For 16 ways you only need 15 bits per set. On access, walk the tree flipping bits to point away from the accessed way. On eviction, follow the bits to a victim. Cheap, decent — and what Intel used in L1/L2 caches for decades.
The scan problem. Both LRU and PLRU assume recently-used means soon-to-be-used. Then someone runs memcpy over a 100MB array. Every line is touched exactly once, every line marked MRU, and your useful working set gets flushed out the back. LRU is provably terrible for any access pattern larger than the cache.
RRIP (Re-Reference Interval Prediction). Intel adopted this for L3 around Ivy Bridge. Each line gets a small counter (typically 2 bits) predicting how soon it'll be re-referenced: 0 = "near-immediate," 3 = "distant future." On insertion, new lines get value 2 (not 0!) — meaning "I don't trust you yet, prove you're useful." On hit, the counter drops to 0. On eviction, find a line with value 3; if none exists, increment everyone and search again.
This one trick — inserting at "distant" instead of "recent" — means a streaming scan can't evict your hot data. The streamed lines get inserted at 2, never hit again, age to 3, and get evicted before touching your real working set. SHiP and DRRIP extend this with PC-based or set-dueling adaptivity.
Rule of thumb: If your working set fits in cache, PLRU and RRIP perform nearly identically. If it's between 1× and 2× cache size, RRIP wins by 10–30% miss rate. If it's massive and streaming, RRIP behaves almost like cache bypass while LRU collapses.
Real-world example: Database hash joins on tables larger than L3. With LRU/PLRU, the probe phase's sequential scan over the build-side hash table evicts the hash table itself. RRIP-style insertion preserves the frequently-probed buckets — measurable 15–25% throughput gains on TPC-H queries with large joins.
