Daily Hardware Architecture: The Victim TLB: How CPUs Catch Translations the Main TLB Just Evicted

The Victim TLB: How CPUs Catch Translations the Main TLB Just Evicted

2026-06-06

The main TLB is small and brutally fast — typically 64 entries for L1 DTLB, 4-cycle access, fully indexed in parallel with cache lookups. When it evicts a translation, the cost of getting that translation back is enormous: an L2 TLB lookup (12+ cycles) or worse, a full page walk (hundreds of cycles, four memory accesses on x86-64). The victim TLB is a small fully-associative buffer that catches evicted L1 TLB entries before they fall to L2, giving thrashing workloads a second chance at a near-zero-latency hit.

The design mirrors the victim cache: when an L1 TLB miss happens, the lookup checks the victim TLB in parallel with the L2 TLB. A hit promotes the entry back to L1 TLB (swapping with the entry being evicted) and completes in ~2 extra cycles instead of 12+. The L1 TLB stays small and fast; the victim TLB absorbs conflict misses without bloating critical-path logic.

Why this matters specifically for TLBs: TLB working sets are weirdly shaped. A program touching 65 pages in a tight loop will thrash a 64-entry TLB constantly under LRU, evicting and reloading the same translations. With even an 8-entry victim TLB, that 65th page lives there and gets promoted on each access — a ~10× latency reduction on the recurring miss.

Real-world example: AMD's Zen 2 and later use a 64-entry L1 DTLB backed by a 2048-entry L2 TLB. Some ARM cores (Cortex-A78, Neoverse N2) add a small fully-associative "micro-TLB" tier between them that functions as a victim buffer for the main L1 DTLB. On database workloads with hash tables spanning ~200 4KB pages, this tier measurably cuts dTLB-load-miss penalties — sometimes 15-25% of total miss latency disappears because the hot translations bounce between L1 and the victim tier instead of going to L2.

Rule of thumb: A victim TLB is worth it when your working set is 1.0×-1.5× the L1 TLB size. Below 1.0× there's nothing to catch; above 1.5× the victim TLB also thrashes and you need bigger L2 TLB capacity instead. For a 64-entry L1, an 8-16 entry victim covers the sweet spot.

A subtle constraint: victim TLBs must handle ASID (address space ID) and page-size tags correctly. A 4KB and a 2MB translation for overlapping virtual ranges can't both live in the victim — coherence with the main TLB's invalidations (INVLPG, TLB shootdowns) has to walk the victim too, which is why they stay small.

Key Takeaway: The victim TLB catches translations the L1 TLB just evicted, converting expensive L2 TLB lookups and page walks into near-free re-hits for workloads whose working set barely exceeds L1 TLB capacity.

All newsletters