Daily Hardware Architecture: TLB Design: The Cache Your CPU Can't Live Without

TLB Design: The Cache Your CPU Can't Live Without

2026-04-22

Every memory access your program makes uses a virtual address. Before anything hits the cache or DRAM, the CPU must translate that virtual address to a physical one via the page table. The problem: page tables live in memory, and a full walk through a 4-level page table (as on x86-64) costs four sequential memory accesses. At ~100 cycles per DRAM access, that's 400 cycles of latency on every load and store. This is where the Translation Lookaside Buffer (TLB) saves you.

A TLB is a small, fully-associative (or high set-associative) cache that stores recent virtual-to-physical page mappings. Modern CPUs use a split, multi-level TLB hierarchy:

L1 DTLB: ~64–96 entries, 1-cycle access, fully associative. Covers data accesses.
L1 ITLB: ~48–64 entries, serves instruction fetches.
L2 STLB (Shared TLB): ~1,536–2,048 entries, 6–8 cycle access, serves misses from both L1 TLBs.

Concrete example: An Intel Alder Lake P-core has a 96-entry L1 DTLB and a 2,048-entry L2 STLB. With 4KB pages, the L1 DTLB covers 96 × 4KB = 384KB of address space. The L2 STLB covers 2,048 × 4KB = 8MB. If your working set exceeds 8MB of randomly-accessed pages, you start suffering full page walks on every TLB miss.

Rule of thumb: A TLB miss with a 4-level page walk costs roughly 4× your L2/L3 latency (the page walk accesses are themselves cacheable in the data caches, which helps enormously). On a modern CPU, expect 20–50 cycles for a cached page walk, but 200+ cycles if the page table entries aren't in cache.

This is exactly why huge pages (2MB or 1GB) matter for performance. A 2MB huge page lets each TLB entry cover 512× more address space. That same 96-entry L1 DTLB now covers 192MB instead of 384KB. Databases like PostgreSQL and runtimes like the JVM use huge pages specifically to reduce TLB pressure.

Hardware also helps with a page walk cache (also called a Paging Structure Cache), which caches intermediate levels of the page table. Intel CPUs cache PML4, PDPT, and PD entries separately, so a "miss" often only needs to fetch the final PT level — turning a 4-access walk into a 1-access walk.

TLB shootdowns are the dirty secret of multicore: when one core updates a page table mapping, all other cores that might have cached that mapping must be interrupted via IPI (inter-processor interrupt) to invalidate their TLB entries. This is why frequent mmap/munmap in multithreaded code can silently kill performance — each unmap triggers a cross-core TLB shootdown that stalls every core for microseconds.

See it in action: Check out How to fail every C++ interview in under 10 seconds by Coding Jesus (getcracked.io) to see this theory applied.

Key Takeaway: The TLB turns a 400-cycle page table walk into a 1-cycle lookup, but its limited capacity means your data layout and page size choices directly determine whether the CPU spends its time computing or translating addresses.

All newsletters