Translation Lookaside Buffers (TLBs): How Hardware Caches the Page Table So Every Load Doesn't Become Five

2026-06-08

Every load and store your CPU executes uses a virtual address. The DRAM only understands physical addresses. Somewhere between the load instruction and the cache, hardware has to translate one to the other — and it has to do it in roughly one clock cycle, or your pipeline stalls on every memory access.

The translation lives in the page table, a tree structure in memory. On x86-64 with 4-level paging, walking that tree takes four memory accesses. If every load triggered a page walk, you'd spend 5× more memory bandwidth on translation than on actual data. The TLB is the cache that prevents this.

A TLB is a small, fully- or set-associative CAM-like structure indexed by virtual page number, storing the corresponding physical page number plus permission bits (read/write/execute, user/supervisor, dirty, accessed). On a hit, translation completes in 1 cycle. On a miss, a hardware page walker (a dedicated state machine) traverses the page table and refills the TLB — typically 20–100+ cycles, longer if the walk itself misses in the data cache.

Real example — the cost of a TLB miss: An Intel Skylake L1 dTLB has 64 entries. Each entry maps a 4KB page, so the L1 dTLB covers exactly 64 × 4KB = 256 KB of virtual address space. Walk linearly through a 1 GB array and you'll thrash the dTLB on every page boundary. With ~25-cycle walks happening every 4 KB, you've added ~6 cycles of overhead per byte — and that's before you account for cache misses.

Rule of thumb: If your working set fits in (L1 dTLB entries × page size), translations are free. Above that, switch to huge pages (2 MB on x86, configured via madvise(MADV_HUGEPAGE) or hugetlbfs) — a single 2 MB entry replaces 512 4 KB entries, expanding reach 512×.

This is why database engineers obsess over huge pages: a hash table with random access patterns can spend 30–50% of its cycles in page walks if it overflows the STLB. Context switches make it worse — most TLB entries aren't tagged with a process ID (or use a small ASID space), so a switch flushes most of the structure and the next process pays a wave of walks.

Key Takeaway: The TLB is a hardware cache for virtual-to-physical address translation; when your working set exceeds TLB entries × page size, every memory access risks a multi-cycle page walk, which is why huge pages dramatically speed up large random-access workloads.

All newsletters