2026-05-30
When you picture a cache, you probably imagine one big SRAM block holding cache lines. The reality: every cache is actually two physically separate arrays — the tag array (small, holds addresses + metadata) and the data array (huge, holds the actual cache lines). How a CPU schedules accesses to these two arrays determines its latency, power, and bandwidth.
What lives where:
Three ways to access them:
Real example — Intel Sunny Cove L1D: 48 KB, 12-way set associative. Parallel access reads all 12 tag entries (~720 bits) and all 12 data entries (6144 bits) every cycle. The data array burns roughly 8× the dynamic power of the tag array per access — which is why L2 (1.25 MB, 20-way) switches to serial access: reading 20 ways × 64 bytes per access would melt the chip.
Rule of thumb for power: Tag array power ≈ data array power ÷ (lineSize_bytes × 8 ÷ tagBits). For a 64-byte line with 50-bit tags: 512/50 ≈ 10× ratio. So serial access cuts a cache's read energy by roughly (N−1)/N × 10/11 — for an 8-way cache, that's about 80% savings.
The dual-port problem: L1D needs two reads + one write per cycle for modern superscalar. Multi-porting the data array is expensive (area scales as ports²), so designers often multi-port only the tag array and use banked data arrays — splitting cache lines across SRAM banks so two accesses to different banks proceed in parallel. Bank conflicts then become a real perf event you can measure with PMCs.
