Daily Hardware Architecture: The Tag Array vs. The Data Array: Why Caches Are Really Two Caches

The Tag Array vs. The Data Array: Why Caches Are Really Two Caches

2026-05-30

When you picture a cache, you probably imagine one big SRAM block holding cache lines. The reality: every cache is actually two physically separate arrays — the tag array (small, holds addresses + metadata) and the data array (huge, holds the actual cache lines). How a CPU schedules accesses to these two arrays determines its latency, power, and bandwidth.

What lives where:

Tag array: upper address bits (the tag), valid bit, dirty bit, MESI state, LRU/replacement bits, ECC. Maybe 40-60 bits per line.
Data array: the 64-byte cache line itself = 512 bits, plus ECC. Roughly 10× larger than the tag array.

Three ways to access them:

Serial (tag-then-data): Read tags first, find the hit way, then read only that way's data. Saves power (one data way read instead of N), but adds a cycle of latency. Used in L2/L3 where latency already dominates.
Parallel: Read all N tags and all N data ways simultaneously, then mux the correct one based on tag comparison. Fast but burns power reading data you'll throw away. Used in L1D for latency-critical access.
Way-predicted: Predict the way, read only that data way in parallel with the tag check. Best of both, but mispredictions cost a replay. (Covered in your previous Way Predictor lesson — this is the partner system.)

Real example — Intel Sunny Cove L1D: 48 KB, 12-way set associative. Parallel access reads all 12 tag entries (~720 bits) and all 12 data entries (6144 bits) every cycle. The data array burns roughly 8× the dynamic power of the tag array per access — which is why L2 (1.25 MB, 20-way) switches to serial access: reading 20 ways × 64 bytes per access would melt the chip.

Rule of thumb for power: Tag array power ≈ data array power ÷ (lineSize_bytes × 8 ÷ tagBits). For a 64-byte line with 50-bit tags: 512/50 ≈ 10× ratio. So serial access cuts a cache's read energy by roughly (N−1)/N × 10/11 — for an 8-way cache, that's about 80% savings.

The dual-port problem: L1D needs two reads + one write per cycle for modern superscalar. Multi-porting the data array is expensive (area scales as ports²), so designers often multi-port only the tag array and use banked data arrays — splitting cache lines across SRAM banks so two accesses to different banks proceed in parallel. Bank conflicts then become a real perf event you can measure with PMCs.

See it in action: Check out L11 4 how caches work by David Black-Schaffer to see this theory applied.

Key Takeaway: A cache is two arrays, not one — and the latency/power/bandwidth tradeoff of how you sequence their access defines every cache level from L1 to L3.

All newsletters