2026-05-17
You think of DRAM as a flat array with one latency number. It isn't. A DRAM chip is organized as channels → DIMMs → ranks → banks → rows → columns, and the memory controller has to physically open a row into a small buffer before any column read works. Where your address falls in that hierarchy determines whether your access takes 15 ns or 80 ns.
A DRAM bank has one row buffer (typically 8 KB). Three cases:
That's why streaming sequential access is so much faster than random: sequential reads hit the same open row repeatedly. Random reads across a large working set thrash the row buffer with conflicts.
The controller also interleaves physical addresses across channels and banks to enable parallelism. A typical DDR4 system with 2 channels × 2 DIMMs × 2 ranks × 16 banks = 128 banks operating in parallel. The controller's address-mapping function (often XOR-based) decides which bank each cache line lands in. Pathological access patterns — like striding by exactly the channel-interleave size — can serialize everything onto one bank and tank your bandwidth by 8x.
Real-world example: A column-major matrix walked row-by-row on a 4096-column double matrix strides by 32 KB per access. On a system that maps bit 15 to bank selection, every access can land on the same bank — pure row conflicts. Transposing the matrix or blocking it to fit in L2 turns those into row hits, and the same code runs 5–10x faster. This is why "cache-blocked" matrix multiplies aren't just about cache: they're about row-buffer locality too.
Rule of thumb: Sequential DRAM bandwidth is roughly 4–5x random-access bandwidth on the same hardware. If your perf counters show high UNC_M_CAS_COUNT.RD with low row-buffer hit rate (visible via pcm-memory or Intel PCM's row-hit metric), you're paying tRP+tRCD on every access. Restructure for sequentiality or block to keep rows open.
The controller also reorders requests (FR-FCFS — First-Ready, First-Come-First-Served) to maximize row hits, which is why measuring single-access latency in isolation tells you almost nothing about loaded latency.
