Daily Low-Level Programming: The Memory Controller and DRAM Timings: Why Your "Random" Access Has Five Different Latencies

The Memory Controller and DRAM Timings: Why Your "Random" Access Has Five Different Latencies

2026-05-17

You think of DRAM as a flat array with one latency number. It isn't. A DRAM chip is organized as channels → DIMMs → ranks → banks → rows → columns, and the memory controller has to physically open a row into a small buffer before any column read works. Where your address falls in that hierarchy determines whether your access takes 15 ns or 80 ns.

A DRAM bank has one row buffer (typically 8 KB). Three cases:

Row hit: the row you want is already open. Just issue CAS. ~13 ns (tCL).
Row empty: no row open. Issue ACTIVATE then CAS. ~26 ns (tRCD + tCL).
Row conflict: different row open in same bank. PRECHARGE → ACTIVATE → CAS. ~40 ns (tRP + tRCD + tCL).

That's why streaming sequential access is so much faster than random: sequential reads hit the same open row repeatedly. Random reads across a large working set thrash the row buffer with conflicts.

The controller also interleaves physical addresses across channels and banks to enable parallelism. A typical DDR4 system with 2 channels × 2 DIMMs × 2 ranks × 16 banks = 128 banks operating in parallel. The controller's address-mapping function (often XOR-based) decides which bank each cache line lands in. Pathological access patterns — like striding by exactly the channel-interleave size — can serialize everything onto one bank and tank your bandwidth by 8x.

Real-world example: A column-major matrix walked row-by-row on a 4096-column double matrix strides by 32 KB per access. On a system that maps bit 15 to bank selection, every access can land on the same bank — pure row conflicts. Transposing the matrix or blocking it to fit in L2 turns those into row hits, and the same code runs 5–10x faster. This is why "cache-blocked" matrix multiplies aren't just about cache: they're about row-buffer locality too.

Rule of thumb: Sequential DRAM bandwidth is roughly 4–5x random-access bandwidth on the same hardware. If your perf counters show high UNC_M_CAS_COUNT.RD with low row-buffer hit rate (visible via pcm-memory or Intel PCM's row-hit metric), you're paying tRP+tRCD on every access. Restructure for sequentiality or block to keep rows open.

The controller also reorders requests (FR-FCFS — First-Ready, First-Come-First-Served) to maximize row hits, which is why measuring single-access latency in isolation tells you almost nothing about loaded latency.

See it in action: Check out What Is RAM

amp; RAM Timing Explained by @Jayztwocents at Micro Center by Micro Center to see this theory applied.

Key Takeaway: DRAM isn't a flat array — it's a hierarchy of banks with row buffers, and whether your access is a row hit, row empty, or row conflict can swing latency by 3x and bandwidth by 5x.

All newsletters