Cache Lines and Memory Alignment

2026-04-21

Every time your CPU reads a single byte from memory, it actually fetches an entire cache line — typically 64 bytes on x86 and most ARM processors. Understanding this mechanism is the difference between code that flies and code that crawls.

The CPU cache hierarchy (L1, L2, L3) operates exclusively in cache-line-sized chunks. When you access address 0x1000, the hardware fetches bytes 0x1000–0x103F into a single cache line. This has two critical consequences for how you write low-level code:

1. Spatial locality pays off. If you iterate through a contiguous array, the first access loads 64 bytes for free. For an array of int32_t, that's 16 elements per cache line. Sequential traversal gives you a ~16:1 ratio of hits to misses. Contrast this with a linked list where nodes are heap-scattered — every pointer chase is likely a fresh cache miss costing 50–100+ nanoseconds from DRAM.

2. Alignment prevents straddling. A uint64_t at address 0x103E spans two cache lines (bytes 0x103E–0x1045 cross the boundary at 0x1040). The CPU must now fetch two cache lines and stitch the value together. On some ARM cores, this is an outright fault. On x86, it works but costs a penalty. The rule: a datum of size N should sit at an address divisible by N (up to the cache line size).

Real-world example: struct padding. Consider this C struct:

struct Bad { char a; long b; char c; long d; }; — the compiler inserts 7 bytes of padding after a and 7 after c to align the long fields. Total size: 32 bytes.
struct Good { long b; long d; char a; char c; }; — grouping fields by descending alignment minimizes padding. Total size: 24 bytes (with 6 bytes trailing pad for array alignment).

That 25% size reduction means more structs per cache line, fewer misses, and measurably faster iteration over large arrays.

Rule of thumb for estimation: L1 cache hit ≈ 1 ns, DRAM fetch ≈ 100 ns. If your inner loop causes one cache miss per iteration across 10 million elements, that's roughly 10M × 100 ns = 1 second of pure memory stall. Making it cache-friendly can cut that to 10M × 1 ns = 10 ms — a 100× improvement.

False sharing is the multi-threaded trap. If two threads write to different variables that share a cache line, the cores constantly invalidate each other's copy — the line ping-pongs between caches. The fix: pad or align hot per-thread variables to 64-byte boundaries using __attribute__((aligned(64))) or alignas(64).

See it in action: Check out How Cache Works Inside a CPU by BitLemon to see this theory applied.

Key Takeaway: The CPU always moves data in 64-byte cache lines, so organizing your data for contiguous, aligned access — and avoiding false sharing across threads — is one of the highest-leverage performance optimizations available at the systems level.