NUMA: Why Memory Location Matters More Than Memory Speed

2026-05-02

On a single-socket machine, all RAM is equally far from the CPU. Add a second socket and that assumption shatters. Non-Uniform Memory Access (NUMA) means each CPU socket has "local" memory attached to it, and accessing memory attached to a different socket costs more — typically 1.5–2x the latency.

A modern dual-socket server is organized into NUMA nodes. Each node is a CPU socket plus its locally-attached DIMMs. The nodes are connected by an interconnect (Intel's UPI, AMD's Infinity Fabric). When CPU 0 reads memory attached to CPU 1, the request must traverse that interconnect — adding 40–80ns of latency on top of the ~80ns local access. That's not a rounding error; it's a 50–100% penalty on every cache miss.

You can inspect your topology right now:

Linux uses a first-touch policy by default: the physical page is allocated on the node where the thread that first writes to it is running. This is why initialization patterns matter enormously. If thread 0 on node 0 initializes a giant array, then thread 1 on node 1 processes half of it, that half suffers remote-access penalties for the lifetime of the program.

Real-world example: A database team saw a 35% throughput drop after migrating from single-socket to dual-socket machines. The buffer pool was initialized by one thread at startup, pinning all pages to node 0. Half their worker threads ran on node 1, hitting remote memory on every query. The fix: numactl --interleave=all ./database for the initialization phase, which round-robins pages across nodes, giving a statistically even distribution. For tighter control, they later used mbind() to explicitly bind memory regions to specific nodes matching the threads that would use them.

Rule of thumb: If your program is memory-bandwidth-bound and runs on a NUMA system, misplaced pages can cost you 30–50% of throughput. The fix is usually one of three approaches: (1) interleave with numactl --interleave=all for shared data, (2) bind threads and their memory to the same node with numactl --cpunodebind=N --membind=N, or (3) use numa_alloc_onnode() and pthread_setaffinity_np() in code for fine-grained control.

The kernel tries to help with AutoNUMA (enabled via /proc/sys/kernel/numa_balancing), which periodically unmaps pages to detect access patterns and migrates them closer to the accessing CPU. It works, but it's reactive — it can't fix bad placement faster than your hot loop can suffer from it. Explicit placement always wins for latency-sensitive workloads.

See it in action: Check out C++ Masters student doesn
#39;t know how to allocate memory on the heap. by Coding Jesus (getcracked.io) to see this theory applied.
Key Takeaway: On multi-socket systems, a thread accessing memory attached to a remote CPU socket pays 1.5–2x latency per cache miss — bind your threads and their data to the same NUMA node, or pay a hidden tax on every memory access.