Daily Low-Level Programming: Memory Barriers and Ordering

Memory Barriers and Ordering

2026-04-22

Modern CPUs don't execute memory operations in the order you wrote them. The processor reorders loads and stores for performance — and on most architectures, this reordering is visible to other cores. This is the source of some of the most brutal concurrency bugs in systems programming.

Why reordering happens: A CPU has a store buffer that batches writes before committing them to cache. It also speculatively executes loads ahead of earlier instructions. On x86, stores are never reordered with other stores, and loads are never reordered with other loads — but a later load can pass an earlier store (StoreLoad reordering). ARM and RISC-V have a weaker model: they permit all four reordering types (LoadLoad, LoadStore, StoreLoad, StoreStore) unless you explicitly prevent it.

A concrete bug. Consider a lock-free flag pattern between two threads:

Thread A: data = 42; ready = 1;
Thread B: while (!ready); use(data);

Without a barrier, Thread B can observe ready == 1 but read stale data. On ARM, the store to data and the store to ready can be reordered. On x86, Thread B's loads of ready and data can be reordered. Both break correctness.

The fix — barrier instructions:

x86: MFENCE (full barrier), SFENCE (store barrier), LFENCE (load barrier). In practice, x86's strong model means you mostly only need MFENCE or a locked instruction like LOCK XCHG for the StoreLoad case.
ARM: DMB (data memory barrier) with variants — DMB ISH for inner-shareable domain (typical multi-core), DMB ISHST for store-only ordering.
C/C++: Use atomic_thread_fence(memory_order_seq_cst) or, better, std::atomic with the appropriate memory order. The compiler emits the right barrier for your target.

Rule of thumb for cost: A full memory barrier (MFENCE on x86, DMB ISH on ARM) costs roughly 20–80 cycles depending on contention and microarchitecture. That's about the same as an L2 cache miss. So barriers aren't free, but they're far cheaper than a mutex (which itself contains barriers plus a syscall under contention). In a tight loop doing millions of iterations, unnecessary barriers add measurable overhead — placing them precisely matters.

Practical guidance: Prefer std::atomic with memory_order_acquire on loads and memory_order_release on stores instead of seq_cst everywhere. On ARM, seq_cst emits a full DMB before and after the operation, while acquire/release use the cheaper half-barriers. On x86, acquire and release are typically free (just compiler barriers) because the hardware already provides those guarantees — only seq_cst stores require an MFENCE.

See it in action: Check out CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?” by CppCon to see this theory applied.

Key Takeaway: CPUs reorder memory operations for performance; memory barriers (hardware instructions) and atomic orderings (language-level) force visibility guarantees between cores, and choosing the minimal sufficient ordering — acquire/release over sequential consistency — directly reduces overhead.

All newsletters