Stack Overflow Unanswered: lock free ring buffer works on x86 but breaks on ARM

lock free ring buffer works on x86 but breaks on ARM - memory barriers not helping?

2026-05-28

Stack Overflow: View Question

Tags: c, concurrency, arm, memory-barriers

Score: 1 | Views: 211

The asker has a single-producer / single-consumer ring buffer in C that runs cleanly on x86 (Intel i7) but intermittently corrupts data on a Cortex-A53. They've added memory barriers and the problem persists. This is the classic "my lock-free code works on x86 but explodes on ARM" trap.

Why it's hard. x86 has a strong memory model (TSO): stores from a single core become visible to other cores in program order, and loads are not reordered with earlier loads. So a lot of "lock-free" code that's quietly wrong appears correct on x86 because the hardware papers over missing synchronization. ARMv8 has a weak memory model — independent loads and stores can be reordered, and writes can become visible to other cores in different orders unless you tell the CPU otherwise.

Most likely root causes.

Wrong barrier on the wrong side. For an SPSC ring you need release semantics when the producer publishes a new head (so the payload writes are visible before the index update) and acquire semantics when the consumer reads head (so payload reads happen after seeing the new index). A single __sync_synchronize() on one side isn't enough.
Plain (non-atomic) loads/stores of the indices. The compiler can hoist, fuse, or tear them. Barriers are useless if the access itself isn't atomic. Use atomic_load_explicit/atomic_store_explicit with memory_order_acquire and memory_order_release.
False sharing. Head and tail on the same 64-byte cache line cause ping-ponging that doesn't cause corruption but masks ordering bugs by changing timing.

Direction to a fix. Rewrite with C11 atomics:

// producer
memcpy(&buf[head], src, n);
atomic_store_explicit(&g_head, (head + n) & mask, memory_order_release);

// consumer
size_t h = atomic_load_explicit(&g_head, memory_order_acquire);
if (h != tail) { memcpy(dst, &buf[tail], ...); ... }

The acquire/release pair maps to LDAR/STLR on ARMv8, which is exactly what you need — and it's cheaper than a full DMB ISH.

Gotchas. Cortex-A53 is in-order but still reorders memory operations (it's the memory subsystem, not the pipeline, that matters). If the ring buffer is shared with a peripheral or DMA engine, you also need DMB OSH / DSB and possibly non-cacheable mappings — acquire/release only synchronize between CPU cores. And don't trust testing: a working run proves nothing on weak-memory hardware. Run TSan, or stress with perf and pin producer/consumer to different cores.

The challenge: Code that "passes" on x86 isn't lock-free-correct — it's just running on hardware generous enough to forgive missing acquire/release semantics that ARM exposes ruthlessly.

All newsletters