2026-04-29
You already understand memory barriers and ordering. Now we look at the hardware primitives that make lock-free programming possible: atomic operations, and specifically the king of them all, compare-and-swap (CAS).
A regular read-modify-write on a shared variable is not safe. Between your load and your store, another core can intervene. Atomic operations solve this by making the CPU guarantee that the entire read-modify-write happens as one indivisible unit, visible to all cores.
How CAS works at the hardware level: On x86, the LOCK CMPXCHG instruction locks the cache line, compares the value at a memory address with an expected value, and if they match, swaps in a new value — all atomically. On ARM, you use an LDXR/STXR (load-exclusive/store-exclusive) pair: the store-exclusive fails if another core touched that cache line between your load and store, and you retry in a loop.
Real-world example — a lock-free stack push:
top pointer.next to point to that top.top is still what I read, replace it with my new node."This pattern eliminates mutexes entirely. The Linux kernel uses it extensively in its atomic_t type for reference counting, and cmpxchg() appears throughout the scheduler and memory allocator.
The cost: A LOCK-prefixed instruction on x86 costs roughly 10–100 ns depending on contention, because it must acquire exclusive ownership of the cache line via the MESI protocol. Under no contention, expect ~10–20 ns. Under heavy contention with 8 cores hammering the same line, expect 80+ ns due to cache-line bouncing. Rule of thumb: an uncontended atomic costs about 10x a plain memory access (~1 ns), but is still 10–50x cheaper than a full mutex lock/unlock cycle (~200–500 ns with a kernel round-trip).
Pitfalls to know:
CMPXCHG16B does 128-bit CAS for exactly this purpose).STXR can fail even without real contention (cache eviction, interrupt). Always use a retry loop.In C11, use atomic_compare_exchange_strong() or __atomic_compare_exchange_n() in GCC. Prefer _strong on x86 (no spurious failures) and _weak in loops on ARM (cheaper per attempt).
