Daily Low-Level Programming: Atomic Operations and Compare-and-Swap

Atomic Operations and Compare-and-Swap

2026-04-29

You already understand memory barriers and ordering. Now we look at the hardware primitives that make lock-free programming possible: atomic operations, and specifically the king of them all, compare-and-swap (CAS).

A regular read-modify-write on a shared variable is not safe. Between your load and your store, another core can intervene. Atomic operations solve this by making the CPU guarantee that the entire read-modify-write happens as one indivisible unit, visible to all cores.

How CAS works at the hardware level: On x86, the LOCK CMPXCHG instruction locks the cache line, compares the value at a memory address with an expected value, and if they match, swaps in a new value — all atomically. On ARM, you use an LDXR/STXR (load-exclusive/store-exclusive) pair: the store-exclusive fails if another core touched that cache line between your load and store, and you retry in a loop.

Real-world example — a lock-free stack push:

Load the current top pointer.
Set your new node's next to point to that top.
CAS: "If top is still what I read, replace it with my new node."
If CAS fails (another thread pushed first), loop and retry.

This pattern eliminates mutexes entirely. The Linux kernel uses it extensively in its atomic_t type for reference counting, and cmpxchg() appears throughout the scheduler and memory allocator.

The cost: A LOCK-prefixed instruction on x86 costs roughly 10–100 ns depending on contention, because it must acquire exclusive ownership of the cache line via the MESI protocol. Under no contention, expect ~10–20 ns. Under heavy contention with 8 cores hammering the same line, expect 80+ ns due to cache-line bouncing. Rule of thumb: an uncontended atomic costs about 10x a plain memory access (~1 ns), but is still 10–50x cheaper than a full mutex lock/unlock cycle (~200–500 ns with a kernel round-trip).

Pitfalls to know:

The ABA problem: CAS checks value equality, not identity. If a value changes from A→B→A, your CAS succeeds when it shouldn't. Solution: use a version counter alongside the pointer (a "tagged pointer" — x86-64's CMPXCHG16B does 128-bit CAS for exactly this purpose).
Spurious failures on ARM: STXR can fail even without real contention (cache eviction, interrupt). Always use a retry loop.
False sharing: Two independent atomics on the same 64-byte cache line will contend. Pad your atomic variables to cache-line boundaries.

In C11, use atomic_compare_exchange_strong() or __atomic_compare_exchange_n() in GCC. Prefer _strong on x86 (no spurious failures) and _weak in loops on ARM (cheaper per attempt).

See it in action: Check out Process Synchronization - Compare and Swap Instruction by Dr. Vinod D to see this theory applied.

Key Takeaway: Compare-and-swap is the fundamental hardware primitive behind all lock-free data structures — it turns a conditional update into a single atomic instruction, but you must account for ABA, contention costs, and architecture-specific retry semantics.

All newsletters