Daily Low-Level Programming: Read-Copy-Update (RCU): Lock-Free Reads at Production Scale

Read-Copy-Update (RCU): Lock-Free Reads at Production Scale

2026-05-05

RCU is the synchronization primitive that lets the Linux kernel scale to thousands of cores. It separates readers from writers entirely: readers pay zero synchronization cost — no locks, no atomics, no memory barriers on most architectures. Writers do all the work.

The core trick: never mutate shared data in place. Instead:

Reader: dereference a pointer, use the data, done.
Writer: allocate a new copy, modify it, atomically swap the pointer (a single aligned store), then wait until all pre-existing readers finish before freeing the old version.

That "wait until readers finish" step is called a grace period. The kernel determines it has elapsed once every CPU has passed through a quiescent state — typically a context switch, idle loop, or return to user space. If every CPU has scheduled at least once since the pointer swap, no thread can still hold the old pointer (because reading it requires being inside an rcu_read_lock()/rcu_read_unlock() critical section, which disables preemption and therefore prevents context switches).

Concrete example — kernel routing table lookup: every packet on a 100Gbps NIC needs a route lookup. With a reader-writer lock, the cache line holding the lock bounces between cores on every lookup, capping you at maybe 10M lookups/sec across the box. With RCU, each core reads its local cache line of the routing table pointer with no coherence traffic. Routes are updated rarely (BGP convergence, seconds apart), so the writer cost is irrelevant. Result: lookups scale linearly with cores.

The reader fast path in pseudo-C:

rcu_read_lock(); → just preempt_disable(), a single decrement of a per-CPU counter
p = rcu_dereference(global_ptr); → a plain load (plus a compiler barrier; on Alpha, a real barrier)
use p
rcu_read_unlock(); → preempt_enable()

Rule of thumb: RCU wins when reads outnumber writes by at least 10:1 and readers are short. It loses when writers are frequent (grace periods stack up, memory reclaim falls behind) or when readers need to block (synchronous RCU readers can't sleep — though SRCU and RCU Tasks variants relax this).

Userspace: liburcu gives you the same primitives. DPDK, memcached's bucket migration, and several JIT runtimes use it for hot read paths.

The gotcha: grace periods can be tens of milliseconds. If you free a 1GB structure protected by RCU, that memory sits unreclaimed until the grace period elapses. Under memory pressure, this can OOM you while "free" memory is technically pending reclamation.

Key Takeaway: RCU trades deferred memory reclamation for completely free reads, making it the right primitive whenever read throughput matters more than write latency.

All newsletters