The Leader Election Pattern: Picking One Node to Do the Singleton Work

2026-05-22

You scaled your service to five replicas for availability. Great. Now you need to run a nightly cleanup job, poll a third-party API every minute, or process a queue that demands strict ordering. If all five replicas do it, you get duplicate work, race conditions, and rate-limit violations. If you hardcode "only replica-0 does it," you've lost your availability the moment replica-0 dies.

Leader election solves this: the cluster agrees on exactly one node to perform the singleton work, and automatically picks a new one when that node dies.

How it works in practice. Every node tries to acquire a lease — a time-bounded lock in a coordination store (etcd, ZooKeeper, Consul, Redis with Redlock, or a database row with a TTL). The winner becomes the leader, periodically renews the lease, and does the singleton work. Losers sit in standby, retrying the acquisition. If the leader crashes, its lease expires, and a standby grabs it.

Concrete example. Kubernetes controllers use this constantly. The kube-controller-manager runs as multiple replicas for HA, but only one is "active." They race to update a Lease object in the API server every few seconds. If the active one stops renewing for ~15 seconds, another takes over. Your own services can do the same with the client-go/tools/leaderelection package — about 30 lines of setup.

The rule of thumb for lease duration. Pick a lease TTL that's roughly 3× your renewal interval, and a renewal interval that's 10× your typical network round-trip. So with a 50ms RTT to etcd, renew every 500ms, lease for 1.5s. Shorter = faster failover but more risk of flapping under load spikes. Longer = stable but slow recovery. A common production setting is renew=2s, lease=15s — accepting 15s of downtime to avoid false failovers during GC pauses.

The trap nobody warns you about: split-brain on long pauses. Suppose the leader gets stuck in a 20-second GC pause. Its lease expires, a new leader takes over, and then the old leader wakes up — still thinking it's the leader — and writes to your database. Two leaders, corrupted state.

The defense is fencing tokens: every lease comes with a monotonically increasing number. The leader sends that number with every write, and the storage layer rejects writes with a stale token. etcd and ZooKeeper provide this natively. If your store doesn't support fencing, leader election alone is not safe for writes — it's only safe for idempotent reads or work you can afford to do twice.

Key Takeaway: Leader election gives you a singleton in a redundant cluster — but without fencing tokens, a paused leader can wake up and corrupt the state a new leader has already started writing.

All newsletters