2026-05-27
Read repair fixes inconsistencies when data is accessed, but what about data nobody reads? Cold keys can diverge silently for months. The anti-entropy pattern is a background process that periodically compares replicas and reconciles differences, regardless of read traffic. It's the janitor that sweeps the corners read repair never visits.
The naive approach — sending every key-value pair between replicas — is catastrophic. A node with 100GB of data would saturate the network just to verify consistency. The trick is Merkle trees: each replica builds a tree where leaves hash key ranges and internal nodes hash their children. Two replicas compare root hashes first. If they match, you're done — zero data transferred. If they differ, you recurse only into the subtrees that disagree.
Real-world example: Cassandra's nodetool repair runs anti-entropy across a cluster. Suppose nodes A and B each hold 10 million keys split into 32,768 Merkle tree leaves (~305 keys per leaf). If 100 keys diverged due to a network blip last week, the trees differ in maybe 50-80 leaves. You transfer the keys in those leaves (~25,000 keys) instead of all 10 million — a 400x reduction in repair traffic. DynamoDB, Riak, and ScyllaDB all use variants of this.
Rule of thumb for scheduling: run full anti-entropy within your gc_grace_seconds window (Cassandra's default is 10 days). If you delete a key, the tombstone gets garbage-collected after this window. If anti-entropy hasn't reconciled before then, a node that missed the delete will resurrect the row — the dreaded "zombie data" problem. Schedule repairs to complete in roughly half your GC grace period to leave safety margin.
Watch out for these pitfalls:
Anti-entropy is the eventual in "eventual consistency" doing its job. Without it, your replicas slowly rot — and the rot only surfaces when an old key is finally accessed, often during an incident at 3am.
