2026-05-24
In a distributed system, a node that has crashed and a node that's just slow look identical from the outside: silence. The heartbeat pattern turns that silence into a signal. Each node periodically sends a small "I'm alive" message to peers (or to a coordinator), and the absence of those messages within a deadline is treated as failure.
The mechanics are simple, but the tuning is where engineers get burned. You pick two numbers: the heartbeat interval (how often to send) and the timeout (how long to wait before declaring a node dead). Set them too tight and you'll false-positive every time the network hiccups, triggering needless failovers. Set them too loose and you'll keep routing traffic to a corpse.
Rule of thumb: timeout should be at least 3× the heartbeat interval, plus a margin for network jitter. If you heartbeat every 1 second, don't declare dead until at least 3–5 seconds of silence. This tolerates one or two dropped packets without flapping. For cross-region links, multiply by 5–10× to account for higher latency variance.
Real-world example: Kubernetes uses heartbeats from the kubelet to the control plane. Each node sends a NodeStatus update every 10 seconds. After 40 seconds without one, the node is marked NotReady. After 5 more minutes, pods get evicted. That long eviction delay is deliberate — a transient network blip shouldn't drain a node and reschedule 50 pods, because the cure (mass rescheduling) is worse than the disease (a brief outage).
Three pitfalls worth internalizing:
Phi Accrual, used by Cassandra and Akka, is worth knowing about: instead of a hard timeout, it outputs a continuous suspicion level based on the statistical distribution of past heartbeat intervals. You set a suspicion threshold, not a timeout, and the detector adapts to actual network conditions.
