2026-05-25
In a distributed system, you can't trust wall-clock timestamps to order events. Clocks drift, NTP corrects them backward, and two machines can record "the same moment" seconds apart. If you've ever seen a comment appear before the post it replies to in an event log, you've hit clock skew. Vector clocks solve this by tracking causality instead of time.
A vector clock is a map of {node_id → counter} attached to every event. The rules are simple:
To compare two events A and B: if every entry in A ≤ the matching entry in B (and at least one is strictly less), A happened-before B. If neither dominates, they're concurrent — you cannot order them, and your application must decide what to do.
Real-world example: Amazon's Dynamo (and DynamoDB's ancestor, Riak) uses vector clocks on shopping carts. If two replicas accept writes during a partition — one adds milk, the other adds bread — the vector clocks are concurrent, not ordered. Riak doesn't pick a winner; it returns both versions to the client (called "sibling values") and lets the cart-merge logic union them. Result: no items vanish from your cart just because your phone hit one replica and your laptop hit another.
Rule of thumb on size: a vector clock grows linearly with the number of writers it has ever seen. For N active clients, you're carrying N × 8 bytes of metadata per object. With 10,000 mobile clients each writing once, that's 80KB of clock attached to a 200-byte cart. This is why production systems use dotted version vectors or prune entries for nodes that haven't written in days — keep the writer set bounded, or your metadata will dwarf your data.
When to reach for them: multi-master replication, CRDTs, collaborative editing, any system where two writers can legitimately disagree and you need to detect — not hide — the conflict. When not to: single-leader systems (a monotonic sequence number is enough) or anything with strict linearizability via consensus (Raft already gives you total order).
The deeper insight: vector clocks don't tell you what time something happened. They tell you what must have happened before it. In distributed systems, that's the only kind of "when" that's actually reliable.
