Vector clocks and logical time
Lamport, vector, version vectors, hybrid logical clocks. Why wall clocks fail and how each successor fixes the gap.
Why wall clocks fail
A distributed system has N machines, each with its own clock. The clocks are not synchronized perfectly. NTP gives you maybe 10ms accuracy on a LAN, much worse on the internet. GPS+atomic gives sub-microsecond (Spanner's TrueTime).
Real issues:
- Clock drift: hardware oscillators run at slightly different rates. A cheap clock can drift 100ms/day.
- NTP adjustments: clocks can jump forward, backward, slew slowly.
- Leap seconds: rare but real (Cloudflare had a 2017 leap-second incident).
- Virtualization: VM clocks can pause and resume unpredictably.
Consequences if you use wall clock for ordering:
- Event A happens after B (by causality) but A.timestamp < B.timestamp. LWW picks the wrong winner.
- Logs from different machines can't be sorted into causal order.
- Tokens that depend on time become non-monotonic across machines.
Logical clocks solve this without requiring tight time sync.
Happens-before relation
Leslie Lamport defined the happens-before relation (->) in his 1978 paper:
- If A and B are events in the same process and A occurs first, A -> B.
- If A is a send and B is the matching receive, A -> B.
- Transitivity: if A -> B and B -> C, then A -> C.
If A -> B is false and B -> A is false, A and B are concurrent (A || B).
This relation captures causality. The challenge: how to track it efficiently?
Lamport clocks
A single counter per process. Rules:
- Before each local event, increment counter.
- When sending a message, include current counter.
- When receiving a message with counter t, set local counter = max(local, t) + 1.
Property: if A -> B, then Lamport(A) < Lamport(B).
The reverse is NOT true. Lamport(A) < Lamport(B) does not imply A -> B. They could be concurrent and just happen to have ordered counter values.
For total order: break ties using process ID. Now every event has a unique (timestamp, pid) pair, and you can sort. The total order extends the partial order of happens-before.
Lamport clocks are tiny (one integer) and good for total ordering. They are insufficient for detecting concurrency, which is why we need vectors.
Vector clocks
Each process p keeps a vector V_p of length N (number of processes). V_p[i] is p's belief about i's counter.
Rules:
- Before each local event at p, increment V_p[p].
- When sending a message at p, include V_p.
- When receiving at q with vector V_msg, for each i set V_q[i] = max(V_q[i], V_msg[i]). Then increment V_q[q].
Comparison:
- V1 happens-before V2 (V1 < V2): for all i, V1[i] <= V2[i], and there exists some i with V1[i] < V2[i].
- V1 equals V2: V1[i] == V2[i] for all i.
- V1 concurrent with V2: neither V1 < V2 nor V2 < V1.
Now: V1 < V2 if and only if event 1 happens-before event 2. Vector clocks precisely capture causality.
Cost: O(N) bytes per event, where N is number of processes. Doesn't scale to thousands of clients.
Version vectors
Vector clocks indexed by process don't fit replicated systems. Replicas, not clients, are what you care about. Version vectors are vector clocks indexed by replica.
Each replica r tracks V_r[i] for each replica i. Same rules, but the dimensionality is the number of replicas (small, say 3-5) rather than the number of clients (potentially millions).
Used in Dynamo, Riak, CouchDB, COPS. Practical and bounded.
Detecting concurrent writes
In Riak (and similar), when you write a value, the system attaches a version vector. When you read, you get the value and its version. When you write back, include the version you read.
If two clients read v1 and both write back, the system gets two writes with vectors that are concurrent. It returns both versions as siblings. The application must merge.
This avoids LWW data loss. The cost: application complexity.
Compare to LWW (DynamoDB default): concurrent writes are silently resolved by timestamp. One wins. The other is lost. Clock skew makes this unreliable.
Pruning vector clocks
A long-lived vector clock can grow if new clients join over time. Practical systems prune:
- TTL per entry: drop entries older than N hours.
- Bounded size: keep top K entries.
- Use server-assigned client IDs instead of arbitrary IDs.
Dynamo had a famous bug where vector clocks grew unbounded under high concurrency. The fix was to assign vector entries to the coordinator node, not the client.
Hybrid logical clocks
HLC (Kulkarni, Demirbas, et al, 2014) combines wall clock and logical counter. Each timestamp is (physical, logical).
Rules at process p with wall time pt:
- On local event: l' = max(pt, l, lr) where lr is the largest received logical. If l' == old, increment c. Else c = 0.
In practice (simplified):
- Compact: 64-bit integer (48 bits physical, 16 bits logical).
- Monotonic: never decreases.
- Approximates wall time when clocks are reasonably synced.
- Captures causality when they aren't.
Used by:
- CockroachDB: HLC drives MVCC timestamps.
- MongoDB: cluster time uses HLC.
- YugabyteDB: same idea.
HLC requires bounded clock skew for serializability guarantees. CockroachDB uses a max_offset setting (default 500ms). If skew exceeds this, the node halts itself.
TrueTime: Spanner's hardware trick
Google's Spanner uses TrueTime, an API that returns an interval [earliest, latest] within which now() falls. The interval width is the uncertainty, kept under 7ms by GPS and atomic clocks across datacenters.
To commit at timestamp t:
- Acquire locks.
- Pick t = TrueTime.now().latest.
- Wait until TrueTime.now().earliest > t. This commit-wait ensures t is in the past.
- Apply commit.
This gives external consistency (strict serializability). The cost: hardware infrastructure (GPS receivers, atomic clocks per datacenter) and ~7ms commit-wait latency.
Without TrueTime, you need explicit consensus (Paxos/Raft) or accept weaker guarantees. CockroachDB chose HLC and accepts weaker guarantees (serializability, not strict serializability).
Use cases
-
Causal consistency: deliver writes in causal order across replicas. Vector clocks or causal contexts.
-
Conflict detection: returning sibling versions instead of silently overwriting. Riak, Voldemort.
-
Snapshot reads in MVCC: timestamp every write, read at a consistent point. HLC enables this without tight time sync.
-
Distributed tracing: total ordering of spans across services. Lamport-like timestamps in trace contexts.
-
Operational transforms (collaborative editing): each operation has a vector clock, transforms applied in causal order.
Common bugs
- Forgetting to increment own entry on send. Cascade of incorrect comparisons.
- Comparing vectors of different dimensions. Pad with zeros or use a map.
- Treating vector clocks as ordering metadata only. They must drive merge logic.
- Using wall clock for LWW under skew. Data loss from clock-skewed concurrent writes.
- Letting vector grow unbounded. Memory and bandwidth degrade.
Practical advice
- Default to HLC (CockroachDB's approach) for new systems that need causality + bounded size.
- Use version vectors for replicated stores where you can tolerate sibling values.
- Use Lamport clocks for total ordering without causality precision (logs, trace spans).
- Use TrueTime if you have Google's infrastructure budget.
- Never use raw wall clock for conflict resolution. Always at minimum a (wall, node_id) tuple to ensure uniqueness.
What to read next
- Lamport's 1978 paper. Short, foundational.
- DDIA chapter 8 on time, ordering, and lies.
- HLC paper for the modern compact approach.
- CockroachDB blog on living without atomic clocks.
Learn more
- PaperTime, clocks, and the ordering of eventsLeslie Lamport
- ArticleDesigning Data-Intensive Applications, Chapter 8Martin Kleppmann
- PaperDynamo paperAWS
- PaperHybrid Logical Clocks paperKulkarni et al
- ArticleCockroachDB HLC blogCockroachDB