Replication strategies
Deep dive on single-leader failover, multi-leader conflict resolution, leaderless quorums, and what breaks in production.
What replication is for
Three goals, often conflicting:
- Latency: serve reads from a replica near the user. 50ms instead of 200ms cross-region.
- Availability: survive node, rack, AZ, region failures. Failover instead of downtime.
- Throughput: scale reads horizontally. Writes still bottleneck at the leader.
A fourth, sometimes: durability. More replicas = more copies of your data. AWS S3 stores 6 copies across 3 AZs by default for 11 nines of durability.
The strategy you pick optimizes some of these and pays for it on others.
Single-leader: the workhorse
One node accepts writes. Changes stream to N-1 followers via a replication log (Postgres WAL, MySQL binlog, MongoDB oplog).
Replication can be:
- Statement-based: ship the SQL. Breaks on nondeterministic functions (NOW(), RAND()). MySQL legacy mode.
- WAL/physical: ship the byte-level changes. Postgres streaming replication. Tight coupling to version, no schema flexibility.
- Logical/row-based: ship the row changes. MySQL row-based binlog, Postgres logical replication. Cross-version friendly, slightly slower.
Sync modes:
- Synchronous: leader waits for at least one follower ack. Zero data loss on leader failure. Higher commit latency.
- Asynchronous: leader returns on local commit. Can lose recent writes. Default in most setups.
- Semi-sync: hybrid, wait but time out. MySQL has this.
For Postgres: synchronous_commit = on plus synchronous_standby_names = 'ANY 1 (sb1, sb2)' gives you sync to at least one of two standbys. This is the production-grade setup.
Failover: where everything goes wrong
Failover is the operation of promoting a follower when the leader dies. Most production incidents happen here.
Steps:
- Detect leader is dead (timeout, health check fail).
- Pick a new leader (most up-to-date follower, or via consensus).
- Reconfigure clients to talk to new leader.
- Old leader, if it comes back, must not accept writes.
What goes wrong:
- Split brain: old leader comes back, thinks it is still leader, accepts writes. Now you have two leaders with conflicting data. Fix: fencing tokens, or consensus-based leader election.
- Lost writes: async replication means new leader is behind. GitHub October 2018: 24-hour outage from exactly this.
- Cascade failure: replicas can't keep up with new leader, fall behind, get evicted, replication storm.
- Clock skew: STONITH (shoot the other node in the head) requires accurate failure detection. False positives cause unnecessary failovers.
Tools that do this right: Patroni for Postgres (uses etcd or Consul for leader election), Orchestrator for MySQL, MongoDB's built-in replica set.
Multi-leader: when geography matters
Two or more nodes accept writes. They replicate to each other. Used for:
- Multi-region active-active (write latency must be local).
- Multi-datacenter for resilience.
- Offline-capable clients (CouchDB, mobile).
The hard problem: write conflicts. Node A in EU writes name=Bob. Node B in US writes name=Robert. Same row, same time. Now what?
Resolution strategies:
- Last-write-wins (LWW): pick by timestamp. Easy. Loses data silently. Clock skew makes this unreliable.
- Vector clocks: detect concurrent writes, return both versions, force application to merge.
- CRDTs: data structures that merge deterministically regardless of order. G-counters, OR-sets, RGA for text.
- Application merge: surface both versions, let user resolve. Git does this.
CRDTs are the right answer for collaborative apps. Figma uses CRDTs for collaborative design. Riak supports CRDTs natively. The math: any pair of replicas converges to the same state once they exchange all updates.
The catch: CRDTs only exist for specific data types. Counters, sets, registers, sequences. Building one for arbitrary application state is hard.
Leaderless: the Dynamo model
No leader. Client picks N replicas, writes to W, reads from R. Replicas use anti-entropy and read-repair to converge.
The quorum math:
- N=3, W=2, R=2: strong consistency, survives 1 failure. Standard.
- N=5, W=3, R=3: survives 2 failures. More durable, slower.
- N=3, W=3, R=1: read-heavy, can't tolerate any write-time failure.
- N=3, W=1, R=1: fast and eventually consistent.
Read repair: when client reads from R replicas and gets different values, it writes the newest back to the stale replicas. Synchronous repair.
Anti-entropy: background process compares replicas using Merkle trees and reconciles differences. Asynchronous repair.
Hinted handoff: if a replica is down at write time, a different node accepts the write and "hints" to deliver it later when the target recovers.
Cassandra uses all three. DynamoDB uses internal versions of all three.
Read scaling pitfalls
The naive "add replicas to scale reads" plan has limits:
- Replication lag: replicas are always behind. For read-your-writes, you must route to primary or use a freshness token.
- Stale reads under load: when primary is overloaded, replicas fall further behind. Exactly when you need them most.
- Write amplification: every replica must apply every write. Writes don't scale by adding replicas.
- Connection sprawl: each replica is another connection pool to manage.
Real solution stack:
- Caching layer (Redis) for hot reads. Hits primary only on miss.
- Read replicas for analytics and reporting, not user-facing reads.
- For user-facing scale, shard (split data across primaries).
CAP and consistency by strategy
| Strategy | C under partition | Failover time | Conflict handling |
|---|---|---|---|
| Single-leader sync | Yes (loses A) | 5-30s | None (no concurrent writes) |
| Single-leader async | Eventual | 5-30s, data loss | None |
| Multi-leader | Eventual | Zero | Required (LWW, CRDT, app) |
| Leaderless W+R>N | Linearizable | Zero | LWW or vector clocks |
| Leaderless W+R<=N | Eventual | Zero | LWW or vector clocks |
Common failure modes
- Replication lag spike during backup. Backup loads the primary, replicas can't keep up. Run backups from a dedicated replica.
- Long-running query on replica blocks replication. WAL replay on Postgres can stall if a transaction holds locks. Use hot_standby_feedback or kill long queries.
- Network partition causes split-brain. Use a consensus system (etcd, ZooKeeper) for leader election, not heartbeats alone.
- Cross-region replication saturates link. Compress and rate-limit, or shard by region.
- Schema migration on multi-leader. Different nodes might have different schemas mid-deploy. Use expand-contract migration pattern.
Picking a strategy
Decision tree:
- Single region, single AZ? Single-leader with one sync follower. Done.
- Single region, multiple AZ for HA? Single-leader with sync followers across AZs. Patroni + etcd.
- Multi-region for latency? Either: read replicas in each region (writes go cross-region) or multi-leader with CRDTs.
- Multi-region for compliance (GDPR)? Shard by region, each region is its own single-leader cluster.
- Need zero-downtime writes during node failures? Leaderless (Cassandra, Scylla, DynamoDB).
99% of apps should pick option 1 or 2 and not look further. The complexity of multi-leader or leaderless is rarely worth it under 100M users.
What to read next
- DDIA chapter 5. The canonical treatment.
- Dynamo paper for the original leaderless design.
- Martin Kleppmann's CRDT talks for the collaborative-app angle.
- Postgres replication docs for the production playbook.
Learn more
- ArticleDesigning Data-Intensive Applications, Chapter 5Martin Kleppmann
- DocsPostgreSQL replication docsPostgreSQL
- PaperDynamo paperAWS
- TalkCRDTs explained, Martin KleppmannMartin Kleppmann
- Article