Deep dive15 min read← Back to crisp

Replication strategies

Deep dive on single-leader failover, multi-leader conflict resolution, leaderless quorums, and what breaks in production.

What replication is for

Three goals, often conflicting:

Latency: serve reads from a replica near the user. 50ms instead of 200ms cross-region.
Availability: survive node, rack, AZ, region failures. Failover instead of downtime.
Throughput: scale reads horizontally. Writes still bottleneck at the leader.

A fourth, sometimes: durability. More replicas = more copies of your data. AWS S3 stores 6 copies across 3 AZs by default for 11 nines of durability.

The strategy you pick optimizes some of these and pays for it on others.

Single-leader: the workhorse

One node accepts writes. Changes stream to N-1 followers via a replication log (Postgres WAL, MySQL binlog, MongoDB oplog).

Replication can be:

Statement-based: ship the SQL. Breaks on nondeterministic functions (NOW(), RAND()). MySQL legacy mode.
WAL/physical: ship the byte-level changes. Postgres streaming replication. Tight coupling to version, no schema flexibility.
Logical/row-based: ship the row changes. MySQL row-based binlog, Postgres logical replication. Cross-version friendly, slightly slower.

Sync modes:

Synchronous: leader waits for at least one follower ack. Zero data loss on leader failure. Higher commit latency.
Asynchronous: leader returns on local commit. Can lose recent writes. Default in most setups.
Semi-sync: hybrid, wait but time out. MySQL has this.

For Postgres: synchronous_commit = on plus synchronous_standby_names = 'ANY 1 (sb1, sb2)' gives you sync to at least one of two standbys. This is the production-grade setup.

Postgres semi-sync replication

Failover: where everything goes wrong

Failover is the operation of promoting a follower when the leader dies. Most production incidents happen here.

Steps:

Detect leader is dead (timeout, health check fail).
Pick a new leader (most up-to-date follower, or via consensus).
Reconfigure clients to talk to new leader.
Old leader, if it comes back, must not accept writes.

What goes wrong:

Split brain: old leader comes back, thinks it is still leader, accepts writes. Now you have two leaders with conflicting data. Fix: fencing tokens, or consensus-based leader election.
Lost writes: async replication means new leader is behind. GitHub October 2018: 24-hour outage from exactly this.
Cascade failure: replicas can't keep up with new leader, fall behind, get evicted, replication storm.
Clock skew: STONITH (shoot the other node in the head) requires accurate failure detection. False positives cause unnecessary failovers.

Tools that do this right: Patroni for Postgres (uses etcd or Consul for leader election), Orchestrator for MySQL, MongoDB's built-in replica set.

Multi-leader: when geography matters

Two or more nodes accept writes. They replicate to each other. Used for:

Multi-region active-active (write latency must be local).
Multi-datacenter for resilience.
Offline-capable clients (CouchDB, mobile).

The hard problem: write conflicts. Node A in EU writes name=Bob. Node B in US writes name=Robert. Same row, same time. Now what?

Resolution strategies:

Last-write-wins (LWW): pick by timestamp. Easy. Loses data silently. Clock skew makes this unreliable.
Vector clocks: detect concurrent writes, return both versions, force application to merge.
CRDTs: data structures that merge deterministically regardless of order. G-counters, OR-sets, RGA for text.
Application merge: surface both versions, let user resolve. Git does this.

CRDTs are the right answer for collaborative apps. Figma uses CRDTs for collaborative design. Riak supports CRDTs natively. The math: any pair of replicas converges to the same state once they exchange all updates.

The catch: CRDTs only exist for specific data types. Counters, sets, registers, sequences. Building one for arbitrary application state is hard.

Leaderless: the Dynamo model

No leader. Client picks N replicas, writes to W, reads from R. Replicas use anti-entropy and read-repair to converge.

The quorum math:

N=3, W=2, R=2: strong consistency, survives 1 failure. Standard.
N=5, W=3, R=3: survives 2 failures. More durable, slower.
N=3, W=3, R=1: read-heavy, can't tolerate any write-time failure.
N=3, W=1, R=1: fast and eventually consistent.

Read repair: when client reads from R replicas and gets different values, it writes the newest back to the stale replicas. Synchronous repair.

Anti-entropy: background process compares replicas using Merkle trees and reconciles differences. Asynchronous repair.

Hinted handoff: if a replica is down at write time, a different node accepts the write and "hints" to deliver it later when the target recovers.

Cassandra uses all three. DynamoDB uses internal versions of all three.

Leaderless quorum write with anti-entropy

Read scaling pitfalls

The naive "add replicas to scale reads" plan has limits:

Replication lag: replicas are always behind. For read-your-writes, you must route to primary or use a freshness token.
Stale reads under load: when primary is overloaded, replicas fall further behind. Exactly when you need them most.
Write amplification: every replica must apply every write. Writes don't scale by adding replicas.
Connection sprawl: each replica is another connection pool to manage.

Real solution stack:

Caching layer (Redis) for hot reads. Hits primary only on miss.
Read replicas for analytics and reporting, not user-facing reads.
For user-facing scale, shard (split data across primaries).

CAP and consistency by strategy

Strategy	C under partition	Failover time	Conflict handling
Single-leader sync	Yes (loses A)	5-30s	None (no concurrent writes)
Single-leader async	Eventual	5-30s, data loss	None
Multi-leader	Eventual	Zero	Required (LWW, CRDT, app)
Leaderless W+R>N	Linearizable	Zero	LWW or vector clocks
Leaderless W+R<=N	Eventual	Zero	LWW or vector clocks

Common failure modes

Replication lag spike during backup. Backup loads the primary, replicas can't keep up. Run backups from a dedicated replica.
Long-running query on replica blocks replication. WAL replay on Postgres can stall if a transaction holds locks. Use hot_standby_feedback or kill long queries.
Network partition causes split-brain. Use a consensus system (etcd, ZooKeeper) for leader election, not heartbeats alone.
Cross-region replication saturates link. Compress and rate-limit, or shard by region.
Schema migration on multi-leader. Different nodes might have different schemas mid-deploy. Use expand-contract migration pattern.

Picking a strategy

Decision tree:

Single region, single AZ? Single-leader with one sync follower. Done.
Single region, multiple AZ for HA? Single-leader with sync followers across AZs. Patroni + etcd.
Multi-region for latency? Either: read replicas in each region (writes go cross-region) or multi-leader with CRDTs.
Multi-region for compliance (GDPR)? Shard by region, each region is its own single-leader cluster.
Need zero-downtime writes during node failures? Leaderless (Cassandra, Scylla, DynamoDB).

99% of apps should pick option 1 or 2 and not look further. The complexity of multi-leader or leaderless is rarely worth it under 100M users.

Learn more

Article
Designing Data-Intensive Applications, Chapter 5Martin Kleppmann
Docs
PostgreSQL replication docsPostgreSQL
Paper
Dynamo paperAWS
Talk
CRDTs explained, Martin KleppmannMartin Kleppmann
Article
GitHub MySQL HA postmortemGitHub

Deep dive15 min read← Back to crisp

Replication strategies

Deep dive on single-leader failover, multi-leader conflict resolution, leaderless quorums, and what breaks in production.

What replication is for

Three goals, often conflicting:

Latency: serve reads from a replica near the user. 50ms instead of 200ms cross-region.
Availability: survive node, rack, AZ, region failures. Failover instead of downtime.
Throughput: scale reads horizontally. Writes still bottleneck at the leader.

A fourth, sometimes: durability. More replicas = more copies of your data. AWS S3 stores 6 copies across 3 AZs by default for 11 nines of durability.

The strategy you pick optimizes some of these and pays for it on others.

Single-leader: the workhorse

One node accepts writes. Changes stream to N-1 followers via a replication log (Postgres WAL, MySQL binlog, MongoDB oplog).

Replication can be:

Statement-based: ship the SQL. Breaks on nondeterministic functions (NOW(), RAND()). MySQL legacy mode.
WAL/physical: ship the byte-level changes. Postgres streaming replication. Tight coupling to version, no schema flexibility.
Logical/row-based: ship the row changes. MySQL row-based binlog, Postgres logical replication. Cross-version friendly, slightly slower.

Sync modes:

Synchronous: leader waits for at least one follower ack. Zero data loss on leader failure. Higher commit latency.
Asynchronous: leader returns on local commit. Can lose recent writes. Default in most setups.
Semi-sync: hybrid, wait but time out. MySQL has this.

For Postgres: synchronous_commit = on plus synchronous_standby_names = 'ANY 1 (sb1, sb2)' gives you sync to at least one of two standbys. This is the production-grade setup.

Postgres semi-sync replication

Failover: where everything goes wrong

Failover is the operation of promoting a follower when the leader dies. Most production incidents happen here.

Steps:

Detect leader is dead (timeout, health check fail).
Pick a new leader (most up-to-date follower, or via consensus).
Reconfigure clients to talk to new leader.
Old leader, if it comes back, must not accept writes.

What goes wrong:

Split brain: old leader comes back, thinks it is still leader, accepts writes. Now you have two leaders with conflicting data. Fix: fencing tokens, or consensus-based leader election.
Lost writes: async replication means new leader is behind. GitHub October 2018: 24-hour outage from exactly this.
Cascade failure: replicas can't keep up with new leader, fall behind, get evicted, replication storm.
Clock skew: STONITH (shoot the other node in the head) requires accurate failure detection. False positives cause unnecessary failovers.

Tools that do this right: Patroni for Postgres (uses etcd or Consul for leader election), Orchestrator for MySQL, MongoDB's built-in replica set.

Multi-leader: when geography matters

Two or more nodes accept writes. They replicate to each other. Used for:

Multi-region active-active (write latency must be local).
Multi-datacenter for resilience.
Offline-capable clients (CouchDB, mobile).

The hard problem: write conflicts. Node A in EU writes name=Bob. Node B in US writes name=Robert. Same row, same time. Now what?

Resolution strategies:

Last-write-wins (LWW): pick by timestamp. Easy. Loses data silently. Clock skew makes this unreliable.
Vector clocks: detect concurrent writes, return both versions, force application to merge.
CRDTs: data structures that merge deterministically regardless of order. G-counters, OR-sets, RGA for text.
Application merge: surface both versions, let user resolve. Git does this.

The catch: CRDTs only exist for specific data types. Counters, sets, registers, sequences. Building one for arbitrary application state is hard.

Leaderless: the Dynamo model

No leader. Client picks N replicas, writes to W, reads from R. Replicas use anti-entropy and read-repair to converge.

The quorum math:

N=3, W=2, R=2: strong consistency, survives 1 failure. Standard.
N=5, W=3, R=3: survives 2 failures. More durable, slower.
N=3, W=3, R=1: read-heavy, can't tolerate any write-time failure.
N=3, W=1, R=1: fast and eventually consistent.

Read repair: when client reads from R replicas and gets different values, it writes the newest back to the stale replicas. Synchronous repair.

Anti-entropy: background process compares replicas using Merkle trees and reconciles differences. Asynchronous repair.

Hinted handoff: if a replica is down at write time, a different node accepts the write and "hints" to deliver it later when the target recovers.

Cassandra uses all three. DynamoDB uses internal versions of all three.

Leaderless quorum write with anti-entropy

Read scaling pitfalls

The naive "add replicas to scale reads" plan has limits:

Replication lag: replicas are always behind. For read-your-writes, you must route to primary or use a freshness token.
Stale reads under load: when primary is overloaded, replicas fall further behind. Exactly when you need them most.
Write amplification: every replica must apply every write. Writes don't scale by adding replicas.
Connection sprawl: each replica is another connection pool to manage.

Real solution stack:

Caching layer (Redis) for hot reads. Hits primary only on miss.
Read replicas for analytics and reporting, not user-facing reads.
For user-facing scale, shard (split data across primaries).

CAP and consistency by strategy

Strategy	C under partition	Failover time	Conflict handling
Single-leader sync	Yes (loses A)	5-30s	None (no concurrent writes)
Single-leader async	Eventual	5-30s, data loss	None
Multi-leader	Eventual	Zero	Required (LWW, CRDT, app)
Leaderless W+R>N	Linearizable	Zero	LWW or vector clocks
Leaderless W+R<=N	Eventual	Zero	LWW or vector clocks

Common failure modes

Replication lag spike during backup. Backup loads the primary, replicas can't keep up. Run backups from a dedicated replica.
Long-running query on replica blocks replication. WAL replay on Postgres can stall if a transaction holds locks. Use hot_standby_feedback or kill long queries.
Network partition causes split-brain. Use a consensus system (etcd, ZooKeeper) for leader election, not heartbeats alone.
Cross-region replication saturates link. Compress and rate-limit, or shard by region.
Schema migration on multi-leader. Different nodes might have different schemas mid-deploy. Use expand-contract migration pattern.

Picking a strategy

Decision tree:

Single region, single AZ? Single-leader with one sync follower. Done.
Single region, multiple AZ for HA? Single-leader with sync followers across AZs. Patroni + etcd.
Multi-region for latency? Either: read replicas in each region (writes go cross-region) or multi-leader with CRDTs.
Multi-region for compliance (GDPR)? Shard by region, each region is its own single-leader cluster.
Need zero-downtime writes during node failures? Leaderless (Cassandra, Scylla, DynamoDB).

99% of apps should pick option 1 or 2 and not look further. The complexity of multi-leader or leaderless is rarely worth it under 100M users.

Learn more

Article
Designing Data-Intensive Applications, Chapter 5Martin Kleppmann
Docs
PostgreSQL replication docsPostgreSQL
Paper
Dynamo paperAWS
Talk
CRDTs explained, Martin KleppmannMartin Kleppmann
Article
GitHub MySQL HA postmortemGitHub

Replication strategies

What replication is for

Single-leader: the workhorse

Failover: where everything goes wrong

Multi-leader: when geography matters

Leaderless: the Dynamo model

Read scaling pitfalls

CAP and consistency by strategy

Common failure modes

Picking a strategy

What to read next

Learn more

Replication strategies

What replication is for

Single-leader: the workhorse

Failover: where everything goes wrong

Multi-leader: when geography matters

Leaderless: the Dynamo model

Read scaling pitfalls

CAP and consistency by strategy

Common failure modes

Picking a strategy

What to read next

Learn more