Streaming replication
How WAL streaming works, sync vs async semantics, replication slots, logical vs physical replication, failover orchestration, and the data loss profiles of each setup.
Postgres replication is the foundation for HA, read scale-out, geo-distribution, and zero-downtime migrations. The mechanism is simple (ship WAL) but the operational surface is large.
WAL: the source of truth
Everything in replication is WAL. Every change to the database (insert, update, delete, DDL, index changes) writes a WAL record before touching the heap. The WAL is the durability log. It is also the replication stream.
WAL records carry a Log Sequence Number (LSN), a 64-bit position in the WAL stream. Replicas track which LSN they have received and which they have applied.
Physical streaming replication
The default. Replica is a byte-for-byte copy of the primary. WAL records are shipped raw and replayed.
Setup:
- On primary:
wal_level = replica,max_wal_senders = 10, configurepg_hba.confto allow replica. - On replica:
pg_basebackupto clone the primary, then start withprimary_conninfoandstandby.signal. - Replica connects, streams WAL, replays in real time.
# primary postgresql.conf
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
# replica postgresql.conf
primary_conninfo = 'host=primary port=5432 user=replicator'Synchronous replication semantics
synchronous_commit levels (per transaction):
off: do not wait for local WAL fsync. Fast but loses recent commits on primary crash.local: wait for local fsync (default for non-sync setup).remote_write: wait for sync replica to write WAL to OS. Replica crash before fsync can lose data.on(default): wait for sync replica to fsync WAL. Durable on both primary and replica.remote_apply: wait for sync replica to apply WAL. Replica reads see the write.
synchronous_standby_names controls which replicas count. Common pattern: 2 of 3 sync replicas must ack, so any one can fail without blocking writes.
synchronous_standby_names = 'ANY 2 (replica1, replica2, replica3)'Replication slots
Without a slot, the primary may recycle WAL files before a replica receives them, breaking replication. The replica then needs a full re-clone.
A replication slot guarantees the primary keeps WAL until the replica confirms receipt. Always use slots in production.
SELECT pg_create_physical_replication_slot('replica1_slot');Risk: if the replica is down for days, WAL accumulates on the primary and fills disk. Monitor:
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS bytes_behind
FROM pg_replication_slots;Drop dead slots fast.
Logical replication
Physical replication ships byte-level WAL. The replica is identical. Logical replication ships logical changes (INSERT, UPDATE, DELETE per row) decoded from WAL. The replica can have a different schema, different version, different OS.
Use cases:
- Zero-downtime major version upgrade: replicate from Postgres 13 to 15, cut over.
- Cross-database migrations: replicate specific tables between databases.
- CDC pipelines: replicate to Kafka, ETL, audit log.
- Sharding migration: replicate to a sharded setup while reading from the old one.
-- on source
CREATE PUBLICATION my_pub FOR TABLE users, orders;
-- on target
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=source dbname=app'
PUBLICATION my_pub;Limitations:
- Initial sync can be slow for big tables
- DDL not replicated automatically
- Sequences not replicated
- Tables must have a primary key or
REPLICA IDENTITY FULL
Failover
When the primary dies, you must promote a replica. Manual failover:
pg_ctl promote -D /var/lib/postgresql/dataThen update application config to point to the new primary. Slow, error-prone, requires human at 3am.
Automated failover tools:
- Patroni: leader election via etcd/consul, automated promotion, requires Pg cluster orchestration.
- repmgr: replication manager, manual or automated failover.
- pg_auto_failover: Microsoft's tool, simpler than Patroni.
- AWS RDS Multi-AZ: managed, sub-minute failover with automated DNS update.
Split brain and fencing
If the primary is unreachable but still alive (network partition), and you promote a replica, you have two primaries. Both accept writes. Reconciliation is impossible.
Patroni uses a distributed consensus store (etcd, Zookeeper, consul) for leader election. Only the node that holds the lease writes. Fencing kills the old primary before promoting the new.
Read your writes
Async replicas are behind. If your user submits a form and immediately refreshes, the replica may not have the write yet.
Fixes:
- Stick reads to primary after a write: track per-session "last write LSN" in app, route reads to primary until replica catches up.
synchronous_commit = remote_apply: write waits for replica apply. Subsequent read on replica is safe.- Cookie-based replica selection: set cookie for N seconds after write, route to primary during that window.
Lag monitoring
SELECT
application_name,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;replay_lag is what matters for read traffic. Alert at 1MB+ on a healthy system.
hot_standby_feedback
By default, replicas can be aborted with "canceling statement due to conflict with recovery" when the primary's VACUUM removes tuples the replica's query is still reading.
hot_standby_feedback = on makes the replica tell the primary about its oldest snapshot, holding back VACUUM. Trade-off: long replica queries cause primary bloat.
Common pitfalls
- No replication slot: replica falls behind, primary recycles WAL, full re-clone needed.
- Slot held by dead replica: WAL accumulates, disk fills. Monitor and drop.
- Sync replication with one replica: if replica fails, primary blocks writes. Always have at least 2 sync candidates.
- Reading from replica for "real-time" data: lag breaks UX. Use sync replication or route to primary.
- Logical replication without primary keys: silent slowdown or failure. Always add PKs.
The senior take
Physical streaming replication is the foundation. Always use replication slots. Run at least one sync replica for HA, async replicas for read scale-out and geo. Use logical replication for migrations, version upgrades, and CDC. Automate failover with Patroni or use managed (RDS, Cloud SQL). Monitor lag in bytes, not seconds. The default async setup loses recent commits on failover, factor that into your durability promises.
Learn more
- DocsPostgreSQL: ReplicationPostgreSQL
- DocsPostgreSQL: Logical ReplicationPostgreSQL
- DocsPatroni DocumentationPatroni