Deep dive15 min read← Back to crisp

Streaming replication

How WAL streaming works, sync vs async semantics, replication slots, logical vs physical replication, failover orchestration, and the data loss profiles of each setup.

Postgres replication is the foundation for HA, read scale-out, geo-distribution, and zero-downtime migrations. The mechanism is simple (ship WAL) but the operational surface is large.

WAL: the source of truth

Everything in replication is WAL. Every change to the database (insert, update, delete, DDL, index changes) writes a WAL record before touching the heap. The WAL is the durability log. It is also the replication stream.

WAL records carry a Log Sequence Number (LSN), a 64-bit position in the WAL stream. Replicas track which LSN they have received and which they have applied.

Physical streaming replication

The default. Replica is a byte-for-byte copy of the primary. WAL records are shipped raw and replayed.

Setup:

On primary: wal_level = replica, max_wal_senders = 10, configure pg_hba.conf to allow replica.
On replica: pg_basebackup to clone the primary, then start with primary_conninfo and standby.signal.
Replica connects, streams WAL, replays in real time.

# primary postgresql.conf
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
 
# replica postgresql.conf
primary_conninfo = 'host=primary port=5432 user=replicator'

Typical topology: one sync replica for HA, async replicas for read scale-out and geo

Synchronous replication semantics

synchronous_commit levels (per transaction):

off: do not wait for local WAL fsync. Fast but loses recent commits on primary crash.
local: wait for local fsync (default for non-sync setup).
remote_write: wait for sync replica to write WAL to OS. Replica crash before fsync can lose data.
on (default): wait for sync replica to fsync WAL. Durable on both primary and replica.
remote_apply: wait for sync replica to apply WAL. Replica reads see the write.

synchronous_standby_names controls which replicas count. Common pattern: 2 of 3 sync replicas must ack, so any one can fail without blocking writes.

synchronous_standby_names = 'ANY 2 (replica1, replica2, replica3)'

Replication slots

Without a slot, the primary may recycle WAL files before a replica receives them, breaking replication. The replica then needs a full re-clone.

A replication slot guarantees the primary keeps WAL until the replica confirms receipt. Always use slots in production.

SELECT pg_create_physical_replication_slot('replica1_slot');

Risk: if the replica is down for days, WAL accumulates on the primary and fills disk. Monitor:

SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS bytes_behind
FROM pg_replication_slots;

Drop dead slots fast.

Logical replication

Physical replication ships byte-level WAL. The replica is identical. Logical replication ships logical changes (INSERT, UPDATE, DELETE per row) decoded from WAL. The replica can have a different schema, different version, different OS.

Use cases:

Zero-downtime major version upgrade: replicate from Postgres 13 to 15, cut over.
Cross-database migrations: replicate specific tables between databases.
CDC pipelines: replicate to Kafka, ETL, audit log.
Sharding migration: replicate to a sharded setup while reading from the old one.

-- on source
CREATE PUBLICATION my_pub FOR TABLE users, orders;
 
-- on target
CREATE SUBSCRIPTION my_sub
  CONNECTION 'host=source dbname=app'
  PUBLICATION my_pub;

Limitations:

Initial sync can be slow for big tables
DDL not replicated automatically
Sequences not replicated
Tables must have a primary key or REPLICA IDENTITY FULL

Failover

When the primary dies, you must promote a replica. Manual failover:

pg_ctl promote -D /var/lib/postgresql/data

Then update application config to point to the new primary. Slow, error-prone, requires human at 3am.

Automated failover tools:

Patroni: leader election via etcd/consul, automated promotion, requires Pg cluster orchestration.
repmgr: replication manager, manual or automated failover.
pg_auto_failover: Microsoft's tool, simpler than Patroni.
AWS RDS Multi-AZ: managed, sub-minute failover with automated DNS update.

Split brain and fencing

If the primary is unreachable but still alive (network partition), and you promote a replica, you have two primaries. Both accept writes. Reconciliation is impossible.

Patroni uses a distributed consensus store (etcd, Zookeeper, consul) for leader election. Only the node that holds the lease writes. Fencing kills the old primary before promoting the new.

Read your writes

Async replicas are behind. If your user submits a form and immediately refreshes, the replica may not have the write yet.

Fixes:

Stick reads to primary after a write: track per-session "last write LSN" in app, route reads to primary until replica catches up.
synchronous_commit = remote_apply: write waits for replica apply. Subsequent read on replica is safe.
Cookie-based replica selection: set cookie for N seconds after write, route to primary during that window.

Lag monitoring

SELECT
  application_name,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag,
  pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

replay_lag is what matters for read traffic. Alert at 1MB+ on a healthy system.

hot_standby_feedback

By default, replicas can be aborted with "canceling statement due to conflict with recovery" when the primary's VACUUM removes tuples the replica's query is still reading.

hot_standby_feedback = on makes the replica tell the primary about its oldest snapshot, holding back VACUUM. Trade-off: long replica queries cause primary bloat.

Common pitfalls

No replication slot: replica falls behind, primary recycles WAL, full re-clone needed.
Slot held by dead replica: WAL accumulates, disk fills. Monitor and drop.
Sync replication with one replica: if replica fails, primary blocks writes. Always have at least 2 sync candidates.
Reading from replica for "real-time" data: lag breaks UX. Use sync replication or route to primary.
Logical replication without primary keys: silent slowdown or failure. Always add PKs.

The senior take

Physical streaming replication is the foundation. Always use replication slots. Run at least one sync replica for HA, async replicas for read scale-out and geo. Use logical replication for migrations, version upgrades, and CDC. Automate failover with Patroni or use managed (RDS, Cloud SQL). Monitor lag in bytes, not seconds. The default async setup loses recent commits on failover, factor that into your durability promises.

Learn more

Docs
PostgreSQL: ReplicationPostgreSQL
Docs
PostgreSQL: Logical ReplicationPostgreSQL
Docs
Patroni DocumentationPatroni