MVCC in Postgres
How Postgres implements snapshot isolation with xmin/xmax tuple headers, HOT updates, VACUUM, and freeze maps, plus the wraparound disaster scenarios.
MVCC is the single most important implementation detail in Postgres. It shapes your indexes, your performance, your operational burden, and your bugs.
The tuple header
Every row in Postgres is preceded by a 23-byte header. The fields that matter for MVCC:
xmin: the transaction id that inserted this tuplexmax: the transaction id that deleted (or locked) this tuple, 0 if alivecmin/cmax: command id within the transaction, for self-visibilityt_ctid: pointer to the new version, if this tuple was updatedt_infomask: flags (committed, aborted, frozen, hint bits)
A SELECT walks the heap, reads each tuple header, and applies the visibility rule.
Snapshot composition
A snapshot is not just a single number. It is:
xmin: the oldest transaction still in progress when this snapshot was takenxmax: the next xid to be assignedxip[]: array of in-progress xids betweenxminandxmax
A tuple with xmin = T is visible if T < snapshot.xmin, or if T < snapshot.xmax AND T is not in xip[] AND T committed.
This is why a transaction taken at the same instant as another transaction's commit may or may not see that commit depending on ordering. There is no global clock.
UPDATE is INSERT plus mark-dead
When you UPDATE a row, Postgres:
- Finds the existing tuple
- Sets the old tuple's
xmaxto current xid - Writes a new tuple with
xmin= current xid,xmax= 0 - Sets the old tuple's
t_ctidto point to the new one - Writes all this to the WAL
If the new tuple fits in the same page AND no indexed columns changed, Postgres uses HOT (Heap-Only Tuple) update: the new tuple is on the same page and the index still points to the old one, which has a chain to the new. This saves index writes.
If HOT does not apply, every index on the table gets a new entry. A table with 10 indexes turns one UPDATE into 11 writes.
DELETE is not free
DELETE does not remove the row. It sets xmax to current xid. The tuple sits there until VACUUM reclaims it. This is why bulk DELETE followed by bulk INSERT does not shrink the table. Use TRUNCATE or partition swap.
VACUUM in detail
VACUUM has two jobs:
- Reclaim dead tuples. Find tuples whose
xmaxis committed and older than the globalxminhorizon. Mark their space as reusable in the free space map. - Freeze old live tuples. Any tuple with
xminolder thanvacuum_freeze_min_agegets itsxminrewritten to a special "frozen" value. This is for wraparound prevention.
VACUUM does not lock the table for reads or writes. It only takes a SHARE UPDATE EXCLUSIVE lock, which conflicts with DDL but not with normal queries. VACUUM FULL rewrites the table and takes ACCESS EXCLUSIVE. Avoid VACUUM FULL in production. Use pg_repack instead.
The global xmin horizon
VACUUM cannot clean any tuple newer than the oldest active snapshot in the cluster. This is the "xmin horizon." If any transaction is open for hours, VACUUM is stuck.
Common causes of held-back horizon:
- A long analytics query
- An idle-in-transaction connection (someone called
BEGINand went to lunch) - A replication slot with
hot_standby_feedback = onand a slow replica - A prepared transaction (two-phase commit) that was never committed or rolled back
Check with:
SELECT pid, state, xact_start, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;Transaction ID wraparound
Postgres xid is 32 bits. After 2^31 transactions (~2.1 billion), the counter wraps. To prevent ancient tuples from appearing in the future, Postgres "freezes" old tuples by setting a flag that says "visible to everyone, ignore xmin."
If autovacuum cannot keep up, the cluster approaches wraparound. At 10 million transactions remaining, Postgres warns. At 1 million, it refuses new transactions and goes into single-user mode. Sentry and Mailchimp have both written post-mortems about this.
Monitor with:
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;If any database is over 1 billion, you have a problem.
Index bloat is real
Indexes do not have visibility info. Every entry points to a heap tuple. When a tuple dies, its index entries die too, but the index does not know until a scan visits the heap and the heap says "dead." Index entries are reclaimed lazily by VACUUM.
A heavily updated table with 5 indexes might have indexes 5x larger than the data. REINDEX CONCURRENTLY rebuilds without locking.
Common pitfalls
SELECT FOR UPDATEin Read Committed creates write skew. Two transactions both lock different rows that should be jointly consistent. UseSERIALIZABLEor application-level locks.pg_dumpopens a long transaction. On a busy database, this can hold back the xmin horizon and cause massive bloat during the backup window.- Forgetting that COUNT(*) is slow. Because of MVCC, Postgres cannot maintain an accurate row count cheaply.
COUNT(*)scans the heap (or an index withIndex Only Scanplus visibility map). - Idle-in-transaction connections. Use
idle_in_transaction_session_timeoutto kill them automatically.
The senior take
MVCC is a trade. You get lock-free reads and writes, consistent snapshots, and clean transaction semantics. You pay with bloat, VACUUM operational burden, and wraparound risk. Tune autovacuum aggressively, kill idle transactions, monitor pg_stat_user_tables.n_dead_tup, and your DB will run for years. Ignore it and your 200GB table becomes 2TB.
Learn more
- DocsPostgreSQL: Concurrency ControlPostgreSQL
- ArticleThe Internals of PostgreSQLHironobu Suzuki
- ArticleSentry: Transaction ID Wraparound in PostgresSentry Engineering