Deep dive15 min read← Back to crisp

MVCC in Postgres

How Postgres implements snapshot isolation with xmin/xmax tuple headers, HOT updates, VACUUM, and freeze maps, plus the wraparound disaster scenarios.

MVCC is the single most important implementation detail in Postgres. It shapes your indexes, your performance, your operational burden, and your bugs.

The tuple header

Every row in Postgres is preceded by a 23-byte header. The fields that matter for MVCC:

xmin: the transaction id that inserted this tuple
xmax: the transaction id that deleted (or locked) this tuple, 0 if alive
cmin/cmax: command id within the transaction, for self-visibility
t_ctid: pointer to the new version, if this tuple was updated
t_infomask: flags (committed, aborted, frozen, hint bits)

A SELECT walks the heap, reads each tuple header, and applies the visibility rule.

MVCC visibility check per tuple

Snapshot composition

A snapshot is not just a single number. It is:

xmin: the oldest transaction still in progress when this snapshot was taken
xmax: the next xid to be assigned
xip[]: array of in-progress xids between xmin and xmax

A tuple with xmin = T is visible if T < snapshot.xmin, or if T < snapshot.xmax AND T is not in xip[] AND T committed.

This is why a transaction taken at the same instant as another transaction's commit may or may not see that commit depending on ordering. There is no global clock.

UPDATE is INSERT plus mark-dead

When you UPDATE a row, Postgres:

Finds the existing tuple
Sets the old tuple's xmax to current xid
Writes a new tuple with xmin = current xid, xmax = 0
Sets the old tuple's t_ctid to point to the new one
Writes all this to the WAL

If the new tuple fits in the same page AND no indexed columns changed, Postgres uses HOT (Heap-Only Tuple) update: the new tuple is on the same page and the index still points to the old one, which has a chain to the new. This saves index writes.

If HOT does not apply, every index on the table gets a new entry. A table with 10 indexes turns one UPDATE into 11 writes.

DELETE is not free

DELETE does not remove the row. It sets xmax to current xid. The tuple sits there until VACUUM reclaims it. This is why bulk DELETE followed by bulk INSERT does not shrink the table. Use TRUNCATE or partition swap.

VACUUM in detail

VACUUM has two jobs:

Reclaim dead tuples. Find tuples whose xmax is committed and older than the global xmin horizon. Mark their space as reusable in the free space map.
Freeze old live tuples. Any tuple with xmin older than vacuum_freeze_min_age gets its xmin rewritten to a special "frozen" value. This is for wraparound prevention.

VACUUM does not lock the table for reads or writes. It only takes a SHARE UPDATE EXCLUSIVE lock, which conflicts with DDL but not with normal queries. VACUUM FULL rewrites the table and takes ACCESS EXCLUSIVE. Avoid VACUUM FULL in production. Use pg_repack instead.

The global xmin horizon

VACUUM cannot clean any tuple newer than the oldest active snapshot in the cluster. This is the "xmin horizon." If any transaction is open for hours, VACUUM is stuck.

Common causes of held-back horizon:

A long analytics query
An idle-in-transaction connection (someone called BEGIN and went to lunch)
A replication slot with hot_standby_feedback = on and a slow replica
A prepared transaction (two-phase commit) that was never committed or rolled back

Check with:

SELECT pid, state, xact_start, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;

Transaction ID wraparound

Postgres xid is 32 bits. After 2^31 transactions (~2.1 billion), the counter wraps. To prevent ancient tuples from appearing in the future, Postgres "freezes" old tuples by setting a flag that says "visible to everyone, ignore xmin."

If autovacuum cannot keep up, the cluster approaches wraparound. At 10 million transactions remaining, Postgres warns. At 1 million, it refuses new transactions and goes into single-user mode. Sentry and Mailchimp have both written post-mortems about this.

Monitor with:

SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

If any database is over 1 billion, you have a problem.

Index bloat is real

Indexes do not have visibility info. Every entry points to a heap tuple. When a tuple dies, its index entries die too, but the index does not know until a scan visits the heap and the heap says "dead." Index entries are reclaimed lazily by VACUUM.

A heavily updated table with 5 indexes might have indexes 5x larger than the data. REINDEX CONCURRENTLY rebuilds without locking.

Common pitfalls

SELECT FOR UPDATE in Read Committed creates write skew. Two transactions both lock different rows that should be jointly consistent. Use SERIALIZABLE or application-level locks.
pg_dump opens a long transaction. On a busy database, this can hold back the xmin horizon and cause massive bloat during the backup window.
Forgetting that COUNT(*) is slow. Because of MVCC, Postgres cannot maintain an accurate row count cheaply. COUNT(*) scans the heap (or an index with Index Only Scan plus visibility map).
Idle-in-transaction connections. Use idle_in_transaction_session_timeout to kill them automatically.

The senior take

MVCC is a trade. You get lock-free reads and writes, consistent snapshots, and clean transaction semantics. You pay with bloat, VACUUM operational burden, and wraparound risk. Tune autovacuum aggressively, kill idle transactions, monitor pg_stat_user_tables.n_dead_tup, and your DB will run for years. Ignore it and your 200GB table becomes 2TB.

Learn more

Docs
PostgreSQL: Concurrency ControlPostgreSQL
Article
The Internals of PostgreSQLHironobu Suzuki
Article
Sentry: Transaction ID Wraparound in PostgresSentry Engineering