Deep dive15 min read← Back to crisp

B-tree indexes

B-tree page structure, Lehman-Yao concurrency, fillfactor, deduplication, partial and expression indexes, and how to read EXPLAIN for index usage.

B-tree is the workhorse index of relational databases. Postgres's implementation is a refined Lehman-Yao B+tree with high concurrency, deduplication since v13, and bottom-up index deletion since v14. Understanding the page structure is what separates good index design from cargo cult.

Page structure

A B-tree in Postgres is a tree of 8KB pages. Each page holds:

Page header (24 bytes)
An array of IndexTuples, each holding a key and a heap TID
Free space in the middle
Line pointers at the bottom

Leaf pages hold keys plus TID pointers to heap rows. Internal pages hold keys plus pointers to child pages. The leaves form a doubly-linked list (Postgres has both forward and backward links since v12) for efficient range scans in either direction.

B-tree with bidirectional leaf links

Lookup algorithm

Start at root.
Binary search the keys to find the child pointer.
Descend.
Repeat until leaf.
Binary search the leaf for the key.
Follow the TID to the heap, check visibility, return row.

For 100 million rows, height is typically 4. Each level is one page read. With shared_buffers caching the upper levels, lookups take 1-2 disk I/Os.

Lehman-Yao concurrency

Classical B-tree concurrency uses tree-wide locks or hand-over-hand latches. Lehman-Yao adds a high key and a right link to each page. A reader descending the tree can detect "the page I am on was split since I read its parent" by checking if the search key is greater than the high key, and follow the right link.

This means Postgres B-tree:

Readers never block writers
Writers take only page-level latches, briefly
Splits propagate up the tree atomically with WAL records

Fillfactor

Default fillfactor for B-tree leaves is 90 (90 percent full, 10 percent free). For frequently inserted but rarely updated keys (timestamps, sequence IDs), set higher. For random keys (UUIDs), lower is better to delay page splits.

CREATE INDEX idx_users_uuid ON users (uuid) WITH (fillfactor = 70);

Page splits

When a leaf is full and a new key arrives, Postgres splits the page in half. Half the keys move to a new page, the parent gets a new entry. If the parent is full, it splits too. In the worst case, splits cascade to the root, increasing tree height.

Random key inserts (UUIDs) cause splits everywhere. Sequential key inserts (auto-increment IDs) only split the rightmost page, which is efficient. This is why UUIDs are slower for B-tree than bigints. Use ULIDs or UUIDv7 (timestamp-prefixed) to get back the locality.

Deduplication (v13+)

If many heap rows share the same indexed value (low-cardinality columns), v13's deduplication stores one key with a list of TIDs instead of one entry per TID. Saves space on indexes over status, country_code, etc.

SELECT idx_scan, indexrelname, pg_relation_size(indexrelid)
FROM pg_stat_user_indexes
WHERE schemaname = 'public';

Bottom-up index deletion (v14+)

Before v14, MVCC bloat would cause page splits even when the page had many dead tuples. v14 added "bottom-up" deletion: before splitting, check if any tuples are dead and reclaim them first. Massively reduced bloat on heavily updated tables.

Partial indexes

Index only rows matching a predicate. Smaller index, faster scans.

CREATE INDEX idx_active_users ON users (email) WHERE deleted_at IS NULL;

Postgres will use this index for queries that include WHERE deleted_at IS NULL. Common pattern for "soft deleted" rows or "open" tickets.

Expression indexes

Index a computed value.

CREATE INDEX idx_lower_email ON users (lower(email));
SELECT * FROM users WHERE lower(email) = 'yash@example.com';

Without the expression index, the query would have to scan and compute lower() per row. With it, direct lookup.

INCLUDE columns (covering indexes)

INCLUDE adds payload columns to leaves without using them for sorting. Enables index-only scans for queries that need a few extra columns.

CREATE INDEX idx_orders_user ON orders (user_id) INCLUDE (total, created_at);
SELECT user_id, total, created_at FROM orders WHERE user_id = 42;

No heap fetch needed.

Reading EXPLAIN for B-tree

Composite index column order: the real rules

The textbook rule "most selective first" is wrong as stated. The right rules:

Equality columns before range columns. Index (status, created_at) works for WHERE status='open' AND created_at > x. Index (created_at, status) does not.
Most-used equality column first. If 90 percent of queries filter by tenant_id, put it first.
Sort columns last. Index (tenant_id, created_at) allows WHERE tenant_id=5 ORDER BY created_at.

When the planner ignores your index

Low selectivity: query returns 30 percent of rows. Seq scan is cheaper.
Stale statistics: ANALYZE not run. Planner has wrong row counts. Run ANALYZE after bulk loads.
Type mismatch: WHERE int_col = '5' (text). Implicit cast prevents index use. Cast the literal.
Function on column: WHERE lower(email) = ? without expression index.
OR conditions: sometimes ignored unless you use UNION or a BitmapOr.

Maintenance

REINDEX INDEX CONCURRENTLY idx_foo;

CONCURRENTLY builds a new index without locking writes, then atomically swaps. Use after detecting bloat:

SELECT schemaname, relname, idx_scan, pg_relation_size(indexrelid)
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public';

Zero scans for weeks means the index is dead weight. Drop it.

Common pitfalls

Indexing low-cardinality columns alone. CREATE INDEX ON users (is_active) is rarely useful. Use partial indexes instead.
Forgetting CONCURRENTLY. CREATE INDEX without it takes a write lock for minutes on big tables. Always use CONCURRENTLY in production.
Indexing every foreign key by reflex. FKs are good index candidates but not always needed. Check actual query patterns.
Not monitoring index bloat. Use pgstattuple or pg_bloat_check.sh. Rebuild when free space exceeds 30 percent.

The senior take

B-tree is the default for a reason: it handles equality, range, sort, and prefix in one structure with predictable O(log n) cost. Design composite indexes for your top queries, use partial and expression indexes to keep them small, and audit idx_scan to drop dead weight. Every index you keep costs you write throughput.

Learn more

Docs
PostgreSQL: B-tree ImplementationPostgreSQL
Paper
Lehman-Yao: Efficient Locking for Concurrent Operations on B-TreesACM TODS 1981
Article
Use the Index, LukeMarkus Winand