B-tree indexes
B-tree page structure, Lehman-Yao concurrency, fillfactor, deduplication, partial and expression indexes, and how to read EXPLAIN for index usage.
B-tree is the workhorse index of relational databases. Postgres's implementation is a refined Lehman-Yao B+tree with high concurrency, deduplication since v13, and bottom-up index deletion since v14. Understanding the page structure is what separates good index design from cargo cult.
Page structure
A B-tree in Postgres is a tree of 8KB pages. Each page holds:
- Page header (24 bytes)
- An array of
IndexTuples, each holding a key and a heap TID - Free space in the middle
- Line pointers at the bottom
Leaf pages hold keys plus TID pointers to heap rows. Internal pages hold keys plus pointers to child pages. The leaves form a doubly-linked list (Postgres has both forward and backward links since v12) for efficient range scans in either direction.
Lookup algorithm
- Start at root.
- Binary search the keys to find the child pointer.
- Descend.
- Repeat until leaf.
- Binary search the leaf for the key.
- Follow the TID to the heap, check visibility, return row.
For 100 million rows, height is typically 4. Each level is one page read. With shared_buffers caching the upper levels, lookups take 1-2 disk I/Os.
Lehman-Yao concurrency
Classical B-tree concurrency uses tree-wide locks or hand-over-hand latches. Lehman-Yao adds a high key and a right link to each page. A reader descending the tree can detect "the page I am on was split since I read its parent" by checking if the search key is greater than the high key, and follow the right link.
This means Postgres B-tree:
- Readers never block writers
- Writers take only page-level latches, briefly
- Splits propagate up the tree atomically with WAL records
Fillfactor
Default fillfactor for B-tree leaves is 90 (90 percent full, 10 percent free). For frequently inserted but rarely updated keys (timestamps, sequence IDs), set higher. For random keys (UUIDs), lower is better to delay page splits.
CREATE INDEX idx_users_uuid ON users (uuid) WITH (fillfactor = 70);Page splits
When a leaf is full and a new key arrives, Postgres splits the page in half. Half the keys move to a new page, the parent gets a new entry. If the parent is full, it splits too. In the worst case, splits cascade to the root, increasing tree height.
Random key inserts (UUIDs) cause splits everywhere. Sequential key inserts (auto-increment IDs) only split the rightmost page, which is efficient. This is why UUIDs are slower for B-tree than bigints. Use ULIDs or UUIDv7 (timestamp-prefixed) to get back the locality.
Deduplication (v13+)
If many heap rows share the same indexed value (low-cardinality columns), v13's deduplication stores one key with a list of TIDs instead of one entry per TID. Saves space on indexes over status, country_code, etc.
SELECT idx_scan, indexrelname, pg_relation_size(indexrelid)
FROM pg_stat_user_indexes
WHERE schemaname = 'public';Bottom-up index deletion (v14+)
Before v14, MVCC bloat would cause page splits even when the page had many dead tuples. v14 added "bottom-up" deletion: before splitting, check if any tuples are dead and reclaim them first. Massively reduced bloat on heavily updated tables.
Partial indexes
Index only rows matching a predicate. Smaller index, faster scans.
CREATE INDEX idx_active_users ON users (email) WHERE deleted_at IS NULL;Postgres will use this index for queries that include WHERE deleted_at IS NULL. Common pattern for "soft deleted" rows or "open" tickets.
Expression indexes
Index a computed value.
CREATE INDEX idx_lower_email ON users (lower(email));
SELECT * FROM users WHERE lower(email) = 'yash@example.com';Without the expression index, the query would have to scan and compute lower() per row. With it, direct lookup.
INCLUDE columns (covering indexes)
INCLUDE adds payload columns to leaves without using them for sorting. Enables index-only scans for queries that need a few extra columns.
CREATE INDEX idx_orders_user ON orders (user_id) INCLUDE (total, created_at);
SELECT user_id, total, created_at FROM orders WHERE user_id = 42;No heap fetch needed.
Reading EXPLAIN for B-tree
Composite index column order: the real rules
The textbook rule "most selective first" is wrong as stated. The right rules:
- Equality columns before range columns. Index
(status, created_at)works forWHERE status='open' AND created_at > x. Index(created_at, status)does not. - Most-used equality column first. If 90 percent of queries filter by
tenant_id, put it first. - Sort columns last. Index
(tenant_id, created_at)allowsWHERE tenant_id=5 ORDER BY created_at.
When the planner ignores your index
- Low selectivity: query returns 30 percent of rows. Seq scan is cheaper.
- Stale statistics:
ANALYZEnot run. Planner has wrong row counts. RunANALYZEafter bulk loads. - Type mismatch:
WHERE int_col = '5'(text). Implicit cast prevents index use. Cast the literal. - Function on column:
WHERE lower(email) = ?without expression index. - OR conditions: sometimes ignored unless you use
UNIONor aBitmapOr.
Maintenance
REINDEX INDEX CONCURRENTLY idx_foo;CONCURRENTLY builds a new index without locking writes, then atomically swaps. Use after detecting bloat:
SELECT schemaname, relname, idx_scan, pg_relation_size(indexrelid)
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public';Zero scans for weeks means the index is dead weight. Drop it.
Common pitfalls
- Indexing low-cardinality columns alone.
CREATE INDEX ON users (is_active)is rarely useful. Use partial indexes instead. - Forgetting
CONCURRENTLY.CREATE INDEXwithout it takes a write lock for minutes on big tables. Always useCONCURRENTLYin production. - Indexing every foreign key by reflex. FKs are good index candidates but not always needed. Check actual query patterns.
- Not monitoring index bloat. Use
pgstattupleorpg_bloat_check.sh. Rebuild when free space exceeds 30 percent.
The senior take
B-tree is the default for a reason: it handles equality, range, sort, and prefix in one structure with predictable O(log n) cost. Design composite indexes for your top queries, use partial and expression indexes to keep them small, and audit idx_scan to drop dead weight. Every index you keep costs you write throughput.
Learn more
- DocsPostgreSQL: B-tree ImplementationPostgreSQL
- Paper
- ArticleUse the Index, LukeMarkus Winand