Join strategies (nested, hash, merge)
How each join algorithm works internally, when the planner picks each, the memory and disk costs, and how to debug bad join choices.
Joins are where queries live or die. Picking the wrong algorithm for the data size and shape turns a 100ms query into a 10-minute one. The planner usually picks well, but understanding the trade-offs lets you diagnose when it does not.
Nested Loop in detail
for outer_row in outer:
for inner_row in inner:
if outer_row.key == inner_row.key:
emit(outer_row, inner_row)Without an index on inner, this is O(n*m). Catastrophic for large inputs.
With an index on inner.key, inner lookup is O(log m), so total is O(n log m). Excellent for small outer.
Postgres uses nested loop when:
- Outer rows are few (estimated under ~1000)
- Inner has a usable index
- Or when no other join is possible (cross join, non-equi join)
When nested loop becomes a disaster
If the planner estimates outer = 10 rows but actual is 1 million, you get 1 million index lookups. Each is fast (~10us) but total is 10 seconds. The query was supposed to take milliseconds.
This is the classic "bad join plan" issue. Cause: stale statistics, correlated columns, parameter sniffing on prepared statements.
Fix:
- Run
ANALYZE - Add extended statistics for correlated columns
- For prepared statements, force custom plans:
SET plan_cache_mode = force_custom_plan
Hash Join in detail
# Build phase
hash_table = {}
for row in build_side: # smaller side
hash_table.setdefault(row.key, []).append(row)
# Probe phase
for row in probe_side:
for match in hash_table.get(row.key, []):
emit(row, match)O(n + m). Both sides scanned once. Hash table held in work_mem.
Postgres uses hash join when:
- Equi-join (
=) - Neither side is sorted
- Build side fits in
work_mem(or fits in batches) - Sizes too large for nested loop
Batches and spills
If hash table does not fit in work_mem, Postgres partitions both sides by hash and processes one partition at a time. Each partition becomes a separate hash join. Both sides spill to temp files.
Hash Join
Hash Cond: (a.id = b.a_id)
Buckets: 16384 Batches: 8 Memory Usage: 4096kB
Batches: 8 means 8 spills to disk. Each batch reads both sides from temp files. Massive slowdown. Fix: increase work_mem.
Build side selection
Postgres picks the smaller (estimated) side as the build side. Wrong estimate -> builds on the larger side -> bigger hash table -> more spills. Check with EXPLAIN ANALYZE.
Merge Join in detail
i, j = 0, 0
while i < len(a) and j < len(b):
if a[i].key < b[j].key:
i += 1
elif a[i].key > b[j].key:
j += 1
else:
# match - emit and handle duplicates
emit(a[i], b[j])
# advance carefully for duplicatesO(n + m) after sorting. Sort cost is O(n log n) per side.
Best when:
- Both sides are large
- Both can be produced sorted (index on join key)
- Output needs to be sorted (e.g., for window functions or downstream merge)
Common in:
- Joining two huge tables both clustered on the join key
- Window functions that join and partition by same key
- Outer joins where order matters
Sort cost dominates
If both sides need explicit sorting, hash join usually wins for equi-joins. Merge join shines when sorts are free (Index Scan in key order) or when both sides are already physically clustered.
How the planner chooses
The planner cost-models each candidate:
- Nested loop cost = outer_rows * inner_lookup_cost
- Hash join cost = build_cost + probe_cost + (spills * extra_io)
- Merge join cost = sort_a + sort_b + merge_cost
Picks the lowest. Wrong estimates -> wrong choice. The most common failure mode is severely underestimating outer rows in nested loop.
Specific patterns
Anti-join (NOT EXISTS)
SELECT * FROM orders o
WHERE NOT EXISTS (SELECT 1 FROM refunds r WHERE r.order_id = o.id);Postgres uses hash anti-join: hash refunds, scan orders, emit those not in hash. Much faster than NOT IN (which has NULL handling pitfalls).
Semi-join (EXISTS)
SELECT * FROM orders o
WHERE EXISTS (SELECT 1 FROM line_items li WHERE li.order_id = o.id);Similar to anti-join but emits matches. Hash semi-join. Stops at first match per outer row.
LATERAL joins
SELECT u.name, recent.title
FROM users u,
LATERAL (
SELECT title FROM posts p
WHERE p.user_id = u.id
ORDER BY created_at DESC
LIMIT 3
) recent;Per-row correlated subquery. Always nested loop. Useful for "top N per group."
Range joins
SELECT * FROM events e
JOIN windows w ON e.timestamp BETWEEN w.start AND w.end;Only nested loop works for range conditions (no hash, no merge). Build a GIST index on the range type or use int4range/tstzrange to enable index-supported range joins.
EXPLAIN signals
Nested Loop (cost=0.43..8.45 rows=1)
If actual rows are 1,000,000, you have a problem.
Hash Join ... Batches: 64
Hash table spilled 64 times. Increase work_mem.
Merge Join ... Sort Method: external merge Disk: 200000kB
Sort spilled. Increase work_mem or add an index for sorted output.
Forcing plans (debugging only)
SET enable_nestloop = off;
SET enable_hashjoin = off;
SET enable_mergejoin = off;Use to confirm the planner has a better option. Never set in production.
Common pitfalls
- Joining on different types.
int = textrequires cast, breaks index usage, forces nested loop. - Functions in join condition.
JOIN ... ON lower(a.email) = b.email. Add expression index or precompute. - OR in join condition. Often expands to UNION internally. Restructure as two joins with UNION.
- Three-way joins with bad order. Planner enumerates orders up to
geqo_threshold(default 12 tables). Beyond that, genetic algorithm, sometimes worse.
The senior take
Three join algorithms cover everything. Nested loop for small outer with indexed inner. Hash join for medium-large equi-joins. Merge join for huge pre-sorted inputs. The planner picks based on row count estimates. Bad estimates -> bad plans. Most join performance problems are statistics problems in disguise. Always check estimated vs actual rows in EXPLAIN before blaming the planner.
Learn more
- DocsPostgreSQL: Planner/OptimizerPostgreSQL
- ArticleDesigning Data-Intensive Applications, Chapter 10Martin Kleppmann
- TalkBruce Momjian: Inside the Postgres Query OptimizerBruce Momjian