Deep dive15 min read← Back to crisp

Join strategies (nested, hash, merge)

How each join algorithm works internally, when the planner picks each, the memory and disk costs, and how to debug bad join choices.

Joins are where queries live or die. Picking the wrong algorithm for the data size and shape turns a 100ms query into a 10-minute one. The planner usually picks well, but understanding the trade-offs lets you diagnose when it does not.

Nested Loop in detail

for outer_row in outer:
    for inner_row in inner:
        if outer_row.key == inner_row.key:
            emit(outer_row, inner_row)

Without an index on inner, this is O(n*m). Catastrophic for large inputs.

With an index on inner.key, inner lookup is O(log m), so total is O(n log m). Excellent for small outer.

Postgres uses nested loop when:

Outer rows are few (estimated under ~1000)
Inner has a usable index
Or when no other join is possible (cross join, non-equi join)

Nested Loop with indexed inner

When nested loop becomes a disaster

If the planner estimates outer = 10 rows but actual is 1 million, you get 1 million index lookups. Each is fast (~10us) but total is 10 seconds. The query was supposed to take milliseconds.

This is the classic "bad join plan" issue. Cause: stale statistics, correlated columns, parameter sniffing on prepared statements.

Fix:

Run ANALYZE
Add extended statistics for correlated columns
For prepared statements, force custom plans: SET plan_cache_mode = force_custom_plan

Hash Join in detail

# Build phase
hash_table = {}
for row in build_side:  # smaller side
    hash_table.setdefault(row.key, []).append(row)
 
# Probe phase
for row in probe_side:
    for match in hash_table.get(row.key, []):
        emit(row, match)

O(n + m). Both sides scanned once. Hash table held in work_mem.

Postgres uses hash join when:

Equi-join (=)
Neither side is sorted
Build side fits in work_mem (or fits in batches)
Sizes too large for nested loop

Batches and spills

If hash table does not fit in work_mem, Postgres partitions both sides by hash and processes one partition at a time. Each partition becomes a separate hash join. Both sides spill to temp files.

Hash Join
  Hash Cond: (a.id = b.a_id)
  Buckets: 16384  Batches: 8  Memory Usage: 4096kB

Batches: 8 means 8 spills to disk. Each batch reads both sides from temp files. Massive slowdown. Fix: increase work_mem.

Build side selection

Postgres picks the smaller (estimated) side as the build side. Wrong estimate -> builds on the larger side -> bigger hash table -> more spills. Check with EXPLAIN ANALYZE.

Merge Join in detail

i, j = 0, 0
while i < len(a) and j < len(b):
    if a[i].key < b[j].key:
        i += 1
    elif a[i].key > b[j].key:
        j += 1
    else:
        # match - emit and handle duplicates
        emit(a[i], b[j])
        # advance carefully for duplicates

O(n + m) after sorting. Sort cost is O(n log n) per side.

Best when:

Both sides are large
Both can be produced sorted (index on join key)
Output needs to be sorted (e.g., for window functions or downstream merge)

Common in:

Joining two huge tables both clustered on the join key
Window functions that join and partition by same key
Outer joins where order matters

Sort cost dominates

If both sides need explicit sorting, hash join usually wins for equi-joins. Merge join shines when sorts are free (Index Scan in key order) or when both sides are already physically clustered.

How the planner chooses

The planner cost-models each candidate:

Nested loop cost = outer_rows * inner_lookup_cost
Hash join cost = build_cost + probe_cost + (spills * extra_io)
Merge join cost = sort_a + sort_b + merge_cost

Picks the lowest. Wrong estimates -> wrong choice. The most common failure mode is severely underestimating outer rows in nested loop.

Specific patterns

Anti-join (NOT EXISTS)

SELECT * FROM orders o
WHERE NOT EXISTS (SELECT 1 FROM refunds r WHERE r.order_id = o.id);

Postgres uses hash anti-join: hash refunds, scan orders, emit those not in hash. Much faster than NOT IN (which has NULL handling pitfalls).

Semi-join (EXISTS)

SELECT * FROM orders o
WHERE EXISTS (SELECT 1 FROM line_items li WHERE li.order_id = o.id);

Similar to anti-join but emits matches. Hash semi-join. Stops at first match per outer row.

LATERAL joins

SELECT u.name, recent.title
FROM users u,
LATERAL (
  SELECT title FROM posts p
  WHERE p.user_id = u.id
  ORDER BY created_at DESC
  LIMIT 3
) recent;

Per-row correlated subquery. Always nested loop. Useful for "top N per group."

Range joins

SELECT * FROM events e
JOIN windows w ON e.timestamp BETWEEN w.start AND w.end;

Only nested loop works for range conditions (no hash, no merge). Build a GIST index on the range type or use int4range/tstzrange to enable index-supported range joins.

EXPLAIN signals

Nested Loop  (cost=0.43..8.45 rows=1)

If actual rows are 1,000,000, you have a problem.

Hash Join  ... Batches: 64

Hash table spilled 64 times. Increase work_mem.

Merge Join  ... Sort Method: external merge  Disk: 200000kB

Sort spilled. Increase work_mem or add an index for sorted output.

Forcing plans (debugging only)

SET enable_nestloop = off;
SET enable_hashjoin = off;
SET enable_mergejoin = off;

Use to confirm the planner has a better option. Never set in production.

Common pitfalls

Joining on different types. int = text requires cast, breaks index usage, forces nested loop.
Functions in join condition. JOIN ... ON lower(a.email) = b.email. Add expression index or precompute.
OR in join condition. Often expands to UNION internally. Restructure as two joins with UNION.
Three-way joins with bad order. Planner enumerates orders up to geqo_threshold (default 12 tables). Beyond that, genetic algorithm, sometimes worse.

The senior take

Three join algorithms cover everything. Nested loop for small outer with indexed inner. Hash join for medium-large equi-joins. Merge join for huge pre-sorted inputs. The planner picks based on row count estimates. Bad estimates -> bad plans. Most join performance problems are statistics problems in disguise. Always check estimated vs actual rows in EXPLAIN before blaming the planner.

Learn more

Docs
PostgreSQL: Planner/OptimizerPostgreSQL
Article
Designing Data-Intensive Applications, Chapter 10Martin Kleppmann
Talk
Bruce Momjian: Inside the Postgres Query OptimizerBruce Momjian