Query planner and EXPLAIN
How the Postgres cost-based optimizer enumerates plans, estimates selectivity, and where it goes wrong. Plus how to read EXPLAIN like a senior engineer.
The Postgres planner is a sophisticated cost-based optimizer that turns declarative SQL into procedural execution plans. Understanding how it makes decisions, and how it can be wrong, is the difference between writing queries and tuning them.
The optimization pipeline
SQL → Parse Tree → Query Tree (post-rewrite) → Plan Tree → Execution
The planner stage is where the magic happens:
- Path generation: for each scan, generate possible access paths (seq scan, index scan, bitmap scan)
- Join ordering: enumerate join orders (dynamic programming up to ~10 tables, genetic algorithm beyond)
- Join method selection: for each pair, consider nested loop, hash join, merge join
- Cost estimation: assign cost to each candidate
- Plan selection: pick the lowest-cost complete plan
For 10 tables, there are about 10 million possible join orders. The planner uses dynamic programming to keep it tractable.
The cost model
Costs are in arbitrary units, calibrated so sequential page reads cost 1.0. Configurable parameters:
seq_page_cost = 1.0: cost to read a page sequentiallyrandom_page_cost = 4.0: cost to read a page randomly (default; lower for SSD: 1.1)cpu_tuple_cost = 0.01: cost to process one rowcpu_index_tuple_cost = 0.005: cost per index entrycpu_operator_cost = 0.0025: cost per operator (function call)
For SSD, set random_page_cost = 1.1. Default of 4.0 is for spinning rust and biases the planner against index scans.
Statistics
The planner needs to estimate "how many rows will this WHERE clause return?" It uses statistics collected by ANALYZE:
- Most Common Values (MCVs): top N values and their frequencies
- Histogram: equi-depth buckets of remaining values
n_distinct: estimated number of distinct valuesnull_frac: fraction of nulls- Correlation: how sorted the column is vs physical order (affects index scan cost)
Per-column. Increase resolution with ALTER TABLE foo ALTER COLUMN bar SET STATISTICS 1000; then ANALYZE.
Extended statistics
Default stats assume independence between columns. If columns are correlated (city and zip_code), estimates are wrong.
CREATE STATISTICS s_address ON city, zip_code FROM addresses;
ANALYZE addresses;This collects:
ndistinct: distinct count across the column groupdependencies: functional dependencies (zip determines city)mcv: multi-column MCVs
Massive improvement on correlated filters.
EXPLAIN output, decoded
Hash Join (cost=14.50..28.50 rows=100 width=64) (actual time=0.234..1.456 rows=98 loops=1)
Hash Cond: (o.user_id = u.id)
Buffers: shared hit=42 read=8
-> Seq Scan on orders o (cost=0.00..10.00 rows=1000 width=32) (actual rows=1000 loops=1)
Buffers: shared hit=10
-> Hash (cost=2.00..2.00 rows=10 width=32) (actual rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on users u (cost=0.00..2.00 rows=10 width=32) (actual rows=10 loops=1)
Reading this:
- Bottom-up execution: scan users, build hash table, scan orders, probe hash, output joined rows
- Hash join cost 14.50..28.50: planner estimate
- actual time 0.234..1.456: real time (startup..total per loop)
- rows=100 estimated vs 98 actual: good estimate
- Buffers shared hit=42 read=8: 42 pages from cache, 8 from disk
Signals of trouble
- Row estimate off by 10x or more: stats are wrong. Run
ANALYZE. Consider extended stats. Rows Removed by Filter: index returned tuples that did not match the full predicate. Add the predicate to the index or use a partial index.Heap Fetches: with index only scan, count of times it had to fetch the heap because the page was not all-visible. RunVACUUM.- Nested Loop with high outer count: O(n*m). If outer is 10,000 and inner is index scan, can still be fast (10,000 lookups). If inner is seq scan, disaster.
Sort Method: external merge Disk: 50000kB: sort spilled. Increasework_mem.Batches: 8in a Hash node: hash table did not fit in memory. Increasework_mem.
When the planner is wrong
Three common scenarios:
Correlated columns
WHERE city = 'Bangalore' AND state = 'Karnataka' estimates as freq(city) * freq(state). But every Bangalore row has state Karnataka. Estimate is 10-100x too low. Fix: extended statistics.
Function on indexed column
WHERE date_trunc('day', created_at) = '2026-06-21' cannot use a B-tree on created_at. Either rewrite as WHERE created_at >= '2026-06-21' AND created_at < '2026-06-22' or create an expression index.
Skewed data
99 percent of rows have status = 'closed', 1 percent are open. Planner sees average selectivity. Use MCVs (default 100, increase if needed).
Plan cache and parameter sniffing
Prepared statements are planned once and reused. Postgres tries to detect when re-planning is needed (generic vs custom plan, decided by plan_cache_mode). If your queries have wildly different parameter selectivities, force custom plans:
SET plan_cache_mode = force_custom_plan;Hints, or the lack thereof
For actual hints, install the pg_hint_plan extension. Use sparingly.
Reading EXPLAIN like a senior
- Look at total execution time at the bottom
- Find the slowest node (often a Sort, Hash, or large Seq Scan)
- Check estimated vs actual rows at that node
- Check Buffers: how much disk I/O
- If estimate is wrong, fix stats. If estimate is right, fix the plan (index, rewrite).
Common pitfalls
EXPLAINwithoutANALYZE: just estimates, can be deeply wrongANALYZEon a transaction that modifies data: it runs the query for real- Not running
ANALYZEafter bulk loads: stats are stale, planner uses pre-load row counts - Trusting plan from dev DB: prod has different data distribution, plan will differ
The senior take
The planner is a cost model fed by statistics. Garbage stats give garbage plans. Run ANALYZE after bulk changes, set up extended statistics for correlated columns, set random_page_cost to match your storage, and read EXPLAIN from the bottom up. Most "Postgres is slow" tickets are missing indexes or stale stats, not planner bugs.
Learn more
- DocsPostgreSQL: The Planner OptimizerPostgreSQL
- Docs
- Talk