Deep dive15 min read← Back to crisp

Partitioning and sharding

Declarative partitioning internals, pruning rules, partition-wise joins, sharding architectures (Citus, app-level, CockroachDB), and the operational reality of distributed Postgres.

Partitioning and sharding are the two answers to "the table is too big." They solve different problems and have very different operational profiles. Most teams confuse them.

Declarative partitioning (since Postgres 10)

Before Postgres 10, partitioning was inheritance plus triggers. Painful. Since Postgres 10, declarative partitioning treats partitions as first-class:

CREATE TABLE measurements (
  id bigserial,
  device_id int NOT NULL,
  recorded_at timestamptz NOT NULL,
  value double precision
) PARTITION BY RANGE (recorded_at);
 
CREATE TABLE measurements_2026_06 PARTITION OF measurements
  FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
 
CREATE TABLE measurements_2026_07 PARTITION OF measurements
  FOR VALUES FROM ('2026-07-01') TO ('2026-08-01');

Inserts to measurements are routed to the right partition by the partition key. Queries against measurements are pruned to relevant partitions.

Partition pruning

The planner eliminates partitions that cannot match the WHERE clause. Two phases:

Plan-time pruning: WHERE has constants. Pruned during planning.
Execution-time pruning: WHERE has parameters (prepared statement, partition key from join). Pruned during execution.

Partition pruning eliminates non-matching partitions

Verify pruning with EXPLAIN. If you see all partitions scanned, pruning failed.

Partition strategies

Range partitioning

Best for time-series, sequential IDs, anything with natural ordering.

CREATE TABLE events PARTITION BY RANGE (created_at);

Pros: easy to drop old data, predictable size growth. Cons: hot partition (current month) gets all writes.

List partitioning

Best for tenant isolation, region-based sharding within one DB.

CREATE TABLE orders PARTITION BY LIST (country);
CREATE TABLE orders_in PARTITION OF orders FOR VALUES IN ('IN');
CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US');
CREATE TABLE orders_rest PARTITION OF orders DEFAULT;

Pros: physical isolation per tenant or region. Cons: requires explicit partition per value, default partition catches the rest.

Hash partitioning

Best for even distribution when no natural key range.

CREATE TABLE users PARTITION BY HASH (id);
CREATE TABLE users_0 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE users_1 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE users_2 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 2);
CREATE TABLE users_3 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 3);

Pros: even write distribution. Cons: cannot drop "old" partitions, no logical meaning to a partition.

Sub-partitioning

Each partition can itself be partitioned. Common: range by month, then list by tenant.

CREATE TABLE events PARTITION BY RANGE (created_at);
CREATE TABLE events_2026_06 PARTITION OF events FOR VALUES FROM ('2026-06-01') TO ('2026-07-01')
  PARTITION BY LIST (tenant_id);

Indexes on partitions

Indexes on the parent are automatically created on each partition. Same for primary keys (since Postgres 11) but they must include the partition key.

Local indexes per partition are smaller and faster to maintain. Global indexes (single index across all partitions) do not exist in Postgres.

Partition-wise joins

If two tables are partitioned the same way, Postgres can join partition-to-partition instead of building one big join. Enable:

SET enable_partitionwise_join = on;
SET enable_partitionwise_aggregate = on;

Off by default because it uses more memory. Huge speedup when tables are co-partitioned by a join key.

Operational benefits

Bulk delete: DROP TABLE events_2025_01 is instant. No VACUUM, no bloat, no WAL.
Backups per partition: dump only recent partitions.
Different storage per partition: old partitions on cheap disks (tablespaces).
Parallel maintenance: VACUUM, REINDEX run per partition.

Sharding: when partitioning is not enough

Partitioning helps within one node. When you exceed one node's storage, CPU, or write throughput, you need to spread across nodes. That is sharding.

When you actually need to shard

Modern single-node Postgres can handle:

100TB+ disk (with tiered storage)
100K writes per second
1M reads per second from replicas
64-128 CPUs

Most teams that "need sharding" actually need: better indexes, better hardware, read replicas, connection pooling.

Citus

Citus is a Postgres extension that adds distributed query execution. Architecture:

Coordinator: routes queries, holds metadata
Workers: hold actual data shards
Distributed tables: sharded by a column, e.g., tenant_id
Reference tables: small tables replicated to every worker

SELECT create_distributed_table('events', 'tenant_id');

Multi-tenant SaaS is the killer Citus use case. Each tenant's data lives on one shard, queries are routed by tenant_id, no cross-shard joins.

App-level sharding

You decide the shard. Client library routes by hash of user_id to one of N Postgres instances. Each instance is independent.

Pros: simple per-instance, no extension. Cons: cross-shard queries are your problem. Re-sharding is painful. Schema changes must coordinate across instances.

Used by: Instagram, Notion, Figma (early), most large Postgres deployments.

CockroachDB / YugabyteDB

Distributed SQL with Postgres wire protocol. Native sharding, distributed transactions, multi-region.

Pros: SQL semantics, distributed by default. Cons: not Postgres. Different optimizer, different operational model, sometimes surprising performance.

Sharding trade-offs

Cross-shard transactions: lose ACID across shards unless you use 2-phase commit (slow) or eventual consistency.
Cross-shard joins: either ship data through the coordinator (slow) or denormalize.
Re-sharding: doubling shard count requires re-distributing data. Plan for this from day one.
Hot shards: one tenant dominates. Need per-shard load monitoring and ability to migrate.

Common pitfalls

Partitioning before scale matters. A 10GB table does not need partitioning. Overhead exceeds benefit.
Wrong partition key. Partitioning by a column not in your WHERE means full scans.
Too many partitions. Each partition adds planner overhead. Postgres handles thousands but at hundreds the planner slows down.
Sharding before exhausting one node. Operational complexity 10x. Do it last.
Sharding by entity that crosses shards. Sharding by user_id when most queries filter by tenant_id means every query hits every shard.

Partition tables when you have time-series at scale and want fast deletion plus smaller indexes. Use range-by-time, drop old partitions monthly. Shard only when you have proven a single beefy Postgres cannot keep up after read replicas, connection pooling, and proper indexing. When you shard, pick the key carefully because re-sharding is painful. Most teams do not need sharding; they need better tuning.

Learn more

Docs
PostgreSQL: Table PartitioningPostgreSQL
Docs
Citus Distributed PostgresCitus
Article
Notion: Sharding PostgresNotion Engineering