Sharding and partitioning

Hash vs range, consistent hashing, shard key archaeology, hot shard mitigation, cross-shard queries and transactions.

Why shard

You shard when one node can't hold the working set, serve the load, or be scaled vertically. The trigger is usually one of:

Storage: data exceeds the largest disk you can attach.
Write throughput: one node can't keep up with the write rate, even with batching.
Read throughput: read replicas are not enough, or you need write scaling.
Failure blast radius: one shard going down should not take everything down.

If you can avoid sharding by buying a bigger box or adding read replicas, do that. Sharding adds permanent operational complexity. The decision is one-way for years.

Partition strategies

Three main ways:

Hash partitioning: hash(key) determines shard. Uniform load. Range queries require scatter-gather.
Range partitioning: contiguous key ranges per shard. Range queries are local. Hot ranges (recent timestamps) are a risk.
Directory/lookup: explicit mapping from key to shard. Maximum flexibility. The directory becomes a SPOF.

Hash is the default for OLTP. Range is for time-series and ordered datasets. Directory is for migration phases and ad-hoc rebalancing.

Consistent hashing in detail

Naive hash partitioning: shard = hash(key) % N. If you change N (add or remove a node), nearly all keys get remapped. For a 100-node cluster adding 1 node, 99% of keys move. Unusable.

Consistent hashing maps both keys and nodes to a ring (typically 0 to 2^32 - 1). Each key is owned by the next node clockwise on the ring. Adding a node only takes keys from one neighbor.

Variants:

Pure consistent hashing: nodes hash to single points. Load is uneven (max load ~log(N) times average).
Virtual nodes (vnodes): each physical node maps to K points on the ring. Smooths load. Cassandra uses 256 vnodes per node by default.
Jump consistent hash (Google, 2014): no ring at all, just a deterministic mapping. Faster, but only supports adding/removing the last bucket.
Rendezvous hashing: hash(key, node_id) for every node, pick the max. Simple, good for small N.

Consistent hashing with virtual nodes

Shard key archaeology

The shard key is the most consequential decision. It determines:

Load distribution (uniform vs hot shards).
Query patterns (single-shard vs scatter-gather).
Transaction scope (single-shard vs distributed).
Resharding ease (can you split without rewriting).

Properties of a good shard key:

High cardinality. Millions of distinct values, not dozens.
Uniform distribution. Skew creates hot shards.
Query alignment. Most queries filter on it.
Stable. Should not change after row is created (otherwise the row moves).

Common shard keys and their issues:

Key	Pro	Con
user_id	Cardinality, tenant isolation	Celebrity problem (Bieber effect)
timestamp	Time-series friendly	All writes hit latest shard (hot)
auto-increment ID	Simple	All writes hit latest shard, predictable enumeration
hash(user_id)	Uniform	Loses user_id locality for queries
(tenant_id, user_id)	Tenant isolation, can split tenants	Schema becomes load-bearing

Discord shards messages by (channel_id, bucket) where bucket is a time bucket. Channel locality + bounded shard size + reasonable distribution.

Stripe sharded by merchant_id initially. Worked until some merchants got huge. They had to migrate to (merchant_id, hash) composite. Took years and a custom migration tooling stack.

Hot shards: causes and mitigations

A hot shard takes disproportionate load. 5% of shards, 80% of traffic. Symptoms: high p99 latency on those shards, replication lag, cascading failures.

Causes:

Skewed shard key (tenant size variance).
Sequential keys (timestamp, auto-increment).
Celebrity effect (one user, massive read traffic).
Bot traffic, DDOS, viral content.

Mitigations:

Salt the key: append random suffix to spread one logical entity across N shards. Reads must fan out to N. Trade-off: write scale, read complexity.
Read replicas for the hot shard: scales reads, not writes.
Caching layer: Redis in front of hot shards.
Dedicated hardware for hot shards: keep the same key, give it a beefier node.
Resharding to split the hot shard: surgery but works.

Twitter's celebrity fix was a hybrid: tweets sharded by user_id, but reads for celebrities went through a fan-out cache (Manhattan + custom cache layer).

Cross-shard queries

The moment you shard, queries that don't filter by shard key become scatter-gather:

Send query to all shards in parallel.
Gather partial results.
Merge, sort, paginate at the coordinator.

This is slow (limited by slowest shard, p99 not p50) and expensive (every shard pays). Avoid by:

Choosing a shard key that matches your dominant query pattern.
Denormalizing: store data twice, once per query pattern.
Secondary indexes that are themselves sharded by the query key.
Materialized views that aggregate across shards.

Cassandra famously punishes you for this. Materialized views are the official answer; in practice teams write data twice manually.

Cross-shard transactions

Distributed transactions across shards require 2PC or a saga pattern. Both are painful.

2PC: classic distributed transaction. Coordinator orchestrates prepare and commit. Blocks on failure. Limited throughput. Used inside Spanner, Vitess, CockroachDB internally.
Saga: break transaction into local transactions plus compensating actions. No atomicity, eventual consistency. Used by microservices.

For SaaS multi-tenant: design so transactions never cross tenants. Then sharding by tenant_id gives you single-shard transactions for free.

Online resharding

You will eventually need to split or merge shards. Options:

Pre-allocate virtual shards: hash space split into 4096 virtual shards mapped to N physical nodes. Adding capacity = remap some virtual shards to new nodes. Cassandra and DynamoDB do this.
Online split: copy data while serving writes, then atomic switchover. Vitess split workflow:
- Take source shard, dual-write to two new target shards.
- Backfill historical data from source to targets.
- Verify checksums match.
- Switch reads to targets.
- Switch writes to targets.
- Decommission source.
Cold migration: script-driven, take downtime. Last resort.

Stripe's online migration playbook is the canonical write-up. Dual-write, backfill, verify, cut over reads, cut over writes, archive source. Months of work per migration.

Stripe-style online resharding

Operational gotchas

Hot shard hides until production. Synthetic load is uniform. Real users are zipfian.
Cross-shard joins look fine in dev (small data, all on one shard) and explode in prod.
Backups across shards are not transactionally consistent unless you snapshot atomically.
Schema migrations across shards: deploy schema change to all shards before deploying code that uses it. Otherwise you race.
Capacity planning: each shard needs headroom for the moment when an adjacent shard fails and load spills over.

When not to shard

Default to NOT sharding. Sharding adds:

Operational complexity (monitoring per shard, alerting, rebalancing).
Application complexity (shard routing, cross-shard queries).
Failure modes (shard down, shard imbalance, shard split-brain).
Migration costs (changing shard key is a multi-year project at scale).

Cheaper alternatives:

Vertical scaling: 96-core, 768GB RAM, 100TB NVMe servers exist.
Read replicas: scale reads without sharding.
Caching: Redis or memcached in front absorbs read load.
Workload separation: OLTP on one cluster, analytics on another.
Tiered storage: hot data on SSD, cold on S3.

Postgres can serve millions of QPS on a single beefy node. Discord ran on a single MongoDB primary for years before sharding. Notion ran on one Postgres for a long time.

The rule: shard when you can prove you've exhausted vertical scale, not before.

Learn more

Article
Designing Data-Intensive Applications, Chapter 6Martin Kleppmann
Docs
Vitess shardingVitess
Article
Stripe online migrationsStripe
Paper
Consistent hashing paperAkamai/MIT
Article
Discord trillion messagesDiscord

Deep dive15 min read← Back to crisp

Sharding and partitioning

Hash vs range, consistent hashing, shard key archaeology, hot shard mitigation, cross-shard queries and transactions.

Why shard

You shard when one node can't hold the working set, serve the load, or be scaled vertically. The trigger is usually one of:

Storage: data exceeds the largest disk you can attach.
Write throughput: one node can't keep up with the write rate, even with batching.
Read throughput: read replicas are not enough, or you need write scaling.
Failure blast radius: one shard going down should not take everything down.

If you can avoid sharding by buying a bigger box or adding read replicas, do that. Sharding adds permanent operational complexity. The decision is one-way for years.

Partition strategies

Three main ways:

Hash partitioning: hash(key) determines shard. Uniform load. Range queries require scatter-gather.
Range partitioning: contiguous key ranges per shard. Range queries are local. Hot ranges (recent timestamps) are a risk.
Directory/lookup: explicit mapping from key to shard. Maximum flexibility. The directory becomes a SPOF.

Hash is the default for OLTP. Range is for time-series and ordered datasets. Directory is for migration phases and ad-hoc rebalancing.

Consistent hashing in detail

Naive hash partitioning: shard = hash(key) % N. If you change N (add or remove a node), nearly all keys get remapped. For a 100-node cluster adding 1 node, 99% of keys move. Unusable.

Consistent hashing maps both keys and nodes to a ring (typically 0 to 2^32 - 1). Each key is owned by the next node clockwise on the ring. Adding a node only takes keys from one neighbor.

Variants:

Pure consistent hashing: nodes hash to single points. Load is uneven (max load ~log(N) times average).
Virtual nodes (vnodes): each physical node maps to K points on the ring. Smooths load. Cassandra uses 256 vnodes per node by default.
Jump consistent hash (Google, 2014): no ring at all, just a deterministic mapping. Faster, but only supports adding/removing the last bucket.
Rendezvous hashing: hash(key, node_id) for every node, pick the max. Simple, good for small N.

Consistent hashing with virtual nodes

Shard key archaeology

The shard key is the most consequential decision. It determines:

Load distribution (uniform vs hot shards).
Query patterns (single-shard vs scatter-gather).
Transaction scope (single-shard vs distributed).
Resharding ease (can you split without rewriting).

Properties of a good shard key:

High cardinality. Millions of distinct values, not dozens.
Uniform distribution. Skew creates hot shards.
Query alignment. Most queries filter on it.
Stable. Should not change after row is created (otherwise the row moves).

Common shard keys and their issues:

Key	Pro	Con
user_id	Cardinality, tenant isolation	Celebrity problem (Bieber effect)
timestamp	Time-series friendly	All writes hit latest shard (hot)
auto-increment ID	Simple	All writes hit latest shard, predictable enumeration
hash(user_id)	Uniform	Loses user_id locality for queries
(tenant_id, user_id)	Tenant isolation, can split tenants	Schema becomes load-bearing

Discord shards messages by (channel_id, bucket) where bucket is a time bucket. Channel locality + bounded shard size + reasonable distribution.

Stripe sharded by merchant_id initially. Worked until some merchants got huge. They had to migrate to (merchant_id, hash) composite. Took years and a custom migration tooling stack.

Hot shards: causes and mitigations

A hot shard takes disproportionate load. 5% of shards, 80% of traffic. Symptoms: high p99 latency on those shards, replication lag, cascading failures.

Causes:

Skewed shard key (tenant size variance).
Sequential keys (timestamp, auto-increment).
Celebrity effect (one user, massive read traffic).
Bot traffic, DDOS, viral content.

Mitigations:

Salt the key: append random suffix to spread one logical entity across N shards. Reads must fan out to N. Trade-off: write scale, read complexity.
Read replicas for the hot shard: scales reads, not writes.
Caching layer: Redis in front of hot shards.
Dedicated hardware for hot shards: keep the same key, give it a beefier node.
Resharding to split the hot shard: surgery but works.

Twitter's celebrity fix was a hybrid: tweets sharded by user_id, but reads for celebrities went through a fan-out cache (Manhattan + custom cache layer).

Cross-shard queries

The moment you shard, queries that don't filter by shard key become scatter-gather:

Send query to all shards in parallel.
Gather partial results.
Merge, sort, paginate at the coordinator.

This is slow (limited by slowest shard, p99 not p50) and expensive (every shard pays). Avoid by:

Choosing a shard key that matches your dominant query pattern.
Denormalizing: store data twice, once per query pattern.
Secondary indexes that are themselves sharded by the query key.
Materialized views that aggregate across shards.

Cassandra famously punishes you for this. Materialized views are the official answer; in practice teams write data twice manually.

Cross-shard transactions

Distributed transactions across shards require 2PC or a saga pattern. Both are painful.

2PC: classic distributed transaction. Coordinator orchestrates prepare and commit. Blocks on failure. Limited throughput. Used inside Spanner, Vitess, CockroachDB internally.
Saga: break transaction into local transactions plus compensating actions. No atomicity, eventual consistency. Used by microservices.

For SaaS multi-tenant: design so transactions never cross tenants. Then sharding by tenant_id gives you single-shard transactions for free.

Online resharding

You will eventually need to split or merge shards. Options:

Pre-allocate virtual shards: hash space split into 4096 virtual shards mapped to N physical nodes. Adding capacity = remap some virtual shards to new nodes. Cassandra and DynamoDB do this.
Online split: copy data while serving writes, then atomic switchover. Vitess split workflow:
- Take source shard, dual-write to two new target shards.
- Backfill historical data from source to targets.
- Verify checksums match.
- Switch reads to targets.
- Switch writes to targets.
- Decommission source.
Cold migration: script-driven, take downtime. Last resort.

Stripe's online migration playbook is the canonical write-up. Dual-write, backfill, verify, cut over reads, cut over writes, archive source. Months of work per migration.

Stripe-style online resharding

Operational gotchas

Hot shard hides until production. Synthetic load is uniform. Real users are zipfian.
Cross-shard joins look fine in dev (small data, all on one shard) and explode in prod.
Backups across shards are not transactionally consistent unless you snapshot atomically.
Schema migrations across shards: deploy schema change to all shards before deploying code that uses it. Otherwise you race.
Capacity planning: each shard needs headroom for the moment when an adjacent shard fails and load spills over.

When not to shard

Default to NOT sharding. Sharding adds:

Operational complexity (monitoring per shard, alerting, rebalancing).
Application complexity (shard routing, cross-shard queries).
Failure modes (shard down, shard imbalance, shard split-brain).
Migration costs (changing shard key is a multi-year project at scale).

Cheaper alternatives:

Vertical scaling: 96-core, 768GB RAM, 100TB NVMe servers exist.
Read replicas: scale reads without sharding.
Caching: Redis or memcached in front absorbs read load.
Workload separation: OLTP on one cluster, analytics on another.
Tiered storage: hot data on SSD, cold on S3.

Postgres can serve millions of QPS on a single beefy node. Discord ran on a single MongoDB primary for years before sharding. Notion ran on one Postgres for a long time.

The rule: shard when you can prove you've exhausted vertical scale, not before.

Learn more

Article
Designing Data-Intensive Applications, Chapter 6Martin Kleppmann
Docs
Vitess shardingVitess
Article
Stripe online migrationsStripe
Paper
Consistent hashing paperAkamai/MIT
Article
Discord trillion messagesDiscord

Sharding and partitioning

Why shard

Partition strategies

Consistent hashing in detail

Shard key archaeology

Hot shards: causes and mitigations

Cross-shard queries

Cross-shard transactions

Online resharding

Operational gotchas

When not to shard

What to read next

Learn more

Sharding and partitioning

Why shard

Partition strategies

Consistent hashing in detail

Shard key archaeology

Hot shards: causes and mitigations

Cross-shard queries

Cross-shard transactions

Online resharding

Operational gotchas

When not to shard

What to read next

Learn more