Sharding and partitioning
Hash vs range, consistent hashing, shard key archaeology, hot shard mitigation, cross-shard queries and transactions.
Why shard
You shard when one node can't hold the working set, serve the load, or be scaled vertically. The trigger is usually one of:
- Storage: data exceeds the largest disk you can attach.
- Write throughput: one node can't keep up with the write rate, even with batching.
- Read throughput: read replicas are not enough, or you need write scaling.
- Failure blast radius: one shard going down should not take everything down.
If you can avoid sharding by buying a bigger box or adding read replicas, do that. Sharding adds permanent operational complexity. The decision is one-way for years.
Partition strategies
Three main ways:
- Hash partitioning: hash(key) determines shard. Uniform load. Range queries require scatter-gather.
- Range partitioning: contiguous key ranges per shard. Range queries are local. Hot ranges (recent timestamps) are a risk.
- Directory/lookup: explicit mapping from key to shard. Maximum flexibility. The directory becomes a SPOF.
Hash is the default for OLTP. Range is for time-series and ordered datasets. Directory is for migration phases and ad-hoc rebalancing.
Consistent hashing in detail
Naive hash partitioning: shard = hash(key) % N. If you change N (add or remove a node), nearly all keys get remapped. For a 100-node cluster adding 1 node, 99% of keys move. Unusable.
Consistent hashing maps both keys and nodes to a ring (typically 0 to 2^32 - 1). Each key is owned by the next node clockwise on the ring. Adding a node only takes keys from one neighbor.
Variants:
- Pure consistent hashing: nodes hash to single points. Load is uneven (max load ~log(N) times average).
- Virtual nodes (vnodes): each physical node maps to K points on the ring. Smooths load. Cassandra uses 256 vnodes per node by default.
- Jump consistent hash (Google, 2014): no ring at all, just a deterministic mapping. Faster, but only supports adding/removing the last bucket.
- Rendezvous hashing: hash(key, node_id) for every node, pick the max. Simple, good for small N.
Shard key archaeology
The shard key is the most consequential decision. It determines:
- Load distribution (uniform vs hot shards).
- Query patterns (single-shard vs scatter-gather).
- Transaction scope (single-shard vs distributed).
- Resharding ease (can you split without rewriting).
Properties of a good shard key:
- High cardinality. Millions of distinct values, not dozens.
- Uniform distribution. Skew creates hot shards.
- Query alignment. Most queries filter on it.
- Stable. Should not change after row is created (otherwise the row moves).
Common shard keys and their issues:
| Key | Pro | Con |
|---|---|---|
| user_id | Cardinality, tenant isolation | Celebrity problem (Bieber effect) |
| timestamp | Time-series friendly | All writes hit latest shard (hot) |
| auto-increment ID | Simple | All writes hit latest shard, predictable enumeration |
| hash(user_id) | Uniform | Loses user_id locality for queries |
| (tenant_id, user_id) | Tenant isolation, can split tenants | Schema becomes load-bearing |
Discord shards messages by (channel_id, bucket) where bucket is a time bucket. Channel locality + bounded shard size + reasonable distribution.
Stripe sharded by merchant_id initially. Worked until some merchants got huge. They had to migrate to (merchant_id, hash) composite. Took years and a custom migration tooling stack.
Hot shards: causes and mitigations
A hot shard takes disproportionate load. 5% of shards, 80% of traffic. Symptoms: high p99 latency on those shards, replication lag, cascading failures.
Causes:
- Skewed shard key (tenant size variance).
- Sequential keys (timestamp, auto-increment).
- Celebrity effect (one user, massive read traffic).
- Bot traffic, DDOS, viral content.
Mitigations:
- Salt the key: append random suffix to spread one logical entity across N shards. Reads must fan out to N. Trade-off: write scale, read complexity.
- Read replicas for the hot shard: scales reads, not writes.
- Caching layer: Redis in front of hot shards.
- Dedicated hardware for hot shards: keep the same key, give it a beefier node.
- Resharding to split the hot shard: surgery but works.
Twitter's celebrity fix was a hybrid: tweets sharded by user_id, but reads for celebrities went through a fan-out cache (Manhattan + custom cache layer).
Cross-shard queries
The moment you shard, queries that don't filter by shard key become scatter-gather:
- Send query to all shards in parallel.
- Gather partial results.
- Merge, sort, paginate at the coordinator.
This is slow (limited by slowest shard, p99 not p50) and expensive (every shard pays). Avoid by:
- Choosing a shard key that matches your dominant query pattern.
- Denormalizing: store data twice, once per query pattern.
- Secondary indexes that are themselves sharded by the query key.
- Materialized views that aggregate across shards.
Cassandra famously punishes you for this. Materialized views are the official answer; in practice teams write data twice manually.
Cross-shard transactions
Distributed transactions across shards require 2PC or a saga pattern. Both are painful.
- 2PC: classic distributed transaction. Coordinator orchestrates prepare and commit. Blocks on failure. Limited throughput. Used inside Spanner, Vitess, CockroachDB internally.
- Saga: break transaction into local transactions plus compensating actions. No atomicity, eventual consistency. Used by microservices.
For SaaS multi-tenant: design so transactions never cross tenants. Then sharding by tenant_id gives you single-shard transactions for free.
Online resharding
You will eventually need to split or merge shards. Options:
-
Pre-allocate virtual shards: hash space split into 4096 virtual shards mapped to N physical nodes. Adding capacity = remap some virtual shards to new nodes. Cassandra and DynamoDB do this.
-
Online split: copy data while serving writes, then atomic switchover. Vitess split workflow:
- Take source shard, dual-write to two new target shards.
- Backfill historical data from source to targets.
- Verify checksums match.
- Switch reads to targets.
- Switch writes to targets.
- Decommission source.
-
Cold migration: script-driven, take downtime. Last resort.
Stripe's online migration playbook is the canonical write-up. Dual-write, backfill, verify, cut over reads, cut over writes, archive source. Months of work per migration.
Operational gotchas
- Hot shard hides until production. Synthetic load is uniform. Real users are zipfian.
- Cross-shard joins look fine in dev (small data, all on one shard) and explode in prod.
- Backups across shards are not transactionally consistent unless you snapshot atomically.
- Schema migrations across shards: deploy schema change to all shards before deploying code that uses it. Otherwise you race.
- Capacity planning: each shard needs headroom for the moment when an adjacent shard fails and load spills over.
When not to shard
Default to NOT sharding. Sharding adds:
- Operational complexity (monitoring per shard, alerting, rebalancing).
- Application complexity (shard routing, cross-shard queries).
- Failure modes (shard down, shard imbalance, shard split-brain).
- Migration costs (changing shard key is a multi-year project at scale).
Cheaper alternatives:
- Vertical scaling: 96-core, 768GB RAM, 100TB NVMe servers exist.
- Read replicas: scale reads without sharding.
- Caching: Redis or memcached in front absorbs read load.
- Workload separation: OLTP on one cluster, analytics on another.
- Tiered storage: hot data on SSD, cold on S3.
Postgres can serve millions of QPS on a single beefy node. Discord ran on a single MongoDB primary for years before sharding. Notion ran on one Postgres for a long time.
The rule: shard when you can prove you've exhausted vertical scale, not before.
What to read next
- DDIA chapter 6. The foundation.
- Vitess docs for production-grade sharding patterns.
- Stripe's online migration blog for the canonical resharding playbook.
- Discord's trillion-messages blog for shard key choices under extreme load.
Learn more
- ArticleDesigning Data-Intensive Applications, Chapter 6Martin Kleppmann
- DocsVitess shardingVitess
- ArticleStripe online migrationsStripe
- PaperConsistent hashing paperAkamai/MIT
- ArticleDiscord trillion messagesDiscord