In revision.
Crisp5 min readGo deeper →

Sharding and partitioning

Split data across nodes by key. Hash for uniformity, range for scans. The shard key choice is permanent and load-bearing.

Sharding splits one logical dataset across multiple physical nodes by a shard key. You shard when one node cannot hold the data, can't serve the write load, or can't be scaled vertically cheaper than horizontally.

Two ways to split

  1. Hash partitioning: hash(key) mod N. Uniform distribution. Range scans require fan-out.
  2. Range partitioning: shard 1 holds keys A-F, shard 2 holds G-M. Range scans are local. Hot ranges are a risk.

Hash is the default. Range is for time-series and ordered scans.

Consistent hashing

Naive hash mod N: when N changes (add/remove a node), every key gets remapped. Catastrophic.

Consistent hashing: map nodes and keys to points on a ring. Each key is owned by the next node clockwise. Adding a node only moves keys from one neighbor. ~1/N of keys remapped, not all.

Used by: DynamoDB, Cassandra, Memcached, Redis Cluster (variant).

Picking the shard key

This is the single most consequential design decision in the system. Once you ship, you cannot easily change it. Stripe took 4 years to migrate one shard key.

Rules:

  • Cardinality: must have many distinct values. user_id good, country_code bad.
  • Distribution: must be uniform. created_at is biased toward recent.
  • Query alignment: queries must filter by shard key. Otherwise every query scatter-gathers.
  • Tenant boundary: for multi-tenant SaaS, shard by tenant_id so one customer's data is on one shard.

Twitter learned this with celebrities. They sharded tweets by user_id. Then Justin Bieber happened. One user, one shard, billions of read requests. They had to special-case.

Resharding is painful

When a shard fills up or gets hot, you must split it. Options:

  1. Pre-split: hash space is 4096 virtual shards mapped to N physical. Add a physical node, remap virtual shards. Cassandra and DynamoDB do this.
  2. Online resharding: copy data while serving writes, switch over atomically. Vitess, MongoDB.
  3. Cold migration: write a script, take downtime. The fallback.

Plan for resharding before you ship. Use vshards from day 1.

Hash partitioning with virtual shards

Learn more