Leader election

Mechanisms, failure modes, fencing, lease vs lock semantics, real patterns from production systems.

Why elect a leader at all

Leaderless systems exist (Dynamo, Cassandra) and avoid election entirely. But many problems become easier with a leader:

Sequencing writes (give them a total order).
Coordinating background jobs (only one cron runs at a time).
Managing shared resources (one process owns the lock).
Driving replication (followers learn from leader).
Garbage collection, leader-based load balancing, leader-based heartbeating.

The leader is a coordination point. Whether you need one depends on the problem. If you do, you need real election, not heartbeats.

What can go wrong without consensus

Naive election: nodes ping each other, lowest IP wins.

Failure modes:

Network partition: both sides think they are the only survivors. Both elect leaders. Split-brain. Writes diverge.
Slow node: appears dead, gets replaced, comes back with stale state.
Clock skew: timeouts fire on one side faster, leadership flaps.
Asymmetric partitions: A can see B, B can't see A. Each thinks the other is dead.

The fix is consensus: only a majority can elect a leader. Minority cannot make progress.

Coordinator-service pattern

Use ZooKeeper, etcd, or Consul. They run Raft (or ZAB for ZK) internally. Your app interacts via primitives.

Etcd election

Etcd exposes a Campaign primitive:

session, _ := concurrency.NewSession(client, concurrency.WithTTL(15))
defer session.Close()
 
election := concurrency.NewElection(session, "/myapp/leader")
if err := election.Campaign(ctx, "node-1"); err != nil {
    return err
}
// I am leader. Session TTL keeps me leader as long as I send heartbeats.
defer election.Resign(context.Background())

Etcd uses a lease (the session) under the hood. If your node crashes or pauses past TTL, lease expires, election fires for the next campaigner in the queue. Etcd guarantees a single leader at any time across the cluster.

ZooKeeper election

ZooKeeper's recipe: create ephemeral sequential znodes. Lowest sequence number is leader. Each node watches the one immediately before it.

/election/node_000000001  <- leader
/election/node_000000002  <- watches 001
/election/node_000000003  <- watches 002

When the leader's session ends (TTL or disconnect), its znode disappears. The watcher fires, the next node becomes leader.

This is the "herd-avoiding" recipe: only one node gets notified per leader change, not all of them.

Consul election

Consul has a sessions API plus a KV store. Acquire a session-locked KV key. If session expires, key is released.

PUT /v1/kv/leader?acquire=session-id

Same idea: lease-based, expires on failure.

Etcd lease-based election

Embedded Raft pattern

Your application runs Raft directly. No external dependency. Used by CockroachDB, TiKV, Kafka KRaft, RethinkDB.

Tradeoffs:

Pro: no operational dependency on a separate cluster.
Pro: tighter integration, lower latency.
Con: you must operate Raft yourself.
Con: bigger blast radius (Raft bugs take down your app).

Use libraries: hashicorp/raft (Go), apache/ratis (Java), tikv/raft-rs (Rust).

Lease vs lock

Subtle distinction:

Lock: I hold this resource until I release it. If I crash, the system must figure out and release.
Lease: I hold this resource for at most T seconds, then it auto-releases. If I want to keep it, I renew.

Leases are safer. Locks without expiration are a deadlock waiting to happen (process crashes holding lock, lock never released, system stalls).

All real distributed lock systems use leases. Redis SET NX EX, etcd lease, ZooKeeper ephemeral nodes. Even when called "locks," they expire.

The lease duration is the recovery time. Shorter lease = faster failover, more network chatter. Longer lease = slower failover, more downtime on crash. Typical: 5-30 seconds.

Fencing tokens

The hard truth: lease-based leader election does not give you exclusive access to anything. It gives you a strong probabilistic guarantee that there is one leader, but not absolute exclusion.

Scenario:

Leader A holds lease, working.
Leader A's process pauses (GC pause, network glitch, etc).
Lease expires.
Leader B is elected, starts work.
Leader A wakes up, still thinks it's leader (clock might not have advanced).
A and B both write to shared storage.

Fix: fencing tokens. Each leader gets a monotonically increasing term/epoch. Every operation includes the token. Storage rejects operations with stale tokens.

Leader A writes with token=5: ok, stored token now=5
Leader B writes with token=6: ok, stored token now=6
Leader A writes with token=5: rejected, current=6

ZooKeeper's zxid is a fencing token. Etcd revisions are. Spanner's transaction IDs are. Custom systems must build this in.

Without fencing, your election protocol does not actually prevent concurrent writes. Martin Kleppmann's "How to do distributed locking" post is the canonical explanation, with a famous critique of the Redlock algorithm for missing this.

Election storms

A subtle failure: every leader change triggers a flurry of activity (cache warmup, connection rebuild, state recovery). If leaders churn, the cluster spends all its time recovering and never making progress.

Causes:

Lease too short: small network blips trigger failover.
Heartbeat path saturated: leader can't get heartbeats out, gets demoted, re-elected, repeat.
Buggy health check: false positives bounce the leader.

Mitigations:

Pre-vote (Raft extension): don't trigger election unless you can win.
Backoff: after a recent failover, wait longer before another.
Separate heartbeat from data plane: dedicated low-latency path for heartbeats.

CockroachDB and TiKV both had election storm incidents in their early years. Now both use pre-vote and separate heartbeat paths.

Multiple leaders by design

Sometimes you want multiple leaders, each for a subset of work. Partition-leader pattern:

Each shard has its own leader.
Leaders elected per shard.
Load distributed across N leaders, not concentrated on one.

Kafka does this: each partition has a leader, leaders distributed across brokers.

CockroachDB does this: each range has its own Raft group and leader, leaders rebalanced across nodes.

This avoids the single-leader bottleneck. Cost: more election overhead (N partitions = N election groups).

Practical patterns

Cron job that runs once

You have a fleet of N app servers, want one (and only one) to run a cron job at midnight.

Naive: pick by hostname hash. Breaks if that host is down. Better: every server tries to acquire an etcd lock at midnight. First to acquire runs the job, releases when done.

with etcd_client.lock("/cron/nightly-export") as lock:
    if lock.acquired:
        run_nightly_export()

Singleton background worker

You have a fleet, want exactly one worker process consuming a queue.

session = etcd.lease(ttl=30)
election = etcd.election("/workers/queue-processor", session)
election.campaign("worker-" + hostname)
# I am the singleton. Keep renewing the lease.
while True:
    msg = queue.pop()
    process(msg)
    session.renew()

Distributed cache invalidator

You want one node to broadcast cache invalidations to N edge servers.

Same pattern: elect a leader, leader does the broadcasting.

Anti-patterns

"Heartbeat between nodes, lowest IP wins." Works in dev, splits brain in prod. Use real consensus.
"Database row as a lock." A row in MySQL with is_leader = 1 and a last_heartbeat column. Same as #1 plus a database round-trip.
"Redlock as exclusive lock." Even Redlock proponents now acknowledge it does not give exclusive access. Use fencing tokens.
"Re-elect on every minor blip." Aggressive timeouts cause election storms. 10-30s lease is a good baseline.
"Forget about clock skew." If your election depends on time synchronization, you need NTP, and even then ~10ms skew is possible. Use logical timestamps (terms, zxid).

Learn more

Paper
Raft paperOngaro and Ousterhout
Docs
ZooKeeper recipesZooKeeper
Article
How to do distributed lockingMartin Kleppmann
Docs
Etcd documentationetcd
Repo
Hashicorp Raft libraryHashicorp

Deep dive15 min read← Back to crisp

Leader election

Mechanisms, failure modes, fencing, lease vs lock semantics, real patterns from production systems.

Why elect a leader at all

Leaderless systems exist (Dynamo, Cassandra) and avoid election entirely. But many problems become easier with a leader:

Sequencing writes (give them a total order).
Coordinating background jobs (only one cron runs at a time).
Managing shared resources (one process owns the lock).
Driving replication (followers learn from leader).
Garbage collection, leader-based load balancing, leader-based heartbeating.

The leader is a coordination point. Whether you need one depends on the problem. If you do, you need real election, not heartbeats.

What can go wrong without consensus

Naive election: nodes ping each other, lowest IP wins.

Failure modes:

Network partition: both sides think they are the only survivors. Both elect leaders. Split-brain. Writes diverge.
Slow node: appears dead, gets replaced, comes back with stale state.
Clock skew: timeouts fire on one side faster, leadership flaps.
Asymmetric partitions: A can see B, B can't see A. Each thinks the other is dead.

The fix is consensus: only a majority can elect a leader. Minority cannot make progress.

Coordinator-service pattern

Use ZooKeeper, etcd, or Consul. They run Raft (or ZAB for ZK) internally. Your app interacts via primitives.

Etcd election

Etcd exposes a Campaign primitive:

session, _ := concurrency.NewSession(client, concurrency.WithTTL(15))
defer session.Close()
 
election := concurrency.NewElection(session, "/myapp/leader")
if err := election.Campaign(ctx, "node-1"); err != nil {
    return err
}
// I am leader. Session TTL keeps me leader as long as I send heartbeats.
defer election.Resign(context.Background())

ZooKeeper election

ZooKeeper's recipe: create ephemeral sequential znodes. Lowest sequence number is leader. Each node watches the one immediately before it.

/election/node_000000001  <- leader
/election/node_000000002  <- watches 001
/election/node_000000003  <- watches 002

When the leader's session ends (TTL or disconnect), its znode disappears. The watcher fires, the next node becomes leader.

This is the "herd-avoiding" recipe: only one node gets notified per leader change, not all of them.

Consul election

Consul has a sessions API plus a KV store. Acquire a session-locked KV key. If session expires, key is released.

PUT /v1/kv/leader?acquire=session-id

Same idea: lease-based, expires on failure.

Etcd lease-based election

Embedded Raft pattern

Your application runs Raft directly. No external dependency. Used by CockroachDB, TiKV, Kafka KRaft, RethinkDB.

Tradeoffs:

Pro: no operational dependency on a separate cluster.
Pro: tighter integration, lower latency.
Con: you must operate Raft yourself.
Con: bigger blast radius (Raft bugs take down your app).

Use libraries: hashicorp/raft (Go), apache/ratis (Java), tikv/raft-rs (Rust).

Lease vs lock

Subtle distinction:

Lock: I hold this resource until I release it. If I crash, the system must figure out and release.
Lease: I hold this resource for at most T seconds, then it auto-releases. If I want to keep it, I renew.

Leases are safer. Locks without expiration are a deadlock waiting to happen (process crashes holding lock, lock never released, system stalls).

All real distributed lock systems use leases. Redis SET NX EX, etcd lease, ZooKeeper ephemeral nodes. Even when called "locks," they expire.

The lease duration is the recovery time. Shorter lease = faster failover, more network chatter. Longer lease = slower failover, more downtime on crash. Typical: 5-30 seconds.

Fencing tokens

The hard truth: lease-based leader election does not give you exclusive access to anything. It gives you a strong probabilistic guarantee that there is one leader, but not absolute exclusion.

Scenario:

Leader A holds lease, working.
Leader A's process pauses (GC pause, network glitch, etc).
Lease expires.
Leader B is elected, starts work.
Leader A wakes up, still thinks it's leader (clock might not have advanced).
A and B both write to shared storage.

Fix: fencing tokens. Each leader gets a monotonically increasing term/epoch. Every operation includes the token. Storage rejects operations with stale tokens.

Leader A writes with token=5: ok, stored token now=5
Leader B writes with token=6: ok, stored token now=6
Leader A writes with token=5: rejected, current=6

ZooKeeper's zxid is a fencing token. Etcd revisions are. Spanner's transaction IDs are. Custom systems must build this in.

Election storms

Causes:

Lease too short: small network blips trigger failover.
Heartbeat path saturated: leader can't get heartbeats out, gets demoted, re-elected, repeat.
Buggy health check: false positives bounce the leader.

Mitigations:

Pre-vote (Raft extension): don't trigger election unless you can win.
Backoff: after a recent failover, wait longer before another.
Separate heartbeat from data plane: dedicated low-latency path for heartbeats.

CockroachDB and TiKV both had election storm incidents in their early years. Now both use pre-vote and separate heartbeat paths.

Multiple leaders by design

Sometimes you want multiple leaders, each for a subset of work. Partition-leader pattern:

Each shard has its own leader.
Leaders elected per shard.
Load distributed across N leaders, not concentrated on one.

Kafka does this: each partition has a leader, leaders distributed across brokers.

CockroachDB does this: each range has its own Raft group and leader, leaders rebalanced across nodes.

This avoids the single-leader bottleneck. Cost: more election overhead (N partitions = N election groups).

Practical patterns

Cron job that runs once

You have a fleet of N app servers, want one (and only one) to run a cron job at midnight.

Naive: pick by hostname hash. Breaks if that host is down. Better: every server tries to acquire an etcd lock at midnight. First to acquire runs the job, releases when done.

with etcd_client.lock("/cron/nightly-export") as lock:
    if lock.acquired:
        run_nightly_export()

Singleton background worker

You have a fleet, want exactly one worker process consuming a queue.

session = etcd.lease(ttl=30)
election = etcd.election("/workers/queue-processor", session)
election.campaign("worker-" + hostname)
# I am the singleton. Keep renewing the lease.
while True:
    msg = queue.pop()
    process(msg)
    session.renew()

Distributed cache invalidator

You want one node to broadcast cache invalidations to N edge servers.

Same pattern: elect a leader, leader does the broadcasting.

Anti-patterns

"Heartbeat between nodes, lowest IP wins." Works in dev, splits brain in prod. Use real consensus.
"Database row as a lock." A row in MySQL with is_leader = 1 and a last_heartbeat column. Same as #1 plus a database round-trip.
"Redlock as exclusive lock." Even Redlock proponents now acknowledge it does not give exclusive access. Use fencing tokens.
"Re-elect on every minor blip." Aggressive timeouts cause election storms. 10-30s lease is a good baseline.
"Forget about clock skew." If your election depends on time synchronization, you need NTP, and even then ~10ms skew is possible. Use logical timestamps (terms, zxid).

Learn more

Paper
Raft paperOngaro and Ousterhout
Docs
ZooKeeper recipesZooKeeper
Article
How to do distributed lockingMartin Kleppmann
Docs
Etcd documentationetcd
Repo
Hashicorp Raft libraryHashicorp

Leader election

Why elect a leader at all

What can go wrong without consensus

Coordinator-service pattern

Etcd election

ZooKeeper election

Consul election

Embedded Raft pattern

Lease vs lock

Fencing tokens

Election storms

Multiple leaders by design

Practical patterns

Cron job that runs once

Singleton background worker

Distributed cache invalidator

Anti-patterns

What to read next

Learn more

Leader election

Why elect a leader at all

What can go wrong without consensus

Coordinator-service pattern

Etcd election

ZooKeeper election

Consul election

Embedded Raft pattern

Lease vs lock

Fencing tokens

Election storms

Multiple leaders by design

Practical patterns

Cron job that runs once

Singleton background worker

Distributed cache invalidator

Anti-patterns

What to read next

Learn more