Gossip protocols
Push, pull, push-pull, anti-entropy, SWIM. How gossip actually scales, what it costs, and where it breaks.
Why epidemics
Demers et al at Xerox PARC (1987) showed you can update a replicated database by having each node periodically exchange information with random peers. The math: information spreads exponentially. After O(log N) rounds, every node has seen the update with high probability.
Properties that make gossip useful:
- Decentralized: no single coordinator.
- Robust to failures: if a few nodes go down, the rest still propagate.
- Network-friendly: each round has bounded fanout, no broadcast storms.
- Self-healing: state eventually converges even after partitions.
Limitations:
- Eventually consistent: takes seconds to converge.
- Probabilistic: hard to bound worst-case delivery time.
- Bandwidth grows with state size: gossip becomes expensive for large state.
- Not ordered: no causal or total order out of the box.
Gossip is for cluster meta-state, not for high-throughput application data.
Three gossip styles
-
Push: A sends state to B. B doesn't ask. Fast to converge initially, but redundant once most nodes know.
-
Pull: A asks B for state. B sends what A doesn't have. Slow at the start (need to randomly find someone with the info), fast at the end.
-
Push-pull: A and B exchange. A sends its digest, B sends differences, then A sends differences. Best of both. Cassandra uses this.
For an update of size U over N nodes:
- Push: ~U * N * log N total bytes.
- Pull: ~U * N * log N total bytes after warmup.
- Push-pull: ~U * N * log N, faster convergence.
Anti-entropy vs rumor mongering
Two distinct goals:
- Anti-entropy: full state reconciliation. A and B compare everything, merge differences. Slow per round, guaranteed convergence. Used by Cassandra for periodic catchup.
- Rumor mongering: spread a specific update fast. Each node remembers updates as "hot," propagates aggressively, then "cools" and stops. Fast initial spread, eventually fades.
In practice, systems run both: rumor mongering for fresh updates, anti-entropy for background reconciliation.
SWIM: failure detection done well
Scalable Weakly-consistent Infection-style Process Group Membership protocol (Das, Gupta, Motivala, 2002). The de facto standard for cluster membership and failure detection.
Direct probing
Each node, every T seconds (typically 1s):
- Pick a random alive member M.
- Send a ping to M.
- If reply within timeout, M is alive.
- If no reply, proceed to indirect probing.
Indirect probing
If direct ping fails:
- Pick K random members (typically K=3).
- Ask each to ping M on your behalf.
- If any of them gets a reply, M is alive.
- If none do, M is suspect.
Indirect probing eliminates false positives from network blips between A and M specifically. M might be alive from B's perspective.
Suspicion mechanism
When M is suspect:
- A gossips "M is suspect" to other nodes.
- M, if alive and hears this, broadcasts "I am alive" (refutation).
- If no refutation in suspect_timeout, M is declared dead.
This avoids premature death declarations during transient issues.
Why SWIM scales
Round time T is constant regardless of cluster size. Each node does O(1) work per round. Failure detection time is bounded by suspect_timeout + protocol_period.
For a 1000-node cluster with T=1s, K=3, suspect_timeout=10s, expected detection time is ~10s with very low false positive rate.
Lifeguard improvements
Hashicorp's Lifeguard paper (2018) addresses SWIM weaknesses:
-
Local Health Multiplier: nodes that are overloaded (high local request latency) bump their own timeouts. Avoids spurious suspicions of healthy nodes.
-
Dogpile prevention: if many nodes are suspecting the same target, slow down probing. Don't pile on.
-
Buddy system: each node remembers who pinged it last, gives them slight priority for ack. Reduces tail latency in failure detection.
These are now part of Hashicorp's Memberlist (used by Consul, Serf, Nomad).
Cassandra's gossip
Cassandra uses push-pull gossip to maintain cluster state. Each second, each node:
- Picks 1-3 random nodes.
- Exchanges digests of known state versions.
- Sends/receives full state for entries the other is behind on.
State exchanged includes:
- Endpoint state: schema version, load, gossip version, status (NORMAL, JOINING, LEAVING).
- Application state: anything Cassandra needs to know about each node.
For a 100-node cluster, ~3 KB per gossip exchange, ~3 KB/s outbound per node. Negligible bandwidth.
Phi accrual failure detection is layered on top. Instead of a binary up/down, each node computes a "phi" score for each other node based on inter-arrival times of gossip messages. Higher phi = more likely dead. Configurable threshold.
Convergence math
For push gossip with fanout 1 in a perfect network:
- After round r, expected fraction of nodes that know = 1 - (1 - 1/N)^(2^r)
- For N=1000, ~10 rounds to reach >99% of nodes.
With fanout f and N nodes, expected rounds to reach all = ~log_f(N).
Real networks: add some buffer. 15-20 rounds for a 1000-node cluster with 1s rounds = 15-20 seconds to full convergence.
Bandwidth analysis
Each round per node: K messages of size S each. Total per node: K * S / T bytes/sec.
For SWIM with K=1 ping, S=1 KB, T=1s: 1 KB/s per node. Negligible.
For Cassandra gossip with K=3 exchanges, S=3 KB digest plus deltas, T=1s: ~10 KB/s per node. Still small.
For full state gossip (anti-entropy with large state): can become expensive. Tune frequency.
Failure modes
-
Partitioned cluster: each partition runs its own gossip, declares the other dead. When healed, conflicting state must be reconciled. Cassandra has a generation counter to handle this.
-
Flapping nodes: node oscillates between alive and dead. Causes churn in routing tables, hot/cold caches. Mitigation: hysteresis, longer suspect timeouts.
-
State explosion: if you gossip everything ever, state grows unbounded. Use TTLs, prune old state.
-
Slow convergence: in WAN deployments, RTT dominates. Increase gossip period to amortize cost, accept slower convergence.
-
Membership thrashing: rapidly adding/removing nodes during deploys can confuse gossip. Plan deploys, drain nodes gracefully.
When NOT to use gossip
Anti-patterns:
- Distributed locks. Use Raft/etcd/ZooKeeper.
- Leader election. Use consensus.
- Configuration that needs immediate effect. Push config via consensus.
- Atomic updates. Gossip can't give you atomicity.
- Sequencing. No order guarantees.
Gossip is for soft state and eventual cluster awareness. Hard guarantees need consensus.
Real systems
| System | Gossip use |
|---|---|
| Cassandra | Cluster state, schema, load |
| Consul | Membership, failure detection (SWIM) |
| Serf | General-purpose SWIM membership |
| Riak | Cluster state |
| Akka Cluster | Membership and failure detection |
| Redis Cluster | Cluster bus uses gossip-like protocol |
| Bitcoin / Ethereum | Block and transaction propagation |
| Docker Swarm | Manager membership |
Note that most of these layer real consensus on top for things that matter (Cassandra Paxos for LWT, Consul Raft for KV).
Implementation gotchas
-
UDP vs TCP: SWIM uses UDP for pings (low overhead). State exchange can use TCP for reliability.
-
Encryption: gossip messages should be encrypted. Hashicorp's tools use AES-GCM.
-
Compression: digests compress well. Cassandra uses LZ4.
-
Membership boundaries: who's in the cluster? Typically a seed list. New nodes contact seeds, get initial state, join gossip.
-
Generation counters: distinguish "node X just rejoined" from "node X is still here." Each restart bumps the counter.
What to read next
- Epidemic algorithms paper (Demers 1987). Foundational.
- SWIM paper for the failure detector everyone uses.
- Lifeguard paper for the modern improvements.
- Hashicorp Memberlist source for a clean Go implementation.
Learn more
- PaperEpidemic algorithms for replicated database maintenanceDemers et al, Xerox PARC
- PaperSWIM paperDas, Gupta, Motivala
- PaperLifeguard paper (SWIM improvements)Hashicorp
- RepoHashicorp MemberlistHashicorp
- DocsCassandra gossip docsApache Cassandra