Load balancing strategies
Algorithms, layers, health checks, consistent hashing, autoscaling, and the production gotchas of load balancing at scale.
What problem a load balancer solves
A single server has finite CPU, memory, and bandwidth. To scale beyond it, you put N servers behind one address and distribute requests. The load balancer is the "one address."
Beyond capacity, an LB provides:
- Failure isolation: bad backend removed automatically.
- Rolling deploys: drain a backend, deploy, return it to pool.
- SSL termination at one place.
- Request routing (L7): by path, host, cookie.
- Observability: centralized metrics on all traffic.
Layer 4 vs Layer 7
L4 LBs operate on TCP/UDP. They see source IP, source port, dest IP, dest port. They forward connections by hashing the 5-tuple to a backend.
Examples: AWS NLB, HAProxy in TCP mode, IPVS, Cilium with XDP, Maglev (Google).
L4 is fast. Some implementations run in the kernel or in hardware ASICs. Throughput in millions of packets per second.
L7 LBs parse HTTP. They route by Host header, URL path, cookie, JWT claim. They can rewrite headers, compress, decompress, do AB testing.
Examples: Nginx, Envoy, AWS ALB, Cloudflare, Kong.
L7 is slower (10-100x more CPU per request) but enables all the routing intelligence modern apps need.
Common architecture: L4 in front of L7. L4 handles raw connection distribution; L7 does the smart routing per request within a connection.
Algorithms in detail
Round-robin
Cycle through backends in order. Easy, fair if requests cost roughly the same.
Pitfall: if request work varies, one slow request lands on backend X just as round-robin assigns backend X another big request. Slow backend gets slower.
Weighted round-robin
Each backend has a weight. Higher weight gets more requests proportionally. Useful when backends have different capacities (some 2 vCPU, some 8 vCPU).
Least connections
Track active connections per backend; pick the one with fewest. Adapts to slow requests automatically.
Caveat: requires the LB to track state per backend. Works well for long-lived connections (database pool, WebSocket); less useful for very short HTTP requests.
Power of 2 choices
Pick 2 random backends. Send to the one with fewer active requests. Despite being almost random, this dramatically reduces tail latency compared to round-robin or random.
Mitzenmacher proved this: the maximum load grows as log log N / log 2 instead of log N / log log N. In practice, p99 tail latency drops 2-5x compared to random.
Used by Envoy as a default-friendly option, Finagle, Linkerd.
Consistent hashing
Hash a key (URL, session ID, user ID) to a position on a ring. Backends also hash to positions. Key goes to the next backend clockwise on the ring.
Adding or removing a backend reshuffles only 1/(N+1) of keys, not all. This is critical for cache layers where you do not want a full rehash on scaling.
Virtual nodes: hash each backend at multiple positions (e.g., 100 virtual positions per real backend). Smooths distribution and reduces hot spots.
Used by: memcached client libraries, Cassandra, DynamoDB, every distributed cache, every CDN routing layer.
Maglev hashing
Google's load balancer (Maglev paper, NSDI 2016) uses a different consistent hashing variant optimized for hardware: a precomputed lookup table of size M, where M is a prime much larger than the number of backends. Each backend claims rows in the table; lookup is O(1).
Used by Envoy, AWS NLB underneath, Cloudflare.
EWMA (Exponentially Weighted Moving Average)
Track each backend's recent response latency. Use 1 / (smoothed latency) as a weight in random selection. Fast backends get more requests.
Used by Envoy (LEAST_REQUEST with active request weighting), Finagle's P2C with peak EWMA.
Random
Pick a backend uniformly at random. Fair in expectation, terrible tail behavior. Use only as a baseline.
Health checks
A backend can die or degrade. The LB needs to know.
Active health checks
LB sends a probe request periodically (every 1-30 seconds). If N consecutive probes fail, remove from pool. If M consecutive probes pass after removal, return to pool.
Common probe: GET /health that returns 200 OK quickly without doing real work.
Best practice: probe a real endpoint that exercises critical dependencies (DB connection, cache reachability). A healthy /health that says "OK" while the app cannot read DB is worse than no health check.
Passive health checks
Track real request outcomes. If error rate or latency spikes for a backend, mark it unhealthy.
Outlier detection (Envoy): "this backend's error rate is 5x the cluster average, eject for 30 seconds."
Combine active and passive. Passive catches problems instantly; active confirms recovery before re-pooling.
Slow start
When a backend rejoins, ramp up traffic gradually. Otherwise it gets blasted with the steady-state share immediately and might fall over again. Linear ramp over 30-60 seconds.
Sticky sessions
Some applications keep state in memory: WebSocket connections, in-process session caches, gRPC streaming. The next request from the same client must go to the same backend.
- IP hash: route by client IP. Breaks behind NAT or shared proxies.
- Cookie hash: LB sets a cookie identifying the backend; subsequent requests carry it. Works through NAT.
- Consistent hash on session ID.
Trade-offs: stickiness causes uneven load. A power user hot-spots their backend. Mitigation: aim for stateless backends (move session state to Redis); fall back to stickiness only when necessary.
Connection draining
Before removing a backend (deploy, scale-down), tell the LB to drain it: stop sending new requests but let existing ones complete. Wait for in-flight to finish (with a timeout), then remove.
AWS ALB calls this "deregistration delay." Kubernetes calls it "preStop hook + terminationGracePeriodSeconds."
Without draining: TCP RSTs to in-flight users, 5xx errors during every deploy.
Service mesh and sidecar LBs
In Kubernetes, a service mesh (Istio, Linkerd) puts an L7 proxy (Envoy, Linkerd-proxy) as a sidecar in every pod. The sidecar handles load balancing client-side: each app talks to localhost; the sidecar picks a backend.
Advantages:
- No centralized LB bottleneck.
- Mesh-level mTLS, retries, circuit breakers.
- Per-request routing decisions with full app context.
Disadvantages: more pods, more CPU, complex.
Global load balancing
Above the cluster level, you balance across regions:
- DNS-based: authoritative DNS returns different IPs per region (geo-DNS).
- Anycast: same IP in multiple regions; BGP picks nearest.
- Latency-based: route to lowest-latency region per user.
- Health-based: failover to next region when primary fails.
AWS Route 53, Cloudflare Load Balancing, Akamai GTM, Google Cloud Load Balancing.
Autoscaling and the LB
The LB doesn't scale your backends, but it provides the signal: CPU per backend, request rate, queue depth. Autoscalers consume these metrics.
Caveat: autoscaling reaction time (1-5 minutes typical) is much slower than traffic spikes. Need headroom in the steady-state pool.
Common production failures
Health check storm
100 backends, health check every 1 second to a /health that does a DB query. Health checks alone produce 100 QPS on your DB. Scale up backends and the rate scales linearly. Solution: cheap, dependency-free /health endpoint; or rate-limit checks.
Thundering herd after restart
Deploy restarts 50 backends in parallel. They all come up at the same time, all reconnect to backend services (DB, cache), all warm up cold caches simultaneously. Stagger restarts.
Stale DNS in client libraries
Some HTTP client libraries cache DNS forever and never re-resolve. When you replace a backend, old clients keep hitting dead IPs. Force periodic DNS re-resolution or use connection pooling that honors TTL.
Connection coalescing across deploys
HTTP/2 reuses connections. After deploy, old connections hold to old (now-drained) backends until they time out. Use GOAWAY frames to gracefully close.
Asymmetric routing through L4 LBs
Source NAT vs Direct Server Return (DSR). DSR is faster (responses bypass the LB) but requires the backend to know the LB's VIP. Mismatched configs cause weird connection failures.
Sticky session blow-up on deploy
You rolled out a deploy. Stickiness routed 30% of users to a backend that no longer exists. They see errors until cookies expire. Solution: drain before terminate, or use session storage outside the backend.
Decision tree
Numbers to memorize
- Round-robin variance under random work: 10-20%.
- Power of 2 choices vs random: 2-5x p99 reduction.
- Consistent hash reshuffle on N to N+1: 1/(N+1) of keys.
- Maglev table size: usually 65537 (prime).
- Typical health check: every 5-30 seconds, 2-3 failures to eject.
- Connection draining timeout: 30-300 seconds.
- TCP keep-alive timeout on most LBs: 60 seconds.
What to know for interviews
Be ready to explain why round-robin is not always the right answer, how consistent hashing works at a whiteboard level, what L4 vs L7 means, and how health checks both help and hurt. Bonus: explain power of 2 choices and why it works.
Learn more
- DocsHAProxy documentationHAProxy
- DocsEnvoy load balancingEnvoy
- DocsNginx load balancingNginx
- Docs
- PaperMitzenmacher: The Power of Two Random ChoicesMichael Mitzenmacher
- Paper