Circuit breakers and bulkheads
Why retries amplify failures, how breakers stop cascades, how bulkheads isolate, and how service meshes change the math.
A microservice architecture is a distributed monolith if the failure modes are not isolated. Without breakers and bulkheads, one slow dependency can drag your entire service into the ground. The math is unavoidable: every thread that is stuck waiting on a slow remote call is a thread that cannot serve a healthy request. Run out of threads, and even your /health endpoint stops responding.
The cascade math
Service A has 100 threads. Each thread serves one request. It calls Service B, which normally responds in 50ms. So A can serve 100 / 0.05 = 2000 req/sec.
Service B starts taking 5 seconds. A's threads block for 5s each. Capacity drops to 100 / 5 = 20 req/sec. A's queue fills. Upstream callers time out. They retry. A is now under more load while serving less. A falls over. The outage propagates upward to every service that depends on A.
Three independent failures cascaded to take down the whole system. The breaker stops the cascade.
How the breaker stops it
When the breaker is open, A doesn't call B. It returns the fallback immediately (cached value, default, or 503). A's threads complete fast. A's capacity stays high. The retries from upstream still get fast responses (even if degraded). The cascade is contained to "B is broken, the B-dependent feature is degraded."
State machine in detail
Closed. Normal operation. Every call is attempted. Failures and successes are counted in a sliding window (typically 10-30 seconds).
Open. Tripped. All calls fail fast. No request hits the downstream. After a cooldown timer (typically 30s to 2min), transition to half-open.
Half-open. Trial period. Allow one or a small number of probe calls. If successful, close. If failed, reopen with a fresh cooldown timer.
The probe is critical. Without it, the breaker has no way to know if the downstream recovered. Some implementations also let through a small percentage of traffic in half-open instead of one probe, smoothing the recovery.
Trip thresholds
Two main strategies:
Failure count threshold. "Trip after 10 consecutive failures." Simple. Brittle: one bad batch trips even if the next 1000 calls would succeed.
Failure rate threshold. "Trip if failure rate exceeds 50% over the last 30 seconds, with at least 20 requests." Better. The minimum request count prevents flapping on low-traffic services.
Add a slow-call threshold. "Treat calls >5s as failures even if they eventually succeed." A slow downstream is as bad as a failing one - it eats threads.
Timeouts: the floor
A breaker requires timeouts. Without them, your call hangs forever. The breaker never sees a failure to count. Always set:
- Connection timeout. 1-3s. How long to establish the TCP/TLS connection.
- Read timeout. 3-10s for most APIs. How long to wait for the first byte of response.
- Total timeout. Sometimes lower than read timeout for endpoints with strict SLAs.
Most HTTP clients default to "no timeout," which is criminal. Always set them explicitly.
Bulkheads in detail
The bulkhead pattern isolates resources per dependency so one bad dependency cannot starve others.
Three implementations:
Thread pool isolation (Hystrix style). Each downstream gets its own thread pool. Calls run on that pool. If the pool is exhausted, calls fail fast (bulkhead full). Other downstreams have their own pools, unaffected.
Semaphore isolation. Cheaper than thread pools. Each downstream has a semaphore with N permits. Each call acquires a permit, releases on completion. Permits exhausted = fail fast. No new threads created.
Connection pool isolation. Each downstream has its own HTTP connection pool. Pool exhausted = fail fast. Most production systems do this.
# Bulkheads via separate clients
stripe = HttpClient(max_connections=20, timeout=5)
twilio = HttpClient(max_connections=10, timeout=3)
internal = HttpClient(max_connections=60, timeout=2)If Stripe goes slow, only the 20 Stripe connections are stuck. The 60 internal connections keep serving. The /health endpoint doesn't even know.
Fallback strategies
When the breaker is open, what do you return?
Cached value. Acceptable for read-heavy data with some staleness tolerance. Product catalog, user profile, feature flags. Use a stale-while-revalidate cache so the cache always has data even if the downstream has been down for hours.
Default value. "Recommendations" returns the trending list instead of personalized. "Currency conversion" uses yesterday's rates. Reasonable defaults that don't break the user flow.
Empty success. For optional enrichment. The user's social posts feed shows real posts; if the trending sidebar API is broken, hide the sidebar. Don't show an error.
Graceful error. A 503 with a clear message and Retry-After. Best for the rare cases where there is no useful fallback.
Queue for later. Write the operation to a queue. Process when the downstream recovers. Good for writes like "send email" or "post to slack."
The fallback must be meaningfully better than the failure. If your fallback returns "Service unavailable," skip it.
Retries inside vs outside the breaker
The order matters.
Retry inside breaker (wrong). Retries are counted as one failure by the breaker. A flaky downstream that succeeds on retry 2/3 looks fine. But the breaker also doesn't see the real failure rate, so it never trips for a slowly degrading service.
Retry outside breaker (correct). The breaker sees every attempt. The retry logic respects the breaker state: if open, fail fast without retrying. This makes the breaker's metrics accurate.
In practice with libraries: configure the resilience layer as "breaker first, then retry." Resilience4j composes these explicitly.
Service mesh: breakers in the network
Istio and Linkerd implement "outlier detection" at the proxy level. Every pod has a sidecar. The sidecar tracks failure rates per upstream pod. If a pod fails too often, it is ejected from the load balancer pool for a cooldown.
This is bulkheading at the network layer. Application code doesn't know about it.
Pros: zero code change, language-agnostic, consistent policy across services.
Cons: less context-aware. The mesh can't know that "this 500 means downstream is broken" vs "this 500 means the user sent bad input." App-level breakers can distinguish.
Use both. Mesh for the coarse "this pod is bad, route around it." App for the fine "this dependency is bad, use the fallback."
Half-open probe storms
Subtle bug: if you have 100 pods of service A all using the same breaker config, all 100 transition to half-open at roughly the same time. They all send probes to B at once. B was just recovering and gets hammered again. Trip storm.
Fixes:
- Add jitter to the cooldown timer per pod.
- Use a centralized breaker state (Redis) so only one probe goes out across the fleet.
- Use the mesh's outlier detection, which is per-pod not per-fleet.
Adaptive concurrency
Beyond simple breakers, look at adaptive concurrency control. Libraries like Netflix's concurrency-limits use TCP-style congestion control. They dynamically adjust the in-flight request limit to the downstream based on observed latency.
When the downstream gets slower, the limit drops. When the downstream recovers, the limit grows. Replaces the manual "set max connections to 20."
Worth learning if you operate high-traffic services with variable downstream performance.
Observability
Track per breaker:
- State (closed/open/half-open). Should be closed 99%+ of the time.
- Trip events. Alert on every trip in critical paths.
- Fallback rate. Percentage of calls returning fallback. Spike = breaker open or downstream degraded.
- Downstream latency p50/p99. Tells you why the breaker tripped.
Track per bulkhead:
- Pool utilization. Sustained high = need more capacity or the downstream is slow.
- Rejections. Pool exhausted, fail-fast triggered.
Testing
Two unfair questions to ask before shipping:
-
"What happens if downstream X returns 500 for every request for 5 minutes?" Test it. Block the downstream in staging. Confirm the breaker trips, fallbacks return, dependent endpoints stay green.
-
"What happens if downstream X takes 30 seconds to respond?" Test it. Inject latency. Confirm timeouts fire, bulkhead fills, fail-fast kicks in, the rest of the service is unaffected.
If you cannot answer these with evidence, you do not actually have resilience. You have wishful thinking.
What I would tell a junior engineer
Every line of code that crosses the network is a chance to take down your service. Three lines of defense, in order:
- Timeout. Always. No exceptions.
- Bulkhead. Separate pool per dependency. Failure stays in its lane.
- Breaker. Stop calling when broken. Fallback when possible.
It is the cheapest insurance against the most expensive bugs.
Learn more
- ArticleMichael Nygard: Release ItPragmatic Bookshelf
- ArticleMartin Fowler: Circuit Breakermartinfowler.com
- ArticleAWS Builders Library: timeouts, retries, and backoffAWS Builders Library
- RepoNetflix HystrixNetflix
- DocsIstio: outlier detectionIstio docs