Circuit breakers and bulkheads
Circuit breakers stop calling a failing dependency; bulkheads stop one failing dependency from drowning the rest of your service.
A circuit breaker wraps a remote call. When the remote starts failing, the breaker "trips" and short-circuits subsequent calls for a cooling period. This protects you from cascading failure and protects the downstream from a thundering herd of retries.
Three states
- Closed. Normal. Calls pass through. Failures counted.
- Open. Tripped. Calls fail fast with a fallback. No real call made.
- Half-open. Cooling period over. Let one probe call through. If it succeeds, close. If it fails, reopen.
When to trip
Common policy: 50% failure rate over a 30-second window with at least 20 requests. The minimum request count prevents tripping on a single 503.
Cooling period: 30 seconds to 5 minutes. Too short and you keep banging on a dead service. Too long and you stay degraded after the service recovers.
Bulkheads
Named after ship hulls. A bulkhead isolates one part of your system so one failure does not flood everything.
In code: give each downstream its own thread pool, connection pool, or rate limit. If Stripe is slow, only the threads dedicated to Stripe are stuck. Threads for your DB are free. The /health endpoint still responds.
# Bad: one shared pool
http_client = HttpClient(max_connections=100)
# Good: per-dependency pools
stripe_client = HttpClient(max_connections=20)
twilio_client = HttpClient(max_connections=20)
internal_api = HttpClient(max_connections=60)Fallbacks
When the breaker is open, what do you return?
- Cached value. Stale but better than nothing.
- Default value. "Recommended for you" returns top 10 instead of personalized.
- Empty + 200. For optional features.
- Error + 503. When there is no graceful degradation.
A fallback is only useful if it is meaningfully better than the failure mode. Don't add fallbacks just to have them.
Libraries
- Resilience4j (Java). Modern Hystrix successor.
- Polly (.NET). Mature, well-documented.
- Tenacity / pybreaker (Python). Decent options.
- Service mesh (Istio, Linkerd). Breakers at the network layer, no app code needed.
My rule
Every cross-network call has a breaker, a timeout, and a bulkhead. The timeout is the floor (3-5s for most APIs). The breaker is the ceiling (stop retrying when broken). The bulkhead is the wall (one bad neighbor doesn't kill the rest).
Learn more
- ArticleMichael Nygard: Release It (2nd edition)Pragmatic Bookshelf
- ArticleMartin Fowler: Circuit Breakermartinfowler.com
- RepoNetflix Hystrix wikiNetflix