2026-04-22
When your service calls another service, what happens when that dependency is down? Without protection, every request blocks for 30 seconds waiting for a timeout, your thread pool fills up, and your entire system crashes. One failing dependency takes down everything. This is called cascading failure, and the circuit breaker pattern exists to prevent it.
The pattern works exactly like an electrical circuit breaker. It has three states:
Here's a minimal implementation in pseudocode:
failureCount = 0; state = CLOSED; lastFailureTime = null;
On each call: if state is OPEN and cooldown hasn't elapsed, fail fast. If cooldown has elapsed, let one request through (HALF_OPEN). On success, reset to CLOSED. On failure, increment the counter. When failureCount >= threshold, flip to OPEN and record the timestamp.
Real-world example: Your checkout service calls a payment gateway. The gateway starts returning 500 errors. Without a circuit breaker, every checkout attempt hangs for your full timeout (say 10 seconds), your users stare at a spinner, and your request queue backs up. With a circuit breaker configured at 5 failures in 60 seconds, after 5 failed payments the breaker opens. Subsequent checkout attempts instantly get a "Payment temporarily unavailable, please try again shortly" message. Your service stays responsive. After 30 seconds of cooldown, one request probes the gateway — if it's back, traffic resumes.
Rule of thumb for tuning: Set your failure threshold to timeout_seconds × normal_requests_per_second × 0.5. If your timeout is 5 seconds and you normally send 10 req/s, trip after roughly 25 failures. This prevents tripping on isolated errors while catching real outages within a few seconds. Start conservative (trip early) and loosen from there.
A few practical notes:
Most languages have battle-tested libraries: resilience4j (Java), Polly (.NET), opossum (Node.js), pybreaker (Python). Use them instead of hand-rolling — they handle edge cases like concurrent state transitions and sliding window counters.
