Circuit Breaker Pattern — Timeouts Alone Kill Thread Pools
Thread pool hit 100% in 2 minutes when payment gateway leaked connections.
- Circuit Breaker Pattern: a state machine that stops requests to a failing dependency
- Closed: requests pass, failure counter increments on each failure
- Open: all requests fail immediately, no network call made, threads freed
- Half-Open: after timeout, limited probes test if service has recovered
- Performance insight: fail-fast reduces thread pool exhaustion by up to 90% under high failure rates
- Production insight: thread pool starvation is silent until timeout — circuit breaker prevents it
Imagine your house has a fuse box. When too many appliances run at once and the wiring gets dangerously hot, the fuse trips and cuts power before your house burns down. You don't keep plugging things in — you wait, fix the problem, then carefully flip the switch back on. A Circuit Breaker in software does exactly this: when a downstream service keeps failing, it 'trips' and stops sending it requests so the whole system doesn't catch fire. It then quietly tests the water before fully reconnecting.
Distributed systems fail in ways that monoliths never do. A single slow database call can hold a thread. A hundred slow calls can hold a thread pool. At that point your entire service — which is otherwise perfectly healthy — is completely unavailable, brought down not by its own bugs but by something it was talking to. This is called cascading failure, and it's responsible for some of the most spectacular production outages in the industry.
The Circuit Breaker pattern exists to break that cascade. Instead of letting your service hammer a failing dependency indefinitely, it interposes a state machine between your code and the remote call. When failures breach a threshold, the breaker opens and subsequent calls fail fast — immediately, without touching the network — giving the downstream system breathing room to recover and protecting your own thread pool from exhaustion.
By the end of this article you'll understand exactly how the three-state machine works under the hood, how to tune failure thresholds and timeout windows without guessing, how to implement a production-grade breaker in Java from scratch, and the real-world gotchas that bite teams even when they think they've set it up correctly. We'll also compare the two dominant counting strategies — count-based and time-based sliding windows — so you can choose the right one for your traffic pattern.
What Is the Circuit Breaker Pattern?
The Circuit Breaker pattern is a state machine that monitors remote calls and opens when failures exceed a threshold. Its primary job: fail fast when a dependency is unhealthy, not slow — and give that dependency time to recover without being flooded with requests.
Think of it as a safety valve. In a closed state, all requests pass through normally. Each failure increments a counter. When the counter hits the configured threshold, the breaker trips to open, and subsequent requests are rejected immediately with an exception. After a recovery timeout, the breaker transitions to half-open, allowing a limited number of probe requests. If these succeed, the breaker closes again. If they fail, it reopens.
The pattern decouples error handling from business logic. You don't have to write try-catch blocks in every method that calls an external service. Instead, the circuit breaker centralises failure detection and recovery.
- Failures = current overload
- Open state = tripped breaker, no current flows
- Half-open state = attempt to reset breaker
- Closed state = normal flow after reset
The Three States and Their Transitions
The circuit breaker operates in three distinct states:
CLOSED — Normal operation. All requests pass through. Each failure increments an internal counter. When the counter reaches the threshold, the breaker transitions to OPEN. In a count-based window, failures are counted within a fixed number of requests (e.g., 5 failures out of the last 10 requests). In time-based windows, failures are counted within a time window (e.g., 5 failures in the last 10 seconds).
OPEN — Requests are rejected immediately without calling the downstream service. The breaker remains open for a configurable recovery timeout. After this timeout, it transitions to HALF_OPEN.
HALF_OPEN — A limited number of probe requests are allowed through. If a probe succeeds, the breaker transitions back to CLOSED (and resets the failure count). If the probe fails, the breaker returns to OPEN and resets the recovery timeout. The number of probes and the success threshold are configurable.
The transition from HALF_OPEN to CLOSED should require a minimum number of consecutive successes (e.g., 3) to prevent flaps. A single success is not enough — one probe could succeed by luck while the downstream is still degraded.
Implementing a Circuit Breaker in Java: Production-Grade Approach
Building a circuit breaker from scratch is educational, but for production you should use a battle-tested library. Two popular choices in Java: Resilience4j and Spring Cloud Circuit Breaker. The following example uses Resilience4j, which provides sliding window counters, thread pool isolation, and event listeners.
Resilience4j's circuit breaker supports two counting strategies: - count-based: failures in the last N calls (e.g., last 10 calls) - time-based: failures within a time window (e.g., last 10 seconds)
Each strategy has its own internal sliding window implementation. The count-based strategy uses a circular buffer of size N, while the time-based strategy uses a sliding timestamp list. Both are efficient — O(1) for recording calls — but consume memory proportional to the window size.
Count-Based vs Time-Based Sliding Windows: The Right Strategy for Your Traffic
The sliding window strategy determines how failures are aggregated. Count-based windows consider the last N requests. Time-based windows consider all requests within the last T duration. Both have trade-offs that matter in production.
Count-based is simple: keep a circular buffer of the last N call results. Each new call overwrites the oldest. Failure rate = failures / N. Works well when request rate is roughly constant. But during low traffic, the window is 'empty' for long periods, and a burst of failures near the end of the window may not trigger the breaker if earlier successes dilute the rate.
Time-based uses a sliding timestamp list. Each call records its result and timestamp. Old records are evicted when they're older than the window duration. This adapts naturally to traffic variations: during a spike, the window fills quickly; during a lull, it decays. The memory overhead is higher because every call's timestamp is stored — O(windowSize) in the count-based case vs O(requestsInWindow) in time-based.
Which one should you use? If your traffic is uniform (e.g., 100 req/s constantly), count-based is fine. If your traffic is bursty (e.g., periodic batch jobs that drive request spikes), time-based is more accurate because it measures real time, not request count.
Production Gotchas: What Bites Teams That Think They've Set It Up Correctly
Even with a working circuit breaker, teams hit common pitfalls that cause outages. Here are the six most dangerous ones.
1. Circuit breaker on timeout only, not on exception type Many configurations only count timeouts as failures. But network errors, 5xx responses, and even 429 rate limits should also be counted. If you only count timeouts, a service returning 503 errors will never trip the breaker.
2. Half-open probes that don't match real traffic The probe request is often a simple health check. But the real failure could be a specific endpoint that's slow. Configuration: configure the circuit breaker's probe to use a representative call, or use the same method call with a decorator that records success/failure on every call (even when half-open).
3. Not isolating thread pools per circuit breaker If all circuit breakers share one thread pool for their downstream calls, one open breaker reduces the pool's available threads for other dependencies. Separate thread pools (using Resilience4j's Bulkhead) prevent this.
4. Recovery timeout too short Setting the open state duration to 5 seconds on a database that takes 30 seconds to restart causes continuous open/half-open flapping. Recovery timeout should be at least the P99 recovery time of the downstream service, plus 50%.
5. Forgetting to reset failures on success Some custom implementations never reset the failure count on a successful call while in CLOSED state. This causes the breaker to open after X total failures, even if they occurred days apart. Always reset the failure count after a successful call if you're using a count-based approach (or rely on sliding window).
6. No fallback mechanism Circuit breakers reject requests when open. If you don't provide a fallback (e.g., a cached response or a default value), the user gets an error. Combine circuit breaker with a fallback method for a better user experience.
The Day the Thread Pool Died
- Always wrap every remote call in a circuit breaker — even "reliable" internal services fail
- Thread pool exhaustion is a silent killer; monitor thread pool usage with alerts at 80%
- Timeouts alone are not enough — they just make the failure slower
Key takeaways
Common mistakes to avoid
5 patternsSetting failure threshold too high
Using a generic health endpoint for half-open probes
Sharing thread pool across all circuit breakers
Forgetting to reset failure count on success in custom implementations
Not providing a fallback
Interview Questions on This Topic
Explain the three states of a circuit breaker and how transitions happen.
Frequently Asked Questions
That's Components. Mark it forged?
5 min read · try the examples if you haven't