Circuit Breaker Pattern with Spring Cloud and Resilience4j
Master the Circuit Breaker pattern with Spring Cloud Resilience4j: CLOSED/OPEN/HALF_OPEN states, @CircuitBreaker, sliding windows, failure thresholds, and Actuator monitoring.
- Circuit breaker has three states: CLOSED (normal), OPEN (failing fast), HALF_OPEN (testing recovery)
- Annotate methods with @CircuitBreaker(name='cbName', fallbackMethod='fallbackMethod') from Spring Cloud CircuitBreaker
- Configure sliding window type (COUNT_BASED vs TIME_BASED), failure rate threshold, and slow call threshold in Resilience4j config
- Monitor state and metrics via /actuator/circuitbreakers and /actuator/circuitbreakerevents
- Listen to CircuitBreakerOnStateTransitionEvent for alerts and operational visibility
A circuit breaker works exactly like the electrical circuit breaker in your home. When too many 'faults' happen (too many failing service calls), the breaker trips OPEN and stops sending requests to the failing service — just like a tripped breaker stops electricity to protect your appliances. After a cooldown, it goes HALF_OPEN and tries a few test requests; if they succeed, it closes back to normal.
Distributed systems fail in ways that monoliths never do. A microservice calling an inventory service that's responding in 30 seconds instead of 300 milliseconds will exhaust its thread pool in minutes, causing requests to queue up, then cascade failures to every caller upstream. Without a circuit breaker, one slow service can take down an entire platform.
The circuit breaker pattern was popularized by Michael Nygard in 'Release It!' and formalized for JVM microservices by Netflix with Hystrix. Netflix decommissioned Hystrix in 2018, and the Spring Cloud ecosystem migrated to Resilience4j — a lightweight, modular fault-tolerance library that implements circuit breaking, rate limiting, bulkhead isolation, retry, and timeout as composable decorators.
The production pain point that drives adoption is almost always a cascading failure event. A downstream service degrades, API calls start timing out at 10 seconds instead of 200 milliseconds, and thread pools fill up in seconds. Without circuit breaking, callers retry their requests, which adds more load to the already-struggling downstream, creating a positive feedback loop of failure. A circuit breaker breaks this loop by failing fast for a configurable period.
Spring Cloud CircuitBreaker provides a unified abstraction over Resilience4j (and optionally Sentinel, Spring Retry) with Spring Boot auto-configuration. The @CircuitBreaker annotation on Spring beans integrates with Spring AOP to wrap method calls with circuit breaker logic transparently. The fallbackMethod receives the exception so your fallback logic can distinguish circuit-open scenarios from genuine business errors.
Sliding window configuration is where teams most often make mistakes. COUNT_BASED windows make decisions based on the last N calls, which works well for high-traffic services but reacts slowly on low-traffic services. TIME_BASED windows evaluate calls in the last N seconds, which is more appropriate for services with variable traffic patterns but requires enough requests per second to generate meaningful statistics.
This guide covers every aspect of the circuit breaker pattern as implemented in Spring Cloud with Resilience4j, from basic annotation usage to advanced event-driven alerting and Actuator-based monitoring in production.
Circuit Breaker State Machine: CLOSED, OPEN, HALF_OPEN
Understanding the Resilience4j state machine is prerequisite to correct configuration. The state machine has five states in Resilience4j: CLOSED, OPEN, HALF_OPEN, DISABLED, and FORCED_OPEN. The operational states are the first three.
CLOSED is normal operation. All calls pass through to the downstream service. Each call outcome is recorded in the sliding window. The failure rate and slow call rate are computed after minimum-number-of-calls have been recorded. If either rate exceeds its threshold, the circuit transitions to OPEN.
OPEN is the protective state. All calls immediately throw CallNotPermittedException without touching the downstream service. The fallback method (if configured) is called instead. The circuit remains OPEN for wait-duration-in-open-state (default 60s), then automatically transitions to HALF_OPEN.
HALF_OPEN is the probing state. A limited number of calls (permitted-number-of-calls-in-half-open-state) are allowed through to test if the downstream service has recovered. All other calls fail immediately (no waiting). After the permitted calls complete, if the failure rate is below the threshold, the circuit closes. If it's above, the circuit opens again and starts another wait period.
DISABLED and FORCED_OPEN are manually set states for operational control — useful for maintenance windows or chaos engineering. The state can be forced via the Actuator management endpoint or programmatically via CircuitBreakerRegistry.
State transition events are valuable for operational visibility. Register a CircuitBreakerEventPublisher listener or use Spring's @EventListener with CircuitBreakerOnStateTransitionEvent to fire alerts (PagerDuty, Slack) when circuits open. A circuit opening is a signal that requires investigation — either the downstream service is degraded or your circuit breaker thresholds are misconfigured.
@CircuitBreaker Annotation and Fallback Methods
The @CircuitBreaker annotation from Spring Cloud CircuitBreaker (io.github.resilience4j.spring.annotations) wraps the annotated method in a Resilience4j circuit breaker via Spring AOP. The name attribute must match a configured instance in resilience4j.circuitbreaker.instances. The fallbackMethod attribute specifies the name of a method in the same class that returns the same type.
Fallback method signatures must include all parameters of the original method plus a Throwable parameter as the last argument. The Throwable receives the exception that caused the fallback to trigger — this is critical for distinguishing between circuit-open scenarios (CallNotPermittedException) and actual downstream failures (FeignException, ConnectException). A fallback for a circuit-open scenario should return cached data; a fallback for a genuine 503 should return an appropriate error response or propagate the exception.
Multiple fallback methods can be chained for different exception types. If you define fallback methods with specific exception types as the last parameter (IOException fallback, TimeoutException fallback, Throwable fallback), Resilience4j selects the most specific matching method. This allows fine-grained fallback logic without a big switch statement in a single fallback method.
AOP limitations are the most common source of confusion: @CircuitBreaker only works when the call goes through a Spring proxy. Calling an annotated method from within the same class (this.protectedMethod()) bypasses the AOP proxy and the circuit breaker doesn't activate. The calling code must inject the Spring bean and call the method through the injected reference, or use self-injection.
COUNT_BASED vs TIME_BASED Sliding Windows
Resilience4j supports two sliding window algorithms: COUNT_BASED and TIME_BASED. Choosing the wrong one for your traffic pattern is one of the most common circuit breaker configuration mistakes.
COUNT_BASED (default) uses a circular array of the last N call outcomes. A sliding-window-size of 20 means the circuit breaker evaluates the most recent 20 calls. This is efficient (O(1) memory, O(1) computation per call) and reacts quickly to failure bursts. The limitation: on low-traffic services (5 calls per minute), 20 calls represents 4 minutes of history. A failure burst at minute 1 stays in the window until minute 5. On high-traffic services, 20 calls is milliseconds of history, potentially too reactive.
TIME_BASED divides time into N one-second epochs and maintains a circular array of epoch data. A sliding-window-size of 60 evaluates calls from the last 60 seconds. This provides consistent time-based semantics regardless of traffic volume. The limitation: on low-traffic services with 1 call per second, a 60-second window contains only 60 calls — sufficient for meaningful statistics. But if the service has 1 call per minute, a 60-second window rarely has enough data to compute a meaningful failure rate.
For high-traffic services (100+ RPS): use COUNT_BASED with window size 100-200. For medium-traffic services (1-100 RPS): either works; COUNT_BASED with size 50 is a safe default. For low-traffic services (<1 RPS): use TIME_BASED with a longer window (300 seconds) and increase minimum-number-of-calls to match expected volume. Services with highly variable traffic (batch jobs, event-driven) should use TIME_BASED.
The minimum-number-of-calls setting acts as a guard — the failure rate is only evaluated after this many calls have been recorded. Set it to at least 10-20 to avoid opening the circuit on a single burst of test failures.
Bulkhead and TimeLimiter as Circuit Breaker Companions
Circuit breakers work best when combined with two other Resilience4j patterns: TimeLimiter and Bulkhead. Without them, even a well-tuned circuit breaker can be circumvented by slow calls that never fail (they just take forever) or by too many concurrent calls exhausting thread pools.
TimeLimiter wraps calls with a hard timeout. When the timeout expires, a TimeoutException is thrown, which counts as a failure in the circuit breaker's sliding window. This is essential for preventing thread pool exhaustion from slow dependencies — without a timeout, threads block indefinitely waiting for responses. Configure timeout-duration to be slightly above your P99 SLA target (not P99 of the downstream service, but your user-facing SLA).
Bulkhead limits concurrent calls to a downstream service. There are two implementations: SemaphoreBulkhead (limits concurrent calls with a semaphore, synchronous) and ThreadPoolBulkhead (uses a separate thread pool, enabling async execution). SemaphoreBulkhead rejects calls that would exceed the concurrency limit with BulkheadFullException immediately — this fast rejection prevents thread pool exhaustion in the calling service. ThreadPoolBulkhead offloads calls to a dedicated thread pool, isolating the calling service's thread pool from downstream slowness.
The combination of CircuitBreaker + TimeLimiter + Bulkhead provides defense-in-depth: Bulkhead prevents too many concurrent calls (fast rejection), TimeLimiter prevents calls from taking too long (timeout), CircuitBreaker prevents calling the service at all when it's degraded (fail fast). Use @CircuitBreaker + @TimeLimiter + @Bulkhead annotations on the same method, or compose them programmatically via Resilience4j decorators.
Actuator Monitoring: /actuator/circuitbreakers Endpoint
Spring Boot Actuator exposes Resilience4j circuit breaker state and metrics through dedicated endpoints when resilience4j-spring-boot3 and the actuator dependency are on the classpath. The primary endpoint is /actuator/circuitbreakers which returns the current state, metrics, and configuration for all registered circuit breakers.
Key metrics from /actuator/circuitbreakers: state (CLOSED/OPEN/HALF_OPEN), failureRate (percentage), slowCallRate (percentage), numberOfBufferedCalls, numberOfFailedCalls, numberOfSlowCalls, numberOfNotPermittedCalls (rejected by OPEN circuit). These metrics provide real-time operational insight without needing to wait for a Prometheus scrape.
The /actuator/circuitbreakerevents endpoint provides a historical log of individual call events: SUCCESS, ERROR, SLOW_SUCCESS, SLOW_ERROR, NOT_PERMITTED (rejected), IGNORED_ERROR. You can filter by circuit breaker name and event type. This is invaluable for diagnosing intermittent failures — you can see the exact sequence of events that led to a circuit opening.
For Prometheus-based monitoring, add the resilience4j-micrometer dependency and enable register-health-indicator: true in circuit breaker configuration. Key Prometheus metrics: resilience4j_circuitbreaker_state (gauge, 0=CLOSED, 1=OPEN, 2=HALF_OPEN), resilience4j_circuitbreaker_failure_rate (gauge, percentage), resilience4j_circuitbreaker_calls_total (counter by kind: successful, failed, slow, not_permitted). Create dashboards showing failure rate trends over time and alert on sustained failure rate above threshold.
Resilience4j Configuration Reference and Tuning Guide
Correct threshold tuning requires understanding your services' baseline performance characteristics. Start by measuring your P50, P95, P99, and P999 response times and error rates under normal load. The circuit breaker thresholds should trip when the service deviates significantly from baseline — not on normal variance.
Failure rate threshold: Start at 50% for most services. For critical services (payment, auth) where any degradation is unacceptable, lower to 30%. For batch or background services where partial failures are tolerable, raise to 70%. Track the false positive rate — if the circuit opens more than once per week without a genuine downstream issue, the threshold is too sensitive.
Slow call threshold: Set slow-call-duration-threshold to 2-3x your P99 response time. If P99 is 300ms, set the threshold to 800ms-1s. Set slow-call-rate-threshold to 60-70% — you're calling something slow only when the majority of calls are slow, not on occasional P999 events.
Wait duration in open state: This determines how long the circuit stays OPEN before trying again. It should be long enough for the downstream service to recover, but not so long that you miss recovery. 30 seconds is a reasonable default; for services with known recovery patterns (database failover takes 45 seconds), set it accordingly.
Permitted calls in HALF_OPEN: Set to 5-10 for a statistically meaningful sample. With 1 permitted call, a single recovered response closes the circuit; if the service is intermittently recovering, this causes rapid open-close cycling.
Missing Slow-Call Threshold Caused 8-Minute Outage During Black Friday
- Failure rate alone is insufficient for circuit breaking.
- Slow services that eventually respond are just as dangerous as services that fail outright.
- Always configure slow-call-duration-threshold and slow-call-rate-threshold alongside failure-rate-threshold, and add time limiters to bound maximum wait time.
curl -s http://your-service:8080/actuator/circuitbreakers | python3 -m json.toolcurl -s 'http://your-service:8080/actuator/circuitbreakerevents?name=PaymentService&type=STATE_TRANSITION' | python3 -m json.toolKey takeaways
this.method() bypasses the circuit breaker entirelyCommon mistakes to avoid
6 patternsNot configuring slow-call-duration-threshold
Calling @CircuitBreaker-annotated method via this.method()
self.method()Including business exceptions (404, 409, 422) in record-exceptions
Setting minimum-number-of-calls to 1 or not setting it
Not adding register-health-indicator: true
Using fallback to silently succeed on write operations when circuit is OPEN
Interview Questions on This Topic
Explain the three states of a circuit breaker and what triggers transitions between them.
Frequently Asked Questions
That's Spring Cloud. Mark it forged?
8 min read · try the examples if you haven't