Senior 8 min · May 23, 2026

Circuit Breaker Pattern with Spring Cloud and Resilience4j

Master the Circuit Breaker pattern with Spring Cloud Resilience4j: CLOSED/OPEN/HALF_OPEN states, @CircuitBreaker, sliding windows, failure thresholds, and Actuator monitoring.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Circuit breaker has three states: CLOSED (normal), OPEN (failing fast), HALF_OPEN (testing recovery)
  • Annotate methods with @CircuitBreaker(name='cbName', fallbackMethod='fallbackMethod') from Spring Cloud CircuitBreaker
  • Configure sliding window type (COUNT_BASED vs TIME_BASED), failure rate threshold, and slow call threshold in Resilience4j config
  • Monitor state and metrics via /actuator/circuitbreakers and /actuator/circuitbreakerevents
  • Listen to CircuitBreakerOnStateTransitionEvent for alerts and operational visibility
✦ Definition~90s read
What is Circuit Breaker Pattern with Spring Cloud and Resilience4j?

The Circuit Breaker pattern is a stability pattern that prevents cascading failures in distributed systems by monitoring call failure rates and short-circuiting calls when failures exceed a threshold. It maintains a state machine with three states: CLOSED (all calls pass through, failures are counted), OPEN (all calls fail immediately with a fallback, no downstream calls made), and HALF_OPEN (a limited number of probe calls are allowed through; if they succeed, the circuit closes; if they fail, it opens again).

A circuit breaker works exactly like the electrical circuit breaker in your home.

Resilience4j implements this with a circular bit set for COUNT_BASED sliding windows and epoch-second buckets for TIME_BASED windows. Each call outcome (success, failure, slow call, ignored exception) updates the window. The failure rate is computed as failed calls / total calls; slow call rate as slow calls (exceeding slow-call-duration-threshold) / total calls.

When either rate exceeds its threshold, the circuit opens. Successful calls reset the failure statistics after the circuit closes.

Spring Cloud CircuitBreaker wraps Resilience4j with Spring-idiomatic configuration, AOP-based @CircuitBreaker annotation support, and Spring Boot Actuator integration for runtime monitoring. Each circuit breaker is a named instance — the name is used as the key in configuration, metrics tags, and event logs, so naming them descriptively (after the target service or method) is important for operational clarity.

Plain-English First

A circuit breaker works exactly like the electrical circuit breaker in your home. When too many 'faults' happen (too many failing service calls), the breaker trips OPEN and stops sending requests to the failing service — just like a tripped breaker stops electricity to protect your appliances. After a cooldown, it goes HALF_OPEN and tries a few test requests; if they succeed, it closes back to normal.

Distributed systems fail in ways that monoliths never do. A microservice calling an inventory service that's responding in 30 seconds instead of 300 milliseconds will exhaust its thread pool in minutes, causing requests to queue up, then cascade failures to every caller upstream. Without a circuit breaker, one slow service can take down an entire platform.

The circuit breaker pattern was popularized by Michael Nygard in 'Release It!' and formalized for JVM microservices by Netflix with Hystrix. Netflix decommissioned Hystrix in 2018, and the Spring Cloud ecosystem migrated to Resilience4j — a lightweight, modular fault-tolerance library that implements circuit breaking, rate limiting, bulkhead isolation, retry, and timeout as composable decorators.

The production pain point that drives adoption is almost always a cascading failure event. A downstream service degrades, API calls start timing out at 10 seconds instead of 200 milliseconds, and thread pools fill up in seconds. Without circuit breaking, callers retry their requests, which adds more load to the already-struggling downstream, creating a positive feedback loop of failure. A circuit breaker breaks this loop by failing fast for a configurable period.

Spring Cloud CircuitBreaker provides a unified abstraction over Resilience4j (and optionally Sentinel, Spring Retry) with Spring Boot auto-configuration. The @CircuitBreaker annotation on Spring beans integrates with Spring AOP to wrap method calls with circuit breaker logic transparently. The fallbackMethod receives the exception so your fallback logic can distinguish circuit-open scenarios from genuine business errors.

Sliding window configuration is where teams most often make mistakes. COUNT_BASED windows make decisions based on the last N calls, which works well for high-traffic services but reacts slowly on low-traffic services. TIME_BASED windows evaluate calls in the last N seconds, which is more appropriate for services with variable traffic patterns but requires enough requests per second to generate meaningful statistics.

This guide covers every aspect of the circuit breaker pattern as implemented in Spring Cloud with Resilience4j, from basic annotation usage to advanced event-driven alerting and Actuator-based monitoring in production.

Circuit Breaker State Machine: CLOSED, OPEN, HALF_OPEN

Understanding the Resilience4j state machine is prerequisite to correct configuration. The state machine has five states in Resilience4j: CLOSED, OPEN, HALF_OPEN, DISABLED, and FORCED_OPEN. The operational states are the first three.

CLOSED is normal operation. All calls pass through to the downstream service. Each call outcome is recorded in the sliding window. The failure rate and slow call rate are computed after minimum-number-of-calls have been recorded. If either rate exceeds its threshold, the circuit transitions to OPEN.

OPEN is the protective state. All calls immediately throw CallNotPermittedException without touching the downstream service. The fallback method (if configured) is called instead. The circuit remains OPEN for wait-duration-in-open-state (default 60s), then automatically transitions to HALF_OPEN.

HALF_OPEN is the probing state. A limited number of calls (permitted-number-of-calls-in-half-open-state) are allowed through to test if the downstream service has recovered. All other calls fail immediately (no waiting). After the permitted calls complete, if the failure rate is below the threshold, the circuit closes. If it's above, the circuit opens again and starts another wait period.

DISABLED and FORCED_OPEN are manually set states for operational control — useful for maintenance windows or chaos engineering. The state can be forced via the Actuator management endpoint or programmatically via CircuitBreakerRegistry.

State transition events are valuable for operational visibility. Register a CircuitBreakerEventPublisher listener or use Spring's @EventListener with CircuitBreakerOnStateTransitionEvent to fire alerts (PagerDuty, Slack) when circuits open. A circuit opening is a signal that requires investigation — either the downstream service is degraded or your circuit breaker thresholds are misconfigured.

DISABLED vs FORCED_OPEN for Operational Control
DISABLED disables the circuit breaker entirely (no metrics, no state changes, all calls pass through). FORCED_OPEN blocks all calls without any metrics. Use FORCED_OPEN for maintenance windows when you need to stop traffic to a service without disabling monitoring. Use DISABLED only for emergency debugging when you need to rule out the circuit breaker as a cause of issues.
Production Insight
A circuit breaker opening is a production incident signal — wire it to your alerting system and require a human review before the circuit closes rather than relying solely on automatic HALF_OPEN testing.
Key Takeaway
The three operational states (CLOSED/OPEN/HALF_OPEN) form a state machine; listen to state transitions for alerting and use FORCED_OPEN for planned maintenance windows.

@CircuitBreaker Annotation and Fallback Methods

The @CircuitBreaker annotation from Spring Cloud CircuitBreaker (io.github.resilience4j.spring.annotations) wraps the annotated method in a Resilience4j circuit breaker via Spring AOP. The name attribute must match a configured instance in resilience4j.circuitbreaker.instances. The fallbackMethod attribute specifies the name of a method in the same class that returns the same type.

Fallback method signatures must include all parameters of the original method plus a Throwable parameter as the last argument. The Throwable receives the exception that caused the fallback to trigger — this is critical for distinguishing between circuit-open scenarios (CallNotPermittedException) and actual downstream failures (FeignException, ConnectException). A fallback for a circuit-open scenario should return cached data; a fallback for a genuine 503 should return an appropriate error response or propagate the exception.

Multiple fallback methods can be chained for different exception types. If you define fallback methods with specific exception types as the last parameter (IOException fallback, TimeoutException fallback, Throwable fallback), Resilience4j selects the most specific matching method. This allows fine-grained fallback logic without a big switch statement in a single fallback method.

AOP limitations are the most common source of confusion: @CircuitBreaker only works when the call goes through a Spring proxy. Calling an annotated method from within the same class (this.protectedMethod()) bypasses the AOP proxy and the circuit breaker doesn't activate. The calling code must inject the Spring bean and call the method through the injected reference, or use self-injection.

Self-Invocation Bypasses Spring AOP Circuit Breaker
Calling a @CircuitBreaker-annotated method from within the same class using 'this.method()' bypasses the Spring AOP proxy and the circuit breaker does NOT activate. Either inject the bean into itself (@Autowired private PaymentService self) or refactor the method into a separate Spring bean that is injected.
Production Insight
Do not use @CircuitBreaker fallbacks to silently succeed on write operations (payments, order creation) — failed writes must surface as errors so users know to retry and systems don't lose data.
Key Takeaway
@CircuitBreaker fallback methods must have the same parameters plus a Throwable; multiple overloads with specific exception types allow fine-grained fallback logic.

COUNT_BASED vs TIME_BASED Sliding Windows

Resilience4j supports two sliding window algorithms: COUNT_BASED and TIME_BASED. Choosing the wrong one for your traffic pattern is one of the most common circuit breaker configuration mistakes.

COUNT_BASED (default) uses a circular array of the last N call outcomes. A sliding-window-size of 20 means the circuit breaker evaluates the most recent 20 calls. This is efficient (O(1) memory, O(1) computation per call) and reacts quickly to failure bursts. The limitation: on low-traffic services (5 calls per minute), 20 calls represents 4 minutes of history. A failure burst at minute 1 stays in the window until minute 5. On high-traffic services, 20 calls is milliseconds of history, potentially too reactive.

TIME_BASED divides time into N one-second epochs and maintains a circular array of epoch data. A sliding-window-size of 60 evaluates calls from the last 60 seconds. This provides consistent time-based semantics regardless of traffic volume. The limitation: on low-traffic services with 1 call per second, a 60-second window contains only 60 calls — sufficient for meaningful statistics. But if the service has 1 call per minute, a 60-second window rarely has enough data to compute a meaningful failure rate.

For high-traffic services (100+ RPS): use COUNT_BASED with window size 100-200. For medium-traffic services (1-100 RPS): either works; COUNT_BASED with size 50 is a safe default. For low-traffic services (<1 RPS): use TIME_BASED with a longer window (300 seconds) and increase minimum-number-of-calls to match expected volume. Services with highly variable traffic (batch jobs, event-driven) should use TIME_BASED.

The minimum-number-of-calls setting acts as a guard — the failure rate is only evaluated after this many calls have been recorded. Set it to at least 10-20 to avoid opening the circuit on a single burst of test failures.

Set minimum-number-of-calls for Low-Traffic Services
Without minimum-number-of-calls, a single failed request on a low-traffic service (100% failure rate on 1 call) opens the circuit. Set this to a value that represents a statistically meaningful sample — at least 10 for COUNT_BASED, or 5 for TIME_BASED with a 2+ minute window.
Production Insight
Use COUNT_BASED for services processing 10+ requests per second; use TIME_BASED for batch endpoints and services with highly variable traffic to avoid stale window data.
Key Takeaway
COUNT_BASED is reactive and memory-efficient; TIME_BASED provides consistent temporal semantics — choose based on your traffic pattern and tune minimum-number-of-calls to avoid false opens.

Bulkhead and TimeLimiter as Circuit Breaker Companions

Circuit breakers work best when combined with two other Resilience4j patterns: TimeLimiter and Bulkhead. Without them, even a well-tuned circuit breaker can be circumvented by slow calls that never fail (they just take forever) or by too many concurrent calls exhausting thread pools.

TimeLimiter wraps calls with a hard timeout. When the timeout expires, a TimeoutException is thrown, which counts as a failure in the circuit breaker's sliding window. This is essential for preventing thread pool exhaustion from slow dependencies — without a timeout, threads block indefinitely waiting for responses. Configure timeout-duration to be slightly above your P99 SLA target (not P99 of the downstream service, but your user-facing SLA).

Bulkhead limits concurrent calls to a downstream service. There are two implementations: SemaphoreBulkhead (limits concurrent calls with a semaphore, synchronous) and ThreadPoolBulkhead (uses a separate thread pool, enabling async execution). SemaphoreBulkhead rejects calls that would exceed the concurrency limit with BulkheadFullException immediately — this fast rejection prevents thread pool exhaustion in the calling service. ThreadPoolBulkhead offloads calls to a dedicated thread pool, isolating the calling service's thread pool from downstream slowness.

The combination of CircuitBreaker + TimeLimiter + Bulkhead provides defense-in-depth: Bulkhead prevents too many concurrent calls (fast rejection), TimeLimiter prevents calls from taking too long (timeout), CircuitBreaker prevents calling the service at all when it's degraded (fail fast). Use @CircuitBreaker + @TimeLimiter + @Bulkhead annotations on the same method, or compose them programmatically via Resilience4j decorators.

SemaphoreBulkhead Is Synchronous, ThreadPoolBulkhead Is Async
SemaphoreBulkhead (Bulkhead.Type.SEMAPHORE) limits concurrency but the calling thread still does the work. ThreadPoolBulkhead (Bulkhead.Type.THREADPOOL) offloads the work to a separate thread pool, completely isolating your service's thread pool from downstream slowness. Use THREADPOOL for I/O-bound downstream calls in synchronous Servlet-based applications.
Production Insight
Size max-concurrent-calls in the bulkhead to roughly 50% of your downstream service's documented concurrent request limit; this leaves headroom for other callers.
Key Takeaway
The production-grade resilience stack is CircuitBreaker + TimeLimiter + Bulkhead; each addresses a different failure mode: state-based protection, timeout enforcement, and concurrency limiting.

Actuator Monitoring: /actuator/circuitbreakers Endpoint

Spring Boot Actuator exposes Resilience4j circuit breaker state and metrics through dedicated endpoints when resilience4j-spring-boot3 and the actuator dependency are on the classpath. The primary endpoint is /actuator/circuitbreakers which returns the current state, metrics, and configuration for all registered circuit breakers.

Key metrics from /actuator/circuitbreakers: state (CLOSED/OPEN/HALF_OPEN), failureRate (percentage), slowCallRate (percentage), numberOfBufferedCalls, numberOfFailedCalls, numberOfSlowCalls, numberOfNotPermittedCalls (rejected by OPEN circuit). These metrics provide real-time operational insight without needing to wait for a Prometheus scrape.

The /actuator/circuitbreakerevents endpoint provides a historical log of individual call events: SUCCESS, ERROR, SLOW_SUCCESS, SLOW_ERROR, NOT_PERMITTED (rejected), IGNORED_ERROR. You can filter by circuit breaker name and event type. This is invaluable for diagnosing intermittent failures — you can see the exact sequence of events that led to a circuit opening.

For Prometheus-based monitoring, add the resilience4j-micrometer dependency and enable register-health-indicator: true in circuit breaker configuration. Key Prometheus metrics: resilience4j_circuitbreaker_state (gauge, 0=CLOSED, 1=OPEN, 2=HALF_OPEN), resilience4j_circuitbreaker_failure_rate (gauge, percentage), resilience4j_circuitbreaker_calls_total (counter by kind: successful, failed, slow, not_permitted). Create dashboards showing failure rate trends over time and alert on sustained failure rate above threshold.

Wire Circuit Breaker State to /actuator/health
Set register-health-indicator: true and allow-health-indicator-to-fail: true in circuit breaker config. An OPEN circuit breaker then causes the /actuator/health endpoint to return DOWN, which triggers Kubernetes readiness probe failures and removes the pod from load balancer rotation. This prevents cascading failures at the infrastructure level.
Production Insight
Create a Grafana dashboard showing failure rate, slow call rate, and state transitions over time for each circuit breaker; the trend before an open event reveals whether it's a true service degradation or a configuration issue.
Key Takeaway
Use /actuator/circuitbreakers for real-time state, /actuator/circuitbreakerevents for historical event analysis, and Prometheus metrics for trending and alerting.

Resilience4j Configuration Reference and Tuning Guide

Correct threshold tuning requires understanding your services' baseline performance characteristics. Start by measuring your P50, P95, P99, and P999 response times and error rates under normal load. The circuit breaker thresholds should trip when the service deviates significantly from baseline — not on normal variance.

Failure rate threshold: Start at 50% for most services. For critical services (payment, auth) where any degradation is unacceptable, lower to 30%. For batch or background services where partial failures are tolerable, raise to 70%. Track the false positive rate — if the circuit opens more than once per week without a genuine downstream issue, the threshold is too sensitive.

Slow call threshold: Set slow-call-duration-threshold to 2-3x your P99 response time. If P99 is 300ms, set the threshold to 800ms-1s. Set slow-call-rate-threshold to 60-70% — you're calling something slow only when the majority of calls are slow, not on occasional P999 events.

Wait duration in open state: This determines how long the circuit stays OPEN before trying again. It should be long enough for the downstream service to recover, but not so long that you miss recovery. 30 seconds is a reasonable default; for services with known recovery patterns (database failover takes 45 seconds), set it accordingly.

Permitted calls in HALF_OPEN: Set to 5-10 for a statistically meaningful sample. With 1 permitted call, a single recovered response closes the circuit; if the service is intermittently recovering, this causes rapid open-close cycling.

Use base-config to Avoid Configuration Duplication
Resilience4j supports base-config in the instances section to inherit from a config template and override specific properties. Define templates like 'internal-service' and 'critical-service' in the configs section, then reference them in instances. This eliminates copy-paste configuration errors and ensures consistent baseline settings across similar services.
Production Insight
Test circuit breaker behavior under realistic load before production: use the force-open endpoint to simulate OPEN state and verify fallbacks work correctly, then test HALF_OPEN recovery with controlled downstream failure injection.
Key Takeaway
Use base-config inheritance for consistent templates; tune thresholds from your measured P99 baseline; test all three circuit states in staging before production deployment.
● Production incidentPOST-MORTEMseverity: high

Missing Slow-Call Threshold Caused 8-Minute Outage During Black Friday

Symptom
Order processing P99 latency climbed to 45 seconds during Black Friday peak. Payment processing appeared UP in monitoring (no failures), but checkout conversion dropped to 12% of normal. Thread pools at 95% saturation.
Assumption
The team assumed circuit breakers were protecting against the payment service — all circuit breakers showed CLOSED in Actuator because the payment service was returning 200 responses, just very slowly.
Root cause
The payment service was experiencing database connection pool exhaustion and responding in 15-20 seconds instead of its normal 500ms. Since responses eventually returned 200 OK, the failure-rate-threshold-based circuit breakers never opened. The order service thread pool filled with threads waiting for payment responses. No slow-call-rate-threshold was configured.
Fix
Added slow-call-duration-threshold: 2s and slow-call-rate-threshold: 60 to all payment service circuit breaker instances. Added a time limiter (timeout) of 3 seconds that throws TimeoutException (recorded as a failure). Added a bulkhead to limit concurrent payment calls to 50, preventing thread pool exhaustion even when the circuit is CLOSED.
Key lesson
  • Failure rate alone is insufficient for circuit breaking.
  • Slow services that eventually respond are just as dangerous as services that fail outright.
  • Always configure slow-call-duration-threshold and slow-call-rate-threshold alongside failure-rate-threshold, and add time limiters to bound maximum wait time.
Production debug guideSymptom → root cause → fix5 entries
Symptom · 01
Circuit breaker is OPEN but never transitions to HALF_OPEN
Fix
Check the wait-duration-in-open-state configuration — the default is 60 seconds. If the downstream service recovers faster than this, you're leaving revenue on the table. Also check if there's any process that calls eurekaClient or the circuit breaker's state machine and incorrectly resets the wait timer. Verify the Actuator endpoint GET /actuator/circuitbreakers shows the correct nextState and stateTransitionTime. If the circuit is stuck, you can manually transition via the management API.
Symptom · 02
Circuit breaker opens on low traffic (5 calls failed out of 5)
Fix
The minimum-number-of-calls setting (default 100 for COUNT_BASED) specifies how many calls must be recorded before failure rate is evaluated. If this is set too low (e.g., 5), a single spike of failures opens the circuit prematurely. Check your resilience4j.circuitbreaker.instances.{name}.minimum-number-of-calls setting and increase it to at least 20-50 for a meaningful sample size. For very low-traffic services, use TIME_BASED sliding windows with a longer window duration.
Symptom · 03
BusinessException (404, 409) is being counted as a circuit breaker failure
Fix
By default, all exceptions increment the failure counter. Configure ignore-exceptions to exclude business exceptions that represent valid responses (not service health issues): ignore-exceptions: [com.example.ResourceNotFoundException, com.example.ValidationException]. A 404 means 'resource not found', not 'service is broken' — it should not count toward failure rate. Conversely, ensure TimeoutException and ConnectException ARE in record-exceptions.
Symptom · 04
HALF_OPEN state immediately returns to OPEN on first probe call failure
Fix
The permitted-number-of-calls-in-half-open-state setting determines how many probe calls are made before a decision. If set to 1, a single failure reopens the circuit. Increase to 5-10 so the circuit evaluates a sample before deciding. Also check if the downstream service is actually recovering — use the circuitbreakerevents Actuator endpoint to see each call's outcome in HALF_OPEN state.
Symptom · 05
@CircuitBreaker fallback method not being called
Fix
Verify the fallback method signature exactly matches the protected method's parameters plus an additional Throwable parameter at the end. The fallback method must be in the same class (Spring AOP proxies don't work for inter-class calls without going through the proxy). Check that the calling class is a Spring-managed bean (not instantiated with new). The fallback method name must match exactly — case-sensitive. Enable DEBUG logging for io.github.resilience4j to see AOP advice application.
★ Debug Cheat SheetFast diagnosis commands for circuit breaker issues in production
Circuit breaker stuck in OPEN state
Immediate action
Check current state and timing via Actuator
Commands
curl -s http://your-service:8080/actuator/circuitbreakers | python3 -m json.tool
curl -s 'http://your-service:8080/actuator/circuitbreakerevents?name=PaymentService&type=STATE_TRANSITION' | python3 -m json.tool
Fix now
Check wait-duration-in-open-state; manually force HALF_OPEN via management if the downstream is confirmed healthy
Too many false circuit opens+
Immediate action
Check failure rate and event history
Commands
curl -s 'http://your-service:8080/actuator/circuitbreakerevents?name=InventoryClient' | python3 -m json.tool | grep '"type"'
curl -s http://your-service:8080/actuator/metrics/resilience4j.circuitbreaker.failure.rate?tag=name:InventoryClient
Fix now
Increase minimum-number-of-calls to 20+ and add ignore-exceptions for 4xx business exceptions
Circuit breaker metrics not appearing in Prometheus+
Immediate action
Verify Micrometer Resilience4j integration is on classpath
Commands
curl -s http://your-service:8080/actuator/metrics | python3 -m json.tool | grep resilience4j
curl -s http://your-service:8080/actuator/prometheus | grep 'resilience4j_circuitbreaker'
Fix now
Add resilience4j-micrometer dependency; enable register-health-indicator: true in circuit breaker instance config
Fallback not triggering in unit tests+
Immediate action
Verify test is using Spring context with AOP enabled
Commands
grep -r 'CircuitBreakerRegistry\|@SpringBootTest\|@ExtendWith(SpringExtension' src/test/ | head -10
curl -s http://your-service:8080/actuator/health | python3 -m json.tool | grep -A5 circuitBreaker
Fix now
Use @SpringBootTest for integration tests; CircuitBreaker AOP requires Spring proxy — unit tests with plain new MyService() won't trigger it
COUNT_BASED vs TIME_BASED Sliding Windows
AspectCOUNT_BASEDTIME_BASED
Memory usageFixed (circular array of N outcomes)Variable (per-second buckets for N seconds)
Best forHigh-traffic services (10+ RPS)Low/variable-traffic services (<10 RPS)
Window definitionLast N callsCalls in last N seconds
Staleness riskLow on high trafficLow (time-bounded)
Sparse traffic riskWindow may contain old dataWindow may be nearly empty
Recommended size50-200 calls30-120 seconds
minimum-number-of-calls10-20% of window size5-10 calls
Configuration keysliding-window-size: 100sliding-window-size: 60

Key takeaways

1
Always configure slow-call-duration-threshold alongside failure-rate-threshold; services that are slow but not failing are just as dangerous as services that fail outright
2
Use ignore-exceptions for business exceptions (404, 409, 422) so legitimate business logic doesn't trigger circuit opening
3
@CircuitBreaker only works through Spring AOP proxies; calling annotated methods via this.method() bypasses the circuit breaker entirely
4
The production resilience stack is CircuitBreaker + TimeLimiter + Bulkhead
each addressing a distinct failure mode: state-based protection, timeout enforcement, and concurrency limiting
5
Wire circuit breaker health to /actuator/health with register-health-indicator
true so OPEN circuits automatically remove pods from Kubernetes load balancer rotation

Common mistakes to avoid

6 patterns
×

Not configuring slow-call-duration-threshold

Symptom
Slow dependencies (responding in 15+ seconds) never trigger the circuit breaker because they return 200 OK, causing thread pool exhaustion
Fix
Always configure both failure-rate-threshold AND slow-call-rate-threshold + slow-call-duration-threshold; add TimeLimiter to bound maximum wait time
×

Calling @CircuitBreaker-annotated method via this.method()

Symptom
Circuit breaker never activates; fallback method never called; no circuit breaker metrics recorded for the method
Fix
Inject the Spring bean and call through the proxy, or use self-injection (@Autowired private MyService self) and call self.method()
×

Including business exceptions (404, 409, 422) in record-exceptions

Symptom
Circuit breaker opens during normal peak traffic when legitimate 404s (resource not found) push failure rate above threshold
Fix
Add business exceptions to ignore-exceptions; only record infrastructure failures (IOException, TimeoutException, 5xx HTTP errors) as circuit breaker failures
×

Setting minimum-number-of-calls to 1 or not setting it

Symptom
Circuit opens after a single failed request (100% failure rate on 1 call), causing unnecessary outages during transient failures
Fix
Set minimum-number-of-calls to at least 10-20 for meaningful statistical evaluation before the failure rate triggers circuit opening
×

Not adding register-health-indicator: true

Symptom
Circuit breaker state is invisible to Kubernetes readiness probes; an OPEN circuit doesn't remove the pod from load balancer rotation
Fix
Set register-health-indicator: true and allow-health-indicator-to-fail: true; OPEN circuit breakers then surface as DOWN in /actuator/health
×

Using fallback to silently succeed on write operations when circuit is OPEN

Symptom
Users get a success response for a payment or order creation that never happened; data inconsistency and lost revenue
Fix
For write operations, fallbacks should throw a descriptive exception (PaymentServiceUnavailableException) with a user-facing retry message; never silently succeed
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain the three states of a circuit breaker and what triggers transiti...
Q02SENIOR
What is the difference between failure-rate-threshold and slow-call-rate...
Q03SENIOR
How do you prevent business exceptions (404, validation errors) from cou...
Q04SENIOR
A circuit breaker opens on a service that's actually healthy. How would ...
Q05SENIOR
What happens when both @CircuitBreaker and @TimeLimiter are applied to t...
Q06SENIOR
When would you choose TIME_BASED over COUNT_BASED sliding windows?
Q07SENIOR
How do you test circuit breaker behavior in integration tests?
Q08SENIOR
What is the Bulkhead pattern and how does it complement the Circuit Brea...
Q01 of 08JUNIOR

Explain the three states of a circuit breaker and what triggers transitions between them.

ANSWER
CLOSED (normal): all calls pass through, outcomes are recorded in the sliding window. Transitions to OPEN when failure rate or slow call rate exceeds their configured thresholds after minimum-number-of-calls have been recorded. OPEN (protective): all calls fail immediately with CallNotPermittedException without calling the downstream service. Automatically transitions to HALF_OPEN after wait-duration-in-open-state. HALF_OPEN (probing): limited calls (permitted-number-of-calls-in-half-open-state) pass through to test recovery. Transitions back to CLOSED if their failure rate is below threshold; back to OPEN if above.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
Is Resilience4j compatible with Spring Boot 3.x?
02
Can I use @CircuitBreaker with reactive WebFlux methods?
03
How many circuit breakers should a microservice have?
04
What happens to in-flight requests when a circuit breaker transitions from CLOSED to OPEN?
05
How do I monitor circuit breaker events in production?
06
Can the circuit breaker be configured to automatically recover from OPEN to CLOSED?
🔥

That's Spring Cloud. Mark it forged?

8 min read · try the examples if you haven't

Previous
Feign Client in Spring Boot Microservices
4 / 8 · Spring Cloud
Next
Centralized Config with Spring Cloud Config