Intermediate 8 min · May 23, 2026

Circuit Breaker Pattern with Spring Cloud and Resilience4j

Q: Is Resilience4j compatible with Spring Boot 3.x?

Yes. Use the resilience4j-spring-boot3 dependency (not resilience4j-spring-boot2). The Spring Cloud CircuitBreaker starter (spring-cloud-starter-circuitbreaker-resilience4j) includes the correct version for your Spring Boot version when using the Spring Cloud BOM.

Q: Can I use @CircuitBreaker with reactive WebFlux methods?

Yes, but the method must return Mono or Flux . Resilience4j Reactor support (resilience4j-reactor) wraps reactive publishers with circuit breaker logic. The @CircuitBreaker annotation works with reactive return types when Spring Cloud CircuitBreaker is configured with the reactive CircuitBreaker factory. Alternatively, use CircuitBreakerOperator.of(circuitBreaker) in the reactive chain.

Q: How many circuit breakers should a microservice have?

One per dependency is the standard. If your service calls 5 downstream services, you should have 5 circuit breakers — one for each. Within a single downstream service, you might have multiple circuit breakers for different operation categories (read vs write) if they have different SLAs or if failures in one category should not affect the other. Avoid having a single circuit breaker for all outbound calls — it provides too coarse-grained protection.

Q: What happens to in-flight requests when a circuit breaker transitions from CLOSED to OPEN?

In-flight requests (already executing when the circuit opens) are not interrupted — they complete normally. The circuit breaker only affects new incoming calls. Requests that arrive after the transition immediately receive CallNotPermittedException. This means there's a brief overlap where some requests get through while the circuit is transitioning, which is acceptable behavior.

Q: How do I monitor circuit breaker events in production?

Use three complementary approaches: (1) /actuator/circuitbreakers for current state and real-time metrics; (2) /actuator/circuitbreakerevents for historical event log with filtering by name and type; (3) Prometheus metrics (resilience4j_circuitbreaker_state, resilience4j_circuitbreaker_failure_rate) for trending and alerting in Grafana. For push-based alerting, implement a @EventListener for CircuitBreakerOnStateTransitionEvent to send alerts when circuits open.

Q: Can the circuit breaker be configured to automatically recover from OPEN to CLOSED?

Not directly — it always goes through HALF_OPEN first. Set automatic-transition-from-open-to-half-open-enabled: true to make the transition to HALF_OPEN automatic after wait-duration-in-open-state. In HALF_OPEN, if the probe calls succeed, the circuit closes automatically. This means the full automatic recovery path is OPEN → (wait) → HALF_OPEN → (probe) → CLOSED, which is the correct behavior for validating that the downstream service is actually healthy.

Master the Circuit Breaker pattern with Spring Cloud Resilience4j: CLOSED/OPEN/HALF_OPEN states, @CircuitBreaker, sliding windows, failure thresholds, and Actuator monitoring..

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Written from production experience, not tutorials.

✓ Production

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Circuit breaker has three states: CLOSED (normal), OPEN (failing fast), HALF_OPEN (testing recovery)
Annotate methods with @CircuitBreaker(name='cbName', fallbackMethod='fallbackMethod') from Spring Cloud CircuitBreaker
Configure sliding window type (COUNT_BASED vs TIME_BASED), failure rate threshold, and slow call threshold in Resilience4j config
Monitor state and metrics via /actuator/circuitbreakers and /actuator/circuitbreakerevents
Listen to CircuitBreakerOnStateTransitionEvent for alerts and operational visibility

✦ Definition~90s read

What is Circuit Breaker Pattern with Resilience4j?

The Circuit Breaker pattern is a stability pattern that prevents cascading failures in distributed systems by monitoring call failure rates and short-circuiting calls when failures exceed a threshold. It maintains a state machine with three states: CLOSED (all calls pass through, failures are counted), OPEN (all calls fail immediately with a fallback, no downstream calls made), and HALF_OPEN (a limited number of probe calls are allowed through; if they succeed, the circuit closes; if they fail, it opens again).

★

A circuit breaker works exactly like the electrical circuit breaker in your home.

Resilience4j implements this with a circular bit set for COUNT_BASED sliding windows and epoch-second buckets for TIME_BASED windows. Each call outcome (success, failure, slow call, ignored exception) updates the window. The failure rate is computed as failed calls / total calls; slow call rate as slow calls (exceeding slow-call-duration-threshold) / total calls.

When either rate exceeds its threshold, the circuit opens. Successful calls reset the failure statistics after the circuit closes.

Spring Cloud CircuitBreaker wraps Resilience4j with Spring-idiomatic configuration, AOP-based @CircuitBreaker annotation support, and Spring Boot Actuator integration for runtime monitoring. Each circuit breaker is a named instance — the name is used as the key in configuration, metrics tags, and event logs, so naming them descriptively (after the target service or method) is important for operational clarity.

Plain-English First

A circuit breaker works exactly like the electrical circuit breaker in your home. When too many 'faults' happen (too many failing service calls), the breaker trips OPEN and stops sending requests to the failing service — just like a tripped breaker stops electricity to protect your appliances. After a cooldown, it goes HALF_OPEN and tries a few test requests; if they succeed, it closes back to normal.

Distributed systems fail in ways that monoliths never do. A microservice calling an inventory service that's responding in 30 seconds instead of 300 milliseconds will exhaust its thread pool in minutes, causing requests to queue up, then cascade failures to every caller upstream. Without a circuit breaker, one slow service can take down an entire platform.

The circuit breaker pattern was popularized by Michael Nygard in 'Release It!' and formalized for JVM microservices by Netflix with Hystrix. Netflix decommissioned Hystrix in 2018, and the Spring Cloud ecosystem migrated to Resilience4j — a lightweight, modular fault-tolerance library that implements circuit breaking, rate limiting, bulkhead isolation, retry, and timeout as composable decorators.

The production pain point that drives adoption is almost always a cascading failure event. A downstream service degrades, API calls start timing out at 10 seconds instead of 200 milliseconds, and thread pools fill up in seconds. Without circuit breaking, callers retry their requests, which adds more load to the already-struggling downstream, creating a positive feedback loop of failure. A circuit breaker breaks this loop by failing fast for a configurable period.

Spring Cloud CircuitBreaker provides a unified abstraction over Resilience4j (and optionally Sentinel, Spring Retry) with Spring Boot auto-configuration. The @CircuitBreaker annotation on Spring beans integrates with Spring AOP to wrap method calls with circuit breaker logic transparently. The fallbackMethod receives the exception so your fallback logic can distinguish circuit-open scenarios from genuine business errors.

Sliding window configuration is where teams most often make mistakes. COUNT_BASED windows make decisions based on the last N calls, which works well for high-traffic services but reacts slowly on low-traffic services. TIME_BASED windows evaluate calls in the last N seconds, which is more appropriate for services with variable traffic patterns but requires enough requests per second to generate meaningful statistics.

This guide covers every aspect of the circuit breaker pattern as implemented in Spring Cloud with Resilience4j, from basic annotation usage to advanced event-driven alerting and Actuator-based monitoring in production.

Circuit Breaker State Machine: CLOSED, OPEN, HALF_OPEN

Understanding the Resilience4j state machine is prerequisite to correct configuration. The state machine has five states in Resilience4j: CLOSED, OPEN, HALF_OPEN, DISABLED, and FORCED_OPEN. The operational states are the first three.

CLOSED is normal operation. All calls pass through to the downstream service. Each call outcome is recorded in the sliding window. The failure rate and slow call rate are computed after minimum-number-of-calls have been recorded. If either rate exceeds its threshold, the circuit transitions to OPEN.

OPEN is the protective state. All calls immediately throw CallNotPermittedException without touching the downstream service. The fallback method (if configured) is called instead. The circuit remains OPEN for wait-duration-in-open-state (default 60s), then automatically transitions to HALF_OPEN.

HALF_OPEN is the probing state. A limited number of calls (permitted-number-of-calls-in-half-open-state) are allowed through to test if the downstream service has recovered. All other calls fail immediately (no waiting). After the permitted calls complete, if the failure rate is below the threshold, the circuit closes. If it's above, the circuit opens again and starts another wait period.

DISABLED and FORCED_OPEN are manually set states for operational control — useful for maintenance windows or chaos engineering. The state can be forced via the Actuator management endpoint or programmatically via CircuitBreakerRegistry.

State transition events are valuable for operational visibility. Register a CircuitBreakerEventPublisher listener or use Spring's @EventListener with CircuitBreakerOnStateTransitionEvent to fire alerts (PagerDuty, Slack) when circuits open. A circuit opening is a signal that requires investigation — either the downstream service is degraded or your circuit breaker thresholds are misconfigured.

DISABLED vs FORCED_OPEN for Operational Control

DISABLED disables the circuit breaker entirely (no metrics, no state changes, all calls pass through). FORCED_OPEN blocks all calls without any metrics. Use FORCED_OPEN for maintenance windows when you need to stop traffic to a service without disabling monitoring. Use DISABLED only for emergency debugging when you need to rule out the circuit breaker as a cause of issues.

Production Insight

A circuit breaker opening is a production incident signal — wire it to your alerting system and require a human review before the circuit closes rather than relying solely on automatic HALF_OPEN testing.

Key Takeaway

The three operational states (CLOSED/OPEN/HALF_OPEN) form a state machine; listen to state transitions for alerting and use FORCED_OPEN for planned maintenance windows.

thecodeforge.io

Spring Cloud Circuit Breaker

@CircuitBreaker Annotation and Fallback Methods

The @CircuitBreaker annotation from Spring Cloud CircuitBreaker (io.github.resilience4j.spring.annotations) wraps the annotated method in a Resilience4j circuit breaker via Spring AOP. The name attribute must match a configured instance in resilience4j.circuitbreaker.instances. The fallbackMethod attribute specifies the name of a method in the same class that returns the same type.

Fallback method signatures must include all parameters of the original method plus a Throwable parameter as the last argument. The Throwable receives the exception that caused the fallback to trigger — this is critical for distinguishing between circuit-open scenarios (CallNotPermittedException) and actual downstream failures (FeignException, ConnectException). A fallback for a circuit-open scenario should return cached data; a fallback for a genuine 503 should return an appropriate error response or propagate the exception.

Multiple fallback methods can be chained for different exception types. If you define fallback methods with specific exception types as the last parameter (IOException fallback, TimeoutException fallback, Throwable fallback), Resilience4j selects the most specific matching method. This allows fine-grained fallback logic without a big switch statement in a single fallback method.

AOP limitations are the most common source of confusion: @CircuitBreaker only works when the call goes through a Spring proxy. Calling an annotated method from within the same class (this.protectedMethod()) bypasses the AOP proxy and the circuit breaker doesn't activate. The calling code must inject the Spring bean and call the method through the injected reference, or use self-injection.

Self-Invocation Bypasses Spring AOP Circuit Breaker

Calling a @CircuitBreaker-annotated method from within the same class using 'this.method()' bypasses the Spring AOP proxy and the circuit breaker does NOT activate. Either inject the bean into itself (@Autowired private PaymentService self) or refactor the method into a separate Spring bean that is injected.

Production Insight

Do not use @CircuitBreaker fallbacks to silently succeed on write operations (payments, order creation) — failed writes must surface as errors so users know to retry and systems don't lose data.

Key Takeaway

@CircuitBreaker fallback methods must have the same parameters plus a Throwable; multiple overloads with specific exception types allow fine-grained fallback logic.

COUNT_BASED vs TIME_BASED Sliding Windows

Resilience4j supports two sliding window algorithms: COUNT_BASED and TIME_BASED. Choosing the wrong one for your traffic pattern is one of the most common circuit breaker configuration mistakes.

COUNT_BASED (default) uses a circular array of the last N call outcomes. A sliding-window-size of 20 means the circuit breaker evaluates the most recent 20 calls. This is efficient (O(1) memory, O(1) computation per call) and reacts quickly to failure bursts. The limitation: on low-traffic services (5 calls per minute), 20 calls represents 4 minutes of history. A failure burst at minute 1 stays in the window until minute 5. On high-traffic services, 20 calls is milliseconds of history, potentially too reactive.

TIME_BASED divides time into N one-second epochs and maintains a circular array of epoch data. A sliding-window-size of 60 evaluates calls from the last 60 seconds. This provides consistent time-based semantics regardless of traffic volume. The limitation: on low-traffic services with 1 call per second, a 60-second window contains only 60 calls — sufficient for meaningful statistics. But if the service has 1 call per minute, a 60-second window rarely has enough data to compute a meaningful failure rate.

For high-traffic services (100+ RPS): use COUNT_BASED with window size 100-200. For medium-traffic services (1-100 RPS): either works; COUNT_BASED with size 50 is a safe default. For low-traffic services (<1 RPS): use TIME_BASED with a longer window (300 seconds) and increase minimum-number-of-calls to match expected volume. Services with highly variable traffic (batch jobs, event-driven) should use TIME_BASED.

The minimum-number-of-calls setting acts as a guard — the failure rate is only evaluated after this many calls have been recorded. Set it to at least 10-20 to avoid opening the circuit on a single burst of test failures.

Set minimum-number-of-calls for Low-Traffic Services

Without minimum-number-of-calls, a single failed request on a low-traffic service (100% failure rate on 1 call) opens the circuit. Set this to a value that represents a statistically meaningful sample — at least 10 for COUNT_BASED, or 5 for TIME_BASED with a 2+ minute window.

Production Insight

Use COUNT_BASED for services processing 10+ requests per second; use TIME_BASED for batch endpoints and services with highly variable traffic to avoid stale window data.

Key Takeaway

COUNT_BASED is reactive and memory-efficient; TIME_BASED provides consistent temporal semantics — choose based on your traffic pattern and tune minimum-number-of-calls to avoid false opens.

thecodeforge.io

Spring Cloud Circuit Breaker

Bulkhead and TimeLimiter as Circuit Breaker Companions

Circuit breakers work best when combined with two other Resilience4j patterns: TimeLimiter and Bulkhead. Without them, even a well-tuned circuit breaker can be circumvented by slow calls that never fail (they just take forever) or by too many concurrent calls exhausting thread pools.

TimeLimiter wraps calls with a hard timeout. When the timeout expires, a TimeoutException is thrown, which counts as a failure in the circuit breaker's sliding window. This is essential for preventing thread pool exhaustion from slow dependencies — without a timeout, threads block indefinitely waiting for responses. Configure timeout-duration to be slightly above your P99 SLA target (not P99 of the downstream service, but your user-facing SLA).

Bulkhead limits concurrent calls to a downstream service. There are two implementations: SemaphoreBulkhead (limits concurrent calls with a semaphore, synchronous) and ThreadPoolBulkhead (uses a separate thread pool, enabling async execution). SemaphoreBulkhead rejects calls that would exceed the concurrency limit with BulkheadFullException immediately — this fast rejection prevents thread pool exhaustion in the calling service. ThreadPoolBulkhead offloads calls to a dedicated thread pool, isolating the calling service's thread pool from downstream slowness.

The combination of CircuitBreaker + TimeLimiter + Bulkhead provides defense-in-depth: Bulkhead prevents too many concurrent calls (fast rejection), TimeLimiter prevents calls from taking too long (timeout), CircuitBreaker prevents calling the service at all when it's degraded (fail fast). Use @CircuitBreaker + @TimeLimiter + @Bulkhead annotations on the same method, or compose them programmatically via Resilience4j decorators.

SemaphoreBulkhead Is Synchronous, ThreadPoolBulkhead Is Async

SemaphoreBulkhead (Bulkhead.Type.SEMAPHORE) limits concurrency but the calling thread still does the work. ThreadPoolBulkhead (Bulkhead.Type.THREADPOOL) offloads the work to a separate thread pool, completely isolating your service's thread pool from downstream slowness. Use THREADPOOL for I/O-bound downstream calls in synchronous Servlet-based applications.

Production Insight

Size max-concurrent-calls in the bulkhead to roughly 50% of your downstream service's documented concurrent request limit; this leaves headroom for other callers.

Key Takeaway

The production-grade resilience stack is CircuitBreaker + TimeLimiter + Bulkhead; each addresses a different failure mode: state-based protection, timeout enforcement, and concurrency limiting.

Actuator Monitoring: /actuator/circuitbreakers Endpoint

Spring Boot Actuator exposes Resilience4j circuit breaker state and metrics through dedicated endpoints when resilience4j-spring-boot3 and the actuator dependency are on the classpath. The primary endpoint is /actuator/circuitbreakers which returns the current state, metrics, and configuration for all registered circuit breakers.

Key metrics from /actuator/circuitbreakers: state (CLOSED/OPEN/HALF_OPEN), failureRate (percentage), slowCallRate (percentage), numberOfBufferedCalls, numberOfFailedCalls, numberOfSlowCalls, numberOfNotPermittedCalls (rejected by OPEN circuit). These metrics provide real-time operational insight without needing to wait for a Prometheus scrape.

The /actuator/circuitbreakerevents endpoint provides a historical log of individual call events: SUCCESS, ERROR, SLOW_SUCCESS, SLOW_ERROR, NOT_PERMITTED (rejected), IGNORED_ERROR. You can filter by circuit breaker name and event type. This is invaluable for diagnosing intermittent failures — you can see the exact sequence of events that led to a circuit opening.

For Prometheus-based monitoring, add the resilience4j-micrometer dependency and enable register-health-indicator: true in circuit breaker configuration. Key Prometheus metrics: resilience4j_circuitbreaker_state (gauge, 0=CLOSED, 1=OPEN, 2=HALF_OPEN), resilience4j_circuitbreaker_failure_rate (gauge, percentage), resilience4j_circuitbreaker_calls_total (counter by kind: successful, failed, slow, not_permitted). Create dashboards showing failure rate trends over time and alert on sustained failure rate above threshold.

Wire Circuit Breaker State to /actuator/health

Set register-health-indicator: true and allow-health-indicator-to-fail: true in circuit breaker config. An OPEN circuit breaker then causes the /actuator/health endpoint to return DOWN, which triggers Kubernetes readiness probe failures and removes the pod from load balancer rotation. This prevents cascading failures at the infrastructure level.

Production Insight

Create a Grafana dashboard showing failure rate, slow call rate, and state transitions over time for each circuit breaker; the trend before an open event reveals whether it's a true service degradation or a configuration issue.

Key Takeaway

Use /actuator/circuitbreakers for real-time state, /actuator/circuitbreakerevents for historical event analysis, and Prometheus metrics for trending and alerting.

Resilience4j Configuration Reference and Tuning Guide

Correct threshold tuning requires understanding your services' baseline performance characteristics. Start by measuring your P50, P95, P99, and P999 response times and error rates under normal load. The circuit breaker thresholds should trip when the service deviates significantly from baseline — not on normal variance.

Failure rate threshold: Start at 50% for most services. For critical services (payment, auth) where any degradation is unacceptable, lower to 30%. For batch or background services where partial failures are tolerable, raise to 70%. Track the false positive rate — if the circuit opens more than once per week without a genuine downstream issue, the threshold is too sensitive.

Slow call threshold: Set slow-call-duration-threshold to 2-3x your P99 response time. If P99 is 300ms, set the threshold to 800ms-1s. Set slow-call-rate-threshold to 60-70% — you're calling something slow only when the majority of calls are slow, not on occasional P999 events.

Wait duration in open state: This determines how long the circuit stays OPEN before trying again. It should be long enough for the downstream service to recover, but not so long that you miss recovery. 30 seconds is a reasonable default; for services with known recovery patterns (database failover takes 45 seconds), set it accordingly.

Permitted calls in HALF_OPEN: Set to 5-10 for a statistically meaningful sample. With 1 permitted call, a single recovered response closes the circuit; if the service is intermittently recovering, this causes rapid open-close cycling.

Use base-config to Avoid Configuration Duplication

Resilience4j supports base-config in the instances section to inherit from a config template and override specific properties. Define templates like 'internal-service' and 'critical-service' in the configs section, then reference them in instances. This eliminates copy-paste configuration errors and ensures consistent baseline settings across similar services.

Production Insight

Test circuit breaker behavior under realistic load before production: use the force-open endpoint to simulate OPEN state and verify fallbacks work correctly, then test HALF_OPEN recovery with controlled downstream failure injection.

Key Takeaway

Use base-config inheritance for consistent templates; tune thresholds from your measured P99 baseline; test all three circuit states in staging before production deployment.

Why You Need an Abstraction Layer (and Why Hystrix Died)

Before Spring Cloud Circuit Breaker, you were locked into Netflix Hystrix. Want to swap to Resilience4j? Rewrite your entire codebase. That coupling cost teams weeks. Hystrix went into maintenance mode in 2018. Teams stuck on it couldn't migrate without surgery.

Spring Cloud Circuit Breaker fixes this. It's an abstraction—a thin API layer over implementations like Resilience4j, Sentry, or even a stub for testing. Your service code calls CircuitBreakerFactory.create("name"). The implementation underneath can swap without touching a single business logic line.

This matters in production because circuit breaker libraries evolve. Resilience4j is the current king, but five years from now? Your abstraction lets you rip and replace without a rewrite. The Spring Boot auto-configuration detects which starter you've pulled in (e.g., spring-cloud-starter-circuitbreaker-resilience4j) and wires the factory bean automatically.

Don't code to a specific library. Code to the abstraction. Your future self debugging a production outage at 2 AM will thank you.

CircuitBreakerService.javaJAVA

// io.thecodeforge — java tutorial
@Service
public class AlbumService {
    private final CircuitBreakerFactory factory;

    public AlbumService(CircuitBreakerFactory factory) {
        this.factory = factory;
    }

    public List<Album> fetchAlbums() {
        CircuitBreaker cb = factory.create("album-service");
        return cb.run(() -> restTemplate.getForObject(
            "https://jsonplaceholder.typicode.com/albums",
            List.class
        ), throwable -> fallbackAlbums());
    }
}

Output

On third-party failure: returns cached fallback list. Logs an ERROR with circuit breaker state change.

Production Trap:

Never put fallback logic into the lambda that touches the same failing dependency. That's just stacking failures. Fallbacks should be in-memory caches, static defaults, or calls to a completely isolated backup service.

Key Takeaway

Always use the CircuitBreakerFactory abstraction. Never import Resilience4j or any implementation class directly into your business logic.

Global vs. Specific Configuration — Stop Copy-Pasting Tuning

Default circuit breaker settings are for demos, not production. You need per-service tuning because a payment API and a product search have completely different tolerance for failure.

Spring Cloud lets you set global defaults in application.yml, then override per circuit breaker name. This is critical: your inventory-service circuit breaker might allow 5 failures in 10 seconds, while payment-gateway is more strict—2 failures in 30 seconds.

The WHY: One misconfigured circuit breaker can cascade and take down your whole app. A payment gateway that's too lenient will hammer a dead service, exhausting your thread pool and crashing unrelated endpoints. Rate limiting? That's a separate knob—don't confuse it with circuit breakers.

Global config applies when you haven't specified a name-specific block. The overrides are granular down to sliding window type, minimum calls, and wait duration. Use them. Every service is different.

application.ymlJAVA

// io.thecodeforge — java tutorial
resilience4j:
  circuitbreaker:
    configs:
      default:
        sliding-window-type: COUNT_BASED
        sliding-window-size: 10
        minimum-number-of-calls: 5
        failure-rate-threshold: 50
        wait-duration-in-open-state: 5s
        permitted-number-of-calls-in-half-open-state: 3
    instances:
      payment-gateway:
        base-config: default
        sliding-window-size: 20
        failure-rate-threshold: 40
        wait-duration-in-open-state: 10s
      inventory-service:
        base-config: default

Output

payment-gateway circuit breaker: opens after 8 failures out of 20 (40%). Waits 10 seconds. inventory-service: uses defaults (50% of 10 in 5s).

Production Trap:

Setting wait-duration-in-open-state too low (e.g., 1 second) causes a thundering herd on HALF_OPEN. The circuit opens, immediately tries again, fails, opens again. This oscillation trashes your logs and kills performance. Minimum 5 seconds for most services.

Key Takeaway

Global config for safety net. Specific config for critical services. Always set wait-duration-in-open-state to at least 5 seconds.

RateLimiter and Retry — Don't Bolt Them On, Configure Them

Newcomers glue @RateLimiter onto a method that already has @CircuitBreaker. That's not composition—that's chaos. Resilience4j modules are designed to stack, but the order matters.

First, understand WHY each exists: RateLimiter rejects requests that exceed a rate (protects your service). Retry tries again on transient failures (protects the caller). Circuit breaker opens the circuit on systemic failures (protects the downstream).

Spring Cloud lets you compose them via annotations: @CircuitBreaker(name="x", fallbackMethod="fallback") + @RateLimiter(name="x") + @Retry(name="x"). But the execution order is: Retry, CircuitBreaker, RateLimiter (unless you use @Ordered). This matters in production because a retry that exhausts before the circuit breaker opens can take down your app.

Configuration tip: Use the same name across modules so shared configs apply. And NEVER use Retry with a fallback—the fallback swallows the retry's intent. Let the retry exhaust, then let the circuit breaker trigger its fallback.

OrderService.javaJAVA

// io.thecodeforge — java tutorial
@Retry(name = "checkout", fallbackMethod = "checkoutFallback")
@CircuitBreaker(name = "checkout", fallbackMethod = "checkoutFallback")
@RateLimiter(name = "checkout")
public Order checkout(String cartId) {
    return paymentClient.charge(cartId);
}

private Order checkoutFallback(String cartId, Throwable t) {
    log.error("Checkout failed for {}: {}", cartId, t.getMessage());
    return new Order(cartId, OrderStatus.PENDING, "Checkout failed, try again later");
}

Output

If payment API is slow: Retry runs 3 times (default), then CircuitBreaker records failure, then RateLimiter enforces max 10 requests/sec. Fallback returns pending order.

Production Trap:

Never put a fallback on both @Retry and @CircuitBreaker with the same method name. The framework will call the first fallback it finds, and the second will be ignored entirely. Use one fallback method and let it handle all exception types.

Key Takeaway

Compose modules with intention: Retry for transient blips, RateLimiter for request throttling, CircuitBreaker for systemic down. Never duplicate fallback methods across annotations.

● Production incidentPOST-MORTEMseverity: high

Missing Slow-Call Threshold Caused 8-Minute Outage During Black Friday

Symptom

Order processing P99 latency climbed to 45 seconds during Black Friday peak. Payment processing appeared UP in monitoring (no failures), but checkout conversion dropped to 12% of normal. Thread pools at 95% saturation.

Assumption

The team assumed circuit breakers were protecting against the payment service — all circuit breakers showed CLOSED in Actuator because the payment service was returning 200 responses, just very slowly.

Root cause

The payment service was experiencing database connection pool exhaustion and responding in 15-20 seconds instead of its normal 500ms. Since responses eventually returned 200 OK, the failure-rate-threshold-based circuit breakers never opened. The order service thread pool filled with threads waiting for payment responses. No slow-call-rate-threshold was configured.

Fix

Added slow-call-duration-threshold: 2s and slow-call-rate-threshold: 60 to all payment service circuit breaker instances. Added a time limiter (timeout) of 3 seconds that throws TimeoutException (recorded as a failure). Added a bulkhead to limit concurrent payment calls to 50, preventing thread pool exhaustion even when the circuit is CLOSED.

Key lesson

Failure rate alone is insufficient for circuit breaking.
Slow services that eventually respond are just as dangerous as services that fail outright.
Always configure slow-call-duration-threshold and slow-call-rate-threshold alongside failure-rate-threshold, and add time limiters to bound maximum wait time.

Production debug guideSymptom → root cause → fix5 entries

Symptom · 01

Circuit breaker is OPEN but never transitions to HALF_OPEN

→

Fix

Check the wait-duration-in-open-state configuration — the default is 60 seconds. If the downstream service recovers faster than this, you're leaving revenue on the table. Also check if there's any process that calls eurekaClient or the circuit breaker's state machine and incorrectly resets the wait timer. Verify the Actuator endpoint GET /actuator/circuitbreakers shows the correct nextState and stateTransitionTime. If the circuit is stuck, you can manually transition via the management API.

Symptom · 02

Circuit breaker opens on low traffic (5 calls failed out of 5)

→

Fix

The minimum-number-of-calls setting (default 100 for COUNT_BASED) specifies how many calls must be recorded before failure rate is evaluated. If this is set too low (e.g., 5), a single spike of failures opens the circuit prematurely. Check your resilience4j.circuitbreaker.instances.{name}.minimum-number-of-calls setting and increase it to at least 20-50 for a meaningful sample size. For very low-traffic services, use TIME_BASED sliding windows with a longer window duration.

Symptom · 03

BusinessException (404, 409) is being counted as a circuit breaker failure

→

Fix

By default, all exceptions increment the failure counter. Configure ignore-exceptions to exclude business exceptions that represent valid responses (not service health issues): ignore-exceptions: [com.example.ResourceNotFoundException, com.example.ValidationException]. A 404 means 'resource not found', not 'service is broken' — it should not count toward failure rate. Conversely, ensure TimeoutException and ConnectException ARE in record-exceptions.

Symptom · 04

HALF_OPEN state immediately returns to OPEN on first probe call failure

→

Fix

The permitted-number-of-calls-in-half-open-state setting determines how many probe calls are made before a decision. If set to 1, a single failure reopens the circuit. Increase to 5-10 so the circuit evaluates a sample before deciding. Also check if the downstream service is actually recovering — use the circuitbreakerevents Actuator endpoint to see each call's outcome in HALF_OPEN state.

Symptom · 05

@CircuitBreaker fallback method not being called

→

Fix

Verify the fallback method signature exactly matches the protected method's parameters plus an additional Throwable parameter at the end. The fallback method must be in the same class (Spring AOP proxies don't work for inter-class calls without going through the proxy). Check that the calling class is a Spring-managed bean (not instantiated with new). The fallback method name must match exactly — case-sensitive. Enable DEBUG logging for io.github.resilience4j to see AOP advice application.

★ Debug Cheat SheetFast diagnosis commands for circuit breaker issues in production

Circuit breaker stuck in OPEN state−

Immediate action

Check current state and timing via Actuator

Commands

curl -s http://your-service:8080/actuator/circuitbreakers | python3 -m json.tool

curl -s 'http://your-service:8080/actuator/circuitbreakerevents?name=PaymentService&type=STATE_TRANSITION' | python3 -m json.tool

Fix now

Check wait-duration-in-open-state; manually force HALF_OPEN via management if the downstream is confirmed healthy

Too many false circuit opens+

Circuit breaker metrics not appearing in Prometheus+

Fallback not triggering in unit tests+

COUNT_BASED vs TIME_BASED Sliding Windows

Aspect	COUNT_BASED	TIME_BASED
Memory usage	Fixed (circular array of N outcomes)	Variable (per-second buckets for N seconds)
Best for	High-traffic services (10+ RPS)	Low/variable-traffic services (<10 RPS)
Window definition	Last N calls	Calls in last N seconds
Staleness risk	Low on high traffic	Low (time-bounded)
Sparse traffic risk	Window may contain old data	Window may be nearly empty
Recommended size	50-200 calls	30-120 seconds
minimum-number-of-calls	10-20% of window size	5-10 calls
Configuration key	sliding-window-size: 100	sliding-window-size: 60

⚙ Quick Reference

3 commands from this guide

File	Command / Code	Purpose
CircuitBreakerService.java	@Service	Why You Need an Abstraction Layer (and Why Hystrix Died)
application.yml	resilience4j:	Global vs. Specific Configuration
OrderService.java	@Retry(name = "checkout", fallbackMethod = "checkoutFallback")	RateLimiter and Retry

Key takeaways

Always configure slow-call-duration-threshold alongside failure-rate-threshold; services that are slow but not failing are just as dangerous as services that fail outright

Use ignore-exceptions for business exceptions (404, 409, 422) so legitimate business logic doesn't trigger circuit opening

@CircuitBreaker only works through Spring AOP proxies; calling annotated methods via this.method() bypasses the circuit breaker entirely

The production resilience stack is CircuitBreaker + TimeLimiter + Bulkhead

each addressing a distinct failure mode: state-based protection, timeout enforcement, and concurrency limiting

Wire circuit breaker health to /actuator/health with register-health-indicator

true so OPEN circuits automatically remove pods from Kubernetes load balancer rotation

Common mistakes to avoid

6 patterns

Not configuring slow-call-duration-threshold

Symptom

Slow dependencies (responding in 15+ seconds) never trigger the circuit breaker because they return 200 OK, causing thread pool exhaustion

Fix

Always configure both failure-rate-threshold AND slow-call-rate-threshold + slow-call-duration-threshold; add TimeLimiter to bound maximum wait time

Calling @CircuitBreaker-annotated method via this.method()

Symptom

Circuit breaker never activates; fallback method never called; no circuit breaker metrics recorded for the method

Fix

Inject the Spring bean and call through the proxy, or use self-injection (@Autowired private MyService self) and call self.method()

Including business exceptions (404, 409, 422) in record-exceptions

Symptom

Circuit breaker opens during normal peak traffic when legitimate 404s (resource not found) push failure rate above threshold

Fix

Add business exceptions to ignore-exceptions; only record infrastructure failures (IOException, TimeoutException, 5xx HTTP errors) as circuit breaker failures

Setting minimum-number-of-calls to 1 or not setting it

Symptom

Circuit opens after a single failed request (100% failure rate on 1 call), causing unnecessary outages during transient failures

Fix

Set minimum-number-of-calls to at least 10-20 for meaningful statistical evaluation before the failure rate triggers circuit opening

Not adding register-health-indicator: true

Symptom

Circuit breaker state is invisible to Kubernetes readiness probes; an OPEN circuit doesn't remove the pod from load balancer rotation

Fix

Set register-health-indicator: true and allow-health-indicator-to-fail: true; OPEN circuit breakers then surface as DOWN in /actuator/health

Using fallback to silently succeed on write operations when circuit is OPEN

Symptom

Users get a success response for a payment or order creation that never happened; data inconsistency and lost revenue

Fix

For write operations, fallbacks should throw a descriptive exception (PaymentServiceUnavailableException) with a user-facing retry message; never silently succeed

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain the three states of a circuit breaker and what triggers transiti...

Q02SENIOR

What is the difference between failure-rate-threshold and slow-call-rate...

Q03SENIOR

How do you prevent business exceptions (404, validation errors) from cou...

Q04SENIOR

A circuit breaker opens on a service that's actually healthy. How would ...

Q05SENIOR

What happens when both @CircuitBreaker and @TimeLimiter are applied to t...

Q06SENIOR

When would you choose TIME_BASED over COUNT_BASED sliding windows?

Q07SENIOR

How do you test circuit breaker behavior in integration tests?

Q08SENIOR

What is the Bulkhead pattern and how does it complement the Circuit Brea...

Q01 of 08JUNIOR

Explain the three states of a circuit breaker and what triggers transitions between them.

ANSWER

CLOSED (normal): all calls pass through, outcomes are recorded in the sliding window. Transitions to OPEN when failure rate or slow call rate exceeds their configured thresholds after minimum-number-of-calls have been recorded. OPEN (protective): all calls fail immediately with CallNotPermittedException without calling the downstream service. Automatically transitions to HALF_OPEN after wait-duration-in-open-state. HALF_OPEN (probing): limited calls (permitted-number-of-calls-in-half-open-state) pass through to test recovery. Transitions back to CLOSED if their failure rate is below threshold; back to OPEN if above.

FAQ · 6 QUESTIONS

Frequently Asked Questions

Is Resilience4j compatible with Spring Boot 3.x?

Can I use @CircuitBreaker with reactive WebFlux methods?

How many circuit breakers should a microservice have?

What happens to in-flight requests when a circuit breaker transitions from CLOSED to OPEN?

How do I monitor circuit breaker events in production?

Can the circuit breaker be configured to automatically recover from OPEN to CLOSED?

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Written from production experience, not tutorials.

✓ Verified

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

🔥

That's Spring Cloud. Mark it forged?

8 min read · try the examples if you haven't