Retry Mechanism in Spring Boot
Master Spring Boot retry: @Retryable, @Recover, Resilience4j with exponential backoff and jitter, RetryTemplate, idempotency, and retry vs circuit breaker patterns.
- Use @Retryable for declarative retry on specific exceptions with configurable backoff and max attempts
- Always pair @Retryable with @Recover to handle exhausted retries gracefully without bubbling exceptions
- Add jitter to exponential backoff to prevent retry thundering herd when many clients fail simultaneously
- Idempotency is mandatory before enabling retry — retrying non-idempotent operations causes double charges, duplicate records, and data corruption
- Use circuit breaker alongside retry: retry handles transient failures, circuit breaker stops retrying when a downstream is systemically down
Retry is like redialing a phone number when the line is busy. You don't give up on the first busy signal — you wait a moment and try again. Exponential backoff means each wait is longer than the last, so you're not hammering a struggling service. Jitter is like adding a random few seconds to your wait so you and 10,000 other callers don't all redial at exactly the same millisecond and crash the exchange again.
Your payment service calls a bank API. Network blips happen. The bank's load balancer hiccups for 200ms. Without retry, your customer sees 'Payment failed — please try again' for a transient error that would have resolved itself in a third of a second. With naive retry, all 10,000 concurrent failed requests retry simultaneously and take down the bank's already-struggling service.
Getting retry right is one of the most impactful resilience improvements you can make to a distributed system. Spring Boot provides two battle-tested options: Spring Retry (declarative, annotation-driven, lightweight) and Resilience4j (feature-rich, metrics-integrated, production-hardened). Both support exponential backoff, jitter, custom retry conditions, and fallback logic.
But retry is a sharp knife. Retry a non-idempotent operation and you get double charges. Retry without a circuit breaker and you amplify load on an already-down service. Retry with fixed backoff and synchronized clients create thundering herd. Every retry decision involves trade-offs that experienced engineers sweat over.
This guide walks through both Spring Retry and Resilience4j with complete production configurations, explains when to use each, covers the idempotency requirements that must come first, and draws the precise boundary between retry and circuit breaker. Code examples run on Spring Boot 3.x with Java 17+.
Spring Retry: @Retryable and @Recover
Spring Retry's annotation-driven API is the fastest path from zero to production retry. Add spring-retry and spring-boot-starter-aop to your classpath, put @EnableRetry on your configuration, and annotate methods with @Retryable. Spring AOP wraps the bean in a proxy that intercepts calls, catches specified exceptions, and retries according to your configuration.
The @Retryable annotation has several key attributes: value (or include) specifies which exception types trigger retry; exclude lists exceptions that should immediately propagate without retry; maxAttempts (default 3) controls total attempts including the first; backoff configures the delay strategy between attempts.
The @Backoff annotation controls timing: delay is the initial delay in milliseconds, multiplier enables exponential growth (delay × multiplier^n), maxDelay caps the delay, and random=true adds jitter by multiplying the computed delay by a random factor between 0 and 1. Always set random=true in production.
@Recover is the fallback for exhausted retries. It must be in the same class as @Retryable, have a compatible return type, and accept the exception as its first parameter (with the same additional parameters as the retried method). Without @Recover, Spring Retry rethrows the last exception after exhausting attempts.
A common gotcha: @Retryable doesn't work when calling the method from within the same class. Spring AOP creates a proxy around the bean, but self-calls bypass the proxy. Always inject the service into itself (using @Lazy or @Self-injection) or, better, extract the retryable method into a dedicated infrastructure class.
Resilience4j Retry: Exponential Backoff with Jitter
Resilience4j is the production-grade choice when you need rich configurability, Micrometer metrics integration, reactive support, or composability with circuit breaker and bulkhead. It works well with Spring Boot's auto-configuration through the resilience4j-spring-boot3 starter.
Resilience4j Retry wraps a Supplier, Callable, or function and retries it on configurable exceptions. The Java annotation API (@Retry from resilience4j-spring-boot3) is the most convenient for Spring services — it works via AOP similarly to Spring Retry.
The power of Resilience4j is in its IntervalFunction options. Simple fixed delay is available, but the real value is exponential randomized backoff: IntervalFunction.ofExponentialRandomBackoff() computes delay as initialInterval × multiplier^n, then multiplies by a random factor between 0 and a configurable upper bound. This breaks synchronization between retrying clients.
Resilience4j integrates with Micrometer out of the box — every retry instance exposes metrics: resilience4j.retry.calls (tagged with kind=successful_with_retry, failed_with_retry, successful_without_retry, failed_without_retry). This gives you precise visibility into retry rates in production without adding instrumentation code.
The most powerful production pattern is composing Retry with CircuitBreaker: wrap the retry in a circuit breaker so that when failure rate exceeds the threshold, the circuit opens and retry stops immediately rather than burning retry budget on a genuinely down service. Resilience4j's decorator API makes this composition clean.
Idempotency: The Prerequisite for Retry
No retry discussion is complete without idempotency — it's the non-negotiable prerequisite. Retrying a non-idempotent operation is worse than not retrying at all, because you get silent data corruption instead of visible failures. Before enabling retry on any operation, ask: 'If this executes twice, what happens?' For database reads and idempotent updates (set status = X where status = Y), retry is safe. For inserts, payment charges, or email sends, you must implement idempotency first.
Idempotency implementation has three layers. First, use the idempotency features built into external APIs: Stripe accepts an Idempotency-Key header that deduplicates charges for 24 hours; AWS S3 PUT operations are naturally idempotent; most modern payment processors support this. Always use these native capabilities.
Second, for your own APIs, implement idempotency keys at the application layer. The client generates a stable key (ideally from the business request: SHA256 of orderId + amount + currency), includes it in the request, and your service stores the key alongside the result in a database table. On retry, you detect the key exists and return the cached result without re-executing.
Third, use database-level idempotency for database operations: unique constraints prevent duplicate inserts; INSERT ... ON CONFLICT DO NOTHING (PostgreSQL) or INSERT IGNORE (MySQL) handle concurrent retries safely. For state transitions, WHERE clauses on current state (UPDATE orders SET status='CONFIRMED' WHERE id=? AND status='PENDING') make updates safe to retry.
For distributed systems, the idempotency key table should have an expiry TTL matched to your retry window. Don't keep idempotency records forever — 24-48 hours covers all reasonable retry scenarios and prevents unbounded growth.
RetryTemplate: Programmatic Retry
While annotations cover most use cases, RetryTemplate gives you programmatic control over retry logic — useful for batch jobs, complex retry conditions, dynamic retry configuration based on runtime state, or testing retry behavior directly.
RetryTemplate is configurable with RetryPolicy (how many times and on what conditions) and BackOffPolicy (delay strategy). Common policies include SimpleRetryPolicy (max attempts), ExceptionClassifierRetryPolicy (different policies per exception type), and TimeoutRetryPolicy (retry until a wall-clock deadline). BackOff policies include FixedBackOffPolicy, ExponentialBackOffPolicy, and ExponentialRandomBackOffPolicy.
RetryTemplate's execute() method takes a RetryCallback (the operation) and optionally a RecoveryCallback (fallback). The RetryContext passed to the callback contains retry count and the last exception — useful for logging or conditional logic within the retried operation.
A powerful pattern is ExceptionClassifierRetryPolicy: map specific exception types to specific retry policies. Retry ServerException up to 5 times with exponential backoff; retry ThrottledException up to 10 times with longer delays; immediately propagate ValidationException without any retry.
Retry vs. Circuit Breaker: Drawing the Boundary
Retry and circuit breaker are complementary resilience patterns that solve different problems and must be combined to handle the full spectrum of failures. Understanding exactly where one ends and the other begins prevents over-retrying and amplifying load on already-stressed services.
Retry addresses transient failures: the downstream succeeded a moment ago and will succeed again soon. The failure is temporary — a brief network blip, a momentary pod restart, a short GC pause on the downstream. Retry waits and tries again, with the expectation of eventual success within a handful of attempts.
Circuit breaker addresses systemic failures: the downstream has been failing consistently for a meaningful period. Retrying in this state is counterproductive — you're adding load to an already-struggling system and burning your thread pool waiting for timeouts. The circuit breaker tracks failure rate over a sliding window, and when it exceeds a threshold (typically 50%), the circuit opens. In open state, calls fail immediately (without attempting the downstream) for a configured cooldown period. After cooldown, the circuit enters half-open, allows a small number of probe calls, and closes if they succeed.
The correct composition: circuit breaker wraps retry. For each call: the circuit breaker checks if the circuit is open (fail fast if so); if closed, the retry logic executes; if the retry exhausts without success, that counts as a circuit breaker failure event. This way retry handles transient failures within a healthy circuit, and the circuit breaker detects when retry is systematically failing and opens to stop the bleeding.
Key metrics to monitor: retry rate (what % of calls require at least one retry), retry success rate (what % of retried calls eventually succeed), circuit breaker state transitions (closed → open means systemic failure), and time in open state (how long services are failing fast). If retry success rate drops below 50%, your retry budget is being wasted on a systemic failure — the circuit breaker threshold needs tuning.
The Missing Piece: Backoff Policies That Don't Kill Your Backend
The default retry is a blunt instrument. Three retries with a 1-second delay might work for a local database deadlock, but against a flaky external API it becomes a denial-of-service attack. Production incidents taught me to never retry without exponential backoff. Spring Retry gives you multiple backoff strategies via the @Backoff annotation and ExponentialBackOffPolicy. The key insight is multiplier — each attempt delay multiplies (e.g., 2s, 4s, 8s). Add maxDelay to cap the ceiling. For high-throughput systems, combine this with jitter to avoid thundering herd problems on your dependencies. The default backoff is fixed. That's fine for testing. In production, fixed backoff is how you accidentally DDoS your own database. Always pair @Retryable with @Backoff(delay = 1000, multiplier = 2.0, maxDelay = 10000). Your ops team will thank you when that AWS RDS failover happens at 3 AM.
randomize = true to @Backoff to spread retry windows. For Kafka consumers, this prevents the entire cluster from hammering the DB at the same second.Retry Configuration: Externalize It, Don't Hardcode It
How many times have you hotfixed a retry count at 2 AM because your payment provider started throttling? Hardcoded @Retryable(maxAttempts = 3) means a recompile and deploy cycle. Stop doing that. Spring Retry supports externalized properties via the @Retryable annotation's maxAttemptsExpression, delayExpression, and backoff attributes. You feed these from application.yml or environment variables. This is a game-changer for multi-region or multi-tenant setups where different environments have different SLOs. The pattern is simple: use SpEL expressions like #{${retry.payment.max-attempts}} and define the defaults in your properties file. When your API starts returning 429s, you bump the delay in the config server and the next retry picks it up. No code change. No pipeline. This is what separates a production battle station from a toy app. Always externalize retry parameters. Your pager will appreciate the difference between a config change and a full deploy at 3 AM.
@ConfigurationProperties class. Then inject them as a bean and reference them in @Retryable expressions. Avoid stringly-typed magic numbers in code.The Thundering Herd: Synchronized Retries Took Down the Auth Service
- Jitter is not optional — it's mandatory.
- Without randomization in retry delays, synchronized failures create synchronized retry storms.
- Always add jitter to backoff, and always pair retry with a circuit breaker to stop retrying a service that's genuinely down.
grep -r '@EnableRetry' src/main/java/ --include='*.java'grep -r 'spring-retry\|spring-boot-starter-aop' pom.xml build.gradleKey takeaways
Common mistakes to avoid
6 patternsUsing @Retryable with self-invocation (calling from the same class)
Fixed backoff delay without jitter across many clients
IntervalFunction.ofExponentialRandomBackoff() — jitter desynchronizes retry wavesRetrying non-idempotent operations (payments, email sends, record creation)
Not configuring noRetryFor for client errors (4xx HTTP status)
Missing @Recover method — retries exhaust and throw the raw exception
Retry without circuit breaker — amplifying load on a downed service
Interview Questions on This Topic
What is exponential backoff with jitter and why is jitter critical?
Frequently Asked Questions
That's Messaging. Mark it forged?
8 min read · try the examples if you haven't