Senior 12 min · May 23, 2026

Spring Boot API Timeout Handling: RestTemplate, WebClient, Resilience4j & More

Master API timeout handling in Spring Boot 3.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Set connectTimeout and readTimeout on SimpleClientHttpRequestFactory for RestTemplate
  • Use responseTimeout() and tcpConfiguration() on WebClient for reactive clients
  • Add timeout=30 to @Transactional to kill long-running database transactions
  • Wrap async calls with Resilience4j TimeLimiter or CompletableFuture.orTimeout()
  • Return 503 (Service Unavailable) for internal timeouts, 504 (Gateway Timeout) when a downstream dependency times out
✦ Definition~90s read
What is Spring Boot API Timeout Handling?

An API timeout in Spring Boot is a bounded wait: a maximum duration the application is willing to block waiting for an external response — an HTTP call, a database row lock, an asynchronous future, or a reactive stream event — before abandoning the operation and returning control to the caller. Without explicit timeout configuration, Java's blocking I/O primitives and Spring's higher-level abstractions default to waiting indefinitely, which turns any upstream latency spike into a thread-starvation event for your service.

Think of timeouts like a restaurant kitchen rule: if a dish isn't ready in 30 minutes, the waiter stops waiting and apologizes to the customer rather than making them sit there forever.

Timeouts operate at different layers of the stack. The connect timeout governs how long the TCP handshake phase is allowed to take before the connection is refused. The read (socket) timeout governs how long the client waits for data bytes to arrive after a connection is established.

These are separate failure modes — a connect timeout usually means the target host is unreachable, while a read timeout usually means the server accepted the connection but is processing slowly. Transaction timeouts apply to the database layer and are enforced by the JPA provider or JDBC driver, not the HTTP client.

Resilience4j TimeLimiter works at the Java Future/Reactive layer and can wrap any async computation regardless of the underlying transport.

Plain-English First

Think of timeouts like a restaurant kitchen rule: if a dish isn't ready in 30 minutes, the waiter stops waiting and apologizes to the customer rather than making them sit there forever. In Spring Boot, timeouts are those rules — they tell your application exactly how long to wait for a database query, a remote API call, or an async job before giving up and returning a clean error instead of hanging indefinitely.

It starts with a single slow third-party API. On Monday morning, response times creep from 200 ms to 12 seconds. Within minutes, your connection pool is exhausted, your thread pool is saturated, and your entire service is unresponsive — not because your code is broken, but because you never told it how long to wait. This is the cascade that takes down production systems every week, and yet timeout configuration remains one of the most neglected areas in Spring Boot services.

Timeout handling is not a single knob. Every I/O boundary in a Spring Boot application — HTTP client calls, database transactions, reactive streams, asynchronous tasks, and Feign clients in microservice chains — has its own timeout mechanism with its own defaults (often none). A service that sets timeouts only on RestTemplate but forgets its @Transactional methods or its WebClient reactive pipeline is still a ticking clock.

The HTTP status codes you return on timeout failures carry real operational meaning. A 503 Service Unavailable signals that your own service cannot fulfill the request right now, while a 504 Gateway Timeout tells the caller that your service was acting as a proxy and a downstream dependency didn't respond in time. Conflating these makes incident diagnosis dramatically harder for consumers and on-call engineers alike.

Modern Spring Boot 3.x and Java 17+ give you a rich toolkit: the classic SimpleClientHttpRequestFactory for RestTemplate, the reactive responseTimeout on WebClient, declarative transaction timeouts via @Transactional, and Resilience4j's TimeLimiter for wrapping any CompletableFuture or Mono. Feign clients in microservice chains introduce their own timeout layer that interacts with — and can override — the underlying HTTP client settings.

This guide walks through every timeout surface area with production-grade configuration, real incident patterns, and copy-paste-ready code for Spring Boot 3.x with Java 17+. By the end you will have a systematic approach to auditing your service for timeout gaps and a reference cheat sheet for diagnosing timeout failures in production.

RestTemplate Timeout Configuration with SimpleClientHttpRequestFactory

RestTemplate is Spring's classic synchronous HTTP client, still widely used in Spring Boot 3.x legacy codebases. Despite being in maintenance mode (replaced by WebClient for new code), it remains common in enterprise services and understanding its timeout configuration is essential for anyone maintaining or migrating such a service.

The timeout configuration lives in the ClientHttpRequestFactory implementation. The default factory used by a no-arg new RestTemplate() is SimpleClientHttpRequestFactory, which wraps Java's HttpURLConnection. Critically, SimpleClientHttpRequestFactory has no timeout by default — both connectTimeout and readTimeout are set to -1, meaning infinite wait. This is the most common source of thread-starvation incidents in Spring Boot services.

For production use, always construct RestTemplate with an explicit factory configuration. The two critical properties are connectTimeout (how long to wait for the TCP connection to be established) and readTimeout (how long to wait for data to arrive on an established connection). Both are specified in milliseconds.

For Apache HttpClient-backed RestTemplate (recommended for connection pooling), use HttpComponentsClientHttpRequestFactory. This gives you full control over connection pool size, connection TTL, and per-request timeouts. The RequestConfig builder lets you set connectionRequestTimeout (how long to wait to acquire a connection from the pool — a third timeout surface often missed), connectTimeout, and socketTimeout.

RestTemplate does not natively support reactive timeout propagation. If your service uses a reactive gateway layer (Spring Cloud Gateway) that sets a deadline on the incoming request, RestTemplate will not respect that deadline. This is the primary architectural reason to migrate to WebClient for new services — WebClient integrates with Project Reactor's context propagation and can participate in deadline-aware pipelines.

The Default RestTemplate Has No Timeout
A new RestTemplate() constructed without a factory will wait indefinitely for both connection and socket read. Always inject a configured RestTemplate bean — never call new RestTemplate() directly in service code.
Production Insight
In microservice architectures using a service mesh (Istio/Linkerd), the mesh can enforce its own timeouts transparently. However, mesh timeouts are a safety net — they cause hard TCP resets that leave connections in ambiguous states. Application-level timeouts are cleaner and produce more useful error handling. Run both: application-level timeouts slightly shorter than mesh timeouts.
Key Takeaway
RestTemplate has three distinct timeout surfaces: connection acquisition from pool, TCP connect, and socket read. All three must be configured explicitly — the defaults are infinite.

WebClient Timeout Configuration for Reactive Services

WebClient is the modern, non-blocking HTTP client in Spring WebFlux and Spring Boot 3.x. Its timeout configuration is more nuanced than RestTemplate because it operates on a reactive pipeline with multiple observable points. A common mistake is configuring the HTTP-level responseTimeout but forgetting the TCP-level connection timeout, or vice versa.

ResponseTimeout is the highest-level timeout — it governs the entire HTTP exchange from request send to last byte of response received. This is the most commonly configured option and covers the majority of latency scenarios. It is set on the WebClient.Builder via .responseTimeout(Duration.ofSeconds(5)).

The TCP-level timeout is configured via the reactor-netty HttpClient's tcpConfiguration (pre-Netty 1.0) or channelOption/doOnConnected API (Netty 1.0+). The connection timeout (TCP handshake) is set via ChannelOption.CONNECT_TIMEOUT_MILLIS. The read and write idle timeouts are set via ReadTimeoutHandler and WriteTimeoutHandler in the channel pipeline — these fire if no data is read or written within the specified duration, which is different from responseTimeout.

For production services, you typically want both levels configured: responseTimeout for the application-level SLA, and CONNECT_TIMEOUT_MILLIS for network-level fast-fail. If the target host is unreachable, CONNECT_TIMEOUT_MILLIS determines how quickly you fail; if it's reachable but slow to respond, responseTimeout determines how long you wait.

Per-request timeout overrides are supported via the httpRequest attribute on the exchange, or more cleanly, by using .timeout(Duration) operator on the resulting Mono/Flux at the reactive layer. This allows different endpoints of the same service to have different timeout budgets without needing separate WebClient instances.

Note that WebClient's non-blocking nature means thread exhaustion looks different: it's the Netty event-loop threads (typically one per CPU core) that saturate rather than a servlet thread pool. Netty event-loop threads are designed to handle thousands of concurrent connections, but each connection that's waiting for a slow response still occupies an in-flight request slot in the event loop queue. High concurrency to a slow upstream will eventually exhaust available memory for queued requests even if threads appear free.

responseTimeout vs CONNECT_TIMEOUT_MILLIS Are Independent
Setting only responseTimeout does not configure the TCP connect timeout. If the target host is down or unreachable, without CONNECT_TIMEOUT_MILLIS you may wait for the OS-level TCP timeout (up to 75 seconds on Linux). Always set both.
Production Insight
In Kubernetes, DNS resolution adds latency to first connection establishment. If you see intermittent connect timeouts but the service is healthy, check ndots configuration and consider setting CONNECT_TIMEOUT_MILLIS to at least 5 seconds for cross-namespace or cross-cluster calls where DNS lookup is involved.
Key Takeaway
WebClient requires timeout configuration at two independent layers: the application-level responseTimeout and the network-level CONNECT_TIMEOUT_MILLIS. Missing either creates a gap where specific failure modes will result in indefinite hangs.

@Transactional Timeout and Database Transaction Management

Database transactions are a frequently overlooked timeout surface. A @Transactional method that runs a slow query or waits on a row lock can hold a database connection for minutes, exhausting the HikariCP connection pool and starving the rest of the application of DB access. The timeout attribute on @Transactional directly addresses this: it instructs the transaction manager to set a deadline, and if the transaction has not committed by that deadline, it is rolled back.

The timeout value is in seconds (not milliseconds, unlike most other Spring timeout configurations — a common mistake). The timeout countdown begins when the transaction is opened, not when the query starts executing. This means for methods that do pre-processing before the first database call, the effective database query budget is timeout minus setup time.

Propagation interacts with timeout in important ways. A @Transactional(timeout=30) method that calls another @Transactional method with the default REQUIRED propagation will share the same transaction and the same timeout budget. The inner method does not reset or extend the timeout. However, calling a @Transactional(propagation=REQUIRES_NEW, timeout=10) method creates a new transaction with its own 10-second budget, independent of the outer transaction.

At the JDBC level, timeout is translated to Statement.setQueryTimeout() by the JPA provider (Hibernate, in most Spring Boot applications). The database then cancels the query server-side when the timeout fires, which is more efficient than client-side cancellation — the database terminates the query and releases server resources immediately. This is why transaction timeout is superior to simply wrapping a service call in a Resilience4j TimeLimiter for database-heavy operations.

For read-only queries, @Transactional(readOnly=true, timeout=10) is the recommended pattern. The readOnly hint allows Hibernate to skip dirty checking, and the timeout provides the safety net. For write operations, keep the transaction timeout tight and push long-running work outside the transaction boundary.

@Transactional timeout is in SECONDS, not milliseconds
Unlike connectTimeout (milliseconds) and responseTimeout (Duration), @Transactional(timeout=N) takes N in seconds. Setting timeout=5000 does not mean 5 seconds — it means 5000 seconds. This is a subtle API inconsistency that causes accidental permissive timeouts.
Production Insight
Hibernate translates @Transactional timeout to JDBC Statement.setQueryTimeout(). However, this only applies to queries executed within the transaction — it does not prevent row-lock wait indefinitely. For lock-wait timeouts, set spring.jpa.properties.javax.persistence.lock.timeout=5000 (milliseconds) or use database-level lock_timeout settings.
Key Takeaway
@Transactional(timeout=N) is the most effective timeout for database operations because it cancels the query server-side, immediately releasing database resources. Use it on every service method that touches the database.

Resilience4j TimeLimiter and CompletableFuture.orTimeout()

Resilience4j TimeLimiter provides a declarative timeout mechanism that works at the Java Future/Reactive layer, independent of the underlying I/O implementation. It is particularly useful for wrapping CompletableFuture-based async operations, third-party SDK calls that don't expose timeout configuration, or complex orchestration flows where multiple I/O calls need a single shared deadline.

TimeLimiter works by scheduling a cancellation task after the configured timeoutDuration. If the CompletableFuture or Mono completes before the deadline, the cancellation is discarded. If the deadline fires first, the future is cancelled (which sends an interrupt signal) and a TimeoutException is propagated. The important nuance is that cancellation does not guarantee the underlying thread stops — threads that ignore interruption (such as JDBC blocking reads) will continue running in the background. This is a key operational concern: you may return a timeout response to the caller while the background work continues consuming resources.

For pure Java 9+ code without Resilience4j, CompletableFuture.orTimeout(5, TimeUnit.SECONDS) is a lightweight alternative that schedules a TimeoutException if the future doesn't complete in time. CompletableFuture.completeOnTimeout(defaultValue, 5, TimeUnit.SECONDS) goes further and completes the future with a fallback value instead of an exception — useful for non-critical data fetches where a default is acceptable.

In Resilience4j, TimeLimiter integrates cleanly with CircuitBreaker: a timeout counts as a failure for circuit breaker state machine purposes. This means a sustained stream of timeouts will open the circuit breaker and fail fast, preventing thread/resource exhaustion. This integration is one of the key reasons to prefer Resilience4j over raw orTimeout() for production services.

Annotation-based Resilience4j configuration via @TimeLimiter requires the method to return CompletableFuture or Mono/Flux. For blocking service methods, wrap them in CompletableFuture.supplyAsync() within a bounded thread pool — but be aware this shifts the blocking to the thread pool's threads rather than the caller's thread.

cancelRunningFuture=true Does Not Guarantee Thread Termination
Setting cancelRunningFuture=true sends an interrupt to the underlying thread, but threads blocked on JDBC, network I/O, or code that swallows InterruptedException will continue running. Monitor your thread pool's active count after enabling TimeLimiter to detect zombie tasks.
Production Insight
Pair TimeLimiter with Micrometer metrics. Resilience4j publishes resilience4j.timelimiter.calls with tags for kind (successful, timeout, failed). Alert when timeout rate for any instance exceeds 5% of total calls — this indicates the timeoutDuration may be too tight for current p99 latency, or the upstream is degrading.
Key Takeaway
Resilience4j TimeLimiter adds timeout behavior to any CompletableFuture or Mono and integrates directly with CircuitBreaker so sustained timeouts trigger circuit opening. Use it for third-party SDK calls and async orchestration flows.

503 vs 504: Returning the Right HTTP Status on Timeout

The HTTP status code returned when a timeout occurs is not cosmetic — it carries semantic meaning that affects how upstream services, API gateways, load balancers, and monitoring systems respond. Conflating 503 and 504 makes incident diagnosis significantly harder and can lead to incorrect retry behavior from clients.

503 Service Unavailable means your service itself is currently unable to handle the request. Use this when the timeout is internal: your own thread pool is exhausted, your circuit breaker is open, your database connection pool is drained, or a resource your service owns is unavailable. The Retry-After header should be included to give clients a hint about when to retry. Load balancers and API gateways typically remove a 503-returning instance from their upstream pool.

504 Gateway Timeout means your service was acting as a proxy or gateway and a downstream dependency failed to respond in time. Your service received the request, forwarded it downstream, and that downstream service did not respond within the timeout window. Use this when the timeout occurs in a client call to another service — a payment gateway, an inventory service, a shipping provider API. The consumer knows the failure is not in your service but in something your service depends on.

The practical implication: if your order service times out calling the inventory service, return 504 to the API gateway. The API gateway (or the caller) can then make an informed decision — retry the call, use cached data, or surface a specific error to the user. If you return 503, the caller may conclude your order service is down and stop sending traffic to it, when in fact only the inventory dependency is degraded.

In Spring Boot, the mapping is implemented via @ExceptionHandler or a global @ControllerAdvice. Different exception types from different timeout mechanisms map to different HTTP statuses. TimeLimiter throws TimeoutException, WebClient throws WebClientRequestException wrapping various IO exceptions, and @Transactional timeout throws TransactionTimedOutException.

Never Return 500 for Timeout Scenarios
Returning 500 Internal Server Error for timeout conditions hides the true cause from clients, monitoring systems, and SLA tracking. Load balancers and API gateways treat 500, 503, and 504 differently in their routing logic. Use the semantically correct status code every time.
Production Insight
Some API gateways (AWS API Gateway, Kong) treat 504 specially and emit their own 504 before your service does if their own timeout fires. Configure your service's timeout to be shorter than the gateway's timeout so your error handling code (including fallbacks and logging) executes before the gateway terminates the connection.
Key Takeaway
503 means your service is the problem; 504 means a downstream dependency is the problem. Map every timeout exception type to the semantically correct HTTP status in a global @ControllerAdvice.

Feign Client Timeout Propagation in Microservice Chains

Feign clients in a microservice architecture introduce a layered timeout problem. Each Feign client has its own timeout configuration, which interacts with — and can be overridden by — the underlying HTTP client (OkHttp or Apache HttpClient). Additionally, the calling service's own response timeout budget must be considered holistically across the chain.

Feign's timeout configuration has two levels: the Feign-level default (set via Request.Options) and the HTTP client-level timeout (set on the underlying OkHttpClient or HttpClient bean). The Feign-level default takes precedence for most scenarios, but some HTTP client implementations may enforce their own limits more strictly. The safest approach is to configure both consistently.

In a microservice chain (A → B → C → D), if service B's Feign client has a 10-second read timeout calling C, and C's Feign client has a 10-second read timeout calling D, then A's timeout for the entire chain must be greater than 20 seconds to avoid timing out before B and C finish (assuming sequential calls). In practice, timeouts should be set considering the end-to-end latency budget per call with appropriate margins.

Idempotency is critical when retrying across Feign clients. The default Feign retryer (Retryer.NEVER_RETRY) disables retries, which is the safe default for non-idempotent operations. For idempotent operations (GET, PUT with same data), configure Retryer with maxAttempts and period. For POST operations that may create resources, idempotency keys (sent as a request header, stored with Redis SETNX by the receiver) prevent duplicate resource creation on retry after a timeout.

Spring Cloud LoadBalancer integrates with Feign and adds retry logic at the load-balancer layer. If a Feign call fails with an IOException (including timeout), Spring Cloud can retry on a different instance. Configure this carefully — retrying a write operation on timeout to a different instance is only safe with idempotency key support on the target service.

Feign Timeout Does Not Override Underlying HTTP Client Timeout
If you configure OkHttpClient with a 30-second read timeout and Feign Request.Options with a 5-second read timeout, Feign's 5-second limit applies. But if the OkHttpClient read timeout is set to 2 seconds, it will fire before Feign's 5-second limit. Always configure both consistently.
Production Insight
In distributed traces (Zipkin/Jaeger), Feign timeout failures appear as a span with error=true and a short duration, while the downstream service may show no corresponding span if the timeout fired before the request arrived. This asymmetry is a key diagnostic signal — a Feign span that ends in error with no matching downstream span indicates a network-level or connect timeout, not a slow response.
Key Takeaway
Feign timeout configuration must match the underlying HTTP client configuration, and idempotency keys are mandatory for safe retry of write operations after a timeout in a microservice chain.

Spring MVC Request Timeout: The Async Request You Didn't Configure

You can set a timeout on Spring MVC requests without touching RestTemplate or WebClient. The spring.mvc.async.request-timeout property controls how long the container waits for a deferred result before sending a 503 back to the client. This is not a database timeout or an HTTP client timeout—it's the timeout for the entire request processing pipeline when you use Callable or DeferredResult. Why does this matter? If your controller returns a Callable, Spring hands off execution to a separate thread pool. If that thread hangs on a slow database query, the client sits waiting. Without this property, Tomcat's default connector timeout (usually 20 seconds for keep-alive) kicks in, but you lose control over the error response. Setting spring.mvc.async.request-timeout=5000 tells Spring to throw a AsyncRequestTimeoutException after 5 seconds. Catch it with @ExceptionHandler and return a proper 503. Common trap: developers set timeouts on the HTTP client but forget the async timeout. The client times out and retries, but the server thread is still burning CPU. Always set async request timeout when using deferred results.

AsyncTimeoutConfig.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — java tutorial
@Configuration
public class AsyncTimeoutConfig {

    @Bean
    public WebMvcConfigurer asyncTimeoutConfigurer() {
        return new WebMvcConfigurer() {
            @Override
            public void configureAsyncSupport(AsyncSupportConfigurer configurer) {
                configurer.setDefaultTimeout(5000); // 5 seconds
            }
        };
    }

    @ControllerAdvice
    public static class TimeoutHandler {
        @ExceptionHandler(AsyncRequestTimeoutException.class)
        @ResponseStatus(HttpStatus.SERVICE_UNAVAILABLE)
        public ResponseEntity<String> handleAsyncTimeout() {
            return ResponseEntity.status(503).body("Request timed out");
        }
    }
}
Output
GET /slow-async -> 503 Service Unavailable after 5 seconds
Production Trap:
Async request timeout only fires for Callable/DeferredResult controllers. If you use ResponseEntity with direct blocking calls, this property does nothing. Pair it with an HTTP client timeout.
Key Takeaway
Spring MVC async request timeout is your last line of defense—never let the container hang waiting for a response.

RestClient Timeout: The Modern Non-Reactive Client

Spring Boot 3.2 introduced RestClient as the synchronous successor to RestTemplate. Same fluent API as WebClient, but blocking. With RestClient, you configure timeouts through the underlying ClientHttpRequestFactory—same approach as RestTemplate, but cleaner builder pattern. Why use RestClient over RestTemplate? RestClient is the future. Spring marks RestTemplate as deprecated in 3.2+ for maintenance. RestClient gives you connectTimeout, readTimeout, and connectionRequestTimeout via SimpleClientHttpRequestFactory or HttpComponentsClientHttpRequestFactory. The critical mistake: setting connect timeout to 30 seconds when you mean read timeout. Connect timeout covers TCP handshake—should be 1–3 seconds. Read timeout covers waiting for the response body—should match your SLA. Example: external payment API guarantees 2 second response. Set read timeout to 3 seconds. Any longer, you're masking upstream failures. Also: always set connectionRequestTimeout when using connection pools. Without it, a thread can wait indefinitely for a pooled connection if the pool is exhausted.

RestClientTimeoutConfig.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — java tutorial
@Configuration
public class RestClientTimeoutConfig {

    @Bean
    public RestClient restClient() {
        HttpComponentsClientHttpRequestFactory factory = new HttpComponentsClientHttpRequestFactory();
        factory.setConnectTimeout(2000);
        factory.setReadTimeout(3000);
        factory.setConnectionRequestTimeout(1000);

        return RestClient.builder()
                .requestFactory(factory)
                .baseUrl("https://api.payments.com")
                .build();
    }

    // Usage:
    // restClient.get().uri("/status").retrieve().body(String.class);
}
Output
Payment API call fails after 2s connect + 3s read = 5s total timeout
Migration Path:
Replace RestTemplate with RestClient incrementally. Both use the same request factory. Change one endpoint at a time. No service disruption.
Key Takeaway
RestClient is the non-reactive future for Spring Boot. Don't wait for RestTemplate to break—migrate now and set distinct connect vs read timeouts.
● Production incidentPOST-MORTEMseverity: high

The Black Friday Cascade: How a Missing WebClient Timeout Took Down Checkout

Symptom
Checkout service health check returning 500. All HTTP requests hanging for 20-30 seconds then failing. HikariCP connection pool exhausted. Netty worker thread pool at 100% utilization. No circuit breaker tripping.
Assumption
On-call engineer assumed a database issue due to HikariCP exhaustion metrics and began investigating slow queries. The payment gateway team was not paged for 18 minutes.
Root cause
The WebClient instance calling the payment gateway had no responseTimeout configured. The gateway was responding slowly (25 s) due to their own upstream bank API issues. Each checkout request held a Netty event-loop thread for 25 seconds. With 512 event-loop threads and 3 requests/second sustained, all threads were saturated within 170 seconds. HikariCP exhaustion was a secondary symptom — threads holding Netty workers also held open JDBC connections.
Fix
Added .responseTimeout(Duration.ofSeconds(5)) to the WebClient builder. Added Resilience4j CircuitBreaker wrapping the payment client with a 60% failure rate threshold. Added fallback to a queued async payment flow. Redeployed in 11 minutes; recovery in under 2 minutes after deployment.
Key lesson
  • WebClient is non-blocking but Netty event-loop threads are still finite.
  • A slow upstream with no timeout exhausts the reactive thread pool just as a blocking RestTemplate exhausts the servlet thread pool.
  • Every WebClient instance that calls an external service must have responseTimeout set.
  • Add circuit breakers so a single slow dependency cannot monopolize the thread pool.
Production debug guideSymptom → root cause → fix5 entries
Symptom · 01
Service hangs and threads pile up; eventually returns 500 after 60+ seconds
Fix
You likely have no timeout configured. Take a thread dump (jstack <pid>) and look for threads blocked on socket read. Identify which upstream host is involved, then add connectTimeout and readTimeout (RestTemplate) or responseTimeout (WebClient) for that client. Start with 5 s read timeout and tune based on p99 latency of the upstream.
Symptom · 02
HikariCP 'Connection is not available, request timed out after 30000ms' in logs
Fix
Threads are holding DB connections longer than the connection-timeout allows new requests to acquire one. Check for missing @Transactional(timeout=N) on slow queries. Run SHOW PROCESSLIST (MySQL) or SELECT * FROM pg_stat_activity (Postgres) to find long-running queries. Add transaction timeouts and optimize slow queries; short-term mitigation is to increase pool size cautiously.
Symptom · 03
Feign client in microservice chain returns 504 to the caller but you see a 200 from the downstream service in its logs
Fix
The Feign read timeout fired before the downstream service finished responding. The downstream service completed the work but the response arrived after Feign already closed the connection. This is especially dangerous for non-idempotent write operations — the action completed but the caller thinks it failed. Add idempotency keys (Redis SETNX) on write endpoints and ensure Feign read timeouts align with the downstream p99+buffer. Return 504 to the upstream caller to signal the gateway-timeout semantics.
Symptom · 04
Resilience4j TimeLimiter fires with TimeoutException but CompletableFuture task keeps running in background
Fix
TimeLimiter cancels the Future but does not interrupt the thread if the task doesn't respect interruption. Wrap your async task in logic that checks Thread.currentThread().isInterrupted() periodically. For blocking JDBC calls, set @Transactional(timeout=N) so the database itself cancels the query. Use ThreadPoolTaskExecutor with a bounded queue so runaway background tasks don't exhaust executor threads.
Symptom · 05
503 returned by service even though downstream responded within SLA
Fix
Check whether your own service's thread pool or connection pool exhausted before the downstream responded. Look at actuator metrics: http.server.requests histogram, executor.active gauge, hikaricp.connections.active. The timeout may be firing due to queue wait time, not actual I/O latency. Increase corePoolSize or reduce the work done before the downstream call.
★ Debug Cheat SheetCopy-paste commands for diagnosing timeout issues in Spring Boot production services
Service threads blocked on socket I/O
Immediate action
Take a thread dump to identify blocked threads and upstream host
Commands
jstack $(pgrep -f 'java.*myapp') | grep -A 20 'BLOCKED\|socket read'
curl -s http://localhost:8080/actuator/metrics/jvm.threads.states | jq '.measurements'
Fix now
Add connectTimeout(3000) and readTimeout(5000) to the RestTemplate or WebClient targeting the identified host
HikariCP connection pool exhausted in production+
Immediate action
Check active connection count and long-running DB queries immediately
Commands
curl -s http://localhost:8080/actuator/metrics/hikaricp.connections.active | jq '.'
psql -U appuser -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
Fix now
Kill long-running queries with SELECT pg_terminate_backend(pid) then add @Transactional(timeout=30) to offending service methods
Resilience4j TimeLimiter metrics show high timeout rate+
Immediate action
Check TimeLimiter metrics and compare to downstream service latency
Commands
curl -s http://localhost:8080/actuator/metrics/resilience4j.timelimiter.calls | jq '.measurements'
curl -s http://localhost:8080/actuator/health | jq '.components.circuitBreakers'
Fix now
Adjust timeoutDuration in TimeLimiterConfig to p99 latency + 20% buffer; open the circuit breaker if failure rate exceeds threshold
Kubernetes pod restarting; timeout errors in logs+
Immediate action
Check if liveness probe timeout is too aggressive relative to GC pauses or startup time
Commands
kubectl describe pod myapp-pod-xyz -n production | grep -A 10 'Liveness\|Readiness'
kubectl logs myapp-pod-xyz -n production --previous | tail -100
Fix now
Increase livenessProbe.timeoutSeconds and add initialDelaySeconds to allow JVM startup and cache warm-up before probes begin
Timeout Mechanisms Comparison
MechanismLayerConfig UnitAsync-SafeBest For
SimpleClientHttpRequestFactoryHTTP ClientMillisecondsNoLegacy RestTemplate
HttpComponentsClientHttpRequestFactoryHTTP Client + PoolMillisecondsNoPooled RestTemplate in production
WebClient responseTimeoutReactive HTTPDurationYesNon-blocking WebFlux services
WebClient CONNECT_TIMEOUT_MILLISTCP LayerMillisecondsYesFast-fail on unreachable hosts
@Transactional(timeout=N)DB TransactionSecondsNoDatabase query/lock timeouts
Resilience4j TimeLimiterFuture/ReactiveDurationYesWrapping any async operation
CompletableFuture.orTimeout()Java FutureTimeUnitYesLightweight async timeout without Resilience4j
Feign Request.OptionsHTTP ClientMillisecondsNoService-to-service Feign calls

Key takeaways

1
Every I/O boundary in Spring Boot needs an explicit timeout
RestTemplate, WebClient, @Transactional, Feign, and async tasks all have independent timeout surfaces with infinite defaults
2
WebClient requires two independent timeout configurations
responseTimeout for the HTTP exchange and CONNECT_TIMEOUT_MILLIS for the TCP handshake — missing either creates a hang scenario
3
@Transactional(timeout=N) takes N in seconds (not milliseconds) and translates to JDBC Statement.setQueryTimeout(), which cancels the query server-side for efficient resource release
4
Return 503 when your service is the failure source and 504 when a downstream dependency timed out
correct status codes are essential for incident diagnosis and caller retry logic
5
Feign retry on timeout is only safe for non-idempotent operations when idempotency keys (stored with Redis SETNX) are used to prevent duplicate resource creation on retry

Common mistakes to avoid

7 patterns
×

Using `new RestTemplate()` without a factory

Symptom
Service threads pile up indefinitely when an upstream slows down; HikariCP exhaustion follows
Fix
Always construct RestTemplate with a configured factory bean: SimpleClientHttpRequestFactory or HttpComponentsClientHttpRequestFactory with explicit connectTimeout and readTimeout
×

Setting @Transactional(timeout=5000) thinking the unit is milliseconds

Symptom
Transactions effectively never time out (5000 seconds ≈ 83 minutes); slow queries hold connections forever
Fix
The timeout attribute is in seconds. Use @Transactional(timeout=5) for a 5-second limit. Document this prominently in code comments given the counter-intuitive unit.
×

Configuring responseTimeout on WebClient but not CONNECT_TIMEOUT_MILLIS

Symptom
When a downstream host goes completely unreachable, service hangs for OS-level TCP timeout (up to 75 seconds on Linux) rather than failing fast
Fix
Always set both: .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 3_000) and .responseTimeout(Duration.ofSeconds(5)) on the HttpClient builder
×

Returning 500 for all timeout scenarios

Symptom
Monitoring cannot distinguish internal failures from downstream dependency failures; callers cannot implement correct retry logic; load balancers may remove healthy instances
Fix
Return 503 for internal resource exhaustion (own DB timeout, circuit breaker open); return 504 when a downstream service call timed out
×

Retrying non-idempotent operations after a timeout without idempotency keys

Symptom
Duplicate records created, duplicate payments charged, double inventory reserved — visible to end users as data corruption
Fix
Generate a deterministic idempotency key (order ID + operation type) for write operations and include it as a header. The receiver uses Redis SETNX to deduplicate. Only then is retry-on-timeout safe.
×

Setting Feign retry to Retryer.Default without considering non-idempotent calls

Symptom
POST requests to create resources are retried on timeout, creating duplicate resources since the original request completed server-side
Fix
Use Retryer.NEVER_RETRY as the default. Enable retries per client only for explicitly idempotent operations (GET, idempotent PUT). Always use idempotency keys.
×

Assuming Resilience4j TimeLimiter stops the underlying thread

Symptom
Service returns 504 quickly, but background threads continue running slow DB queries or HTTP calls, eventually exhausting the thread pool anyway
Fix
Pair TimeLimiter with @Transactional(timeout=N) for DB operations so the database cancels server-side. For HTTP calls, rely on WebClient/RestTemplate socket timeout which does stop the I/O at the network layer.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is the difference between connectTimeout and readTimeout in RestTem...
Q02JUNIOR
When should a Spring Boot service return 503 vs 504 on a timeout?
Q03SENIOR
How does @Transactional(timeout=30) interact with nested @Transactional ...
Q04SENIOR
Explain how Resilience4j TimeLimiter works and why cancelRunningFuture=t...
Q05SENIOR
How do you configure WebClient to fail fast when a downstream host is co...
Q06SENIOR
What is the correct approach for retry-on-timeout for a Feign POST call ...
Q07SENIOR
How would you implement timeout budget propagation in a microservice cha...
Q08SENIOR
What metrics would you alert on to detect timeout problems before users ...
Q09SENIOR
How does WebClient handle backpressure differently from RestTemplate in ...
Q01 of 09JUNIOR

What is the difference between connectTimeout and readTimeout in RestTemplate?

ANSWER
connectTimeout is how long to wait for the TCP handshake to complete — if the target host is unreachable or not accepting connections. readTimeout (socket timeout) is how long to wait for data bytes to arrive after a connection is established — if the server accepted the connection but is slow to respond. They are independent failure modes; a service can be reachable (short connect time) but slow to respond (requiring a long read timeout), or vice versa. Both must be configured explicitly — defaults in SimpleClientHttpRequestFactory are -1 (infinite).
FAQ · 7 QUESTIONS

Frequently Asked Questions

01
Does setting a timeout on Resilience4j TimeLimiter stop the underlying database query?
02
What happens if a Feign client timeout fires but the downstream service already completed the operation?
03
Should I set timeouts at the Kubernetes Service/Ingress level or in the application?
04
How do I configure different timeout values for different Feign client methods?
05
What is the connection pool acquisition timeout in HikariCP and when does it matter?
06
Can I use CompletableFuture.orTimeout() instead of Resilience4j TimeLimiter?
07
How should I handle timeout configuration in integration tests?
🔥

That's Production. Mark it forged?

12 min read · try the examples if you haven't

Previous
Spring Boot Real-world Debugging Scenarios
2 / 3 · Production
Next
Microservices Failure Recovery Patterns