Production Debugging in Spring Boot: Fix It Before It Fixes You
Real production Spring Boot debugging scenarios.
- Connection pool exhaustion looks like slow everything, not OOM.
- Thread dumps + heap dumps are your primary tools. Don't guess.
jstack,jmap, andheap.hprofare the holy trinity.- Slow queries hide behind Hibernate stats. Enable them before the incident.
- Rollback a deployment first — it's the fastest diagnostic step you have.
Think of your Spring Boot app as a busy restaurant kitchen. If the stove breaks (database connection), every order slows down, but the kitchen still looks busy. You don't fix it by yelling at the waiters — you check the gas line. Debugging in production means finding the real broken pipe, not the symptom.
You just got paged at 2:17 AM. Your service is serving HTTP 503s. The health check is failing, but the CPU is idle and memory usage looks normal. Your first instinct? Restart the pod. Don't. That destroys the evidence.
That crucible moment separates junior ops from senior engineers. The ones who restart are hoping the problem goes away. The ones who first snapshot the JVM state — thread dump, heap dump, JMX metrics — are hunting. I've seen teams burn 8 hours on a mystery that was a single, misconfigured connection pool timeout.
In my second year running a payment processing service, we lost $40k in failed transactions because nobody knew that spring.datasource.hikari.connection-timeout defaults to 30 seconds. Thirty seconds of stalled requests cascading through a cluster. That's the kind of incident that teaches you to read configuration the way a diver reads an oxygen tank gauge.
This isn't a tutorial. It's a field manual. Every pattern here comes from a real outage I personally debugged or cleaned up after. I'm going to show you the exact commands, the exact tools, and the exact thought process. You'll learn why HikariPool-1 - Connection is not available, request timed out after 30000ms is almost never your database's fault. And you'll learn how to prove that in under five minutes.
The Connection Pool is a Liar: How HikariCP Hides Real Problems
HikariCP is the best connection pool in the Java world. It's fast, lightweight, and rarely the source of a bug. But it's the first thing people blame when requests slow down. Stop blaming the pool. The pool is just a mirror. It reflects your database's behavior.
I witnessed a post-mortem where a team doubled maximum-pool-size from 10 to 50, then 50 to 200. The system collapsed faster. Why? They had a single query running full table scans that held locks for 30+ seconds. More connections meant more concurrent slow queries, which meant more contention, which meant more timeouts. The pool wasn't the problem — it was the canary.
Here's the pattern: Threads wait on the pool → pool is exhausted → you think 'need more connections'. Wrong. The real question is: why are existing connections not coming back? Three root causes account for 90% of cases: 1. A @Transactional service method that's too broad, holding a connection while doing I/O (HTTP call, file read). 2. A raw JDBC Connection that was opened but never closed due to an exception skirting the finally block. 3. A database deadlock or long-running query that the pool's connectionTimeout can't outrun.
Fix: Before touching the pool size, find the slow queries. Enable Hibernate stats, read the logs, and find the N+1s or missing indexes. Then, if you still have issues, look at spring.datasource.hikari.maxLifetime — if it's set lower than your DB's wait_timeout, you'll get ghost connections.
spring.datasource.hikari.connection-timeout to a value higher than your database's wait_timeout or innodb_lock_wait_timeout means the pool will never cleanly time out. You'll get 'Connection is not available' after 30s instead of a fast 5s failure. Always keep connection-timeout ≤ DB lock timeout.Thread Dumps Are Your X-Ray: How to Read Them Under Fire
When the app is slow but not dead, and you have no clear error, a thread dump is your only friend. It shows you exactly what every thread is doing at one instant. It's a snapshot of your entire application's state.
I had a situation where a Kafka consumer kept failing to commit offsets. The logs showed 'commit cannot be completed since the group has already rebalanced'. The team spent 2 days tuning max.poll.interval.ms and session.timeout.ms. Nothing worked. I took a thread dump. Found 12 threads in BLOCKED state fighting over a shared HashMap in a singleton bean. The consumers were processing messages slowly because they were waiting on each other to finish writing to the map. The rebalance was a cascade effect.
Tools: jstack <pid> > dump.txt is your first move. On Kubernetes, use kubectl exec <pod> -- jstack 1 > dump.txt (PID 1 is usually the Java process in a container). Then grep for BLOCKED, WAITING, and TIMED_WAITING. Look for threads stuck on org.apache.tomcat.util.threads.TaskQueue.offer or com.zaxxer.hikari.pool.HikariPool.getConnection. Those are your smoking guns.
Pro tip: Take 3 thread dumps, each 10 seconds apart. A single dump can be misleading — a thread might be temporarily waiting on GC. Three dumps show you what's persistent. Compare them. If the same thread is waiting in the same method in all three, you found the culprit.
exec into the pod. Use kubectl logs <pod> --previous to get the last log before a crash, but for thread dumps, exec is fine. Or better: expose a Spring Actuator endpoint that dumps threads under security. Then curl it.jstack fluency. Recognize BLOCKED on a synchronized block vs WAITING on a LockSupport.park(). They mean different things.Heap Dumps: The Art of Finding the 1MB Leak in 25GB
Heap dumps scare most engineers. They're huge, slow to analyze, and the tooling is arcane. But they're the only way to find the memory leak that's been slowly killing your service for days.
I once analyzed a 12GB heap dump from a production instance that was crashing every 31 hours. The dominant class was byte[]. I sorted by retained heap. The top consumer was a HashMap in a singleton bean holding byte[] keys. Turned out a developer had used a ConcurrentHashMap as a cache, but the keys were generated from UUID.randomUUID() — every request created a new key, and nothing ever evicted them. 12GB of orphaned byte arrays.
Start with jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>. The live flag triggers a full GC first, so you only dump live objects. That's critical — it removes the noise of garbage. Then open with Eclipse MAT (or JProfiler if you're on a Mac). Use the 'Leak Suspects' report. It finds the single biggest retained object set. 9 times out of 10, that's your leak.
Common patterns: ThreadLocal variables that accumulate data per request thread (e.g., a MDC context not cleared), HttpClient response bodies that weren't closed, and @Cacheable caches without @CacheEvict or TTL configuration.
-XX:+HeapDumpOnOutOfMemoryError as the must-have JVM arg.Actuator is Your Emergency Dashboard: Use It Before It's an Emergency
management.endpoints.web.exposure.include=* with no auth. Attackers could change log levels to enable remote code execution via expression injection in some configurations.Transactional Pitfalls: The @Transactional That Breaks Everything
@Transactional seems simple. It's not. The most common bug I see is a service method annotated @Transactional that internally makes a REST call to another microservice. The HTTP call takes 2 seconds. The database connection stays open for those 2 seconds. That's two seconds of holding a connection that could serve another request.
In a high-traffic system with 10 pool connections and 100 requests/second, only 10 requests can be in the transaction at any time. The rest are queued at the pool. Response times spike. The pool exhausts. Every request fails.
Self-invocation is another classic. If you have a @Transactional method in a Spring bean, and you call it from another method in the same class, the annotation is ignored. Spring's AOP proxies only intercept external calls. I've debugged a production issue where a that should have been transactional was not, causing partial writes and data corruption. The fix: either call the method from another class, or inject the proxy (awful pattern, but it works: save()((YourService) ).AopContext.currentProxy()).foo()
Isolation levels matter too. @Transactional(isolation = Isolation.REPEATABLE_READ) on Postgres can cause serialization failures under high contention could not serialize access due to read/write dependencies. If you don't handle CannotSerializeTransactionException properly, the entire request fails.
@Transactional(propagation = Propagation.REQUIRES_NEW) for long-running sub-operations that should not hold the outer transaction. Or better: extract the DB work into a separate service class. Self-invocation bugs vanish when you inject the other service.Configuration That Saves Your Weekend: loggers, delays, and flags
I have a standard set of configurations I add to every Spring Boot project before it goes to production. They've saved me more than once.
spring.datasource.hikari.leak-detection-threshold=30000 — This logs a stack trace when a connection is held longer than 30 seconds. It's the fastest way to find a connection leak. Don't run it in high-traffic dev for long, but turn it on immediately during an incident.
logging.level.org.springframework.transaction.interceptor=TRACE — Logs every transaction begin and commit. It's verbose but invaluable when debugging why a transaction is taking too long.
spring.jpa.properties.hibernate.generate_statistics=true — Logs every query, its timing, and number of rows fetched. You'll see N+1 queries printed as '100 JDBC statements executed for 100 entities'.
server.tomcat.threads.max=200 combined with server.tomcat.accept-count=100 — Controls thread pool size. If your app is I/O heavy (many external calls), increase max threads. If CPU-bound, decrease it. Tune this before you need to.
And the one that's saved me more times than any other: management.endpoint.health.probes.enabled=true. This separates Kubernetes liveness/readiness probes from the health endpoint. You can be 'live' but not 'ready' — for example, when the database is down, you want Kubernetes to stop routing traffic to you (readiness fails) but not kill the pod (liveness passes). This prevents the restart loop during a DB outage.
leak-detection-threshold=10000 (10 seconds) and forget it in production. It logged a stack trace for every query that took more than 10 seconds. The log volume caused a 200GB disk fill in 4 hours. Use this flag sparingly in production — 30 seconds is safe, 10 seconds is risky.leak-detection-threshold, generate_statistics, and probes.enabled=true to every production app.The Silent Connection Pool Leak That Took Down Payments
DataSourceUtils.getConnection() inside a loop without closing it in a finally block. Intermittent exceptions in the loop skipped the close. The thread-local held the connection, so HikariCP's eviction thread never saw it. After ~500 requests, the pool had 10 leaked connections. After 1000, it was dead.@Transactional on the service method so Spring manages the connection lifecycle. 3. Replace the raw JDBC code with Spring's JdbcTemplate which automatically closes connections. 4. Set spring.datasource.hikari.leak-detection-threshold=60000 to alert on any connection held >60s.- If you manually open a JDBC connection in Spring Boot, you're writing buggy code.
- Use
JdbcTemplateor@Transactional. - Always.
- There is no exception.
jstack <pid> | grep -A 20 'HikariPool' to see which threads are waiting. Next, check the database with SELECT * FROM pg_stat_activity WHERE state = 'active' for long-running queries or locks. Finally, enable HikariCP leak detection with spring.datasource.hikari.leak-detection-threshold=30000 in your next deployment.-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.hprof to JVM args if not already present. Use jcmd <pid> GC.heap_dump /tmp/heap.hprof now. Analyze with Eclipse MAT. Look for byte[] instances held by ThreadLocal or HttpClient response bodies. 90% of the time, it's a stream that wasn't consumed, or a cache without eviction.logging.level.org.springframework.transaction.interceptor=TRACE. Check if your @Transactional(readOnly = true) is actually using a read-only connection — if your DB driver doesn't support it, it's ignored. Profile with spring.jpa.properties.hibernate.generate_statistics=true. Look for N+1 queries printed in logs./proc/<pid>/status for VmRSS. Add -XX:NativeMemoryTracking=detail then jcmd <pid> VM.native_memory summary. Common culprits: Spring Boot with Netty, gRPC, or image processing libraries.jstack <pid> | grep -A 30 'HikariPool'SELECT * FROM pg_locks WHERE NOT granted;leak-detection-threshold=30000.Key takeaways
@Transactional boundaries. Never make an HTTP call inside a transactionleak-detection-threshold and probes.enabled=true before you need them.Common mistakes to avoid
5 patternsUsing raw JDBC Connection inside a Spring Boot service without closing in finally block
JdbcTemplate which auto-closes connections, or add @Transactional to the method and inject EntityManager instead.Setting spring.datasource.hikari.maximum-pool-size too high (e.g. 200)
Not enabling Hibernate statistics before an incident
spring.jpa.properties.hibernate.generate_statistics=true to application.yml. Turn it on now. Restart. You'll get detailed query logs.Calling a @Transactional method from within the same class (self-invocation)
Exposing Actuator endpoints without authentication
spring.security.user.name=admin and spring.security.user.password=${ACTUATOR_PASSWORD}. Use a secrets manager for the password. Or lock down via network policies.Interview Questions on This Topic
A Spring Boot service starts returning 503 errors after running for 12 hours. Thread dumps show many threads in BLOCKED state on a ReentrantLock inside HikariPool. What do you check first?
SELECT * FROM pg_stat_activity WHERE state != 'idle' to find queries running longer than 30 seconds. Then check if any @Transactional method is making a slow HTTP call while holding a connection.Frequently Asked Questions
That's Production. Mark it forged?
6 min read · try the examples if you haven't