Mid 6 min · May 23, 2026

Production Debugging in Spring Boot: Fix It Before It Fixes You

Real production Spring Boot debugging scenarios.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Connection pool exhaustion looks like slow everything, not OOM.
  • Thread dumps + heap dumps are your primary tools. Don't guess.
  • jstack, jmap, and heap.hprof are the holy trinity.
  • Slow queries hide behind Hibernate stats. Enable them before the incident.
  • Rollback a deployment first — it's the fastest diagnostic step you have.
✦ Definition~90s read
What is Production Debugging in Spring Boot?

Production debugging in Spring Boot is the skill of diagnosing and fixing failures under live traffic, without bringing the system down. It's not unit testing. It's not running in your IDE. It's connecting to a running JVM, dumping thread stacks, reading heap dumps, and parsing log levels on the fly.

Think of your Spring Boot app as a busy restaurant kitchen.

Real debugging means understanding that OutOfMemoryError: Java heap space is rarely about heap space — it's about a leak, a config mistake, or a runaway batch process. It means knowing that a HikariPool-1 - Connection is not available error often hides behind a deadlocked transaction or a slow query that never returned.

It means having the reflexes to toggle logging.level.org.springframework.transaction.interceptor=TRACE without restarting the pod.

This guide gives you the battle-tested patterns. No theory. No fluff. Just the commands and mental models that prevent 3AM pages.

Plain-English First

Think of your Spring Boot app as a busy restaurant kitchen. If the stove breaks (database connection), every order slows down, but the kitchen still looks busy. You don't fix it by yelling at the waiters — you check the gas line. Debugging in production means finding the real broken pipe, not the symptom.

You just got paged at 2:17 AM. Your service is serving HTTP 503s. The health check is failing, but the CPU is idle and memory usage looks normal. Your first instinct? Restart the pod. Don't. That destroys the evidence.

That crucible moment separates junior ops from senior engineers. The ones who restart are hoping the problem goes away. The ones who first snapshot the JVM state — thread dump, heap dump, JMX metrics — are hunting. I've seen teams burn 8 hours on a mystery that was a single, misconfigured connection pool timeout.

In my second year running a payment processing service, we lost $40k in failed transactions because nobody knew that spring.datasource.hikari.connection-timeout defaults to 30 seconds. Thirty seconds of stalled requests cascading through a cluster. That's the kind of incident that teaches you to read configuration the way a diver reads an oxygen tank gauge.

This isn't a tutorial. It's a field manual. Every pattern here comes from a real outage I personally debugged or cleaned up after. I'm going to show you the exact commands, the exact tools, and the exact thought process. You'll learn why HikariPool-1 - Connection is not available, request timed out after 30000ms is almost never your database's fault. And you'll learn how to prove that in under five minutes.

The Connection Pool is a Liar: How HikariCP Hides Real Problems

HikariCP is the best connection pool in the Java world. It's fast, lightweight, and rarely the source of a bug. But it's the first thing people blame when requests slow down. Stop blaming the pool. The pool is just a mirror. It reflects your database's behavior.

I witnessed a post-mortem where a team doubled maximum-pool-size from 10 to 50, then 50 to 200. The system collapsed faster. Why? They had a single query running full table scans that held locks for 30+ seconds. More connections meant more concurrent slow queries, which meant more contention, which meant more timeouts. The pool wasn't the problem — it was the canary.

Here's the pattern: Threads wait on the pool → pool is exhausted → you think 'need more connections'. Wrong. The real question is: why are existing connections not coming back? Three root causes account for 90% of cases: 1. A @Transactional service method that's too broad, holding a connection while doing I/O (HTTP call, file read). 2. A raw JDBC Connection that was opened but never closed due to an exception skirting the finally block. 3. A database deadlock or long-running query that the pool's connectionTimeout can't outrun.

Fix: Before touching the pool size, find the slow queries. Enable Hibernate stats, read the logs, and find the N+1s or missing indexes. Then, if you still have issues, look at spring.datasource.hikari.maxLifetime — if it's set lower than your DB's wait_timeout, you'll get ghost connections.

Production Trap:
Setting spring.datasource.hikari.connection-timeout to a value higher than your database's wait_timeout or innodb_lock_wait_timeout means the pool will never cleanly time out. You'll get 'Connection is not available' after 30s instead of a fast 5s failure. Always keep connection-timeout ≤ DB lock timeout.
Production Insight
A connection pool exhausted at 20 connections is a crisis. Exhausted at 200 is the same crisis, just more expensive.
Key Takeaway
Before increasing pool size, find the query or transaction holding connections too long. The pool size is a symptom, not the cure.

Thread Dumps Are Your X-Ray: How to Read Them Under Fire

When the app is slow but not dead, and you have no clear error, a thread dump is your only friend. It shows you exactly what every thread is doing at one instant. It's a snapshot of your entire application's state.

I had a situation where a Kafka consumer kept failing to commit offsets. The logs showed 'commit cannot be completed since the group has already rebalanced'. The team spent 2 days tuning max.poll.interval.ms and session.timeout.ms. Nothing worked. I took a thread dump. Found 12 threads in BLOCKED state fighting over a shared HashMap in a singleton bean. The consumers were processing messages slowly because they were waiting on each other to finish writing to the map. The rebalance was a cascade effect.

Tools: jstack <pid> > dump.txt is your first move. On Kubernetes, use kubectl exec <pod> -- jstack 1 > dump.txt (PID 1 is usually the Java process in a container). Then grep for BLOCKED, WAITING, and TIMED_WAITING. Look for threads stuck on org.apache.tomcat.util.threads.TaskQueue.offer or com.zaxxer.hikari.pool.HikariPool.getConnection. Those are your smoking guns.

Pro tip: Take 3 thread dumps, each 10 seconds apart. A single dump can be misleading — a thread might be temporarily waiting on GC. Three dumps show you what's persistent. Compare them. If the same thread is waiting in the same method in all three, you found the culprit.

Senior Shortcut:
In Kubernetes, you don't need to exec into the pod. Use kubectl logs <pod> --previous to get the last log before a crash, but for thread dumps, exec is fine. Or better: expose a Spring Actuator endpoint that dumps threads under security. Then curl it.
Production Insight
A thread dump taken under load is worth 10,000 lines of logs.
Key Takeaway
Learn jstack fluency. Recognize BLOCKED on a synchronized block vs WAITING on a LockSupport.park(). They mean different things.

Heap Dumps: The Art of Finding the 1MB Leak in 25GB

Heap dumps scare most engineers. They're huge, slow to analyze, and the tooling is arcane. But they're the only way to find the memory leak that's been slowly killing your service for days.

I once analyzed a 12GB heap dump from a production instance that was crashing every 31 hours. The dominant class was byte[]. I sorted by retained heap. The top consumer was a HashMap in a singleton bean holding byte[] keys. Turned out a developer had used a ConcurrentHashMap as a cache, but the keys were generated from UUID.randomUUID() — every request created a new key, and nothing ever evicted them. 12GB of orphaned byte arrays.

Start with jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>. The live flag triggers a full GC first, so you only dump live objects. That's critical — it removes the noise of garbage. Then open with Eclipse MAT (or JProfiler if you're on a Mac). Use the 'Leak Suspects' report. It finds the single biggest retained object set. 9 times out of 10, that's your leak.

Common patterns: ThreadLocal variables that accumulate data per request thread (e.g., a MDC context not cleared), HttpClient response bodies that weren't closed, and @Cacheable caches without @CacheEvict or TTL configuration.

Interview Gold:
In an interview, if they ask about memory leaks, say 'The most common cause in Spring Boot is holding references in ThreadLocal or @SessionScope beans without cleaning them up.' Then mention -XX:+HeapDumpOnOutOfMemoryError as the must-have JVM arg.
Production Insight
A heap dump from production is a crime scene. Don't alter the body (restart) before the coroner (MAT) arrives.
Key Takeaway
Don't guess the leak. Let Eclipse MAT's 'Leak Suspects' report tell you. It's rarely where you think.

Actuator is Your Emergency Dashboard: Use It Before It's an Emergency

Never Do This:
Don't expose Actuator endpoints to the public internet without authentication. I've seen a startup using management.endpoints.web.exposure.include=* with no auth. Attackers could change log levels to enable remote code execution via expression injection in some configurations.
Production Insight
If you haven't configured Actuator security on day one, you're building on sand.
Key Takeaway
Enable Actuator from the first commit. Secure it from the second. You'll need thread dumps and log-level changes during an incident, and you want them ready.

Transactional Pitfalls: The @Transactional That Breaks Everything

@Transactional seems simple. It's not. The most common bug I see is a service method annotated @Transactional that internally makes a REST call to another microservice. The HTTP call takes 2 seconds. The database connection stays open for those 2 seconds. That's two seconds of holding a connection that could serve another request.

In a high-traffic system with 10 pool connections and 100 requests/second, only 10 requests can be in the transaction at any time. The rest are queued at the pool. Response times spike. The pool exhausts. Every request fails.

Self-invocation is another classic. If you have a @Transactional method in a Spring bean, and you call it from another method in the same class, the annotation is ignored. Spring's AOP proxies only intercept external calls. I've debugged a production issue where a save() that should have been transactional was not, causing partial writes and data corruption. The fix: either call the method from another class, or inject the proxy (awful pattern, but it works: ((YourService) AopContext.currentProxy()).foo()).

Isolation levels matter too. @Transactional(isolation = Isolation.REPEATABLE_READ) on Postgres can cause serialization failures under high contention could not serialize access due to read/write dependencies. If you don't handle CannotSerializeTransactionException properly, the entire request fails.

Senior Shortcut:
Use @Transactional(propagation = Propagation.REQUIRES_NEW) for long-running sub-operations that should not hold the outer transaction. Or better: extract the DB work into a separate service class. Self-invocation bugs vanish when you inject the other service.
Production Insight
If your transaction holds a connection while it calls someone else's API, you're designing a failure cascade.
Key Takeaway
Transactions should be as narrow as possible. Never mix I/O with DB work in the same transactional boundary.

Configuration That Saves Your Weekend: loggers, delays, and flags

I have a standard set of configurations I add to every Spring Boot project before it goes to production. They've saved me more than once.

spring.datasource.hikari.leak-detection-threshold=30000 — This logs a stack trace when a connection is held longer than 30 seconds. It's the fastest way to find a connection leak. Don't run it in high-traffic dev for long, but turn it on immediately during an incident.

logging.level.org.springframework.transaction.interceptor=TRACE — Logs every transaction begin and commit. It's verbose but invaluable when debugging why a transaction is taking too long.

spring.jpa.properties.hibernate.generate_statistics=true — Logs every query, its timing, and number of rows fetched. You'll see N+1 queries printed as '100 JDBC statements executed for 100 entities'.

server.tomcat.threads.max=200 combined with server.tomcat.accept-count=100 — Controls thread pool size. If your app is I/O heavy (many external calls), increase max threads. If CPU-bound, decrease it. Tune this before you need to.

And the one that's saved me more times than any other: management.endpoint.health.probes.enabled=true. This separates Kubernetes liveness/readiness probes from the health endpoint. You can be 'live' but not 'ready' — for example, when the database is down, you want Kubernetes to stop routing traffic to you (readiness fails) but not kill the pod (liveness passes). This prevents the restart loop during a DB outage.

The Classic Bug:
I once saw a team set leak-detection-threshold=10000 (10 seconds) and forget it in production. It logged a stack trace for every query that took more than 10 seconds. The log volume caused a 200GB disk fill in 4 hours. Use this flag sparingly in production — 30 seconds is safe, 10 seconds is risky.
Production Insight
The difference between a 30-minute debug and a 3-hour one is often a single configuration flag you turned on before the incident.
Key Takeaway
Proactive configuration is better than reactive debugging. Add leak-detection-threshold, generate_statistics, and probes.enabled=true to every production app.
● Production incidentPOST-MORTEMseverity: high

The Silent Connection Pool Leak That Took Down Payments

Symptom
Payment processing slowed to a crawl. Intermittent timeouts. Alerts showed Nginx 502s and HikariPool exceptions. CPU and memory looked fine. Datadog showed connection pool saturation, then exhaustion.
Assumption
The database is overloaded. Add more replicas. Increase pool size. Call the DBA at 3 AM.
Root cause
A thread-local connection leak. A developer had opened a DataSourceUtils.getConnection() inside a loop without closing it in a finally block. Intermittent exceptions in the loop skipped the close. The thread-local held the connection, so HikariCP's eviction thread never saw it. After ~500 requests, the pool had 10 leaked connections. After 1000, it was dead.
Fix
1. Deploy a rolling restart to clear leaked connections. 2. Add @Transactional on the service method so Spring manages the connection lifecycle. 3. Replace the raw JDBC code with Spring's JdbcTemplate which automatically closes connections. 4. Set spring.datasource.hikari.leak-detection-threshold=60000 to alert on any connection held >60s.
Key lesson
  • If you manually open a JDBC connection in Spring Boot, you're writing buggy code.
  • Use JdbcTemplate or @Transactional.
  • Always.
  • There is no exception.
Production debug guideSymptom → root cause → fix for the failures that actually happen4 entries
Symptom · 01
HTTP requests hang, then timeout after 30s. HikariCP logs: 'Connection is not available, request timed out after 30000ms'.
Fix
Don't restart. Run jstack <pid> | grep -A 20 'HikariPool' to see which threads are waiting. Next, check the database with SELECT * FROM pg_stat_activity WHERE state = 'active' for long-running queries or locks. Finally, enable HikariCP leak detection with spring.datasource.hikari.leak-detection-threshold=30000 in your next deployment.
Symptom · 02
OutOfMemoryError: Java heap space with no obvious memory leak in the code.
Fix
Immediately add -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.hprof to JVM args if not already present. Use jcmd <pid> GC.heap_dump /tmp/heap.hprof now. Analyze with Eclipse MAT. Look for byte[] instances held by ThreadLocal or HttpClient response bodies. 90% of the time, it's a stream that wasn't consumed, or a cache without eviction.
Symptom · 03
DB queries run fine in a script but are slow from Spring Boot.
Fix
Enable Spring's transaction logging: logging.level.org.springframework.transaction.interceptor=TRACE. Check if your @Transactional(readOnly = true) is actually using a read-only connection — if your DB driver doesn't support it, it's ignored. Profile with spring.jpa.properties.hibernate.generate_statistics=true. Look for N+1 queries printed in logs.
Symptom · 04
Pod restarts without any Java error, CPU spikes to 100% before crash.
Fix
This is usually a native memory leak. JVM heap looks fine, but native code (e.g., netty direct buffers, JNI) is leaking off-heap. Check /proc/<pid>/status for VmRSS. Add -XX:NativeMemoryTracking=detail then jcmd <pid> VM.native_memory summary. Common culprits: Spring Boot with Netty, gRPC, or image processing libraries.
★ Debug Cheat SheetCommands for fast diagnosis in production
All requests hang → pool exhaustion
Immediate action
Thread dump. Look for threads waiting on a connection.
Commands
jstack <pid> | grep -A 30 'HikariPool'
SELECT * FROM pg_locks WHERE NOT granted;
Fix now
Rollback deployment or restart. Add leak-detection-threshold=30000.
OOM without obvious heavy object creation+
Immediate action
Heap dump immediately before restart.
Commands
jcmd <pid> GC.heap_dump /tmp/heap.hprof
jcmd <pid> VM.native_memory summary
Fix now
Add -XX:+HeapDumpOnOutOfMemoryError and analyze with MAT.
Slow responses, high RT, low throughput+
Immediate action
Enable Hibernate stats on the fly via Actuator.
Commands
curl -X POST localhost:8081/actuator/loggers/org.hibernate.stat -H 'Content-Type: application/json' -d '{"configuredLevel":"TRACE"}'
tail -f /var/log/app/spring.log | grep 'hibernate.stat'
Fix now
Add spring.jpa.properties.hibernate.generate_statistics=true to config and restart.
Cache Strategies in Production
Cache TypeSpring AnnotationEviction StrategyBest For
In-Memory (Caffeine)@Cacheable + @CacheConfig(cacheNames="...")TTL via spring.cache.caffeine.spec=expireAfterWrite=10mSmall, fast-growing caches like user sessions
Redis@Cacheable with RedisCacheManagerTTL via RedisTimeToLive; manual @CacheEvictShared caches across instances, high throughput
Distributed (Hazelcast)@Cacheable with HazelcastTTL + max-size policyLarge, clustered caches needing near-cache speed
Database (JPA 2nd level)@Cacheable with HibernateRegion-based evictionEntity caching that stays consistent with DB writes

Key takeaways

1
Connection pool exhaustion is almost never about the pool size. It's about queries or transactions holding connections too long. Find the slow query first.
2
Thread dumps are free and fast. Take three before you touch anything. They show you the real state of the JVM.
3
Heap dumps don't lie. Eclipse MAT's 'Leak Suspects' report finds the dominating memory consumer in minutes. Don't guess.
4
Narrow your @Transactional boundaries. Never make an HTTP call inside a transaction
you're holding a connection hostage.
5
Enable Hibernate statistics and Actuator endpoints on day one. Configure leak-detection-threshold and probes.enabled=true before you need them.

Common mistakes to avoid

5 patterns
×

Using raw JDBC Connection inside a Spring Boot service without closing in finally block

Symptom
HikariPool connection exhaustion after a few hundred requests
Fix
Replace with JdbcTemplate which auto-closes connections, or add @Transactional to the method and inject EntityManager instead.
×

Setting spring.datasource.hikari.maximum-pool-size too high (e.g. 200)

Symptom
Database CPU spikes, slow queries, connection storms on DB restart
Fix
Set based on DB core count × 2 + 1 formula. For a typical app with fast queries: 10-20 per instance is enough. Increase sequentially only after proving the bottleneck is pool size.
×

Not enabling Hibernate statistics before an incident

Symptom
Cannot trace slow queries or N+1 problems in production
Fix
Add spring.jpa.properties.hibernate.generate_statistics=true to application.yml. Turn it on now. Restart. You'll get detailed query logs.
×

Calling a @Transactional method from within the same class (self-invocation)

Symptom
Transaction boundary ignored — partial DB writes, no rollback on exception
Fix
Inject the service into itself (circular dependency) or call the method from a different service. Better: extract transactional logic into a separate DAO/service class.
×

Exposing Actuator endpoints without authentication

Symptom
Any external attacker can change log levels, trigger shutdown, or see heap dumps
Fix
Set spring.security.user.name=admin and spring.security.user.password=${ACTUATOR_PASSWORD}. Use a secrets manager for the password. Or lock down via network policies.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
A Spring Boot service starts returning 503 errors after running for 12 h...
Q02SENIOR
What's the difference between `@Transactional(propagation = REQUIRES_NEW...
Q03SENIOR
Your Spring Boot app uses HikariCP with maximum-pool-size=10. During a l...
Q04SENIOR
You have a memory leak in a Spring Boot app. The heap dump shows a `Conc...
Q05JUNIOR
How do you hot-change a log level in a running Spring Boot application w...
Q06JUNIOR
What is the difference between `@RestControllerAdvice` and `@ControllerA...
Q07SENIOR
Explain how a thread dump can help debug a Java deadlock. What patterns ...
Q08SENIOR
Your Spring Boot app has a scheduled task (`@Scheduled`) that runs every...
Q01 of 08SENIOR

A Spring Boot service starts returning 503 errors after running for 12 hours. Thread dumps show many threads in BLOCKED state on a ReentrantLock inside HikariPool. What do you check first?

ANSWER
I check the database for long-running queries or deadlocks. The pool lock is not the root cause — it's a consequence. Use SELECT * FROM pg_stat_activity WHERE state != 'idle' to find queries running longer than 30 seconds. Then check if any @Transactional method is making a slow HTTP call while holding a connection.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How do I get a thread dump from a Kubernetes pod running Spring Boot?
02
What's the first thing to check when HikariCP says 'Connection is not available'?
03
Should I use `@EnableTransactionManagement` in Spring Boot?
04
What is the difference between Hibernate's `org.hibernate.stat.Statistics` and `spring.jpa.properties.hibernate.generate_statistics`?
05
Why did my Spring Boot app crash with 'Unable to create native thread' even though heap usage was low?
🔥

That's Production. Mark it forged?

6 min read · try the examples if you haven't

Previous
High Traffic Handling in Spring Boot
1 / 1 · Production
Next
Spring Boot Interview Questions