Spring Boot Production Deployment on Kubernetes
Deploy Spring Boot on Kubernetes with production-grade YAML, readiness/liveness probes, graceful shutdown, HPA, resource limits, and zero-downtime strategies.
- Use /actuator/health/readiness for readiness probe and /actuator/health/liveness for liveness probe — never swap them
- Set spring.lifecycle.timeout-per-shutdown-phase=30s and terminationGracePeriodSeconds=60 for graceful shutdown
- Always set both requests AND limits for CPU and memory — missing requests breaks HPA scheduling
- HPA requires metrics-server installed; for custom metrics use Prometheus Adapter or KEDA
- ConfigMap for non-sensitive config, Secret for credentials — never bake secrets into container images
Running Spring Boot on Kubernetes is like managing a restaurant chain. Each pod is a kitchen. The readiness probe checks if the kitchen is ready to take orders (DB connected, caches warm). The liveness probe checks if the kitchen hasn't caught fire. HPA is the manager who opens more kitchens during lunch rush and closes them at night. Graceful shutdown is the 'last orders' call — no new customers, finish the current tables.
Deploying a Spring Boot application to Kubernetes for the first time typically takes an afternoon. Getting that deployment production-ready — with proper health checks, graceful shutdown, resource sizing, autoscaling, and zero-downtime rollouts — takes considerably longer, often because the gap between 'it runs' and 'it is production-safe' is invisible until an incident exposes it.
The most common first mistake is missing or misconfigured health probes. A liveness probe that points to /actuator/health (which includes database health) will kill a healthy pod the moment the database becomes temporarily unavailable — creating a cascade where the loss of DB connectivity triggers pod restarts, which surge connection pool creation, which worsens the DB overload. Kubernetes probes require precise configuration of which health indicator reports to which endpoint.
The second common mistake is omitting graceful shutdown. Without it, Kubernetes sends SIGTERM and immediately stops routing traffic to the pod — but the pod's in-flight requests are still being processed. With a 5-10 second request timeout and no graceful shutdown, every in-flight request at deployment time returns a 503. At 10 deployments per day, that is 10 traffic spikes of errors that accumulate in SLO burn rate dashboards.
Resource requests and limits are equally critical. Missing resource requests means the Kubernetes scheduler cannot make informed placement decisions, and the HPA has no baseline for scaling decisions. Missing limits means a memory leak in one pod can consume all node memory, triggering OOM kills of unrelated pods on the same node.
This guide covers the full production deployment stack: Deployment/Service/ConfigMap/Secret YAML, probe configuration, graceful shutdown setup, HPA with both CPU and custom Prometheus metrics, resource sizing formulas, and the operational checklist for zero-downtime deploys.
Complete Kubernetes Deployment YAML
A production Spring Boot Kubernetes deployment requires eight distinct resource types working together: Deployment, Service, ConfigMap, Secret, HorizontalPodAutoscaler, PodDisruptionBudget, ServiceAccount, and (optionally) a NetworkPolicy. Omitting any of these creates operational gaps.
The Deployment spec defines the pod template, container configuration, resource requests/limits, health probes, lifecycle hooks, and rolling update strategy. The strategy.rollingUpdate section should specify maxUnavailable=0 (zero pods taken down before new ones are ready) and maxSurge=1 (one extra pod spun up during rollout) for true zero-downtime deploys.
Resource requests must be set accurately based on observed metrics, not guessed. Under-setting requests causes the scheduler to pack too many pods on one node, leading to noisy-neighbour CPU throttling. Over-setting requests wastes capacity and prevents efficient bin-packing. The target is to set requests at the p50 (median) observed usage and limits at the p99 plus a 20% buffer.
The ServiceAccount with minimal RBAC permissions follows the principle of least privilege. Spring Cloud Kubernetes uses the service account to read ConfigMaps for live configuration reload — scope this permission tightly.
A PodDisruptionBudget with minAvailable=1 (or maxUnavailable=1 for larger deployments) ensures that node drains during cluster maintenance do not take down all pods simultaneously. This is frequently overlooked and causes unnecessary downtime during infrastructure maintenance.
Health Probes: Readiness vs Liveness vs Startup
Kubernetes provides three probe types that serve distinct purposes. Confusing them is one of the most common and damaging production misconfigurations.
The liveness probe answers: 'Is this container worth keeping alive?' Kubernetes kills and restarts a container that fails the liveness probe. Therefore, liveness should only check application-internal state — is the JVM running, is the main thread alive, is there no deadlock. It must NOT check external dependencies like databases. If the database goes down, the application is still alive and should remain alive, waiting for the database to recover. Checking external health in the liveness probe causes pod restarts on infrastructure failures, worsening the situation.
The readiness probe answers: 'Is this container ready to receive traffic?' Kubernetes removes a pod from the Service's endpoint pool when it fails the readiness probe, stopping new traffic from routing to it. Readiness SHOULD check external dependencies: is the database connection pool healthy, is the cache warm, are required downstream services reachable. A pod that is failing readiness is still alive and counted in replica count — it just does not receive traffic.
The startup probe answers: 'Has the application finished starting up?' It runs instead of liveness and readiness until it succeeds, giving slow-starting applications (Spring Boot with many auto-configurations, database migration on startup) time to initialise without triggering liveness failures. Once the startup probe succeeds, Kubernetes switches to liveness and readiness probes.
Spring Boot Actuator exposes these three endpoints out of the box with management.endpoint.health.probes.enabled=true. The /actuator/health/liveness endpoint reports LivenessState (CORRECT or BROKEN). The /actuator/health/readiness endpoint reports ReadinessState (ACCEPTING_TRAFFIC or REFUSING_TRAFFIC). Your application can programmatically update these states via ApplicationAvailability.
Graceful Shutdown
Graceful shutdown is the process by which a Spring Boot application, upon receiving SIGTERM from Kubernetes, stops accepting new requests, allows in-flight requests to complete, then cleanly shuts down its connections and resources. Without it, every rolling deployment or node drain causes visible errors for the requests in flight at the moment of pod termination.
The graceful shutdown sequence in Kubernetes is: (1) Kubernetes sets pod to Terminating state, removes it from Service endpoints (stops routing new traffic), runs preStop hook; (2) simultaneously, Kubernetes sends SIGTERM to the container; (3) the application's shutdown hooks run in order: Spring's SmartLifecycle stops accepting new requests, waits for in-flight requests to complete (up to timeout), then closes each infrastructure component in order (HTTP server, Kafka consumers, JDBC pool, caches); (4) after terminationGracePeriodSeconds, Kubernetes sends SIGKILL regardless.
Critical timing issue: the endpoint removal (step 1) and SIGTERM (step 2) happen concurrently. There is a race condition where the load balancer may still route requests to the pod for a few hundred milliseconds after SIGTERM is received. The preStop sleep of 5 seconds covers this window.
Spring Boot 2.3+ enables graceful shutdown natively. Set server.shutdown=graceful and spring.lifecycle.timeout-per-shutdown-phase=30s. The timeout is per phase — Spring's SmartLifecycle has ordered phases (HIGHEST_PRECEDENCE to LOWEST_PRECEDENCE) and each gets 30 seconds. Total shutdown time could be multiple minutes for complex applications with many lifecycle phases.
HorizontalPodAutoscaler: CPU and Custom Metrics
The HorizontalPodAutoscaler (HPA) automatically adjusts the replica count of a Deployment based on observed metrics. The most common trigger is CPU utilisation, but production systems often need to scale on custom metrics like request queue depth, Kafka consumer lag, or active WebSocket connections.
CPU-based HPA uses the Kubernetes Metrics Server, which scrapes CPU usage from the kubelet every 15 seconds. HPA calculates the desired replica count as: ceil(currentMetricValue / desiredMetricValue × currentReplicas). For example, with targetCPUUtilizationPercentage=70, 3 replicas at 90% CPU → ceil(90/70 × 3) = ceil(3.86) = 4 replicas.
Custom metrics HPA requires additional infrastructure. The Prometheus Adapter translates Prometheus metrics into the Kubernetes custom metrics API. KEDA (Kubernetes Event-Driven Autoscaling) is a more powerful alternative that supports scaling to zero and a wider range of metric sources including Kafka consumer lag, RabbitMQ queue depth, and AWS SQS queue length.
HPA has configurable scale-up and scale-down stabilisation windows to prevent flapping. The default scale-up stabilisation is 0 seconds (scale up immediately) and scale-down is 300 seconds (wait 5 minutes before scaling down). This asymmetry is intentional — scale up fast to handle load spikes, scale down slowly to avoid premature termination.
VerticalPodAutoscaler (VPA) adjusts resource requests/limits based on observed usage, complementing HPA. Use VPA in recommendation mode first (never in auto mode in production) to understand actual resource needs before setting requests and limits manually.
ConfigMap and Secret Management
Kubernetes ConfigMaps and Secrets decouple application configuration from container images, enabling the same image to run in development, staging, and production with different configuration values. This is a core Twelve-Factor App principle.
ConfigMaps store non-sensitive configuration: feature flags, tuning parameters, service URLs, Spring profiles. Secrets store sensitive values: database passwords, API keys, TLS certificates, Kafka credentials. Secrets are base64-encoded in the API server (not encrypted by default — enable encryption at rest for production clusters).
Spring Boot reads Kubernetes configuration via environment variables (preferred for simple values) or mounted files (preferred for complex configuration like Java keystore files or multi-line YAML). For dynamic configuration reload without pod restart, Spring Cloud Kubernetes Config Server polls the ConfigMap for changes and refreshes @RefreshScope beans.
Never bake secrets into Docker images or commit them to source control. Use external secret management: AWS Secrets Manager + External Secrets Operator, HashiCorp Vault + Vault Agent Injector, or Sealed Secrets for GitOps. The External Secrets Operator syncs secrets from external stores into Kubernetes Secrets automatically, including rotation.
For sensitive Spring Boot properties, prefer environment variable injection over mounted files — mounted files may be accessible to other processes in the container. Use Kubernetes Secret references in env.valueFrom.secretKeyRef for fine-grained per-key injection.
Resource Sizing and Zero-Downtime Deploy Checklist
Resource sizing for Spring Boot on Kubernetes requires measuring actual usage, not guessing. Start with VPA in recommendation mode for one week, then set requests at the p50 of observed usage and limits at p99 + 20%.
For a typical Spring Boot REST service with moderate traffic: memory request 512Mi, memory limit 1Gi, CPU request 250m, CPU limit 1000m. Adjust based on your JVM heap configuration (-XX:MaxRAMPercentage=75 → 768Mi heap for 1Gi limit), number of threads (each thread uses ~1MB stack by default), and HikariCP pool size (each connection uses ~10MB in PostgreSQL).
Zero-downtime deployment requires all of the following to be true simultaneously: (1) readiness probe passes before pod receives traffic, (2) maxUnavailable=0 so old pods stay until new pods are ready, (3) preStop sleep covers load balancer propagation delay, (4) terminationGracePeriodSeconds > shutdown time, (5) PodDisruptionBudget ensures minimum replicas during the rollout, (6) new version is backward-compatible with the current database schema (no breaking migrations before deployment completes).
Database migration strategy for zero-downtime: use Flyway or Liquibase with non-breaking migrations. The rule: never remove or rename columns in the same migration as deploying new code that stops using them. Phase it: (1) add new column, deploy new code that writes to both old and new column; (2) backfill new column; (3) deploy code that reads from new column only; (4) remove old column in a later migration after all pods are on the new version.
Liveness Probe Pointed at /actuator/health Causes Pod Restart Storm During RDS Maintenance
httpGet: path: /actuator/health. Spring Boot's /actuator/health includes a DataSourceHealthIndicator that reports DOWN when the database is unreachable. When RDS failed over (2-minute window), all pods reported DOWN to the liveness probe. Kubernetes interpreted this as 'the application is crashed' and killed/restarted all pods. The restart surge created 200+ new HikariCP connections trying to reach still-unreachable RDS, worsening connection exhaustion when RDS came back online.- The liveness probe must ONLY check if the JVM process is alive and not deadlocked — never external dependencies.
- Use /actuator/health/liveness for liveness, /actuator/health/readiness for readiness.
- They exist for exactly this reason.
kubectl describe pod -l app=my-service -n production | grep -A 20 'Last State\|Events'kubectl logs -l app=my-service --previous -n production --tail=100Key takeaways
Common mistakes to avoid
6 patternsChecking external dependencies (DB, Redis) in the liveness probe
Using Docker ENTRYPOINT in shell form instead of exec form
Not setting CPU/memory resource requests
Setting terminationGracePeriodSeconds less than spring.lifecycle.timeout-per-shutdown-phase
Running with a single replica (replicas: 1) in production
No preStop hook — race condition between SIGTERM and load balancer endpoint removal
Interview Questions on This Topic
What is the difference between a liveness probe and a readiness probe in Kubernetes?
Frequently Asked Questions
That's Deployment. Mark it forged?
8 min read · try the examples if you haven't