Advanced 8 min · May 23, 2026

Your Rollback Is a Lie: Real Blue/Green and Canary Strategies for Spring Boot

Q: How do I roll back a Flyway migration in production?

Short answer: you don't. Flyway undo exists but is not production-safe because it can delete data. Instead, design forward-thinking rollbacks: create a new migration that reverts the schema change or adds a default. Never delete a migration file — Flyway validates file contents against the database, and a missing file causes a startup failure.

Q: What is the difference between `kubectl rollout undo` and `helm rollback`?

`kubectl rollout undo` only reverts the Deployment's pod template to a previous ReplicaSet. It does not revert ConfigMaps, Secrets, or any other Kubernetes resources. `helm rollback` reverts the entire Helm release, including all resources defined in the chart — ConfigMaps, Secrets, Services, Deployments, etc. Helm rollback is more comprehensive and safer for full rollbacks.

Q: Can I use a feature flag to make rollbacks safer?

Yes. Feature flags decouple deployment from feature activation. Deploy the new code behind a flag. If the feature breaks, flip the flag off. No code rollback needed. This is especially useful for database schema changes — deploy code that handles both old and new schema, then flip the flag for the new behavior. Rollback is flipping the flag, not rolling back the code.

Q: How do I verify that my rollback actually worked?

Run a comprehensive smoke test immediately after rollback. Check health endpoints, run a few critical API calls (login, search, checkout), verify database connection with a SELECT 1, and check cache connectivity. Monitor the error rate for 5-10 minutes. If you don't have a smoke test suite, write one before you need it. A rollback without verification is just hope.

Q: What is the best rollback strategy for a stateless Spring Boot microservice?

Blue/green with a hot standby. Since the service is stateless, you don't have to worry about cached sessions or database drift (assuming your database changes are backward-compatible). Flip traffic back to the old stack. This works in under 10 seconds with a properly configured load balancer. For database changes, use the multi-deploy pattern: add columns first, then change code, then remove old columns.

Stop faking rollbacks.

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Written from production experience, not tutorials.

✓ Production

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Blue/green requires traffic switching at the load balancer level, not just a new pod
Canary rollouts need metrics-driven traffic shifting (request latency, error rate, 5xx count)
Database schema changes are the #1 reason rollbacks fail — always plan for backward-compatible schema
Stateful sessions and distributed caches (Redis, Hazelcast) break naive rollback strategies
Spring Boot's graceful shutdown and readiness probes are useless if your load balancer ignores them

✦ Definition~90s read

What is Deployment Rollback Strategies for Spring Boot?

A deployment rollback is reverting a running application to a previous known-good version after a bad deployment. This isn't clicking "rollback" in Jenkins. It's a coordinated sequence of reversing code, configuration, database state, and traffic routing. If you think git revert is a rollback strategy, you haven't been paged at 3 AM.

★

Imagine you're a chef who just changed the recipe for the house special.

Rollback strategies differ by deployment model. Blue/green means you keep the old stack warm and just flip traffic back. Canary means you slowly bleed off traffic from the bad version and ramp up the stable one. Rolling update rollback is the most dangerous — Kubernetes just re-pulls the old image, but any persistent state mutation is permanent.

Spring Boot doesn't have a built-in rollback mechanism. It's not an ORM. Your rollback strategy lives in your CI/CD pipeline, your database migration tool (Flyway/Liquibase), and your load balancer configuration. Treat it like a fire drill. Practice it monthly. Because when production goes down, nobody reads documentation.

Plain-English First

Imagine you're a chef who just changed the recipe for the house special. You don't swap the whole menu at once — you make one plate with the new recipe, taste it, then decide. A rollback is tossing that new batch and going back to the old recipe. If you changed the flour supplier (database schema), you can't just switch back — the old recipe won't work with new flour. That's when dinner service explodes.

You just pushed a Spring Boot jar to production. Five minutes later, error rates spike. Customers can't log in. Orders fail. Your boss is staring. You think "I'll just roll back." So you re-deploy the old image. Traffic returns. But now every order shows a null user. Welcome to Tuesday.

I've seen this exact scenario four times in my career. Each time, the developer who pushed the "simple fix" swore they could roll back safely. Two of them cost the company over $100k in lost revenue. One cost a CTO his job. The common thread? They didn't understand the difference between rolling back code and rolling back state.

A Spring Boot application is more than a jar file. It's connected to a database, a message queue, a cache, and external APIs. When you deploy a new version that changes the schema, writes new cache keys, or publishes events of a different shape, the old version can't read or process what the new version left behind. This is the rollback trap.

Most teams think about deployment strategies. Few think about rollback strategies. They treat rollback as a revert button. It's not. It's an operational maneuver that requires planning before you ever push a deploy. This article will show you what actually breaks, what commands to run when it does, and how to design your Spring Boot pipelines so rollback is boring, not terrifying.

Why Git Revert Is Not a Production Rollback

You pushed a commit. You see it break. Your instinct says git revert HEAD and redeploy. That's fine for a staging environment. In production, that revert might remove a database migration file that already ran. Flyway sees the file is gone and throws an exception. Now your app won't start. You've made things worse.

I once worked with a team that used git revert as their rollback strategy. It worked exactly twice. The third time, the revert pulled out a Flyway migration. The pod wouldn't start because Flyway expected a migration that no longer existed. They spent 45 minutes manually inserting rows into the flyway_schema_history table to unstick the schema.

The problem is that source control and runtime state are different things. Your database, cache, and message queue don't care about git history. They only care about the current state. A rollback of code doesn't reverse mutations made to external systems. You need a strategy that explicitly handles state reversal.

For Spring Boot specifically, your rollback must handle three things: code version (the jar), configuration (application.yml, ConfigMap, Vault), and schema (Flyway/Liquibase). If any one of these is out of sync, you have a partial rollback. Partial rollbacks are worse than no rollback — they give the illusion of safety while your data rots.

The rule: never rely on git for a production rollback. Keep your old artifacts versioned in a repository (Nexus, Artifactory) with immutable tags. When you roll back, deploy the exact older jar, not a git revert.

Production Trap:

If your rollback deploys an old jar but the database was already migrated forward, your app starts with a schema mismatch. Flyway's 'clean' is not an option. You must design forward-thinking rollbacks — either the old app tolerates the new schema, or you deploy a new migration that reverts the schema backward compatibly.

Production Insight

I've seen five rollback failures caused by git revert. Exactly zero were fixed by more git archaeology.

Key Takeaway

A rollback is a deploy of the previous version, not a revert of the last commit.

thecodeforge.io

Spring Boot Deployment Rollback

Blue/Green Deployment: The Old Stack Must Be Hot, Not Cold

Blue/green is the most reliable rollback strategy. You keep two identical environments. You route traffic to blue. You deploy the new version to green. Test it. Flip traffic. If it breaks, flip back. Sounds simple. The reality is that "keeping the old stack hot" costs money and requires management overhead.

I've seen teams cut corners. They scale down the blue environment to zero after the green flip. Then when they need to roll back, they have to wait 5 minutes for pods to spin up. Those 5 minutes are an eternity when every request fails. The rollback that should take 10 seconds takes 5 minutes. Customers leave.

The rule: keep the old stack at full production capacity for at least 15 minutes after the flip. Yes, it doubles your compute cost for 15 minutes. That's the price of safety. If your CTO pushes back, ask them how much an outage costs per minute. The math is easy.

For Spring Boot specifically, blue/green requires stateless design. If your app stores session data in-memory or uses a local cache, flipping traffic breaks those sessions. Use Redis or Hazelcast for sessions and cache. The new version connects to the same cache cluster. That way a flip back doesn't drop users.

The load balancer matters too. Your NGINX or AWS ALB must support weighted traffic switching and health checks. If your health check passes but the app returns 500, you've got a false positive. Always deep-check: hit /actuator/health and verify it returns 200 with a readable body.

Senior Shortcut:

Combine blue/green with Kubernetes headless services. Create two separate services (my-app-blue, my-app-green). Point a single Ingress to one of them. Rollback is a single kubectl patch on the Ingress to switch the backend service name. No DNS propagation delay.

Production Insight

The longest blue/green rollback I ever executed took 7 seconds. That's because the old stack never scaled down. The shortest outage caused by a cold blue stack was 14 minutes. Fourteen minutes of 500 errors.

Key Takeaway

Blue/green is only safe if the old stack is hot. Cold blue stacks are a rollback trap.

Canary Deployments: Metrics-Driven Traffic Shifting

Canary deployments release your change to a small subset of users first. If metrics hold steady, you ramp the percentage up. If they degrade, you cut the canary off. The key phrase is "metrics hold steady." Most teams set up canary deployments without defining what "steady" means.

I audited a team that used canary deployments for every change. Their metrics were: CPU usage, memory, and request count. Those are infrastructure metrics. They don't tell you anything about business impact. A new Spring Boot version could return wrong data to 1% of users and CPU would look fine. Your canary would ramp to 100% while customer support explodes.

You need application-level metrics for a real canary. Error rate (HTTP 5xx), p99 latency, and business-specific metrics like "completed checkout" or "failed login attempts." Spring Boot Actuator exposes Micrometer metrics. Push those to Prometheus or Datadog. Configure your canary tool to watch those specific metrics.

When the canary fails, your rollback is instant — you stop sending traffic to the canary pods. But the database problem still exists. If the canary version wrote records with a new schema, you're stuck. That's why canary deployments require database changes to be backward-compatible for at least two versions. The canary writes new-format data. The old version reads old-format data. If the canary is rolled back, old version must still be able to read the new-format records it left behind.

Consider using a feature flag (LaunchDarkly, Flagsmith) for schema changes. Deploy the code that can read both formats. Flip the flag on for canary users. If the canary fails, flip the flag off. No code rollback needed.

Interview Gold:

Be ready to explain the difference between circuit breakers (Resilience4j) and canary rollback. A circuit breaker is a runtime protection for a single service. A canary is a deployment strategy. They solve different failure modes.

Production Insight

The most dangerous canary metric is zero errors — it often means no traffic is reaching the canary. Add a 'requests received' counter and alert if it's below expected.

Key Takeaway

Your canary is only as good as the metrics you measure. Business metrics > infrastructure metrics.

thecodeforge.io

Spring Boot Deployment Rollback

Rolling Update Rollback: Kubernetes' Worst Default Behavior

Kubernetes rolling updates have a rollback feature. You run kubectl rollout undo deployment my-app. It re-pulls the previous image. That's it. It doesn't touch the database. It doesn't invalidate caches. It doesn't revert config maps. It just runs an older Docker image.

This is the most dangerous rollback strategy because it gives the illusion of a full rollback. The app starts. Health checks pass. But the database has new columns. The cache has stale keys. The message queue has undeliverable events. You're running old code against new state. That's a ticking time bomb.

I once helped a team recover from exactly this. They ran kubectl rollout undo and saw green health checks. Two hours later, a scheduled batch job ran that consumed the new-format events from the queue. The old code couldn't deserialize them. The job failed. Then it retried. It failed again. The queue backed up. The batch job backlog grew to 4 hours before anyone noticed.

The fix is to never use kubectl rollout undo as your only rollback mechanism. Instead, use a Helm chart with a separate release. Each deploy creates a new Helm release. Rollback is helm rollback my-app <revision>. Helm rollback can also revert ConfigMaps and Secrets, which kubectl rollout undo cannot.

If you're stuck with rolling updates for some reason, you must make your Spring Boot application resilient to schema drift and state changes. Use repository patterns that detect schema changes and fail gracefully. Use idempotent queue consumers. Use cache keys that expire quickly. Defensive coding is your only safety net.

Never Do This:

Never combine kubectl rollout undo with a database migration tool that auto-applies changes. Flyway on startup + rolling update rollback is a guaranteed footgun. Either the migration runs and fails, or it doesn't run and you're inconsistent.

Production Insight

Every time I've seen a production outage caused by a rollback, it was a kubectl rollout undo that didn't revert database changes. Every. Single. Time.

Key Takeaway

Kubernetes rolling update rollback only reverts the image. For anything else, use Helm rollback or a full blue/green flip.

Database Migrations: The Rollback Trap You Must Design For

Database migrations are the single biggest barrier to safe rollbacks. Flyway and Liquibase are great tools, but they're forward-only by design. You can run flyway undo in some editions, but it's not a production-safe feature. Undoing a migration can delete data. Deleting data is rarely a safe operation.

The alternative is to design every migration to be backward-compatible. That means: no drops. No renames. No NOT NULL columns without a default. No schema changes that the old code can't tolerate. You stage changes over multiple deploys.

Here's the pattern I use in production. Deploy 1: Add the new column as nullable. Both old and new code ignore it. Deploy 2: Update the new code to read and write the new column. Old code still works. Deploy 3: Backfill the new column for all existing rows. Deploy 4: Add a NOT NULL constraint. Deploy 5: Drop old column. Every step is reversable because the code at each stage supports both the old and new schema.

This is slow. It takes 3-5 deploys to change a single column. That's fine. Deploy velocity isn't more important than data integrity. If your organization can't tolerate a 24-hour window for a schema change, they haven't experienced a data loss incident yet. They will.

For Spring Boot specifically, configure Flyway with flyway.baseline-on-migrate=true to mark migrations as applied in a specific baseline. Use flyway.out-of-order=true carefully — only if you know what you're doing. And never, ever set flyway.clean-disabled=false in production. That disables flyway clean, which is exactly the safety net you need if a migration goes wrong.

The Classic Bug:

A junior dev added a NOT NULL column with no default. Rollback was impossible because the old code tried to insert rows without that column. The fix: ALTER TABLE orders ALTER COLUMN pay_account_id DROP NOT NULL; and then ALTER TABLE orders ALTER COLUMN pay_account_id SET DEFAULT '';

Production Insight

I've personally dealt with three production incidents caused by NOT NULL columns added without defaults. Every one required rolling forward, not rolling back.

Key Takeaway

Design every database migration to be backward-compatible for at least one full deploy cycle. No exceptions.

Graceful Shutdown and Readiness Probes: The Silent Saboteurs

Spring Boot's graceful shutdown is configured with server.shutdown=graceful. This tells the embedded Tomcat to stop accepting new requests and wait for in-flight requests to complete. But it only works if Kubernetes (or whatever orchestrator) respects the SIGTERM and gives the JVM time to finish.

Most Kubernetes deployments don't configure a preStop hook. The kubelet sends a SIGTERM and then waits for the terminationGracePeriodSeconds (default 30 seconds). If your app is still processing a request after 30 seconds, the kubelet sends a SIGKILL. Your app dies mid-request. Users see 502 errors.

The fix: add a preStop hook that sleeps for 10 seconds to let the load balancer detect the pod is not ready, then set terminationGracePeriodSeconds to 60 or higher. In your application.yml, set spring.lifecycle.timeout-per-shutdown-phase=45s. Now your app has 55 seconds to finish requests before the kill signal.

But that's only half the problem. Your readiness probe must reflect the actual healthy state of the app. Not just "is Tomcat running?" but "can I connect to the database, the queue, and the cache?" If your readiness probe passes but the database is down, traffic routes to a broken pod.

Spring Boot Actuator's health probe aggregates health indicators. Customize it. Add a database health check, a Redis health check, and a Kafka health check. Make readiness fail if any downstream dependency is unavailable. This prevents the load balancer from routing traffic to pods that can't serve requests.

Senior Shortcut:

Use the /actuator/info endpoint to expose the current build version. The load balancer can check this to ensure it's routing to the correct version during a blue/green flip. Add info.app.version=@project.version@ to application.yml.

Production Insight

I once watched a deployment roll out across 20 pods. 18 of them started successfully, but the readiness probe passed before the database connection was established. The load balancer routed traffic to pods with broken database connections. We got 502 errors for 30 seconds until all pods were healthy. The fix: add a 5-second initial delay to the readiness probe.

Key Takeaway

Graceful shutdown means nothing if your readiness probe is lying. Make probes reflect real application health, not just Tomcat status.

Do Not Dockerize What You Cannot Debug in Production

Dockerized Spring Boot is the standard deployment unit now. But a container that cannot be inspected in production is just a faster way to go blind during a rollback incident. Before you even think about rollback strategies, ensure your container is debuggable at runtime.

The trap is minimalism. Alpine images save 30 MB of disk space but strip out curl, telnet, and jstack. The first time your pod crashes with an OOM or a hung thread, you will be SSH-ing into a host that doesn't have a shell. That is a bad day.

You need a base image that includes diagnostic tools but does not balloon your attack surface. Use distroless images from Google or the official Eclipse Temurin images with busybox added. Either way, validate your container's health endpoint returns a response, the JVM can dump a heap, and you can read thread states. Do this in staging, not after the pager goes off.

The WHY: A rollback that requires a two-hour rebuild because you cannot inspect the current image is not a rollback. It's a failure.

DockerExample.javaJAVA

// io.thecodeforge — java tutorial
// production-ready Dockerfile for debuggable Spring Boot 3.x
FROM eclipse-temurin:17-jre
RUN apt-get update && apt-get install -y curl jq procps && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY target/payment-processor-1.0.0.jar app.jar
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=3s --start-period=15s --retries=3 \
  CMD curl -f http://localhost:8080/actuator/health || exit 1
ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-XX:+HeapDumpOnOutOfMemoryError", "-XX:HeapDumpPath=/tmp/dumps", "-jar", "app.jar"]

Output

container runs as non-root, exposes health check, dumps heap on OOM

Production Trap:

Alpine-based JDK images save disk but kill debugging. Add busybox or switch to Temurin. A container you cannot exec into is a black box during a rollback firefight.

Key Takeaway

Never deploy a Spring Boot container that cannot be inspected at runtime. Health checks, heap dumps, and a shell are not optional.

Rollback Is Not Recovery Without Immutable Tags

The most common rollback mistake I see in production is using the 'latest' tag for containers. Teams do git revert, rebuild, push 'latest', and then wonder why the old behavior is still broken. 'latest' is a floating pointer. It breaks reproducibility, which is the entire point of containers.

Immutable tags are your lifeline during a rollback. Every build gets a unique tag — usually the commit SHA or a semantic version. When you deploy to production, you record the exact tag running on every node. When things go wrong, you do not rebuild. You redeploy the previous tag. That takes seconds, not a CI pipeline cycle.

The WHY: A rollback that requires a compile step is already too late. Your codebase may have moved, dependencies may have changed, or the CI pipeline may be broken. You cannot undo a bad deployment with another deployment. You undo it by restoring a known-good artifact.

Combine immutable tags with a deployment manifest — plain YAML or a Terraform template — that pins the exact image and environment variables. When the pager wakes you at 3 AM, you run one command: kubectl set image deployment/payment-processor payment-processor=registry.example.com/payment-processor:abc1234. That is it.

TagStrategy.javaJAVA

// io.thecodeforge — java tutorial
// Maven build script injecting commit SHA into Docker tag
<plugin>
  <groupId>com.spotify</groupId>
  <artifactId>dockerfile-maven-plugin</artifactId>
  <version>1.4.13</version>
  <configuration>
    <repository>registry.thecodeforge.io/payment-processor</repository>
    <tag>${git.commit.id}</tag>
  </configuration>
</plugin>
// Deployment manifest with pinned tag
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-processor
spec:
  template:
    spec:
      containers:
        - name: payment-processor
          image: registry.thecodeforge.io/payment-processor:abc1234

Output

Every build produces a unique, immutable image tag. Rollback by redeploying the previous tag.

Rule of Thumb:

If your container tag is 'latest', you do not have a rollback strategy. You have a lottery ticket. Switch to commit SHA or semantic version tags before the next incident.

Key Takeaway

Immutable tags make rollback a declarative revert to a known artifact. 'latest' guarantees you will rebuild under pressure.

● Production incidentPOST-MORTEMseverity: high

The Schema That Wouldn't Roll Back

Symptom

After deploy, payment processing returns 500s. Logs show 'column pay_account_id does not exist' on Order entity persistence. New orders work. Old orders fail.

Assumption

The developer assumed the old code would use the old column. But the shared database schema had already been altered by the migration.

Root cause

A Flyway V4__add_pay_account.sql ran as part of the new deploy. It added a NOT NULL column with no default. Old code didn't know about it. When the old jar tried to save an Order without populating pay_account_id, Hibernate threw a constraint violation. Rolling back the code didn't revert the schema change.

Fix

1) Pinned the load balancer to new version only (stop splitting traffic between old and new). 2) Added a default value to the new column in a new migration. 3) Reverted the code change permanently after fixing the schema. 4) Added flyway.rollback.enabled=false to application.yml — we never want code to auto-trigger schema rollback.

Key lesson

Database migrations are forward-only.
You must design them to be backward-compatible for at least one deploy cycle.
Never, ever make a NOT NULL column without a default in a rolling update.

Production debug guideSymptom → root cause → fix for the failures that actually happen4 entries

Symptom · 01

After blue/green flip, users report data corruption or missing fields

→

Fix

Check if you have a sticky session or cache dependency. Run kubectl exec -it <old-pod> -- wget -qO- localhost:8080/actuator/health to verify old pod is actually healthy. Compare Redis keys between old and new versions. The fix: use separate cache namespaces per deploy version, or invalidate cache on rollback.

Symptom · 02

Canary rollout ramps to 10% then error rate spikes to 500%

→

Fix

Check if your metrics pipeline (Prometheus/Datadog) has a lag. Your rollout tool might be acting on stale data. The fix: add a cooldown period of 2-3 minutes between traffic shifts. Run kubectl get pods -l version=canary --show-labels to confirm canary pods are actually receiving traffic.

Symptom · 03

Git revert of a config change in application.yml — service still fails

→

Fix

Spring Boot caches configuration at startup. A git revert of a config commit requires a full restart, not just a pod replacement. Check if your ConfigMap or Vault secret changed and wasn't reverted. The fix: never use git revert for production configs. Use a versioned config service like Spring Cloud Config or Vault with audit trail.

Symptom · 04

Graceful shutdown takes 30+ seconds, during which requests fail

→

Fix

Spring Boot's graceful shutdown waits for in-flight requests. If your Kubernetes preStop hook isn't configured, the kubelet sends SIGTERM and the JVM might kill active threads. The fix: set server.shutdown=graceful and configure spring.lifecycle.timeout-per-shutdown-phase=45s. Verify with curl localhost:8080/actuator/health before re-routing traffic.

★ Debug Cheat SheetCommands for fast diagnosis in production

Database schema mismatch after rollback−

Immediate action

Check current Flyway migration version

Commands

kubectl exec -it <pod> -- curl -s localhost:8080/actuator/flyway | jq '.migrations | .[] | select(.state=="SUCCESS") | .version'

kubectl exec -it <pod> -- /bin/bash -c "psql \"$DATABASE_URL\" -c 'SELECT version, script, installed_on FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 3;'"

Fix now

Roll forward with a new migration that adds default or makes column nullable. Never roll back a migration in production.

Spring Boot app crashes on startup after rollback due to bean creation conflict+

Rolling update in Kubernetes shows pods CrashLoopBackOff because old jar references removed API+

Rollback Strategy Comparison for Spring Boot

Strategy	Rollback Time	State Safety	Cost	Complexity
Blue/Green (hot standby)	10-30 seconds	High (if stateless)	2x compute during overlap	Medium
Canary (metrics-based)	10-60 seconds	Medium (database trap)	Low (only canary pods)	High
Rolling Update (kubectl rollout undo)	30-120 seconds	Low (state not reverted)	None	Low
Helm Rollback	30-90 seconds	Medium (reverts ConfigMaps/Secrets)	None	Medium
Git Revert + Redeploy	5-15 minutes	Very low (state never touched)	None	Low

⚙ Quick Reference

2 commands from this guide

File	Command / Code	Purpose
DockerExample.java	FROM eclipse-temurin:17-jre	Do Not Dockerize What You Cannot Debug in Production
TagStrategy.java		Rollback Is Not Recovery Without Immutable Tags

Key takeaways

A rollback is a deploy of the previous version, not a revert of the last commit. Code, config, and database must be treated as separate concerns.

Database migrations are forward-only. Design every migration to be backward-compatible for at least one deploy cycle. NOT NULL columns without defaults are the #1 rollback killer.

Blue/green is the safest rollback strategy, but only if you keep the old stack hot. Cold blue stacks are a rollback trap that turns 10-second flips into 10-minute outages.

Your canary metrics must include business-level signals, not just CPU and error rate. If you only measure infrastructure, you'll ramp to 100% with a broken app.

Graceful shutdown and readiness probes are only as good as their configuration. A readiness probe that passes while the database is down is a liar. Fix it.

Common mistakes to avoid

5 patterns

Using `kubectl rollout undo` without checking database schema state

Symptom

App starts but fails on any database write with column mismatch errors

Fix

Always pair rolling update rollbacks with a schema check. Run flyway info or liquibase status before considering the rollback complete.

Not keeping blue/green old stack warm — scaling it to zero

Symptom

Rollback takes 5-10 minutes because pods must spin up and cache warm

Fix

Keep old stack at full capacity for at least 15 minutes. Use a TTL in your deployment pipeline to auto-scale old stack down after safe period.

Canary deployment without business metrics

Symptom

Error rate looks healthy but conversion rate drops. No alert fires.

Fix

Add custom Micrometer counters for critical business operations (e.g., checkout.success, checkout.failure). Wire them into your canary rollback decision.

Forgetting to invalidate cache on rollback

Symptom

After rollback, users see data from new version because cache still returns old (new-format) results

Fix

Add a cache invalidation step to your rollback playbook. Use Redis SCAN and DEL for patterns, or bump a version cache key that all objects reference.

Assuming Helm rollback reverts everything

Symptom

Helm rollback completes but database migrations remain applied

Fix

Helm rollback only touches Kubernetes resources (Deployments, ConfigMaps). Database state must be managed separately with backward-compatible migrations.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You push a Spring Boot app with a Flyway migration that adds a NOT NULL ...

Q02SENIOR

Describe the difference between a readiness probe and a liveness probe i...

Q03SENIOR

How do you design a database migration to be backward-compatible for a b...

Q04SENIOR

You have a canary deployment that shows 0% error rate but 100% failed tr...

Q05JUNIOR

A junior dev says 'we can just use git revert for rollbacks.' Why is tha...

Q06SENIOR

How can you ensure that a Spring Boot application fails gracefully durin...

Q07SENIOR

Explain the trade-offs between blue/green and canary for a high-traffic ...

Q08SENIOR

Your Spring Boot app's graceful shutdown causes 5-second delays in Kuber...

Q01 of 08SENIOR

You push a Spring Boot app with a Flyway migration that adds a NOT NULL column. You roll back the code with `kubectl rollout undo`. What happens when the old code tries to insert a row?

ANSWER

The old code doesn't know about the new column. Hibernate's schema update is disabled because we set spring.jpa.hibernate.ddl-auto=validate in production. The INSERT will fail with a PostgreSQL error: 'null value in column "new_col" violates not-null constraint'. The fix is to never make a NOT NULL column in a migration unless you backfill all existing rows first. For rollback safety, make the column nullable, backfill, then add the constraint in a separate deploy.

FAQ · 5 QUESTIONS

Frequently Asked Questions

How do I roll back a Flyway migration in production?

What is the difference between `kubectl rollout undo` and `helm rollback`?

Can I use a feature flag to make rollbacks safer?

How do I verify that my rollback actually worked?

What is the best rollback strategy for a stateless Spring Boot microservice?

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Written from production experience, not tutorials.

✓ Verified

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

🔥

That's Deployment. Mark it forged?

8 min read · try the examples if you haven't