Mid 8 min · May 23, 2026

Your Rollback Is a Lie: Real Blue/Green and Canary Strategies for Spring Boot

Stop faking rollbacks.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Blue/green requires traffic switching at the load balancer level, not just a new pod
  • Canary rollouts need metrics-driven traffic shifting (request latency, error rate, 5xx count)
  • Database schema changes are the #1 reason rollbacks fail — always plan for backward-compatible schema
  • Stateful sessions and distributed caches (Redis, Hazelcast) break naive rollback strategies
  • Spring Boot's graceful shutdown and readiness probes are useless if your load balancer ignores them
✦ Definition~90s read
What is Your Rollback Is a Lie?

A deployment rollback is reverting a running application to a previous known-good version after a bad deployment. This isn't clicking "rollback" in Jenkins. It's a coordinated sequence of reversing code, configuration, database state, and traffic routing. If you think git revert is a rollback strategy, you haven't been paged at 3 AM.

Imagine you're a chef who just changed the recipe for the house special.

Rollback strategies differ by deployment model. Blue/green means you keep the old stack warm and just flip traffic back. Canary means you slowly bleed off traffic from the bad version and ramp up the stable one. Rolling update rollback is the most dangerous — Kubernetes just re-pulls the old image, but any persistent state mutation is permanent.

Spring Boot doesn't have a built-in rollback mechanism. It's not an ORM. Your rollback strategy lives in your CI/CD pipeline, your database migration tool (Flyway/Liquibase), and your load balancer configuration. Treat it like a fire drill. Practice it monthly. Because when production goes down, nobody reads documentation.

Plain-English First

Imagine you're a chef who just changed the recipe for the house special. You don't swap the whole menu at once — you make one plate with the new recipe, taste it, then decide. A rollback is tossing that new batch and going back to the old recipe. If you changed the flour supplier (database schema), you can't just switch back — the old recipe won't work with new flour. That's when dinner service explodes.

You just pushed a Spring Boot jar to production. Five minutes later, error rates spike. Customers can't log in. Orders fail. Your boss is staring. You think "I'll just roll back." So you re-deploy the old image. Traffic returns. But now every order shows a null user. Welcome to Tuesday.

I've seen this exact scenario four times in my career. Each time, the developer who pushed the "simple fix" swore they could roll back safely. Two of them cost the company over $100k in lost revenue. One cost a CTO his job. The common thread? They didn't understand the difference between rolling back code and rolling back state.

A Spring Boot application is more than a jar file. It's connected to a database, a message queue, a cache, and external APIs. When you deploy a new version that changes the schema, writes new cache keys, or publishes events of a different shape, the old version can't read or process what the new version left behind. This is the rollback trap.

Most teams think about deployment strategies. Few think about rollback strategies. They treat rollback as a revert button. It's not. It's an operational maneuver that requires planning before you ever push a deploy. This article will show you what actually breaks, what commands to run when it does, and how to design your Spring Boot pipelines so rollback is boring, not terrifying.

Why Git Revert Is Not a Production Rollback

You pushed a commit. You see it break. Your instinct says git revert HEAD and redeploy. That's fine for a staging environment. In production, that revert might remove a database migration file that already ran. Flyway sees the file is gone and throws an exception. Now your app won't start. You've made things worse.

I once worked with a team that used git revert as their rollback strategy. It worked exactly twice. The third time, the revert pulled out a Flyway migration. The pod wouldn't start because Flyway expected a migration that no longer existed. They spent 45 minutes manually inserting rows into the flyway_schema_history table to unstick the schema.

The problem is that source control and runtime state are different things. Your database, cache, and message queue don't care about git history. They only care about the current state. A rollback of code doesn't reverse mutations made to external systems. You need a strategy that explicitly handles state reversal.

For Spring Boot specifically, your rollback must handle three things: code version (the jar), configuration (application.yml, ConfigMap, Vault), and schema (Flyway/Liquibase). If any one of these is out of sync, you have a partial rollback. Partial rollbacks are worse than no rollback — they give the illusion of safety while your data rots.

The rule: never rely on git for a production rollback. Keep your old artifacts versioned in a repository (Nexus, Artifactory) with immutable tags. When you roll back, deploy the exact older jar, not a git revert.

Production Trap:
If your rollback deploys an old jar but the database was already migrated forward, your app starts with a schema mismatch. Flyway's 'clean' is not an option. You must design forward-thinking rollbacks — either the old app tolerates the new schema, or you deploy a new migration that reverts the schema backward compatibly.
Production Insight
I've seen five rollback failures caused by git revert. Exactly zero were fixed by more git archaeology.
Key Takeaway
A rollback is a deploy of the previous version, not a revert of the last commit.

Blue/Green Deployment: The Old Stack Must Be Hot, Not Cold

Blue/green is the most reliable rollback strategy. You keep two identical environments. You route traffic to blue. You deploy the new version to green. Test it. Flip traffic. If it breaks, flip back. Sounds simple. The reality is that "keeping the old stack hot" costs money and requires management overhead.

I've seen teams cut corners. They scale down the blue environment to zero after the green flip. Then when they need to roll back, they have to wait 5 minutes for pods to spin up. Those 5 minutes are an eternity when every request fails. The rollback that should take 10 seconds takes 5 minutes. Customers leave.

The rule: keep the old stack at full production capacity for at least 15 minutes after the flip. Yes, it doubles your compute cost for 15 minutes. That's the price of safety. If your CTO pushes back, ask them how much an outage costs per minute. The math is easy.

For Spring Boot specifically, blue/green requires stateless design. If your app stores session data in-memory or uses a local cache, flipping traffic breaks those sessions. Use Redis or Hazelcast for sessions and cache. The new version connects to the same cache cluster. That way a flip back doesn't drop users.

The load balancer matters too. Your NGINX or AWS ALB must support weighted traffic switching and health checks. If your health check passes but the app returns 500, you've got a false positive. Always deep-check: hit /actuator/health and verify it returns 200 with a readable body.

Senior Shortcut:
Combine blue/green with Kubernetes headless services. Create two separate services (my-app-blue, my-app-green). Point a single Ingress to one of them. Rollback is a single kubectl patch on the Ingress to switch the backend service name. No DNS propagation delay.
Production Insight
The longest blue/green rollback I ever executed took 7 seconds. That's because the old stack never scaled down. The shortest outage caused by a cold blue stack was 14 minutes. Fourteen minutes of 500 errors.
Key Takeaway
Blue/green is only safe if the old stack is hot. Cold blue stacks are a rollback trap.

Canary Deployments: Metrics-Driven Traffic Shifting

Canary deployments release your change to a small subset of users first. If metrics hold steady, you ramp the percentage up. If they degrade, you cut the canary off. The key phrase is "metrics hold steady." Most teams set up canary deployments without defining what "steady" means.

I audited a team that used canary deployments for every change. Their metrics were: CPU usage, memory, and request count. Those are infrastructure metrics. They don't tell you anything about business impact. A new Spring Boot version could return wrong data to 1% of users and CPU would look fine. Your canary would ramp to 100% while customer support explodes.

You need application-level metrics for a real canary. Error rate (HTTP 5xx), p99 latency, and business-specific metrics like "completed checkout" or "failed login attempts." Spring Boot Actuator exposes Micrometer metrics. Push those to Prometheus or Datadog. Configure your canary tool to watch those specific metrics.

When the canary fails, your rollback is instant — you stop sending traffic to the canary pods. But the database problem still exists. If the canary version wrote records with a new schema, you're stuck. That's why canary deployments require database changes to be backward-compatible for at least two versions. The canary writes new-format data. The old version reads old-format data. If the canary is rolled back, old version must still be able to read the new-format records it left behind.

Consider using a feature flag (LaunchDarkly, Flagsmith) for schema changes. Deploy the code that can read both formats. Flip the flag on for canary users. If the canary fails, flip the flag off. No code rollback needed.

Interview Gold:
Be ready to explain the difference between circuit breakers (Resilience4j) and canary rollback. A circuit breaker is a runtime protection for a single service. A canary is a deployment strategy. They solve different failure modes.
Production Insight
The most dangerous canary metric is zero errors — it often means no traffic is reaching the canary. Add a 'requests received' counter and alert if it's below expected.
Key Takeaway
Your canary is only as good as the metrics you measure. Business metrics > infrastructure metrics.

Rolling Update Rollback: Kubernetes' Worst Default Behavior

Kubernetes rolling updates have a rollback feature. You run kubectl rollout undo deployment my-app. It re-pulls the previous image. That's it. It doesn't touch the database. It doesn't invalidate caches. It doesn't revert config maps. It just runs an older Docker image.

This is the most dangerous rollback strategy because it gives the illusion of a full rollback. The app starts. Health checks pass. But the database has new columns. The cache has stale keys. The message queue has undeliverable events. You're running old code against new state. That's a ticking time bomb.

I once helped a team recover from exactly this. They ran kubectl rollout undo and saw green health checks. Two hours later, a scheduled batch job ran that consumed the new-format events from the queue. The old code couldn't deserialize them. The job failed. Then it retried. It failed again. The queue backed up. The batch job backlog grew to 4 hours before anyone noticed.

The fix is to never use kubectl rollout undo as your only rollback mechanism. Instead, use a Helm chart with a separate release. Each deploy creates a new Helm release. Rollback is helm rollback my-app <revision>. Helm rollback can also revert ConfigMaps and Secrets, which kubectl rollout undo cannot.

If you're stuck with rolling updates for some reason, you must make your Spring Boot application resilient to schema drift and state changes. Use repository patterns that detect schema changes and fail gracefully. Use idempotent queue consumers. Use cache keys that expire quickly. Defensive coding is your only safety net.

Never Do This:
Never combine kubectl rollout undo with a database migration tool that auto-applies changes. Flyway on startup + rolling update rollback is a guaranteed footgun. Either the migration runs and fails, or it doesn't run and you're inconsistent.
Production Insight
Every time I've seen a production outage caused by a rollback, it was a kubectl rollout undo that didn't revert database changes. Every. Single. Time.
Key Takeaway
Kubernetes rolling update rollback only reverts the image. For anything else, use Helm rollback or a full blue/green flip.

Database Migrations: The Rollback Trap You Must Design For

Database migrations are the single biggest barrier to safe rollbacks. Flyway and Liquibase are great tools, but they're forward-only by design. You can run flyway undo in some editions, but it's not a production-safe feature. Undoing a migration can delete data. Deleting data is rarely a safe operation.

The alternative is to design every migration to be backward-compatible. That means: no drops. No renames. No NOT NULL columns without a default. No schema changes that the old code can't tolerate. You stage changes over multiple deploys.

Here's the pattern I use in production. Deploy 1: Add the new column as nullable. Both old and new code ignore it. Deploy 2: Update the new code to read and write the new column. Old code still works. Deploy 3: Backfill the new column for all existing rows. Deploy 4: Add a NOT NULL constraint. Deploy 5: Drop old column. Every step is reversable because the code at each stage supports both the old and new schema.

This is slow. It takes 3-5 deploys to change a single column. That's fine. Deploy velocity isn't more important than data integrity. If your organization can't tolerate a 24-hour window for a schema change, they haven't experienced a data loss incident yet. They will.

For Spring Boot specifically, configure Flyway with flyway.baseline-on-migrate=true to mark migrations as applied in a specific baseline. Use flyway.out-of-order=true carefully — only if you know what you're doing. And never, ever set flyway.clean-disabled=false in production. That disables flyway clean, which is exactly the safety net you need if a migration goes wrong.

The Classic Bug:
A junior dev added a NOT NULL column with no default. Rollback was impossible because the old code tried to insert rows without that column. The fix: ALTER TABLE orders ALTER COLUMN pay_account_id DROP NOT NULL; and then ALTER TABLE orders ALTER COLUMN pay_account_id SET DEFAULT '';
Production Insight
I've personally dealt with three production incidents caused by NOT NULL columns added without defaults. Every one required rolling forward, not rolling back.
Key Takeaway
Design every database migration to be backward-compatible for at least one full deploy cycle. No exceptions.

Graceful Shutdown and Readiness Probes: The Silent Saboteurs

Spring Boot's graceful shutdown is configured with server.shutdown=graceful. This tells the embedded Tomcat to stop accepting new requests and wait for in-flight requests to complete. But it only works if Kubernetes (or whatever orchestrator) respects the SIGTERM and gives the JVM time to finish.

Most Kubernetes deployments don't configure a preStop hook. The kubelet sends a SIGTERM and then waits for the terminationGracePeriodSeconds (default 30 seconds). If your app is still processing a request after 30 seconds, the kubelet sends a SIGKILL. Your app dies mid-request. Users see 502 errors.

The fix: add a preStop hook that sleeps for 10 seconds to let the load balancer detect the pod is not ready, then set terminationGracePeriodSeconds to 60 or higher. In your application.yml, set spring.lifecycle.timeout-per-shutdown-phase=45s. Now your app has 55 seconds to finish requests before the kill signal.

But that's only half the problem. Your readiness probe must reflect the actual healthy state of the app. Not just "is Tomcat running?" but "can I connect to the database, the queue, and the cache?" If your readiness probe passes but the database is down, traffic routes to a broken pod.

Spring Boot Actuator's health probe aggregates health indicators. Customize it. Add a database health check, a Redis health check, and a Kafka health check. Make readiness fail if any downstream dependency is unavailable. This prevents the load balancer from routing traffic to pods that can't serve requests.

Senior Shortcut:
Use the /actuator/info endpoint to expose the current build version. The load balancer can check this to ensure it's routing to the correct version during a blue/green flip. Add info.app.version=@project.version@ to application.yml.
Production Insight
I once watched a deployment roll out across 20 pods. 18 of them started successfully, but the readiness probe passed before the database connection was established. The load balancer routed traffic to pods with broken database connections. We got 502 errors for 30 seconds until all pods were healthy. The fix: add a 5-second initial delay to the readiness probe.
Key Takeaway
Graceful shutdown means nothing if your readiness probe is lying. Make probes reflect real application health, not just Tomcat status.
● Production incidentPOST-MORTEMseverity: high

The Schema That Wouldn't Roll Back

Symptom
After deploy, payment processing returns 500s. Logs show 'column pay_account_id does not exist' on Order entity persistence. New orders work. Old orders fail.
Assumption
The developer assumed the old code would use the old column. But the shared database schema had already been altered by the migration.
Root cause
A Flyway V4__add_pay_account.sql ran as part of the new deploy. It added a NOT NULL column with no default. Old code didn't know about it. When the old jar tried to save an Order without populating pay_account_id, Hibernate threw a constraint violation. Rolling back the code didn't revert the schema change.
Fix
1) Pinned the load balancer to new version only (stop splitting traffic between old and new). 2) Added a default value to the new column in a new migration. 3) Reverted the code change permanently after fixing the schema. 4) Added flyway.rollback.enabled=false to application.yml — we never want code to auto-trigger schema rollback.
Key lesson
  • Database migrations are forward-only.
  • You must design them to be backward-compatible for at least one deploy cycle.
  • Never, ever make a NOT NULL column without a default in a rolling update.
Production debug guideSymptom → root cause → fix for the failures that actually happen4 entries
Symptom · 01
After blue/green flip, users report data corruption or missing fields
Fix
Check if you have a sticky session or cache dependency. Run kubectl exec -it <old-pod> -- wget -qO- localhost:8080/actuator/health to verify old pod is actually healthy. Compare Redis keys between old and new versions. The fix: use separate cache namespaces per deploy version, or invalidate cache on rollback.
Symptom · 02
Canary rollout ramps to 10% then error rate spikes to 500%
Fix
Check if your metrics pipeline (Prometheus/Datadog) has a lag. Your rollout tool might be acting on stale data. The fix: add a cooldown period of 2-3 minutes between traffic shifts. Run kubectl get pods -l version=canary --show-labels to confirm canary pods are actually receiving traffic.
Symptom · 03
Git revert of a config change in application.yml — service still fails
Fix
Spring Boot caches configuration at startup. A git revert of a config commit requires a full restart, not just a pod replacement. Check if your ConfigMap or Vault secret changed and wasn't reverted. The fix: never use git revert for production configs. Use a versioned config service like Spring Cloud Config or Vault with audit trail.
Symptom · 04
Graceful shutdown takes 30+ seconds, during which requests fail
Fix
Spring Boot's graceful shutdown waits for in-flight requests. If your Kubernetes preStop hook isn't configured, the kubelet sends SIGTERM and the JVM might kill active threads. The fix: set server.shutdown=graceful and configure spring.lifecycle.timeout-per-shutdown-phase=45s. Verify with curl localhost:8080/actuator/health before re-routing traffic.
★ Debug Cheat SheetCommands for fast diagnosis in production
Database schema mismatch after rollback
Immediate action
Check current Flyway migration version
Commands
kubectl exec -it <pod> -- curl -s localhost:8080/actuator/flyway | jq '.migrations | .[] | select(.state=="SUCCESS") | .version'
kubectl exec -it <pod> -- /bin/bash -c "psql \"$DATABASE_URL\" -c 'SELECT version, script, installed_on FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 3;'"
Fix now
Roll forward with a new migration that adds default or makes column nullable. Never roll back a migration in production.
Spring Boot app crashes on startup after rollback due to bean creation conflict+
Immediate action
Check application logs for bean definition overrides
Commands
kubectl logs --tail=200 <pod> | grep -E "(BeanDefinitionOverrideException|ConflictingBeanDefinition)"
java -jar my-app.jar --debug 2>&1 | grep "Overriding bean definition"
Fix now
Add spring.main.allow-bean-definition-overriding=true temporarily, but the real fix is to ensure your rollback doesn't change bean names across versions.
Rolling update in Kubernetes shows pods CrashLoopBackOff because old jar references removed API+
Immediate action
Check readiness probe failures
Commands
kubectl describe pod <pod> | grep -A5 "Readiness probe failed:"
kubectl exec -it <pod> -- wget -qO- http://localhost:8080/actuator/health | jq .
Fix now
Temporarily disable the readiness probe with kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"containers":[{"name":"my-app","readinessProbe":{"httpGet":null}}]}}}}' and fix the API dependency. This is duct tape — fix the dependency asap.
Rollback Strategy Comparison for Spring Boot
StrategyRollback TimeState SafetyCostComplexity
Blue/Green (hot standby)10-30 secondsHigh (if stateless)2x compute during overlapMedium
Canary (metrics-based)10-60 secondsMedium (database trap)Low (only canary pods)High
Rolling Update (kubectl rollout undo)30-120 secondsLow (state not reverted)NoneLow
Helm Rollback30-90 secondsMedium (reverts ConfigMaps/Secrets)NoneMedium
Git Revert + Redeploy5-15 minutesVery low (state never touched)NoneLow

Key takeaways

1
A rollback is a deploy of the previous version, not a revert of the last commit. Code, config, and database must be treated as separate concerns.
2
Database migrations are forward-only. Design every migration to be backward-compatible for at least one deploy cycle. NOT NULL columns without defaults are the #1 rollback killer.
3
Blue/green is the safest rollback strategy, but only if you keep the old stack hot. Cold blue stacks are a rollback trap that turns 10-second flips into 10-minute outages.
4
Your canary metrics must include business-level signals, not just CPU and error rate. If you only measure infrastructure, you'll ramp to 100% with a broken app.
5
Graceful shutdown and readiness probes are only as good as their configuration. A readiness probe that passes while the database is down is a liar. Fix it.

Common mistakes to avoid

5 patterns
×

Using `kubectl rollout undo` without checking database schema state

Symptom
App starts but fails on any database write with column mismatch errors
Fix
Always pair rolling update rollbacks with a schema check. Run flyway info or liquibase status before considering the rollback complete.
×

Not keeping blue/green old stack warm — scaling it to zero

Symptom
Rollback takes 5-10 minutes because pods must spin up and cache warm
Fix
Keep old stack at full capacity for at least 15 minutes. Use a TTL in your deployment pipeline to auto-scale old stack down after safe period.
×

Canary deployment without business metrics

Symptom
Error rate looks healthy but conversion rate drops. No alert fires.
Fix
Add custom Micrometer counters for critical business operations (e.g., checkout.success, checkout.failure). Wire them into your canary rollback decision.
×

Forgetting to invalidate cache on rollback

Symptom
After rollback, users see data from new version because cache still returns old (new-format) results
Fix
Add a cache invalidation step to your rollback playbook. Use Redis SCAN and DEL for patterns, or bump a version cache key that all objects reference.
×

Assuming Helm rollback reverts everything

Symptom
Helm rollback completes but database migrations remain applied
Fix
Helm rollback only touches Kubernetes resources (Deployments, ConfigMaps). Database state must be managed separately with backward-compatible migrations.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You push a Spring Boot app with a Flyway migration that adds a NOT NULL ...
Q02SENIOR
Describe the difference between a readiness probe and a liveness probe i...
Q03SENIOR
How do you design a database migration to be backward-compatible for a b...
Q04SENIOR
You have a canary deployment that shows 0% error rate but 100% failed tr...
Q05JUNIOR
A junior dev says 'we can just use git revert for rollbacks.' Why is tha...
Q06SENIOR
How can you ensure that a Spring Boot application fails gracefully durin...
Q07SENIOR
Explain the trade-offs between blue/green and canary for a high-traffic ...
Q08SENIOR
Your Spring Boot app's graceful shutdown causes 5-second delays in Kuber...
Q01 of 08SENIOR

You push a Spring Boot app with a Flyway migration that adds a NOT NULL column. You roll back the code with `kubectl rollout undo`. What happens when the old code tries to insert a row?

ANSWER
The old code doesn't know about the new column. Hibernate's schema update is disabled because we set spring.jpa.hibernate.ddl-auto=validate in production. The INSERT will fail with a PostgreSQL error: 'null value in column "new_col" violates not-null constraint'. The fix is to never make a NOT NULL column in a migration unless you backfill all existing rows first. For rollback safety, make the column nullable, backfill, then add the constraint in a separate deploy.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How do I roll back a Flyway migration in production?
02
What is the difference between `kubectl rollout undo` and `helm rollback`?
03
Can I use a feature flag to make rollbacks safer?
04
How do I verify that my rollback actually worked?
05
What is the best rollback strategy for a stateless Spring Boot microservice?
🔥

That's Deployment. Mark it forged?

8 min read · try the examples if you haven't

Previous
Spring Boot Production Deployment Guide
2 / 2 · Deployment
Next
High Traffic Handling in Spring Boot