Your Rollback Is a Lie: Real Blue/Green and Canary Strategies for Spring Boot
Stop faking rollbacks.
- Blue/green requires traffic switching at the load balancer level, not just a new pod
- Canary rollouts need metrics-driven traffic shifting (request latency, error rate, 5xx count)
- Database schema changes are the #1 reason rollbacks fail — always plan for backward-compatible schema
- Stateful sessions and distributed caches (Redis, Hazelcast) break naive rollback strategies
- Spring Boot's graceful shutdown and readiness probes are useless if your load balancer ignores them
Imagine you're a chef who just changed the recipe for the house special. You don't swap the whole menu at once — you make one plate with the new recipe, taste it, then decide. A rollback is tossing that new batch and going back to the old recipe. If you changed the flour supplier (database schema), you can't just switch back — the old recipe won't work with new flour. That's when dinner service explodes.
You just pushed a Spring Boot jar to production. Five minutes later, error rates spike. Customers can't log in. Orders fail. Your boss is staring. You think "I'll just roll back." So you re-deploy the old image. Traffic returns. But now every order shows a null user. Welcome to Tuesday.
I've seen this exact scenario four times in my career. Each time, the developer who pushed the "simple fix" swore they could roll back safely. Two of them cost the company over $100k in lost revenue. One cost a CTO his job. The common thread? They didn't understand the difference between rolling back code and rolling back state.
A Spring Boot application is more than a jar file. It's connected to a database, a message queue, a cache, and external APIs. When you deploy a new version that changes the schema, writes new cache keys, or publishes events of a different shape, the old version can't read or process what the new version left behind. This is the rollback trap.
Most teams think about deployment strategies. Few think about rollback strategies. They treat rollback as a revert button. It's not. It's an operational maneuver that requires planning before you ever push a deploy. This article will show you what actually breaks, what commands to run when it does, and how to design your Spring Boot pipelines so rollback is boring, not terrifying.
Why Git Revert Is Not a Production Rollback
You pushed a commit. You see it break. Your instinct says git revert HEAD and redeploy. That's fine for a staging environment. In production, that revert might remove a database migration file that already ran. Flyway sees the file is gone and throws an exception. Now your app won't start. You've made things worse.
I once worked with a team that used git revert as their rollback strategy. It worked exactly twice. The third time, the revert pulled out a Flyway migration. The pod wouldn't start because Flyway expected a migration that no longer existed. They spent 45 minutes manually inserting rows into the flyway_schema_history table to unstick the schema.
The problem is that source control and runtime state are different things. Your database, cache, and message queue don't care about git history. They only care about the current state. A rollback of code doesn't reverse mutations made to external systems. You need a strategy that explicitly handles state reversal.
For Spring Boot specifically, your rollback must handle three things: code version (the jar), configuration (application.yml, ConfigMap, Vault), and schema (Flyway/Liquibase). If any one of these is out of sync, you have a partial rollback. Partial rollbacks are worse than no rollback — they give the illusion of safety while your data rots.
The rule: never rely on git for a production rollback. Keep your old artifacts versioned in a repository (Nexus, Artifactory) with immutable tags. When you roll back, deploy the exact older jar, not a git revert.
Blue/Green Deployment: The Old Stack Must Be Hot, Not Cold
Blue/green is the most reliable rollback strategy. You keep two identical environments. You route traffic to blue. You deploy the new version to green. Test it. Flip traffic. If it breaks, flip back. Sounds simple. The reality is that "keeping the old stack hot" costs money and requires management overhead.
I've seen teams cut corners. They scale down the blue environment to zero after the green flip. Then when they need to roll back, they have to wait 5 minutes for pods to spin up. Those 5 minutes are an eternity when every request fails. The rollback that should take 10 seconds takes 5 minutes. Customers leave.
The rule: keep the old stack at full production capacity for at least 15 minutes after the flip. Yes, it doubles your compute cost for 15 minutes. That's the price of safety. If your CTO pushes back, ask them how much an outage costs per minute. The math is easy.
For Spring Boot specifically, blue/green requires stateless design. If your app stores session data in-memory or uses a local cache, flipping traffic breaks those sessions. Use Redis or Hazelcast for sessions and cache. The new version connects to the same cache cluster. That way a flip back doesn't drop users.
The load balancer matters too. Your NGINX or AWS ALB must support weighted traffic switching and health checks. If your health check passes but the app returns 500, you've got a false positive. Always deep-check: hit /actuator/health and verify it returns 200 with a readable body.
Canary Deployments: Metrics-Driven Traffic Shifting
Canary deployments release your change to a small subset of users first. If metrics hold steady, you ramp the percentage up. If they degrade, you cut the canary off. The key phrase is "metrics hold steady." Most teams set up canary deployments without defining what "steady" means.
I audited a team that used canary deployments for every change. Their metrics were: CPU usage, memory, and request count. Those are infrastructure metrics. They don't tell you anything about business impact. A new Spring Boot version could return wrong data to 1% of users and CPU would look fine. Your canary would ramp to 100% while customer support explodes.
You need application-level metrics for a real canary. Error rate (HTTP 5xx), p99 latency, and business-specific metrics like "completed checkout" or "failed login attempts." Spring Boot Actuator exposes Micrometer metrics. Push those to Prometheus or Datadog. Configure your canary tool to watch those specific metrics.
When the canary fails, your rollback is instant — you stop sending traffic to the canary pods. But the database problem still exists. If the canary version wrote records with a new schema, you're stuck. That's why canary deployments require database changes to be backward-compatible for at least two versions. The canary writes new-format data. The old version reads old-format data. If the canary is rolled back, old version must still be able to read the new-format records it left behind.
Consider using a feature flag (LaunchDarkly, Flagsmith) for schema changes. Deploy the code that can read both formats. Flip the flag on for canary users. If the canary fails, flip the flag off. No code rollback needed.
Rolling Update Rollback: Kubernetes' Worst Default Behavior
Kubernetes rolling updates have a rollback feature. You run kubectl rollout undo deployment my-app. It re-pulls the previous image. That's it. It doesn't touch the database. It doesn't invalidate caches. It doesn't revert config maps. It just runs an older Docker image.
This is the most dangerous rollback strategy because it gives the illusion of a full rollback. The app starts. Health checks pass. But the database has new columns. The cache has stale keys. The message queue has undeliverable events. You're running old code against new state. That's a ticking time bomb.
I once helped a team recover from exactly this. They ran kubectl rollout undo and saw green health checks. Two hours later, a scheduled batch job ran that consumed the new-format events from the queue. The old code couldn't deserialize them. The job failed. Then it retried. It failed again. The queue backed up. The batch job backlog grew to 4 hours before anyone noticed.
The fix is to never use kubectl rollout undo as your only rollback mechanism. Instead, use a Helm chart with a separate release. Each deploy creates a new Helm release. Rollback is helm rollback my-app <revision>. Helm rollback can also revert ConfigMaps and Secrets, which kubectl rollout undo cannot.
If you're stuck with rolling updates for some reason, you must make your Spring Boot application resilient to schema drift and state changes. Use repository patterns that detect schema changes and fail gracefully. Use idempotent queue consumers. Use cache keys that expire quickly. Defensive coding is your only safety net.
kubectl rollout undo with a database migration tool that auto-applies changes. Flyway on startup + rolling update rollback is a guaranteed footgun. Either the migration runs and fails, or it doesn't run and you're inconsistent.kubectl rollout undo that didn't revert database changes. Every. Single. Time.Database Migrations: The Rollback Trap You Must Design For
Database migrations are the single biggest barrier to safe rollbacks. Flyway and Liquibase are great tools, but they're forward-only by design. You can run flyway undo in some editions, but it's not a production-safe feature. Undoing a migration can delete data. Deleting data is rarely a safe operation.
The alternative is to design every migration to be backward-compatible. That means: no drops. No renames. No NOT NULL columns without a default. No schema changes that the old code can't tolerate. You stage changes over multiple deploys.
Here's the pattern I use in production. Deploy 1: Add the new column as nullable. Both old and new code ignore it. Deploy 2: Update the new code to read and write the new column. Old code still works. Deploy 3: Backfill the new column for all existing rows. Deploy 4: Add a NOT NULL constraint. Deploy 5: Drop old column. Every step is reversable because the code at each stage supports both the old and new schema.
This is slow. It takes 3-5 deploys to change a single column. That's fine. Deploy velocity isn't more important than data integrity. If your organization can't tolerate a 24-hour window for a schema change, they haven't experienced a data loss incident yet. They will.
For Spring Boot specifically, configure Flyway with flyway.baseline-on-migrate=true to mark migrations as applied in a specific baseline. Use flyway.out-of-order=true carefully — only if you know what you're doing. And never, ever set flyway.clean-disabled=false in production. That disables flyway clean, which is exactly the safety net you need if a migration goes wrong.
Graceful Shutdown and Readiness Probes: The Silent Saboteurs
Spring Boot's graceful shutdown is configured with server.shutdown=graceful. This tells the embedded Tomcat to stop accepting new requests and wait for in-flight requests to complete. But it only works if Kubernetes (or whatever orchestrator) respects the SIGTERM and gives the JVM time to finish.
Most Kubernetes deployments don't configure a preStop hook. The kubelet sends a SIGTERM and then waits for the terminationGracePeriodSeconds (default 30 seconds). If your app is still processing a request after 30 seconds, the kubelet sends a SIGKILL. Your app dies mid-request. Users see 502 errors.
The fix: add a preStop hook that sleeps for 10 seconds to let the load balancer detect the pod is not ready, then set terminationGracePeriodSeconds to 60 or higher. In your application.yml, set spring.lifecycle.timeout-per-shutdown-phase=45s. Now your app has 55 seconds to finish requests before the kill signal.
But that's only half the problem. Your readiness probe must reflect the actual healthy state of the app. Not just "is Tomcat running?" but "can I connect to the database, the queue, and the cache?" If your readiness probe passes but the database is down, traffic routes to a broken pod.
Spring Boot Actuator's health probe aggregates health indicators. Customize it. Add a database health check, a Redis health check, and a Kafka health check. Make readiness fail if any downstream dependency is unavailable. This prevents the load balancer from routing traffic to pods that can't serve requests.
/actuator/info endpoint to expose the current build version. The load balancer can check this to ensure it's routing to the correct version during a blue/green flip. Add info.app.version=@project.version@ to application.yml.The Schema That Wouldn't Roll Back
- Database migrations are forward-only.
- You must design them to be backward-compatible for at least one deploy cycle.
- Never, ever make a NOT NULL column without a default in a rolling update.
kubectl exec -it <old-pod> -- wget -qO- localhost:8080/actuator/health to verify old pod is actually healthy. Compare Redis keys between old and new versions. The fix: use separate cache namespaces per deploy version, or invalidate cache on rollback.kubectl get pods -l version=canary --show-labels to confirm canary pods are actually receiving traffic.preStop hook isn't configured, the kubelet sends SIGTERM and the JVM might kill active threads. The fix: set server.shutdown=graceful and configure spring.lifecycle.timeout-per-shutdown-phase=45s. Verify with curl localhost:8080/actuator/health before re-routing traffic.kubectl exec -it <pod> -- curl -s localhost:8080/actuator/flyway | jq '.migrations | .[] | select(.state=="SUCCESS") | .version'kubectl exec -it <pod> -- /bin/bash -c "psql \"$DATABASE_URL\" -c 'SELECT version, script, installed_on FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 3;'"Key takeaways
Common mistakes to avoid
5 patternsUsing `kubectl rollout undo` without checking database schema state
flyway info or liquibase status before considering the rollback complete.Not keeping blue/green old stack warm — scaling it to zero
Canary deployment without business metrics
checkout.success, checkout.failure). Wire them into your canary rollback decision.Forgetting to invalidate cache on rollback
Assuming Helm rollback reverts everything
Interview Questions on This Topic
You push a Spring Boot app with a Flyway migration that adds a NOT NULL column. You roll back the code with `kubectl rollout undo`. What happens when the old code tries to insert a row?
spring.jpa.hibernate.ddl-auto=validate in production. The INSERT will fail with a PostgreSQL error: 'null value in column "new_col" violates not-null constraint'. The fix is to never make a NOT NULL column in a migration unless you backfill all existing rows first. For rollback safety, make the column nullable, backfill, then add the constraint in a separate deploy.Frequently Asked Questions
That's Deployment. Mark it forged?
8 min read · try the examples if you haven't