Rolling Deployments - Missing Readiness Probe 3-Min Error
A readiness probe returning 200 before DB pool init caused 3-minute 503 errors on every deploy.
- Rolling deployments replace old instances with new ones in batches, never taking the whole service down.
- maxUnavailable controls how many pods can be down at once; maxSurge controls how many extra pods can be created.
- Health checks gate every batch — Kubernetes won't send traffic until the new pod passes readiness.
- Version skew is guaranteed: old and new code coexist during rollout, so backward compatibility is mandatory.
- The biggest mistake is missing a readinessProbe: pods start but get traffic before their connection pool warms up.
Every time your team ships a new feature, there's a moment of genuine terror — the deployment window. In the old days, that meant taking your entire app offline at 2am on a Sunday, praying nothing breaks, and issuing a public apology if it does. For teams deploying multiple times a day, that approach simply doesn't scale. It's not just inconvenient; it's a business risk that your competitors have already solved.
Rolling deployments exist to eliminate that terror entirely. Instead of replacing your entire fleet of servers at once, you replace them in small batches. At any given moment, some instances are running the old version and some are running the new version. If something goes wrong with the new version, you've only exposed a fraction of your users to the problem — and you can halt the rollout immediately. The blast radius of a bad deploy shrinks from 'everyone is down' to 'a small percentage of requests hit the broken version for a few minutes'.
By the end of this article you'll understand exactly how a rolling deployment works under the hood, how to configure one in Kubernetes and a CI/CD pipeline, what makes them fail silently, and how to make yours bulletproof. You'll also be able to explain the trade-offs confidently in a system design interview — because rolling deployments come up constantly.
How a Rolling Deployment Actually Works Step by Step
A rolling deployment works by treating your server fleet like a queue. You decide on two numbers: the maximum number of instances you're willing to take offline at once (maxUnavailable) and the maximum number of extra instances you'll spin up during the transition (maxSurge). The deployment controller — whether that's Kubernetes, ECS, or a custom script — then orchestrates a loop.
The loop looks like this: take a small batch of old instances out of the load balancer rotation, drain their in-flight requests, terminate them, start new instances with the new code, wait for those new instances to pass health checks, then add them back to the load balancer. Repeat until every instance is on the new version.
The crucial word in that loop is 'health checks'. The system won't move on to the next batch until the new instances actually prove they're healthy. This is what makes rolling deployments safe — the process is gated on real evidence that the new code works, not just the assumption that it compiled and started.
The downside is that during the rollout, two versions of your code are live simultaneously. If your new API changes a response shape that the old frontend depends on, or your new code writes a database column that the old code doesn't know about, you'll have problems. That's not a flaw in rolling deployments — it's a constraint that forces you to write backward-compatible code, which is a good habit regardless.
Wiring a Rolling Deployment Into Your CI/CD Pipeline
Understanding rolling deployments in isolation is one thing — getting them to fire automatically from a git push is another. The pattern that actually works in production ties three things together: your container registry, your deployment manifest, and a pipeline step that updates the image tag and triggers the rollout.
The anti-pattern is updating the manifest file by hand. The moment a human has to manually edit a YAML file and run kubectl apply, you've introduced the most dangerous variable in software: a tired human at 11pm. Instead, your CI pipeline should build the image, tag it with the exact git commit SHA (not 'latest' — never 'latest'), push it to the registry, and then use a tool like kubectl set image or kustomize to patch the deployment manifest and apply it automatically.
The pipeline below shows a GitHub Actions workflow that does exactly this. Notice how the image tag is the git SHA — that means every deployed version is traceable to a specific commit. If something goes wrong, you know exactly what changed. You can also use kubectl rollout undo to immediately revert to the previous SHA-tagged image, which is the rolling deployment equivalent of a one-command parachute.
The health check gates in the deployment YAML you saw in the previous section are what make this pipeline safe to run on every merge to main. The pipeline doesn't need to babysit the rollout — Kubernetes does that. The pipeline's job is just to hand off the new image and trust the deployment strategy to do its work.
The Version Skew Problem — Why Your Code Must Be Backward Compatible
Here's the scenario nobody warns you about until it burns you. You're rolling out v2 of your user service. v2 renames the JSON field 'user_name' to 'username' (snake_case to camelCase — a perfectly reasonable cleanup). For about four minutes while the rollout happens, your load balancer is sending some requests to v1 pods and some to v2 pods.
Your frontend is calling the user service and reading 'user_name'. The v1 pods return it correctly. The v2 pods return 'username' instead. For those four minutes, roughly half your users see a broken UI where their name doesn't display. You've just created a production incident during what should have been a routine deploy.
This is called version skew — the period where multiple versions of the same service coexist. It's not optional during a rolling deployment; it's guaranteed. The fix isn't to avoid rolling deployments. The fix is the expand-contract pattern: in v2, return BOTH 'user_name' AND 'username'. In v3 (a later deploy), remove 'user_name'. You expand the interface first, let the world catch up, then contract it.
The same principle applies to database migrations. Never drop a column or rename one in the same deploy that changes the code that reads it. Add the new column, deploy code that writes to both, migrate the data, deploy code that only reads the new column, then drop the old one. It's more steps, but each step is individually safe.
Rolling Back a Deployment: The Graceful Undo
No matter how well you test, a bad deploy will eventually happen. The question isn't if — it's how fast you can recover. Rolling deployments give you a built-in escape hatch that doesn't require a full re-deploy of the old version. Kubernetes stores the previous ReplicaSet configuration as a revision, and you can instantly start a rolling rollback with a single command.
The 'kubectl rollout undo' command doesn't snap all instances back at once. It uses the same rolling update strategy — it gradually replaces new pods with the previous version's pods. That means your rollback is just as safe as your forward deployment: health checks gate the process, traffic gradually shifts back, and if the rollback also fails (unlikely but possible), you can undo the rollback.
The catch is that Kubernetes only keeps a limited number of revision histories. By default, it stores 10 revisions — controlled by the revisionHistoryLimit field. Once you exceed that limit, older revisions are pruned, and you can't undo to them. If you need to roll back to a version from a month ago, you'll need to re-deploy the old image tag, not rely on rollout undo.
Another gotcha: if you accidentally deployed the same version twice (e.g., re-pushed the same image with a new tag but identical code), rollout undo will roll you back to the same broken version. Always verify the image tag in the previous ReplicaSet before trusting an undo.
Database Migrations with Rolling Deployments: The Safe Way
If there's one thing that causes more rolling deployment failures than anything else, it's database schema changes. The problem is fundamental: your database is a shared, stateful resource. You can't have two versions of the schema while two versions of your code are running. Or can you?
The answer is the expand-contract pattern applied to database changes. Let's walk through a concrete example: adding a NOT NULL column to the users table. The naive approach: write the migration to add the column with a default value, deploy the code that populates it, and add the NOT NULL constraint all in one migration. During the rolling rollout, old pods try to insert a row and fail because the NOT NULL constraint is already in place but the old code doesn't populate the column.
The safe approach:
- Expand: Add the column as nullable (no NOT NULL constraint). Deploy this migration while old pods are still running. The old code ignores the column.
- Backfill: Populate existing rows with a default value. This can be done as a background job or inline migration.
- Deploy new code: The new code populates the column on inserts. Old pods still run but they don't set the column — the nullable default handles that.
- Contract: Once all pods are on the new code, run a second migration to add the NOT NULL constraint. This is safe because every pod now writes the column.
Each step is individually reversible. If something goes wrong in step 3, you can roll back the code without having to revert the schema.
| Aspect | Rolling Deployment | Blue/Green Deployment |
|---|---|---|
| Downtime during deploy | Zero — traffic shifts gradually | Zero — traffic switches atomically |
| Resource cost | Low — only maxSurge extra instances needed | High — requires a full duplicate environment |
| Rollback speed | Slow — must re-roll forward or wait for undo | Instant — flip the load balancer back to blue |
| Version skew risk | High — two versions serve traffic simultaneously | None — only one version is live at any time |
| Best for | Stateless services with backward-compatible APIs | Stateful apps or high-risk releases needing instant rollback |
| Database migrations | Requires expand-contract pattern across multiple deploys | Easier — blue env handles migration before cutover |
| Complexity | Low — built into Kubernetes natively | High — requires managing two environments |
| Traffic control | Coarse — batch-based percentage splits | Precise — can do weighted routing with a proxy |
Key Takeaways
- Rolling deployments replace instances in small batches — health checks gate every batch, so a bad deploy can only affect the fraction of traffic hitting the new instances, not everyone at once.
- The expand-contract pattern is non-negotiable for rolling deployments — any API field rename, database column change, or message schema update must be deployed in at least two phases to survive the version skew window.
- Never tag deployment images as 'latest' — use the git commit SHA so every running version is traceable to a specific code change and kubectl rollout undo reverts to a known, specific state.
- The kubectl rollout status --timeout flag is what separates a robust pipeline from a false-green one — without it your CI reports success the moment deployment is triggered, not when it actually finishes.
- Rollback uses the same rolling strategy — it's not instant. Plan incident response around that latency, not against it.
Common Mistakes to Avoid
- Using the 'latest' image tag in your deployment
Symptom: Your rollout appears to succeed but Kubernetes may not actually pull a new image because 'latest' was already cached on the node. The symptom is kubectl rollout status saying 'successfully rolled out' but your new code never actually running.
Fix: Always tag images with the git commit SHA (e.g. myapp:a3f9c12) and set imagePullPolicy: Always in your container spec. This guarantees every rollout pulls a specific, traceable image. - Skipping the readinessProbe
Symptom: New pods start, pass the liveness check (meaning the process is running), and immediately receive live traffic — but your app takes 15 seconds to warm up its database connection pool. The symptom is a spike of 503 errors on every deploy, right as new pods come online.
Fix: Configure a separate readinessProbe that hits a /health/ready endpoint which returns 200 only after the app has fully initialized. Kubernetes will hold the pod out of rotation until it genuinely passes. - Making a breaking database change in the same deploy as the code change
Symptom: You add a NOT NULL column to the users table and deploy the code that populates it in the same release. During the rollout, old pods try to insert rows without that column and crash with a database constraint violation.
Fix: Decouple schema changes from code changes using the expand-contract pattern. Deploy the column as nullable first, ship code that populates it, backfill existing rows, then add the NOT NULL constraint in a later deploy.
Interview Questions on This Topic
- QWhat is version skew in the context of rolling deployments, and how does the expand-contract pattern solve it? Give a concrete database example.SeniorReveal
- QWalk me through what maxUnavailable and maxSurge actually control in a Kubernetes rolling deployment. What happens at the resource level if you set maxUnavailable to 0 on a cluster that's at 90% capacity?SeniorReveal
- QYour rolling deployment shows as 'successful' in the CI pipeline but users are reporting errors for about three minutes on every release. What are the three most likely root causes and how would you diagnose each one?Mid-levelReveal
Frequently Asked Questions
What is the difference between a rolling deployment and a blue/green deployment?
A rolling deployment gradually replaces old instances with new ones, so both versions serve traffic simultaneously during the rollout. A blue/green deployment runs two full identical environments and switches all traffic at once with a load balancer flip. Blue/green gives you instant rollback and zero version skew, but costs roughly double the infrastructure. Rolling deployments are cheaper but require your code to handle two versions coexisting.
How do I roll back a Kubernetes rolling deployment that went wrong?
Run kubectl rollout undo deployment/your-deployment-name. Kubernetes stores the previous ReplicaSet configuration and will immediately start rolling back to it using the same rolling strategy. You can also target a specific revision with --to-revision=2. This is why tagging images with git SHAs matters — the rollback actually goes back to a known, specific version of your code.
Can I use rolling deployments with a stateful service like a database?
Directly rolling out a stateful database (like a primary PostgreSQL instance) with a standard rolling deployment is dangerous because your data and schema are shared state — it's not like stateless app servers where any instance is interchangeable. For databases, blue/green or maintenance-window deployments are safer. Rolling deployments work well for stateless application services that sit in front of a database, as long as you handle schema migrations separately using the expand-contract pattern.
What is the default revision history limit in Kubernetes and why does it matter for rollback?
The default revisionHistoryLimit is 10. Once you exceed that, older ReplicaSets are pruned and cannot be rolled back to using rollout undo. If you need to roll back to a month-old version, you'll have to re-deploy the old image tag. For critical services, increase the limit to 20 or 30, or use a deployment tool that stores image tags externally.
That's CI/CD. Mark it forged?
7 min read · try the examples if you haven't