Rolling Deployments - Missing Readiness Probe 3-Min Error
A readiness probe returning 200 before DB pool init caused 3-minute 503 errors on every deploy.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
- Rolling deployments replace old instances with new ones in batches, never taking the whole service down.
- maxUnavailable controls how many pods can be down at once; maxSurge controls how many extra pods can be created.
- Health checks gate every batch — Kubernetes won't send traffic until the new pod passes readiness.
- Version skew is guaranteed: old and new code coexist during rollout, so backward compatibility is mandatory.
- The biggest mistake is missing a readinessProbe: pods start but get traffic before their connection pool warms up.
Imagine a restaurant that wants to swap out every table with a fancier one. Instead of closing for the day, they replace one table at a time while customers keep eating. Some diners sit at the old tables, some at the new ones — but the restaurant never closes. That's a rolling deployment: you swap out old servers running old code for new ones, gradually, while real users keep getting served without ever seeing a 'down for maintenance' page.
Every time your team ships a new feature, there's a moment of genuine terror — the deployment window. In the old days, that meant taking your entire app offline at 2am on a Sunday, praying nothing breaks, and issuing a public apology if it does. For teams deploying multiple times a day, that approach simply doesn't scale. It's not just inconvenient; it's a business risk that your competitors have already solved.
Rolling deployments exist to eliminate that terror entirely. Instead of replacing your entire fleet of servers at once, you replace them in small batches. At any given moment, some instances are running the old version and some are running the new version. If something goes wrong with the new version, you've only exposed a fraction of your users to the problem — and you can halt the rollout immediately. The blast radius of a bad deploy shrinks from 'everyone is down' to 'a small percentage of requests hit the broken version for a few minutes'.
By the end of this article you'll understand exactly how a rolling deployment works under the hood, how to configure one in Kubernetes and a CI/CD pipeline, what makes them fail silently, and how to make yours bulletproof. You'll also be able to explain the trade-offs confidently in a system design interview — because rolling deployments come up constantly.
How Rolling Deployments Actually Work
A rolling deployment replaces instances of an old application version with a new one incrementally, keeping the service available throughout. The core mechanic: you update a subset of instances (e.g., 25% of a 100-instance cluster) to the new version, verify they're healthy, then proceed to the next batch. This avoids downtime and allows gradual exposure to changes.
Key properties: batch size (how many instances updated at once), max surge (extra instances allowed during update), and max unavailable (instances allowed to be down). In Kubernetes, a Deployment with strategy type RollingUpdate uses these to control the pace. The readiness probe is critical — if it's missing or misconfigured, the controller may consider a broken pod "ready" and continue rolling, causing widespread failure within minutes.
Use rolling deployments for stateless services where zero-downtime updates are required and you can tolerate a brief mixed-version state. They're the default for most microservices because they balance safety with speed. Avoid them for stateful workloads or when backward-incompatible schema changes exist — blue/green or canary deployments are safer there.
How a Rolling Deployment Actually Works Step by Step
A rolling deployment works by treating your server fleet like a queue. You decide on two numbers: the maximum number of instances you're willing to take offline at once (maxUnavailable) and the maximum number of extra instances you'll spin up during the transition (maxSurge). The deployment controller — whether that's Kubernetes, ECS, or a custom script — then orchestrates a loop.
The loop looks like this: take a small batch of old instances out of the load balancer rotation, drain their in-flight requests, terminate them, start new instances with the new code, wait for those new instances to pass health checks, then add them back to the load balancer. Repeat until every instance is on the new version.
The crucial word in that loop is 'health checks'. The system won't move on to the next batch until the new instances actually prove they're healthy. This is what makes rolling deployments safe — the process is gated on real evidence that the new code works, not just the assumption that it compiled and started.
The downside is that during the rollout, two versions of your code are live simultaneously. If your new API changes a response shape that the old frontend depends on, or your new code writes a database column that the old code doesn't know about, you'll have problems. That's not a flaw in rolling deployments — it's a constraint that forces you to write backward-compatible code, which is a good habit regardless.
Wiring a Rolling Deployment Into Your CI/CD Pipeline
Understanding rolling deployments in isolation is one thing — getting them to fire automatically from a git push is another. The pattern that actually works in production ties three things together: your container registry, your deployment manifest, and a pipeline step that updates the image tag and triggers the rollout.
The anti-pattern is updating the manifest file by hand. The moment a human has to manually edit a YAML file and run kubectl apply, you've introduced the most dangerous variable in software: a tired human at 11pm. Instead, your CI pipeline should build the image, tag it with the exact git commit SHA (not 'latest' — never 'latest'), push it to the registry, and then use a tool like kubectl set image or kustomize to patch the deployment manifest and apply it automatically.
The pipeline below shows a GitHub Actions workflow that does exactly this. Notice how the image tag is the git SHA — that means every deployed version is traceable to a specific commit. If something goes wrong, you know exactly what changed. You can also use kubectl rollout undo to immediately revert to the previous SHA-tagged image, which is the rolling deployment equivalent of a one-command parachute.
The health check gates in the deployment YAML you saw in the previous section are what make this pipeline safe to run on every merge to main. The pipeline doesn't need to babysit the rollout — Kubernetes does that. The pipeline's job is just to hand off the new image and trust the deployment strategy to do its work.
The Version Skew Problem — Why Your Code Must Be Backward Compatible
Here's the scenario nobody warns you about until it burns you. You're rolling out v2 of your user service. v2 renames the JSON field 'user_name' to 'username' (snake_case to camelCase — a perfectly reasonable cleanup). For about four minutes while the rollout happens, your load balancer is sending some requests to v1 pods and some to v2 pods.
Your frontend is calling the user service and reading 'user_name'. The v1 pods return it correctly. The v2 pods return 'username' instead. For those four minutes, roughly half your users see a broken UI where their name doesn't display. You've just created a production incident during what should have been a routine deploy.
This is called version skew — the period where multiple versions of the same service coexist. It's not optional during a rolling deployment; it's guaranteed. The fix isn't to avoid rolling deployments. The fix is the expand-contract pattern: in v2, return BOTH 'user_name' AND 'username'. In v3 (a later deploy), remove 'user_name'. You expand the interface first, let the world catch up, then contract it.
The same principle applies to database migrations. Never drop a column or rename one in the same deploy that changes the code that reads it. Add the new column, deploy code that writes to both, migrate the data, deploy code that only reads the new column, then drop the old one. It's more steps, but each step is individually safe.
Rolling Back a Deployment: The Graceful Undo
No matter how well you test, a bad deploy will eventually happen. The question isn't if — it's how fast you can recover. Rolling deployments give you a built-in escape hatch that doesn't require a full re-deploy of the old version. Kubernetes stores the previous ReplicaSet configuration as a revision, and you can instantly start a rolling rollback with a single command.
The 'kubectl rollout undo' command doesn't snap all instances back at once. It uses the same rolling update strategy — it gradually replaces new pods with the previous version's pods. That means your rollback is just as safe as your forward deployment: health checks gate the process, traffic gradually shifts back, and if the rollback also fails (unlikely but possible), you can undo the rollback.
The catch is that Kubernetes only keeps a limited number of revision histories. By default, it stores 10 revisions — controlled by the revisionHistoryLimit field. Once you exceed that limit, older revisions are pruned, and you can't undo to them. If you need to roll back to a version from a month ago, you'll need to re-deploy the old image tag, not rely on rollout undo.
Another gotcha: if you accidentally deployed the same version twice (e.g., re-pushed the same image with a new tag but identical code), rollout undo will roll you back to the same broken version. Always verify the image tag in the previous ReplicaSet before trusting an undo.
Database Migrations with Rolling Deployments: The Safe Way
If there's one thing that causes more rolling deployment failures than anything else, it's database schema changes. The problem is fundamental: your database is a shared, stateful resource. You can't have two versions of the schema while two versions of your code are running. Or can you?
The answer is the expand-contract pattern applied to database changes. Let's walk through a concrete example: adding a NOT NULL column to the users table. The naive approach: write the migration to add the column with a default value, deploy the code that populates it, and add the NOT NULL constraint all in one migration. During the rolling rollout, old pods try to insert a row and fail because the NOT NULL constraint is already in place but the old code doesn't populate the column.
The safe approach:
- Expand: Add the column as nullable (no NOT NULL constraint). Deploy this migration while old pods are still running. The old code ignores the column.
- Backfill: Populate existing rows with a default value. This can be done as a background job or inline migration.
- Deploy new code: The new code populates the column on inserts. Old pods still run but they don't set the column — the nullable default handles that.
- Contract: Once all pods are on the new code, run a second migration to add the NOT NULL constraint. This is safe because every pod now writes the column.
Each step is individually reversible. If something goes wrong in step 3, you can roll back the code without having to revert the schema.
- Every change to a shared resource must be two deployments: first add the new thing, then remove the old thing.
- During the window, both old and new coexist — your code must handle both.
- The order matters: expand the schema first, then deploy code that uses the new schema, then contract.
- For column drops: the contract is the drop step. Never drop a column in the same deploy that stops using it.
Use Case: When Rolling Saves Your Ass (and When It Won't)
Don't use rolling deployments because they're trendy. Use them when zero-downtime matters and your instances are stateless cattle, not pets. Rolling shines for web backends, API servers, and worker pools—any workload where you can spin up a fresh pod and drain traffic from an old one without a user noticing. It fails spectacularly with stateful workloads like databases or legacy monoliths holding in-memory sessions. If your app can't handle two versions coexisting for minutes, rolling will corrupt data faster than a junior with sudo rm -rf. The rule: if you can't afford a version skew window, use blue-green or canary. Rolling assumes your code is backward-compatible. It's not magic—it's orchestrated serial replacement. Every new pod must serve traffic alongside the old ones until it's verified healthy. That means health checks must be ruthless and fast. A slow liveness probe turns a five-minute rollout into a fifteen-minute outage. Choose rolling when you need gradual traffic migration and can tolerate partial capacity during the swap. Choose something else when you can't.
The Pod-Template-Hash Label: Kubernetes' Secret Weapon for Rollout Sanity
Ever wonder how Kubernetes knows which pods belong to which version during a rolling update? It's not magic—it's the pod-template-hash label. When you create or update a Deployment, the ReplicaSet controller appends this hash to every pod it creates. The hash is a deterministic SHA-256 of the pod template spec. Change the image tag? New hash. Change an env var? New hash. Same spec? Same hash. This lets the Deployment controller identify exactly which ReplicaSet owns which pods. During a rollout, it can drain pods from the old ReplicaSet by hash, spin up pods in the new ReplicaSet with a different hash, and never mix them up. You can see it with kubectl get pods --show-labels. If a pod has pod-template-hash=abc123, you know it belongs to ReplicaSet abc123. This is critical for rollbacks: kubectl rollout undo simply scales down the current ReplicaSet and scales up the previous one, identified by its hash. Don't rely on pod names alone—they're ephemeral. The hash is the ground truth. If you're debugging a rollout failure, first check which hashes are running: kubectl get replicasets -l app=yourapp. If you see two ReplicaSets with different hashes and neither is scaling, your rollout is stuck—likely a bad health check or resource shortage.
kubectl rollout undo, Kubernetes scales down the current ReplicaSet (by hash) and scales up the previous ReplicaSet (different hash). It does NOT delete the old ReplicaSet—it keeps it around for the rollout history limit (default 10). You can manually delete old ReplicaSets to free resources, but don't if you think you'll need a rollback.Rollover: What Happens When You Deploy Again Mid-Rollout
You kicked off a rolling deployment. Five minutes in, and your CTO screams 'stop the rollout, we need a hotfix now.' You push a new image tag. What happens? Kubernetes doesn't abort the old rollout—it creates a new ReplicaSet with the latest spec and starts scaling it up while scaling down the old ones. This is called a rollover. The Deployment controller reconciles the desired state immediately. If you had 10 replicas at image v1, started rolling to v2 (5 new, 5 old), then pushed v3, the controller ignores v2 and starts creating v3 pods. It scales down both v1 and v2 ReplicaSets to make room for v3. This is efficient but dangerous: you lost the v2 version's data (if any migration ran) and pods may have inconsistent states if the v2 ReplicaSet was partially scaled. The rollover respects your maxSurge and maxUnavailable settings, so it won't exceed pod limits—but it can thrash. If you push three versions in ten minutes, you'll have pods coming and going like a revolving door. Production mistake: pushing a new tag before the previous rollout completes, then wondering why some pods have different configs. The fix: use kubectl rollout pause before making changes during an active rollout, or have your CI/CD pipeline check rollout status before allowing the next deploy.
The Readiness Probe That Wasn't: How a Missing Endpoint Caused 3-Minute Errors on Every Deploy
- A readiness probe that checks only HTTP serving is a false positive — it must validate the service is ready to handle real traffic.
- Don't trust a CI pipeline that reports 'successful rollout' without checking the health dashboard for the first 5 minutes after deploy.
- Connection pool warm-up is not instant — account for it in your probe design.
kubectl rollout status deployment/payments-api --timeout=10skubectl describe deployment/payments-api | grep -A 10 'Conditions:'Key takeaways
Common mistakes to avoid
3 patternsUsing the 'latest' image tag in your deployment
Skipping the readinessProbe
Making a breaking database change in the same deploy as the code change
Interview Questions on This Topic
What is version skew in the context of rolling deployments, and how does the expand-contract pattern solve it? Give a concrete database example.
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
That's CI/CD. Mark it forged?
11 min read · try the examples if you haven't