Rolling Deployments Explained: Zero-Downtime Releases Done Right
Every time your team ships a new feature, there's a moment of genuine terror — the deployment window. In the old days, that meant taking your entire app offline at 2am on a Sunday, praying nothing breaks, and issuing a public apology if it does. For teams deploying multiple times a day, that approach simply doesn't scale. It's not just inconvenient; it's a business risk that your competitors have already solved.
Rolling deployments exist to eliminate that terror entirely. Instead of replacing your entire fleet of servers at once, you replace them in small batches. At any given moment, some instances are running the old version and some are running the new version. If something goes wrong with the new version, you've only exposed a fraction of your users to the problem — and you can halt the rollout immediately. The blast radius of a bad deploy shrinks from 'everyone is down' to 'a small percentage of requests hit the broken version for a few minutes'.
By the end of this article you'll understand exactly how a rolling deployment works under the hood, how to configure one in Kubernetes and a CI/CD pipeline, what makes them fail silently, and how to make yours bulletproof. You'll also be able to explain the trade-offs confidently in a system design interview — because rolling deployments come up constantly.
How a Rolling Deployment Actually Works Step by Step
A rolling deployment works by treating your server fleet like a queue. You decide on two numbers: the maximum number of instances you're willing to take offline at once (maxUnavailable) and the maximum number of extra instances you'll spin up during the transition (maxSurge). The deployment controller — whether that's Kubernetes, ECS, or a custom script — then orchestrates a loop.
The loop looks like this: take a small batch of old instances out of the load balancer rotation, drain their in-flight requests, terminate them, start new instances with the new code, wait for those new instances to pass health checks, then add them back to the load balancer. Repeat until every instance is on the new version.
The crucial word in that loop is 'health checks'. The system won't move on to the next batch until the new instances actually prove they're healthy. This is what makes rolling deployments safe — the process is gated on real evidence that the new code works, not just the assumption that it compiled and started.
The downside is that during the rollout, two versions of your code are live simultaneously. If your new API changes a response shape that the old frontend depends on, or your new code writes a database column that the old code doesn't know about, you'll have problems. That's not a flaw in rolling deployments — it's a constraint that forces you to write backward-compatible code, which is a good habit regardless.
apiVersion: apps/v1 kind: Deployment metadata: name: payments-api labels: app: payments-api spec: replicas: 6 # We're running 6 instances in total selector: matchLabels: app: payments-api strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 # Never take more than 1 instance offline at a time maxSurge: 2 # Allow up to 2 extra instances during the rollout template: metadata: labels: app: payments-api spec: containers: - name: payments-api image: myregistry/payments-api:v2.4.1 # The new version we're rolling out ports: - containerPort: 8080 readinessProbe: # Kubernetes won't route traffic until this passes httpGet: path: /health/ready # Our app exposes a readiness endpoint port: 8080 initialDelaySeconds: 10 # Give the app 10s to start before probing periodSeconds: 5 # Check every 5 seconds failureThreshold: 3 # Three consecutive failures = not ready livenessProbe: # Kubernetes restarts the pod if this fails httpGet: path: /health/live port: 8080 initialDelaySeconds: 20 periodSeconds: 10 resources: requests: cpu: "250m" memory: "256Mi" limits: cpu: "500m" memory: "512Mi"
deployment.apps/payments-api configured
$ kubectl rollout status deployment/payments-api
Waiting for deployment "payments-api" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 3 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 5 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 1 old replicas are pending termination...
deployment "payments-api" successfully rolled out
Wiring a Rolling Deployment Into Your CI/CD Pipeline
Understanding rolling deployments in isolation is one thing — getting them to fire automatically from a git push is another. The pattern that actually works in production ties three things together: your container registry, your deployment manifest, and a pipeline step that updates the image tag and triggers the rollout.
The anti-pattern is updating the manifest file by hand. The moment a human has to manually edit a YAML file and run kubectl apply, you've introduced the most dangerous variable in software: a tired human at 11pm. Instead, your CI pipeline should build the image, tag it with the exact git commit SHA (not 'latest' — never 'latest'), push it to the registry, and then use a tool like kubectl set image or kustomize to patch the deployment manifest and apply it automatically.
The pipeline below shows a GitHub Actions workflow that does exactly this. Notice how the image tag is the git SHA — that means every deployed version is traceable to a specific commit. If something goes wrong, you know exactly what changed. You can also use kubectl rollout undo to immediately revert to the previous SHA-tagged image, which is the rolling deployment equivalent of a one-command parachute.
The health check gates in the deployment YAML you saw in the previous section are what make this pipeline safe to run on every merge to main. The pipeline doesn't need to babysit the rollout — Kubernetes does that. The pipeline's job is just to hand off the new image and trust the deployment strategy to do its work.
name: Build and Rolling-Deploy Payments API on: push: branches: - main # Only deploy from main branch merges env: REGISTRY: ghcr.io IMAGE_NAME: myorg/payments-api KUBE_NAMESPACE: production DEPLOYMENT_NAME: payments-api jobs: build-and-deploy: runs-on: ubuntu-latest permissions: contents: read packages: write # Needed to push to GitHub Container Registry steps: - name: Checkout source code uses: actions/checkout@v4 - name: Log in to GitHub Container Registry uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - name: Build and push Docker image uses: docker/build-push-action@v5 with: context: . push: true # Tag with the git SHA so every image is 100% traceable to a commit tags: | ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest - name: Configure kubectl uses: azure/k8s-set-context@v3 with: method: kubeconfig kubeconfig: ${{ secrets.KUBE_CONFIG }} # Store your kubeconfig as a repo secret - name: Trigger rolling deployment with new image run: | # This command patches the deployment in-place — Kubernetes handles the rolling strategy kubectl set image deployment/${{ env.DEPLOYMENT_NAME }} \ ${{ env.DEPLOYMENT_NAME }}=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \ --namespace=${{ env.KUBE_NAMESPACE }} - name: Wait for rollout to complete (fail pipeline if deployment fails) run: | # This blocks the pipeline until the rollout finishes or times out after 5 minutes kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \ --namespace=${{ env.KUBE_NAMESPACE }} \ --timeout=5m - name: Roll back automatically if rollout failed if: failure() # This step only runs if the previous step failed run: | echo "Rollout failed — reverting to previous deployment" kubectl rollout undo deployment/${{ env.DEPLOYMENT_NAME }} \ --namespace=${{ env.KUBE_NAMESPACE }}
deployment.apps/payments-api image updated
Run kubectl rollout status deployment/payments-api --namespace=production --timeout=5m
Waiting for deployment "payments-api" rollout to finish: 1 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 3 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 5 out of 6 new replicas have been updated...
deployment "payments-api" successfully rolled out
The Version Skew Problem — Why Your Code Must Be Backward Compatible
Here's the scenario nobody warns you about until it burns you. You're rolling out v2 of your user service. v2 renames the JSON field 'user_name' to 'username' (snake_case to camelCase — a perfectly reasonable cleanup). For about four minutes while the rollout happens, your load balancer is sending some requests to v1 pods and some to v2 pods.
Your frontend is calling the user service and reading 'user_name'. The v1 pods return it correctly. The v2 pods return 'username' instead. For those four minutes, roughly half your users see a broken UI where their name doesn't display. You've just created a production incident during what should have been a routine deploy.
This is called version skew — the period where multiple versions of the same service coexist. It's not optional during a rolling deployment; it's guaranteed. The fix isn't to avoid rolling deployments. The fix is the expand-contract pattern: in v2, return BOTH 'user_name' AND 'username'. In v3 (a later deploy), remove 'user_name'. You expand the interface first, let the world catch up, then contract it.
The same principle applies to database migrations. Never drop a column or rename one in the same deploy that changes the code that reads it. Add the new column, deploy code that writes to both, migrate the data, deploy code that only reads the new column, then drop the old one. It's more steps, but each step is individually safe.
from flask import Flask, jsonify from dataclasses import dataclass from typing import Optional app = Flask(__name__) @dataclass class User: id: int display_name: str email: str # Simulating a database fetch def fetch_user_from_db(user_id: int) -> Optional[User]: mock_users = { 1: User(id=1, display_name="Ada Lovelace", email="ada@example.com"), 2: User(id=2, display_name="Grace Hopper", email="grace@example.com"), } return mock_users.get(user_id) @app.route("/users/<int:user_id>") def get_user(user_id: int): user = fetch_user_from_db(user_id) if not user: return jsonify({"error": "User not found"}), 404 # EXPAND PHASE: This is v2 of the API. # We've renamed the field from 'user_name' to 'username' internally, # but we still return BOTH keys during the rollout transition window. # Old clients reading 'user_name' keep working. # New clients reading 'username' also work. # We'll remove 'user_name' in a separate v3 deploy once all clients are updated. response_payload = { "id": user.id, "username": user.display_name, # New field name — clients should migrate to this "user_name": user.display_name, # Deprecated — kept for backward compatibility only "email": user.email, "_meta": { "deprecated_fields": ["user_name"], # Signal to API consumers that migration is needed "api_version": "v2" } } return jsonify(response_payload), 200 if __name__ == "__main__": app.run(debug=False, port=8080)
{
"_meta": {
"api_version": "v2",
"deprecated_fields": ["user_name"]
},
"email": "ada@example.com",
"id": 1,
"user_name": "Ada Lovelace",
"username": "Ada Lovelace"
}
| Aspect | Rolling Deployment | Blue/Green Deployment |
|---|---|---|
| Downtime during deploy | Zero — traffic shifts gradually | Zero — traffic switches atomically |
| Resource cost | Low — only maxSurge extra instances needed | High — requires a full duplicate environment |
| Rollback speed | Slow — must re-roll forward or wait for undo | Instant — flip the load balancer back to blue |
| Version skew risk | High — two versions serve traffic simultaneously | None — only one version is live at any time |
| Best for | Stateless services with backward-compatible APIs | Stateful apps or high-risk releases needing instant rollback |
| Database migrations | Requires expand-contract pattern across multiple deploys | Easier — blue env handles migration before cutover |
| Complexity | Low — built into Kubernetes natively | High — requires managing two environments |
| Traffic control | Coarse — batch-based percentage splits | Precise — can do weighted routing with a proxy |
🎯 Key Takeaways
- Rolling deployments replace instances in small batches — health checks gate every batch, so a bad deploy can only affect the fraction of traffic hitting the new instances, not everyone at once.
- The expand-contract pattern is non-negotiable for rolling deployments — any API field rename, database column change, or message schema update must be deployed in at least two phases to survive the version skew window.
- Never tag deployment images as 'latest' — use the git commit SHA so every running version is traceable to a specific code change and kubectl rollout undo reverts to a known, specific state.
- The kubectl rollout status --timeout flag is what separates a robust pipeline from a false-green one — without it your CI reports success the moment deployment is triggered, not when it actually finishes.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Using the 'latest' image tag in your deployment — Your rollout appears to succeed but Kubernetes may not actually pull a new image because 'latest' was already cached on the node. The symptom is kubectl rollout status saying 'successfully rolled out' but your new code never actually running. Fix: always tag images with the git commit SHA (e.g. myapp:a3f9c12) and set imagePullPolicy: Always in your container spec. This guarantees every rollout pulls a specific, traceable image.
- ✕Mistake 2: Skipping the readinessProbe — New pods start, pass the liveness check (meaning the process is running), and immediately receive live traffic — but your app takes 15 seconds to warm up its database connection pool. The symptom is a spike of 503 errors on every deploy, right as new pods come online. Fix: configure a separate readinessProbe that hits a /health/ready endpoint which returns 200 only after the app has fully initialized. Kubernetes will hold the pod out of rotation until it genuinely passes.
- ✕Mistake 3: Making a breaking database change in the same deploy as the code change — You add a NOT NULL column to the users table and deploy the code that populates it in the same release. During the rollout, old pods try to insert rows without that column and crash with a database constraint violation. Fix: decouple schema changes from code changes using the expand-contract pattern. Deploy the column as nullable first, ship code that populates it, backfill existing rows, then add the NOT NULL constraint in a later deploy.
Interview Questions on This Topic
- QWhat is version skew in the context of rolling deployments, and how does the expand-contract pattern solve it? Give a concrete database example.
- QWalk me through what maxUnavailable and maxSurge actually control in a Kubernetes rolling deployment. What happens at the resource level if you set maxUnavailable to 0 on a cluster that's at 90% capacity?
- QYour rolling deployment shows as 'successful' in the CI pipeline but users are reporting errors for about three minutes on every release. What are the three most likely root causes and how would you diagnose each one?
Frequently Asked Questions
What is the difference between a rolling deployment and a blue/green deployment?
A rolling deployment gradually replaces old instances with new ones, so both versions serve traffic simultaneously during the rollout. A blue/green deployment runs two full identical environments and switches all traffic at once with a load balancer flip. Blue/green gives you instant rollback and zero version skew, but costs roughly double the infrastructure. Rolling deployments are cheaper but require your code to handle two versions coexisting.
How do I roll back a Kubernetes rolling deployment that went wrong?
Run kubectl rollout undo deployment/your-deployment-name. Kubernetes stores the previous ReplicaSet configuration and will immediately start rolling back to it using the same rolling strategy. You can also target a specific revision with --to-revision=2. This is why tagging images with git SHAs matters — the rollback actually goes back to a known, specific version of your code.
Can I use rolling deployments with a stateful service like a database?
Directly rolling out a stateful database (like a primary PostgreSQL instance) with a standard rolling deployment is dangerous because your data and schema are shared state — it's not like stateless app servers where any instance is interchangeable. For databases, blue/green or maintenance-window deployments are safer. Rolling deployments work well for stateless application services that sit in front of a database, as long as you handle schema migrations separately using the expand-contract pattern.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.