Home DevOps Rolling Deployments Explained: Zero-Downtime Releases Done Right

Rolling Deployments Explained: Zero-Downtime Releases Done Right

In Plain English 🔥
Imagine a restaurant that wants to swap out every table with a fancier one. Instead of closing for the day, they replace one table at a time while customers keep eating. Some diners sit at the old tables, some at the new ones — but the restaurant never closes. That's a rolling deployment: you swap out old servers running old code for new ones, gradually, while real users keep getting served without ever seeing a 'down for maintenance' page.
⚡ Quick Answer
Imagine a restaurant that wants to swap out every table with a fancier one. Instead of closing for the day, they replace one table at a time while customers keep eating. Some diners sit at the old tables, some at the new ones — but the restaurant never closes. That's a rolling deployment: you swap out old servers running old code for new ones, gradually, while real users keep getting served without ever seeing a 'down for maintenance' page.

Every time your team ships a new feature, there's a moment of genuine terror — the deployment window. In the old days, that meant taking your entire app offline at 2am on a Sunday, praying nothing breaks, and issuing a public apology if it does. For teams deploying multiple times a day, that approach simply doesn't scale. It's not just inconvenient; it's a business risk that your competitors have already solved.

Rolling deployments exist to eliminate that terror entirely. Instead of replacing your entire fleet of servers at once, you replace them in small batches. At any given moment, some instances are running the old version and some are running the new version. If something goes wrong with the new version, you've only exposed a fraction of your users to the problem — and you can halt the rollout immediately. The blast radius of a bad deploy shrinks from 'everyone is down' to 'a small percentage of requests hit the broken version for a few minutes'.

By the end of this article you'll understand exactly how a rolling deployment works under the hood, how to configure one in Kubernetes and a CI/CD pipeline, what makes them fail silently, and how to make yours bulletproof. You'll also be able to explain the trade-offs confidently in a system design interview — because rolling deployments come up constantly.

How a Rolling Deployment Actually Works Step by Step

A rolling deployment works by treating your server fleet like a queue. You decide on two numbers: the maximum number of instances you're willing to take offline at once (maxUnavailable) and the maximum number of extra instances you'll spin up during the transition (maxSurge). The deployment controller — whether that's Kubernetes, ECS, or a custom script — then orchestrates a loop.

The loop looks like this: take a small batch of old instances out of the load balancer rotation, drain their in-flight requests, terminate them, start new instances with the new code, wait for those new instances to pass health checks, then add them back to the load balancer. Repeat until every instance is on the new version.

The crucial word in that loop is 'health checks'. The system won't move on to the next batch until the new instances actually prove they're healthy. This is what makes rolling deployments safe — the process is gated on real evidence that the new code works, not just the assumption that it compiled and started.

The downside is that during the rollout, two versions of your code are live simultaneously. If your new API changes a response shape that the old frontend depends on, or your new code writes a database column that the old code doesn't know about, you'll have problems. That's not a flaw in rolling deployments — it's a constraint that forces you to write backward-compatible code, which is a good habit regardless.

kubernetes-rolling-deployment.yaml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  labels:
    app: payments-api
spec:
  replicas: 6                          # We're running 6 instances in total
  selector:
    matchLabels:
      app: payments-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1                # Never take more than 1 instance offline at a time
      maxSurge: 2                      # Allow up to 2 extra instances during the rollout
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: payments-api
          image: myregistry/payments-api:v2.4.1   # The new version we're rolling out
          ports:
            - containerPort: 8080
          readinessProbe:              # Kubernetes won't route traffic until this passes
            httpGet:
              path: /health/ready      # Our app exposes a readiness endpoint
              port: 8080
            initialDelaySeconds: 10    # Give the app 10s to start before probing
            periodSeconds: 5           # Check every 5 seconds
            failureThreshold: 3        # Three consecutive failures = not ready
          livenessProbe:               # Kubernetes restarts the pod if this fails
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 10
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
▶ Output
$ kubectl apply -f kubernetes-rolling-deployment.yaml
deployment.apps/payments-api configured

$ kubectl rollout status deployment/payments-api
Waiting for deployment "payments-api" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 3 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 5 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 1 old replicas are pending termination...
deployment "payments-api" successfully rolled out
⚠️
Watch Out: maxUnavailable: 0 Is Not Free SafetySetting maxUnavailable to 0 means Kubernetes must spin up new instances before removing old ones. This sounds safer, but it doubles your resource usage during the rollout. If your cluster is near capacity, the new pods will get stuck in Pending state and the rollout will stall indefinitely. Always pair maxUnavailable: 0 with a confirmed headroom buffer in your cluster.

Wiring a Rolling Deployment Into Your CI/CD Pipeline

Understanding rolling deployments in isolation is one thing — getting them to fire automatically from a git push is another. The pattern that actually works in production ties three things together: your container registry, your deployment manifest, and a pipeline step that updates the image tag and triggers the rollout.

The anti-pattern is updating the manifest file by hand. The moment a human has to manually edit a YAML file and run kubectl apply, you've introduced the most dangerous variable in software: a tired human at 11pm. Instead, your CI pipeline should build the image, tag it with the exact git commit SHA (not 'latest' — never 'latest'), push it to the registry, and then use a tool like kubectl set image or kustomize to patch the deployment manifest and apply it automatically.

The pipeline below shows a GitHub Actions workflow that does exactly this. Notice how the image tag is the git SHA — that means every deployed version is traceable to a specific commit. If something goes wrong, you know exactly what changed. You can also use kubectl rollout undo to immediately revert to the previous SHA-tagged image, which is the rolling deployment equivalent of a one-command parachute.

The health check gates in the deployment YAML you saw in the previous section are what make this pipeline safe to run on every merge to main. The pipeline doesn't need to babysit the rollout — Kubernetes does that. The pipeline's job is just to hand off the new image and trust the deployment strategy to do its work.

.github/workflows/deploy-payments-api.yml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
name: Build and Rolling-Deploy Payments API

on:
  push:
    branches:
      - main                           # Only deploy from main branch merges

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: myorg/payments-api
  KUBE_NAMESPACE: production
  DEPLOYMENT_NAME: payments-api

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write                  # Needed to push to GitHub Container Registry

    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          # Tag with the git SHA so every image is 100% traceable to a commit
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG }}  # Store your kubeconfig as a repo secret

      - name: Trigger rolling deployment with new image
        run: |
          # This command patches the deployment in-place — Kubernetes handles the rolling strategy
          kubectl set image deployment/${{ env.DEPLOYMENT_NAME }} \
            ${{ env.DEPLOYMENT_NAME }}=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=${{ env.KUBE_NAMESPACE }}

      - name: Wait for rollout to complete (fail pipeline if deployment fails)
        run: |
          # This blocks the pipeline until the rollout finishes or times out after 5 minutes
          kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
            --namespace=${{ env.KUBE_NAMESPACE }} \
            --timeout=5m

      - name: Roll back automatically if rollout failed
        if: failure()                  # This step only runs if the previous step failed
        run: |
          echo "Rollout failed — reverting to previous deployment"
          kubectl rollout undo deployment/${{ env.DEPLOYMENT_NAME }} \
            --namespace=${{ env.KUBE_NAMESPACE }}
▶ Output
Run kubectl set image deployment/payments-api payments-api=ghcr.io/myorg/payments-api:a3f9c12...
deployment.apps/payments-api image updated

Run kubectl rollout status deployment/payments-api --namespace=production --timeout=5m
Waiting for deployment "payments-api" rollout to finish: 1 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 3 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "payments-api" rollout to finish: 5 out of 6 new replicas have been updated...
deployment "payments-api" successfully rolled out
⚠️
Pro Tip: Always Block the Pipeline on Rollout StatusWithout the kubectl rollout status step, your pipeline reports 'success' the moment it triggers the deployment — not when it actually finishes. That means a broken deployment looks green in your CI dashboard while your users are hitting errors. The --timeout flag ensures the pipeline fails loudly if the rollout stalls, giving you an automatic tripwire for bad deploys.

The Version Skew Problem — Why Your Code Must Be Backward Compatible

Here's the scenario nobody warns you about until it burns you. You're rolling out v2 of your user service. v2 renames the JSON field 'user_name' to 'username' (snake_case to camelCase — a perfectly reasonable cleanup). For about four minutes while the rollout happens, your load balancer is sending some requests to v1 pods and some to v2 pods.

Your frontend is calling the user service and reading 'user_name'. The v1 pods return it correctly. The v2 pods return 'username' instead. For those four minutes, roughly half your users see a broken UI where their name doesn't display. You've just created a production incident during what should have been a routine deploy.

This is called version skew — the period where multiple versions of the same service coexist. It's not optional during a rolling deployment; it's guaranteed. The fix isn't to avoid rolling deployments. The fix is the expand-contract pattern: in v2, return BOTH 'user_name' AND 'username'. In v3 (a later deploy), remove 'user_name'. You expand the interface first, let the world catch up, then contract it.

The same principle applies to database migrations. Never drop a column or rename one in the same deploy that changes the code that reads it. Add the new column, deploy code that writes to both, migrate the data, deploy code that only reads the new column, then drop the old one. It's more steps, but each step is individually safe.

user_service_backward_compatible_response.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
from flask import Flask, jsonify
from dataclasses import dataclass
from typing import Optional

app = Flask(__name__)

@dataclass
class User:
    id: int
    display_name: str
    email: str

# Simulating a database fetch
def fetch_user_from_db(user_id: int) -> Optional[User]:
    mock_users = {
        1: User(id=1, display_name="Ada Lovelace", email="ada@example.com"),
        2: User(id=2, display_name="Grace Hopper", email="grace@example.com"),
    }
    return mock_users.get(user_id)

@app.route("/users/<int:user_id>")
def get_user(user_id: int):
    user = fetch_user_from_db(user_id)

    if not user:
        return jsonify({"error": "User not found"}), 404

    # EXPAND PHASE: This is v2 of the API.
    # We've renamed the field from 'user_name' to 'username' internally,
    # but we still return BOTH keys during the rollout transition window.
    # Old clients reading 'user_name' keep working.
    # New clients reading 'username' also work.
    # We'll remove 'user_name' in a separate v3 deploy once all clients are updated.
    response_payload = {
        "id": user.id,
        "username": user.display_name,       # New field name — clients should migrate to this
        "user_name": user.display_name,      # Deprecated — kept for backward compatibility only
        "email": user.email,
        "_meta": {
            "deprecated_fields": ["user_name"],  # Signal to API consumers that migration is needed
            "api_version": "v2"
        }
    }

    return jsonify(response_payload), 200

if __name__ == "__main__":
    app.run(debug=False, port=8080)
▶ Output
$ curl http://localhost:8080/users/1
{
"_meta": {
"api_version": "v2",
"deprecated_fields": ["user_name"]
},
"email": "ada@example.com",
"id": 1,
"user_name": "Ada Lovelace",
"username": "Ada Lovelace"
}
🔥
Interview Gold: The Expand-Contract PatternWhen an interviewer asks 'how do you handle database migrations with rolling deployments?', the expand-contract answer (also called parallel change) is exactly what senior engineers say. It shows you've actually shipped rolling deployments and hit the database migration wall, not just read about the concept.
AspectRolling DeploymentBlue/Green Deployment
Downtime during deployZero — traffic shifts graduallyZero — traffic switches atomically
Resource costLow — only maxSurge extra instances neededHigh — requires a full duplicate environment
Rollback speedSlow — must re-roll forward or wait for undoInstant — flip the load balancer back to blue
Version skew riskHigh — two versions serve traffic simultaneouslyNone — only one version is live at any time
Best forStateless services with backward-compatible APIsStateful apps or high-risk releases needing instant rollback
Database migrationsRequires expand-contract pattern across multiple deploysEasier — blue env handles migration before cutover
ComplexityLow — built into Kubernetes nativelyHigh — requires managing two environments
Traffic controlCoarse — batch-based percentage splitsPrecise — can do weighted routing with a proxy

🎯 Key Takeaways

  • Rolling deployments replace instances in small batches — health checks gate every batch, so a bad deploy can only affect the fraction of traffic hitting the new instances, not everyone at once.
  • The expand-contract pattern is non-negotiable for rolling deployments — any API field rename, database column change, or message schema update must be deployed in at least two phases to survive the version skew window.
  • Never tag deployment images as 'latest' — use the git commit SHA so every running version is traceable to a specific code change and kubectl rollout undo reverts to a known, specific state.
  • The kubectl rollout status --timeout flag is what separates a robust pipeline from a false-green one — without it your CI reports success the moment deployment is triggered, not when it actually finishes.

⚠ Common Mistakes to Avoid

  • Mistake 1: Using the 'latest' image tag in your deployment — Your rollout appears to succeed but Kubernetes may not actually pull a new image because 'latest' was already cached on the node. The symptom is kubectl rollout status saying 'successfully rolled out' but your new code never actually running. Fix: always tag images with the git commit SHA (e.g. myapp:a3f9c12) and set imagePullPolicy: Always in your container spec. This guarantees every rollout pulls a specific, traceable image.
  • Mistake 2: Skipping the readinessProbe — New pods start, pass the liveness check (meaning the process is running), and immediately receive live traffic — but your app takes 15 seconds to warm up its database connection pool. The symptom is a spike of 503 errors on every deploy, right as new pods come online. Fix: configure a separate readinessProbe that hits a /health/ready endpoint which returns 200 only after the app has fully initialized. Kubernetes will hold the pod out of rotation until it genuinely passes.
  • Mistake 3: Making a breaking database change in the same deploy as the code change — You add a NOT NULL column to the users table and deploy the code that populates it in the same release. During the rollout, old pods try to insert rows without that column and crash with a database constraint violation. Fix: decouple schema changes from code changes using the expand-contract pattern. Deploy the column as nullable first, ship code that populates it, backfill existing rows, then add the NOT NULL constraint in a later deploy.

Interview Questions on This Topic

  • QWhat is version skew in the context of rolling deployments, and how does the expand-contract pattern solve it? Give a concrete database example.
  • QWalk me through what maxUnavailable and maxSurge actually control in a Kubernetes rolling deployment. What happens at the resource level if you set maxUnavailable to 0 on a cluster that's at 90% capacity?
  • QYour rolling deployment shows as 'successful' in the CI pipeline but users are reporting errors for about three minutes on every release. What are the three most likely root causes and how would you diagnose each one?

Frequently Asked Questions

What is the difference between a rolling deployment and a blue/green deployment?

A rolling deployment gradually replaces old instances with new ones, so both versions serve traffic simultaneously during the rollout. A blue/green deployment runs two full identical environments and switches all traffic at once with a load balancer flip. Blue/green gives you instant rollback and zero version skew, but costs roughly double the infrastructure. Rolling deployments are cheaper but require your code to handle two versions coexisting.

How do I roll back a Kubernetes rolling deployment that went wrong?

Run kubectl rollout undo deployment/your-deployment-name. Kubernetes stores the previous ReplicaSet configuration and will immediately start rolling back to it using the same rolling strategy. You can also target a specific revision with --to-revision=2. This is why tagging images with git SHAs matters — the rollback actually goes back to a known, specific version of your code.

Can I use rolling deployments with a stateful service like a database?

Directly rolling out a stateful database (like a primary PostgreSQL instance) with a standard rolling deployment is dangerous because your data and schema are shared state — it's not like stateless app servers where any instance is interchangeable. For databases, blue/green or maintenance-window deployments are safer. Rolling deployments work well for stateless application services that sit in front of a database, as long as you handle schema migrations separately using the expand-contract pattern.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousMulti-Cloud StrategyNext →Service Mesh — Istio Basics
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged