Intermediate 8 min · March 06, 2026

Rolling Deployments - Missing Readiness Probe 3-Min Error

Q: What is the difference between a rolling deployment and a blue/green deployment?

A rolling deployment gradually replaces old instances with new ones, so both versions serve traffic simultaneously during the rollout. A blue/green deployment runs two full identical environments and switches all traffic at once with a load balancer flip. Blue/green gives you instant rollback and zero version skew, but costs roughly double the infrastructure. Rolling deployments are cheaper but require your code to handle two versions coexisting.

Q: How do I roll back a Kubernetes rolling deployment that went wrong?

Run kubectl rollout undo deployment/your-deployment-name. Kubernetes stores the previous ReplicaSet configuration and will immediately start rolling back to it using the same rolling strategy. You can also target a specific revision with --to-revision=2. This is why tagging images with git SHAs matters — the rollback actually goes back to a known, specific version of your code.

Q: Can I use rolling deployments with a stateful service like a database?

Directly rolling out a stateful database (like a primary PostgreSQL instance) with a standard rolling deployment is dangerous because your data and schema are shared state — it's not like stateless app servers where any instance is interchangeable. For databases, blue/green or maintenance-window deployments are safer. Rolling deployments work well for stateless application services that sit in front of a database, as long as you handle schema migrations separately using the expand-contract pattern.

Q: What is the default revision history limit in Kubernetes and why does it matter for rollback?

The default revisionHistoryLimit is 10. Once you exceed that, older ReplicaSets are pruned and cannot be rolled back to using rollout undo. If you need to roll back to a month-old version, you'll have to re-deploy the old image tag. For critical services, increase the limit to 20 or 30, or use a deployment tool that stores image tags externally.

A readiness probe returning 200 before DB pool init caused 3-minute 503 errors on every deploy.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of DevOps fundamentals
✓Comfortable with command-line tools
✓Basic Linux administration knowledge

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Rolling deployments replace old instances with new ones in batches, never taking the whole service down.
maxUnavailable controls how many pods can be down at once; maxSurge controls how many extra pods can be created.
Health checks gate every batch — Kubernetes won't send traffic until the new pod passes readiness.
Version skew is guaranteed: old and new code coexist during rollout, so backward compatibility is mandatory.
The biggest mistake is missing a readinessProbe: pods start but get traffic before their connection pool warms up.

✦ Definition~90s read

What is Rolling Deployments?

A rolling deployment is a strategy for updating running software by gradually replacing old instances with new ones, one pod or server at a time, rather than taking the entire system down. Kubernetes, AWS ECS, and Nomad all implement this natively: they spin up a new replica, wait for it to pass health checks, then terminate an old one, repeating until all instances are updated.

★

Imagine a restaurant that wants to swap out every table with a fancier one.

The core problem rolling deployments solve is zero-downtime updates — users never see a 503 because there's always at least one healthy instance serving traffic. But the trade-off is complexity: during the rollout, two versions of your code run simultaneously, which means every request must be handled correctly by either version.

If your new code can't read data written by the old code, or vice versa, you get silent data corruption or 500s. This is the version skew problem, and it's why backward compatibility isn't optional — it's the fundamental constraint that makes rolling deployments safe.

The 3-minute readiness probe error you're hitting is Kubernetes telling you your new pods aren't passing their health checks fast enough, which stalls the entire rollout and eventually fails it. That probe is your safety net: it prevents traffic from reaching a pod that isn't ready to serve, but if your app takes longer than the configured threshold to initialize, the deployment hangs.

You'll see this most often with apps that do heavy startup work — loading ML models, warming caches, or running database migrations on boot. The fix isn't to remove the probe; it's to either make startup faster, increase the initialDelaySeconds and periodSeconds, or switch to a startup probe that gives your app more time before readiness checks begin.

Rolling deployments are not the right tool for everything. If your database schema changes are backward-incompatible (e.g., dropping a column), you need a blue/green deployment or a phased approach with expand-contract migrations. And if your service is stateful — like a message queue or a database itself — rolling deployments can cause split-brain scenarios; those systems need stateful sets with ordered, graceful pod management.

For stateless HTTP services, though, rolling deployments are the default because they're simple, resource-efficient, and don't require extra infrastructure like load balancer reconfiguration. Just remember: the readiness probe is your deployment's heartbeat monitor.

Ignore it, and your rollout silently bleeds traffic to broken pods.

Plain-English First

Imagine a restaurant that wants to swap out every table with a fancier one. Instead of closing for the day, they replace one table at a time while customers keep eating. Some diners sit at the old tables, some at the new ones — but the restaurant never closes. That's a rolling deployment: you swap out old servers running old code for new ones, gradually, while real users keep getting served without ever seeing a 'down for maintenance' page.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Every time your team ships a new feature, there's a moment of genuine terror — the deployment window. In the old days, that meant taking your entire app offline at 2am on a Sunday, praying nothing breaks, and issuing a public apology if it does. For teams deploying multiple times a day, that approach simply doesn't scale. It's not just inconvenient; it's a business risk that your competitors have already solved.

Rolling deployments exist to eliminate that terror entirely. Instead of replacing your entire fleet of servers at once, you replace them in small batches. At any given moment, some instances are running the old version and some are running the new version. If something goes wrong with the new version, you've only exposed a fraction of your users to the problem — and you can halt the rollout immediately. The blast radius of a bad deploy shrinks from 'everyone is down' to 'a small percentage of requests hit the broken version for a few minutes'.

By the end of this article you'll understand exactly how a rolling deployment works under the hood, how to configure one in Kubernetes and a CI/CD pipeline, what makes them fail silently, and how to make yours bulletproof. You'll also be able to explain the trade-offs confidently in a system design interview — because rolling deployments come up constantly.

How Rolling Deployments Actually Work

A rolling deployment replaces instances of an old application version with a new one incrementally, keeping the service available throughout. The core mechanic: you update a subset of instances (e.g., 25% of a 100-instance cluster) to the new version, verify they're healthy, then proceed to the next batch. This avoids downtime and allows gradual exposure to changes.

Key properties: batch size (how many instances updated at once), max surge (extra instances allowed during update), and max unavailable (instances allowed to be down). In Kubernetes, a Deployment with strategy type RollingUpdate uses these to control the pace. The readiness probe is critical — if it's missing or misconfigured, the controller may consider a broken pod "ready" and continue rolling, causing widespread failure within minutes.

Use rolling deployments for stateless services where zero-downtime updates are required and you can tolerate a brief mixed-version state. They're the default for most microservices because they balance safety with speed. Avoid them for stateful workloads or when backward-incompatible schema changes exist — blue/green or canary deployments are safer there.

⚠ Missing Readiness Probe = Silent Disaster

Without a readiness probe, Kubernetes treats a pod as ready the instant its containers start — even if the app returns 500s. A rolling update then proceeds, breaking all traffic.

📊 Production Insight

A team deployed a new version with a database migration that took 30 seconds to complete. No readiness probe. The first batch of pods started, failed health checks silently, and the controller continued rolling. Within 3 minutes, 90% of pods were serving errors.

Symptom: gradual increase in 5xx errors across all instances, no single pod crash, no obvious alert.

Rule of thumb: always define a readiness probe that checks actual application readiness (e.g., /health/ready endpoint) and set initialDelaySeconds to account for startup time.

🎯 Key Takeaway

A missing or naive readiness probe is the #1 cause of failed rolling deployments — it lets broken pods pass as healthy.

Batch size and max surge directly control blast radius: smaller batches = safer, but slower rollouts.

Always test a rolling update in a non-production environment with realistic traffic patterns before touching production.

thecodeforge.io

Rolling Deployments

How a Rolling Deployment Actually Works Step by Step

A rolling deployment works by treating your server fleet like a queue. You decide on two numbers: the maximum number of instances you're willing to take offline at once (maxUnavailable) and the maximum number of extra instances you'll spin up during the transition (maxSurge). The deployment controller — whether that's Kubernetes, ECS, or a custom script — then orchestrates a loop.

The loop looks like this: take a small batch of old instances out of the load balancer rotation, drain their in-flight requests, terminate them, start new instances with the new code, wait for those new instances to pass health checks, then add them back to the load balancer. Repeat until every instance is on the new version.

The crucial word in that loop is 'health checks'. The system won't move on to the next batch until the new instances actually prove they're healthy. This is what makes rolling deployments safe — the process is gated on real evidence that the new code works, not just the assumption that it compiled and started.

The downside is that during the rollout, two versions of your code are live simultaneously. If your new API changes a response shape that the old frontend depends on, or your new code writes a database column that the old code doesn't know about, you'll have problems. That's not a flaw in rolling deployments — it's a constraint that forces you to write backward-compatible code, which is a good habit regardless.

kubernetes-rolling-deployment.yamlYAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  labels:
    app: payments-api
spec:
  replicas: 6                          # We're running 6 instances in total
  selector:
    matchLabels:
      app: payments-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1                # Never take more than 1 instance offline at a time
      maxSurge: 2                      # Allow up to 2 extra instances during the rollout
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: payments-api
          image: myregistry/payments-api:v2.4.1   # The new version we're rolling out
          ports:
            - containerPort: 8080
          readinessProbe:              # Kubernetes won't route traffic until this passes
            httpGet:
              path: /health/ready      # Our app exposes a readiness endpoint
              port: 8080
            initialDelaySeconds: 10    # Give the app 10s to start before probing
            periodSeconds: 5           # Check every 5 seconds
            failureThreshold: 3        # Three consecutive failures = not ready
          livenessProbe:               # Kubernetes restarts the pod if this fails
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 10
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"

Output

$ kubectl apply -f kubernetes-rolling-deployment.yaml

deployment.apps/payments-api configured

$ kubectl rollout status deployment/payments-api

Waiting for deployment "payments-api" rollout to finish: 2 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 3 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 4 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 5 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 1 old replicas are pending termination...

deployment "payments-api" successfully rolled out

⚠ Watch Out: maxUnavailable: 0 Is Not Free Safety

Setting maxUnavailable to 0 means Kubernetes must spin up new instances before removing old ones. This sounds safer, but it doubles your resource usage during the rollout. If your cluster is near capacity, the new pods will get stuck in Pending state and the rollout will stall indefinitely. Always pair maxUnavailable: 0 with a confirmed headroom buffer in your cluster.

📊 Production Insight

Real world: A team used maxUnavailable: 0 on a cluster at 85% capacity.

Result: new pods stayed Pending, rollout stalled, no new code shipped for 3 hours.

Rule: always leave at least 10% cluster headroom when using maxUnavailable: 0.

🎯 Key Takeaway

Rolling deployments replace instances in small batches.

Health checks gate every batch — a fail can only affect a fraction of traffic.

Version skew is guaranteed — backward compatibility is not optional.

Wiring a Rolling Deployment Into Your CI/CD Pipeline

Understanding rolling deployments in isolation is one thing — getting them to fire automatically from a git push is another. The pattern that actually works in production ties three things together: your container registry, your deployment manifest, and a pipeline step that updates the image tag and triggers the rollout.

The anti-pattern is updating the manifest file by hand. The moment a human has to manually edit a YAML file and run kubectl apply, you've introduced the most dangerous variable in software: a tired human at 11pm. Instead, your CI pipeline should build the image, tag it with the exact git commit SHA (not 'latest' — never 'latest'), push it to the registry, and then use a tool like kubectl set image or kustomize to patch the deployment manifest and apply it automatically.

The pipeline below shows a GitHub Actions workflow that does exactly this. Notice how the image tag is the git SHA — that means every deployed version is traceable to a specific commit. If something goes wrong, you know exactly what changed. You can also use kubectl rollout undo to immediately revert to the previous SHA-tagged image, which is the rolling deployment equivalent of a one-command parachute.

The health check gates in the deployment YAML you saw in the previous section are what make this pipeline safe to run on every merge to main. The pipeline doesn't need to babysit the rollout — Kubernetes does that. The pipeline's job is just to hand off the new image and trust the deployment strategy to do its work.

.github/workflows/deploy-payments-api.ymlYAML

name: Build and Rolling-Deploy Payments API

on:
  push:
    branches:
      - main                           # Only deploy from main branch merges

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: myorg/payments-api
  KUBE_NAMESPACE: production
  DEPLOYMENT_NAME: payments-api

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write                  # Needed to push to GitHub Container Registry

    steps:
      - name: Checkout source code
        uses: actions/checkout@v4

      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          # Tag with the git SHA so every image is 100% traceable to a commit
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG }}  # Store your kubeconfig as a repo secret

      - name: Trigger rolling deployment with new image
        run: |
          # This command patches the deployment in-place — Kubernetes handles the rolling strategy
          kubectl set image deployment/${{ env.DEPLOYMENT_NAME }} \
            ${{ env.DEPLOYMENT_NAME }}=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=${{ env.KUBE_NAMESPACE }}

      - name: Wait for rollout to complete (fail pipeline if deployment fails)
        run: |
          # This blocks the pipeline until the rollout finishes or times out after 5 minutes
          kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
            --namespace=${{ env.KUBE_NAMESPACE }} \
            --timeout=5m

      - name: Roll back automatically if rollout failed
        if: failure()                  # This step only runs if the previous step failed
        run: |
          echo "Rollout failed — reverting to previous deployment"
          kubectl rollout undo deployment/${{ env.DEPLOYMENT_NAME }} \
            --namespace=${{ env.KUBE_NAMESPACE }}

Output

Run kubectl set image deployment/payments-api payments-api=ghcr.io/myorg/payments-api:a3f9c12...

deployment.apps/payments-api image updated

Run kubectl rollout status deployment/payments-api --namespace=production --timeout=5m

Waiting for deployment "payments-api" rollout to finish: 1 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 2 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 3 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 4 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 5 out of 6 new replicas have been updated...

deployment "payments-api" successfully rolled out

💡Pro Tip: Always Block the Pipeline on Rollout Status

Without the kubectl rollout status step, your pipeline reports 'success' the moment it triggers the deployment — not when it actually finishes. That means a broken deployment looks green in your CI dashboard while your users are hitting errors. The --timeout flag ensures the pipeline fails loudly if the rollout stalls, giving you an automatic tripwire for bad deploys.

📊 Production Insight

Without rollout status blocking, a production incident can go unnoticed for 10+ minutes.

The pipeline reports green, the deploy actually failed silently, and users see errors.

Rule: always block CI pipeline on rollout status with a timeout.

🎯 Key Takeaway

Never tag images as 'latest' — use the git commit SHA.

Always block CI pipeline on rollout status with --timeout.

A auto-rollback step on failure is your safety net.

thecodeforge.io

Rolling Deployments

The Version Skew Problem — Why Your Code Must Be Backward Compatible

Here's the scenario nobody warns you about until it burns you. You're rolling out v2 of your user service. v2 renames the JSON field 'user_name' to 'username' (snake_case to camelCase — a perfectly reasonable cleanup). For about four minutes while the rollout happens, your load balancer is sending some requests to v1 pods and some to v2 pods.

Your frontend is calling the user service and reading 'user_name'. The v1 pods return it correctly. The v2 pods return 'username' instead. For those four minutes, roughly half your users see a broken UI where their name doesn't display. You've just created a production incident during what should have been a routine deploy.

This is called version skew — the period where multiple versions of the same service coexist. It's not optional during a rolling deployment; it's guaranteed. The fix isn't to avoid rolling deployments. The fix is the expand-contract pattern: in v2, return BOTH 'user_name' AND 'username'. In v3 (a later deploy), remove 'user_name'. You expand the interface first, let the world catch up, then contract it.

The same principle applies to database migrations. Never drop a column or rename one in the same deploy that changes the code that reads it. Add the new column, deploy code that writes to both, migrate the data, deploy code that only reads the new column, then drop the old one. It's more steps, but each step is individually safe.

user_service_backward_compatible_response.pyPYTHON

from flask import Flask, jsonify
from dataclasses import dataclass
from typing import Optional

app = Flask(__name__)

@dataclass
class User:
    id: int
    display_name: str
    email: str

# Simulating a database fetch
def fetch_user_from_db(user_id: int) -> Optional[User]:
    mock_users = {
        1: User(id=1, display_name="Ada Lovelace", email="ada@example.com"),
        2: User(id=2, display_name="Grace Hopper", email="grace@example.com"),
    }
    return mock_users.get(user_id)

@app.route("/users/<int:user_id>")
def get_user(user_id: int):
    user = fetch_user_from_db(user_id)

    if not user:
        return jsonify({"error": "User not found"}), 404

    # EXPAND PHASE: This is v2 of the API.
    # We've renamed the field from 'user_name' to 'username' internally,
    # but we still return BOTH keys during the rollout transition window.
    # Old clients reading 'user_name' keep working.
    # New clients reading 'username' also work.
    # We'll remove 'user_name' in a separate v3 deploy once all clients are updated.
    response_payload = {
        "id": user.id,
        "username": user.display_name,       # New field name — clients should migrate to this
        "user_name": user.display_name,      # Deprecated — kept for backward compatibility only
        "email": user.email,
        "_meta": {
            "deprecated_fields": ["user_name"],  # Signal to API consumers that migration is needed
            "api_version": "v2"
        }
    }

    return jsonify(response_payload), 200

if __name__ == "__main__":
    app.run(debug=False, port=8080)

Output

$ curl http://localhost:8080/users/1

{

"_meta": {

"api_version": "v2",

"deprecated_fields": ["user_name"]

"email": "ada@example.com",

"id": 1,

"user_name": "Ada Lovelace",

"username": "Ada Lovelace"

}

🔥Interview Gold: The Expand-Contract Pattern

When an interviewer asks 'how do you handle database migrations with rolling deployments?', the expand-contract answer (also called parallel change) is exactly what senior engineers say. It shows you've actually shipped rolling deployments and hit the database migration wall, not just read about the concept.

📊 Production Insight

A real incident: renaming a field during a rolling deploy caused a 4-minute window where 50% of requests returned partial data.

The root cause: no backward compatibility in the new API version.

Rule: always use expand-contract for any interface or schema change.

🎯 Key Takeaway

Rolling deployments guarantee version skew.

Backward compatibility is not optional — use expand-contract.

Database migrations must be separate from code changes.

Rolling Back a Deployment: The Graceful Undo

No matter how well you test, a bad deploy will eventually happen. The question isn't if — it's how fast you can recover. Rolling deployments give you a built-in escape hatch that doesn't require a full re-deploy of the old version. Kubernetes stores the previous ReplicaSet configuration as a revision, and you can instantly start a rolling rollback with a single command.

The 'kubectl rollout undo' command doesn't snap all instances back at once. It uses the same rolling update strategy — it gradually replaces new pods with the previous version's pods. That means your rollback is just as safe as your forward deployment: health checks gate the process, traffic gradually shifts back, and if the rollback also fails (unlikely but possible), you can undo the rollback.

The catch is that Kubernetes only keeps a limited number of revision histories. By default, it stores 10 revisions — controlled by the revisionHistoryLimit field. Once you exceed that limit, older revisions are pruned, and you can't undo to them. If you need to roll back to a version from a month ago, you'll need to re-deploy the old image tag, not rely on rollout undo.

Another gotcha: if you accidentally deployed the same version twice (e.g., re-pushed the same image with a new tag but identical code), rollout undo will roll you back to the same broken version. Always verify the image tag in the previous ReplicaSet before trusting an undo.

rollback-commands.shBASH

# Quick rollback to the previous revision
kubectl rollout undo deployment/payments-api --namespace=production

# Rollback to a specific revision (list revisions first)
kubectl rollout history deployment/payments-api
# Output:
# REVISION  CHANGE-CAUSE
# 1         <none>
# 2         <none>
# 3         <none>
# (only if the deployment has the annotation kubernetes.io/change-cause)

# Target revision 2
kubectl rollout undo deployment/payments-api --to-revision=2

# Check rollback status
kubectl rollout status deployment/payments-api --timeout=5m

# View the ReplicaSet of the previous version
kubectl get rs -l app=payments-api

Output

$ kubectl rollout undo deployment/payments-api --namespace=production

deployment.apps/payments-api rolled back

$ kubectl rollout status deployment/payments-api --timeout=5m

Waiting for deployment "payments-api" rollout to finish: 2 out of 6 new replicas have been updated...

Waiting for deployment "payments-api" rollout to finish: 3 out of 6 new replicas have been updated...

deployment "payments-api" successfully rolled out

⚠ Rollout Undo Is Not Instant — Plan for It

Rollout undo is a rolling deployment itself. It takes the same time as the original rollout. If your service can't tolerate a 3-minute transition period, you may need a blue/green deployment instead. Rolling undo is safe but not fast — design your incident response around that latency.

📊 Production Insight

Rollout undo can take as long as the original deploy — it's not instant.

Default revision history limit is 10 — after that, you can't undo to older versions.

Rule: ensure you have a quick recovery path for urgent rollbacks (e.g., blue/green or feature flags).

🎯 Key Takeaway

Rollback uses the same rolling strategy — it's not instant.

revisionHistoryLimit controls how many versions you can undo to.

Always verify the target revision before issuing an undo.

Database Migrations with Rolling Deployments: The Safe Way

If there's one thing that causes more rolling deployment failures than anything else, it's database schema changes. The problem is fundamental: your database is a shared, stateful resource. You can't have two versions of the schema while two versions of your code are running. Or can you?

The answer is the expand-contract pattern applied to database changes. Let's walk through a concrete example: adding a NOT NULL column to the users table. The naive approach: write the migration to add the column with a default value, deploy the code that populates it, and add the NOT NULL constraint all in one migration. During the rolling rollout, old pods try to insert a row and fail because the NOT NULL constraint is already in place but the old code doesn't populate the column.

The safe approach:

Expand: Add the column as nullable (no NOT NULL constraint). Deploy this migration while old pods are still running. The old code ignores the column.
Backfill: Populate existing rows with a default value. This can be done as a background job or inline migration.
Deploy new code: The new code populates the column on inserts. Old pods still run but they don't set the column — the nullable default handles that.
Contract: Once all pods are on the new code, run a second migration to add the NOT NULL constraint. This is safe because every pod now writes the column.

Each step is individually reversible. If something goes wrong in step 3, you can roll back the code without having to revert the schema.

expand-contract-migration.sqlSQL

-- ========== Phase 1: Expand ==========
-- Add the column as nullable. This is safe to run while old code is live.
ALTER TABLE users ADD COLUMN timezone TEXT;

-- ========== Phase 2: Backfill ==========
-- Populate existing rows with a default value.
UPDATE users SET timezone = 'UTC' WHERE timezone IS NULL;

-- ========== Phase 3: Deploy new code ==========
-- Also done in CI/CD: deploy new app version that always sets timezone
-- (This is done outside the database migration)

-- ========== Phase 4: Contract ==========
-- After confirming all pods are on the new version, add the constraint.
ALTER TABLE users ALTER COLUMN timezone SET NOT NULL;

Output

-- Running each phase separately ensures zero downtime.

-- If Phase 1 fails, rollback is straightforward: ALTER TABLE users DROP COLUMN timezone;

Mental Model

Expand-Contract Mental Model

Think of it as adding a new door before tearing down the old one. You never block an exit.

Every change to a shared resource must be two deployments: first add the new thing, then remove the old thing.
During the window, both old and new coexist — your code must handle both.
The order matters: expand the schema first, then deploy code that uses the new schema, then contract.
For column drops: the contract is the drop step. Never drop a column in the same deploy that stops using it.

📊 Production Insight

A team added a NOT NULL column in the same deploy as code changes.

Old pods crashed on insert with constraint violation — 5 minutes of write failures.

Fix they implemented: the expand-contract pattern across three deploys.

Rule: never combine schema changes with code changes in one rolling deploy.

🎯 Key Takeaway

Database migrations and code changes must be in separate deploys.

Expand-contract prevents the version skew from breaking the database.

Each phase must be individually reversible.

Use Case: When Rolling Saves Your Ass (and When It Won't)

Don't use rolling deployments because they're trendy. Use them when zero-downtime matters and your instances are stateless cattle, not pets. Rolling shines for web backends, API servers, and worker pools—any workload where you can spin up a fresh pod and drain traffic from an old one without a user noticing. It fails spectacularly with stateful workloads like databases or legacy monoliths holding in-memory sessions. If your app can't handle two versions coexisting for minutes, rolling will corrupt data faster than a junior with sudo rm -rf. The rule: if you can't afford a version skew window, use blue-green or canary. Rolling assumes your code is backward-compatible. It's not magic—it's orchestrated serial replacement. Every new pod must serve traffic alongside the old ones until it's verified healthy. That means health checks must be ruthless and fast. A slow liveness probe turns a five-minute rollout into a fifteen-minute outage. Choose rolling when you need gradual traffic migration and can tolerate partial capacity during the swap. Choose something else when you can't.

rolling-deployment-use-case.yamlYAML

// io.thecodeforge
# Bad rolling candidate: stateful DB (don't do this)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  updateStrategy:
    type: RollingUpdate  # Car crash waiting to happen

# Good rolling candidate: stateless web API
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1

Output

Deployment configured for rolling update with 10 replicas

maxSurge=2 allows 2 extra pods during rollout

maxUnavailable=1 ensures at least 9 pods always serve traffic

⚠ Production Trap:

Setting maxUnavailable to 0 sounds safe, but it means you cannot roll out if resource quota is tight. A pod failing to start blocks the entire rollout. Set it to 1 or 10% to give yourself room.

🎯 Key Takeaway

Rolling deployments only work for stateless, backward-compatible services. If your app can't handle two versions in flight, pick a different strategy.

The Pod-Template-Hash Label: Kubernetes' Secret Weapon for Rollout Sanity

Ever wonder how Kubernetes knows which pods belong to which version during a rolling update? It's not magic—it's the pod-template-hash label. When you create or update a Deployment, the ReplicaSet controller appends this hash to every pod it creates. The hash is a deterministic SHA-256 of the pod template spec. Change the image tag? New hash. Change an env var? New hash. Same spec? Same hash. This lets the Deployment controller identify exactly which ReplicaSet owns which pods. During a rollout, it can drain pods from the old ReplicaSet by hash, spin up pods in the new ReplicaSet with a different hash, and never mix them up. You can see it with kubectl get pods --show-labels. If a pod has pod-template-hash=abc123, you know it belongs to ReplicaSet abc123. This is critical for rollbacks: kubectl rollout undo simply scales down the current ReplicaSet and scales up the previous one, identified by its hash. Don't rely on pod names alone—they're ephemeral. The hash is the ground truth. If you're debugging a rollout failure, first check which hashes are running: kubectl get replicasets -l app=yourapp. If you see two ReplicaSets with different hashes and neither is scaling, your rollout is stuck—likely a bad health check or resource shortage.

inspect-rollout-hash.shBASH

// io.thecodeforge
# See which pods belong to which ReplicaSet version
kubectl get pods -l app=payment-api \
  -o custom-columns=NAME:.metadata.name,HASH:.metadata.labels.pod-template-hash

# Check rollout history with hashes
kubectl rollout history deployment/payment-api

# Inspect a specific revision's pod template hash
kubectl rollout history deployment/payment-api --revision=2 | grep pod-template-hash

Output

NAME HASH

payment-api-7d8f9c9b6-abc12 7d8f9c9b6

payment-api-7d8f9c9b6-def34 7d8f9c9b6

payment-api-5e6f7a8b9-ghi56 5e6f7a8b9

🔥Pro Tip:

When you kubectl rollout undo, Kubernetes scales down the current ReplicaSet (by hash) and scales up the previous ReplicaSet (different hash). It does NOT delete the old ReplicaSet—it keeps it around for the rollout history limit (default 10). You can manually delete old ReplicaSets to free resources, but don't if you think you'll need a rollback.

🎯 Key Takeaway

The pod-template-hash label is the unique identifier for every version of your deployment. Use it to trace which pods belong to which rollout, never rely on pod names.

Rollover: What Happens When You Deploy Again Mid-Rollout

You kicked off a rolling deployment. Five minutes in, and your CTO screams 'stop the rollout, we need a hotfix now.' You push a new image tag. What happens? Kubernetes doesn't abort the old rollout—it creates a new ReplicaSet with the latest spec and starts scaling it up while scaling down the old ones. This is called a rollover. The Deployment controller reconciles the desired state immediately. If you had 10 replicas at image v1, started rolling to v2 (5 new, 5 old), then pushed v3, the controller ignores v2 and starts creating v3 pods. It scales down both v1 and v2 ReplicaSets to make room for v3. This is efficient but dangerous: you lost the v2 version's data (if any migration ran) and pods may have inconsistent states if the v2 ReplicaSet was partially scaled. The rollover respects your maxSurge and maxUnavailable settings, so it won't exceed pod limits—but it can thrash. If you push three versions in ten minutes, you'll have pods coming and going like a revolving door. Production mistake: pushing a new tag before the previous rollout completes, then wondering why some pods have different configs. The fix: use kubectl rollout pause before making changes during an active rollout, or have your CI/CD pipeline check rollout status before allowing the next deploy.

rollover-scenario.shBASH

// io.thecodeforge
# Scenario: mid-rollout, push v3 before v2 finishes
kubectl set image deployment/payment-api \
  payment-api=payment-api:v2
# 2 minutes later...
kubectl set image deployment/payment-api \
  payment-api=payment-api:v3  # Rollover!

# Check what ReplicaSets exist
kubectl get replicasets -l app=payment-api

# If you need to pause the rollout first:
kubectl rollout pause deployment/payment-api
# Then make changes
kubectl set image deployment/payment-api payment-api=v4
kubectl rollout resume deployment/payment-api

Output

NAME DESIRED CURRENT READY AGE

payment-api-7d8f9c9b6-v1 10 10 10 1h

payment-api-5e6f7a8b9-v2 3 3 3 2m # partially scaled

payment-api-9a1b2c3d4-v3 7 7 7 30s # rollover in progress

⚠ Production Trap:

Rollover during a database migration is catastrophic. If v2 ran a migration that v3 reverts, you'll get schema conflicts. Always check rollout status before pushing the next version, or use a staging gate in your pipeline.

🎯 Key Takeaway

A rollover cancels the current rollout and starts a new one. It's useful for hotfixes but dangerous mid-migration. Pause the deployment before making changes to avoid chaos.

● Production incidentPOST-MORTEMseverity: high

The Readiness Probe That Wasn't: How a Missing Endpoint Caused 3-Minute Errors on Every Deploy

Symptom

A payments API serving ~2000 req/s. Every deploy triggered a spike of 503 errors for approximately 3 minutes, then recovery. The CI pipeline reported 'successful rollout'.

Assumption

The team assumed the database was slow to reconnect, so they added connection retry logic. That didn't fix it. Then they assumed the liveness probe was the gateway, but pods were never restarted.

Root cause

The readiness probe endpoint /health/ready returned 200 immediately, before the database connection pool was fully initialized. The pod passed the readiness check, got traffic, and every incoming request failed because the connection pool had only 1 connection ready. The 3-minute error window matched the drain time of old pods — until new pods' pools filled up.

Fix

Changed the readiness probe to ping the database: /health/ready endpoint calls SELECT 1 and only returns 200 when the connection pool reports at least 5 idle connections. Also added a startupProbe to allow 15 seconds for initialization before the readiness cycle begins.

Key lesson

A readiness probe that checks only HTTP serving is a false positive — it must validate the service is ready to handle real traffic.
Don't trust a CI pipeline that reports 'successful rollout' without checking the health dashboard for the first 5 minutes after deploy.
Connection pool warm-up is not instant — account for it in your probe design.

Production debug guideSymptom → Action pattern for the three most common rolling deployment failures3 entries

Symptom · 01

Rollout is stuck: 'Waiting for rollout to finish: 0 out of X new replicas have been updated'

→

Fix

Check if the cluster has enough capacity. Run kubectl describe deployment to see events. Look for 'FailedCreate' or 'Insufficient cpu'. If maxSurge is too high and cluster is near capacity, reduce maxSurge or scale up nodes.

Symptom · 02

New pods crash immediately after startup (CrashLoopBackOff)

→

Fix

Check logs of the new pod: kubectl logs deployment/name --previous. Often a configuration issue (wrong env vars, missing secret) or a code startup error. The rollout should pause automatically because the new pods fail liveness checks.

Symptom · 03

Deployment reports 'successfully rolled out' but old pods never terminate

→

Fix

Check for PodDisruptionBudgets (PDBs) that block pod eviction. Run kubectl get pdb. If a PDB prevents a single pod from being disrupted, the rollout can't finish. Also check if the old ReplicaSet has any pending pods due to resource constraints.

★ Rolling Deployment Quick Debug Cheat SheetFive commands to diagnose a stalled or failed rolling deployment in Kubernetes. Run these in order.

Rollout in progress but not making progress−

Immediate action

Check rollout status with timeout

Commands

kubectl rollout status deployment/payments-api --timeout=10s

kubectl describe deployment/payments-api | grep -A 10 'Conditions:'

Fix now

If rollout is stuck, check resource limits and node capacity. If not, restore with rollout undo.

New pods enter CrashLoopBackOff+

Old pods not terminating after rollout completes+

Rolling Deployment vs Blue/Green Deployment

Aspect	Rolling Deployment	Blue/Green Deployment
Downtime during deploy	Zero — traffic shifts gradually	Zero — traffic switches atomically
Resource cost	Low — only maxSurge extra instances needed	High — requires a full duplicate environment
Rollback speed	Slow — must re-roll forward or wait for undo	Instant — flip the load balancer back to blue
Version skew risk	High — two versions serve traffic simultaneously	None — only one version is live at any time
Best for	Stateless services with backward-compatible APIs	Stateful apps or high-risk releases needing instant rollback
Database migrations	Requires expand-contract pattern across multiple deploys	Easier — blue env handles migration before cutover
Complexity	Low — built into Kubernetes natively	High — requires managing two environments
Traffic control	Coarse — batch-based percentage splits	Precise — can do weighted routing with a proxy

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
kubernetes-rolling-deployment.yaml	apiVersion: apps/v1	How a Rolling Deployment Actually Works Step by Step
.githubworkflowsdeploy-payments-api.yml	name: Build and Rolling-Deploy Payments API	Wiring a Rolling Deployment Into Your CI/CD Pipeline
user_service_backward_compatible_response.py	from flask import Flask, jsonify	The Version Skew Problem
rollback-commands.sh	kubectl rollout undo deployment/payments-api --namespace=production	Rolling Back a Deployment
expand-contract-migration.sql	ALTER TABLE users ADD COLUMN timezone TEXT;	Database Migrations with Rolling Deployments
rolling-deployment-use-case.yaml	apiVersion: apps/v1	Use Case
inspect-rollout-hash.sh	kubectl get pods -l app=payment-api \	The Pod-Template-Hash Label
rollover-scenario.sh	kubectl set image deployment/payment-api \	Rollover

Key takeaways

Rolling deployments replace instances in small batches

health checks gate every batch, so a bad deploy can only affect the fraction of traffic hitting the new instances, not everyone at once.

The expand-contract pattern is non-negotiable for rolling deployments

any API field rename, database column change, or message schema update must be deployed in at least two phases to survive the version skew window.

Never tag deployment images as 'latest'

use the git commit SHA so every running version is traceable to a specific code change and kubectl rollout undo reverts to a known, specific state.

The kubectl rollout status --timeout flag is what separates a robust pipeline from a false-green one

without it your CI reports success the moment deployment is triggered, not when it actually finishes.

Rollback uses the same rolling strategy

it's not instant. Plan incident response around that latency, not against it.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is version skew in the context of rolling deployments, and how does...

Q02SENIOR

Walk me through what maxUnavailable and maxSurge actually control in a K...

Q03SENIOR

Your rolling deployment shows as 'successful' in the CI pipeline but use...

Q01 of 03SENIOR

What is version skew in the context of rolling deployments, and how does the expand-contract pattern solve it? Give a concrete database example.

ANSWER

Version skew is the period during a rolling deployment when both old and new versions of your service serve traffic. Any change to an API contract or database schema must be backward-compatible during this window. The expand-contract pattern solves this by making changes in two phases: first add the new field/column (expand) while keeping the old one, deploy code that uses the new field, then remove the old one (contract) in a later deploy. For database migrations, you'd add a column as nullable, backfill, deploy code that populates it, then add NOT NULL constraint in a separate deploy.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between a rolling deployment and a blue/green deployment?

How do I roll back a Kubernetes rolling deployment that went wrong?

Can I use rolling deployments with a stateful service like a database?

What is the default revision history limit in Kubernetes and why does it matter for rollback?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's CI/CD. Mark it forged?

8 min read · try the examples if you haven't