Skip to content
Home DevOps Blue-Green Deployment — Database Migration Rollback Traps

Blue-Green Deployment — Database Migration Rollback Traps

Where developers are forged. · Structured learning · Free forever.
📍 Part of: CI/CD → Topic 6 of 14
A dropped column broke both environments mid-switch, corrupting orders with NULLs.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
A dropped column broke both environments mid-switch, corrupting orders with NULLs.
  • Blue-green deployment enables instant rollback by switching traffic between two identical environments — the deploy and the switch are separate concerns.
  • Database schema changes require the expand-contract pattern; backward-compatibility is non-negotiable.
  • Traffic switching can be DNS (non-atomic), load balancer (atomic), or service mesh (gradual). Choose based on your tolerance for mixed version traffic.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Blue-green deployment runs two identical environments, switching traffic atomically
  • Traffic switch at DNS, load balancer, or service mesh level — each with trade-offs
  • Database migrations need backward-compatible schema: old and new must coexist
  • Rollback is a routing change, not a re-deploy — but only if you keep the old environment warm
  • Biggest mistake: deploying schema changes that break the old version still serving traffic
🚨 START HERE

Blue-Green Deploy Quick Debug Commands

Commands to diagnose a broken blue-green deployment fast. Run in order.
🟡

Environment unresponsive after switch

Immediate ActionVerify both environments are running
Commands
curl -I https://blue.example.com/health && curl -I https://green.example.com/health
docker compose -p blue ps && docker compose -p green ps
Fix NowIf one environment is down, switch DNS/LB back to the healthy one immediately. Then investigate.
🟡

Database errors in logs

Immediate ActionCheck if migration ran in both environments
Commands
kubectl logs -n blue deploy/api -c app --tail=100 | grep -i error
kubectl exec -n green deploy/api -- cat /var/app/db/version.txt
Fix NowIf migration didn't run in green, run it manually: `kubectl exec -n green deploy/migration -- ./migrate`
🟡

Traffic not reaching new environment

Immediate ActionInspect load balancer target groups
Commands
aws elbv2 describe-target-groups --names blue-green-tg
curl -H 'Host: app.com' http://<green-private-ip>/health
Fix NowIf target group has no healthy instances, check security groups and health check endpoints.
🟡

Mixed version responses seen by users

Immediate ActionCheck TTL and CDN cache status
Commands
dig example.com +short
curl -H 'Cache-Control: no-cache' https://example.com/api/version
Fix NowFor DNS-based switch: reduce TTL to 60s before cutover. For LB switch: disable connection draining on old target group.
Production Incident

The Silent Database Migration That Corrupted Orders

A team performed a flawless blue-green switch — but the database migration was not backward compatible. The old environment still running (for rollback) started failing, and the new environment corrupted data.
SymptomFive minutes after the traffic switch, support tickets spiked: users reporting orders showing up with NULL totals. Rollback initiated, but the old environment also showed corrupted data because the migration had run before the switch.
AssumptionThe team assumed database migrations could be run at the same time as the code deploy. They didn't consider that the old environment still needed to read the old schema for rollback.
Root causeThe migration script dropped a column that the old application code required. When traffic was rolled back, the old app crashed on startup because the column didn't exist. The migration was irreversible without a restore.
FixRewrite the migration to be backward-compatible: add the new column as nullable, deploy both environments, switch traffic, then in a separate release drop the old column. Use a migration versioning tool that supports conditional execution based on environment.
Key Lesson
Database migrations must be backward-compatible: the old schema must continue to work until the old environment is decommissioned.Run migrations as a separate step before the blue-green switch, not during the deploy.Always test rollback by actually rolling back in a staging environment — theory isn't enough.
Production Debug Guide

Symptom → Action: What to do when the switch goes wrong

New environment returns 502 Bad Gateway after switchCheck if health checks are passing: curl localhost:port/health. Verify backend processes are running. Look for missing env vars or configs that differ from blue.
Some users see old version, some see new versionDNS propagation delay — reduce TTL to 60s before switch. If using load balancer, check sticky sessions are disabled or properly configured. Flush CDN cache.
Rollback doesn't fix the issue — both environments brokenDatabase schema change is irreversible. You need point-in-time restore. Never assume rollback via blue-green alone will save you from schema mutations.
New environment passes health checks but fails under loadWarm up the new environment with a small traffic fraction first (canary). Use gradual switch: 10% → 50% → 100%. Monitor p99 latency and error rate during switch.

Every deployment is a calculated gamble. You're shipping untested code into a live system that real users depend on right now. The traditional approach — stop the app, deploy, restart, pray — trades availability for simplicity. At small scale that's fine. At production scale, that maintenance window is a revenue event, a support ticket storm, and a trust problem all at once. Companies like Amazon measured that every 100ms of latency costs them 1% in sales. Downtime isn't measured in minutes; it's measured in dollars and reputation.

Blue-green deployment solves the deployment risk problem at its root. Instead of mutating your live environment in place, you build a complete, parallel environment — run every health check and smoke test against it while real traffic still hits the original — then switch. The switch is a routing change, not a deployment. That distinction is everything. Your rollback is equally trivial: re-route traffic back. No re-deploys, no frantic hotfixes at 2am, no partial states.

By the end of this article you'll understand the full internal mechanics of blue-green deployments including DNS vs load-balancer vs service-mesh switching strategies, the database migration problem that trips up most teams, how to wire this into a real CI/CD pipeline with Nginx and shell scripting, the subtle failure modes nobody talks about in blog posts, and exactly how to answer the curveball questions interviewers throw at senior candidates.

What is Blue-Green Deployment?

Blue-green deployment is a release pattern that keeps two production environments running simultaneously. Let's call them Blue (the current live) and Green (the new version). You deploy the new version to Green while all real traffic still hits Blue. Once Green passes all health checks and smoke tests, you flip the traffic router — DNS, load balancer, or service mesh — so that incoming requests go to Green. Blue stays live, idle but ready, serving as an instant rollback target.

The magic isn't in the deploy. It's in the switch. A routing change is fast, atomic, and reversible. You don't redeploy anything during rollback — you just flip the switch back. That's why blue-green pairs so well with database migrations that are backward-compatible: if the migration can't be undone, you've lost the rollback benefit.

/etc/nginx/conf.d/blue-green.conf · NGINX
12345678910111213141516171819
# TheCodeForgeNginx blue-green traffic switch
upstream blue {
    server 10.0.1.10:80;  # blue environment
    server 10.0.1.11:80;
}

upstream green {
    server 10.0.2.10:80;  # green environment
    server 10.0.2.11:80;
}

server {
    listen 80;
    location / {
        # Switch this line to flip traffic: blue -> green
        proxy_pass http://green;
        # For rollback, change to http://blue
    }
}
Mental Model
Mental Model: Two Taxi Stands
Think of blue-green as two taxi stands outside a terminal: one takes passengers now, the other waits empty.
  • Blue is the active taxi line — customers board immediately.
  • Green is the backup line, fully fueled and ready — but empty.
  • You move the sign ('TAXI HERE') to the other line when ready.
  • If the new line has problems, you move the sign back. No car re-parked.
  • The key: both lines stay identical except for the passenger load.
📊 Production Insight
The switch itself is the riskiest moment — not the deploy.
Measure p99 latency and error rate during the cutover window.
If you see a spike, abort and rollback immediately — don't wait for 'it might stabilise'.
🎯 Key Takeaway
Blue-green makes rollback a routing decision, not a deploy.
The cost: double infrastructure while both environments run.
Choose it when uptime SLA is non-negotiable and rollback speed matters more than hardware cost.
When to use blue-green vs other strategies
IfZero-downtime required, quick rollback critical
UseBlue-green deployment — keep idle environment hot
IfGradual traffic shift needed (e.g., testing with real users)
UseCanary releases — route small % to new version first
IfInfrastructure cost is a constraint
UseRolling deployment — update instances sequentially without duplicate cost

Traffic Switching Mechanisms: DNS, Load Balancer, and Service Mesh

The switch is the core of blue-green. Three common mechanisms exist, each with different properties.

DNS-based switching: You update a DNS record (e.g., change A record from blue LB IP to green LB IP). Simple, no extra infrastructure. But DNS propagation takes minutes to hours depending on TTL — not atomic. For a clean cutover, you must set TTL to a low value (60s) at least 24 hours in advance. During propagation, some users hit blue, some hit green. If your service can't handle dual versions for a few minutes, this isn't for you.

Load balancer switching: Your LB has target groups for blue and green. You swap the active target group. This is near-instant (seconds). The LB handles health checks and connection draining. Most production systems use this — AWS ALB, Nginx upstream, HAProxy. The catch: the LB is a single point of failure if not redundant.

Service mesh switching: Tools like Istio or Consul use traffic routing rules to shift percentages. You can do canary within blue-green — route 10% to green, observe, then shift 100%. This gives the best observability but adds complexity to the mesh control plane.

switch-blue-green.sh · BASH
12345678910111213141516171819202122232425
#!/bin/bash
# TheCodeForgeSwitch traffic from blue to green via AWS ALB
# Usage: ./switch-blue-green.sh prod

ENV="$1"
BLUE_TG="arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/blue-${ENV}/abc123"
GREEN_TG="arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/green-${ENV}/def456"
ALB_ARN="arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/${ENV}-alb/xyz789"

echo "Switching ${ENV} traffic from blue to green..."

# Retrieve current listener rule
LISTENER_RULE=$(aws elbv2 describe-rules \
  --listener-arn "arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/${ENV}-alb/xyz789/abc123" \
  --query 'Rules[?Priority==`1`].Actions[0].ForwardConfig.TargetGroups' \
  --output text)

echo "Current target group: $LISTENER_RULE"

# Update the default rule to forward to green
aws elbv2 modify-rule \
  --rule-arn "arn:aws:elasticloadbalancing:us-east-1:123456789012:listener-rule/app/${ENV}-alb/xyz789/abc123/def456" \
  --actions "Type=forward,ForwardConfig={TargetGroups=[{TargetGroupArn=${GREEN_TG}}]}"

echo "Switch complete. Traffic now goes to green."
▶ Output
Switching prod traffic from blue to green...
Current target group: blue-prod
Switch complete. Traffic now goes to green.
⚠ Warning: DNS Propagation Delays
DNS-based switching is not atomic. If you have 5-minute TTL and switch at 12:00, some users will hit old version until 12:05. For zero-downtime that means you must support both versions simultaneously. Prefer LB or service mesh switches for true atomicity.
📊 Production Insight
I see teams rely on DNS switches for simplicity and get caught by propagation.
You can't force a client to refresh its DNS cache. Use health-check draining on the old target group to avoid serving broken connections.
The most reliable approach: use a load balancer with a 'switch' API call.
🎯 Key Takeaway
Atomic switching requires a layer 4/7 load balancer or service mesh.
DNS is not atomic — it's eventual.
Test the switch mechanism, not just the deploy, in a staging environment.
Choosing the right switching mechanism
IfYou need atomic switch (sub-second)
UseUse load balancer target group swap or service mesh traffic policy
IfYou have low TTL control and can tolerate minutes of mixed traffic
UseDNS-based switch — simplest, but plan for dual-version compatibility
IfYou want gradual traffic shift with fine-grained observability
UseService mesh traffic splitting — Istio, Consul, or eBPF-based routing

The Database Migration Problem

Blue-green deployment is straightforward when your release only changes application code. But when you need database schema changes — adding a column, renaming a table, changing a constraint — you face a dilemma.

The problem: both environments (blue and green) access the same database. When you deploy green with new code expecting a new column, but the database hasn't been migrated yet, green crashes. If you migrate the database before the switch, blue (still live) breaks because its code can't handle the new schema.

The solution: expand-contract pattern. Every schema change must be backward-compatible. That means: - Add new columns as nullable or with default values. - Never rename or drop columns in the same release that changes code. - Use three-phase deployment: 1) deploy migration to add new schema (nullable, no code changes), 2) deploy new code that uses both old and new schema, 3) after all traffic is on new code, deploy cleanup migration to remove old columns.

Tools like Flyway or Liquibase help version migrations and enforce order. But the real discipline is the team agreeing on backward-compatibility as a non-negotiable rule.

schema-migration-expand-contract.sql · SQL
123456789101112
-- TheCodeForge — Expand-Contract Migration Example

-- Phase 1: Expand (add new column, backward-compatible)
ALTER TABLE orders ADD COLUMN new_status VARCHAR(20) NULL;
-- Old code sees 'status' only, new code sees both.

-- Phase 2: Deploy new code that writes to both 'status' and 'new_status'
-- After switch, both environments can read/write.

-- Phase 3: Contract (after green is live and blue is decommissioned)
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN new_status TO status;
🔥Key Insight
The expand-contract pattern doubles the number of deployment steps. Each schema change takes three releases. This is the price of zero-downtime database changes. Skipping phases leads to production incidents.
📊 Production Insight
You cannot atomically switch a database schema.
If you run a migration during the blue-green switch, you risk corrupting data if the new code has a bug.
Always decouple schema changes from code deployments — use feature flags or expand-contract.
🎯 Key Takeaway
Every schema change must be backward-compatible.
Expand-contract is the only safe pattern for blue-green database migrations.
If you can't make a backward-compatible change, don't use blue-green for that release — use a different strategy like feature flags.
Decision: Should you use blue-green with this schema change?
IfSchema change adds a new table (no existing schema modified)
UseSafe — both environments can coexist with the new table.
IfSchema change adds a nullable column with default
UseSafe — old code ignores the new column.
IfSchema change renames or drops a column
UseNot safe — you must use expand-contract or avoid blue-green for this release.

CI/CD Pipeline for Blue-Green with Nginx and Shell Scripts

A production-grade blue-green pipeline needs automation. Here's a practical example using a CI/CD tool, Nginx as a soft switch (via upstream config reload), and idempotent shell scripts.

The pipeline: 1. Build and test your application. 2. Deploy to the idle environment (say, green). The deployment script checks which environment is live by querying a file or health check endpoint. 3. Run smoke tests against the new environment directly (internal load balancer, not public). 4. If tests pass, run the Nginx config reload script to switch traffic from blue to green. 5. Monitor for 10 minutes. If errors exceed threshold, run rollback script (reload with blue upstream). 6. If stable, decommission the old environment (optional: keep warm for rollback).

The key script: a switch script that modifies /etc/nginx/conf.d/blue-green.conf and reloads Nginx gracefully (nginx -s reload). The rollback script does the same but reverts.

Idempotency is crucial: running the switch script twice should not cause errors. Track the current active environment in a simple file: /var/run/active-env.txt. The script reads this file before switching.

deploy-blue-green.sh · BASH
12345678910111213141516171819202122232425262728293031
#!/bin/bash
# TheCodeForgeAutomated blue-green deploy
set -euo pipefail

ENV="${1:-staging}"
APP_VERSION="${2:-latest}"

# Determine current active environment
if grep -q 'green' /var/run/active-env.txt; then
  TARGET="blue"
  CURRENT="green"
else
  TARGET="green"
  CURRENT="blue"
fi

echo "Deploying ${APP_VERSION} to ${TARGET} (${ENV})..."

# Deploy to target environment (docker-compose or kubernetes)
docker compose -f "docker-compose.${ENV}.yml" -p "${TARGET}" up -d --pull always

echo "Waiting for health check..."
sleep 10
curl --fail http://${TARGET}.internal.example.com/health || exit 1

echo "Switching traffic from ${CURRENT} to ${TARGET}..."
sed -i "s/proxy_pass http:\/\/${CURRENT}/proxy_pass http:\/\/${TARGET}/" /etc/nginx/conf.d/blue-green.conf
nginx -s reload

echo "${TARGET}" > /var/run/active-env.txt
echo "Deploy successful. Active environment: ${TARGET}"
▶ Output
Deploying v1.2.3 to green (staging)...
Pulling images...
Creating green_web_1 ... done
Waiting for health check...
OK
Switching traffic from blue to green...
Reloading nginx... done
Deploy successful. Active environment: green
💡Prod Tip
Always run the rollback script in your staging pipeline at least once per release. It's the only way to confirm the rollback path actually works. I've seen teams 'forget' to test rollback until production forces them.
📊 Production Insight
Nginx reload is graceful — it doesn't drop connections in flight.
But if your Nginx config has syntax errors, reload fails silently.
Always validate config with 'nginx -t' before reloading.
Test the switch script in CI: use a separate test Nginx instance.
🎯 Key Takeaway
Automate the switch but never skip the rollback test.
Use a state file to track active environment for idempotency.
Validate Nginx config before reload — a syntax error can take down both environments.
Pipeline checks before automatic switch
IfSmoke tests fail on new environment
UseAbort automatic switch. Notify team to investigate.
IfHealth check passes but error rate above 1% after switch
UseTrigger automatic rollback. Keep new environment for debugging.
IfAll checks pass, no errors after 10 minutes
UseConsider decommissioning old environment to save costs.

Failure Modes and Rollback Realities

Blue-green deployment promises instant rollback, but there are subtle failure modes that break that promise.

Failure 1: Environment mismatch. The green environment was deployed with a newer config (e.g., different database host, different API keys) that doesn't match the infrastructure of blue. When you rollback, blue may not work because its dependencies changed.

Failure 2: Data divergence. During the time green was live, users modified the database. The blue environment, when switched back, sees data that its code cannot handle (e.g., new column populated). Rollback becomes data repair, not instant.

Failure 3: Partial switch. If you use feature flags or gradual traffic routing, only part of the traffic switched. Rolling back means identifying exactly which users saw the new version and ensuring their session state is consistent.

Failure 4: Warmup landmines. Green passes health checks but fails because JIT compilation or connection pools weren't warm. Real load exposes these. Canary within blue-green (send 5% traffic to green first) catches this.

Mitigation: Use a gradual blue-green approach — switch 10% traffic, observe, then 100%. This gives you a real feedback loop before committing all users.

gradual-switch.sh · BASH
12345678910111213141516171819
#!/bin/bash
# TheCodeForgeGradual blue-green traffic shift with HAProxy

# Initial config: 100% blue, 0% green
# Change weight to shift gradually

for pct in 10 25 50 75 100; do
  echo "Setting green weight to ${pct}%..."
  sed -i "s/server green weight [0-9]*/server green weight ${pct}/" /etc/haproxy/haproxy.cfg
  haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
  sleep 60  # Observe metrics
  if grep -q "ERROR_RATE_THRESHOLD" /var/log/haproxy/errors.log; then
    echo "Error rate exceeded! Rolling back..."
    # Revert to 0% green
    sed -i 's/server green weight [0-9]*/server green weight 0/' /etc/haproxy/haproxy.cfg
    exit 1
  fi
done
echo "Gradual switch complete. 100% green."
▶ Output
Setting green weight to 10%...
Setting green weight to 25%...
Setting green weight to 50%...
Setting green weight to 75%...
Setting green weight to 100%...
Gradual switch complete. 100% green.
⚠ Warning: Rollback is not always instant
If the new environment modifies data (files, database), rollback means reverting those changes. Blue-green works best for stateless services. For stateful, combine with careful migration planning and data snapshots.
📊 Production Insight
The assumption that rollback is just a router flip only holds if both environments share the same data and state.
If your new code writes to the database in a new format, you've created a state divergence.
Measure the time to actually recover: from decision to full rollback, including any data repair.

Observability and Monitoring During Blue-Green

You can't trust a switch you can't observe. During a blue-green deployment, you need real-time visibility into both environments.

Key metrics to monitor during and after switch
  • Request latency (p50, p90, p99) — compare blue vs green after switch.
  • Error rate (4xx, 5xx) — a spike indicates the new code has issues.
  • Resource utilisation (CPU, memory, connections) — green might need more resources under real load.
  • Business metrics — orders per minute, signup completions. These catch logic errors that don't cause HTTP errors.

For tracing: use distributed tracing (Jaeger, Zipkin) to compare request paths. A new version might call different downstream services or have different timeouts.

Alerting: set up a 'deployment window' alert that triggers if error rate exceeds 0.5% for 1 minute after switch. This alert should be separate from your regular alerts — allow a brief grace period to avoid false positives.

Observability also means logging the switch itself. Log every switch attempt, success/failure, and rollback. This helps post-mortems.

blue-green-monitoring.yml · PROMETHEUS
12345678910111213141516171819
# TheCodeForgePrometheus recording rules for blue-green
# Use these to alert on deployment anomalies

groups:
  - name: blue-green
    rules:
      - record: job:error_rate:ratio1m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1m]))
          /
          sum(rate(http_requests_total[1m]))
      - alert: BlueGreenErrorBurst
        expr: job:error_rate:ratio1m > 0.005
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Blue-green switch may have caused error burst"
          description: "Error rate {{ $value | humanizePercentage }} for job {{ $labels.job }}"
🔥Observability Gap
Most teams monitor only HTTP status codes. Business metrics catch subtle regressions: e.g., a new search algorithm returns fewer results, but HTTP 200, so no alert fires. Add at least one business metric per deployment.
📊 Production Insight
Don't rely solely on synthetic health checks — they miss real user behavior.
Use request tracing to compare the path of a single request through blue vs green.
The first sign of trouble is often a business metric drop, not a 5xx spike.
🗂 Blue-Green vs Other Zero-Downtime Strategies
Compare key characteristics to choose the right pattern
StrategyRollback TimeInfrastructure CostDatabase Migration SupportTraffic Control Granularity
Blue-GreenSeconds (routing change)2x environment costExpand-contract requiredBinary (100% one environment)
Canary ReleaseMinutes (gradual rollback)1x + small fractionSame as blue-greenGradual (1% to 100%)
Rolling DeploymentN/A (re-deploy fixed)1x (sequential update)Limited by per-instance update orderPer node, not per user
Feature FlagSeconds (flag toggle)1x (flag in code)Easy (feature flags shield old code)Per user or per request

🎯 Key Takeaways

  • Blue-green deployment enables instant rollback by switching traffic between two identical environments — the deploy and the switch are separate concerns.
  • Database schema changes require the expand-contract pattern; backward-compatibility is non-negotiable.
  • Traffic switching can be DNS (non-atomic), load balancer (atomic), or service mesh (gradual). Choose based on your tolerance for mixed version traffic.
  • Always test rollback in staging — the rollback script is as important as the deploy script.
  • Observability must include business metrics, not just HTTP status codes, to catch logical regressions.

⚠ Common Mistakes to Avoid

    Deploying schema changes without expand-contract
    Symptom

    Blue environment crashes after switch because new column is missing or old code can't handle new schema.

    Fix

    Always deploy schema changes as backward-compatible: add columns nullable, never rename/drop in the same release as code changes. Use three-phase deployment.

    Assuming DNS switch is atomic
    Symptom

    Some users hit old version, some hit new version for several minutes. If old version can't serve requests meant for new version, errors occur.

    Fix

    Use load balancer or service mesh for atomic switch. If using DNS, set TTL to 60s at least 24h before and plan for dual-version compatibility.

    Skipping rollback testing
    Symptom

    When you need to rollback, the script fails or the old environment has been decommissioned or misconfigured.

    Fix

    Include rollback tests in your CI/CD pipeline. Keep the old environment warm and assert that rollback completes successfully before decommissioning.

    Only checking HTTP health, not business metrics
    Symptom

    New environment returns 200 but business logic is broken (e.g., orders not saving). Discovered hours later via user complaints.

    Fix

    Monitor business metrics: orders/minute, signups, cart sizes. Alert on drops relative to pre-deploy baseline.

Interview Questions on This Topic

  • QExplain the expand-contract pattern for database migrations in a blue-green deployment. Why is it necessary?SeniorReveal
    Expand-contract ensures that schema changes are backward-compatible. In phase 1 (expand), you add new columns as nullable or with defaults — old code ignores them. Phase 2 deploys code that uses both old and new schema; both environments can read/write. Phase 3 (contract) removes old columns after all traffic is on new code. Without this pattern, the live environment (blue) breaks during migration, or the new environment (green) breaks because the migration hasn't run yet. It's necessary because you cannot atomically change a schema shared by two code versions.
  • QYou're using blue-green with a DNS-based traffic switch. TTL is set to 300 seconds. You need to cutover at 2:00 PM. What problems do you anticipate and how do you mitigate them?Mid-levelReveal
    Problem: DNS propagation takes up to 5 minutes. During that window, some users resolve to blue, some to green. If the new version is incompatible with old data, errors occur. Mitigations: (1) Reduce TTL to 60s at least 24 hours before cutover to minimise propagation window. (2) Ensure both versions can handle the same database schema and API contracts for at least 10 minutes. (3) Use a load balancer with a status page that redirects traffic via HTTP 302 if you need instant switch — DNS is for gradual adoption.
  • QWhat's the biggest risk of blue-green deployment for stateful services? How do you mitigate it?SeniorReveal
    State divergence. The new environment may write data in a new format (e.g., different serialisation, new columns). When you rollback to blue, it can't read that data. Mitigations: (1) Use backward-compatible schema changes. (2) Feature flags to keep old code paths active. (3) If data is written during green's uptime, you must either forward-port the rollback (play back changes) or accept data loss. Best practice: make services as stateless as possible, pushing state to a separate layer (database, queue) that handles migrations independently.

Frequently Asked Questions

What is Blue-Green Deployment in simple terms?

Blue-Green Deployment is a fundamental concept in DevOps. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

Can blue-green deployment work with microservices?

Yes, but each microservice should have its own blue-green pair. You can't have one traffic switch for all services because they have independent release cycles. Use service mesh to route per-service traffic fractionally.

What happens to in-flight requests during the switch?

If using a load balancer with connection draining, in-flight requests finish on the old environment before it's taken out of rotation. DNS switches do not handle this — new requests go to new environment, but old connections may still be served by old environment. Graceful shutdown (SIGTERM) is recommended.

Is blue-green expensive?

Yes, because you need double the infrastructure (e.g., two full environments). However, you can reduce cost by scaling down the idle environment to a minimum number of instances, only scaling up when needed for rollback readiness. Cloud auto-scaling helps.

How do I handle database rollback if the new schema change was not backward-compatible?

You need a point-in-time restore from a backup taken before the migration. This is not instant. The lesson: never make a non-backward-compatible change in a blue-green system. Use expand-contract and feature flags to avoid this situation entirely.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousCI/CD Pipeline Best PracticesNext →Canary Releases Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged