Senior 6 min · June 25, 2026

Design a Code Deployment System: Zero-Downtime Rollouts Without the 3AM Panic

Design a code deployment system for zero-downtime rollouts.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Design a deployment system by choosing a rollout strategy (blue-green, canary, or rolling), automating the pipeline with CI/CD, implementing health checks and automatic rollback, and monitoring error budgets. Start with blue-green for simplicity, then add canary releases for risk reduction.

✦ Definition~90s read
What is Design a Code Deployment System?

A code deployment system is the automated pipeline that moves code from version control to production servers, handling build, test, rollout, rollback, and traffic shifting with minimal human intervention.

Think of it like replacing the engine on a flying plane.
Plain-English First

Think of it like replacing the engine on a flying plane. You don't just yank the old one out — you bring a second plane alongside, transfer passengers one by one, and if the new engine sputters, you switch back instantly. That's blue-green. Canary is like testing a new recipe on one table before serving the whole restaurant.

Every deployment is a controlled explosion. I've watched a single bad deploy take down a $2M/day e-commerce site because the team thought 'just push to prod' was a strategy. The problem isn't bad code — it's bad deployment design. Most tutorials teach you how to build a pipeline, but they don't tell you what happens when that pipeline fails at 3AM with a database migration that locks every table. This article gives you the battle-tested patterns for zero-downtime deployments, the exact health checks that prevent disasters, and the rollback mechanisms that save your weekend. By the end, you'll be able to design a deployment system that survives bad code, traffic spikes, and your own sleep-deprived mistakes.

Why Most Deployment Pipelines Are a House of Cards

Before we talk patterns, understand the failure modes. A deployment system isn't just a pipeline — it's a state machine that transitions your production system from version N to N+1. Every transition has a blast radius. The most common mistake? Treating deployment as a single step: build, push, restart. That's how you get a full outage when the new code has a bug that only manifests under real traffic. The hack people used before proper systems? SSH into every server, pull the new binary, and restart the service manually. That works for exactly one server. At scale, you need automation that handles partial failures, traffic draining, and health verification. Without it, you're one bad deploy away from a PagerDuty alert that wakes up the whole team.

NaiveDeployment.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — System Design tutorial

// DON'T do this — it's the house of cards
pipeline:
  - build:
      command: "mvn package"
  - deploy:
      command: "scp target/app.jar user@prod-server:/opt/app/ && ssh user@prod-server 'systemctl restart myapp'"
  - verify:
      command: "curl -f http://prod-server:8080/health"

// Problem: single server, no rollback, no traffic draining, no health check before traffic shift
Output
Build succeeds, deploy succeeds, verify passes — but if the new app has a memory leak, it crashes 5 minutes later. No automated rollback.
Never Do This:
Never deploy by SSH'ing into production and restarting services manually. You lose audit trail, rollback capability, and the ability to handle partial failures. Use a deployment controller (Kubernetes, Nomad, or a custom orchestrator) that manages the state machine for you.
Zero-Downtime Deployment System Design THECODEFORGE.IO Zero-Downtime Deployment System Design Blue-green, canary, rolling updates, migrations, rollbacks, health checks, feature flags Blue-Green Deployments Two identical environments; switch traffic instantly Canary Deployments Route 1% traffic to new version, then ramp up Rolling Updates Gradually replace pods; Kubernetes default Database Migrations Backward-compatible schema changes to avoid downtime Rollback Strategies Revert to previous version if new one is toxic Health Checks & Feature Flags Liveness/readiness probes; decouple deploy from release ⚠ Database migrations can break old code if not backward-compatible Always design migrations to be additive and reversible THECODEFORGE.IO
thecodeforge.io
Zero-Downtime Deployment System Design
Design Code Deployment System

Blue-Green Deployments: The Safety Net You Need

Blue-green deployment is the simplest zero-downtime pattern. You maintain two identical environments: blue (current production) and green (new version). You deploy to green, run health checks, then switch traffic from blue to green. If something goes wrong, you switch back. This is the pattern I use for critical services like payment processing. The key insight: the switch must be atomic from the user's perspective. DNS-based switching has propagation delays (minutes to hours). Load balancer switching (e.g., AWS ALB target group swap) is near-instant. The gotcha: you need double the infrastructure cost. For low-traffic services, that's fine. For high-traffic, consider canary deployments instead.

BlueGreenDeployment.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// io.thecodeforge — System Design tutorial

// Blue-green deployment with AWS ALB
resources:
  - type: AWS::AutoScaling::AutoScalingGroup
    properties:
      LaunchConfigurationName: !Ref GreenLaunchConfig
      MinSize: 2
      MaxSize: 10
      DesiredCapacity: 4
  - type: AWS::ElasticLoadBalancingV2::TargetGroup
    properties:
      Name: blue-tg
      Port: 8080
      HealthCheckPath: /health
  - type: AWS::ElasticLoadBalancingV2::TargetGroup
    properties:
      Name: green-tg
      Port: 8080
      HealthCheckPath: /health
  - type: AWS::ElasticLoadBalancingV2::ListenerRule
    properties:
      Actions:
        - Type: forward
          TargetGroupArn: !Ref BlueTG  # Switch to GreenTG during deploy
      Conditions:
        - Field: path-pattern
          Values: ["/*"]

// Deployment script (pseudo):
// 1. Update GreenTG with new instances
// 2. Wait for health checks to pass
// 3. Update listener rule to forward to GreenTG
// 4. Wait for traffic to drain from BlueTG
// 5. Terminate BlueTG instances
Output
Traffic switches from blue to green in under 1 second. If health checks fail on green, the listener rule never updates — zero downtime.
Production Trap:
Health checks must test the full application stack, not just a ping endpoint. I've seen a team use /health that returned 200 even when the database connection was dead. Use a health check that queries the database and checks a critical external dependency. Otherwise, you'll switch traffic to a green environment that can't serve requests.

Canary Deployments: Roll Out to 1% Before 100%

Canary deployments reduce risk by routing a small percentage of traffic to the new version before a full rollout. Start with 1% of users, monitor error rates and latency, then gradually increase to 5%, 25%, 50%, 100%. The magic is in the traffic splitting — you need a load balancer that supports weighted routing (e.g., AWS ALB with weighted target groups, or a service mesh like Istio). The gotcha: you must ensure that the canary instances can handle the traffic spike when you increase the weight. Auto-scaling based on CPU/memory is essential. I've seen a canary crash because the team increased weight from 1% to 10% too fast, and the new version's connection pool wasn't sized for the sudden load.

CanaryDeployment.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — System Design tutorial

// Canary deployment with Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp.example.com
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"  # Or use weight-based routing
    route:
    - destination:
        host: myapp
        subset: canary
      weight: 1
    - destination:
        host: myapp
        subset: stable
      weight: 99

// Deployment script:
// 1. Deploy canary version with subset: canary
// 2. Monitor error budget (e.g., <0.1% error rate for 5 minutes)
// 3. Increase canary weight to 5%, monitor again
// 4. Repeat until 100%
// 5. If error budget exceeded, set canary weight to 0% and rollback
Output
1% of traffic hits the canary. If error rate spikes, the canary weight is automatically set to 0% by the monitoring system.
The Classic Bug:
Canary Rollout ProgressionTHECODEFORGE.IOCanary Rollout ProgressionGradually shift traffic from 1% to 100%1% CanaryRoute 1% traffic to new versionMonitor MetricsCheck error rates & latencyScale UpIncrease to 5%, 25%, 50%Full RolloutRoute 100% traffic to new versionObserve & LockConfirm stability, retire old version⚠ Traffic splitting requires a load balancer that supports weighted routingTHECODEFORGE.IO
thecodeforge.io
Canary Rollout Progression
Design Code Deployment System

Rolling Updates: The Kubernetes Default and Its Pitfalls

Rolling updates replace instances one by one. Kubernetes does this by default: it spins up a new pod, waits for it to become ready, then terminates an old pod. The advantage: no extra infrastructure cost. The disadvantage: during the rollout, both old and new versions serve traffic simultaneously. If the new version has a bug that corrupts shared state (e.g., a database row format), the old version might also be affected. The biggest pitfall: the maxSurge and maxUnavailable settings. Set maxUnavailable: 0 to ensure zero downtime, but then you need enough capacity to handle the surge. I've seen a team set maxSurge: 25% and maxUnavailable: 25% — during a deploy, 25% of pods were unavailable, causing a capacity crunch under peak load.

RollingUpdateDeployment.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — System Design tutorial

// Kubernetes deployment with safe rolling update settings
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Only spin up 1 extra pod at a time
      maxUnavailable: 0  # Never terminate a pod until new one is ready
  template:
    spec:
      containers:
      - name: myapp
        image: myapp:1.2.3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20

// Important: readiness probe must check that the app can serve traffic, not just that the process is alive.
Output
Kubernetes rolls out one pod at a time. Each new pod must pass the readiness probe before the old pod is terminated. Zero downtime if probes are correct.
Senior Shortcut:
Blue-Green vs Rolling UpdatesTHECODEFORGE.IOBlue-Green vs Rolling UpdatesTrade-offs in cost, risk, and complexityBlue-GreenTwo full environments (blue & green)Instant rollback by switching trafficHigher infrastructure costNo mixed-version compatibility issuesRolling UpdateNo extra infrastructure neededGradual replacement of instancesOld & new versions serve togetherRollback requires re-deploying old versionChoose blue-green for critical services; rolling updates for stateless appsTHECODEFORGE.IO
thecodeforge.io
Blue-Green vs Rolling Updates
Design Code Deployment System

Database Migrations: The Deployment Killer

Database migrations are the number one cause of deployment failures. The problem: code and schema must be compatible during the rollout. If you add a NOT NULL column without a default, old code that inserts rows will fail. If you rename a column, old code referencing the old name will crash. The solution: expand-contract pattern. First, expand the schema to support both old and new code (add columns, make them nullable, add default values). Deploy the new code that reads both old and new columns. Then, in a second deploy, contract the schema (remove old columns, make new columns NOT NULL). This takes two deployments, but it's safe. I've seen a team try to do a migration in one deploy — the result was a full table lock on a 500GB table that took 45 minutes, taking down the entire site.

ExpandContractMigration.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — System Design tutorial

// Phase 1: Expand schema (deploy before code change)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255) NULL;
ALTER TABLE users ADD COLUMN legacy_name VARCHAR(255); -- keep old column

// Deploy new code that reads display_name first, falls back to legacy_name
// New code writes to both columns

// Phase 2: Contract schema (after code is fully rolled out)
ALTER TABLE users DROP COLUMN legacy_name;
ALTER TABLE users MODIFY COLUMN display_name VARCHAR(255) NOT NULL;

// Never do this in one deploy:
// ALTER TABLE users CHANGE COLUMN name display_name VARCHAR(255) NOT NULL; -- This will break old code!
Output
Phase 1 migration runs in seconds (adding nullable column). Old code continues to work. Phase 2 runs after all instances are updated.
Never Do This:

Rollback Strategies: When the New Version Is Toxic

A rollback is not just reverting the code — it's reverting the state. If the new version ran a database migration, rolling back the code might leave the schema in an incompatible state. The safest approach: make every deployment reversible. For code-only changes, a simple kubectl rollout undo works. For schema changes, you need a rollback migration script that reverses the schema change. Test the rollback in staging before every production deploy. I've seen a team deploy a migration that dropped a column, then tried to roll back — but the rollback script had a bug and failed. They had to restore from backup, losing 10 minutes of data. The lesson: always have a tested rollback plan, and never drop columns in the same deploy as the code change.

RollbackPlan.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — System Design tutorial

// Migration file: 001_add_display_name.sql
-- Up
ALTER TABLE users ADD COLUMN display_name VARCHAR(255) NULL;
-- Down
ALTER TABLE users DROP COLUMN display_name;

// Deployment script with rollback:
// 1. Run migration up
// 2. Deploy new code
// 3. If health checks fail:
//    a. Run migration down
//    b. Rollback code to previous version
//    c. Verify health

// Never deploy a migration without a corresponding down migration.
Output
If the deploy fails, the rollback script reverses the schema change, and the old code works again.
Production Trap:

Health Checks: The Difference Between a Blip and a Meltdown

Health checks are your deployment's immune system. They must be aggressive and comprehensive. A readiness probe should check that the application can handle traffic — database connectivity, cache connectivity, and any critical downstream services. A liveness probe should check that the process is healthy — but don't make it too aggressive, or a transient spike will kill the pod. The gotcha: health checks must be independent of each other. If your readiness probe depends on the liveness probe's endpoint, a failure cascades. I've seen a team use the same endpoint for both — when the database was slow, the readiness probe failed, Kubernetes stopped sending traffic, the liveness probe also failed (because it hit the same slow endpoint), and Kubernetes killed the pod. The fix: use different endpoints with different timeouts.

HealthCheckEndpoints.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — System Design tutorial

// Spring Boot health check endpoints
@RestController
public class HealthController {

    @GetMapping("/ready")
    public ResponseEntity<String> readiness() {
        // Check database connectivity
        try {
            jdbcTemplate.queryForObject("SELECT 1", Integer.class);
        } catch (DataAccessException e) {
            return ResponseEntity.status(503).body("Database unavailable");
        }
        // Check cache connectivity
        try {
            redisTemplate.opsForValue().get("health");
        } catch (Exception e) {
            return ResponseEntity.status(503).body("Cache unavailable");
        }
        return ResponseEntity.ok("Ready");
    }

    @GetMapping("/health")
    public ResponseEntity<String> liveness() {
        // Simple check — process is alive
        return ResponseEntity.ok("Alive");
    }
}

// Kubernetes configuration:
// readinessProbe: /ready with 5s timeout
// livenessProbe: /health with 2s timeout
Output
If database goes down, readiness probe returns 503, Kubernetes stops sending traffic, but the pod stays alive. When database recovers, readiness passes again, traffic resumes.
Senior Shortcut:

Feature Flags: Decouple Deploy from Release

Feature flags let you deploy code that is turned off, then enable it gradually without a new deploy. This is the ultimate safety net. You can deploy a new feature to production, test it internally, then enable it for 1% of users, then 100%. If something goes wrong, you disable the flag instantly — no rollback needed. The gotcha: feature flags add complexity. You need a flag management system (LaunchDarkly, Split, or a custom solution) and you must clean up flags after the feature is stable. I've seen a codebase with hundreds of stale flags that made the code impossible to understand. The rule: every flag must have an expiration date.

FeatureFlagExample.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — System Design tutorial

// Feature flag check in Java
public class CheckoutService {

    private final FeatureFlagClient flagClient;

    public void processOrder(Order order) {
        if (flagClient.isEnabled("new-checkout-flow", order.getUserId())) {
            // New checkout logic
            newCheckoutFlow(order);
        } else {
            // Old checkout logic
            oldCheckoutFlow(order);
        }
    }
}

// Deployment: deploy code with flag disabled. Then enable flag for internal users. Then ramp to 100%.
Output
Code is deployed but new feature is invisible to users until the flag is enabled. If a bug is found, disable the flag — no rollback needed.
The Classic Bug:

When Not to Use a Complex Deployment System

Not every service needs blue-green or canary deployments. If you have a single-instance application with no traffic (e.g., an internal admin tool), a simple restart is fine. If your service is stateless and you can tolerate a few seconds of downtime, a rolling update with maxUnavailable: 1 is simpler and cheaper. The overengineering trap: I've seen teams set up canary deployments for a cron job that runs once a day. The cron job doesn't serve traffic — there's nothing to canary. Use the simplest system that meets your uptime requirements. For most startups, a basic CI/CD pipeline with a rolling update and a manual rollback button is enough.

Interview Gold:
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
After a routine deploy, pods in the new replicaset crashed every 5 minutes with OOMKilled. The old replicaset was fine.
Assumption
Team assumed a memory leak in the new code — spent hours profiling heap dumps.
Root cause
The deployment YAML had resources.limits.memory: 4Gi but the new version's JVM heap was set to 4GB via -Xmx4g, leaving zero room for JVM overhead (metaspace, threads, GC). The container hit the limit instantly.
Fix
Set -Xmx3g in the JVM args and resources.limits.memory: 4Gi — always leave 25% headroom for non-heap memory.
Key lesson
  • Container memory limits must account for the runtime's overhead, not just the application heap.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Deploy stuck: new pods crash-loop with CrashLoopBackOff
Fix
1. Check pod logs: kubectl logs <pod-name> --previous 2. Check events: kubectl describe pod <pod-name> 3. If OOMKilled, increase memory limits or reduce heap size 4. If liveness probe failing, check health endpoint and adjust probe parameters
Symptom · 02
Deploy completes but error rate spikes to 5%
Fix
1. Rollback immediately: kubectl rollout undo deployment/myapp 2. Check if canary weight was increased too fast — reduce to 1% 3. Check if new version has backward-incompatible changes (API, schema) 4. Add feature flag to disable new code path
Symptom · 03
Database migration runs for 30+ minutes, blocking writes
Fix
1. Kill the migration if it's an ALTER TABLE that locks — use SHOW PROCESSLIST and KILL QUERY <id> 2. Use online schema change tool (gh-ost) for future migrations 3. Restore from backup if data corruption occurred
★ Deployment Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Pods crash-looping with `CrashLoopBackOff`
Immediate action
Check logs from the previous instance
Commands
kubectl logs <pod-name> --previous
kubectl describe pod <pod-name> | grep -A5 Events
Fix now
If OOM: increase resources.limits.memory or reduce JVM heap. If probe failure: adjust initialDelaySeconds or fix health endpoint.
Deploy stuck at 'Waiting for rollout to finish'+
Immediate action
Check rollout status and events
Commands
kubectl rollout status deployment/myapp
kubectl describe deployment/myapp | grep -A10 Conditions
Fix now
If ProgressDeadlineExceeded, check pod logs. Increase progressDeadlineSeconds or fix the underlying issue.
New version causes 5xx errors+
Immediate action
Rollback immediately
Commands
kubectl rollout undo deployment/myapp
kubectl rollout status deployment/myapp
Fix now
After rollback, check if the issue is code or config. If config, fix and redeploy. If code, fix in next release.
Canary deployment shows elevated error rate+
Immediate action
Set canary weight to 0%
Commands
kubectl patch virtualservice myapp --type='json' -p='[{"op":"replace","path":"/spec/http/0/route/0/weight","value":0}]'
kubectl get virtualservice myapp -o yaml | grep weight
Fix now
Investigate canary logs. Ensure backward compatibility. Redeploy with fix.
Feature / AspectBlue-GreenCanaryRolling Update
Infrastructure cost2x (two full environments)1x + small canary pool1x
Rollback speedInstant (DNS/LB switch)Instant (set weight to 0%)Slow (rollout undo)
Traffic isolationFull isolationPartial (shared state risk)None (mixed versions)
ComplexityLowMediumLow
Best forCritical services, stateful appsHigh-traffic, risk-averse teamsStateless, low-traffic apps

Key takeaways

1
Blue-green deployments give instant rollback but cost double infrastructure
use for critical services.
2
Canary deployments reduce risk by gradually shifting traffic
always monitor error budgets and set automatic rollback.
3
Database migrations are the #1 deployment killer
use expand-contract pattern and online schema change tools.
4
Health checks must be comprehensive and independent
readiness for dependencies, liveness for process health.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does a blue-green deployment handle database schema changes that are...
Q02SENIOR
When would you choose a canary deployment over blue-green?
Q03SENIOR
What happens if a readiness probe fails during a rolling update?
Q04JUNIOR
What is the difference between a readiness probe and a liveness probe?
Q05SENIOR
You deploy a new version that introduces a bug causing data corruption. ...
Q06SENIOR
Design a deployment system for a microservices architecture with 50 serv...
Q01 of 06SENIOR

How does a blue-green deployment handle database schema changes that are not backward-compatible?

ANSWER
It doesn't — that's the problem. Blue-green assumes both environments can serve traffic independently. If the new schema is incompatible with the old code, switching traffic will break the old environment during rollback. The solution is expand-contract: make schema changes backward-compatible first, deploy, then clean up in a second deploy.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the best deployment strategy for zero downtime?
02
What's the difference between blue-green and canary deployment?
03
How do I handle database migrations during deployment?
04
What happens if a Kubernetes rolling update gets stuck?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Real World. Mark it forged?

6 min read · try the examples if you haven't

Previous
Design a Stock Exchange
37 / 40 · Real World
Next
Design a Distributed Cache