Design a Code Deployment System: Zero-Downtime Rollouts Without the 3AM Panic
Design a code deployment system for zero-downtime rollouts.
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
Design a deployment system by choosing a rollout strategy (blue-green, canary, or rolling), automating the pipeline with CI/CD, implementing health checks and automatic rollback, and monitoring error budgets. Start with blue-green for simplicity, then add canary releases for risk reduction.
Think of it like replacing the engine on a flying plane. You don't just yank the old one out — you bring a second plane alongside, transfer passengers one by one, and if the new engine sputters, you switch back instantly. That's blue-green. Canary is like testing a new recipe on one table before serving the whole restaurant.
Every deployment is a controlled explosion. I've watched a single bad deploy take down a $2M/day e-commerce site because the team thought 'just push to prod' was a strategy. The problem isn't bad code — it's bad deployment design. Most tutorials teach you how to build a pipeline, but they don't tell you what happens when that pipeline fails at 3AM with a database migration that locks every table. This article gives you the battle-tested patterns for zero-downtime deployments, the exact health checks that prevent disasters, and the rollback mechanisms that save your weekend. By the end, you'll be able to design a deployment system that survives bad code, traffic spikes, and your own sleep-deprived mistakes.
Why Most Deployment Pipelines Are a House of Cards
Before we talk patterns, understand the failure modes. A deployment system isn't just a pipeline — it's a state machine that transitions your production system from version N to N+1. Every transition has a blast radius. The most common mistake? Treating deployment as a single step: build, push, restart. That's how you get a full outage when the new code has a bug that only manifests under real traffic. The hack people used before proper systems? SSH into every server, pull the new binary, and restart the service manually. That works for exactly one server. At scale, you need automation that handles partial failures, traffic draining, and health verification. Without it, you're one bad deploy away from a PagerDuty alert that wakes up the whole team.
Blue-Green Deployments: The Safety Net You Need
Blue-green deployment is the simplest zero-downtime pattern. You maintain two identical environments: blue (current production) and green (new version). You deploy to green, run health checks, then switch traffic from blue to green. If something goes wrong, you switch back. This is the pattern I use for critical services like payment processing. The key insight: the switch must be atomic from the user's perspective. DNS-based switching has propagation delays (minutes to hours). Load balancer switching (e.g., AWS ALB target group swap) is near-instant. The gotcha: you need double the infrastructure cost. For low-traffic services, that's fine. For high-traffic, consider canary deployments instead.
/health that returned 200 even when the database connection was dead. Use a health check that queries the database and checks a critical external dependency. Otherwise, you'll switch traffic to a green environment that can't serve requests.Canary Deployments: Roll Out to 1% Before 100%
Canary deployments reduce risk by routing a small percentage of traffic to the new version before a full rollout. Start with 1% of users, monitor error rates and latency, then gradually increase to 5%, 25%, 50%, 100%. The magic is in the traffic splitting — you need a load balancer that supports weighted routing (e.g., AWS ALB with weighted target groups, or a service mesh like Istio). The gotcha: you must ensure that the canary instances can handle the traffic spike when you increase the weight. Auto-scaling based on CPU/memory is essential. I've seen a canary crash because the team increased weight from 1% to 10% too fast, and the new version's connection pool wasn't sized for the sudden load.
Rolling Updates: The Kubernetes Default and Its Pitfalls
Rolling updates replace instances one by one. Kubernetes does this by default: it spins up a new pod, waits for it to become ready, then terminates an old pod. The advantage: no extra infrastructure cost. The disadvantage: during the rollout, both old and new versions serve traffic simultaneously. If the new version has a bug that corrupts shared state (e.g., a database row format), the old version might also be affected. The biggest pitfall: the maxSurge and maxUnavailable settings. Set maxUnavailable: 0 to ensure zero downtime, but then you need enough capacity to handle the surge. I've seen a team set maxSurge: 25% and maxUnavailable: 25% — during a deploy, 25% of pods were unavailable, causing a capacity crunch under peak load.
Database Migrations: The Deployment Killer
Database migrations are the number one cause of deployment failures. The problem: code and schema must be compatible during the rollout. If you add a NOT NULL column without a default, old code that inserts rows will fail. If you rename a column, old code referencing the old name will crash. The solution: expand-contract pattern. First, expand the schema to support both old and new code (add columns, make them nullable, add default values). Deploy the new code that reads both old and new columns. Then, in a second deploy, contract the schema (remove old columns, make new columns NOT NULL). This takes two deployments, but it's safe. I've seen a team try to do a migration in one deploy — the result was a full table lock on a 500GB table that took 45 minutes, taking down the entire site.
Rollback Strategies: When the New Version Is Toxic
A rollback is not just reverting the code — it's reverting the state. If the new version ran a database migration, rolling back the code might leave the schema in an incompatible state. The safest approach: make every deployment reversible. For code-only changes, a simple kubectl rollout undo works. For schema changes, you need a rollback migration script that reverses the schema change. Test the rollback in staging before every production deploy. I've seen a team deploy a migration that dropped a column, then tried to roll back — but the rollback script had a bug and failed. They had to restore from backup, losing 10 minutes of data. The lesson: always have a tested rollback plan, and never drop columns in the same deploy as the code change.
Health Checks: The Difference Between a Blip and a Meltdown
Health checks are your deployment's immune system. They must be aggressive and comprehensive. A readiness probe should check that the application can handle traffic — database connectivity, cache connectivity, and any critical downstream services. A liveness probe should check that the process is healthy — but don't make it too aggressive, or a transient spike will kill the pod. The gotcha: health checks must be independent of each other. If your readiness probe depends on the liveness probe's endpoint, a failure cascades. I've seen a team use the same endpoint for both — when the database was slow, the readiness probe failed, Kubernetes stopped sending traffic, the liveness probe also failed (because it hit the same slow endpoint), and Kubernetes killed the pod. The fix: use different endpoints with different timeouts.
Feature Flags: Decouple Deploy from Release
Feature flags let you deploy code that is turned off, then enable it gradually without a new deploy. This is the ultimate safety net. You can deploy a new feature to production, test it internally, then enable it for 1% of users, then 100%. If something goes wrong, you disable the flag instantly — no rollback needed. The gotcha: feature flags add complexity. You need a flag management system (LaunchDarkly, Split, or a custom solution) and you must clean up flags after the feature is stable. I've seen a codebase with hundreds of stale flags that made the code impossible to understand. The rule: every flag must have an expiration date.
When Not to Use a Complex Deployment System
Not every service needs blue-green or canary deployments. If you have a single-instance application with no traffic (e.g., an internal admin tool), a simple restart is fine. If your service is stateless and you can tolerate a few seconds of downtime, a rolling update with maxUnavailable: 1 is simpler and cheaper. The overengineering trap: I've seen teams set up canary deployments for a cron job that runs once a day. The cron job doesn't serve traffic — there's nothing to canary. Use the simplest system that meets your uptime requirements. For most startups, a basic CI/CD pipeline with a rolling update and a manual rollback button is enough.
The 4GB Container That Kept Dying
resources.limits.memory: 4Gi but the new version's JVM heap was set to 4GB via -Xmx4g, leaving zero room for JVM overhead (metaspace, threads, GC). The container hit the limit instantly.-Xmx3g in the JVM args and resources.limits.memory: 4Gi — always leave 25% headroom for non-heap memory.- Container memory limits must account for the runtime's overhead, not just the application heap.
kubectl logs <pod-name> --previous
2. Check events: kubectl describe pod <pod-name>
3. If OOMKilled, increase memory limits or reduce heap size
4. If liveness probe failing, check health endpoint and adjust probe parameterskubectl rollout undo deployment/myapp
2. Check if canary weight was increased too fast — reduce to 1%
3. Check if new version has backward-incompatible changes (API, schema)
4. Add feature flag to disable new code pathSHOW PROCESSLIST and KILL QUERY <id>
2. Use online schema change tool (gh-ost) for future migrations
3. Restore from backup if data corruption occurredkubectl logs <pod-name> --previouskubectl describe pod <pod-name> | grep -A5 Eventsresources.limits.memory or reduce JVM heap. If probe failure: adjust initialDelaySeconds or fix health endpoint.Key takeaways
Interview Questions on This Topic
How does a blue-green deployment handle database schema changes that are not backward-compatible?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
That's Real World. Mark it forged?
6 min read · try the examples if you haven't