Missing Health Check: DevOps Interview Gotcha Broke CI/CD
A missing health check caused a 45-minute outage despite green CI/CD.
20+ years shipping production code across the stack, with years spent interviewing engineers. Lessons pulled from things that broke in production.
- DevOps is a cultural and technical practice unifying dev and ops through automation, monitoring, and rapid feedback loops.
- Core components: CI/CD pipelines, Infrastructure as Code (Terraform), container orchestration (Kubernetes), and observability.
- Performance insight: Teams practicing DevOps deploy 46x more frequently and recover from failures 96x faster (DORA 2025).
- Production insight: Without blameless post-mortems, the same outage repeats — automation alone won't fix cultural gaps.
- Biggest mistake: Treating DevOps as a tools role — the real value is in removing silos and automating feedback.
Imagine building a skyscraper where architects, bricklayers, electricians, and inspectors all work in separate buildings and only talk once a month. That's old-school software development. DevOps is what happens when you knock down those walls, put everyone in the same room, and give them walkie-talkies. It's the practice of making the people who write software and the people who run software work as one continuous, automated team — so your app ships faster, breaks less, and gets fixed in minutes instead of weeks.
DevOps interviews are brutal if you walk in memorising buzzwords. Interviewers at companies like Netflix, Spotify, and Stripe don't want you to recite a Wikipedia definition of CI/CD — they want to know if you've felt the pain of a 3am production outage and understand why the practices exist. The difference between a candidate who gets the offer and one who doesn't usually isn't technical depth alone — it's the ability to connect a tool or practice back to a real business problem it solves.
DevOps exists because the old model was broken. Developers would spend weeks writing code, hand a giant batch over a metaphorical wall to operations, and then watch chaos unfold — mismatched environments, undocumented configs, surprise dependencies. DevOps isn't a job title, it's a cultural and technical philosophy: automate everything that can be automated, deliver in small increments, and make feedback loops as short as possible.
By the end of this article you'll be able to answer the questions that trip most candidates up — not by reciting definitions, but by explaining the WHY behind Docker, Kubernetes, CI/CD pipelines, Infrastructure as Code, and monitoring. You'll also know the common traps interviewers set and how to sidestep them with confident, experience-flavoured answers.
Why DevOps Interview Questions Are a Filter for Real-World CI/CD Understanding
Top DevOps interview questions test whether you understand the integration and delivery pipeline as a system, not just a set of tools. They probe your grasp of automation, observability, and failure modes — especially the subtle ones like a missing health check that silently breaks CI/CD. The core mechanic is that a health check is a probe (HTTP, TCP, or command) that validates a service is ready to serve traffic; without it, a deployment can appear successful while the application is actually dead. In practice, a missing health check means the orchestrator (Kubernetes, Nomad, or a load balancer) never detects a crashed or stuck process. The deployment proceeds, the old pods are terminated, and traffic is routed to a non-responsive container. This leads to cascading failures: monitoring alerts fire, rollbacks are manual, and the deployment pipeline reports success despite zero uptime. You must use health checks in every deployment — readiness probes for traffic routing, liveness probes for automatic restarts. They are not optional; they are the difference between a self-healing system and a silent outage. In production, a missing health check is the #1 cause of 'deployment succeeded, app is down' incidents.
Infrastructure as Code (IaC) and Automation
One of the most frequent questions is: 'Why do we need Infrastructure as Code?' In the past, servers were hand-crafted 'pets'—if a production server crashed, no one knew exactly how it was configured. IaC turns infrastructure into 'cattle.' By defining your servers, networks, and databases in code (using tools like Terraform or Ansible), you ensure that your environments are reproducible, version-controlled, and immune to 'configuration drift.' This allows a DevOps engineer to spin up a mirror image of production in minutes for testing purposes.
Interviewers want to see that you understand the pain IaC solves: the 'it works on my machine' syndrome, the cost of manual patching, and the compliance nightmare of snowflake servers. Mentioning the principle of immutability—destroy and rebuild rather than patch—shows you've lived the trade-off between operational overhead and speed.
Containerization and Orchestration: Docker vs. Kubernetes
Interviewers often ask to explain the relationship between Docker and Kubernetes. Think of Docker as the standardized shipping container: it packages the application and its dependencies so it runs the same anywhere. Kubernetes (K8s) is the crane and the cargo ship: it manages thousands of these containers, handling scaling, self-healing (restarting crashed containers), and load balancing across a cluster of machines.
The real depth comes from explaining the WHY: Docker solves environment consistency (no more 'works on my machine'). Kubernetes solves orchestration at scale. When an interviewer asks 'Should we use Docker or Kubernetes?' the correct answer is 'Both — they solve different problems.' If you're running a single service, Docker is enough. If you have multiple services that need to scale independently, you need K8s. Senior engineers also talk about readiness probes, resource limits, and network policies — because those are the things that actually break in production.
CI/CD Pipelines: The Automation Heartbeat
CI/CD is the engine that makes DevOps tick. Interviewers want to see you understand the difference between Continuous Integration (merge often, test automatically) and Continuous Delivery (every commit is deployable). The real power comes from the feedback loop: a good pipeline tells you the moment something breaks, so you fix it before it reaches production.
When asked about CI/CD, avoid reciting tools. Instead, talk about pipeline stages: lint → unit test → build → integration test → security scan → deploy to staging → smoke test → deploy to production. Explain why each stage exists and what happens if it fails. Mention that a well-designed pipeline is idempotent: running it twice on the same commit should produce the same result. Also, high-performing teams have less than 1 hour lead time for changes — that's the metric you want to optimise.
- Each stage (lint, test, build) is a station that must pass before the car moves forward.
- If a station fails, the car is pulled off the line — no manual override without inspection.
- The final gate (production deployment) is the showroom floor — only flawless cars go there.
- Metrics like lead time and deployment frequency are the factory's KPIs — measure them religiously.
Monitoring and Observability: You Can't Improve What You Can't Measure
DevOps interviews often include questions about monitoring. The key distinction they're looking for is between monitoring (checking known metrics) and observability (the ability to infer unknown states from logs, metrics, and traces). Senior engineers know that dashboards are nice but debugging requires the three pillars: logs (what happened), metrics (how many times it happened), and traces (where it happened in a request's journey).
Interviewers want to hear that you don't just rely on dashboards — you build alerting with actionable thresholds, not noise. For example, alerting on CPU at 90% is useless if your app is IO-bound. The golden signals of monitoring (latency, traffic, errors, saturation) are a good start. Also, mention SLOs, SLIs, and error budgets to show you understand the business side — DevOps is about balancing reliability with velocity.
Incident Management and Blameless Post-Mortems
This is the part of DevOps that most candidates ignore. Interviewers at senior levels want to know how you handle incidents, not just how you set up CI/CD. They ask: 'Tell me about a time you handled a production outage.' The structure they expect: detection → containment → root cause analysis → fix → prevention.
Key principles: blameless culture (assume good intent), write a post-mortem within 48 hours, and follow up on action items. The goal is to improve the system, not to find a scapegoat. Senior engineers also talk about incident severity levels (SEV1, SEV2), escalation paths, and how they communicate during an outage. They mention that a good post-mortem has a timeline, a root cause analysis, and action items with owners and due dates.
Configuration Drift: Why Your IaC Will Lie to You
You'll deploy Terraform. It'll say 'No changes.' But your production server has a config file that doesn't match. That's configuration drift. It happens when someone SSH's in and 'just fixes something.' Or an emergency patch gets applied manually. The infrastructure code thinks it's running version X. Reality is version Y. Next deploy, Terraform reverts the fix. Now you're down. The WHY is simple: humans bypass automation under pressure. The HOW is preventive: immutable infrastructure. Don't patch running servers. Deploy new ones. Use baking AMIs or container images. Treat servers like cattle, not pets. And enforce golden images with Packer. Your CI/CD pipeline should be the only path to production. If someone SSH's in after deployment, that's a policy violation, not a workaround. Interviewers will ask: 'How do you detect drift?' Answer with automated compliance checks. Tools like Cloud Custodian, Chef Inspec, or even a scheduled Terraform plan that alerts on changes. If your IaC and your running environment don't match, you have a compliance incident, not a deploy.
Git Workflows That Won't Make You Cry at 3 AM
A senior DevOps engineer doesn't just know git commands. They know which workflow prevents merge hell during a hotfix. The WHY: production doesn't care about your feature branch strategy. It cares about getting a fix out in 10 minutes. The HOW: trunk-based development. Main branch is always deployable. Short-lived feature branches (under 2 days). No long-running release branches unless you're pinned to a compliance calendar. When a Sev1 hits, you revert the last commit, not cherry-pick through 12 branches. For feature flags, use LaunchDarkly, not git branches. Your CI/CD should run on every push to main, not only on PR merge. That's how you catch integration failures before they reach production. Avoid the 'git flow' cargo cult. It was written in 2010 and assumes you have a quarterly release cycle, not daily deploys. If you hear a candidate describe a 6-branch workflow for a microservice, they've never been paged at midnight. The simplest test: can you roll back a single commit in under 60 seconds? If not, your git process is a liability.
The Silent Pipeline: How a Missing Health Check Caused a 45-Minute Outage
/healthz instead of /health). The container started, but the probe never succeeded, so the service was removed from the load balancer — yet the deployment was marked successful./health and added a startup probe to prevent the same issue during initial boot. Also added a pipeline step that verifies the probe returns 200 before marking the deployment as complete.- A green CI/CD pipeline doesn't mean the service is healthy — it means the pipeline ran.
- Always test readiness and liveness probes in a staging environment that mirrors production.
- Add synthetic monitoring that exercises the same endpoints as your probes, so you know the second a deployment goes sideways.
docker ps -a | grep Exited. Also inspect resource limits: docker inspect <container> | jq .[0].HostConfig.Memory. If OOMKilled, increase memory or fix memory leak.kubectl get endpoints <service>. If endpoints are empty, check selector labels and readiness probes. Also check network policies — a misapplied NetworkPolicy can silently drop traffic.kubectl logs <pod> --previouskubectl describe pod <pod> | grep -A 10 'Last State'Key takeaways
Common mistakes to avoid
4 patternsThe 'Tool-First' Fallacy
Missing the 'Business Why'
The 'Black Hole' Pipeline
Assuming 'Automation' Replaces Human Judgment
Interview Questions on This Topic
Explain the 'Three Ways' of DevOps (Feedback, Flow, and Continuous Learning) and how you've applied them in a past project.
Frequently Asked Questions
20+ years shipping production code across the stack, with years spent interviewing engineers. Lessons pulled from things that broke in production.
That's DevOps Interview. Mark it forged?
7 min read · try the examples if you haven't