Senior 7 min · March 06, 2026

Missing Health Check: DevOps Interview Gotcha Broke CI/CD

A missing health check caused a 45-minute outage despite green CI/CD.

N
Naren Founder & Principal Engineer

20+ years shipping production code across the stack, with years spent interviewing engineers. Lessons pulled from things that broke in production.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • DevOps is a cultural and technical practice unifying dev and ops through automation, monitoring, and rapid feedback loops.
  • Core components: CI/CD pipelines, Infrastructure as Code (Terraform), container orchestration (Kubernetes), and observability.
  • Performance insight: Teams practicing DevOps deploy 46x more frequently and recover from failures 96x faster (DORA 2025).
  • Production insight: Without blameless post-mortems, the same outage repeats — automation alone won't fix cultural gaps.
  • Biggest mistake: Treating DevOps as a tools role — the real value is in removing silos and automating feedback.
✦ Definition~90s read
What is Top DevOps Interview Questions?

This article dissects a common DevOps interview trap: the missing health check in CI/CD pipelines. It's not about trivia—it's about exposing whether a candidate truly understands that a pipeline that builds and deploys without verifying the deployed artifact is actually healthy is just a fancy script.

Imagine building a skyscraper where architects, bricklayers, electricians, and inspectors all work in separate buildings and only talk once a month.

Real-world CI/CD demands that every stage, from code commit to production, includes automated validation that the service is alive, responsive, and meeting SLAs. The 'gotcha' reveals if you've been burned by a silent failure where a deployment succeeded but the app was dead on arrival, costing hours of debugging and customer trust.

The piece covers the core pillars of DevOps practice that interviewers use to separate theory from battle-tested experience. Infrastructure as Code (IaC) with tools like Terraform or Pulumi isn't just about YAML files—it's about idempotent, version-controlled state management that prevents drift.

Containerization questions probe whether you understand Docker's layered filesystem and Kubernetes' control loop, not just memorize commands. CI/CD pipeline discussions go beyond Jenkins or GitHub Actions syntax to focus on artifact immutability, canary deployments, and rollback strategies.

Monitoring and observability sections tackle the difference between dashboards and actionable alerts—Prometheus metrics vs. structured logging with OpenTelemetry. Incident management drills into blameless post-mortems and the SRE approach to error budgets, where you measure mean time to recovery (MTTR) not just uptime.

The article positions these topics as filters because real DevOps isn't about tools; it's about systems thinking, automation rigor, and the discipline to catch failures before they reach users. If you can't explain why a health check in your pipeline matters more than your deployment tool, you haven't lived through a production outage at 3 AM.

Plain-English First

Imagine building a skyscraper where architects, bricklayers, electricians, and inspectors all work in separate buildings and only talk once a month. That's old-school software development. DevOps is what happens when you knock down those walls, put everyone in the same room, and give them walkie-talkies. It's the practice of making the people who write software and the people who run software work as one continuous, automated team — so your app ships faster, breaks less, and gets fixed in minutes instead of weeks.

DevOps interviews are brutal if you walk in memorising buzzwords. Interviewers at companies like Netflix, Spotify, and Stripe don't want you to recite a Wikipedia definition of CI/CD — they want to know if you've felt the pain of a 3am production outage and understand why the practices exist. The difference between a candidate who gets the offer and one who doesn't usually isn't technical depth alone — it's the ability to connect a tool or practice back to a real business problem it solves.

DevOps exists because the old model was broken. Developers would spend weeks writing code, hand a giant batch over a metaphorical wall to operations, and then watch chaos unfold — mismatched environments, undocumented configs, surprise dependencies. DevOps isn't a job title, it's a cultural and technical philosophy: automate everything that can be automated, deliver in small increments, and make feedback loops as short as possible.

By the end of this article you'll be able to answer the questions that trip most candidates up — not by reciting definitions, but by explaining the WHY behind Docker, Kubernetes, CI/CD pipelines, Infrastructure as Code, and monitoring. You'll also know the common traps interviewers set and how to sidestep them with confident, experience-flavoured answers.

Why DevOps Interview Questions Are a Filter for Real-World CI/CD Understanding

Top DevOps interview questions test whether you understand the integration and delivery pipeline as a system, not just a set of tools. They probe your grasp of automation, observability, and failure modes — especially the subtle ones like a missing health check that silently breaks CI/CD. The core mechanic is that a health check is a probe (HTTP, TCP, or command) that validates a service is ready to serve traffic; without it, a deployment can appear successful while the application is actually dead. In practice, a missing health check means the orchestrator (Kubernetes, Nomad, or a load balancer) never detects a crashed or stuck process. The deployment proceeds, the old pods are terminated, and traffic is routed to a non-responsive container. This leads to cascading failures: monitoring alerts fire, rollbacks are manual, and the deployment pipeline reports success despite zero uptime. You must use health checks in every deployment — readiness probes for traffic routing, liveness probes for automatic restarts. They are not optional; they are the difference between a self-healing system and a silent outage. In production, a missing health check is the #1 cause of 'deployment succeeded, app is down' incidents.

The Silent Deployment Trap
A missing health check doesn't fail the pipeline — it lets a broken build roll out to production, making the CI/CD green while users see errors.
Production Insight
A team deployed a new microservice version that crashed on startup due to a missing config file. Without a health check, the orchestrator kept the old pods running, but traffic was routed to the new, dead pods — causing a 5-minute outage per deployment.
Symptom: 100% of requests fail with 502 or timeout, but the deployment pipeline shows green and the orchestrator reports all pods as 'Running'.
Rule of thumb: Every container must have at least a liveness probe; every service exposed to traffic must have a readiness probe. Test them in staging with a deliberately failing endpoint.
Key Takeaway
Health checks are not optional — they are the safety net that prevents silent failures from reaching users.
A missing health check turns a successful deployment into a hidden outage; always validate both readiness and liveness.
In CI/CD, a green pipeline means nothing if the application doesn't respond — instrument health checks as a first-class deployment gate.
CI/CD Pipeline Health Check Flow THECODEFORGE.IO CI/CD Pipeline Health Check Flow From IaC to monitoring: key stages in a robust DevOps pipeline Infrastructure as Code Automate provisioning with Terraform, CloudFormation Containerization & Orchestration Docker for packaging, Kubernetes for scaling CI/CD Pipeline Automated build, test, deploy heartbeat Monitoring & Observability Metrics, logs, traces to detect issues Incident Management Blameless post-mortems for continuous improvement Configuration Drift IaC state vs reality; detect and remediate ⚠ Missing health check in pipeline breaks CI/CD Always include automated health checks before deployment THECODEFORGE.IO
thecodeforge.io
CI/CD Pipeline Health Check Flow
Top Devops Interview Questions

Infrastructure as Code (IaC) and Automation

One of the most frequent questions is: 'Why do we need Infrastructure as Code?' In the past, servers were hand-crafted 'pets'—if a production server crashed, no one knew exactly how it was configured. IaC turns infrastructure into 'cattle.' By defining your servers, networks, and databases in code (using tools like Terraform or Ansible), you ensure that your environments are reproducible, version-controlled, and immune to 'configuration drift.' This allows a DevOps engineer to spin up a mirror image of production in minutes for testing purposes.

Interviewers want to see that you understand the pain IaC solves: the 'it works on my machine' syndrome, the cost of manual patching, and the compliance nightmare of snowflake servers. Mentioning the principle of immutability—destroy and rebuild rather than patch—shows you've lived the trade-off between operational overhead and speed.

io/thecodeforge/terraform/main.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# io.thecodeforge: Standard AWS Infrastructure Provisioning
# Declaring infrastructure as code ensures consistency across Dev, Staging, and Prod
provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "forge_server" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"

  # In an interview, highlight how tags help in cost tracking and environment isolation
  tags = {
    Name        = "TheCodeForge-Production-Node"
    Environment = "Production"
    ManagedBy   = "Terraform"
    Project     = "ForgeCore"
  }

  # Ensure security groups are also handled via code, not manual console clicks
  vpc_security_group_ids = [aws_security_group.forge_sg.id]
}
Output
# Terraform will create 1 resource: aws_instance.forge_server
Forge Tip: Embrace Immutability
When answering IaC questions, mention 'Immutability.' Instead of patching an old server (which leads to 'configuration drift'), DevOps teams use IaC to destroy the old one and deploy a fresh, updated version. This eliminates the 'it works on my machine' syndrome and ensures your staging environment is a bit-for-bit clone of production.
Production Insight
The biggest IaC failure we've seen: a developer manually SSH'd into a production server to 'fix a quick bug' and forgot to backport the change to Terraform. Next deployment rolled back that fix — and brought down the payment system.
Rule: never use the console or SSH for production changes. If it's not in code, it doesn't exist.
Key Takeaway
IaC turns infrastructure into code: version-controlled, reproducible, auditable.
The golden rule: any manual change is a future outage waiting to happen.
Immutable deployments > patching in place.
IaC Tool Decision Tree
IfYou need to manage cloud resources (AWS, GCP, Azure)
UseUse Terraform — it's cloud-agnostic and has the widest provider ecosystem.
IfYou're already in AWS and need a simpler, AWS-native approach
UseUse AWS CloudFormation — but be aware of lock-in and slower feature adoption.
IfYou need to configure existing servers (install packages, set configs)
UseUse Ansible or Puppet — Terraform is for provisioning, not configuration management.
IfYou need both provisioning and configuration in one tool
UseUse Terraform + Ansible together — Terraform spins up infra, Ansible configures it.

Containerization and Orchestration: Docker vs. Kubernetes

Interviewers often ask to explain the relationship between Docker and Kubernetes. Think of Docker as the standardized shipping container: it packages the application and its dependencies so it runs the same anywhere. Kubernetes (K8s) is the crane and the cargo ship: it manages thousands of these containers, handling scaling, self-healing (restarting crashed containers), and load balancing across a cluster of machines.

The real depth comes from explaining the WHY: Docker solves environment consistency (no more 'works on my machine'). Kubernetes solves orchestration at scale. When an interviewer asks 'Should we use Docker or Kubernetes?' the correct answer is 'Both — they solve different problems.' If you're running a single service, Docker is enough. If you have multiple services that need to scale independently, you need K8s. Senior engineers also talk about readiness probes, resource limits, and network policies — because those are the things that actually break in production.

io/thecodeforge/docker/DockerfileDOCKER
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# io.thecodeforge: Optimized Multi-stage Build for Spring Boot
# Stage 1: Build - keeps the final image small and secure
FROM eclipse-temurin:17-jdk-alpine as build
WORKDIR /workspace/app
COPY . .
RUN ./gradlew build -x test

# Stage 2: Runtime - only includes the JRE and the JAR
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
VOLUME /tmp

# Best Practice: Run as non-root user for production security
RUN addgroup -S forgegroup && adduser -S forgeuser -G forgegroup
USER forgeuser

COPY --from=build /workspace/app/build/libs/*.jar forge-app.jar

EXPOSE 8080
ENTRYPOINT ["java", "-Djava.security.egd=file:/dev/./urandom", "-jar", "/app/forge-app.jar"]
Output
# Docker image built and ready for K8s deployment.
Real-World Context: More Than Isolation
Don't just say Docker 'isolates' apps. Explain that it reduces onboarding time from days to minutes because new developers don't have to install specific database versions or local runtimes. Mention that K8s 'Liveness' and 'Readiness' probes are the secret sauce that prevents your app from serving traffic before it's actually ready to handle it.
Production Insight
We once debugged a mysterious 5-second timeout on every request. Turns out the liveness probe was hitting an endpoint that internally called the database — and when the DB was slow, Kubernetes killed the pod. The app never had a chance to recover.
Rule: liveness probes should check only the app's process, not downstream dependencies.
Key Takeaway
Docker gives you consistency; Kubernetes gives you resilience at scale.
Always separate liveness from readiness — and never chain them to downstream services.
Multi-stage builds cut image size by 70% — that's faster pulls and fewer vulnerabilities.
Container Orchestration Decision Tree
IfYou have 1-5 services and low scaling needs
UseUse Docker Compose or Docker Swarm — simpler than K8s with less overhead.
IfYou need auto-scaling, self-healing, and rolling updates at scale
UseUse Kubernetes — but invest in a managed service (EKS, AKS, GKE) to reduce operational burden.
IfYou're running batch jobs and not always-on services
UseConsider AWS Fargate or Google Cloud Run — serverless containers eliminate cluster management.

CI/CD Pipelines: The Automation Heartbeat

CI/CD is the engine that makes DevOps tick. Interviewers want to see you understand the difference between Continuous Integration (merge often, test automatically) and Continuous Delivery (every commit is deployable). The real power comes from the feedback loop: a good pipeline tells you the moment something breaks, so you fix it before it reaches production.

When asked about CI/CD, avoid reciting tools. Instead, talk about pipeline stages: lint → unit test → build → integration test → security scan → deploy to staging → smoke test → deploy to production. Explain why each stage exists and what happens if it fails. Mention that a well-designed pipeline is idempotent: running it twice on the same commit should produce the same result. Also, high-performing teams have less than 1 hour lead time for changes — that's the metric you want to optimise.

io/thecodeforge/github-actions/deploy.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# io.thecodeforge: Production CI/CD Pipeline with Quality Gates
name: Forge CI/CD Pipeline
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up JDK 17
        uses: actions/setup-java@v4
        with:
          java-version: '17'
          distribution: 'temurin'
      - name: Run unit tests with coverage
        run: |
          ./gradlew test jacocoTestReport
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        if: success()

  security-scan:
    runs-on: ubuntu-latest
    needs: build-and-test
    steps:
      - uses: actions/checkout@v4
      - name: Run Snyk to check for vulnerabilities
        uses: snyk/actions/gradle-jdk17@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

  deploy-staging:
    runs-on: ubuntu-latest
    needs: [build-and-test, security-scan]
    if: github.ref == 'refs/heads/main'
    environment: staging
    steps:
      - name: Deploy to staging using Helm
        run: |
          helm upgrade --install forge-api ./charts/forge-api \
            --namespace staging \
            --set image.tag=${{ github.sha }} \
            --wait

  health-check:
    runs-on: ubuntu-latest
    needs: deploy-staging
    steps:
      - name: Run smoke tests against staging
        run: |
          forge-health-check --endpoint https://staging.forge.io/health

  deploy-production:
    runs-on: ubuntu-latest
    needs: health-check
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - name: Deploy to production with canary
        run: |
          # Use Flux or ArgoCD for GitOps; this is a simplified example
          kubectl set image deployment/forge-api forge-api=forge.io/forge-api:${{ github.sha }} --record
Output
# Pipeline executes: Lint → Test → Security → Staging → Smoke → Canary → Prod
Mental Model: The Assembly Line
  • Each stage (lint, test, build) is a station that must pass before the car moves forward.
  • If a station fails, the car is pulled off the line — no manual override without inspection.
  • The final gate (production deployment) is the showroom floor — only flawless cars go there.
  • Metrics like lead time and deployment frequency are the factory's KPIs — measure them religiously.
Production Insight
The worst pipeline failure we caused: a team skipped the security scan to 'ship fast' and deployed a Docker image with a known CVE. Within 12 hours, attackers used the vulnerability to exfiltrate customer data.
Rule: never bypass a stage for speed — a broken pipeline gives false confidence. If a stage is flaky, fix the stage, don't skip it.
Key Takeaway
A pipeline is only as good as its feedback loop — make failures visible in under 5 minutes.
Never deploy to production without a post-deployment smoke test.
GitOps: the pipeline updates the repo, and the cluster pulls the change — no direct SSH or kubectl apply.
CI/CD Tooling Decision Guide
IfYour team is small and wants simplicity
UseUse GitHub Actions or GitLab CI — no extra infrastructure to manage.
IfYou need complex pipeline orchestration and visibility
UseUse Jenkins or GoCD — but expect maintenance overhead. Consider managed CI/CD if you're not a core DevOps team.
IfYou want GitOps — infrastructure as code for deployments
UseUse ArgoCD or Flux — they reconcile your cluster state with the Git repo automatically.

Monitoring and Observability: You Can't Improve What You Can't Measure

DevOps interviews often include questions about monitoring. The key distinction they're looking for is between monitoring (checking known metrics) and observability (the ability to infer unknown states from logs, metrics, and traces). Senior engineers know that dashboards are nice but debugging requires the three pillars: logs (what happened), metrics (how many times it happened), and traces (where it happened in a request's journey).

Interviewers want to hear that you don't just rely on dashboards — you build alerting with actionable thresholds, not noise. For example, alerting on CPU at 90% is useless if your app is IO-bound. The golden signals of monitoring (latency, traffic, errors, saturation) are a good start. Also, mention SLOs, SLIs, and error budgets to show you understand the business side — DevOps is about balancing reliability with velocity.

io/thecodeforge/prometheus/alert-rules.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# io.thecodeforge: Production-grade Prometheus alerting rules
groups:
  - name: forge-production
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 2 minutes"
          description: "Instance {{ $labels.instance }} has error rate {{ $value }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 1 second for 5 minutes"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
Output
# Prometheus alerting rules active — provides actionable alerts with appropriate severity
Warning: The 'Dashboard Forest' Trap
Don't build 50 dashboards that no one looks at. Focus on a single pane of glass with the four golden signals. If an alert fires, make sure it includes a runbook link. Otherwise, you're just creating noise that gets ignored — and the real outage goes unnoticed.
Production Insight
At a previous company, we had a beautiful Grafana dashboard covering all SLOs. No one looked at it. When the payment service started failing, the error rate graph spiked, but the alerting was tuned to 5-minute windows — by the time the page went out, we'd already lost $10k in revenue.
Rule: alerts should be actionable and immediate. If you don't have a runbook, the alert is noise.
Key Takeaway
Monitoring tells you what's broken; observability tells you why.
Alerts must be actionable — include a link to the runbook.
The four golden signals: latency, traffic, errors, saturation — start here.
Monitoring vs Observability Decision
IfYou know exactly what metrics to track and have static thresholds
UseStart with monitoring (Prometheus + Grafana). Add alerting based on the golden signals.
IfYou have microservices and need to debug complex, unknown failures
UseInvest in observability: distributed tracing (Jaeger), structured logging (ELK), and metrics together.
IfYou're on a tight budget but need to distinguish known issues from unknown
UseUse a combination: Prometheus for metrics, Loki for logs (reuses Prometheus infra), and Tempo for traces — all in one stack.

Incident Management and Blameless Post-Mortems

This is the part of DevOps that most candidates ignore. Interviewers at senior levels want to know how you handle incidents, not just how you set up CI/CD. They ask: 'Tell me about a time you handled a production outage.' The structure they expect: detection → containment → root cause analysis → fix → prevention.

Key principles: blameless culture (assume good intent), write a post-mortem within 48 hours, and follow up on action items. The goal is to improve the system, not to find a scapegoat. Senior engineers also talk about incident severity levels (SEV1, SEV2), escalation paths, and how they communicate during an outage. They mention that a good post-mortem has a timeline, a root cause analysis, and action items with owners and due dates.

io/thecodeforge/postmortems/2026-04-22-outage.mdMARKDOWN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Post-Mortem: April 22, 2026 - Payment Gateway Outage

## Severity
SEV1 (payment service down, 12% of users affected)

## Timeline (UTC)
- 14:23 - Alert: error rate spike > 10%
- 14:25 - On-call engineer acknowledges
- 14:30 - Identify that a recent config change removed the retry logic for payment api
- 14:32 - Roll back the config change
- 14:45 - Service restored

## Root Cause
A config change to the payment service accidentally removed the retry logic. The change was committed without code review and deployed without testing.

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add validation to config changes | Alice | 2026-04-29 |
| Enforce mandatory code review for all config changes | Bob | 2026-04-26 |
| Add integration test that simulates payment api timeout | Carol | 2026-05-06 |
Output
# Post-mortem document with timeline, root cause, and action items.
Forge Tip: The 5 Whys in Post-Mortems
When investigating root cause, use the 5 Whys technique. Example: 'Why did the payment service fail? Because the API call timed out. Why did it time out? Because the circuit breaker opened. Why did the circuit breaker open? Because a downstream API went down. Why didn't we know about it? Because we didn't have a health check on that API.' The 5th why often reveals a missing or insufficient monitoring.
Production Insight
We had an incident where the post-mortem blamed a developer for 'not testing enough.' The team became afraid to deploy. Velocity dropped 60% in the next quarter.
Rule: blameless post-mortems are not optional — they are the mechanism that prevents fear and maintains a healthy deployment cadence.
Key Takeaway
Incidents happen. How you handle them defines your team's maturity.
Blameless culture accelerates recovery and prevents fear.
Post-mortem action items must have owners and deadlines — otherwise it's just a meeting.
Incident Severity Classification
IfService completely unavailable affecting all users
UseSEV1 — immediate page to all on-call, war room, CEO notified.
IfService degraded but still usable, partial user impact
UseSEV2 — page the primary on-call, escalate if not resolved within 1 hour.
IfMinor bug, no user impact, but needs fix soon
UseSEV3 — assign to engineer, fix in next sprint. No page.
IfCosmetic issue, non-functional (e.g., wrong label)
UseSEV4 — log it, fix when time permits.

Configuration Drift: Why Your IaC Will Lie to You

You'll deploy Terraform. It'll say 'No changes.' But your production server has a config file that doesn't match. That's configuration drift. It happens when someone SSH's in and 'just fixes something.' Or an emergency patch gets applied manually. The infrastructure code thinks it's running version X. Reality is version Y. Next deploy, Terraform reverts the fix. Now you're down. The WHY is simple: humans bypass automation under pressure. The HOW is preventive: immutable infrastructure. Don't patch running servers. Deploy new ones. Use baking AMIs or container images. Treat servers like cattle, not pets. And enforce golden images with Packer. Your CI/CD pipeline should be the only path to production. If someone SSH's in after deployment, that's a policy violation, not a workaround. Interviewers will ask: 'How do you detect drift?' Answer with automated compliance checks. Tools like Cloud Custodian, Chef Inspec, or even a scheduled Terraform plan that alerts on changes. If your IaC and your running environment don't match, you have a compliance incident, not a deploy.

drift_alert.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge.drift_alert
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"

  # Immutable: user_data never runs on existing instances
  user_data = filebase64("${path.module}/init.sh")

  lifecycle {
    # Prevent manual changes from persisting
    ignore_changes = [
      user_data,
      tags["PatchDate"],
      security_groups
    ]
  }
}

# Drift detection: run this in CI every hour
data "aws_instance" "web_current" {
  instance_id = aws_instance.web.id
}

output "drift_warning" {
  value = "Check instance ${aws_instance.web.id}. If tags differ, drift detected."
}
Output
> Check instance i-0abcd1234. If tags differ, drift detected.
Production Trap:
Never rely on Terraform's 'no changes' output alone. Run 'terraform apply -refresh-only' before every production deploy. One orphaned manual change can cascade into a full outage.
Key Takeaway
If your infrastructure code says one thing and production runs another, you have an incident waiting to happen. Always detect drift before it bites you.

Git Workflows That Won't Make You Cry at 3 AM

A senior DevOps engineer doesn't just know git commands. They know which workflow prevents merge hell during a hotfix. The WHY: production doesn't care about your feature branch strategy. It cares about getting a fix out in 10 minutes. The HOW: trunk-based development. Main branch is always deployable. Short-lived feature branches (under 2 days). No long-running release branches unless you're pinned to a compliance calendar. When a Sev1 hits, you revert the last commit, not cherry-pick through 12 branches. For feature flags, use LaunchDarkly, not git branches. Your CI/CD should run on every push to main, not only on PR merge. That's how you catch integration failures before they reach production. Avoid the 'git flow' cargo cult. It was written in 2010 and assumes you have a quarterly release cycle, not daily deploys. If you hear a candidate describe a 6-branch workflow for a microservice, they've never been paged at midnight. The simplest test: can you roll back a single commit in under 60 seconds? If not, your git process is a liability.

rollback.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# io.thecodeforge.rollback
# Production rollback: revert single commit, fast

BRANCH="main"
# Find the bad commit
BAD_SHA=$(git log --oneline -1 --grep="hotfix/vuln")
if [ -z "$BAD_SHA" ]; then
  echo "No hotfix commit found. Exiting safely."
  exit 0
fi

echo "Reverting: $BAD_SHA"
git checkout $BRANCH
git revert $BAD_SHA --no-edit
git push origin $BRANCH

# Verify deploy works
if ! curl -s --fail https://api.example.com/health; then
  echo "Rollback failed! Pinging on-call."
  # Insert PagerDuty call here
  exit 1
fi

echo "Rollback successful."
Output
Reverting: abc1234
Rollback successful.
Production Trap:
Never rebase a shared branch that's deployed. Rebase rewrites history. Revert creates a new commit. Only reverts are safe for hotfixes. You can fix commit messages later. You can't fix a live outage with a rebase.
Key Takeaway
Your git workflow should make rollback the easiest action, not the scariest. Trunk-based development with reverts is your safety net.
● Production incidentPOST-MORTEMseverity: high

The Silent Pipeline: How a Missing Health Check Caused a 45-Minute Outage

Symptom
After a routine deployment, the API service was running but returning 503 for 45 minutes. Users saw errors, and the on-call rotation was paged.
Assumption
The team assumed that if the container started and the CI/CD pipeline passed, the service was healthy. They'd never tested the actual readiness probe.
Root cause
The Kubernetes readiness probe was configured with an incorrect path (/healthz instead of /health). The container started, but the probe never succeeded, so the service was removed from the load balancer — yet the deployment was marked successful.
Fix
Changed the readiness probe path to /health and added a startup probe to prevent the same issue during initial boot. Also added a pipeline step that verifies the probe returns 200 before marking the deployment as complete.
Key lesson
  • A green CI/CD pipeline doesn't mean the service is healthy — it means the pipeline ran.
  • Always test readiness and liveness probes in a staging environment that mirrors production.
  • Add synthetic monitoring that exercises the same endpoints as your probes, so you know the second a deployment goes sideways.
Production debug guideSymptom → Action guide for the three most frequent production pain points.3 entries
Symptom · 01
New deployment: containers crash-looping with no obvious error in logs.
Fix
Check the container's exit code first: docker ps -a | grep Exited. Also inspect resource limits: docker inspect <container> | jq .[0].HostConfig.Memory. If OOMKilled, increase memory or fix memory leak.
Symptom · 02
Service healthy but traffic not reaching it (canary not getting traffic).
Fix
Verify ingress controller and service endpoints: kubectl get endpoints <service>. If endpoints are empty, check selector labels and readiness probes. Also check network policies — a misapplied NetworkPolicy can silently drop traffic.
Symptom · 03
CI/CD pipeline passes but deployment is broken (e.g., wrong image tag).
Fix
Add immutable tags (Git commit SHA) and enforce tag-based deployment policies. Reject pipelines that use 'latest' tag. Also add a post-deployment smoke test that runs against the actual deployed endpoint.
★ Quick Debug Cheat Sheet for DevOps InterviewsMemorise these commands and recovery steps — they'll prove you've actually been in production.
kubectl get pods shows CrashLoopBackOff.
Immediate action
Check logs of the crashing container.
Commands
kubectl logs <pod> --previous
kubectl describe pod <pod> | grep -A 10 'Last State'
Fix now
Fix the error and re-deploy. If flaky, add a startup probe with prolonged failure threshold.
Docker image build succeeds but container exits immediately.+
Immediate action
Run the container locally with interactive shell to inspect.
Commands
docker run -it <image> /bin/sh
docker logs <container> --tail 100
Fix now
Check entrypoint script for missing dependencies or environment variables. Use multi-stage builds to ensure runtime image includes everything.
Terraform apply fails with state lock error.+
Immediate action
Identify who holds the lock and decide to force unlock (only if safe).
Commands
terraform force-unlock <LOCK_ID>
terraform init (if backend config changed)
Fix now
Prevent lock contention by using remote state with DynamoDB locking and ensuring teams work in separate workspaces.
DevOps vs Traditional Ops
ConceptTraditional OpsDevOps / SRE
DeploymentManual, infrequent, high-riskAutomated (CI/CD), frequent, low-risk
InfrastructureManual configuration (Snowflakes)Infrastructure as Code (Reproducible)
MonitoringReactive (Check after it breaks)Proactive (Observability, Metrics, Tracing)
Failure HandlingBlame-oriented culture (Root Cause: Alice)Blameless Post-mortems (Root Cause: Process)
ScalingRequesting hardware weeks in advanceAuto-scaling based on CPU/Memory/Traffic

Key takeaways

1
DevOps is the elimination of 'silos'—Dev, Ops, and QA work together through automated, shared pipelines.
2
Infrastructure as Code (Terraform) and Containerization (Docker) are the technical prerequisites for a modern, scalable system.
3
CI/CD is the heartbeat of DevOps, enabling 'fail fast' and 'fix fast' mentalities.
4
Observability (Prometheus/Grafana/ELK) is non-negotiable; you cannot manage what you do not measure.
5
Practice daily
the forge only works when it's hot 🔥
6
Blameless post-mortems are a cultural superpower
they turn failures into learning, not fear.

Common mistakes to avoid

4 patterns
×

The 'Tool-First' Fallacy

Symptom
Team adopts Jenkins, Docker, and Kubernetes but still operates in silos — developers throw code over the wall to ops. Pipeline exists but culture is unchanged.
Fix
Start with culture and processes: shared on-call, joint code reviews, common SLOs. Tools are enablers, not solutions.
×

Missing the 'Business Why'

Symptom
Candidate talks about automation without connecting it to business outcomes. Interviewer sees a lack of strategic thinking.
Fix
Always frame technical decisions in terms of time-to-market, cost, reliability. 'We automated deployment because downtime cost us $5k/min' is stronger than 'We automated because it's cool.'
×

The 'Black Hole' Pipeline

Symptom
Pipeline deploys code to production, but there's no monitoring or logging to confirm it's healthy. Developers find out about outages from users.
Fix
Every pipeline must include a post-deployment smoke test and feed metrics to an observability stack. If you can't prove it's working, it's not deployed.
×

Assuming 'Automation' Replaces Human Judgment

Symptom
Team automates everything, including approval gates. A bad deployment goes straight to production because the pipeline was trusted without verification.
Fix
Automate routine checks, but keep human-in-the-loop for risky decisions (e.g., canary rollout with manual promotion). Trust but verify.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the 'Three Ways' of DevOps (Feedback, Flow, and Continuous Learn...
Q02SENIOR
Describe a scenario where a deployment failed in production. How did you...
Q03SENIOR
What is 'GitOps,' and how does it differ from traditional CI/CD workflow...
Q04SENIOR
How do you decide when to use a cache in a microservices architecture?
Q05SENIOR
Explain the CAP theorem and how it influences database selection in a di...
Q01 of 05SENIOR

Explain the 'Three Ways' of DevOps (Feedback, Flow, and Continuous Learning) and how you've applied them in a past project.

ANSWER
The Three Ways come from The Phoenix Project. Flow: make work visible, limit WIP, reduce batch sizes. In practice, we broke down a monolithic deployment into per-service pipelines, reducing lead time from weeks to hours. Feedback: amplify feedback loops so problems are caught early. We implemented canary deployments that automatically roll back if error rate exceeds 1%. Continuous Learning: blameless post-mortems and regular game days. We ran a quarterly 'Chaos Monkey' day where we intentionally killed services to test resilience.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between Continuous Delivery and Continuous Deployment?
02
How do you manage 'secrets' (passwords/keys) in a CI/CD pipeline?
03
What is 'Blue-Green Deployment' and why is it used?
04
What is the role of a 'runbook' in incident management?
N
Naren Founder & Principal Engineer

20+ years shipping production code across the stack, with years spent interviewing engineers. Lessons pulled from things that broke in production.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's DevOps Interview. Mark it forged?

7 min read · try the examples if you haven't

Previous
Theoretical Probability: Definition, Formula and Examples
1 / 5 · DevOps Interview
Next
Docker Interview Questions