Mid-level 9 min · March 29, 2026

CI/CD Skipped Jobs — Why 'Success' Deploys Old Code

Q: Should I run integration tests on every branch push?

No. Integration tests are slow (often 10+ minutes) and require external services. Run them only on pull requests to protected branches (main, staging). For feature branches, run linting, static analysis, and unit tests — these fast gates catch the majority of issues. The trade-off is feedback speed vs. coverage depth.

Q: How do I set up healthchecks in Docker Compose for CI/CD?

Add a 'healthcheck' block to each service definition in your docker-compose.yml or in your CI service containers. For PostgreSQL, use 'pg_isready'. For Redis, use 'redis-cli ping'. Set appropriate intervals and retries. Then in your application service, use 'depends_on: condition: service_healthy' to ensure the dependency is truly ready before your app starts.

Q: Why is it dangerous to use :latest in production deployments?

The ':latest' tag is a mutable pointer. Every new build overwrites it, so you lose the ability to know which version of code is running. If you need to roll back, re-deploying ':latest' gives you the same broken version again. Tag with the git commit SHA instead — each SHA is unique and immutable, enabling precise rollback and reconstruction of the exact state.

Q: How do I handle database migrations in a CI/CD pipeline without downtime?

Apply the expand-contract pattern: Phase 1 — Expand the schema to support both old and new code (add columns, make old columns nullable). Deploy both old and new app versions that can work with the expanded schema. Phase 2 — Deploy the new code that relies on the new schema. Phase 3 — Contract by removing old columns and unused indexes. This avoids locking tables and allows zero-downtime schema changes.

Skipped build jobs pass needs checks silently, deploying stale artifacts.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide

⚡Quick Answer

Order pipeline stages by execution speed, not importance — fail fast, fail cheap
Use healthchecks with depends_on for real readiness, not startup order
Mount secrets as files, not env vars — enables rotation without restarts
Track DORA metrics: deployment frequency, lead time, change failure rate, MTTR
Separate readiness and liveness probes — liveness checks only in-process health
Tag images with SHA — never :latest in production; enables precise rollback

Plain-English First

Think of your codebase like a commercial kitchen. Amateur cooks prep everything at the end of service, then panic when the plate's wrong. A Michelin-starred kitchen has a quality check at every single station — the prep cook, the saucier, the expeditor — so a bad dish never reaches the dining room. CI/CD is that station-by-station quality system for software. Every time a developer adds something to the kitchen, it gets tasted, checked, and plated automatically before a single customer sees it. The difference between a restaurant that survives and one that gets shut down by health inspectors is exactly that discipline.

A fintech team I worked with was deploying to production manually every two weeks. One Friday afternoon, a developer copy-pasted a database migration script into the wrong environment, wiped a staging database that was being used as a shadow clone of prod, and triggered a three-hour incident that nearly became a four-hour customer-facing outage. The root cause wasn't the mistake — humans make mistakes. The root cause was that there was no automated gate to catch it.

CI/CD isn't a tool. It's a philosophy that says 'the longer you wait to integrate and ship, the more expensive your mistakes get.' The average high-performing team deploys to production multiple times per day with a change failure rate under 5%. The average low-performing team deploys once a month and spends 40% of their engineering time on unplanned work — firefighting regressions, rolling back broken releases, and manually babysitting deployments. Those aren't different companies. They're the same company, two years apart, after one of them got serious about CI/CD.

By the end of this article, you'll know exactly how to structure a pipeline that catches failures before they reach production, which quality gates actually matter and which ones slow you down for no gain, where pipelines break down at scale and what to do about it, and how to roll out changes without taking the whole system down. You won't just understand CI/CD — you'll be able to walk into an existing codebase and diagnose exactly why its pipeline is failing its team.

Pipeline Architecture: Why Most Teams Build It Backwards

Most teams design their CI pipeline by asking 'what checks should we run?' That's the wrong question. The right question is 'in what order should failures be discovered, and what's the cost of discovering them late?' Every stage of your pipeline is a trade-off between feedback speed and coverage depth. If you put your 45-minute integration test suite before your 30-second linter, you're making every developer wait 45 minutes to learn they forgot a semicolon. I've seen this kill developer velocity at a mid-size SaaS company — engineers started skipping the pipeline locally and just pushing to get CI to run it, which turned the pipeline into a batch job instead of a fast feedback loop.

The principle is fail fast, fail cheap. Your pipeline stages should be ordered by execution time, ascending. Linting and static analysis run first — they're near-instant and catch a massive proportion of bugs. Unit tests second. Integration tests third. End-to-end tests last, and gated behind a merge to a protected branch. Every stage that fails short-circuits the rest. You don't run a 30-minute E2E suite against a commit that failed a type check.

Here's a production-grade GitHub Actions pipeline for a Node.js checkout service that demonstrates this ordering. Notice the explicit stage dependencies and the parallelisation of independent checks — security scanning runs parallel to unit tests because they don't share state.

One addition to this order: include a quick 'dependency caching restore' step before the first gate. It takes seconds but saves minutes in later stages. A common trap is caching node_modules but not the Docker layers — that's separate. Also, don't cache everything blindly; cache only what actually reduces build time. Measure cache hit rates with a dashboard.

Another nuance: the order of failure discovery should also consider blast radius. A linting failure affects only code style and minor bugs — cheap to fix. A security vulnerability in a dependency might require a team-wide update. An integration test failure might indicate a broken contract between services. Order by cost of failure as well as speed; cheap failures first, expensive ones after they're gated by cheap checks.

checkout-service-ci.ymlYAML

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

# io.thecodeforge — DevOps tutorial
# CI pipeline for a checkout service — GitHub Actions
# Ordered by: speed (fastest gates first), then coverage depth
# Principle: catch cheap failures before running expensive ones

name: Checkout Service CI

on:
  push:
    branches: ['**']          # Run on every branch push, not just main
  pull_request:
    branches: [main, staging]  # Gate merges to protected branches

env:
  NODE_VERSION: '20.x'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}/checkout-service

jobs:
  # Stage 1: Sub-60-second gates
  lint-and-typecheck:
    name: Lint & Type Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - name: Install dependencies
        run: npm ci
      - name: Run ESLint
        run: npm run lint
      - name: TypeScript type check
        run: npm run typecheck

  security-scan:
    name: Dependency Security Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run npm audit
        run: npm audit --audit-level=high
      - name: SAST scan with Semgrep
        uses: returntocorp/semgrep-action@v1
        with:
          config: 'p/nodejs'

  # Stage 2: Unit tests (only if Stage 1 passes)
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    needs: [lint-and-typecheck, security-scan]
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - name: Install dependencies
        run: npm ci
      - name: Run unit tests with coverage
        run: npm run test:unit -- --coverage
        env:
          DATABASE_URL: 'sqlite::memory:'
          PAYMENT_GATEWAY_URL: 'http://localhost:9999'
      - name: Upload coverage report
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/
          retention-days: 7
      - name: Enforce coverage threshold
        run: npx nyc check-coverage --lines 80 --functions 80 --branches 75

  # Stage 3: Integration tests (only on PRs to main/staging)
  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: [unit-tests]
    if: github.event_name == 'pull_request'
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: checkout_test
          POSTGRES_USER: checkout_app
          POSTGRES_PASSWORD: ${{ secrets.TEST_DB_PASSWORD }}
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      redis:
        image: redis:7-alpine
        options: --health-cmd "redis-cli ping" --health-interval 10s
        ports:
          - 6379:6379
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - name: Install dependencies
        run: npm ci
      - name: Run database migrations
        run: npm run db:migrate
        env:
          DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test
      - name: Run integration tests
        run: npm run test:integration
        env:
          DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test
          REDIS_URL: redis://localhost:6379
          NODE_ENV: test

  # Stage 4: Build and push Docker image (only on merge to main)
  build-and-push:
    name: Build & Push Image
    runs-on: ubuntu-latest
    needs: [integration-tests]
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    permissions:
      contents: read
      packages: write
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
    steps:
      - uses: actions/checkout@v4
      - name: Log in to container registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Extract metadata for image tags
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-
            type=raw,value=latest,enable={{is_default_branch}}
      - name: Build and push Docker image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
          cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max

Output

Workflow triggered on push to main

✓ lint-and-typecheck (23s)

✓ security-scan (41s) [parallel with lint]

✓ unit-tests (1m 12s) [87% line coverage — threshold: 80%]

✓ integration-tests (3m 44s) [14 tests passed, 0 failed]

✓ build-and-push (2m 08s) [Pushed: ghcr.io/org/checkout-service:sha-a3f91c2]

Total wall-clock time: 7m 48s

Pipeline result: SUCCESS

Image digest: sha256:d4f2a1b9c8e3f5...

Production Trap: The 'needs' Trap That Skips Stages Silently

If a job is skipped (not failed — skipped, because of an 'if' condition), jobs that 'need' it will also be skipped by default without failing. This means a build-and-push job can be silently skipped if integration tests were skipped, and your CD step might try to deploy an image that was never built. Fix it: use 'if: always()' combined with explicit status checks — 'if: needs.integration-tests.result == "success" || needs.integration-tests.result == "skipped"' — and be deliberate about which skips are acceptable.

Production Insight

The biggest pipeline slowdown isn't test execution — it's waiting for infrastructure to spin up.

Teams with 15+ minute pipelines see 40% longer cycle time.

Rule: keep the fast path under 5 minutes or developers will bypass it.

Another hidden sink: downloading dependencies from scratch. Cache npm and Docker layers.

Watch out for service containers that don't reuse build caches — each pipeline run might rebuild entire dependency trees.

Key Takeaway

Order stages by execution time ascending.

Fail fast, fail cheap.

Your lint check should never wait for your E2E suite to even start.

And if you can't trust your pipeline, your team will find ways around it — that's the real failure.

Pipeline Stage Ordering Decision Tree

IfStage runs in under 60 seconds and is stateless

→

UseRun first — failure short-circuits all downstream

IfStage requires external services (DB, cache, API)

→

UsePush to later — service startup time adds latency

IfStage can run independently of other stages

→

UseRun in parallel with other independent stages

IfStage takes >10 minutes and is rarely triggered

→

UseGate behind merge to protected branch — not every commit

Deployment Strategies That Don't Gamble Your Entire User Base

Here's a mistake I've seen kill a Black Friday deployment: a team built a perfect CI pipeline, then wired it directly to 'deploy everything to all pods immediately.' The pipeline was green. The deployment destroyed a third of their order throughput because a new Redis connection pool configuration had a subtle bug that only surfaced under real production load patterns. Their rollback took 22 minutes because they had no deployment strategy — it was all or nothing.

High-performing teams don't choose between 'deploy' and 'don't deploy.' They choose how much of their traffic takes the risk first. Blue-green deployments, canary releases, and feature flags are the three weapons in this arsenal, and they solve different problems. Blue-green is great for infrastructure changes where you need a clean cutover. Canary is best for application changes where you want to validate behaviour under real traffic before full rollout. Feature flags are best for functionality that you want to decouple from deployment entirely — ship the code, turn on the feature later.

The Kubernetes deployment below shows a canary release pattern using weight-based traffic splitting. The key insight is that your health checks must be meaningful — a pod that returns 200 on '/health' but fails to process payments is worse than a pod that's down, because it poisons a percentage of your real user traffic silently.

A nuance that often gets missed: canary analysis must include business metrics, not just HTTP status. One team's canary passed at 99.5% success rate but the new code returned stale cached prices — no 5xx, just wrong data. Include order completion rate or revenue per request in your analysis.

Another trap: rolling back a canary isn't always safe. If the canary has been running for hours and the stable version has since been updated, rolling back means deploying an older version that might have its own issues. Keep canary windows short or use blue-green for the rollback path.

checkout-canary-deployment.ymlYAML

100

101

102

103

104

105

106

# io.thecodeforge — DevOps tutorial
# Canary deployment pattern for checkout service on Kubernetes
# Uses: Argo Rollouts for progressive delivery

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
  namespace: payments
spec:
  replicas: 10
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout-service
          image: ghcr.io/org/checkout-service:sha-a3f91c2
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: '256Mi'
              cpu: '250m'
            limits:
              memory: '512Mi'
              cpu: '500m'
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 5
          env:
            - name: NODE_ENV
              value: production
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: checkout-service-secrets
                  key: database-url
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: checkout-success-rate
            args:
              - name: service-name
                value: checkout-service
        - setWeight: 30
        - pause:
            duration: 10m
        - analysis:
            templates:
              - templateName: checkout-success-rate
              - templateName: checkout-p99-latency
        - setWeight: 100
      autoPromotionEnabled: false

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-success-rate
  namespace: payments
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      count: 5
      successCondition: result[0] >= 0.95
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}",
                status!~"5.."
              }[5m]
            ))
            /
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}"
              }[5m]
            ))

Output

Rollout initiated: checkout-service → sha-a3f91c2

[Step 1/5] Weight: 10% → canary pods

Waiting 5m for traffic sample...

Analysis: checkout-success-rate

Evaluation 1/5: success_rate=0.983 ✓

Evaluation 2/5: success_rate=0.991 ✓

Evaluation 3/5: success_rate=0.979 ✓

Evaluation 4/5: success_rate=0.986 ✓

Evaluation 5/5: success_rate=0.994 ✓

Analysis PASSED ✓

[Step 2/5] Weight: 30% → canary pods

Waiting 10m for traffic sample...

Analysis: checkout-success-rate + checkout-p99-latency

success_rate=0.988 ✓ p99_latency=142ms ✓

Analysis PASSED ✓

[Step 3/5] Weight: 100% — Full rollout

All 10 replicas running sha-a3f91c2

Rollout COMPLETE ✓ Total time: 17m 23s

Never Do This: Using the Same Health Endpoint for Readiness and Liveness

I've seen teams wire both readinessProbe and livenessProbe to '/health' and then wonder why Kubernetes is killing healthy pods under load. If your liveness check includes a database ping, a slow DB will trigger a restart loop — Kubernetes kills the pod, restarts it, it's slow again, kills it again. Separate them: liveness checks only internal process health (event loop alive, no deadlock), readiness checks external dependencies. A pod can be live but not ready — that's exactly the state you want during a downstream outage.

Production Insight

A canary release that only checks HTTP status is blind to business-logic failures.

One team's canary passed at 99.5% success rate but the new code was returning stale cached prices — no 5xx, just wrong data.

Rule: include business-level metrics in canary analysis (e.g., order completion rate).

Another pitfall: canary windows that are too short miss rare error conditions triggered by daily batch jobs or peak traffic.

Key Takeaway

Blue-green for infra changes, canary for app code, feature flags for feature rollout.

Each strategy covers a different risk.

Pick based on what you're changing, not what's trendy.

And always pair deployment strategy with a rollback that can be executed faster than the original rollout.

Deployment Strategy Decision Tree

IfChanging infrastructure (DB upgrades, new load balancer config)

→

UseUse blue-green — instant cutover with clean failback

IfReleasing new application code with unknown impact

→

UseUse canary with automated analysis — validate under real traffic

IfShipping a feature that needs to be toggled per user or segment

→

UseUse feature flags — decouple deployment from release

IfDatabase schema change that needs to be backward-compatible

→

UseUse expand-contract pattern alongside any deployment strategy

The Secrets and Config Management Problem Nobody Talks About Until It's Too Late

I once got called into an incident at midnight because a developer had rotated an API key in AWS Secrets Manager, the application was reading that secret at startup only, and none of the running pods picked up the new value. The service was fine. Then someone did a routine deployment, pods restarted with the new secret, and suddenly half the fleet was talking to the payment gateway with the old key (cached in one still-running pod) and half with the new key. The gateway's duplicate-detection logic flagged the mismatched requests and started rejecting transactions. It took 40 minutes to figure out the problem was secret rotation, not the deployment itself.

Config and secrets management is where CI/CD pipelines quietly accumulate debt. Teams hardcode environment-specific values into their pipelines, or they inject secrets as plain environment variables in their Kubernetes manifests, or they forget to handle secret rotation without a full restart. All three of these will burn you.

The pattern that works: secrets live in a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, or at minimum Kubernetes Secrets encrypted at rest). They're injected at runtime, not build time. Your application watches for secret rotation and reloads without a restart. Your CI pipeline never has access to production secrets — it uses short-lived OIDC tokens to assume the minimum necessary role.

A concrete technique: use External Secrets Operator to sync secrets from AWS to Kubernetes as mounted volumes. Your app can watch the file for changes and reload config without a restart. This avoids the split-brain scenario entirely.

Additionally, manage config separately from secrets. Use ConfigMaps for non-sensitive configuration like feature flags or API endpoints. That way, you can update config without needing to rotate secrets, and vice versa. And always set up a pre-deployment validation that checks whether the target environment has the required secrets before even attempting the deployment — fail loud, not silent.

checkout-secrets-pipeline.ymlYAML

# io.thecodeforge — DevOps tutorial
# Secrets management pattern: GitHub Actions + AWS OIDC + Secrets Manager

name: Checkout Service CD

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  deploy-to-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/checkout-service-deploy-staging
          aws-region: eu-west-1
          role-session-name: checkout-service-deploy-${{ github.run_id }}

      - name: Validate secrets exist before deploying
        run: |
          aws secretsmanager describe-secret --secret-id checkout-service/staging/database-url --query 'Name' --output text
          aws secretsmanager describe-secret --secret-id checkout-service/staging/payment-gateway-key --query 'Name' --output text
          echo "All required secrets confirmed present in Secrets Manager"

      - name: Get kubeconfig for staging cluster
        run: |
          aws eks update-kubeconfig --region eu-west-1 --name payments-staging-cluster --alias staging

      - name: Sync secrets from AWS Secrets Manager to Kubernetes
        run: |
          kubectl apply -f - <<EOF
          apiVersion: external-secrets.io/v1beta1
          kind: ExternalSecret
          metadata:
            name: checkout-service-secrets
            namespace: payments
          spec:
            refreshInterval: 1h
            secretStoreRef:
              name: aws-secrets-manager
              kind: ClusterSecretStore
            target:
              name: checkout-service-secrets
              creationPolicy: Owner
            data:
              - secretKey: database-url
                remoteRef:
                  key: checkout-service/staging/database-url
              - secretKey: payment-gateway-key
                remoteRef:
                  key: checkout-service/staging/payment-gateway-key
          EOF

      - name: Deploy to staging via Argo Rollouts
        run: |
          kubectl argo rollouts set image checkout-service checkout-service=ghcr.io/org/checkout-service:sha-${{ github.sha }} --namespace payments

      - name: Wait for rollout to complete
        run: |
          kubectl argo rollouts status checkout-service --namespace payments --timeout 10m

      - name: Run smoke tests against staging
        run: |
          npm run test:smoke -- --base-url https://checkout-staging.internal.example.com --timeout 30000
        env:
          SMOKE_TEST_API_KEY: ${{ secrets.STAGING_SMOKE_TEST_KEY }}

Output

Deploy to Staging — checkout-service sha-a3f91c2

✓ AWS OIDC authentication successful

Role: checkout-service-deploy-staging

Session expires: 2024-01-15T14:32:00Z (1 hour)

✓ Secret validation passed

checkout-service/staging/database-url [EXISTS]

checkout-service/staging/payment-gateway-key [EXISTS]

✓ Kubeconfig updated for cluster: payments-staging-cluster

✓ ExternalSecret synced

checkout-service-secrets updated in namespace payments

Next refresh: 2024-01-15T15:00:00Z

✓ Rollout initiated: sha-a3f91c2

Canary: 10% → Analysis passed → 30% → Analysis passed → 100%

Rollout complete in 14m 52s

✓ Smoke tests passed

POST /api/v1/checkout — 201 (143ms)

GET /api/v1/orders/{id} — 200 (67ms)

POST /api/v1/checkout/confirm — 200 (298ms)

3/3 smoke tests passed

Deployment result: SUCCESS

Senior Shortcut: Mount Secrets as Files, Not Environment Variables

Mount Kubernetes Secrets as volume files, not env vars. Env vars are captured at pod startup and never refresh. A file mounted from a Secret updates when the Secret updates (within kubelet's sync period, default 60s). Your app can use a file watcher to reload config without restarting. This is how you get secret rotation without downtime. The pattern: mount to '/run/secrets/payment-gateway-key', read with fs.readFileSync, watch with chokidar or inotify.

Production Insight

Secret rotation without a restart plan creates split-brain states — half the pods on new creds, half on old.

This is the #1 cause of 'my deployment broke but I didn't change any code' incidents.

Rule: either rotate with zero-downtime via file watchers, or orchestrate a phased restart.

Also, never use environment-specific secrets in your pipeline YAML — keep them in the external manager only.

Key Takeaway

Mount secrets as files, not env vars.

Use External Secrets Operator for auto-sync.

Your CI pipeline should never touch production secrets directly — use OIDC and least-privilege roles.

And validate secrets exist before each deploy, not after a pod crashes.

Secrets Management Strategy Decision Tree

IfSecrets need to rotate without pod restart

→

UseMount as volume files with file watcher in app

IfSecrets change rarely and restart is acceptable

→

UseUse Kubernetes Secrets as env vars with periodic pod restart

IfUsing AWS/GCP/Azure secrets manager

→

UseUse External Secrets Operator to sync to K8s as volume mounts

IfCI pipeline needs access to secrets

→

UseUse OIDC with least-privilege IAM roles, never store credentials in GitHub Secrets

Observability in the Pipeline: You Can't Fix What You Can't See

A pipeline that tells you 'build failed' is nearly useless. A pipeline that tells you 'integration test checkout_service_test.ts:143 — assertion failed: expected order status CONFIRMED, received PAYMENT_PENDING — flaky for 3 of last 5 runs on this branch — median test duration increased 40% this week' is a co-pilot. The gap between those two things is observability.

High-performing teams treat their pipelines as first-class systems with their own monitoring. They track pipeline duration by stage, test flakiness rates by test file, deployment frequency, change failure rate, and mean time to recovery. These are the four DORA metrics, and if you're not measuring them, you don't know if your DevOps practice is improving or just getting more complicated.

Flaky tests are the silent killer of CI trust. Once developers start seeing random failures they learn to re-run pipelines instead of fixing failures. That habit means they also re-run real failures, which means bugs start shipping. I've seen teams with a 30% flakiness rate on their test suite who had essentially no CI — the pipeline was there but no one believed it. The fix isn't to delete the flaky tests. It's to quarantine them, track them in your issue tracker, and fix them with the same urgency you'd fix a production bug.

One more thing: alert on pipeline performance degradation. A pipeline that quietly grows from 8 minutes to 20 minutes over two weeks is a sign of accumulating technical debt. Put a dashboard up and page the team if the median duration crosses a threshold.

Also consider 'observability for rollbacks.' Track which SHA was deployed when, how long rollback took, and whether the rollback successfully restored the previous state. This data helps you tune your deployment strategy and set better SLOs for recovery time.

pipeline-observability.ymlYAML

# io.thecodeforge — DevOps tutorial
# Pipeline observability: tracking DORA metrics and test flakiness

name: Pipeline Telemetry

on:
  workflow_run:
    workflows: ['Checkout Service CI', 'Checkout Service CD']
    types: [completed]

jobs:
  record-pipeline-metrics:
    name: Record Pipeline Metrics
    runs-on: ubuntu-latest
    steps:
      - name: Calculate pipeline duration and outcome
        id: metrics
        run: |
          WORKFLOW_NAME="${{ github.event.workflow_run.name }}"
          WORKFLOW_CONCLUSION="${{ github.event.workflow_run.conclusion }}"
          START_TIME="${{ github.event.workflow_run.run_started_at }}"
          END_TIME="${{ github.event.workflow_run.updated_at }}"
          START_EPOCH=$(date -d "$START_TIME" +%s)
          END_EPOCH=$(date -d "$END_TIME" +%s)
          DURATION_SECONDS=$((END_EPOCH - START_EPOCH))
          echo "workflow_name=$WORKFLOW_NAME" >> $GITHUB_OUTPUT
          echo "conclusion=$WORKFLOW_CONCLUSION" >> $GITHUB_OUTPUT
          echo "duration=$DURATION_SECONDS" >> $GITHUB_OUTPUT
          echo "branch=${{ github.event.workflow_run.head_branch }}" >> $GITHUB_OUTPUT
          echo "sha=${{ github.event.workflow_run.head_sha }}" >> $GITHUB_OUTPUT

      - name: Push metrics to Datadog
        run: |
          curl -s -X POST "https://api.datadoghq.com/api/v1/series" \
            -H "Content-Type: application/json" \
            -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
            -d '{
              "series": [
                {
                  "metric": "ci.pipeline.duration_seconds",
                  "type": "gauge",
                  "points": [[$(date +%s), ${{ steps.metrics.outputs.duration }}]],
                  "tags": [
                    "workflow:${{ steps.metrics.outputs.workflow_name }}",
                    "conclusion:${{ steps.metrics.outputs.conclusion }}",
                    "branch:${{ steps.metrics.outputs.branch }}",
                    "service:checkout-service"
                  ]
                },
                {
                  "metric": "ci.pipeline.runs_total",
                  "type": "count",
                  "points": [[$(date +%s), 1]],
                  "tags": [
                    "workflow:${{ steps.metrics.outputs.workflow_name }}",
                    "conclusion:${{ steps.metrics.outputs.conclusion }}",
                    "service:checkout-service"
                  ]
                }
              ]
            }'

      - name: Alert on repeated failures
        if: steps.metrics.outputs.conclusion == 'failure'
        run: |
          RECENT_FAILURES=$(curl -s "https://api.datadoghq.com/api/v1/query?from=$(date -d '1 hour ago' +%s)&to=$(date +%s)&query=sum:ci.pipeline.runs_total{service:checkout-service,conclusion:failure,branch:${{ steps.metrics.outputs.branch }}}.as_count()" \
            -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
            -H "DD-APPLICATION-KEY: ${{ secrets.DATADOG_APP_KEY }}" \
            | jq '.series[0]?.points | length // 0')
          if [ "$(echo "$RECENT_FAILURES >= 3" | bc)" -eq 1 ]; then
            curl -X POST https://events.pagerduty.com/v2/enqueue \
              -H "Content-Type: application/json" \
              -H 'Authorization: Token token=${{ secrets.PAGERDUTY_ROUTING_KEY }}' \
              -d '{
                "routing_key": "${{ secrets.PAGERDUTY_ROUTING_KEY }}",
                "event_action": "trigger",
                "payload": {
                  "summary": "CI pipeline failing repeatedly: ${{ steps.metrics.outputs.workflow_name }} on ${{ steps.metrics.outputs.branch }}",
                  "severity": "warning",
                  "source": "github-actions",
                  "custom_details": {
                    "workflow": "${{ steps.metrics.outputs.workflow_name }}",
                    "branch": "${{ steps.metrics.outputs.branch }}",
                    "sha": "${{ steps.metrics.outputs.sha }}",
                    "failures_last_hour": "'$RECENT_FAILURES'",
                    "run_url": "${{ github.event.workflow_run.html_url }}"
                  }
                }
              }'
          fi

Output

Pipeline Telemetry Recorded for checkout-service CI #847

Duration: 7m 48s

Conclusion: success

Tags: workflow=Checkout Service CI, conclusion=success, branch=main, service=checkout-service

Metrics pushed to Datadog:

- ci.pipeline.duration_seconds: 468

- ci.pipeline.runs_total: 1

Failure alert check: 0 failures in last hour — no alert triggered.

Test flakiness report (separate job):

checkout_service_test.ts:143 — flaky: 3/10 runs failed in last 24h (threshold 5%)

Alert triggered: flaky test quarantined, ticket created.

The Hidden Cost of Pipeline Degradation

A pipeline that grows from 8 to 20 minutes over two weeks isn't just slower — it erodes development velocity and trust. Developers start rebasing before CI finishes, merging with outdated heads, or pushing directly to bypass checks. Set an alert on median pipeline duration. If it crosses 10 minutes, the team should drop everything to investigate. A 2-minute increase is a blip; a 12-minute increase is a disaster waiting to happen.

Production Insight

Flaky tests don't just slow you down — they destroy trust in the pipeline.

Once developers auto-retry without investigation, you've lost your safety net.

Rule: track flakiness per test file and alert when any single test fails >5% of the time.

Also, pipeline performance degradation is a leading indicator of technical debt — don't ignore it.

Key Takeaway

Measure pipeline duration by stage and flakiness by test.

Alert on repeated failures in the same branch.

If you're not tracking DORA metrics, you're flying blind.

Build rollback observability into your pipeline — you'll need it.

Pipeline Observability Decision Tree

IfYou have no pipeline metrics at all

→

UseStart with pipeline duration and conclusion per workflow

IfDevelopers are ignoring CI failures

→

UseAdd flakiness tracking and alert on repeated failures per branch

IfPipeline duration is increasing over time

→

UseAdd per-stage duration metrics and alert on regression

IfYou want to measure DevOps effectiveness

→

UseTrack all four DORA metrics: deploy frequency, lead time, change failure rate, MTTR

Artifact Management and Immutable Releases: Ensuring Traceability from Code to Production

I once debugged a production incident where the team couldn't tell which version of the code was running. The pod logs showed app version '1.2.3' but the git tag 'v1.2.3' had been moved twice. The build had been triggered from a different branch than the deployment thought. That three-hour post-mortem started with 'what code is actually deployed right now?' and no one could answer.

High-performing teams treat artifacts as immutable. Every build produces a uniquely identified artifact — typically a container image tagged with the git commit SHA, plus a signed attestation of the build metadata. Once pushed to the registry, that tag is never overwritten. Deployments reference the exact SHA, so you always know what's running. Rollback is trivial: just re-deploy a previous SHA.

The key rules: tag with SHA (not 'latest'), store build metadata (commit, build URL, trigger) as image labels, sign artifacts for supply chain security, and never rebuild a SHA — if you need to patch, cut a new commit and new SHA. This is the foundation of reproducibility.

One more rule many teams miss: include an SBOM (Software Bill of Materials) as part of the artifact. This lets you answer questions like 'which version of Log4j is running' in minutes, not days. Cosign can attach the SBOM to the registry entry.

Additionally, automate the promotion of immutable artifacts through environments. The same SHA that passed CI and tests in staging should be the exact SHA that goes to production — no recompilation, no 'latest' tag substitution. Use a promotion workflow that only changes the deployment manifest, never the artifact itself.

artifact-immutable-pipeline.ymlYAML

# io.thecodeforge — DevOps tutorial
# Immutable artifact pipeline: every build produces a unique, signed, tagged image

name: Build Immutable Artifact

on:
  push:
    branches: [main]

jobs:
  build-and-sign:
    name: Build & Sign Image
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      id-token: write
    steps:
      - uses: actions/checkout@v4

      - name: Log in to container registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Generate unique build metadata
        id: meta
        run: |
          echo "BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_OUTPUT
          echo "COMMIT=${{ github.sha }}" >> $GITHUB_OUTPUT
          echo "TRIGGER=${{ github.event_name }}" >> $GITHUB_OUTPUT
          echo "WORKFLOW=${{ github.workflow }}" >> $GITHUB_OUTPUT

      - name: Build and tag image with SHA
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ghcr.io/org/app:sha-${{ github.sha }}
          labels: |
            org.opencontainers.image.source=${{ github.repository }}
            org.opencontainers.image.revision=${{ github.sha }}
            org.opencontainers.image.created=${{ steps.meta.outputs.BUILD_TIME }}
            io.thecodeforge.build.trigger=${{ steps.meta.outputs.TRIGGER }}
            io.thecodeforge.build.workflow=${{ steps.meta.outputs.WORKFLOW }}

      - name: Sign the image with cosign
        uses: sigstore/cosign-installer@v3
      - run: |
          cosign sign --yes \
            ghcr.io/org/app:sha-${{ github.sha }} \
            --annotations "commit=${{ github.sha }}" \
            --annotations "repo=${{ github.repository }}"

      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          path: ./Dockerfile
          output-file: ${{ runner.temp }}/sbom.spdx

      - name: Attest SBOM to registry
        run: |
          cosign attest --yes \
            --type spdx \
            --predicate ${{ runner.temp }}/sbom.spdx \
            ghcr.io/org/app:sha-${{ github.sha }}

      - name: Update deployment manifest with new SHA
        run: |
          sed -i "s|image: ghcr.io/org/app:.*|image: ghcr.io/org/app:sha-${{ github.sha }}|g" k8s/overlays/production/deployment-patch.yaml
          git config user.name "CI Bot"
          git config user.email "bot@example.com"
          git add k8s/
          git commit -m "Auto-update image to sha-${{ github.sha }}"
          git push

Output

Build and sign completed for sha-a3f91c2

✓ Image built: ghcr.io/org/app:sha-a3f91c2

✓ Labels embedded:

- org.opencontainers.image.revision: a3f91c2

- io.thecodeforge.build.trigger: push

✓ Image signed with cosign (keyless)

✓ SBOM generated and attested

✓ K8s manifest updated to sha-a3f91c2

Artifact is immutable — never overwritten.

Rollback: change image tag to previous SHA.

Artifacts as Railway Tickets

The SHA is the serial number — you can always trace which train (commit) it came from.
'Latest' is a reusable ticket that lets anyone board without proving identity — lose it.
Signatures are the ticket stamp — they prove the ticket was issued by the official authority (your build system).
SBOM is the passenger manifest — you know every dependency that came along for the ride.
Immutable means you never punch the same serial number twice — every ride is unique.

Production Insight

Teams that use 'latest' cannot roll back reliably — the tag moves with every deploy.

If a bad deploy goes out, 'latest' now points to the broken version, and rollback tries to re-deploy 'latest' which is still broken.

Rule: tag with SHA, never overwrite tags, and store full build provenance in image labels.

Also, if you're promoting artifacts across environments, make the promotion a copy operation (not a retag) to preserve immutability.

Key Takeaway

Immutable artifacts are the bedrock of reproducible deployments.

Tag with SHA, sign the image, generate an SBOM.

If you can't answer 'what's running in production right now?' in under 30 seconds, you don't have artifact management.

Promote the same SHA through environments — never rebuild or retag.

Artifact Tagging Strategy Decision Tree

IfYou need precise rollback capability

→

UseTag with git commit SHA, never overwrite tags

IfYou need supply chain security

→

UseSign images with cosign and attach SBOM

IfYou need to trace which build produced a running image

→

UseEmbed build metadata (commit, trigger, workflow) as image labels

IfYou need to patch a released artifact

→

UseCut a new commit and new SHA — never rebuild an existing tag

● Production incidentPOST-MORTEMseverity: high

The Silent Deployment: How a Skipped Build Caused a 2-Hour Outage

Symptom

After a routine merge to main, the pipeline reported 'success' but the staging environment showed no new code. A day later, the production deployment went through — same pipeline, same 'success' label — but the new feature was missing. Customers started seeing outdated checkout flows and payment errors.

Assumption

The team assumed that if the pipeline passes and the rollout completes, the new code must be running. They also assumed that 'needs' dependencies in GitHub Actions would fail the pipeline if a required job was skipped.

Root cause

The build-and-push job was guarded by if: github.ref == 'refs/heads/main' && github.event_name == 'push'. For PR merges, the event is pull_request on the merge commit, not push. The build job was skipped. The deploy job had needs: [build-and-push] — but because the build was skipped (not failed), the deploy job ran anyway using the old image tag. The 'latest' tag had already been moved by a previous successful build.

Fix

Changed the build trigger to also run on pull_request events (or use always() with explicit status checks). Added a check in the deploy job to verify that the image digest actually changed from the previous deployment. Added a smoke test that validates a specific version endpoint exposed by the application.

Key lesson

A skipped job is not a failed job — needs doesn't protect you from skips.
Use explicit if: needs.build.result == 'success' in downstream jobs.
Always validate the deployed artifact: check its hash, version, or commit SHA post-deployment.

Production debug guideCommon symptoms and the exact actions to take when your pipeline lies to you5 entries

Symptom · 01

Pipeline reports success but no changes appear in the environment

→

Fix

Check the image tag in the running pod (kubectl get pod -o yaml | grep image). Compare with the expected SHA from the build. If they match, check if the application cache is stale. If they don't match, look for a skipped build job or a misplaced 'if' condition.

Symptom · 02

Deployment rollout hangs at 0% progress

→

Fix

Check pod events: kubectl describe pod. Look for ImagePullBackOff or CrashLoopBackOff. Verify the registry credentials are correct and the image exists. Check node capacity with kubectl describe node.

Symptom · 03

Secrets missing in the running container despite pipeline success

→

Fix

Check if the secret exists in the namespace: kubectl get secrets. If it's an ExternalSecret, check the operator logs. Verify the secret key names match what the deployment expects. If using env vars, note that they don't update on rotation — consider switching to volume mounts.

Symptom · 04

Flaky test failures that disappear on retry

→

Fix

Quarantine the test immediately — mark it as flaky in your test framework. Create a Jira ticket and assign it. Check if the test has any shared mutable state, timing dependencies, or relies on real network calls. After quarantine, run the test 100 times locally to confirm root cause.

Symptom · 05

Pipeline duration has doubled over the last week

→

Fix

Look at stage-level duration logs. Likely a new heavy integration test or an inefficient build cache. Check if npm ci is being used or if the package-lock.json changed. Examine Docker layer caching — builds may be re-downloading base layers if cache-from is misconfigured.

★ CI/CD Quick Debug Cheat SheetThe three most common pipeline failures and how to fix them in under 5 minutes

Deployed app doesn't reflect the latest commit−

Immediate action

Check pod image tag and compare with expected build SHA

Commands

kubectl get pods -n <ns> -o jsonpath='{.items[0].spec.containers[0].image}'

Check the build log for the pushed image digest: grep 'digest:' build.log

Fix now

If the image is wrong, trigger a manual rebuild: gh workflow run deploy.yml. If the deployment used 'latest', recreate the pod with the correct SHA-tagged image.

Pipeline fails with 'connection refused' for database+

Test flakiness causing random CI failures+

CI/CD Pipeline Strategies Comparison

Strategy	Best for	Rollback time	Traffic impact	Complexity
Blue-Green	Infrastructure changes, DB upgrades	Instant (DNS switch)	Zero-downtime	Medium
Canary	Application code with unknown impact	Gradual (traffic rebalance)	Partial exposure	High
Feature Flags	Decoupling deployment from release	Instant (toggle off)	Zero-downtime	Low
Rolling Update	Standard app updates with minimal risk	Progressive rollback	Minimal	Low
Shadow Deployment	Validating new versions with mirrored traffic	None needed	No impact	Very High

Key takeaways

Order pipeline stages by execution time

catch cheap failures first, fail fast and cheap.

Use canary deployments with automated business-level analysis for application changes.

Mount secrets as files, not env vars, and validate their existence before deploying.

Track DORA metrics and pipeline duration trends

alert on degradation before trust erodes.

Tag every artifact with its git commit SHA and sign it

never use :latest in production.

A skipped job is not a failed job

add explicit status checks in downstream stages.

Common mistakes to avoid

5 patterns

Using depends_on without a healthcheck

Symptom

API crashes on startup with ECONNREFUSED because the database container started but is not yet ready to accept connections.

Fix

Add a healthcheck block to the database service using pg_isready, then use condition: service_healthy in the API depends_on block.

Storing secrets as environment variables in the pipeline YAML

Symptom

Secret rotation requires a full pipeline restart; secrets leaked in logs or build artifacts.

Fix

Use OIDC-based authentication to pull secrets from a vault at deploy time, and mount them as files in the container.

Using the :latest tag for production deployments

Symptom

Cannot roll back reliably because :latest points to the broken version; unknown which commit is actually running.

Fix

Tag every image with its git commit SHA. Never overwrite tags. Use SHA for all production deployments.

Putting long-running E2E tests before fast linting checks

Symptom

Developers wait 30+ minutes to discover a missing semicolon; they start bypassing the pipeline.

Fix

Order pipeline stages by execution time ascending. Lint and type-check first, unit tests second, integration tests third, E2E last.

Not separating readiness and liveness probes

Symptom

Kubernetes kills healthy pods under load because the liveness probe includes a database check that times out during a slow backend.

Fix

Use separate endpoints: /health/live for internal process health only, /health/ready for dependency checks. A pod can be live but not ready.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What are the four DORA metrics and why do they matter?

Q02SENIOR

How do you handle secret rotation in a CI/CD pipeline without causing do...

Q03SENIOR

Explain the difference between a skipped job and a failed job in GitHub ...

Q04SENIOR

When would you choose a canary deployment over a blue-green deployment?

Q05SENIOR

What steps would you take to fix a flaky test that is causing random CI ...

Q01 of 05SENIOR

What are the four DORA metrics and why do they matter?

ANSWER

DORA metrics are: Deployment Frequency (how often you deploy to production), Lead Time for Changes (time from commit to production), Change Failure Rate (percentage of deployments causing failures), and Mean Time to Recovery (time to restore service after a failure). They matter because they provide a standardised way to measure DevOps performance. High-performing teams deploy multiple times per day with a change failure rate under 5%, while low performers deploy monthly with higher failure rates. Tracking these metrics tells you whether your CI/CD improvements actually work.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Should I run integration tests on every branch push?

How do I set up healthchecks in Docker Compose for CI/CD?

What's the fastest way to debug a deployment that didn't pick up the latest code?

Why is it dangerous to use :latest in production deployments?

How do I handle database migrations in a CI/CD pipeline without downtime?

🔥

That's CI/CD. Mark it forged?

9 min read · try the examples if you haven't