Homeβ€Ί DevOpsβ€Ί CI/CD Best Practices: What High-Performing DevOps Teams Do Differently

CI/CD Best Practices: What High-Performing DevOps Teams Do Differently

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: CI/CD β†’ Topic 14 of 14
CI/CD best practices that separate elite DevOps teams from the rest β€” real pipeline patterns, failure modes, and production trade-offs explained.
βš™οΈ Intermediate β€” basic DevOps knowledge assumed
In this tutorial, you'll learn:
  • Order pipeline stages by execution time ascending, not by importance β€” failing a 45-minute integration suite before a 30-second type check is burning compute and developer patience for no reason
  • The moment your team starts re-running pipelines without investigating failures, your CI is dead β€” you have a retry button, not a quality gate; flakiness tracking and mandatory failure investigation are cultural and technical requirements, not optional
  • Never use :latest in production Kubernetes deployments β€” SHA-tagged images are the only way to get meaningful rollbacks, reproducible environments, and an audit trail that doesn't lie to you at 2am
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Think of your codebase like a commercial kitchen. Amateur cooks prep everything at the end of service, then panic when the plate's wrong. A Michelin-starred kitchen has a quality check at every single station β€” the prep cook, the saucier, the expeditor β€” so a bad dish never reaches the dining room. CI/CD is that station-by-station quality system for software. Every time a developer adds something to the kitchen, it gets tasted, checked, and plated automatically before a single customer sees it. The difference between a restaurant that survives and one that gets shut down by health inspectors is exactly that discipline.

A fintech team I worked with was deploying to production manually every two weeks. One Friday afternoon, a developer copy-pasted a database migration script into the wrong environment, wiped a staging database that was being used as a shadow clone of prod, and triggered a three-hour incident that nearly became a four-hour customer-facing outage. The root cause wasn't the mistake β€” humans make mistakes. The root cause was that there was no automated gate to catch it.

CI/CD isn't a tool. It's a philosophy that says 'the longer you wait to integrate and ship, the more expensive your mistakes get.' The average high-performing team deploys to production multiple times per day with a change failure rate under 5%. The average low-performing team deploys once a month and spends 40% of their engineering time on unplanned work β€” firefighting regressions, rolling back broken releases, and manually babysitting deployments. Those aren't different companies. They're the same company, two years apart, after one of them got serious about CI/CD.

By the end of this article, you'll know exactly how to structure a pipeline that catches failures before they reach production, which quality gates actually matter and which ones slow you down for no gain, where pipelines break down at scale and what to do about it, and how to roll out changes without taking the whole system down. You won't just understand CI/CD β€” you'll be able to walk into an existing codebase and diagnose exactly why its pipeline is failing its team.

Pipeline Architecture: Why Most Teams Build It Backwards

Most teams design their CI pipeline by asking 'what checks should we run?' That's the wrong question. The right question is 'in what order should failures be discovered, and what's the cost of discovering them late?' Every stage of your pipeline is a trade-off between feedback speed and coverage depth. If you put your 45-minute integration test suite before your 30-second linter, you're making every developer wait 45 minutes to learn they forgot a semicolon. I've seen this kill developer velocity at a mid-size SaaS company β€” engineers started skipping the pipeline locally and just pushing to get CI to run it, which turned the pipeline into a batch job instead of a fast feedback loop.

The principle is fail fast, fail cheap. Your pipeline stages should be ordered by execution time, ascending. Linting and static analysis run first β€” they're near-instant and catch a massive proportion of bugs. Unit tests second. Integration tests third. End-to-end tests last, and gated behind a merge to a protected branch. Every stage that fails short-circuits the rest. You don't run a 30-minute E2E suite against a commit that failed a type check.

Here's a production-grade GitHub Actions pipeline for a Node.js checkout service that demonstrates this ordering. Notice the explicit stage dependencies and the parallelisation of independent checks β€” security scanning runs parallel to unit tests because they don't share state.

checkout-service-ci.yml Β· YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209
# io.thecodeforge β€” DevOps tutorial
# CI pipeline for a checkout service β€” GitHub Actions
# Ordered by: speed (fastest gates first), then coverage depth
# Principle: catch cheap failures before running expensive ones

name: Checkout Service CI

on:
  push:
    branches: ['**']          # Run on every branch push, not just main
  pull_request:
    branches: [main, staging]  # Gate merges to protected branches

env:
  NODE_VERSION: '20.x'
  # Store non-secret config at the workflow level so every job inherits it
  # Secrets come from GitHub Secrets β€” never hardcode them here
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}/checkout-service

jobs:

  # ─────────────────────────────────────────────
  # STAGE 1: Sub-60-second gates
  # If these fail, nothing else runs. No point burning compute.
  # ─────────────────────────────────────────────
  lint-and-typecheck:
    name: Lint & Type Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'           # Cache node_modules by package-lock.json hash

      - name: Install dependencies
        run: npm ci              # ci installs exactly what's in package-lock β€” no surprises

      - name: Run ESLint
        run: npm run lint        # Fail fast: exit code 1 kills the job immediately

      - name: TypeScript type check
        run: npm run typecheck   # Separate from build β€” catches type errors without emitting JS

  # ─────────────────────────────────────────────
  # STAGE 1 (parallel): Security scan
  # Runs in parallel with lint β€” independent concern, same time budget
  # ─────────────────────────────────────────────
  security-scan:
    name: Dependency Security Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run npm audit
        # --audit-level=high: fail on HIGH or CRITICAL vulns only
        # Don't fail on moderate β€” you'll be blocked forever on transitive deps
        run: npm audit --audit-level=high

      - name: SAST scan with Semgrep
        uses: returntocorp/semgrep-action@v1
        with:
          config: 'p/nodejs'    # Use the Node.js security ruleset β€” not the generic one

  # ─────────────────────────────────────────────
  # STAGE 2: Unit tests
  # Only runs if Stage 1 passes β€” needs: enforces the dependency
  # ─────────────────────────────────────────────
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    needs: [lint-and-typecheck, security-scan]  # Both Stage 1 jobs must pass
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run unit tests with coverage
        run: npm run test:unit -- --coverage
        env:
          # Unit tests must NOT touch real external services
          # These point to in-memory fakes, not real infra
          DATABASE_URL: 'sqlite::memory:'
          PAYMENT_GATEWAY_URL: 'http://localhost:9999'  # Wiremock stub

      - name: Upload coverage report
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/
          retention-days: 7     # Don't keep forever β€” storage costs add up

      - name: Enforce coverage threshold
        # Fail the pipeline if coverage drops below 80%
        # Don't gate on 100% β€” it incentivises writing useless tests
        run: npx nyc check-coverage --lines 80 --functions 80 --branches 75

  # ─────────────────────────────────────────────
  # STAGE 3: Integration tests
  # Spins up real dependencies via Docker Compose
  # Only runs on PRs to main/staging β€” too slow for every branch push
  # ─────────────────────────────────────────────
  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: [unit-tests]
    # Only run integration tests when merging β€” not on every feature branch push
    # This is the speed vs. coverage trade-off in action
    if: github.event_name == 'pull_request'
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: checkout_test
          POSTGRES_USER: checkout_app
          POSTGRES_PASSWORD: ${{ secrets.TEST_DB_PASSWORD }}
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5     # Wait until Postgres is actually ready, not just started
        ports:
          - 5432:5432

      redis:
        image: redis:7-alpine
        options: --health-cmd "redis-cli ping" --health-interval 10s
        ports:
          - 6379:6379

    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run database migrations
        run: npm run db:migrate
        env:
          DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test

      - name: Run integration tests
        run: npm run test:integration
        env:
          DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test
          REDIS_URL: redis://localhost:6379
          NODE_ENV: test

  # ─────────────────────────────────────────────
  # STAGE 4: Build and push Docker image
  # Only on merge to main β€” you don't want an image per commit on feature branches
  # ─────────────────────────────────────────────
  build-and-push:
    name: Build & Push Image
    runs-on: ubuntu-latest
    needs: [integration-tests]
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    permissions:
      contents: read
      packages: write           # Required to push to GitHub Container Registry
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}  # Pass digest to deploy job
    steps:
      - uses: actions/checkout@v4

      - name: Log in to container registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata for image tags
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-          # Tag with git SHA β€” enables precise rollbacks
            type=raw,value=latest,enable={{is_default_branch}}

      - name: Build and push Docker image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          # Cache layers from the registry β€” dramatically speeds up builds
          # Without this, every build re-downloads all base layers
          cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
          cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max
β–Ά Output
Workflow triggered on push to main

βœ“ lint-and-typecheck (23s)
βœ“ security-scan (41s) [parallel with lint]
βœ“ unit-tests (1m 12s) [87% line coverage β€” threshold: 80%]
βœ“ integration-tests (3m 44s) [14 tests passed, 0 failed]
βœ“ build-and-push (2m 08s) [Pushed: ghcr.io/org/checkout-service:sha-a3f91c2]

Total wall-clock time: 7m 48s
Pipeline result: SUCCESS
Image digest: sha256:d4f2a1b9c8e3f5...
⚠️
Production Trap: The 'needs' Trap That Skips Stages SilentlyIf a job is skipped (not failed β€” skipped, because of an 'if' condition), jobs that 'need' it will also be skipped by default without failing. This means a build-and-push job can be silently skipped if integration tests were skipped, and your CD step might try to deploy an image that was never built. Fix it: use 'if: always()' combined with explicit status checks β€” 'if: needs.integration-tests.result == "success" || needs.integration-tests.result == "skipped"' β€” and be deliberate about which skips are acceptable.

Deployment Strategies That Don't Gamble Your Entire User Base

Here's a mistake I've seen kill a Black Friday deployment: a team built a perfect CI pipeline, then wired it directly to 'deploy everything to all pods immediately.' The pipeline was green. The deployment destroyed a third of their order throughput because a new Redis connection pool configuration had a subtle bug that only surfaced under real production load patterns. Their rollback took 22 minutes because they had no deployment strategy β€” it was all or nothing.

High-performing teams don't choose between 'deploy' and 'don't deploy.' They choose how much of their traffic takes the risk first. Blue-green deployments, canary releases, and feature flags are the three weapons in this arsenal, and they solve different problems. Blue-green is great for infrastructure changes where you need a clean cutover. Canary is best for application changes where you want to validate behaviour under real traffic before full rollout. Feature flags are best for functionality that you want to decouple from deployment entirely β€” ship the code, turn on the feature later.

The Kubernetes deployment below shows a canary release pattern using weight-based traffic splitting. The key insight is that your health checks must be meaningful β€” a pod that returns 200 on '/health' but fails to process payments is worse than a pod that's down, because it poisons a percentage of your real user traffic silently.

checkout-canary-deployment.yml Β· YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
# io.thecodeforge β€” DevOps tutorial
# Canary deployment pattern for checkout service on Kubernetes
# Uses: Argo Rollouts for progressive delivery
# Why Argo Rollouts over plain Kubernetes Deployment:
# Plain Deployments have no concept of traffic weighting or analysis steps.
# You need a controller that understands progressive delivery semantics.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
  namespace: payments
spec:
  replicas: 10                  # Total desired replica count at full rollout
  selector:
    matchLabels:
      app: checkout-service

  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout-service
          image: ghcr.io/org/checkout-service:sha-a3f91c2  # Always pin to a specific SHA
          # Never use :latest in production β€” you lose reproducibility and rollback clarity

          ports:
            - containerPort: 3000

          # Resource requests AND limits β€” both required
          # Without requests, the scheduler can't make good placement decisions
          # Without limits, one bad pod can starve its neighbours
          resources:
            requests:
              memory: '256Mi'
              cpu: '250m'
            limits:
              memory: '512Mi'
              cpu: '500m'

          readinessProbe:
            httpGet:
              path: /health/ready   # Readiness != Liveness. Readiness means 'send me traffic'
              port: 3000
            initialDelaySeconds: 10  # Give the app time to initialise DB connections
            periodSeconds: 5
            failureThreshold: 3      # 3 consecutive failures = remove from load balancer

          livenessProbe:
            httpGet:
              path: /health/live     # Liveness: 'am I deadlocked or otherwise unrecoverable?'
              port: 3000
            initialDelaySeconds: 30  # Longer delay β€” a restart loop is worse than being slow
            periodSeconds: 10
            failureThreshold: 5

          env:
            - name: NODE_ENV
              value: production
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: checkout-service-secrets
                  key: database-url   # Pull from Kubernetes Secret, never from env literal

  strategy:
    canary:
      # The canary steps define the progressive rollout
      # Argo pauses at each step, runs analysis, then proceeds or aborts
      steps:
        - setWeight: 10           # Step 1: Send 10% of traffic to new version
        - pause:
            duration: 5m          # Wait 5 minutes β€” enough for p99 latency to show anomalies

        - analysis:               # Automated analysis before proceeding β€” this is the key gate
            templates:
              - templateName: checkout-success-rate
            args:
              - name: service-name
                value: checkout-service

        - setWeight: 30           # Step 2: Increase to 30% only if analysis passed
        - pause:
            duration: 10m

        - analysis:
            templates:
              - templateName: checkout-success-rate
              - templateName: checkout-p99-latency  # Check latency separately β€” success rate can hide slowdowns

        - setWeight: 100          # Full rollout only after both analysis steps pass

      # Automatic rollback: if any analysis step fails, Argo rolls back to stable
      autoPromotionEnabled: false  # Require manual promotion OR passed analysis β€” never auto-promote blindly

---
# AnalysisTemplate defines WHAT to measure during canary steps
# This queries your metrics backend (Prometheus) for real production signals
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-success-rate
  namespace: payments
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s             # Evaluate every 60 seconds during the pause window
      count: 5                  # Must get 5 consecutive passing evaluations
      successCondition: result[0] >= 0.95   # 95% success rate minimum
      failureLimit: 1           # One failure triggers automatic rollback
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}",
                status!~"5.."
              }[5m]
            ))
            /
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}"
              }[5m]
            ))
            # This gives you: (non-5xx requests) / (all requests) = success rate
            # Only measuring the canary pods, not stable β€” this is why service labels matter
β–Ά Output
Rollout initiated: checkout-service β†’ sha-a3f91c2

[Step 1/5] Weight: 10% β†’ canary pods
Waiting 5m for traffic sample...
Analysis: checkout-success-rate
Evaluation 1/5: success_rate=0.983 βœ“
Evaluation 2/5: success_rate=0.991 βœ“
Evaluation 3/5: success_rate=0.979 βœ“
Evaluation 4/5: success_rate=0.986 βœ“
Evaluation 5/5: success_rate=0.994 βœ“
Analysis PASSED βœ“

[Step 2/5] Weight: 30% β†’ canary pods
Waiting 10m for traffic sample...
Analysis: checkout-success-rate + checkout-p99-latency
success_rate=0.988 βœ“ p99_latency=142ms βœ“
Analysis PASSED βœ“

[Step 3/5] Weight: 100% β€” Full rollout
All 10 replicas running sha-a3f91c2

Rollout COMPLETE βœ“ Total time: 17m 23s
⚠️
Never Do This: Using the Same Health Endpoint for Readiness and LivenessI've seen teams wire both readinessProbe and livenessProbe to '/health' and then wonder why Kubernetes is killing healthy pods under load. If your liveness check includes a database ping, a slow DB will trigger a restart loop β€” Kubernetes kills the pod, restarts it, it's slow again, kills it again. Separate them: liveness checks only internal process health (event loop alive, no deadlock), readiness checks external dependencies. A pod can be live but not ready β€” that's exactly the state you want during a downstream outage.

The Secrets and Config Management Problem Nobody Talks About Until It's Too Late

I once got called into an incident at midnight because a developer had rotated an API key in AWS Secrets Manager, the application was reading that secret at startup only, and none of the running pods picked up the new value. The service was fine. Then someone did a routine deployment, pods restarted with the new secret, and suddenly half the fleet was talking to the payment gateway with the old key (cached in one still-running pod) and half with the new key. The gateway's duplicate-detection logic flagged the mismatched requests and started rejecting transactions. It took 40 minutes to figure out the problem was secret rotation, not the deployment itself.

Config and secrets management is where CI/CD pipelines quietly accumulate debt. Teams hardcode environment-specific values into their pipelines, or they inject secrets as plain environment variables in their Kubernetes manifests, or they forget to handle secret rotation without a full restart. All three of these will burn you.

The pattern that works: secrets live in a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, or at minimum Kubernetes Secrets encrypted at rest). They're injected at runtime, not build time. Your application watches for secret rotation and reloads without a restart. Your CI pipeline never has access to production secrets β€” it uses short-lived OIDC tokens to assume the minimum necessary role.

checkout-secrets-pipeline.yml Β· YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
# io.thecodeforge β€” DevOps tutorial
# Secrets management pattern: GitHub Actions + AWS OIDC + Secrets Manager
# Why OIDC instead of long-lived AWS access keys stored in GitHub Secrets?
# Long-lived keys are a credentials leak waiting to happen.
# OIDC issues a token per-workflow-run that expires in minutes.
# No static credentials. No rotation reminders. No 'who committed this key?' post-mortems.

name: Checkout Service CD

on:
  push:
    branches: [main]

permissions:
  id-token: write   # Required for OIDC β€” tells GitHub to generate an OIDC token
  contents: read

jobs:
  deploy-to-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    environment: staging   # GitHub Environment β€” enables deployment protection rules
    steps:
      - uses: actions/checkout@v4

      # Authenticate to AWS using OIDC β€” no static credentials needed
      # The AWS IAM role trusts the GitHub OIDC provider for this repo only
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/checkout-service-deploy-staging
          # This role has ONLY the permissions needed for staging deployment:
          # - ecr:GetAuthorizationToken, ecr:BatchGetImage (pull image)
          # - eks:DescribeCluster (get kubeconfig)
          # - secretsmanager:GetSecretValue (read staging secrets only)
          aws-region: eu-west-1
          role-session-name: checkout-service-deploy-${{ github.run_id }}

      - name: Validate secrets exist before deploying
        # Fail the pipeline here if secrets are missing β€” before touching the cluster
        # Better to fail loudly in CI than silently in a running pod
        run: |
          aws secretsmanager describe-secret \
            --secret-id checkout-service/staging/database-url \
            --query 'Name' \
            --output text
          aws secretsmanager describe-secret \
            --secret-id checkout-service/staging/payment-gateway-key \
            --query 'Name' \
            --output text
          echo "All required secrets confirmed present in Secrets Manager"

      - name: Get kubeconfig for staging cluster
        run: |
          aws eks update-kubeconfig \
            --region eu-west-1 \
            --name payments-staging-cluster \
            --alias staging

      - name: Sync secrets from AWS Secrets Manager to Kubernetes
        # Using External Secrets Operator β€” the right way to bridge AWS Secrets Manager and K8s
        # This creates/updates Kubernetes Secrets automatically when AWS secrets rotate
        # Your pods get the updated secret without restarting if you mount as volumes (not env vars)
        run: |
          # The ExternalSecret resource tells the External Secrets Operator:
          # "Watch this AWS secret, sync it here, refresh every hour"
          kubectl apply -f - <<EOF
          apiVersion: external-secrets.io/v1beta1
          kind: ExternalSecret
          metadata:
            name: checkout-service-secrets
            namespace: payments
          spec:
            refreshInterval: 1h       # Re-sync from AWS every hour β€” picks up rotations
            secretStoreRef:
              name: aws-secrets-manager
              kind: ClusterSecretStore
            target:
              name: checkout-service-secrets
              creationPolicy: Owner
            data:
              - secretKey: database-url
                remoteRef:
                  key: checkout-service/staging/database-url
              - secretKey: payment-gateway-key
                remoteRef:
                  key: checkout-service/staging/payment-gateway-key
          EOF

      - name: Deploy to staging via Argo Rollouts
        run: |
          # Update only the image tag β€” don't replace the entire manifest
          # This preserves any manual overrides and reduces blast radius
          kubectl argo rollouts set image checkout-service \
            checkout-service=ghcr.io/org/checkout-service:sha-${{ github.sha }} \
            --namespace payments

      - name: Wait for rollout to complete
        run: |
          # Watch the rollout with a timeout β€” don't let a stuck rollout block your pipeline forever
          # 10 minutes is generous for canary + analysis steps
          kubectl argo rollouts status checkout-service \
            --namespace payments \
            --timeout 10m

      - name: Run smoke tests against staging
        # Smoke tests run AFTER the deployment β€” they prove the live environment works
        # Not the same as integration tests, which run in isolation pre-deploy
        run: |
          # Hit real endpoints in staging with real (test) data
          npm run test:smoke -- \
            --base-url https://checkout-staging.internal.example.com \
            --timeout 30000
        env:
          # Smoke test API key is a staging-specific test credential
          # It never touches production data
          SMOKE_TEST_API_KEY: ${{ secrets.STAGING_SMOKE_TEST_KEY }}
β–Ά Output
Deploy to Staging β€” checkout-service sha-a3f91c2

βœ“ AWS OIDC authentication successful
Role: checkout-service-deploy-staging
Session expires: 2024-01-15T14:32:00Z (1 hour)

βœ“ Secret validation passed
checkout-service/staging/database-url [EXISTS]
checkout-service/staging/payment-gateway-key [EXISTS]

βœ“ Kubeconfig updated for cluster: payments-staging-cluster

βœ“ ExternalSecret synced
checkout-service-secrets updated in namespace payments
Next refresh: 2024-01-15T15:00:00Z

βœ“ Rollout initiated: sha-a3f91c2
Canary: 10% β†’ Analysis passed β†’ 30% β†’ Analysis passed β†’ 100%
Rollout complete in 14m 52s

βœ“ Smoke tests passed
POST /api/v1/checkout β€” 201 (143ms)
GET /api/v1/orders/{id} β€” 200 (67ms)
POST /api/v1/checkout/confirm β€” 200 (298ms)
3/3 smoke tests passed

Deployment result: SUCCESS
⚠️
Senior Shortcut: Mount Secrets as Files, Not Environment VariablesMount Kubernetes Secrets as volume files, not env vars. Env vars are captured at pod startup and never refresh. A file mounted from a Secret updates when the Secret updates (within kubelet's sync period, default 60s). Your app can use a file watcher to reload config without restarting. This is how you get secret rotation without downtime. The pattern: mount to '/run/secrets/payment-gateway-key', read with fs.readFileSync, watch with chokidar or inotify.

Observability in the Pipeline: You Can't Fix What You Can't See

A pipeline that tells you 'build failed' is nearly useless. A pipeline that tells you 'integration test checkout_service_test.ts:143 β€” assertion failed: expected order status CONFIRMED, received PAYMENT_PENDING β€” flaky for 3 of last 5 runs on this branch β€” median test duration increased 40% this week' is a co-pilot. The gap between those two things is observability.

High-performing teams treat their pipelines as first-class systems with their own monitoring. They track pipeline duration by stage, test flakiness rates by test file, deployment frequency, change failure rate, and mean time to recovery. These are the four DORA metrics, and if you're not measuring them, you don't know if your DevOps practice is improving or just getting more complicated.

Flaky tests are the silent killer of CI trust. Once developers start seeing random failures they learn to re-run pipelines instead of fixing failures. That habit means they also re-run real failures, which means bugs start shipping. I've seen teams with a 30% flakiness rate on their test suite who had essentially no CI β€” the pipeline was there but no one believed it. The fix isn't to delete the flaky tests. It's to quarantine them, track them in your issue tracker, and fix them with the same urgency you'd fix a production bug.

pipeline-observability.yml Β· YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
# io.thecodeforge β€” DevOps tutorial
# Pipeline observability: tracking DORA metrics and test flakiness
# This job runs after every workflow completion (success OR failure)
# It pushes pipeline telemetry to your metrics backend

name: Pipeline Telemetry

on:
  workflow_run:
    workflows: ['Checkout Service CI', 'Checkout Service CD']
    types: [completed]           # Triggers on both success and failure

jobs:
  record-pipeline-metrics:
    name: Record Pipeline Metrics
    runs-on: ubuntu-latest
    steps:
      - name: Calculate pipeline duration and outcome
        id: metrics
        run: |
          # GitHub provides start/end times for workflow runs via the API
          # We push these as custom metrics to track pipeline performance over time

          WORKFLOW_NAME="${{ github.event.workflow_run.name }}"
          WORKFLOW_CONCLUSION="${{ github.event.workflow_run.conclusion }}"
          # conclusion values: success | failure | cancelled | skipped | timed_out

          START_TIME="${{ github.event.workflow_run.run_started_at }}"
          END_TIME="${{ github.event.workflow_run.updated_at }}"

          # Convert to epoch for arithmetic
          START_EPOCH=$(date -d "$START_TIME" +%s)
          END_EPOCH=$(date -d "$END_TIME" +%s)
          DURATION_SECONDS=$((END_EPOCH - START_EPOCH))

          echo "workflow_name=$WORKFLOW_NAME" >> $GITHUB_OUTPUT
          echo "conclusion=$WORKFLOW_CONCLUSION" >> $GITHUB_OUTPUT
          echo "duration=$DURATION_SECONDS" >> $GITHUB_OUTPUT
          echo "branch=${{ github.event.workflow_run.head_branch }}" >> $GITHUB_OUTPUT
          echo "sha=${{ github.event.workflow_run.head_sha }}" >> $GITHUB_OUTPUT

      - name: Push metrics to Datadog
        run: |
          # Push pipeline metrics as custom Datadog metrics
          # These feed into dashboards tracking DORA metrics
          curl -X POST "https://api.datadoghq.com/api/v1/series" \
            -H "Content-Type: application/json" \
            -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
            -d '{
              "series": [
                {
                  "metric": "ci.pipeline.duration_seconds",
                  "type": "gauge",
                  "points": [['\'$(date +%s)'\'', '${{ steps.metrics.outputs.duration }}']],
                  "tags": [
                    "workflow:${{ steps.metrics.outputs.workflow_name }}",
                    "conclusion:${{ steps.metrics.outputs.conclusion }}",
                    "branch:${{ steps.metrics.outputs.branch }}",
                    "service:checkout-service"
                  ]
                },
                {
                  "metric": "ci.pipeline.runs_total",
                  "type": "count",
                  "points": [['\'$(date +%s)'\'', 1]],
                  "tags": [
                    "workflow:${{ steps.metrics.outputs.workflow_name }}",
                    "conclusion:${{ steps.metrics.outputs.conclusion }}",
                    "service:checkout-service"
                  ]
                }
              ]
            }'

      - name: Alert on repeated failures
        # If the same workflow has failed 3 times in the last hour, page the on-call
        # This catches the 'everyone's re-running hoping it fixes itself' pattern
        if: steps.metrics.outputs.conclusion == 'failure'
        run: |
          # Query Datadog for recent failure count on this branch
          RECENT_FAILURES=$(curl -s \
            "https://api.datadoghq.com/api/v1/query?from=$(date -d '1 hour ago' +%s)&to=$(date +%s)&query=sum:ci.pipeline.runs_total{service:checkout-service,conclusion:failure,branch:${{ steps.metrics.outputs.branch }}}.as_count()" \
            -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
            -H "DD-APPLICATION-KEY: ${{ secrets.DATADOG_APP_KEY }}" \
            | jq '.series[0].pointlist[-1][1] // 0')

          echo "Recent failures on this branch: $RECENT_FAILURES"

          if [ "$(echo "$RECENT_FAILURES >= 3" | bc)" -eq 1 ]; then
            # Send a PagerDuty alert β€” 3 consecutive failures is a real problem, not flakiness
            curl -X POST https://events.pagerduty.com/v2/enqueue \
              -H 'Content-Type: application/json' \
              -H 'Authorization: Token token=${{ secrets.PAGERDUTY_ROUTING_KEY }}' \
              -d '{
                "routing_key": "${{ secrets.PAGERDUTY_ROUTING_KEY }}",
                "event_action": "trigger",
                "payload": {
                  "summary": "CI pipeline failing repeatedly: ${{ steps.metrics.outputs.workflow_name }} on ${{ steps.metrics.outputs.branch }}",
                  "severity": "warning",
                  "source": "github-actions",
                  "custom_details": {
                    "workflow": "${{ steps.metrics.outputs.workflow_name }}",
                    "branch": "${{ steps.metrics.outputs.branch }}",
                    "sha": "${{ steps.metrics.outputs.sha }}",
                    "failures_last_hour": "'$RECENT_FAILURES'",
                    "run_url": "${{ github.event.workflow_run.html_url }}"
                  }
                }
              }'
            echo "PagerDuty alert sent for repeated CI failures"
          fi
β–Ά Output
Pipeline Telemetry β€” recording for: Checkout Service CI

βœ“ Metrics calculated
Workflow: Checkout Service CI
Conclusion: failure
Duration: 312 seconds
Branch: feature/new-promo-engine
SHA: b8e3a2f1

βœ“ Metrics pushed to Datadog
ci.pipeline.duration_seconds{service:checkout-service, conclusion:failure}: 312
ci.pipeline.runs_total{service:checkout-service, conclusion:failure}: 1

βœ“ Checking recent failure count on feature/new-promo-engine...
Recent failures (last 1h): 3

⚠ Threshold exceeded (β‰₯3 failures) β€” sending PagerDuty alert
Alert sent to on-call rotation: checkout-service-team
Severity: warning
Summary: CI pipeline failing repeatedly: Checkout Service CI on feature/new-promo-engine
πŸ”₯
Interview Gold: The Four DORA Metrics and What They Actually MeasureDORA's research identifies four metrics that predict software delivery performance: Deployment Frequency (how often you ship to production), Lead Time for Changes (commit to production in minutes/hours/days), Change Failure Rate (% of deployments causing an incident), and Mean Time to Recovery (how long to restore service after failure). Elite teams deploy multiple times per day, have sub-hour lead times, under 5% change failure rate, and recover in under one hour. If a team tells you they release monthly, their MTTR is measured in days β€” guaranteed.
Deployment StrategyBlue-GreenCanary ReleaseFeature Flags
Traffic controlFull cutover (0% β†’ 100%)Gradual (10% β†’ 30% β†’ 100%)Per-user/per-segment rules
Rollback speedInstant (DNS/LB switch)Minutes (weighted traffic revert)Instant (toggle off)
Infrastructure cost2x resource cost during deploy10-40% overhead during canary windowMinimal β€” same infrastructure
Validates real trafficNo β€” cutover before validationYes β€” live traffic on canary podsYes β€” real users on new feature
Database migration safetyRequires backward-compatible schemasRequires backward-compatible schemasCan gate migration behind flag
Best forInfrastructure/config changesApplication code releasesFeature-level toggling, A/B tests
Risk surfaceAll-or-nothing if rollback failsLimited blast radius at each stepFlag misconfiguration affects all users
Tooling requiredLoad balancer + duplicate environmentArgo Rollouts or FlaggerLaunchDarkly, Unleash, or custom Redis
DORA metric impactReduces change failure rateReduces change failure rate + MTTRIncreases deployment frequency

🎯 Key Takeaways

  • Order pipeline stages by execution time ascending, not by importance β€” failing a 45-minute integration suite before a 30-second type check is burning compute and developer patience for no reason
  • The moment your team starts re-running pipelines without investigating failures, your CI is dead β€” you have a retry button, not a quality gate; flakiness tracking and mandatory failure investigation are cultural and technical requirements, not optional
  • Never use :latest in production Kubernetes deployments β€” SHA-tagged images are the only way to get meaningful rollbacks, reproducible environments, and an audit trail that doesn't lie to you at 2am
  • Blue-green, canary, and feature flags are not interchangeable β€” blue-green can't validate real traffic before cutover, canary can't decouple feature release from deployment, and feature flags can't protect you from infrastructure changes; use the right tool for the specific risk you're managing

⚠ Common Mistakes to Avoid

  • βœ•Mistake 1: Running npm install instead of npm ci in CI β€” Symptom: builds that pass locally fail on CI with 'peer dependency conflict' or worse, pass CI but install a subtly different dependency version than production, causing 'works on my machine' bugs β€” Fix: always use npm ci in CI environments; it installs strictly from package-lock.json, fails if lock file is out of date, and never modifies it
  • βœ•Mistake 2: Storing long-lived AWS IAM access keys in GitHub Secrets for CI β€” Symptom: 'AWS Access Key exposed in public repository' GitHub security alert, or a credentials leak during a GitHub breach β€” Fix: replace with OIDC federation using aws-actions/configure-aws-credentials@v4 with role-to-assume; the generated token lives for the duration of one workflow run and has no static credentials to leak
  • βœ•Mistake 3: Setting identical readinessProbe and livenessProbe paths in Kubernetes β€” Symptom: 'CrashLoopBackOff' during high database latency events where Kubernetes repeatedly kills and restarts healthy pods, making a DB slowdown into a full service outage β€” Fix: separate the probes; liveness checks only in-process health (event loop, memory), readiness checks external dependencies; liveness failureThreshold should be 5+ to avoid restart loops on transient issues
  • βœ•Mistake 4: Using :latest as the Docker image tag in Kubernetes deployments β€” Symptom: running kubectl rollout undo fails to actually revert because 'latest' now points to the broken version; you have no way to know which code is running where β€” Fix: always tag images with the git SHA (type=sha in docker/metadata-action); you get exact traceability, meaningful rollback, and reproducible deployments
  • βœ•Mistake 5: Injecting secrets as Kubernetes environment variables instead of mounted files β€” Symptom: after rotating a secret in AWS Secrets Manager, pods continue using the old value until a manual restart; this creates a split-brain state where some pods use old credentials and some use new β€” Fix: mount secrets as volumes using External Secrets Operator; file-mounted secrets update within kubelet's syncFrequency (default 60s) without a pod restart; add a file watcher in your app to reload config on change

Interview Questions on This Topic

  • QYour CI pipeline has a 25% flaky test rate. Developers have started auto-retrying failed builds without investigating. How do you fix the underlying problem without just deleting tests or disabling the pipeline requirement?
  • QWhen would you choose a blue-green deployment over a canary release for a microservice that owns its own PostgreSQL database and is receiving an additive schema migration?
  • QYou're mid-canary deployment at 30% traffic when your success rate analysis shows 94.8% β€” just below your 95% threshold. Argo Rollouts automatically rolls back. But your SRE thinks it might have been a transient spike from an upstream dependency, not your code. What's your mitigation strategy going forward, and what does this reveal about your analysis template design?
  • QHow does OIDC-based authentication for CI/CD differ from long-lived IAM keys in terms of the attack surface it exposes, and what specific IAM conditions would you use to restrict which GitHub workflow runs can assume the production deployment role?

Frequently Asked Questions

What's the difference between continuous delivery and continuous deployment?

Continuous delivery means every change is automatically built, tested, and packaged so it's always ready to deploy to production β€” but a human approves the actual production release. Continuous deployment removes that human gate entirely: every change that passes automated tests ships to production automatically. Most regulated industries (fintech, healthcare) practise continuous delivery and stop there because they need a manual approval step for compliance reasons β€” that's not a failure, it's intentional.

How do I prevent secrets from leaking in GitHub Actions logs?

GitHub automatically masks values stored in GitHub Secrets from workflow logs, but only if the runner knows about them β€” secrets you construct dynamically at runtime from parts don't get masked. Never echo a secret directly, never construct a secret by concatenating values in a run step, and never pass secrets between jobs using outputs (outputs are visible in the workflow log). Pass secrets between jobs using artifacts stored with restricted permissions, or pull them fresh from a secrets manager in each job that needs them.

How do I handle database migrations safely in a CI/CD pipeline without causing downtime?

Expand-contract (also called parallel change) is the pattern: in phase one, deploy a migration that adds the new column without removing the old one β€” the running application still uses the old column. In phase two, deploy the application code that writes to both old and new columns. In phase three, deploy a migration that backfills and removes the old column once all running pods use the new schema. This means your migration and your deployment are never a single atomic change, which is the only way to keep rolling deployments safe with schema changes.

At what scale does a monolithic CI pipeline start to break down, and what's the alternative?

A single pipeline for a monorepo breaks down around 30-50 developers when you hit two symptoms: pipeline duration exceeds 15 minutes even for unrelated changes, and flakiness in module A blocks deployments of module B. The alternative is affected-path detection β€” tools like Nx, Turborepo, or Bazel determine which packages changed and only run CI for the dependency graph downstream of that change. At true monorepo scale (hundreds of services), you also need remote caching so unchanged modules reuse prior build outputs across runs. The trap is implementing this too early β€” if your pipeline runs in under 8 minutes and you have fewer than 20 engineers, the complexity of affected-path detection isn't worth it yet.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousRolling Deployments
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged