Mid-level 16 min · March 29, 2026
DevOps Best Practices: What High-Performing Teams Do Differently

CI/CD Skipped Jobs — Why 'Success' Deploys Old Code

Skipped build jobs pass needs checks silently, deploying stale artifacts.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Production
production tested
June 21, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Order pipeline stages by execution speed, not importance — fail fast, fail cheap
  • Use healthchecks with depends_on for real readiness, not startup order
  • Mount secrets as files, not env vars — enables rotation without restarts
  • Track DORA metrics: deployment frequency, lead time, change failure rate, MTTR
  • Separate readiness and liveness probes — liveness checks only in-process health
  • Tag images with SHA — never :latest in production; enables precise rollback
✦ Definition~90s read
What is DevOps?

CI/CD skipped jobs occur when a pipeline reports a successful build or deployment despite critical steps—like artifact promotion, secret injection, or environment-specific configuration—being bypassed or silently failing. This creates a dangerous illusion: the pipeline shows green, but the code reaching production is stale, misconfigured, or even from a previous build.

Think of your codebase like a commercial kitchen.

The root cause is almost always a pipeline architecture that treats success as a binary pass/fail on compilation and unit tests, ignoring the chain of dependencies required for a truly immutable release. Teams often build pipelines backwards by optimizing for speed over correctness—parallelizing jobs without enforcing ordering, using environment variables that differ between CI and production, or relying on mutable tags like latest that overwrite previous artifacts.

The result is a deployment that passes all checks but ships old or broken code because the actual artifact wasn't rebuilt, wasn't promoted to the registry, or had its secrets injected from a stale vault. This is not a theoretical edge case; it's a systemic failure pattern in organizations that conflate 'pipeline completion' with 'deployment safety.' The fix requires shifting from a linear job-runner mindset to a state-machine approach: every artifact must be built once, checksummed, and promoted through environments with cryptographic verification.

Tools like Sigstore for signing, OCI-compliant registries with immutable tags, and pipeline-as-code frameworks (e.g., Tekton, Argo Workflows) that enforce DAG dependencies are non-negotiable. Without this, your 'success' is just a ticking time bomb—and the bomb is always old code.

Plain-English First

Think of your codebase like a commercial kitchen. Amateur cooks prep everything at the end of service, then panic when the plate's wrong. A Michelin-starred kitchen has a quality check at every single station — the prep cook, the saucier, the expeditor — so a bad dish never reaches the dining room. CI/CD is that station-by-station quality system for software. Every time a developer adds something to the kitchen, it gets tasted, checked, and plated automatically before a single customer sees it. The difference between a restaurant that survives and one that gets shut down by health inspectors is exactly that discipline.

A fintech team I worked with was deploying to production manually every two weeks. One Friday afternoon, a developer copy-pasted a database migration script into the wrong environment, wiped a staging database that was being used as a shadow clone of prod, and triggered a three-hour incident that nearly became a four-hour customer-facing outage. The root cause wasn't the mistake — humans make mistakes. The root cause was that there was no automated gate to catch it.

CI/CD isn't a tool. It's a philosophy that says 'the longer you wait to integrate and ship, the more expensive your mistakes get.' The average high-performing team deploys to production multiple times per day with a change failure rate under 5%. The average low-performing team deploys once a month and spends 40% of their engineering time on unplanned work — firefighting regressions, rolling back broken releases, and manually babysitting deployments. Those aren't different companies. They're the same company, two years apart, after one of them got serious about CI/CD.

By the end of this article, you'll know exactly how to structure a pipeline that catches failures before they reach production, which quality gates actually matter and which ones slow you down for no gain, where pipelines break down at scale and what to do about it, and how to roll out changes without taking the whole system down. You won't just understand CI/CD — you'll be able to walk into an existing codebase and diagnose exactly why its pipeline is failing its team.

Why CI/CD 'Success' Can Deploy Old Code

CI/CD skipped jobs occur when a pipeline stage is conditionally bypassed, often due to path filters, manual approvals, or failure thresholds. The core mechanic: a job that doesn't run is treated as 'successful' by default, so the pipeline proceeds without executing the intended build, test, or deploy step. This means the artifact from the previous run — possibly stale — gets promoted to production.

In practice, skipped jobs create a false sense of safety. For example, a 'deploy' job that only triggers on changes to a specific directory will silently skip when a config file outside that directory is updated. The pipeline shows green, but the running code is unchanged. The key property: pipeline status reflects job execution, not code freshness. Teams often miss this because they assume 'all jobs passed' means 'all jobs ran'.

Use explicit 'required' job markers and version pinning to prevent stale deployments. In real systems, this matters most during hotfixes or config-only changes — a skipped deploy can leave a critical fix unapplied while the dashboard reports success. Always validate that the deployed artifact matches the commit hash.

Silent Stale Deploy
A skipped job is not a failed job — but it's also not a successful one. The pipeline treats absence as success, which is the root cause.
Production Insight
A team pushed a config change to production but the deploy job was skipped due to a path filter — the old binary ran with the new config, causing a silent mismatch.
The symptom: production logs showed the new config values, but the application behavior matched the old code, leading to hours of debugging.
Rule of thumb: always pin the artifact version in the deploy job and fail the pipeline if the deploy job is skipped — never trust a green pipeline with skipped steps.
Key Takeaway
A skipped job is not a passed job — it's an absent job that the pipeline treats as success.
Always validate that the deployed artifact matches the commit hash, regardless of pipeline status.
Use required job markers and explicit artifact versioning to prevent stale code from reaching production.
CI/CD Skipped Jobs: Old Code Deployment THECODEFORGE.IO CI/CD Skipped Jobs: Old Code Deployment Pipeline flow showing how skipped jobs cause stale releases Commit Trigger Code push starts pipeline Skipped Jobs Tests or builds bypassed Artifact Reuse Old artifact deployed as 'success' Deploy to Prod Stale code goes live ⚠ Skipped jobs mask stale artifacts as successful releases Enforce immutable artifacts; never reuse across pipelines THECODEFORGE.IO
thecodeforge.io
CI/CD Skipped Jobs: Old Code Deployment
Devops Best Practices

Pipeline Architecture: Why Most Teams Build It Backwards

Most teams design their CI pipeline by asking 'what checks should we run?' That's the wrong question. The right question is 'in what order should failures be discovered, and what's the cost of discovering them late?' Every stage of your pipeline is a trade-off between feedback speed and coverage depth. If you put your 45-minute integration test suite before your 30-second linter, you're making every developer wait 45 minutes to learn they forgot a semicolon. I've seen this kill developer velocity at a mid-size SaaS company — engineers started skipping the pipeline locally and just pushing to get CI to run it, which turned the pipeline into a batch job instead of a fast feedback loop.

The principle is fail fast, fail cheap. Your pipeline stages should be ordered by execution time, ascending. Linting and static analysis run first — they're near-instant and catch a massive proportion of bugs. Unit tests second. Integration tests third. End-to-end tests last, and gated behind a merge to a protected branch. Every stage that fails short-circuits the rest. You don't run a 30-minute E2E suite against a commit that failed a type check.

Here's a production-grade GitHub Actions pipeline for a Node.js checkout service that demonstrates this ordering. Notice the explicit stage dependencies and the parallelisation of independent checks — security scanning runs parallel to unit tests because they don't share state.

One addition to this order: include a quick 'dependency caching restore' step before the first gate. It takes seconds but saves minutes in later stages. A common trap is caching node_modules but not the Docker layers — that's separate. Also, don't cache everything blindly; cache only what actually reduces build time. Measure cache hit rates with a dashboard.

Another nuance: the order of failure discovery should also consider blast radius. A linting failure affects only code style and minor bugs — cheap to fix. A security vulnerability in a dependency might require a team-wide update. An integration test failure might indicate a broken contract between services. Order by cost of failure as well as speed; cheap failures first, expensive ones after they're gated by cheap checks.

checkout-service-ci.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# io.thecodeforge — DevOps tutorial
# CI pipeline for a checkout service — GitHub Actions
# Ordered by: speed (fastest gates first), then coverage depth
# Principle: catch cheap failures before running expensive ones

name: Checkout Service CI

on:
  push:
    branches: ['**']          # Run on every branch push, not just main
  pull_request:
    branches: [main, staging]  # Gate merges to protected branches

env:
  NODE_VERSION: '20.x'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}/checkout-service

jobs:
  # Stage 1: Sub-60-second gates
  lint-and-typecheck:
    name: Lint & Type Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - name: Install dependencies
        run: npm ci
      - name: Run ESLint
        run: npm run lint
      - name: TypeScript type check
        run: npm run typecheck

  security-scan:
    name: Dependency Security Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run npm audit
        run: npm audit --audit-level=high
      - name: SAST scan with Semgrep
        uses: returntocorp/semgrep-action@v1
        with:
          config: 'p/nodejs'

  # Stage 2: Unit tests (only if Stage 1 passes)
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    needs: [lint-and-typecheck, security-scan]
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - name: Install dependencies
        run: npm ci
      - name: Run unit tests with coverage
        run: npm run test:unit -- --coverage
        env:
          DATABASE_URL: 'sqlite::memory:'
          PAYMENT_GATEWAY_URL: 'http://localhost:9999'
      - name: Upload coverage report
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/
          retention-days: 7
      - name: Enforce coverage threshold
        run: npx nyc check-coverage --lines 80 --functions 80 --branches 75

  # Stage 3: Integration tests (only on PRs to main/staging)
  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: [unit-tests]
    if: github.event_name == 'pull_request'
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: checkout_test
          POSTGRES_USER: checkout_app
          POSTGRES_PASSWORD: ${{ secrets.TEST_DB_PASSWORD }}
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      redis:
        image: redis:7-alpine
        options: --health-cmd "redis-cli ping" --health-interval 10s
        ports:
          - 6379:6379
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - name: Install dependencies
        run: npm ci
      - name: Run database migrations
        run: npm run db:migrate
        env:
          DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test
      - name: Run integration tests
        run: npm run test:integration
        env:
          DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test
          REDIS_URL: redis://localhost:6379
          NODE_ENV: test

  # Stage 4: Build and push Docker image (only on merge to main)
  build-and-push:
    name: Build & Push Image
    runs-on: ubuntu-latest
    needs: [integration-tests]
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    permissions:
      contents: read
      packages: write
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
    steps:
      - uses: actions/checkout@v4
      - name: Log in to container registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Extract metadata for image tags
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-
            type=raw,value=latest,enable={{is_default_branch}}
      - name: Build and push Docker image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
          cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max
Output
Workflow triggered on push to main
✓ lint-and-typecheck (23s)
✓ security-scan (41s) [parallel with lint]
✓ unit-tests (1m 12s) [87% line coverage — threshold: 80%]
✓ integration-tests (3m 44s) [14 tests passed, 0 failed]
✓ build-and-push (2m 08s) [Pushed: ghcr.io/org/checkout-service:sha-a3f91c2]
Total wall-clock time: 7m 48s
Pipeline result: SUCCESS
Image digest: sha256:d4f2a1b9c8e3f5...
Production Trap: The 'needs' Trap That Skips Stages Silently
If a job is skipped (not failed — skipped, because of an 'if' condition), jobs that 'need' it will also be skipped by default without failing. This means a build-and-push job can be silently skipped if integration tests were skipped, and your CD step might try to deploy an image that was never built. Fix it: use 'if: always()' combined with explicit status checks — 'if: needs.integration-tests.result == "success" || needs.integration-tests.result == "skipped"' — and be deliberate about which skips are acceptable.
Production Insight
The biggest pipeline slowdown isn't test execution — it's waiting for infrastructure to spin up.
Teams with 15+ minute pipelines see 40% longer cycle time.
Rule: keep the fast path under 5 minutes or developers will bypass it.
Another hidden sink: downloading dependencies from scratch. Cache npm and Docker layers.
Watch out for service containers that don't reuse build caches — each pipeline run might rebuild entire dependency trees.
Key Takeaway
Order stages by execution time ascending.
Fail fast, fail cheap.
Your lint check should never wait for your E2E suite to even start.
And if you can't trust your pipeline, your team will find ways around it — that's the real failure.
Pipeline Stage Ordering Decision Tree
IfStage runs in under 60 seconds and is stateless
UseRun first — failure short-circuits all downstream
IfStage requires external services (DB, cache, API)
UsePush to later — service startup time adds latency
IfStage can run independently of other stages
UseRun in parallel with other independent stages
IfStage takes >10 minutes and is rarely triggered
UseGate behind merge to protected branch — not every commit

Deployment Strategies That Don't Gamble Your Entire User Base

Here's a mistake I've seen kill a Black Friday deployment: a team built a perfect CI pipeline, then wired it directly to 'deploy everything to all pods immediately.' The pipeline was green. The deployment destroyed a third of their order throughput because a new Redis connection pool configuration had a subtle bug that only surfaced under real production load patterns. Their rollback took 22 minutes because they had no deployment strategy — it was all or nothing.

High-performing teams don't choose between 'deploy' and 'don't deploy.' They choose how much of their traffic takes the risk first. Blue-green deployments, canary releases, and feature flags are the three weapons in this arsenal, and they solve different problems. Blue-green is great for infrastructure changes where you need a clean cutover. Canary is best for application changes where you want to validate behaviour under real traffic before full rollout. Feature flags are best for functionality that you want to decouple from deployment entirely — ship the code, turn on the feature later.

The Kubernetes deployment below shows a canary release pattern using weight-based traffic splitting. The key insight is that your health checks must be meaningful — a pod that returns 200 on '/health' but fails to process payments is worse than a pod that's down, because it poisons a percentage of your real user traffic silently.

A nuance that often gets missed: canary analysis must include business metrics, not just HTTP status. One team's canary passed at 99.5% success rate but the new code returned stale cached prices — no 5xx, just wrong data. Include order completion rate or revenue per request in your analysis.

Another trap: rolling back a canary isn't always safe. If the canary has been running for hours and the stable version has since been updated, rolling back means deploying an older version that might have its own issues. Keep canary windows short or use blue-green for the rollback path.

checkout-canary-deployment.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# io.thecodeforge — DevOps tutorial
# Canary deployment pattern for checkout service on Kubernetes
# Uses: Argo Rollouts for progressive delivery

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
  namespace: payments
spec:
  replicas: 10
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout-service
          image: ghcr.io/org/checkout-service:sha-a3f91c2
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: '256Mi'
              cpu: '250m'
            limits:
              memory: '512Mi'
              cpu: '500m'
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 5
          env:
            - name: NODE_ENV
              value: production
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: checkout-service-secrets
                  key: database-url
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: checkout-success-rate
            args:
              - name: service-name
                value: checkout-service
        - setWeight: 30
        - pause:
            duration: 10m
        - analysis:
            templates:
              - templateName: checkout-success-rate
              - templateName: checkout-p99-latency
        - setWeight: 100
      autoPromotionEnabled: false

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-success-rate
  namespace: payments
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      count: 5
      successCondition: result[0] >= 0.95
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}",
                status!~"5.."
              }[5m]
            ))
            /
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}"
              }[5m]
            ))
Output
Rollout initiated: checkout-service → sha-a3f91c2
[Step 1/5] Weight: 10% → canary pods
Waiting 5m for traffic sample...
Analysis: checkout-success-rate
Evaluation 1/5: success_rate=0.983 ✓
Evaluation 2/5: success_rate=0.991 ✓
Evaluation 3/5: success_rate=0.979 ✓
Evaluation 4/5: success_rate=0.986 ✓
Evaluation 5/5: success_rate=0.994 ✓
Analysis PASSED ✓
[Step 2/5] Weight: 30% → canary pods
Waiting 10m for traffic sample...
Analysis: checkout-success-rate + checkout-p99-latency
success_rate=0.988 ✓ p99_latency=142ms ✓
Analysis PASSED ✓
[Step 3/5] Weight: 100% — Full rollout
All 10 replicas running sha-a3f91c2
Rollout COMPLETE ✓ Total time: 17m 23s
Never Do This: Using the Same Health Endpoint for Readiness and Liveness
I've seen teams wire both readinessProbe and livenessProbe to '/health' and then wonder why Kubernetes is killing healthy pods under load. If your liveness check includes a database ping, a slow DB will trigger a restart loop — Kubernetes kills the pod, restarts it, it's slow again, kills it again. Separate them: liveness checks only internal process health (event loop alive, no deadlock), readiness checks external dependencies. A pod can be live but not ready — that's exactly the state you want during a downstream outage.
Production Insight
A canary release that only checks HTTP status is blind to business-logic failures.
One team's canary passed at 99.5% success rate but the new code was returning stale cached prices — no 5xx, just wrong data.
Rule: include business-level metrics in canary analysis (e.g., order completion rate).
Another pitfall: canary windows that are too short miss rare error conditions triggered by daily batch jobs or peak traffic.
Key Takeaway
Blue-green for infra changes, canary for app code, feature flags for feature rollout.
Each strategy covers a different risk.
Pick based on what you're changing, not what's trendy.
And always pair deployment strategy with a rollback that can be executed faster than the original rollout.
Deployment Strategy Decision Tree
IfChanging infrastructure (DB upgrades, new load balancer config)
UseUse blue-green — instant cutover with clean failback
IfReleasing new application code with unknown impact
UseUse canary with automated analysis — validate under real traffic
IfShipping a feature that needs to be toggled per user or segment
UseUse feature flags — decouple deployment from release
IfDatabase schema change that needs to be backward-compatible
UseUse expand-contract pattern alongside any deployment strategy
Skipped Job Pipeline FlowTHECODEFORGE.IOSkipped Job Pipeline FlowHow a bypassed stage silently deploys stale artifactsCommitDeveloper pushes code changePath FilterConditional skip on non-matching pathsSkipped BuildJob not run, marked successOld ArtifactPrevious build deployed unchangedProd DeployPipeline green, code outdated⚠ A skipped job defaults to success — no warning, no new codeTHECODEFORGE.IO
thecodeforge.io
Skipped Job Pipeline Flow
Devops Best Practices

The Secrets and Config Management Problem Nobody Talks About Until It's Too Late

I once got called into an incident at midnight because a developer had rotated an API key in AWS Secrets Manager, the application was reading that secret at startup only, and none of the running pods picked up the new value. The service was fine. Then someone did a routine deployment, pods restarted with the new secret, and suddenly half the fleet was talking to the payment gateway with the old key (cached in one still-running pod) and half with the new key. The gateway's duplicate-detection logic flagged the mismatched requests and started rejecting transactions. It took 40 minutes to figure out the problem was secret rotation, not the deployment itself.

Config and secrets management is where CI/CD pipelines quietly accumulate debt. Teams hardcode environment-specific values into their pipelines, or they inject secrets as plain environment variables in their Kubernetes manifests, or they forget to handle secret rotation without a full restart. All three of these will burn you.

The pattern that works: secrets live in a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, or at minimum Kubernetes Secrets encrypted at rest). They're injected at runtime, not build time. Your application watches for secret rotation and reloads without a restart. Your CI pipeline never has access to production secrets — it uses short-lived OIDC tokens to assume the minimum necessary role.

A concrete technique: use External Secrets Operator to sync secrets from AWS to Kubernetes as mounted volumes. Your app can watch the file for changes and reload config without a restart. This avoids the split-brain scenario entirely.

Additionally, manage config separately from secrets. Use ConfigMaps for non-sensitive configuration like feature flags or API endpoints. That way, you can update config without needing to rotate secrets, and vice versa. And always set up a pre-deployment validation that checks whether the target environment has the required secrets before even attempting the deployment — fail loud, not silent.

checkout-secrets-pipeline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# io.thecodeforge — DevOps tutorial
# Secrets management pattern: GitHub Actions + AWS OIDC + Secrets Manager

name: Checkout Service CD

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  deploy-to-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/checkout-service-deploy-staging
          aws-region: eu-west-1
          role-session-name: checkout-service-deploy-${{ github.run_id }}

      - name: Validate secrets exist before deploying
        run: |
          aws secretsmanager describe-secret --secret-id checkout-service/staging/database-url --query 'Name' --output text
          aws secretsmanager describe-secret --secret-id checkout-service/staging/payment-gateway-key --query 'Name' --output text
          echo "All required secrets confirmed present in Secrets Manager"

      - name: Get kubeconfig for staging cluster
        run: |
          aws eks update-kubeconfig --region eu-west-1 --name payments-staging-cluster --alias staging

      - name: Sync secrets from AWS Secrets Manager to Kubernetes
        run: |
          kubectl apply -f - <<EOF
          apiVersion: external-secrets.io/v1beta1
          kind: ExternalSecret
          metadata:
            name: checkout-service-secrets
            namespace: payments
          spec:
            refreshInterval: 1h
            secretStoreRef:
              name: aws-secrets-manager
              kind: ClusterSecretStore
            target:
              name: checkout-service-secrets
              creationPolicy: Owner
            data:
              - secretKey: database-url
                remoteRef:
                  key: checkout-service/staging/database-url
              - secretKey: payment-gateway-key
                remoteRef:
                  key: checkout-service/staging/payment-gateway-key
          EOF

      - name: Deploy to staging via Argo Rollouts
        run: |
          kubectl argo rollouts set image checkout-service checkout-service=ghcr.io/org/checkout-service:sha-${{ github.sha }} --namespace payments

      - name: Wait for rollout to complete
        run: |
          kubectl argo rollouts status checkout-service --namespace payments --timeout 10m

      - name: Run smoke tests against staging
        run: |
          npm run test:smoke -- --base-url https://checkout-staging.internal.example.com --timeout 30000
        env:
          SMOKE_TEST_API_KEY: ${{ secrets.STAGING_SMOKE_TEST_KEY }}
Output
Deploy to Staging — checkout-service sha-a3f91c2
✓ AWS OIDC authentication successful
Role: checkout-service-deploy-staging
Session expires: 2024-01-15T14:32:00Z (1 hour)
✓ Secret validation passed
checkout-service/staging/database-url [EXISTS]
checkout-service/staging/payment-gateway-key [EXISTS]
✓ Kubeconfig updated for cluster: payments-staging-cluster
✓ ExternalSecret synced
checkout-service-secrets updated in namespace payments
Next refresh: 2024-01-15T15:00:00Z
✓ Rollout initiated: sha-a3f91c2
Canary: 10% → Analysis passed → 30% → Analysis passed → 100%
Rollout complete in 14m 52s
✓ Smoke tests passed
POST /api/v1/checkout — 201 (143ms)
GET /api/v1/orders/{id} — 200 (67ms)
POST /api/v1/checkout/confirm — 200 (298ms)
3/3 smoke tests passed
Deployment result: SUCCESS
Senior Shortcut: Mount Secrets as Files, Not Environment Variables
Mount Kubernetes Secrets as volume files, not env vars. Env vars are captured at pod startup and never refresh. A file mounted from a Secret updates when the Secret updates (within kubelet's sync period, default 60s). Your app can use a file watcher to reload config without restarting. This is how you get secret rotation without downtime. The pattern: mount to '/run/secrets/payment-gateway-key', read with fs.readFileSync, watch with chokidar or inotify.
Production Insight
Secret rotation without a restart plan creates split-brain states — half the pods on new creds, half on old.
This is the #1 cause of 'my deployment broke but I didn't change any code' incidents.
Rule: either rotate with zero-downtime via file watchers, or orchestrate a phased restart.
Also, never use environment-specific secrets in your pipeline YAML — keep them in the external manager only.
Key Takeaway
Mount secrets as files, not env vars.
Use External Secrets Operator for auto-sync.
Your CI pipeline should never touch production secrets directly — use OIDC and least-privilege roles.
And validate secrets exist before each deploy, not after a pod crashes.
Secrets Management Strategy Decision Tree
IfSecrets need to rotate without pod restart
UseMount as volume files with file watcher in app
IfSecrets change rarely and restart is acceptable
UseUse Kubernetes Secrets as env vars with periodic pod restart
IfUsing AWS/GCP/Azure secrets manager
UseUse External Secrets Operator to sync to K8s as volume mounts
IfCI pipeline needs access to secrets
UseUse OIDC with least-privilege IAM roles, never store credentials in GitHub Secrets

Observability in the Pipeline: You Can't Fix What You Can't See

A pipeline that tells you 'build failed' is nearly useless. A pipeline that tells you 'integration test checkout_service_test.ts:143 — assertion failed: expected order status CONFIRMED, received PAYMENT_PENDING — flaky for 3 of last 5 runs on this branch — median test duration increased 40% this week' is a co-pilot. The gap between those two things is observability.

High-performing teams treat their pipelines as first-class systems with their own monitoring. They track pipeline duration by stage, test flakiness rates by test file, deployment frequency, change failure rate, and mean time to recovery. These are the four DORA metrics, and if you're not measuring them, you don't know if your DevOps practice is improving or just getting more complicated.

Flaky tests are the silent killer of CI trust. Once developers start seeing random failures they learn to re-run pipelines instead of fixing failures. That habit means they also re-run real failures, which means bugs start shipping. I've seen teams with a 30% flakiness rate on their test suite who had essentially no CI — the pipeline was there but no one believed it. The fix isn't to delete the flaky tests. It's to quarantine them, track them in your issue tracker, and fix them with the same urgency you'd fix a production bug.

One more thing: alert on pipeline performance degradation. A pipeline that quietly grows from 8 minutes to 20 minutes over two weeks is a sign of accumulating technical debt. Put a dashboard up and page the team if the median duration crosses a threshold.

Also consider 'observability for rollbacks.' Track which SHA was deployed when, how long rollback took, and whether the rollback successfully restored the previous state. This data helps you tune your deployment strategy and set better SLOs for recovery time.

pipeline-observability.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# io.thecodeforge — DevOps tutorial
# Pipeline observability: tracking DORA metrics and test flakiness

name: Pipeline Telemetry

on:
  workflow_run:
    workflows: ['Checkout Service CI', 'Checkout Service CD']
    types: [completed]

jobs:
  record-pipeline-metrics:
    name: Record Pipeline Metrics
    runs-on: ubuntu-latest
    steps:
      - name: Calculate pipeline duration and outcome
        id: metrics
        run: |
          WORKFLOW_NAME="${{ github.event.workflow_run.name }}"
          WORKFLOW_CONCLUSION="${{ github.event.workflow_run.conclusion }}"
          START_TIME="${{ github.event.workflow_run.run_started_at }}"
          END_TIME="${{ github.event.workflow_run.updated_at }}"
          START_EPOCH=$(date -d "$START_TIME" +%s)
          END_EPOCH=$(date -d "$END_TIME" +%s)
          DURATION_SECONDS=$((END_EPOCH - START_EPOCH))
          echo "workflow_name=$WORKFLOW_NAME" >> $GITHUB_OUTPUT
          echo "conclusion=$WORKFLOW_CONCLUSION" >> $GITHUB_OUTPUT
          echo "duration=$DURATION_SECONDS" >> $GITHUB_OUTPUT
          echo "branch=${{ github.event.workflow_run.head_branch }}" >> $GITHUB_OUTPUT
          echo "sha=${{ github.event.workflow_run.head_sha }}" >> $GITHUB_OUTPUT

      - name: Push metrics to Datadog
        run: |
          curl -s -X POST "https://api.datadoghq.com/api/v1/series" \
            -H "Content-Type: application/json" \
            -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
            -d '{
              "series": [
                {
                  "metric": "ci.pipeline.duration_seconds",
                  "type": "gauge",
                  "points": [[$(date +%s), ${{ steps.metrics.outputs.duration }}]],
                  "tags": [
                    "workflow:${{ steps.metrics.outputs.workflow_name }}",
                    "conclusion:${{ steps.metrics.outputs.conclusion }}",
                    "branch:${{ steps.metrics.outputs.branch }}",
                    "service:checkout-service"
                  ]
                },
                {
                  "metric": "ci.pipeline.runs_total",
                  "type": "count",
                  "points": [[$(date +%s), 1]],
                  "tags": [
                    "workflow:${{ steps.metrics.outputs.workflow_name }}",
                    "conclusion:${{ steps.metrics.outputs.conclusion }}",
                    "service:checkout-service"
                  ]
                }
              ]
            }'

      - name: Alert on repeated failures
        if: steps.metrics.outputs.conclusion == 'failure'
        run: |
          RECENT_FAILURES=$(curl -s "https://api.datadoghq.com/api/v1/query?from=$(date -d '1 hour ago' +%s)&to=$(date +%s)&query=sum:ci.pipeline.runs_total{service:checkout-service,conclusion:failure,branch:${{ steps.metrics.outputs.branch }}}.as_count()" \
            -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \
            -H "DD-APPLICATION-KEY: ${{ secrets.DATADOG_APP_KEY }}" \
            | jq '.series[0]?.points | length // 0')
          if [ "$(echo "$RECENT_FAILURES >= 3" | bc)" -eq 1 ]; then
            curl -X POST https://events.pagerduty.com/v2/enqueue \
              -H "Content-Type: application/json" \
              -H 'Authorization: Token token=${{ secrets.PAGERDUTY_ROUTING_KEY }}' \
              -d '{
                "routing_key": "${{ secrets.PAGERDUTY_ROUTING_KEY }}",
                "event_action": "trigger",
                "payload": {
                  "summary": "CI pipeline failing repeatedly: ${{ steps.metrics.outputs.workflow_name }} on ${{ steps.metrics.outputs.branch }}",
                  "severity": "warning",
                  "source": "github-actions",
                  "custom_details": {
                    "workflow": "${{ steps.metrics.outputs.workflow_name }}",
                    "branch": "${{ steps.metrics.outputs.branch }}",
                    "sha": "${{ steps.metrics.outputs.sha }}",
                    "failures_last_hour": "'$RECENT_FAILURES'",
                    "run_url": "${{ github.event.workflow_run.html_url }}"
                  }
                }
              }'
          fi
Output
Pipeline Telemetry Recorded for checkout-service CI #847
Duration: 7m 48s
Conclusion: success
Tags: workflow=Checkout Service CI, conclusion=success, branch=main, service=checkout-service
Metrics pushed to Datadog:
- ci.pipeline.duration_seconds: 468
- ci.pipeline.runs_total: 1
Failure alert check: 0 failures in last hour — no alert triggered.
Test flakiness report (separate job):
checkout_service_test.ts:143 — flaky: 3/10 runs failed in last 24h (threshold 5%)
Alert triggered: flaky test quarantined, ticket created.
The Hidden Cost of Pipeline Degradation
A pipeline that grows from 8 to 20 minutes over two weeks isn't just slower — it erodes development velocity and trust. Developers start rebasing before CI finishes, merging with outdated heads, or pushing directly to bypass checks. Set an alert on median pipeline duration. If it crosses 10 minutes, the team should drop everything to investigate. A 2-minute increase is a blip; a 12-minute increase is a disaster waiting to happen.
Production Insight
Flaky tests don't just slow you down — they destroy trust in the pipeline.
Once developers auto-retry without investigation, you've lost your safety net.
Rule: track flakiness per test file and alert when any single test fails >5% of the time.
Also, pipeline performance degradation is a leading indicator of technical debt — don't ignore it.
Key Takeaway
Measure pipeline duration by stage and flakiness by test.
Alert on repeated failures in the same branch.
If you're not tracking DORA metrics, you're flying blind.
Build rollback observability into your pipeline — you'll need it.
Pipeline Observability Decision Tree
IfYou have no pipeline metrics at all
UseStart with pipeline duration and conclusion per workflow
IfDevelopers are ignoring CI failures
UseAdd flakiness tracking and alert on repeated failures per branch
IfPipeline duration is increasing over time
UseAdd per-stage duration metrics and alert on regression
IfYou want to measure DevOps effectiveness
UseTrack all four DORA metrics: deploy frequency, lead time, change failure rate, MTTR

Artifact Management and Immutable Releases: Ensuring Traceability from Code to Production

I once debugged a production incident where the team couldn't tell which version of the code was running. The pod logs showed app version '1.2.3' but the git tag 'v1.2.3' had been moved twice. The build had been triggered from a different branch than the deployment thought. That three-hour post-mortem started with 'what code is actually deployed right now?' and no one could answer.

High-performing teams treat artifacts as immutable. Every build produces a uniquely identified artifact — typically a container image tagged with the git commit SHA, plus a signed attestation of the build metadata. Once pushed to the registry, that tag is never overwritten. Deployments reference the exact SHA, so you always know what's running. Rollback is trivial: just re-deploy a previous SHA.

The key rules: tag with SHA (not 'latest'), store build metadata (commit, build URL, trigger) as image labels, sign artifacts for supply chain security, and never rebuild a SHA — if you need to patch, cut a new commit and new SHA. This is the foundation of reproducibility.

One more rule many teams miss: include an SBOM (Software Bill of Materials) as part of the artifact. This lets you answer questions like 'which version of Log4j is running' in minutes, not days. Cosign can attach the SBOM to the registry entry.

Additionally, automate the promotion of immutable artifacts through environments. The same SHA that passed CI and tests in staging should be the exact SHA that goes to production — no recompilation, no 'latest' tag substitution. Use a promotion workflow that only changes the deployment manifest, never the artifact itself.

artifact-immutable-pipeline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# io.thecodeforge — DevOps tutorial
# Immutable artifact pipeline: every build produces a unique, signed, tagged image

name: Build Immutable Artifact

on:
  push:
    branches: [main]

jobs:
  build-and-sign:
    name: Build & Sign Image
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      id-token: write
    steps:
      - uses: actions/checkout@v4

      - name: Log in to container registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Generate unique build metadata
        id: meta
        run: |
          echo "BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_OUTPUT
          echo "COMMIT=${{ github.sha }}" >> $GITHUB_OUTPUT
          echo "TRIGGER=${{ github.event_name }}" >> $GITHUB_OUTPUT
          echo "WORKFLOW=${{ github.workflow }}" >> $GITHUB_OUTPUT

      - name: Build and tag image with SHA
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ghcr.io/org/app:sha-${{ github.sha }}
          labels: |
            org.opencontainers.image.source=${{ github.repository }}
            org.opencontainers.image.revision=${{ github.sha }}
            org.opencontainers.image.created=${{ steps.meta.outputs.BUILD_TIME }}
            io.thecodeforge.build.trigger=${{ steps.meta.outputs.TRIGGER }}
            io.thecodeforge.build.workflow=${{ steps.meta.outputs.WORKFLOW }}

      - name: Sign the image with cosign
        uses: sigstore/cosign-installer@v3
      - run: |
          cosign sign --yes \
            ghcr.io/org/app:sha-${{ github.sha }} \
            --annotations "commit=${{ github.sha }}" \
            --annotations "repo=${{ github.repository }}"

      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          path: ./Dockerfile
          output-file: ${{ runner.temp }}/sbom.spdx

      - name: Attest SBOM to registry
        run: |
          cosign attest --yes \
            --type spdx \
            --predicate ${{ runner.temp }}/sbom.spdx \
            ghcr.io/org/app:sha-${{ github.sha }}

      - name: Update deployment manifest with new SHA
        run: |
          sed -i "s|image: ghcr.io/org/app:.*|image: ghcr.io/org/app:sha-${{ github.sha }}|g" k8s/overlays/production/deployment-patch.yaml
          git config user.name "CI Bot"
          git config user.email "bot@example.com"
          git add k8s/
          git commit -m "Auto-update image to sha-${{ github.sha }}"
          git push
Output
Build and sign completed for sha-a3f91c2
✓ Image built: ghcr.io/org/app:sha-a3f91c2
✓ Labels embedded:
- org.opencontainers.image.revision: a3f91c2
- io.thecodeforge.build.trigger: push
✓ Image signed with cosign (keyless)
✓ SBOM generated and attested
✓ K8s manifest updated to sha-a3f91c2
Artifact is immutable — never overwritten.
Rollback: change image tag to previous SHA.
Artifacts as Railway Tickets
  • The SHA is the serial number — you can always trace which train (commit) it came from.
  • 'Latest' is a reusable ticket that lets anyone board without proving identity — lose it.
  • Signatures are the ticket stamp — they prove the ticket was issued by the official authority (your build system).
  • SBOM is the passenger manifest — you know every dependency that came along for the ride.
  • Immutable means you never punch the same serial number twice — every ride is unique.
Production Insight
Teams that use 'latest' cannot roll back reliably — the tag moves with every deploy.
If a bad deploy goes out, 'latest' now points to the broken version, and rollback tries to re-deploy 'latest' which is still broken.
Rule: tag with SHA, never overwrite tags, and store full build provenance in image labels.
Also, if you're promoting artifacts across environments, make the promotion a copy operation (not a retag) to preserve immutability.
Key Takeaway
Immutable artifacts are the bedrock of reproducible deployments.
Tag with SHA, sign the image, generate an SBOM.
If you can't answer 'what's running in production right now?' in under 30 seconds, you don't have artifact management.
Promote the same SHA through environments — never rebuild or retag.
Artifact Tagging Strategy Decision Tree
IfYou need precise rollback capability
UseTag with git commit SHA, never overwrite tags
IfYou need supply chain security
UseSign images with cosign and attach SBOM
IfYou need to trace which build produced a running image
UseEmbed build metadata (commit, trigger, workflow) as image labels
IfYou need to patch a released artifact
UseCut a new commit and new SHA — never rebuild an existing tag

Push-Back Deployments: Why Your Rollback Is Already a Postmortem

Rollbacks are theater. You hit revert, the pipeline runs, and for the next 12 minutes your users see the crash page you just fixed. Meanwhile your database migrations are irreversible, your cache is poisoned, and that half-migrated schema is now corrupting writes from both code versions. That's not a rollback — that's a triage call.

Real production safety means push-forward deployments: progressive delivery with automated regression gating. The pipeline doesn't just deploy — it monitors, measures, and decides in real time whether to continue the rollout or halt.

Your rollback becomes a single config flag: traffic back to the previous canary. No pipeline rebuild. No git revert. No DNS propagation panic. The key is separating artifact promotion from traffic routing. Build once, deploy everywhere, route with a knob.

Slap a metric threshold on your canary. If p99 latency or error rate breaches it, the pipeline aborts the rollout and notifies on-call. Your rollback is now instant because you never actually left the previous version serving most users.

canary-gating-pipeline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// io.thecodeforge — devops tutorial

name: canary-push-forward

on:
  push:
    branches: [main]

env:
  ARTIFACT_TAG: ${{ github.sha }}
  CANARY_PERCENT: 5
  ERROR_THRESHOLD_MS: 500
  ERROR_RATE_THRESHOLD: 0.01

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Build immutable artifact
        run: |
          docker build -t api:${{ env.ARTIFACT_TAG }} .
          docker tag api:${{ env.ARTIFACT_TAG }} registry.example.com/api:${{ env.ARTIFACT_TAG }}
          docker push registry.example.com/api:${{ env.ARTIFACT_TAG }}

  deploy-canary:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Route 5% traffic to canary
        run: |
          # Update service mesh or load balancer to send 5% traffic to new version
          kubectl set image deployment/api-canary api=registry.example.com/api:${{ env.ARTIFACT_TAG }}
          echo "Traffic shifted: ${{ env.CANARY_PERCENT }}%"

  observe:
    needs: deploy-canary
    runs-on: ubuntu-latest
    steps:
      - name: Monitor canary for 60s
        run: |
          # Poll metrics until threshold or timeout
          while true; do
            P99=$(curl -s metrics-endpoint/p99_latency_ms)
            ERR=$(curl -s metrics-endpoint/error_rate)
            echo "P99: $P99 ms, Error Rate: $ERR"
            if [ "$P99" -gt "${{ env.ERROR_THRESHOLD_MS }}" ] || [ "$(echo "$ERR > ${{ env.ERROR_RATE_THRESHOLD }}" | bc)" -eq 1 ]; then
              echo "Threshold breached! Aborting rollout."
              exit 1
            fi
            sleep 10
          done

  promote:
    needs: observe
    runs-on: ubuntu-latest
    steps:
      - name: Rollout to full fleet
        run: |
          # Promote canary to all users
          kubectl set image deployment/api-prod api=registry.example.com/api:${{ env.ARTIFACT_TAG }}
          echo "Full rollout complete."
Output
Traffic shifted: 5%
P99: 210 ms, Error Rate: 0.002
P99: 225 ms, Error Rate: 0.001
...
Full rollout complete.
Senior Shortcut:
Never promote the canary to prod. Instead, shift the router to point all users to the already-tested canary instances. That way rollback is a switch flip, not a rebuild.
Key Takeaway
Push forward, never pull back. Rollbacks are for amateurs; traffic re-routing is for engineers who sleep through the night.
Rollback vs. Push-BackTHECODEFORGE.IORollback vs. Push-BackWhy reverting a commit is not a safe recoveryRollbackReverts code but not dataCache remains poisonedDB migrations irreversible12-min window of crash pagesPush-BackForward fix with data safetyCache invalidated on deployMigration versioned with codeZero-downtime recovery pathRollbacks are theater — push forward with safe, reversible deploymentsTHECODEFORGE.IO
thecodeforge.io
Rollback vs. Push-Back
Devops Best Practices

Pipeline as Code: Your YAML Is Infrastructure. Treat It Like Prod.

I've seen more outages caused by a misplaced indent in a pipeline YAML than by database connection leaks. Seriously. Your CI/CD config is not a script — it's infrastructure. It deploys to prod, runs credentials, and handles failure. If you're not reviewing it like you review application code, you're one accidental rm -rf / away from a bad Friday.

Treat your pipeline YAML as first-class code. That means version control (duh), peer review, linting, and testing. Yes, testing your pipeline. Use act for local YAML validation, yamllint for formatting, and schema validation with check-jsonschema to catch structural errors before they hit the runner.

Pin your runner images and action versions. A floating ubuntu-latest or actions/checkout@v3 becomes ubuntu-22.04 and actions/checkout@b4f9378 (the commit SHA). That prevents supply chain attacks and ensures reproducible builds. One team I knew had their pipeline silently upgrade Node from 16 to 20 because actions/setup-node@v3 pulled a new minor. Two weeks of broken builds.

Write pipeline tests. Not integration tests — actual unit tests for your pipeline logic. If you have conditional steps or matrix builds, validate them in a staging pipeline before main. Your deployment pipeline is the most critical piece of infrastructure you own. Code review it.

pipeline-hardening.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// io.thecodeforge — devops tutorial

name: pipeline-audit

on:
  pull_request:
    paths: ['.github/workflows/*.yml', 'pipeline-tests/**']

jobs:
  lint:
    runs-on: ubuntu-22.04  # pinned, not latest
    steps:
      - uses: actions/checkout@b4f9378  # pinned commit SHA
      - name: Validate YAML syntax
        run: yamllint .github/workflows/

  validate-schema:
    runs-on: ubuntu-22.04
    steps:
      - name: Check JSON Schema of pipeline
        run: |
          # Validate pipeline against GitHub Actions schema
          check-jsonschema --builtin-schema 'github-workflows' .github/workflows/deploy.yml

  test-matrix:
    runs-on: ubuntu-22.04
    strategy:
      matrix:
        env: [staging, production]
        include:
          - env: staging
            dry_run: true
          - env: production
            dry_run: false
    steps:
      - name: Simulate deploy
        run: |
          echo "Environment: ${{ matrix.env }}"
          echo "Dry run: ${{ matrix.dry_run }}"
          # In staging, we verify the script runs without error
          # In production, we gate on manual approval
          if [ "${{ matrix.dry_run }}" = "true" ]; then
            echo "DRY_RUN: Pipeline would execute."
          else
            echo "PROD: Requires approval gate."
          fi
Output
Environment: staging
Dry run: true
DRY_RUN: Pipeline would execute.
---
Environment: production
Dry run: false
PROD: Requires approval gate.
Production Trap:
Never use on: push for production deployments. Always require a PR with at least one reviewer. And for god's sake, add a manual approval step before any production-facing change. Automate everything except the final confirmation.
Key Takeaway
Lock your pipeline versions, lint your YAML, and review deploys like code. One bad pipeline commit costs more than one bad app commit.

Your Cloud Platform Is a Ticking Time Bomb: Stop Treating It Like a Black Box

Most teams deploy to the cloud without understanding the networking underneath. They copy-paste VPC configs from a blog post and wonder why cross-region latency kills their database writes in production.

The network is the platform. Routing, subnets, NAT gateways, security groups — these aren't ops abstractions. They are the runtime boundaries your code lives and dies inside. A misconfigured load balancer can silently drop 30% of your traffic for months before anyone notices.

The WHY: Your CI/CD pipeline deploys to a network you don't control. If you don't understand how traffic flows from the internet to your pod, you're shipping blind. Every team should have a network topology diagram that maps exactly how a request reaches production — and it should be code, not a Visio file.

vpc-stack.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — devops tutorial

// Never hardcode CIDR blocks
vpc:
  cidr: 10.0.0.0/16
  subnets:
    - name: public-a
      cidr: 10.0.1.0/24
      az: us-east-1a
      route: igw
    - name: private-b
      cidr: 10.0.2.0/24
      az: us-east-1b
      route: nat-gw

// Automate this check in your pipeline
verify:
  - subnet_peering_attached
  - no_open_0.0.0.0/0_inbound
Output
vpc-stack.yml created. Security group 0.0.0.0/0 inbound detected — failing build.
Production Trap:
If your security group allows 0.0.0.0/0 on port 22, fix it before you ship. That's not 'flexibility' — it's a breach waiting to happen.
Key Takeaway
The network is the platform. If you can't draw it, you don't understand it.

Every CI/CD pipeline has that one shell script — 300 lines of grep, sed, and unclosed if statements. It works on your laptop but explodes on a fresh Ubuntu 22.04 runner because someone removed bash 3.2.

Scripts are infrastructure. They deploy code, mutate state, and fail silently. The WHY: A single unquoted variable in a shell script can wipe a production database. A YAML pipeline that calls a dozen shell scripts is a distributed monolith of pain — each script is a hidden failure domain with zero observability.

Fix it. Move shell logic to Go, Python, or at least use shellcheck in pre-commit. Every script needs a shebang, set -euo pipefail, and logs. If it doesn't exit with a non-zero code on failure, it's not script — it's a wish.

ci-script-check.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — devops tutorial

// Enforce minimal standards on all scripts
stages:
  - lint
  - deploy

lint-scripts:
  image: ubuntu:22.04
  script:
    - apt-get update && apt-get install -y shellcheck
    - shellcheck deploy.sh
  rules:
    - changes: [ "*.sh" ]

deploy:
  script:
    - ./deploy.sh prod
  allow_failure: false
Output
deploy.sh: line 23: $1: unbound variable — pipeline failed.
Senior Shortcut:
Write a 10-line Python script instead of a 50-line bash mess. You get real error handling, argument parsing, and cross-platform compatibility for free.
Key Takeaway
If it can fail silently, it will — and it will fail in production.

Scripting Is the Glue That Holds Your Pipeline Together—Until It Breaks

Most DevOps pipelines fail not because of bad architecture, but because the scripts gluing stages together are fragile, untested, and environment-dependent. A shell script that works locally often breaks in CI because you relied on a default PATH, a specific OS version, or a tool installed 'somewhere.' Why this matters: a failed script mid-deployment can leave your system in an inconsistent state, requiring manual rollback. Instead of writing ad-hoc shell spaghetti, enforce three rules: use static analysis (shellcheck), pin versions of every dependency, and structure scripts as pure functions (input in, output out, no side effects). Never embed secrets in scripts—use a vault. Every script should fail fast with a clear exit code. Prefer Go or Python for complex logic; reserve shell for one-liners. Scripts are infrastructure. Treat them like prod.

ci-pipeline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — devops tutorial

// max 25 lines

pipeline:
  stages:
    - name: validate-scripts
      script: shellcheck deploy.sh rollback.sh
    - name: test
      script: |
        set -euo pipefail
        go test ./...
    - name: build
      image: golang:1.22
      steps:
        - go build -o app .
    - name: deploy
      script: |
        ENV=${ENVIRONMENT:?required}
        ./deploy.sh $ENV
      secrets:
        - VAULT_TOKEN
Output
All stages local. Shellcheck passes. Build exits 0. Deploy uses vault token.
Production Trap:
Copying a script from Stack Overflow without understanding its side effects is how production gets wiped. Always test scripts in isolation first.
Key Takeaway
Every script must pass shellcheck, pin dependencies, and fail with a clear exit code—no exceptions.

Building Your DevOps Culture: Trust Must Be Replaced by Automation

DevOps culture isn't about tools—it's about eliminating the fear of deployment. If your team hesitates to push to production on a Friday, your culture is broken. Why: manual gates, hero deploys, and tribal knowledge create bottlenecks and blame games. The real fix: automate everything that causes hesitation. Start by making deployments reversible (instant rollback via feature flags and canary releases). Then enforce shared ownership: anyone can deploy, but only through a verified pipeline. Stop rewarding 'firefighters'—reward teams that build self-healing systems. Daily standups about 'ops' don't fix culture; forcing devs to own their code in production does. Pair rotating on-call responsibilities with postmortems that never assign blame. Finally, measure success by deployment frequency and mean time to recover (MTTR), not by uptime theater. Culture is the system you build—if you want trust, stop needing it.

culture-checklist.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — devops tutorial

// max 25 lines

team_rules:
  - deploy_any_time: true
  - rollback_under_10_seconds: true
  - no_manual_gates: true
  - postmortem_blame_free: true
metrics:
  - deployment_frequency: daily
  - mttr: under_15_mins
  - toil_hours_per_week: under_2
Output
Team deploys 5x per day. MTTR average 8 minutes. No manual gates. Zero blame postmortems.
Production Trap:
Hiring a 'DevOps engineer' won't fix culture. If you still have a separate ops team doing deployments, you're building a silo, not a culture.
Key Takeaway
Culture is measured by deployment frequency and MTTR. Automate trust out of the equation.

Introduction — Day 1

DevOps is not a role or a toolset; it's a cultural and technical shift that demands you stop treating operations as a separate phase. Day 1 is the moment you accept that every commit is a potential deployment, every environment is ephemeral, and every failure is a data point. The WHY is simple: traditional handoffs waste time, breed blame, and create friction. Instead of siloed teams throwing code over a wall, DevOps forces shared ownership of the entire lifecycle from code to production. This means your first actions aren't about tools, but about aligning incentives. You must eliminate the 'it works on my machine' fallacy by standardizing environments early. Start with a single service, a single pipeline, and a single source of truth for configuration. Treat this as a science experiment — measure everything. If your team cannot explain how a change reaches production with confidence, you haven't started. Day 1 is about building the mental model: automation before manual heroics, observability before firefighting.

day1-baseline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — devops tutorial
name: baseline
on:
  push:
    branches: [main]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint commit message
        run: |
          msg=$(git log -1 --pretty=%B)
          echo "Commit: $msg"
      - name: Check for config drift
        run: |
          diff config/expected config/actual || echo "Drift detected"
Output
Commit: feat: add health endpoint
Drift detected
Production Trap:
Don't automate a broken process. Day 1 is about understanding your current workflow first, not forcing YAML onto chaos.
Key Takeaway
The foundation of DevOps is a shared mental model for how code reaches production, not which tools you install.

Let the journey begin

Once the baseline is set, the journey begins with creating feedback loops that outrun your mistakes. The WHY is velocity without safety is sabotage. Every pipeline change must be tested against a staging environment that mirrors production as closely as possible, not a half-configured VM from last year. Start by mapping your current bottleneck — is it the build time, the test execution, or the manual approval gate? Automate that first. Then, introduce feature flags to decouple deployment from release. This allows you to push code without exposing it to all users. The journey is iterative: you will refactor your pipeline as often as your application code. Embrace small batch sizes — deploy every merged pull request, not weekly batches. Monitor deployment frequency and change failure rate as your primary metrics. If you find yourself writing postmortems for events that could have been caught by a pre-commit hook, you have not journeyed far enough. The ultimate goal is to reach a state where deployment becomes an invisible, boring event. SREs shouldn't notice your releases. Users definitely shouldn't.

journey-feature-flag.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial
name: deploy-with-flag
on:
  push:
    branches: [main]
jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      FEATURE_FLAG: "new_checkout_v2"
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to canary
        run: |
          echo "Deploying to canary cluster"
          curl -X POST $CANARY_ENDPOINT -d "{\"flag\":\"$FEATURE_FLAG\"}"
      - name: Health check
        run: |
          sleep 5
          curl -f http://canary.health || exit 1
Output
Deploying to canary cluster
Health check passed
Production Trap:
Feature flags accrue debt. Remove them once the feature stabilizes or they become permanent 'if-else' chains.
Key Takeaway
The journey is over when deployment is a non-event — not because nothing changes, but because changes are invisible and safe.
● Production incidentPOST-MORTEMseverity: high

The Silent Deployment: How a Skipped Build Caused a 2-Hour Outage

Symptom
After a routine merge to main, the pipeline reported 'success' but the staging environment showed no new code. A day later, the production deployment went through — same pipeline, same 'success' label — but the new feature was missing. Customers started seeing outdated checkout flows and payment errors.
Assumption
The team assumed that if the pipeline passes and the rollout completes, the new code must be running. They also assumed that 'needs' dependencies in GitHub Actions would fail the pipeline if a required job was skipped.
Root cause
The build-and-push job was guarded by if: github.ref == 'refs/heads/main' && github.event_name == 'push'. For PR merges, the event is pull_request on the merge commit, not push. The build job was skipped. The deploy job had needs: [build-and-push] — but because the build was skipped (not failed), the deploy job ran anyway using the old image tag. The 'latest' tag had already been moved by a previous successful build.
Fix
Changed the build trigger to also run on pull_request events (or use always() with explicit status checks). Added a check in the deploy job to verify that the image digest actually changed from the previous deployment. Added a smoke test that validates a specific version endpoint exposed by the application.
Key lesson
  • A skipped job is not a failed job — needs doesn't protect you from skips.
  • Use explicit if: needs.build.result == 'success' in downstream jobs.
  • Always validate the deployed artifact: check its hash, version, or commit SHA post-deployment.
Production debug guideCommon symptoms and the exact actions to take when your pipeline lies to you5 entries
Symptom · 01
Pipeline reports success but no changes appear in the environment
Fix
Check the image tag in the running pod (kubectl get pod -o yaml | grep image). Compare with the expected SHA from the build. If they match, check if the application cache is stale. If they don't match, look for a skipped build job or a misplaced 'if' condition.
Symptom · 02
Deployment rollout hangs at 0% progress
Fix
Check pod events: kubectl describe pod. Look for ImagePullBackOff or CrashLoopBackOff. Verify the registry credentials are correct and the image exists. Check node capacity with kubectl describe node.
Symptom · 03
Secrets missing in the running container despite pipeline success
Fix
Check if the secret exists in the namespace: kubectl get secrets. If it's an ExternalSecret, check the operator logs. Verify the secret key names match what the deployment expects. If using env vars, note that they don't update on rotation — consider switching to volume mounts.
Symptom · 04
Flaky test failures that disappear on retry
Fix
Quarantine the test immediately — mark it as flaky in your test framework. Create a Jira ticket and assign it. Check if the test has any shared mutable state, timing dependencies, or relies on real network calls. After quarantine, run the test 100 times locally to confirm root cause.
Symptom · 05
Pipeline duration has doubled over the last week
Fix
Look at stage-level duration logs. Likely a new heavy integration test or an inefficient build cache. Check if npm ci is being used or if the package-lock.json changed. Examine Docker layer caching — builds may be re-downloading base layers if cache-from is misconfigured.
★ CI/CD Quick Debug Cheat SheetThe three most common pipeline failures and how to fix them in under 5 minutes
Deployed app doesn't reflect the latest commit
Immediate action
Check pod image tag and compare with expected build SHA
Commands
kubectl get pods -n <ns> -o jsonpath='{.items[0].spec.containers[0].image}'
Check the build log for the pushed image digest: grep 'digest:' build.log
Fix now
If the image is wrong, trigger a manual rebuild: gh workflow run deploy.yml. If the deployment used 'latest', recreate the pod with the correct SHA-tagged image.
Pipeline fails with 'connection refused' for database+
Immediate action
Check if the service container is healthy, not just started
Commands
docker compose ps --all | grep db | grep -q healthy; echo $?
docker compose logs db | tail -20
Fix now
Add healthcheck to the database service and use condition: service_healthy in the depends_on block. Run the pipeline again.
Test flakiness causing random CI failures+
Immediate action
Isolate the flaky test, don't just retry
Commands
npx jest --listTests --testPathPattern=<flaky_file> | xargs npx jest --repeat 50 --verbose 2>&1 | grep -E 'PASS|FAIL'
Check test isolation: look for shared mutable state between tests
Fix now
Add @flaky marker to the test, set test framework to retry 2 times max, create ticket to fix within 2 sprints. Meanwhile, add a flakiness threshold in CI that alerts but doesn't block the whole pipeline.
CI/CD Pipeline Strategies Comparison
StrategyBest forRollback timeTraffic impactComplexity
Blue-GreenInfrastructure changes, DB upgradesInstant (DNS switch)Zero-downtimeMedium
CanaryApplication code with unknown impactGradual (traffic rebalance)Partial exposureHigh
Feature FlagsDecoupling deployment from releaseInstant (toggle off)Zero-downtimeLow
Rolling UpdateStandard app updates with minimal riskProgressive rollbackMinimalLow
Shadow DeploymentValidating new versions with mirrored trafficNone neededNo impactVery High

Key takeaways

1
Order pipeline stages by execution time
catch cheap failures first, fail fast and cheap.
2
Use canary deployments with automated business-level analysis for application changes.
3
Mount secrets as files, not env vars, and validate their existence before deploying.
4
Track DORA metrics and pipeline duration trends
alert on degradation before trust erodes.
5
Tag every artifact with its git commit SHA and sign it
never use :latest in production.
6
A skipped job is not a failed job
add explicit status checks in downstream stages.

Common mistakes to avoid

5 patterns
×

Using depends_on without a healthcheck

Symptom
API crashes on startup with ECONNREFUSED because the database container started but is not yet ready to accept connections.
Fix
Add a healthcheck block to the database service using pg_isready, then use condition: service_healthy in the API depends_on block.
×

Storing secrets as environment variables in the pipeline YAML

Symptom
Secret rotation requires a full pipeline restart; secrets leaked in logs or build artifacts.
Fix
Use OIDC-based authentication to pull secrets from a vault at deploy time, and mount them as files in the container.
×

Using the :latest tag for production deployments

Symptom
Cannot roll back reliably because :latest points to the broken version; unknown which commit is actually running.
Fix
Tag every image with its git commit SHA. Never overwrite tags. Use SHA for all production deployments.
×

Putting long-running E2E tests before fast linting checks

Symptom
Developers wait 30+ minutes to discover a missing semicolon; they start bypassing the pipeline.
Fix
Order pipeline stages by execution time ascending. Lint and type-check first, unit tests second, integration tests third, E2E last.
×

Not separating readiness and liveness probes

Symptom
Kubernetes kills healthy pods under load because the liveness probe includes a database check that times out during a slow backend.
Fix
Use separate endpoints: /health/live for internal process health only, /health/ready for dependency checks. A pod can be live but not ready.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What are the four DORA metrics and why do they matter?
Q02SENIOR
How do you handle secret rotation in a CI/CD pipeline without causing do...
Q03SENIOR
Explain the difference between a skipped job and a failed job in GitHub ...
Q04SENIOR
When would you choose a canary deployment over a blue-green deployment?
Q05SENIOR
What steps would you take to fix a flaky test that is causing random CI ...
Q01 of 05SENIOR

What are the four DORA metrics and why do they matter?

ANSWER
DORA metrics are: Deployment Frequency (how often you deploy to production), Lead Time for Changes (time from commit to production), Change Failure Rate (percentage of deployments causing failures), and Mean Time to Recovery (time to restore service after a failure). They matter because they provide a standardised way to measure DevOps performance. High-performing teams deploy multiple times per day with a change failure rate under 5%, while low performers deploy monthly with higher failure rates. Tracking these metrics tells you whether your CI/CD improvements actually work.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Should I run integration tests on every branch push?
02
How do I set up healthchecks in Docker Compose for CI/CD?
03
What's the fastest way to debug a deployment that didn't pick up the latest code?
04
Why is it dangerous to use :latest in production deployments?
05
How do I handle database migrations in a CI/CD pipeline without downtime?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Verified
production tested
June 21, 2026
last updated
1,663
articles · all by Naren
🔥

That's CI/CD. Mark it forged?

16 min read · try the examples if you haven't

Previous
Rolling Deployments
13 / 13 · CI/CD
Next
Cloud Computing Explained: Models, Services, and Real-World Architecture