CI/CD Best Practices: What High-Performing DevOps Teams Do Differently
- Order pipeline stages by execution time ascending, not by importance β failing a 45-minute integration suite before a 30-second type check is burning compute and developer patience for no reason
- The moment your team starts re-running pipelines without investigating failures, your CI is dead β you have a retry button, not a quality gate; flakiness tracking and mandatory failure investigation are cultural and technical requirements, not optional
- Never use :latest in production Kubernetes deployments β SHA-tagged images are the only way to get meaningful rollbacks, reproducible environments, and an audit trail that doesn't lie to you at 2am
A fintech team I worked with was deploying to production manually every two weeks. One Friday afternoon, a developer copy-pasted a database migration script into the wrong environment, wiped a staging database that was being used as a shadow clone of prod, and triggered a three-hour incident that nearly became a four-hour customer-facing outage. The root cause wasn't the mistake β humans make mistakes. The root cause was that there was no automated gate to catch it.
CI/CD isn't a tool. It's a philosophy that says 'the longer you wait to integrate and ship, the more expensive your mistakes get.' The average high-performing team deploys to production multiple times per day with a change failure rate under 5%. The average low-performing team deploys once a month and spends 40% of their engineering time on unplanned work β firefighting regressions, rolling back broken releases, and manually babysitting deployments. Those aren't different companies. They're the same company, two years apart, after one of them got serious about CI/CD.
By the end of this article, you'll know exactly how to structure a pipeline that catches failures before they reach production, which quality gates actually matter and which ones slow you down for no gain, where pipelines break down at scale and what to do about it, and how to roll out changes without taking the whole system down. You won't just understand CI/CD β you'll be able to walk into an existing codebase and diagnose exactly why its pipeline is failing its team.
Pipeline Architecture: Why Most Teams Build It Backwards
Most teams design their CI pipeline by asking 'what checks should we run?' That's the wrong question. The right question is 'in what order should failures be discovered, and what's the cost of discovering them late?' Every stage of your pipeline is a trade-off between feedback speed and coverage depth. If you put your 45-minute integration test suite before your 30-second linter, you're making every developer wait 45 minutes to learn they forgot a semicolon. I've seen this kill developer velocity at a mid-size SaaS company β engineers started skipping the pipeline locally and just pushing to get CI to run it, which turned the pipeline into a batch job instead of a fast feedback loop.
The principle is fail fast, fail cheap. Your pipeline stages should be ordered by execution time, ascending. Linting and static analysis run first β they're near-instant and catch a massive proportion of bugs. Unit tests second. Integration tests third. End-to-end tests last, and gated behind a merge to a protected branch. Every stage that fails short-circuits the rest. You don't run a 30-minute E2E suite against a commit that failed a type check.
Here's a production-grade GitHub Actions pipeline for a Node.js checkout service that demonstrates this ordering. Notice the explicit stage dependencies and the parallelisation of independent checks β security scanning runs parallel to unit tests because they don't share state.
# io.thecodeforge β DevOps tutorial # CI pipeline for a checkout service β GitHub Actions # Ordered by: speed (fastest gates first), then coverage depth # Principle: catch cheap failures before running expensive ones name: Checkout Service CI on: push: branches: ['**'] # Run on every branch push, not just main pull_request: branches: [main, staging] # Gate merges to protected branches env: NODE_VERSION: '20.x' # Store non-secret config at the workflow level so every job inherits it # Secrets come from GitHub Secrets β never hardcode them here REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }}/checkout-service jobs: # βββββββββββββββββββββββββββββββββββββββββββββ # STAGE 1: Sub-60-second gates # If these fail, nothing else runs. No point burning compute. # βββββββββββββββββββββββββββββββββββββββββββββ lint-and-typecheck: name: Lint & Type Check runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: ${{ env.NODE_VERSION }} cache: 'npm' # Cache node_modules by package-lock.json hash - name: Install dependencies run: npm ci # ci installs exactly what's in package-lock β no surprises - name: Run ESLint run: npm run lint # Fail fast: exit code 1 kills the job immediately - name: TypeScript type check run: npm run typecheck # Separate from build β catches type errors without emitting JS # βββββββββββββββββββββββββββββββββββββββββββββ # STAGE 1 (parallel): Security scan # Runs in parallel with lint β independent concern, same time budget # βββββββββββββββββββββββββββββββββββββββββββββ security-scan: name: Dependency Security Scan runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run npm audit # --audit-level=high: fail on HIGH or CRITICAL vulns only # Don't fail on moderate β you'll be blocked forever on transitive deps run: npm audit --audit-level=high - name: SAST scan with Semgrep uses: returntocorp/semgrep-action@v1 with: config: 'p/nodejs' # Use the Node.js security ruleset β not the generic one # βββββββββββββββββββββββββββββββββββββββββββββ # STAGE 2: Unit tests # Only runs if Stage 1 passes β needs: enforces the dependency # βββββββββββββββββββββββββββββββββββββββββββββ unit-tests: name: Unit Tests runs-on: ubuntu-latest needs: [lint-and-typecheck, security-scan] # Both Stage 1 jobs must pass steps: - uses: actions/checkout@v4 - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: ${{ env.NODE_VERSION }} cache: 'npm' - name: Install dependencies run: npm ci - name: Run unit tests with coverage run: npm run test:unit -- --coverage env: # Unit tests must NOT touch real external services # These point to in-memory fakes, not real infra DATABASE_URL: 'sqlite::memory:' PAYMENT_GATEWAY_URL: 'http://localhost:9999' # Wiremock stub - name: Upload coverage report uses: actions/upload-artifact@v4 with: name: coverage-report path: coverage/ retention-days: 7 # Don't keep forever β storage costs add up - name: Enforce coverage threshold # Fail the pipeline if coverage drops below 80% # Don't gate on 100% β it incentivises writing useless tests run: npx nyc check-coverage --lines 80 --functions 80 --branches 75 # βββββββββββββββββββββββββββββββββββββββββββββ # STAGE 3: Integration tests # Spins up real dependencies via Docker Compose # Only runs on PRs to main/staging β too slow for every branch push # βββββββββββββββββββββββββββββββββββββββββββββ integration-tests: name: Integration Tests runs-on: ubuntu-latest needs: [unit-tests] # Only run integration tests when merging β not on every feature branch push # This is the speed vs. coverage trade-off in action if: github.event_name == 'pull_request' services: postgres: image: postgres:16 env: POSTGRES_DB: checkout_test POSTGRES_USER: checkout_app POSTGRES_PASSWORD: ${{ secrets.TEST_DB_PASSWORD }} options: >- --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5 # Wait until Postgres is actually ready, not just started ports: - 5432:5432 redis: image: redis:7-alpine options: --health-cmd "redis-cli ping" --health-interval 10s ports: - 6379:6379 steps: - uses: actions/checkout@v4 - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: ${{ env.NODE_VERSION }} cache: 'npm' - name: Install dependencies run: npm ci - name: Run database migrations run: npm run db:migrate env: DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test - name: Run integration tests run: npm run test:integration env: DATABASE_URL: postgres://checkout_app:${{ secrets.TEST_DB_PASSWORD }}@localhost:5432/checkout_test REDIS_URL: redis://localhost:6379 NODE_ENV: test # βββββββββββββββββββββββββββββββββββββββββββββ # STAGE 4: Build and push Docker image # Only on merge to main β you don't want an image per commit on feature branches # βββββββββββββββββββββββββββββββββββββββββββββ build-and-push: name: Build & Push Image runs-on: ubuntu-latest needs: [integration-tests] if: github.ref == 'refs/heads/main' && github.event_name == 'push' permissions: contents: read packages: write # Required to push to GitHub Container Registry outputs: image-digest: ${{ steps.build.outputs.digest }} # Pass digest to deploy job steps: - uses: actions/checkout@v4 - name: Log in to container registry uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - name: Extract metadata for image tags id: meta uses: docker/metadata-action@v5 with: images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} tags: | type=sha,prefix=sha- # Tag with git SHA β enables precise rollbacks type=raw,value=latest,enable={{is_default_branch}} - name: Build and push Docker image id: build uses: docker/build-push-action@v5 with: context: . push: true tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} # Cache layers from the registry β dramatically speeds up builds # Without this, every build re-downloads all base layers cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max
β lint-and-typecheck (23s)
β security-scan (41s) [parallel with lint]
β unit-tests (1m 12s) [87% line coverage β threshold: 80%]
β integration-tests (3m 44s) [14 tests passed, 0 failed]
β build-and-push (2m 08s) [Pushed: ghcr.io/org/checkout-service:sha-a3f91c2]
Total wall-clock time: 7m 48s
Pipeline result: SUCCESS
Image digest: sha256:d4f2a1b9c8e3f5...
Deployment Strategies That Don't Gamble Your Entire User Base
Here's a mistake I've seen kill a Black Friday deployment: a team built a perfect CI pipeline, then wired it directly to 'deploy everything to all pods immediately.' The pipeline was green. The deployment destroyed a third of their order throughput because a new Redis connection pool configuration had a subtle bug that only surfaced under real production load patterns. Their rollback took 22 minutes because they had no deployment strategy β it was all or nothing.
High-performing teams don't choose between 'deploy' and 'don't deploy.' They choose how much of their traffic takes the risk first. Blue-green deployments, canary releases, and feature flags are the three weapons in this arsenal, and they solve different problems. Blue-green is great for infrastructure changes where you need a clean cutover. Canary is best for application changes where you want to validate behaviour under real traffic before full rollout. Feature flags are best for functionality that you want to decouple from deployment entirely β ship the code, turn on the feature later.
The Kubernetes deployment below shows a canary release pattern using weight-based traffic splitting. The key insight is that your health checks must be meaningful β a pod that returns 200 on '/health' but fails to process payments is worse than a pod that's down, because it poisons a percentage of your real user traffic silently.
# io.thecodeforge β DevOps tutorial # Canary deployment pattern for checkout service on Kubernetes # Uses: Argo Rollouts for progressive delivery # Why Argo Rollouts over plain Kubernetes Deployment: # Plain Deployments have no concept of traffic weighting or analysis steps. # You need a controller that understands progressive delivery semantics. apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: checkout-service namespace: payments spec: replicas: 10 # Total desired replica count at full rollout selector: matchLabels: app: checkout-service template: metadata: labels: app: checkout-service spec: containers: - name: checkout-service image: ghcr.io/org/checkout-service:sha-a3f91c2 # Always pin to a specific SHA # Never use :latest in production β you lose reproducibility and rollback clarity ports: - containerPort: 3000 # Resource requests AND limits β both required # Without requests, the scheduler can't make good placement decisions # Without limits, one bad pod can starve its neighbours resources: requests: memory: '256Mi' cpu: '250m' limits: memory: '512Mi' cpu: '500m' readinessProbe: httpGet: path: /health/ready # Readiness != Liveness. Readiness means 'send me traffic' port: 3000 initialDelaySeconds: 10 # Give the app time to initialise DB connections periodSeconds: 5 failureThreshold: 3 # 3 consecutive failures = remove from load balancer livenessProbe: httpGet: path: /health/live # Liveness: 'am I deadlocked or otherwise unrecoverable?' port: 3000 initialDelaySeconds: 30 # Longer delay β a restart loop is worse than being slow periodSeconds: 10 failureThreshold: 5 env: - name: NODE_ENV value: production - name: DATABASE_URL valueFrom: secretKeyRef: name: checkout-service-secrets key: database-url # Pull from Kubernetes Secret, never from env literal strategy: canary: # The canary steps define the progressive rollout # Argo pauses at each step, runs analysis, then proceeds or aborts steps: - setWeight: 10 # Step 1: Send 10% of traffic to new version - pause: duration: 5m # Wait 5 minutes β enough for p99 latency to show anomalies - analysis: # Automated analysis before proceeding β this is the key gate templates: - templateName: checkout-success-rate args: - name: service-name value: checkout-service - setWeight: 30 # Step 2: Increase to 30% only if analysis passed - pause: duration: 10m - analysis: templates: - templateName: checkout-success-rate - templateName: checkout-p99-latency # Check latency separately β success rate can hide slowdowns - setWeight: 100 # Full rollout only after both analysis steps pass # Automatic rollback: if any analysis step fails, Argo rolls back to stable autoPromotionEnabled: false # Require manual promotion OR passed analysis β never auto-promote blindly --- # AnalysisTemplate defines WHAT to measure during canary steps # This queries your metrics backend (Prometheus) for real production signals apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: checkout-success-rate namespace: payments spec: args: - name: service-name metrics: - name: success-rate interval: 60s # Evaluate every 60 seconds during the pause window count: 5 # Must get 5 consecutive passing evaluations successCondition: result[0] >= 0.95 # 95% success rate minimum failureLimit: 1 # One failure triggers automatic rollback provider: prometheus: address: http://prometheus.monitoring.svc.cluster.local:9090 query: | sum(rate( http_requests_total{ service="{{args.service-name}}", status!~"5.." }[5m] )) / sum(rate( http_requests_total{ service="{{args.service-name}}" }[5m] )) # This gives you: (non-5xx requests) / (all requests) = success rate # Only measuring the canary pods, not stable β this is why service labels matter
[Step 1/5] Weight: 10% β canary pods
Waiting 5m for traffic sample...
Analysis: checkout-success-rate
Evaluation 1/5: success_rate=0.983 β
Evaluation 2/5: success_rate=0.991 β
Evaluation 3/5: success_rate=0.979 β
Evaluation 4/5: success_rate=0.986 β
Evaluation 5/5: success_rate=0.994 β
Analysis PASSED β
[Step 2/5] Weight: 30% β canary pods
Waiting 10m for traffic sample...
Analysis: checkout-success-rate + checkout-p99-latency
success_rate=0.988 β p99_latency=142ms β
Analysis PASSED β
[Step 3/5] Weight: 100% β Full rollout
All 10 replicas running sha-a3f91c2
Rollout COMPLETE β Total time: 17m 23s
The Secrets and Config Management Problem Nobody Talks About Until It's Too Late
I once got called into an incident at midnight because a developer had rotated an API key in AWS Secrets Manager, the application was reading that secret at startup only, and none of the running pods picked up the new value. The service was fine. Then someone did a routine deployment, pods restarted with the new secret, and suddenly half the fleet was talking to the payment gateway with the old key (cached in one still-running pod) and half with the new key. The gateway's duplicate-detection logic flagged the mismatched requests and started rejecting transactions. It took 40 minutes to figure out the problem was secret rotation, not the deployment itself.
Config and secrets management is where CI/CD pipelines quietly accumulate debt. Teams hardcode environment-specific values into their pipelines, or they inject secrets as plain environment variables in their Kubernetes manifests, or they forget to handle secret rotation without a full restart. All three of these will burn you.
The pattern that works: secrets live in a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, or at minimum Kubernetes Secrets encrypted at rest). They're injected at runtime, not build time. Your application watches for secret rotation and reloads without a restart. Your CI pipeline never has access to production secrets β it uses short-lived OIDC tokens to assume the minimum necessary role.
# io.thecodeforge β DevOps tutorial # Secrets management pattern: GitHub Actions + AWS OIDC + Secrets Manager # Why OIDC instead of long-lived AWS access keys stored in GitHub Secrets? # Long-lived keys are a credentials leak waiting to happen. # OIDC issues a token per-workflow-run that expires in minutes. # No static credentials. No rotation reminders. No 'who committed this key?' post-mortems. name: Checkout Service CD on: push: branches: [main] permissions: id-token: write # Required for OIDC β tells GitHub to generate an OIDC token contents: read jobs: deploy-to-staging: name: Deploy to Staging runs-on: ubuntu-latest environment: staging # GitHub Environment β enables deployment protection rules steps: - uses: actions/checkout@v4 # Authenticate to AWS using OIDC β no static credentials needed # The AWS IAM role trusts the GitHub OIDC provider for this repo only - name: Configure AWS credentials via OIDC uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789:role/checkout-service-deploy-staging # This role has ONLY the permissions needed for staging deployment: # - ecr:GetAuthorizationToken, ecr:BatchGetImage (pull image) # - eks:DescribeCluster (get kubeconfig) # - secretsmanager:GetSecretValue (read staging secrets only) aws-region: eu-west-1 role-session-name: checkout-service-deploy-${{ github.run_id }} - name: Validate secrets exist before deploying # Fail the pipeline here if secrets are missing β before touching the cluster # Better to fail loudly in CI than silently in a running pod run: | aws secretsmanager describe-secret \ --secret-id checkout-service/staging/database-url \ --query 'Name' \ --output text aws secretsmanager describe-secret \ --secret-id checkout-service/staging/payment-gateway-key \ --query 'Name' \ --output text echo "All required secrets confirmed present in Secrets Manager" - name: Get kubeconfig for staging cluster run: | aws eks update-kubeconfig \ --region eu-west-1 \ --name payments-staging-cluster \ --alias staging - name: Sync secrets from AWS Secrets Manager to Kubernetes # Using External Secrets Operator β the right way to bridge AWS Secrets Manager and K8s # This creates/updates Kubernetes Secrets automatically when AWS secrets rotate # Your pods get the updated secret without restarting if you mount as volumes (not env vars) run: | # The ExternalSecret resource tells the External Secrets Operator: # "Watch this AWS secret, sync it here, refresh every hour" kubectl apply -f - <<EOF apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: checkout-service-secrets namespace: payments spec: refreshInterval: 1h # Re-sync from AWS every hour β picks up rotations secretStoreRef: name: aws-secrets-manager kind: ClusterSecretStore target: name: checkout-service-secrets creationPolicy: Owner data: - secretKey: database-url remoteRef: key: checkout-service/staging/database-url - secretKey: payment-gateway-key remoteRef: key: checkout-service/staging/payment-gateway-key EOF - name: Deploy to staging via Argo Rollouts run: | # Update only the image tag β don't replace the entire manifest # This preserves any manual overrides and reduces blast radius kubectl argo rollouts set image checkout-service \ checkout-service=ghcr.io/org/checkout-service:sha-${{ github.sha }} \ --namespace payments - name: Wait for rollout to complete run: | # Watch the rollout with a timeout β don't let a stuck rollout block your pipeline forever # 10 minutes is generous for canary + analysis steps kubectl argo rollouts status checkout-service \ --namespace payments \ --timeout 10m - name: Run smoke tests against staging # Smoke tests run AFTER the deployment β they prove the live environment works # Not the same as integration tests, which run in isolation pre-deploy run: | # Hit real endpoints in staging with real (test) data npm run test:smoke -- \ --base-url https://checkout-staging.internal.example.com \ --timeout 30000 env: # Smoke test API key is a staging-specific test credential # It never touches production data SMOKE_TEST_API_KEY: ${{ secrets.STAGING_SMOKE_TEST_KEY }}
β AWS OIDC authentication successful
Role: checkout-service-deploy-staging
Session expires: 2024-01-15T14:32:00Z (1 hour)
β Secret validation passed
checkout-service/staging/database-url [EXISTS]
checkout-service/staging/payment-gateway-key [EXISTS]
β Kubeconfig updated for cluster: payments-staging-cluster
β ExternalSecret synced
checkout-service-secrets updated in namespace payments
Next refresh: 2024-01-15T15:00:00Z
β Rollout initiated: sha-a3f91c2
Canary: 10% β Analysis passed β 30% β Analysis passed β 100%
Rollout complete in 14m 52s
β Smoke tests passed
POST /api/v1/checkout β 201 (143ms)
GET /api/v1/orders/{id} β 200 (67ms)
POST /api/v1/checkout/confirm β 200 (298ms)
3/3 smoke tests passed
Deployment result: SUCCESS
Observability in the Pipeline: You Can't Fix What You Can't See
A pipeline that tells you 'build failed' is nearly useless. A pipeline that tells you 'integration test checkout_service_test.ts:143 β assertion failed: expected order status CONFIRMED, received PAYMENT_PENDING β flaky for 3 of last 5 runs on this branch β median test duration increased 40% this week' is a co-pilot. The gap between those two things is observability.
High-performing teams treat their pipelines as first-class systems with their own monitoring. They track pipeline duration by stage, test flakiness rates by test file, deployment frequency, change failure rate, and mean time to recovery. These are the four DORA metrics, and if you're not measuring them, you don't know if your DevOps practice is improving or just getting more complicated.
Flaky tests are the silent killer of CI trust. Once developers start seeing random failures they learn to re-run pipelines instead of fixing failures. That habit means they also re-run real failures, which means bugs start shipping. I've seen teams with a 30% flakiness rate on their test suite who had essentially no CI β the pipeline was there but no one believed it. The fix isn't to delete the flaky tests. It's to quarantine them, track them in your issue tracker, and fix them with the same urgency you'd fix a production bug.
# io.thecodeforge β DevOps tutorial # Pipeline observability: tracking DORA metrics and test flakiness # This job runs after every workflow completion (success OR failure) # It pushes pipeline telemetry to your metrics backend name: Pipeline Telemetry on: workflow_run: workflows: ['Checkout Service CI', 'Checkout Service CD'] types: [completed] # Triggers on both success and failure jobs: record-pipeline-metrics: name: Record Pipeline Metrics runs-on: ubuntu-latest steps: - name: Calculate pipeline duration and outcome id: metrics run: | # GitHub provides start/end times for workflow runs via the API # We push these as custom metrics to track pipeline performance over time WORKFLOW_NAME="${{ github.event.workflow_run.name }}" WORKFLOW_CONCLUSION="${{ github.event.workflow_run.conclusion }}" # conclusion values: success | failure | cancelled | skipped | timed_out START_TIME="${{ github.event.workflow_run.run_started_at }}" END_TIME="${{ github.event.workflow_run.updated_at }}" # Convert to epoch for arithmetic START_EPOCH=$(date -d "$START_TIME" +%s) END_EPOCH=$(date -d "$END_TIME" +%s) DURATION_SECONDS=$((END_EPOCH - START_EPOCH)) echo "workflow_name=$WORKFLOW_NAME" >> $GITHUB_OUTPUT echo "conclusion=$WORKFLOW_CONCLUSION" >> $GITHUB_OUTPUT echo "duration=$DURATION_SECONDS" >> $GITHUB_OUTPUT echo "branch=${{ github.event.workflow_run.head_branch }}" >> $GITHUB_OUTPUT echo "sha=${{ github.event.workflow_run.head_sha }}" >> $GITHUB_OUTPUT - name: Push metrics to Datadog run: | # Push pipeline metrics as custom Datadog metrics # These feed into dashboards tracking DORA metrics curl -X POST "https://api.datadoghq.com/api/v1/series" \ -H "Content-Type: application/json" \ -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \ -d '{ "series": [ { "metric": "ci.pipeline.duration_seconds", "type": "gauge", "points": [['\'$(date +%s)'\'', '${{ steps.metrics.outputs.duration }}']], "tags": [ "workflow:${{ steps.metrics.outputs.workflow_name }}", "conclusion:${{ steps.metrics.outputs.conclusion }}", "branch:${{ steps.metrics.outputs.branch }}", "service:checkout-service" ] }, { "metric": "ci.pipeline.runs_total", "type": "count", "points": [['\'$(date +%s)'\'', 1]], "tags": [ "workflow:${{ steps.metrics.outputs.workflow_name }}", "conclusion:${{ steps.metrics.outputs.conclusion }}", "service:checkout-service" ] } ] }' - name: Alert on repeated failures # If the same workflow has failed 3 times in the last hour, page the on-call # This catches the 'everyone's re-running hoping it fixes itself' pattern if: steps.metrics.outputs.conclusion == 'failure' run: | # Query Datadog for recent failure count on this branch RECENT_FAILURES=$(curl -s \ "https://api.datadoghq.com/api/v1/query?from=$(date -d '1 hour ago' +%s)&to=$(date +%s)&query=sum:ci.pipeline.runs_total{service:checkout-service,conclusion:failure,branch:${{ steps.metrics.outputs.branch }}}.as_count()" \ -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \ -H "DD-APPLICATION-KEY: ${{ secrets.DATADOG_APP_KEY }}" \ | jq '.series[0].pointlist[-1][1] // 0') echo "Recent failures on this branch: $RECENT_FAILURES" if [ "$(echo "$RECENT_FAILURES >= 3" | bc)" -eq 1 ]; then # Send a PagerDuty alert β 3 consecutive failures is a real problem, not flakiness curl -X POST https://events.pagerduty.com/v2/enqueue \ -H 'Content-Type: application/json' \ -H 'Authorization: Token token=${{ secrets.PAGERDUTY_ROUTING_KEY }}' \ -d '{ "routing_key": "${{ secrets.PAGERDUTY_ROUTING_KEY }}", "event_action": "trigger", "payload": { "summary": "CI pipeline failing repeatedly: ${{ steps.metrics.outputs.workflow_name }} on ${{ steps.metrics.outputs.branch }}", "severity": "warning", "source": "github-actions", "custom_details": { "workflow": "${{ steps.metrics.outputs.workflow_name }}", "branch": "${{ steps.metrics.outputs.branch }}", "sha": "${{ steps.metrics.outputs.sha }}", "failures_last_hour": "'$RECENT_FAILURES'", "run_url": "${{ github.event.workflow_run.html_url }}" } } }' echo "PagerDuty alert sent for repeated CI failures" fi
β Metrics calculated
Workflow: Checkout Service CI
Conclusion: failure
Duration: 312 seconds
Branch: feature/new-promo-engine
SHA: b8e3a2f1
β Metrics pushed to Datadog
ci.pipeline.duration_seconds{service:checkout-service, conclusion:failure}: 312
ci.pipeline.runs_total{service:checkout-service, conclusion:failure}: 1
β Checking recent failure count on feature/new-promo-engine...
Recent failures (last 1h): 3
β Threshold exceeded (β₯3 failures) β sending PagerDuty alert
Alert sent to on-call rotation: checkout-service-team
Severity: warning
Summary: CI pipeline failing repeatedly: Checkout Service CI on feature/new-promo-engine
| Deployment Strategy | Blue-Green | Canary Release | Feature Flags |
|---|---|---|---|
| Traffic control | Full cutover (0% β 100%) | Gradual (10% β 30% β 100%) | Per-user/per-segment rules |
| Rollback speed | Instant (DNS/LB switch) | Minutes (weighted traffic revert) | Instant (toggle off) |
| Infrastructure cost | 2x resource cost during deploy | 10-40% overhead during canary window | Minimal β same infrastructure |
| Validates real traffic | No β cutover before validation | Yes β live traffic on canary pods | Yes β real users on new feature |
| Database migration safety | Requires backward-compatible schemas | Requires backward-compatible schemas | Can gate migration behind flag |
| Best for | Infrastructure/config changes | Application code releases | Feature-level toggling, A/B tests |
| Risk surface | All-or-nothing if rollback fails | Limited blast radius at each step | Flag misconfiguration affects all users |
| Tooling required | Load balancer + duplicate environment | Argo Rollouts or Flagger | LaunchDarkly, Unleash, or custom Redis |
| DORA metric impact | Reduces change failure rate | Reduces change failure rate + MTTR | Increases deployment frequency |
π― Key Takeaways
- Order pipeline stages by execution time ascending, not by importance β failing a 45-minute integration suite before a 30-second type check is burning compute and developer patience for no reason
- The moment your team starts re-running pipelines without investigating failures, your CI is dead β you have a retry button, not a quality gate; flakiness tracking and mandatory failure investigation are cultural and technical requirements, not optional
- Never use :latest in production Kubernetes deployments β SHA-tagged images are the only way to get meaningful rollbacks, reproducible environments, and an audit trail that doesn't lie to you at 2am
- Blue-green, canary, and feature flags are not interchangeable β blue-green can't validate real traffic before cutover, canary can't decouple feature release from deployment, and feature flags can't protect you from infrastructure changes; use the right tool for the specific risk you're managing
β Common Mistakes to Avoid
- βMistake 1: Running npm install instead of npm ci in CI β Symptom: builds that pass locally fail on CI with 'peer dependency conflict' or worse, pass CI but install a subtly different dependency version than production, causing 'works on my machine' bugs β Fix: always use npm ci in CI environments; it installs strictly from package-lock.json, fails if lock file is out of date, and never modifies it
- βMistake 2: Storing long-lived AWS IAM access keys in GitHub Secrets for CI β Symptom: 'AWS Access Key exposed in public repository' GitHub security alert, or a credentials leak during a GitHub breach β Fix: replace with OIDC federation using aws-actions/configure-aws-credentials@v4 with role-to-assume; the generated token lives for the duration of one workflow run and has no static credentials to leak
- βMistake 3: Setting identical readinessProbe and livenessProbe paths in Kubernetes β Symptom: 'CrashLoopBackOff' during high database latency events where Kubernetes repeatedly kills and restarts healthy pods, making a DB slowdown into a full service outage β Fix: separate the probes; liveness checks only in-process health (event loop, memory), readiness checks external dependencies; liveness failureThreshold should be 5+ to avoid restart loops on transient issues
- βMistake 4: Using :latest as the Docker image tag in Kubernetes deployments β Symptom: running kubectl rollout undo fails to actually revert because 'latest' now points to the broken version; you have no way to know which code is running where β Fix: always tag images with the git SHA (type=sha in docker/metadata-action); you get exact traceability, meaningful rollback, and reproducible deployments
- βMistake 5: Injecting secrets as Kubernetes environment variables instead of mounted files β Symptom: after rotating a secret in AWS Secrets Manager, pods continue using the old value until a manual restart; this creates a split-brain state where some pods use old credentials and some use new β Fix: mount secrets as volumes using External Secrets Operator; file-mounted secrets update within kubelet's syncFrequency (default 60s) without a pod restart; add a file watcher in your app to reload config on change
Interview Questions on This Topic
- QYour CI pipeline has a 25% flaky test rate. Developers have started auto-retrying failed builds without investigating. How do you fix the underlying problem without just deleting tests or disabling the pipeline requirement?
- QWhen would you choose a blue-green deployment over a canary release for a microservice that owns its own PostgreSQL database and is receiving an additive schema migration?
- QYou're mid-canary deployment at 30% traffic when your success rate analysis shows 94.8% β just below your 95% threshold. Argo Rollouts automatically rolls back. But your SRE thinks it might have been a transient spike from an upstream dependency, not your code. What's your mitigation strategy going forward, and what does this reveal about your analysis template design?
- QHow does OIDC-based authentication for CI/CD differ from long-lived IAM keys in terms of the attack surface it exposes, and what specific IAM conditions would you use to restrict which GitHub workflow runs can assume the production deployment role?
Frequently Asked Questions
What's the difference between continuous delivery and continuous deployment?
Continuous delivery means every change is automatically built, tested, and packaged so it's always ready to deploy to production β but a human approves the actual production release. Continuous deployment removes that human gate entirely: every change that passes automated tests ships to production automatically. Most regulated industries (fintech, healthcare) practise continuous delivery and stop there because they need a manual approval step for compliance reasons β that's not a failure, it's intentional.
How do I prevent secrets from leaking in GitHub Actions logs?
GitHub automatically masks values stored in GitHub Secrets from workflow logs, but only if the runner knows about them β secrets you construct dynamically at runtime from parts don't get masked. Never echo a secret directly, never construct a secret by concatenating values in a run step, and never pass secrets between jobs using outputs (outputs are visible in the workflow log). Pass secrets between jobs using artifacts stored with restricted permissions, or pull them fresh from a secrets manager in each job that needs them.
How do I handle database migrations safely in a CI/CD pipeline without causing downtime?
Expand-contract (also called parallel change) is the pattern: in phase one, deploy a migration that adds the new column without removing the old one β the running application still uses the old column. In phase two, deploy the application code that writes to both old and new columns. In phase three, deploy a migration that backfills and removes the old column once all running pods use the new schema. This means your migration and your deployment are never a single atomic change, which is the only way to keep rolling deployments safe with schema changes.
At what scale does a monolithic CI pipeline start to break down, and what's the alternative?
A single pipeline for a monorepo breaks down around 30-50 developers when you hit two symptoms: pipeline duration exceeds 15 minutes even for unrelated changes, and flakiness in module A blocks deployments of module B. The alternative is affected-path detection β tools like Nx, Turborepo, or Bazel determine which packages changed and only run CI for the dependency graph downstream of that change. At true monorepo scale (hundreds of services), you also need remote caching so unchanged modules reuse prior build outputs across runs. The trap is implementing this too early β if your pipeline runs in under 8 minutes and you have fewer than 20 engineers, the complexity of affected-path detection isn't worth it yet.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.