Mid-level 6 min · March 06, 2026

Release Management: Never allow_failure for Prod Deploy

Error rates spiked from 0.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Release management coordinates code changes from dev to prod through automated gates
  • Semantic versioning (Major.Minor.Patch) combined with trunk-based development reduces merge chaos
  • Environment promotion pipelines validate the same artifact at each stage — never rebuild between environments
  • Feature flags decouple deployment from release, enabling instant rollback via config change
  • A practiced rollback plan cuts recovery time from hours to minutes
  • The biggest mistake: treating quality gates as optional — a skipped gate is a skipped brake check
Plain-English First

Imagine a car factory. Every day, engineers make small improvements — better seats, a stronger engine, a new paint colour. Release management is the system that decides WHEN those changes get bolted onto the car, in WHAT ORDER, and how to quickly UNSCREW them if the new engine blows up. Without that system, every engineer would just show up and start welding things randomly. That's exactly what happens to software teams with no release process — and it's just as messy.

Every production outage has a creation story, and it almost always starts the same way: someone pushed a change without a plan. Maybe it was a hotfix at 11 PM, a config tweak that 'couldn't possibly break anything', or a big-bang deploy of six months of work all at once. Release management isn't bureaucracy for its own sake — it's the difference between your team owning deployments and deployments owning your team.

The core problem release management solves is coordination under uncertainty. Code works on your laptop. It works in staging. Then it hits production — a different database, different load, different config — and everything falls apart. A mature release process creates checkpoints, visibility, and escape hatches at every stage so that when something does go wrong (and it will), the blast radius is small and recovery is fast.

By the end of this article you'll understand how to structure a release pipeline with proper versioning, environment promotion gates, feature flags, and rollback strategies. You'll see real pipeline config, real branching patterns, and the exact mistakes that cause teams to lose weekends. Whether you're formalising a scrappy startup process or auditing an enterprise pipeline, these patterns apply.

Semantic Versioning and Git Branching — Your Release's DNA

Every release needs an identity before it needs a pipeline. That identity is a version number, and the most battle-tested system is Semantic Versioning: MAJOR.MINOR.PATCH. PATCH is a bug fix that doesn't change the API. MINOR adds functionality backwards-compatibly. MAJOR breaks something. This isn't just convention — tools like npm, Helm, and Terraform providers all resolve dependencies using these semantics, so a wrong version bump can silently pull in breaking changes across your whole stack.

Your branching strategy should mirror your release cadence. GitFlow is powerful but heavyweight — use it when you maintain multiple live versions simultaneously (e.g., a SaaS product with enterprise clients on v2 and everyone else on v3). Trunk-based development is faster — developers merge small changes to main daily, and feature flags hide incomplete work from users. For most product teams shipping to a single production environment, trunk-based wins.

The critical rule: your pipeline tags the artifact, not the developer. A human typing '1.4.2' into a field is a human who will one day type '1.4.2' again by mistake. Let your CI system auto-tag based on commit conventions (Conventional Commits + semantic-release is the gold standard here).

semantic-release-pipeline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# GitHub Actions workflow that automatically determines the next
# semantic version, tags the release, and publishes release notes.
# Triggered only on pushes to the main branch (i.e., after a PR merges).

name: Automated Semantic Release

on:
  push:
    branches:
      - main  # Only run on merged PRs — never on feature branches

jobs:
  release:
    name: Determine Version and Tag Release
    runs-on: ubuntu-latest
    permissions:
      contents: write       # Needed to push the git tag
      issues: write         # Needed to comment on resolved issues
      pull-requests: write  # Needed to comment on merged PRs

    steps:
      - name: Checkout full git history
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # CRITICAL: semantic-release needs full history to
                          # calculate the correct version bump. Shallow clones
                          # (the default) will cause it to fail silently.

      - name: Set up Node.js for semantic-release tooling
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install semantic-release and changelog plugin
        run: |
          npm install --save-dev \
            semantic-release \
            @semantic-release/changelog \
            @semantic-release/git
          # @semantic-release/changelog writes a CHANGELOG.md automatically
          # @semantic-release/git commits the changelog back to main

      - name: Run semantic-release
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          # semantic-release reads commit messages to decide the bump:
          # 'fix: ...'           -> PATCH bump  (1.4.1 -> 1.4.2)
          # 'feat: ...'          -> MINOR bump  (1.4.2 -> 1.5.0)
          # 'feat!: ...' or
          # 'BREAKING CHANGE:'   -> MAJOR bump  (1.5.0 -> 2.0.0)
        run: npx semantic-release

  build-and-push:
    name: Build Docker Image with Version Tag
    runs-on: ubuntu-latest
    needs: release  # Only runs AFTER the version tag exists in git

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Extract the new version tag created by semantic-release
        id: get_version
        run: |
          # Pull the latest tag that semantic-release just created
          RELEASE_VERSION=$(git describe --tags --abbrev=0)
          echo "version=${RELEASE_VERSION}" >> $GITHUB_OUTPUT
          echo "Detected release version: ${RELEASE_VERSION}"

      - name: Build and tag Docker image with immutable version
        run: |
          docker build \
            --tag myapp:${{ steps.get_version.outputs.version }} \
            --tag myapp:latest \
            --label "org.opencontainers.image.version=${{ steps.get_version.outputs.version }}" \
            --label "org.opencontainers.image.created=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
            .
          # Tagging with BOTH the exact version AND 'latest' is intentional:
          # - Exact version tag is immutable — you can always roll back to it
          # - 'latest' tag is for convenience in non-production environments
          # NEVER deploy 'latest' to production. Always use the exact version.

      - name: Push to container registry
        run: |
          docker push myapp:${{ steps.get_version.outputs.version }}
          docker push myapp:latest
Output
✓ Checkout full git history
✓ Set up Node.js for semantic-release tooling
✓ Install semantic-release and changelog plugin
[semantic-release] › Starting semantic-release version 22.0.0
[semantic-release] › Loaded plugin: @semantic-release/commit-analyzer
[semantic-release] › Analyzing commits since v1.4.1
[semantic-release] › Found 1 feat commit — bumping MINOR version
[semantic-release] › The next release version is 1.5.0
[semantic-release] › Published GitHub release: v1.5.0
[semantic-release] › Updated CHANGELOG.md
Detected release version: v1.5.0
✓ Build Docker image: myapp:v1.5.0 and myapp:latest
✓ Pushed myapp:v1.5.0 to registry
✓ Pushed myapp:latest to registry
Watch Out: The Shallow Clone Trap
GitHub Actions defaults to a shallow clone (fetch-depth: 1). semantic-release needs the FULL git history to compute the correct version. Without fetch-depth: 0, it either errors out or resets to v1.0.0 on every run. This is the number-one setup mistake with automated versioning.
Production Insight
A team using GitFlow for a single production environment spent 3 hours each release merging long-lived branches.
Switch to trunk-based with feature flags and they cut release prep time to 15 minutes.
Rule: pick your branching model based on the number of live versions you maintain, not the size of your team.
Key Takeaway
Auto-tag versions via commit conventions — never let a human type a version number.
Trunk-based development with feature flags is faster for single-environment teams.
The version is the artifact's identity, not the developer's choice.

Environment Promotion Gates — The Checkpoint System That Saves Weekends

Think of your environments as a series of airlocks on a spacecraft. Code moves from dev → staging → production, and each airlock only opens if a set of conditions is met. This is environment promotion, and the conditions are your quality gates. The idea is simple: every bug you catch in staging costs 10x less than the same bug in production — in time, in customer trust, and sometimes in revenue.

A quality gate is a hard stop, not a suggestion. Examples: test coverage must be above 80%, no critical CVEs in the container image, performance regression must be less than 5% versus the last release, smoke tests must pass. The moment a gate becomes optional — 'just this once, we're behind schedule' — it ceases to exist. Treat a skipped gate the same way you'd treat a skipped brake check on a plane.

The pattern that scales best is a promotion pipeline, not a parallel pipeline. Instead of having three separate pipelines (one per environment), you have ONE pipeline where each stage promotes the same artifact further. This means what you test is what you ship — the exact same Docker image SHA that passed staging tests is the one deployed to production. Never rebuild between environments.

environment-promotion-pipeline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# GitLab CI/CD pipeline demonstrating environment promotion with quality gates.
# The SAME Docker image (identified by its SHA) moves through each environment.
# No rebuilds between environments — what passed testing IS what gets deployed.

stages:
  - build         # Compile and package the artifact once
  - test          # Run all automated quality gates
  - deploy-staging    # Automatic on every main branch push
  - verify-staging    # Automated smoke tests against staging
  - deploy-production # Manual trigger — human makes the final call

variables:
  REGISTRY: registry.example.com
  IMAGE_NAME: $REGISTRY/payments-service
  # IMAGE_TAG is derived from the git commit SHAthis guarantees
  # we always know exactly which code is running in any environment.
  IMAGE_TAG: $CI_COMMIT_SHORT_SHA

# ─── STAGE 1: Build ───────────────────────────────────────────────────────────
build-docker-image:
  stage: build
  image: docker:24
  services:
    - docker:24-dind  # Docker-in-Docker so we can build images in CI
  script:
    - docker build --tag $IMAGE_NAME:$IMAGE_TAG .
    - docker push $IMAGE_NAME:$IMAGE_TAG
    # We push immediately so subsequent stages can pull the same image.
    # No stage ever calls 'docker build' again — this is the single source of truth.
  only:
    - main

# ─── STAGE 2: Test (Quality Gates) ───────────────────────────────────────────
run-unit-and-integration-tests:
  stage: test
  image: $IMAGE_NAME:$IMAGE_TAG  # Run tests INSIDE the built image
  script:
    - pytest tests/ --cov=src --cov-fail-under=80
    # --cov-fail-under=80 is a hard gate: if coverage drops below 80%,
    # this job returns exit code 1, the pipeline stops, no deployment happens.
  coverage: '/TOTAL.*\s+(\d+%)$/'

scan-for-vulnerabilities:
  stage: test
  image: aquasec/trivy:latest
  script:
    - trivy image --exit-code 1 --severity CRITICAL $IMAGE_NAME:$IMAGE_TAG
    # exit-code 1 on CRITICAL CVEs means a known critical vulnerability
    # in the image will BLOCK deployment. This is non-negotiable.
    # You can set --severity HIGH,CRITICAL once your baseline is clean.

check-performance-regression:
  stage: test
  script:
    - |
      # Compare p95 response time against the last production benchmark.
      # If we're more than 5% slower, we don't ship.
      CURRENT_P95=$(run-load-test --output p95)
      BASELINE_P95=$(fetch-baseline --metric p95)
      REGRESSION=$(( (CURRENT_P95 - BASELINE_P95) * 100 / BASELINE_P95 ))
      if [ "$REGRESSION" -gt 5 ]; then
        echo "GATE FAILED: p95 latency regressed by ${REGRESSION}% (threshold: 5%)"
        exit 1
      fi
      echo "Performance gate passed. Regression: ${REGRESSION}%"

# ─── STAGE 3: Deploy to Staging ───────────────────────────────────────────────
deploy-to-staging:
  stage: deploy-staging
  environment:
    name: staging
    url: https://staging.example.com
  script:
    - kubectl set image deployment/payments-service \
        payments-service=$IMAGE_NAME:$IMAGE_TAG \
        --namespace=staging
    # We're deploying the EXACT same $IMAGE_TAG that was built and tested.
    # kubectl set image updates the running deployment without recreating it.
    - kubectl rollout status deployment/payments-service --namespace=staging
    # rollout status blocks until the deployment is healthy or times out.
    # This ensures the next stage only runs if staging is actually up.
  only:
    - main

# ─── STAGE 4: Verify Staging ──────────────────────────────────────────────────
run-staging-smoke-tests:
  stage: verify-staging
  script:
    - |
      # Smoke tests hit the real staging URL and check critical user journeys:
      # login, create payment, view dashboard. Fast checks — not a full suite.
      newman run smoke-tests/payments-collection.json \
        --environment smoke-tests/staging-env.json \
        --reporters cli,junit \
        --reporter-junit-export smoke-test-results.xml
  artifacts:
    reports:
      junit: smoke-test-results.xml  # GitLab parses this to show pass/fail inline
  only:
    - main

# ─── STAGE 5: Deploy to Production (Manual Gate) ──────────────────────────────
deploy-to-production:
  stage: deploy-production
  environment:
    name: production
    url: https://app.example.com
  when: manual          # A human must click 'play' in the GitLab UI
  allow_failure: false  # If this fails, mark the whole pipeline as failed
  script:
    - kubectl set image deployment/payments-service \
        payments-service=$IMAGE_NAME:$IMAGE_TAG \
        --namespace=production
    - kubectl rollout status deployment/payments-service --namespace=production
    # Tag the production-deployed image with 'stable' so we always know
    # what the last known-good production image was.
    - docker tag $IMAGE_NAME:$IMAGE_TAG $IMAGE_NAME:stable
    - docker push $IMAGE_NAME:stable
  only:
    - main
Output
Pipeline #4821 — commit a3f9c12 — branch: main
✓ build-docker-image (1m 43s) Image pushed: registry.example.com/payments-service:a3f9c12
✓ run-unit-and-integration-tests (2m 11s) Coverage: 84% (gate: 80%) ✓
✓ scan-for-vulnerabilities (0m 58s) 0 CRITICAL CVEs found ✓
✓ check-performance-regression(1m 20s) p95 regression: 1.2% (gate: 5%) ✓
✓ deploy-to-staging (0m 45s) Rollout complete in namespace: staging
✓ run-staging-smoke-tests (1m 02s) 12/12 smoke tests passed
⏸ deploy-to-production WAITING FOR MANUAL TRIGGER
→ Visit https://gitlab.example.com/pipelines/4821 to approve production deploy
Pro Tip: The 'Stable' Tag Is Your Rollback Anchor
Tagging the last successful production image as 'stable' means a rollback command is always one line: kubectl set image deployment/payments-service payments-service=registry.example.com/payments-service:stable. No searching through tags, no guessing which SHA was last good. Define your rollback procedure before you need it.
Production Insight
A fintech team skipped the vulnerability scan gate to meet a compliance deadline. The deploy went through and a known CVE in a logging library leaked PII to a public CloudWatch log group.
The gate was re-enabled the same day, but the breach notification cost $200k in fines.
Rule: if a gate can be skipped, it will be skipped under pressure — make it non-negotiable.
Key Takeaway
Quality gates must be hard programmatic stops, not optional suggestions.
Build once, promote the same artifact through all environments — never rebuild.
The 'stable' tag on the last production image is your fastest rollback anchor.

Feature Flags and Dark Launches — Separating Deployment from Release

Here's a mindset shift that changes everything: deployment and release are not the same thing. Deployment is 'the code is in production'. Release is 'users can see it'. Feature flags let you do the first without the second, and that separation is what enables teams to deploy dozens of times a day without chaos.

A dark launch means you ship the code to production but hide the new feature behind a flag. You can then turn it on for 1% of users, watch your error rates and latency, and either ramp up or kill it instantly — without a deployment. No pipeline run, no kubectl command, no 3 AM on-call page. Just a config change.

This pattern is especially powerful for database migrations, API breaking changes, and anything touching payments or authentication. The new code path runs alongside the old one until you're confident. Once 100% of traffic is on the new path and it's stable, you remove the flag and clean up the old code.

Tools like LaunchDarkly, Unleash (self-hosted), and even a simple database table can serve as your flag store. The important thing is that flags are owned, documented, and have a planned removal date — otherwise you accumulate 'flag debt' that makes your codebase unreadable.

feature_flag_checkout.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# Real-world feature flag pattern for a payment checkout flow.
# We're rolling out a new 'express checkout' experience gradually.
# The flag system is Unleash (open-source, self-hostable).

import logging
from unleash_client import UnleashClient
from unleash_client.strategies import Strategy

logger = logging.getLogger(__name__)

# ── Initialise the Unleash client once at application startup ──────────────────
# In production this points to your Unleash server.
# In tests, you can use FakeUnleash to avoid network calls.
unleash_client = UnleashClient(
    url="https://unleash.internal.example.com/api",
    app_name="checkout-service",
    custom_headers={"Authorization": "*:production.your-secret-token"}
)
unleash_client.initialize_client()
# initialize_client() fetches all flag states and caches them.
# The client polls for updates in the background — no per-request network calls.


def process_checkout(user_id: str, cart_items: list, user_tier: str) -> dict:
    """
    Routes the user to either the new express checkout or the classic flow
    depending on the feature flag state for this specific user.

    The flag can be configured in Unleash to:
      - Be ON for specific user IDs (early adopters / beta testers)
      - Be ON for a % of users (gradual rollout)
      - Be ON only for users with user_tier == 'premium' (targeted release)
      - Be completely OFF (emergency kill-switch)
    """
    # Context tells Unleash WHO is asking, so it can apply targeting rules.
    # This is what makes flags smarter than a simple boolean.
    flag_context = {
        "userId": user_id,
        "properties": {
            "userTier": user_tier  # Custom property for tier-based targeting
        }
    }

    # is_enabled() is the key call — it checks the local cache, NOT the server,
    # so it adds microseconds of latency, not milliseconds.
    use_express_checkout = unleash_client.is_enabled(
        "express-checkout-v2",   # Flag name as defined in Unleash UI
        context=flag_context,
        fallback_function=lambda feature_name, ctx: False
        # fallback_function returns False if Unleash is unreachable.
        # NEVER let a flag evaluation crash your application — always define a fallback.
    )

    if use_express_checkout:
        logger.info(
            "express_checkout_used",
            extra={"user_id": user_id, "flag": "express-checkout-v2", "variant": "on"}
            # Structured logging lets you correlate flag state with error rates
            # in your observability platform (Datadog, Grafana, etc.)
        )
        return _run_express_checkout(user_id, cart_items)
    else:
        logger.info(
            "classic_checkout_used",
            extra={"user_id": user_id, "flag": "express-checkout-v2", "variant": "off"}
        )
        return _run_classic_checkout(user_id, cart_items)


def _run_express_checkout(user_id: str, cart_items: list) -> dict:
    """New express checkout — single-page, saved payment methods, faster UX."""
    # New implementation here. This runs in production for flagged users
    # while _run_classic_checkout handles everyone else.
    return {
        "status": "success",
        "flow": "express",
        "steps_completed": 1,
        "order_id": f"EXP-{user_id}-001"
    }


def _run_classic_checkout(user_id: str, cart_items: list) -> dict:
    """Existing checkout flow — kept alive until express is 100% rolled out."""
    return {
        "status": "success",
        "flow": "classic",
        "steps_completed": 3,
        "order_id": f"CLX-{user_id}-001"
    }


# ── Example usage ──────────────────────────────────────────────────────────────
if __name__ == "__main__":
    # Premium user — assume flag is configured to target 'premium' tier
    premium_result = process_checkout(
        user_id="user-99201",
        cart_items=[{"sku": "SHOE-42", "qty": 1}],
        user_tier="premium"
    )
    print(f"Premium user checkout: {premium_result}")

    # Free tier user — flag is OFF for this tier
    free_result = process_checkout(
        user_id="user-10042",
        cart_items=[{"sku": "HAT-L", "qty": 2}],
        user_tier="free"
    )
    print(f"Free user checkout: {free_result}")
Output
INFO express_checkout_used user_id=user-99201 flag=express-checkout-v2 variant=on
Premium user checkout: {'status': 'success', 'flow': 'express', 'steps_completed': 1, 'order_id': 'EXP-user-99201-001'}
INFO classic_checkout_used user_id=user-10042 flag=express-checkout-v2 variant=off
Free user checkout: {'status': 'success', 'flow': 'classic', 'steps_completed': 3, 'order_id': 'CLX-user-10042-001'}
Interview Gold: Deployment vs. Release
Interviewers love asking 'how do you deploy without risk?' The answer they want is feature flags — specifically the concept that you can deploy code to 100% of servers while releasing it to 0% of users. Bonus points for mentioning that this also makes rollbacks instant (flip the flag) versus slow (redeploy the previous version).
Production Insight
A SaaS team deployed a new pricing page with a permanent 'always-on' flag. The flag was never cleaned up, and 18 months later a refactor broke the old code path that no one remembered existed.
The pricing page stopped working for 15% of users who still hit the old code due to a stale cache key.
Rule: every flag needs a removal deadline. Set a 30-day expiry by default.
Key Takeaway
Deploy and release are separate — use flags to control user visibility.
Flags make rollbacks a config change, not a code deploy.
Accumulated flags become debt — schedule removal from day one.

Rollback Strategy — Planning for Failure Before It Happens

Mature teams don't ask 'will this deploy go wrong?' — they ask 'when it goes wrong, how fast can we recover?' A rollback strategy is not an admission of defeat. It's engineering discipline. The goal is to define your recovery path before you're stressed, sleep-deprived, and under pressure from a VP asking 'when will this be fixed?'

There are three levels of rollback you need to think about. First, application rollback: rolling back the Kubernetes deployment to the previous image SHA — this takes under a minute and handles most issues. Second, database rollback: this is harder. Schema migrations that delete columns or rename tables can't be trivially reversed. This is why every migration should be deployed in at least two phases — first add the new column, then (days later) remove the old one. Third, config rollback: if you're using a GitOps tool like Argo CD, every infrastructure change is a git commit, meaning a revert is a git revert. Fast and auditable.

The most important rule: test your rollback in staging before every major release. A rollback you've never practiced is a rollback that will fail when you need it most.

rollback-runbook.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#!/usr/bin/env bash
# ─────────────────────────────────────────────────────────────────────────────
# Rollback Runbook — payments-service
# Run this when a production deployment causes errors above the SLO threshold.
# Prerequisites: kubectl configured, Docker registry access, Slack webhook set.
# ─────────────────────────────────────────────────────────────────────────────

set -euo pipefail
# -e: exit immediately on any error
# -u: treat unset variables as errors (prevents silent config bugs)
# -o pipefail: a pipe fails if ANY command in it fails, not just the last one

# ── Configuration ─────────────────────────────────────────────────────────────
NAMESPACE="production"
DEPLOYMENT_NAME="payments-service"
CONTAINER_NAME="payments-service"
REGISTRY="registry.example.com"
SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL}"  # Injected from CI/CD secrets

# ── Step 1: Confirm current broken state before touching anything ──────────────
echo "──────────────────────────────────────────────────"
echo "ROLLBACK INITIATED — $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Deployment: ${DEPLOYMENT_NAME} in namespace: ${NAMESPACE}"
echo "──────────────────────────────────────────────────"

CURRENT_IMAGE=$(kubectl get deployment "${DEPLOYMENT_NAME}" \
  --namespace="${NAMESPACE}" \
  --output=jsonpath='{.spec.template.spec.containers[0].image}')
echo "Current (broken) image: ${CURRENT_IMAGE}"

# ── Step 2: Find the last known-good image (tagged as 'stable') ────────────────
# The 'stable' tag was set during the last SUCCESSFUL production deploy.
# See deploy-to-production stage in the pipeline config above.
STABLE_IMAGE="${REGISTRY}/${DEPLOYMENT_NAME}:stable"
echo "Rolling back to stable image: ${STABLE_IMAGE}"

# ── Step 3: Execute the rollback ──────────────────────────────────────────────
kubectl set image deployment/"${DEPLOYMENT_NAME}" \
  "${CONTAINER_NAME}=${STABLE_IMAGE}" \
  --namespace="${NAMESPACE}" \
  --record  # --record writes this change to the deployment's change history

# Block until all pods are running the stable image.
# Timeout of 3 minutes — if it takes longer, something is seriously wrong.
kubectl rollout status deployment/"${DEPLOYMENT_NAME}" \
  --namespace="${NAMESPACE}" \
  --timeout=3m

echo "Rollback complete. Verifying pod health..."

# ── Step 4: Quick sanity check — are all pods Ready? ──────────────────────────
READY_PODS=$(kubectl get deployment "${DEPLOYMENT_NAME}" \
  --namespace="${NAMESPACE}" \
  --output=jsonpath='{.status.readyReplicas}')
DESIRED_PODS=$(kubectl get deployment "${DEPLOYMENT_NAME}" \
  --namespace="${NAMESPACE}" \
  --output=jsonpath='{.spec.replicas}')

if [ "${READY_PODS}" -ne "${DESIRED_PODS}" ]; then
  echo "WARNING: Only ${READY_PODS}/${DESIRED_PODS} pods are ready after rollback."
  echo "Check pod logs: kubectl logs -l app=${DEPLOYMENT_NAME} -n ${NAMESPACE}"
  exit 1
fi

echo "✓ All ${READY_PODS}/${DESIRED_PODS} pods are healthy."

# ── Step 5: Notify the team — a silent rollback is a sneaky rollback ───────────
curl --silent --fail --show-error \
  --request POST \
  --header 'Content-type: application/json' \
  --data "{
    \"text\": \":rotating_light: *ROLLBACK EXECUTED* :rotating_light:\",
    \"attachments\": [{
      \"color\": \"danger\",
      \"fields\": [
        {\"title\": \"Service\",         \"value\": \"${DEPLOYMENT_NAME}\",  \"short\": true},
        {\"title\": \"Environment\",      \"value\": \"${NAMESPACE}\",       \"short\": true},
        {\"title\": \"Rolled back FROM\", \"value\": \"${CURRENT_IMAGE}\",  \"short\": false},
        {\"title\": \"Rolled back TO\",   \"value\": \"${STABLE_IMAGE}\",   \"short\": false},
        {\"title\": \"Executed by\",      \"value\": \"${USER:-ci-system}\", \"short\": true},
        {\"title\": \"Time\",             \"value\": \"$(date -u '+%Y-%m-%d %H:%M UTC')\", \"short\": true}
      ]
    }]
  }" \
  "${SLACK_WEBHOOK_URL}"

echo ""
echo "Slack notification sent. Rollback complete."
echo "Next step: create a post-mortem issue and identify root cause before re-deploying."
Output
──────────────────────────────────────────────────
ROLLBACK INITIATED — 2024-11-14 02:17:43 UTC
Deployment: payments-service in namespace: production
──────────────────────────────────────────────────
Current (broken) image: registry.example.com/payments-service:a3f9c12
Rolling back to stable image: registry.example.com/payments-service:stable
deployment.apps/payments-service image updated
Waiting for deployment "payments-service" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "payments-service" rollout to finish: 2 out of 3 new replicas have been updated...
deployment "payments-service" successfully rolled out
Rollback complete. Verifying pod health...
✓ All 3/3 pods are healthy.
Slack notification sent. Rollback complete.
Next step: create a post-mortem issue and identify root cause before re-deploying.
Watch Out: Database Migrations Are Not Rollback-Friendly
kubectl rollout undo rolls back your app code in 60 seconds. It does NOT roll back your database schema. If your new code ran a migration that dropped a column, your old code will crash looking for it. The fix: always use the expand-contract pattern — add the new column, deploy, migrate data, deploy again removing the old column. Never drop columns in the same deploy that first uses the new schema.
Production Insight
A startup tried to roll back a deploy that had run a destructive migration. The app returned to the old version, but the database schema was already changed — the old code immediately crashed with column-not-found errors.
Recovery took 6 hours of manual SQL patching and a full database restore from backup.
Rule: never combine schema changes with app changes in the same deploy. Use expand-contract for every migration.
Key Takeaway
Roll back app code fast: kubectl rollout undo is your 60-second escape.
Database schema changes are NOT rollback-friendly — use expand-contract.
Test your rollback in staging before every major release. Untested rollbacks fail.

Release Automation and Changelog Generation — Closing the Loop

A release isn't complete until someone knows what changed. That's where automated changelog generation comes in. Semantic-release, git-cliff, or a custom script can parse Conventional Commit messages and produce a human-readable changelog, release notes, and even trigger notifications to Slack or email. The key is that this should be fully automated — a human writing 'Bug fixes and performance improvements' is a human wasting time and providing zero value.

Your changelog should be generated at the moment the release tag is created, not retroactively. The pipeline that tags the version should also write the changelog entry, attach it to the GitHub/GitLab release, and post a summary to the team channel. This gives every stakeholder — QA, product, support — a single source of truth for what's in the release.

Automation also means your release cycle can be shorter. Instead of a weekly release cadence where 50 changes bundle together, you can release each merged PR individually. The cost of a release drops to near zero. The only remaining constraint is the manual production gate for riskier changes. But even that can be automated with feature flags: merge to main, deploy to prod behind a flag, and release at your own pace.

.releaserc.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# semantic-release configuration file — defines plugins, release branches, and asset publishing.
# This file lives at the root of your repository and is read by semantic-release.

branches:
  - "main"                     # Releases from main branch
  - {name: "next", prerelease: "rc"}   # Pre-releases from 'next' branch (release candidates)

plugins:
  # ── 1. Analyze commit messages to determine the type of version bump ─────
  - "@semantic-release/commit-analyzer"

  # ── 2. Generate release notes from commit messages ────────────────────────
  - "@semantic-release/release-notes-generator"

  # ── 3. Update the CHANGELOG.md file in the repository ─────────────────────
  - "@semantic-release/changelog"

  # ── 4. Commit the updated CHANGELOG.md back to the repository ─────────────
  - "@semantic-release/git"

  # ── 5. Publish the release to GitHub (creates a GitHub Release with notes) ─
  - "@semantic-release/github"

# ── GitHub Release configuration ──────────────────────────────────────────────
# We attach the built Docker image SHA reference as an asset.
# This allows anyone viewing the GitHub Release to see exactly what artifact was shipped.
githubAssets:
  - path: "release-artifact.sha256"
    label: "Docker Image SHA256 Checksum"

# ── Commit message format ────────────────────────────────────────────────────
# semantic-release expects Angular-style Conventional Commits.
# Examples:
#   feat(cart): add multi-currency support
#   fix(auth): handle token expiry correctly
#   docs(readme): update installation guide
#   BREAKING CHANGE: remove deprecated /v1/api endpoint

preset: "angular"
# Using angular preset means commit types like 'feat', 'fix', 'docs', 'chore' etc.
# Only 'feat' triggers a MINOR bump, only 'fix' triggers a PATCH.
# 'BREAKING CHANGE' in the body triggers a MAJOR bump.
Output
semantic-release v22.0.0
Plugin loaded: @semantic-release/commit-analyzer
Plugin loaded: @semantic-release/release-notes-generator
Plugin loaded: @semantic-release/changelog
Plugin loaded: @semantic-release/git
Plugin loaded: @semantic-release/github
Analyzed 12 commits since v2.3.0:
- 1 fix, 3 feat, 5 docs, 2 refactor, 1 BREAKING CHANGE
Next version: v3.0.0 (MAJOR bump)
✓ Updated CHANGELOG.md
✓ Committed changelog update
✓ Published GitHub Release: v3.0.0
✓ Posted release summary to Slack
Release Automation Mental Model
  • Conventional Commits are the raw material — each commit is labeled (fix, feat, breaking).
  • The commit analyzer is the quality inspector — it reads the labels and decides the bump (PATCH, MINOR, MAJOR).
  • The changelog generator is the packing department — it writes the release notes from the commit history.
  • The git tagging step is the shipping label — it stamps the artifact with an immutable version.
  • The GitHub Release is the shipping manifest — it tells the world what's inside the box.
Production Insight
A team manually wrote release notes for months. The notes were always late, often inaccurate, and skipped critical breaking changes. A support ticket about a missing feature would come in, and the team would have to dig through git log to answer.
After adopting semantic-release with automatic changelogs, the release notes were published within 2 minutes of the tag. Support response time dropped by 40%.
Rule: automate everything after the merge — including the changelog and community announcement.
Key Takeaway
Automated changelogs save hours and prevent missed breaking changes.
Semantic-release with Conventional Commits is the gold standard.
A release isn't complete until the changelog is published and the team is notified.
● Production incidentPOST-MORTEMseverity: high

The Missing Quality Gate That Took Down Payments for 47 Minutes

Symptom
Users reported 500 errors and payment timeouts after a routine dependency update deploy. Error rates spiked from 0.1% to 23% within two minutes.
Assumption
The team assumed the production deploy job would block on failing tests — it had been configured with allow_failure: true during a late-night sprint deadline and never reverted.
Root cause
The CI pipeline had an integration test stage that caught the bad dependency, but because the production deploy step was set to manual with allow_failure: true, the engineer simply ignored the red stage and clicked 'Deploy to Prod' anyway. No automated gate blocked it.
Fix
Changed the production deploy job to when: manual, allow_failure: false. Added a required approval from a second team member before deploy. Added an automated check that no test stages have failed in the last 30 minutes.
Key lesson
  • A quality gate that can be skipped is not a gate — it's a suggestion.
  • Never mark production deploy as allow_failure: true. If deploy happens despite failing tests, the gate is broken.
  • Every production deploy should require at least one human approval who did not author the change.
Production debug guideSymptom → Action guide for common release management failures4 entries
Symptom · 01
Deploy to staging works, but production deploy fails with image not found
Fix
Check the artifact tag. The production environment might be pulling a different tag than what was built. Verify the pipeline uses the same commit SHA tag across all stages.
Symptom · 02
Rollback command runs but pods stay on broken version
Fix
Check the rollout status and ensure the stable tag is correctly updated. Run kubectl rollout history to see previous revisions. The rollback may be targeting a replica set that is already scaled down.
Symptom · 03
Feature flag not taking effect for a subset of users
Fix
Check the Unleash admin UI for flag targeting rules. Verify the flag context includes the correct userId. Restart the service to force a cache refresh. The flag evaluation runs on a cached copy — stale cache is the most common cause.
Symptom · 04
Database migration fails during deploy
Fix
Check if the migration is irreversible (e.g., dropping a column). Use flyway undo or manually roll back the migration SQL. Never apply schema changes in the same deploy as the code that depends on the new schema — use expand-contract.
★ Release Failures Quick-Response GuideFor on-call engineers facing release-related production issues. Real commands, no fluff.
Deploy caused errors — need to roll back fast
Immediate action
Stop the rollout and roll back to the last known-good image
Commands
kubectl rollout undo deployment/myapp --namespace=production
kubectl rollout status deployment/myapp --namespace=production --timeout=3m
Fix now
If rollout undo fails, manually set the image to the stable tag: kubectl set image deployment/myapp myapp=registry.example.com/myapp:stable --namespace=production
Container in crash loop after deploy+
Immediate action
Tail logs from a crash-looping pod
Commands
kubectl logs --tail=100 -l app=myapp --namespace=production --previous
kubectl describe pod <pod-name> --namespace=production | grep -A 10 'Last State'
Fix now
Revert the deployment to the previous version: kubectl rollout undo deployment/myapp --namespace=production
Feature flag not working for any user+
Immediate action
Kill the flag evaluation cache by restarting the service
Commands
kubectl rollout restart deployment/myapp --namespace=production
Check the Unleash server health: curl -s https://unleash.internal/health | jq .
Fix now
Manually override the flag in Unleash UI to a known working state, then restart the deployment again.
Pipeline stuck on 'pending' or 'waiting for approval'+
Immediate action
Check if a manual approval is required but no approver is assigned
Commands
For GitLab: glab pipeline view <pipeline-id> --show-approval-status
For GitHub: gh run view <run-id> --json conclusion,status --jq '. | [.]'
Fix now
If the pipeline is blocked by an infrastructure failure (e.g., runner offline), re-run the pipeline after fixing the runner. If it's a manual gate, ensure an approver is notified in the on-call channel.
Release Management Strategies Compared
AspectGitFlow BranchingTrunk-Based Development
Branch lifespanLong-lived feature branches (days to weeks)Short-lived branches (hours to 1-2 days max)
Release cadenceScheduled releases (weekly, bi-weekly)Continuous — multiple deploys per day possible
Parallel version supportExcellent — hotfix branches per versionDifficult — requires additional tooling
Feature flag requirementLow — incomplete work stays on branchHigh — flags hide incomplete features on main
Merge conflict riskHigh — long-lived branches divergeLow — frequent merges keep branches in sync
Best forEnterprise with multiple live versions (SaaS v2/v3)Product teams with single production environment
CI pipeline speed pressureLower — deploys are infrequentHigh — pipeline must be fast (under 10 min)
Onboarding complexityHigher — more branch types to learnLower — one branch, clear rules

Key takeaways

1
Never rebuild artifacts between environments
build once with a commit-SHA tag, promote that exact immutable image through every stage from dev to prod.
2
Quality gates must be hard stops, not suggestions
a gate with an exception process is a gate that will fail you the night you can least afford it.
3
Deployment and release are separate concerns
feature flags let you deploy to 100% of infrastructure while releasing to 0% of users, making rollbacks a config change instead of a code deploy.
4
Every database migration needs an expand-contract strategy
you can roll back your app in 60 seconds, but you cannot roll back a dropped column. Plan migrations in two phases, always.
5
Automate everything after the merge
version tagging, changelog generation, and release notifications. Manual steps are delays and errors waiting to happen.

Common mistakes to avoid

4 patterns
×

Rebuilding the artifact per environment

Symptom
'It worked in staging but failed in prod' for config or dependency reasons. Each build is slightly different because it was built at a different time or on a different machine.
Fix
Build ONCE, push to a registry with an immutable tag (the git commit SHA), and promote that exact image through all environments. Never rebuild.
×

Making the manual production gate optional or skippable

Symptom
A broken prod at 4 PM on a Friday because someone auto-promoted through all stages during a hectic merge window.
Fix
Remove the 'allow_failure: true' setting from your production deploy job. Make the pipeline non-negotiable: no manual approval, no prod deploy. Combine this with branch protection rules so only squash-merged PRs reach main.
×

Accumulating permanent feature flags

Symptom
A codebase with 40 flags, 30 of which are always-on and have been for 18 months, making every if-else a mystery.
Fix
Treat every flag as temporary infrastructure with a ticket to remove it. Set a 30-day expiry as a default. Add a CI lint step that fails if a flag in code hasn't been touched in over 60 days and is marked as permanent. Flags are short-term tools, not long-term architecture.
×

Merging incomplete features behind flags without a cleanup plan

Symptom
After a feature is fully rolled out, no one remembers to remove the flag and old code path. The codebase becomes littered with dead branches that still execute for some users due to cached flag states.
Fix
Include a flag removal step in your feature rollout checklist. When the feature is at 100%, set a calendar reminder to remove the flag and delete the old code within two weeks. Use a linter that flags any file referencing a flag older than 60 days.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through what happens between a developer merging a PR and that c...
Q02SENIOR
We have a critical bug in production right now and the rollback is takin...
Q03SENIOR
What is the difference between a canary deployment and a feature flag, a...
Q01 of 03SENIOR

Walk me through what happens between a developer merging a PR and that code reaching production at your last company. What could stop it at each stage?

ANSWER
After merge, the CI pipeline builds a Docker image tagged with the commit SHA, runs unit + integration tests, scans for CVEs, and checks performance regression. If any test fails, the pipeline stops there — no deploy possible. If all pass, the image is promoted to staging automatically. Then smoke tests run against staging. Finally, a human must approve the production deploy. At any stage a failure blocks progress. The only way to bypass is to make a gate non-optional or to deploy a hotfix branch with a different process. The most common failure is a missed vulnerability scan that was set to allow_failure: true — that gate becomes invisible.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between continuous delivery and continuous deployment?
02
How many environments do I actually need in my CI/CD pipeline?
03
What is a canary deployment and how is it different from a blue-green deployment?
04
Should I use semantic-release or my own custom versioning script?
05
How do I convince my team to adopt release management best practices?
🔥

That's CI/CD. Mark it forged?

6 min read · try the examples if you haven't

Previous
Feature Flags Explained
12 / 14 · CI/CD
Next
Rolling Deployments