Home DevOps Chaos Engineering Basics: Monitoring Failures Before They Find You

Chaos Engineering Basics: Monitoring Failures Before They Find You

In Plain English 🔥
Imagine a fire drill at school. Nobody waits for a real fire to find out if the exits work — they set off the alarm on purpose to test the plan. Chaos engineering is that fire drill for your software. You intentionally break things in a controlled way to discover weaknesses before real users ever feel them. The monitoring part is the teacher standing by the exit with a clipboard, writing down exactly how long it took everyone to get out safely.
⚡ Quick Answer
Imagine a fire drill at school. Nobody waits for a real fire to find out if the exits work — they set off the alarm on purpose to test the plan. Chaos engineering is that fire drill for your software. You intentionally break things in a controlled way to discover weaknesses before real users ever feel them. The monitoring part is the teacher standing by the exit with a clipboard, writing down exactly how long it took everyone to get out safely.

Every system fails eventually. The brutal truth most engineering teams learn too late is that the failure modes they never tested are always the ones that page them at 3 a.m. on a Friday. Netflix coined the term 'chaos engineering' after their migration to AWS exposed a hard reality: distributed systems fail in ways that are impossible to predict by reading code alone. You have to induce failure — deliberately, scientifically — to build genuine confidence in your system's resilience. Monitoring is what separates chaos engineering from plain vandalism: without deep observability, you're just breaking things and hoping for the best.

The problem chaos engineering solves is the gap between 'we think our system handles this' and 'we have evidence our system handles this.' Runbooks, architecture diagrams, and code reviews are all opinions. A chaos experiment with rigorous monitoring attached is a proof. When a database node disappears, does your read replica take over within your SLA? When a downstream service starts returning 500s, does your circuit breaker actually open, and does it show up in your dashboards before a customer tweets about it? These aren't hypothetical questions — they're experiments with measurable outcomes.

By the end of this article you'll be able to design a complete chaos experiment with a defined steady-state hypothesis, wire up the observability stack needed to validate it, interpret blast-radius telemetry in real time, and avoid the production mistakes that turn a controlled experiment into an uncontrolled incident. We'll use real tooling — Chaos Monkey, Litmus Chaos, Prometheus, and Grafana — with fully runnable configurations and the internal mechanics explained at every step.

Steady-State Hypothesis: The Contract Your Monitoring Must Enforce

Before you inject a single failure, you need a written, measurable definition of 'normal.' This is called the steady-state hypothesis (SSH), and it's the foundation that separates chaos engineering from random testing. Without it, your monitoring has nothing to compare against, and you can't tell whether a blip in your metrics is caused by your experiment or just Tuesday afternoon traffic.

A good SSH has three properties: it is measurable (a concrete metric, not a vague description), it is bounded (a specific threshold like p99 latency < 200ms, not 'fast enough'), and it is observable from your existing monitoring stack without manual inspection. Think of it as a contract — the experiment's job is to stress-test the system, and monitoring's job is to flag the moment that contract is breached.

The SSH drives everything downstream: which metrics you scrape, which alert thresholds you set, how long the experiment runs, and when you abort. Teams that skip this step end up running experiments they can't evaluate — they see metrics move, panic, rollback, and learn nothing. A Prometheus recording rule encoding your SSH turns your hypothesis into an automated referee that fires the moment the blast radius exceeds acceptable bounds.

Note the subtle but critical point: the SSH is about the system's outputs (latency, error rate, throughput), not about the fault you're injecting. You're not asserting 'the database will stay up.' You're asserting 'checkout will complete within 300ms for 99% of requests.' That distinction matters — it's entirely possible the database fails and checkout still hits your SLA via a cache layer. Monitoring that, not the database health itself, is the real experiment.

steady_state_hypothesis.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
# Litmus Chaos ChaosEngine manifest with embedded steady-state validation
# This defines both the fault injection AND the hypothesis monitoring together
# Run with: kubectl apply -f steady_state_hypothesis.yaml

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: checkout-service-resilience-test
  namespace: production-mirror          # NEVER run first experiments in live prod
spec:
  # The application under test — Litmus uses this to scope blast radius
  appinfo:
    appns: production-mirror
    applabel: "app=checkout-service"
    appkind: deployment

  # When the hypothesis is violated, Litmus can auto-stop the experiment
  jobCleanUpPolicy: retain             # Keep job logs for post-mortem analysis

  # Steady-state hypothesis: these probes ARE your monitoring assertions
  # Litmus evaluates them before injection (baseline), during, and after
  experiments:
    - name: pod-cpu-hog                # Fault: saturate CPU on checkout pods
      spec:
        probe:
          # Probe 1: HTTP probe — does the service still respond?
          - name: checkout-endpoint-alive
            type: httpProbe
            httpProbe/inputs:
              url: "http://checkout-service.production-mirror.svc.cluster.local/health"
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==         # Exact HTTP status code match
                  responseCode: "200"
            mode: Continuous           # Keep checking THROUGHOUT the experiment
            runProperties:
              probeTimeout: 5          # Seconds before probe is marked failed
              interval: 10             # Check every 10 seconds
              retry: 2                 # Allow 2 transient failures before flagging
              probePollingInterval: 2

          # Probe 2: Prometheus probe — p99 latency is the REAL hypothesis
          # If CPU is pegged but latency stays under 300ms, the system is resilient
          - name: checkout-p99-latency-under-300ms
            type: promProbe
            promProbe/inputs:
              # PromQL query evaluating our SSH threshold
              # histogram_quantile computes p99 from Prometheus histogram buckets
              endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
              query: |
                histogram_quantile(
                  0.99,
                  rate(
                    http_request_duration_seconds_bucket{
                      service="checkout-service",
                      route="/api/checkout"
                    }[2m]  
                  )
                ) * 1000
              # The experiment FAILS if p99 latency exceeds 300ms at any Continuous check
              comparator:
                criteria: "<="
                type: float
                value: "300"           # Milliseconds — our SSH threshold
            mode: Continuous
            runProperties:
              probeTimeout: 10
              interval: 15
              retry: 1                 # Only 1 retry — latency spikes matter

        components:
          env:
            # Fault parameters — scope and duration of CPU stress
            - name: CPU_CORES
              value: "2"               # Hog 2 cores per pod
            - name: CPU_LOAD
              value: "90"              # 90% utilization on those cores
            - name: TOTAL_CHAOS_DURATION
              value: "120"             # Run fault for 120 seconds
            - name: PODS_AFFECTED_PERC
              value: "50"              # Blast radius: only 50% of pods affected
            - name: RAMP_TIME
              value: "10"              # 10s stabilization before fault starts
▶ Output
chaosengine.litmuschaos.io/checkout-service-resilience-test created

--- Litmus Chaos Runner Output (kubectl logs -n production-mirror chaos-runner) ---

INFO[0000] Steady State Check — PRE-CHAOS
INFO[0002] Probe: checkout-endpoint-alive → PASS (HTTP 200)
INFO[0004] Probe: checkout-p99-latency-under-300ms → PASS (p99: 47.3ms)
INFO[0010] Steady state established. Injecting fault: pod-cpu-hog
INFO[0020] Fault active on pods: [checkout-7d9f4-xkp2n, checkout-7d9f4-m8qvl]
INFO[0030] Probe: checkout-endpoint-alive → PASS (HTTP 200)
INFO[0030] Probe: checkout-p99-latency-under-300ms → PASS (p99: 189.2ms) ← latency rising
INFO[0060] Probe: checkout-p99-latency-under-300ms → PASS (p99: 241.7ms) ← approaching limit
INFO[0090] Probe: checkout-p99-latency-under-300ms → FAIL (p99: 347.1ms > 300ms threshold)
WARN[0090] SSH VIOLATED — initiating experiment abort
INFO[0092] Fault rolled back. Pods restored.
INFO[0095] Steady State Check — POST-CHAOS
INFO[0097] Probe: checkout-p99-latency-under-300ms → PASS (p99: 52.1ms)

EXPERIMENT VERDICT: FAIL
Reason: p99 latency breached 300ms SLA under 90% CPU saturation on 50% of pods
Recommendation: Investigate CPU throttling limits and horizontal pod autoscaler lag
⚠️
Watch Out: Your SSH Must Be Pre-Experiment, Not Best-GuessTeams often set SSH thresholds by intuition ('300ms feels right') rather than by measuring actual baseline p99 over 7 days of production traffic. Run `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[7d]))` in Prometheus first. If your real p99 is already 250ms on a quiet day, a 300ms SSH gives you only 50ms of headroom — and your experiment will be noise-intolerant. Calibrate from data, not feelings.

Blast Radius Monitoring: Seeing Exactly How Far the Damage Spreads

Blast radius is how much of your system a failure actually touches. It sounds simple, but monitoring it in real time during an experiment is one of the hardest observability problems in practice. The reason: failure in distributed systems is rarely localized. A single pod killed by Chaos Monkey can cause retry storms upstream, exhaust connection pools in a shared database, trigger cascading timeouts three service hops away, and spike error rates in a completely unrelated service that shares the same thread pool.

To monitor blast radius properly, you need traces, not just metrics. Metrics tell you that something is broken. Distributed traces tell you which path through your system is breaking and how far the breakage travels. During a chaos experiment, a Jaeger or Tempo trace correlated with your fault injection timestamp is worth more than a wall of Grafana panels.

The practical architecture is this: your chaos tool writes a structured event (fault start, fault end, abort) to a shared event bus. Your monitoring stack consumes those events as annotations on time-series dashboards and as trace baggage propagated via OpenTelemetry. Now every metric spike and every slow trace is automatically contextualized against the fault window. Without this correlation, your engineers spend 40% of experiment time trying to figure out which anomalies are experiment artifacts versus pre-existing issues.

Blast radius monitoring also needs to be multi-layered: infrastructure layer (node CPU, network I/O, disk), service layer (error rates per downstream dependency), and business layer (order completion rate, payment throughput). The business layer is the one most teams forget — and it's the only layer your CTO actually cares about during a post-mortem.

chaos_observability_stack.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
# Prometheus alerting rules that fire DURING a chaos experiment
# These rules correlate with the Litmus chaos event annotations
# Apply with: kubectl apply -f chaos_observability_stack.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: chaos-blast-radius-alerts
  namespace: monitoring
  labels:
    # This label tells the Prometheus Operator to load these rules
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: chaos-experiment-blast-radius
      # Evaluate every 10s during active experiments for fast feedback
      interval: 10s
      rules:

        # Rule 1: Detect upstream retry storm caused by fault injection
        # A retry storm means clients are hammering a degraded service
        - alert: ChaosInducedRetryStorm
          expr: |
            sum by (source_service, target_service) (
              rate(
                grpc_client_handled_total{
                  grpc_code=~"Unavailable|DeadlineExceeded|ResourceExhausted"
                }[1m]
              )
            ) > 50  # More than 50 retryable errors/sec between any two services
          for: 20s   # Must persist 20s — filters transient blips
          labels:
            severity: critical
            experiment_context: "blast-radius-propagation"
          annotations:
            summary: "Retry storm detected: {{ $labels.source_service }} → {{ $labels.target_service }}"
            description: |
              {{ $value | printf "%.1f" }} retryable errors/sec detected between services.
              This indicates the chaos fault has propagated beyond the intended blast radius.
              Check if circuit breaker on {{ $labels.source_service }} is open.
            runbook_url: "https://runbooks.internal/chaos/retry-storm"

        # Rule 2: Connection pool exhaustion — a common blast radius spillover
        # When a service slows down, connection pools fill up and starve other callers
        - alert: DatabaseConnectionPoolExhausted
          expr: |
            (
              db_pool_connections_in_use{pool="checkout-db-pool"}
              /
              db_pool_connections_max{pool="checkout-db-pool"}
            ) > 0.90  # Pool is more than 90% utilized
          for: 15s
          labels:
            severity: critical
            experiment_context: "blast-radius-database"
          annotations:
            summary: "DB connection pool near exhaustion during chaos experiment"
            description: |
              Pool utilization: {{ $value | humanizePercentage }}.
              New requests will queue or fail. Blast radius has reached the database tier.
              Experiment should be aborted if this persists beyond ramp-down window.

        # Rule 3: Business-layer impact — the metric that matters to leadership
        # Drop in order completion rate signals real user impact
        - alert: ChaosBusinessImpactDetected
          expr: |
            (
              rate(orders_completed_total[2m])
              /
              rate(orders_initiated_total[2m])
            ) < 0.95  # Completion rate dropped below 95%
          for: 30s
          labels:
            severity: page            # This one actually wakes someone up
            experiment_context: "blast-radius-business"
          annotations:
            summary: "Order completion rate below 95% — chaos experiment exceeding safe blast radius"
            description: |
              Current completion rate: {{ $value | humanizePercentage }}.
              Expected baseline: >99%. This represents real revenue impact.
              Abort the chaos experiment immediately via Litmus dashboard.

---
# Grafana annotation source — pushes chaos events as vertical lines on dashboards
# This makes every chart self-documenting during an experiment review
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-chaos-annotation-datasource
  namespace: monitoring
data:
  # Grafana reads this query to draw annotations on all dashboards
  # It queries the chaos event log stored as Prometheus labels
  annotation_query: |
    changes(
      litmuschaos_experiment_verdict{
        chaosresult_verdict=~"Pass|Fail|Stopped"
      }[1m]
    ) > 0
  # Annotation label shown in Grafana UI
  annotation_label: "Chaos Fault Window"
▶ Output
prometheusrule.monitoring.coreos.com/chaos-blast-radius-alerts created
configmap/grafana-chaos-annotation-datasource created

--- Prometheus alert evaluation (kubectl logs prometheus-0 -n monitoring | grep chaos) ---

t=14:23:10 level=info msg="Evaluating rule" group=chaos-experiment-blast-radius rule=ChaosInducedRetryStorm
t=14:23:20 level=info msg="Alert firing" alert=ChaosInducedRetryStorm
source_service=payment-service target_service=checkout-service value=73.4
labels: severity=critical experiment_context=blast-radius-propagation

t=14:23:35 level=info msg="Evaluating rule" rule=DatabaseConnectionPoolExhausted
t=14:23:35 level=info msg="Alert resolved" alert=DatabaseConnectionPoolExhausted
-- Pool dropped back to 81% after circuit breaker opened on checkout-service --

t=14:23:50 level=info msg="Evaluating rule" rule=ChaosBusinessImpactDetected
t=14:23:50 level=info msg="Alert NOT firing" -- order completion rate: 97.3% (above threshold)

--- Summary ---
Blast radius contained at: service mesh layer (checkout ↔ payment)
Database pool self-recovered: YES (circuit breaker functioned as designed)
Business impact: NONE (resilience mechanism worked)
Experiment verdict: PARTIAL PASS — retry storm exceeded threshold but auto-recovered
⚠️
Pro Tip: Use Exemplars to Link Metric Spikes Directly to TracesPrometheus exemplars (enabled via `--enable-feature=exemplar-storage`) let you embed a trace ID inside a histogram sample. When your p99 spike shows up on a Grafana panel during an experiment, clicking the spike takes you directly to the Jaeger trace for the slowest request in that window. This cuts blast-radius investigation time from minutes to seconds. Configure your app's histogram metric with `WithExemplarFromTraceID()` in the OpenTelemetry SDK — it's two lines of code that pay dividends in every experiment post-mortem.

Automating Experiment Abort: Building a Self-Healing Chaos Pipeline

The most dangerous moment in a chaos experiment isn't when you inject the fault — it's the 90 seconds between 'something is wrong' and 'someone pushed the abort button.' Manual abort loops depend on an engineer watching a dashboard at exactly the right moment. In production-adjacent environments, that's too slow. You need automated abort conditions wired directly into your chaos tooling.

Litmus Chaos and Gremlin both support this via their probe-failure semantics: if a Continuous probe fails, the experiment engine automatically rolls back the fault and marks the experiment as failed. But that's only the first line of defense. The second — and more robust — layer is an external watchdog: a small service that subscribes to your alertmanager webhook, matches on the experiment_context label you saw in the previous section, and calls the chaos tool's API to abort if a severity:page alert fires.

This two-layer approach handles a subtle failure mode: what if the chaos experiment itself breaks your monitoring stack? If Prometheus can't scrape metrics because the node running it is the one you killed, your probe-based abort won't fire. The external watchdog needs to live on a separate node pool, with its own health check, and ideally in a different availability zone from the experiment's blast radius. Monitoring your monitoring during a chaos experiment sounds paranoid — until the one time it matters.

The abort pipeline should also emit a structured event: experiment name, abort reason, which probe failed, current metric value vs. SSH threshold, and a direct link to the Grafana dashboard snapshot captured at abort time. This snapshot is gold for post-mortems — it captures the exact state of every metric at the moment the system broke, before any auto-healing obscures the evidence.

chaos_watchdog_service.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191
#!/usr/bin/env python3
"""
Chaos Experiment Watchdog Service

Listens for Alertmanager webhook callbacks and automatically aborts
a running Litmus Chaos experiment if a severity:page alert fires
during the experiment window.

Designed to run on a SEPARATE NODE POOL from the experiment blast radius.
Requirements: pip install fastapi uvicorn httpx pydantic
Run with: uvicorn chaos_watchdog_service:watchdog_app --host 0.0.0.0 --port 8090
"""

import asyncio
import logging
import os
from datetime import datetime, timezone
from typing import Optional

import httpx
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel, Field

# ── Configuration ─────────────────────────────────────────────────────────────
LITMUS_API_URL = os.environ["LITMUS_CHAOS_API_URL"]        # e.g. http://litmus.monitoring:9002
LITMUS_API_TOKEN = os.environ["LITMUS_API_TOKEN"]          # Service account token
GRAFANA_API_URL = os.environ["GRAFANA_API_URL"]            # For snapshot capture on abort
GRAFANA_API_TOKEN = os.environ["GRAFANA_API_TOKEN"]

# Dashboard UID that shows all chaos-related panels — captured at abort time
CHAOS_DASHBOARD_UID = os.environ.get("CHAOS_DASHBOARD_UID", "chaos-blast-radius-overview")

# Only abort experiments if alerts carry this label — prevents false triggers
REQUIRED_EXPERIMENT_LABEL = "experiment_context"

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s — %(message)s"
)
logger = logging.getLogger("chaos-watchdog")

# ── Pydantic models matching Alertmanager webhook payload ──────────────────────
class AlertmanagerLabel(BaseModel):
    alertname: str
    severity: Optional[str] = None
    experiment_context: Optional[str] = None   # Set in our PrometheusRule labels
    namespace: Optional[str] = None

class AlertmanagerAnnotation(BaseModel):
    summary: Optional[str] = None
    description: Optional[str] = None

class AlertmanagerAlert(BaseModel):
    status: str                                # 'firing' or 'resolved'
    labels: AlertmanagerLabel
    annotations: AlertmanagerAnnotation
    startsAt: str
    generatorURL: str

class AlertmanagerWebhookPayload(BaseModel):
    version: str = Field(default="4")
    status: str
    alerts: list[AlertmanagerAlert]

# ── FastAPI app ────────────────────────────────────────────────────────────────
watchdog_app = FastAPI(
    title="Chaos Experiment Watchdog",
    description="Auto-aborts chaos experiments when SSH is violated",
    version="1.0.0"
)

@watchdog_app.post("/alertmanager/webhook")
async def handle_alertmanager_webhook(payload: AlertmanagerWebhookPayload, request: Request):
    """
    Alertmanager calls this endpoint when an alert fires or resolves.
    We only act on 'firing' alerts that carry our experiment_context label.
    """
    logger.info(f"Received webhook — status={payload.status}, alert_count={len(payload.alerts)}")

    for alert in payload.alerts:
        # Only act on actively firing alerts
        if alert.status != "firing":
            logger.debug(f"Skipping resolved alert: {alert.labels.alertname}")
            continue

        # Only abort if alert is explicitly tagged as chaos-experiment-related
        experiment_context = alert.labels.experiment_context
        if not experiment_context:
            logger.debug(f"Alert {alert.labels.alertname} has no experiment_context — skipping")
            continue

        # severity:page means business impact — always abort
        if alert.labels.severity == "page":
            logger.warning(
                f"PAGE-SEVERITY alert during chaos experiment! "
                f"Alert={alert.labels.alertname} "
                f"Context={experiment_context}"
            )

            # Step 1: Capture Grafana dashboard snapshot BEFORE aborting
            # This preserves the system state at the moment of breach
            snapshot_url = await capture_grafana_snapshot(
                dashboard_uid=CHAOS_DASHBOARD_UID,
                annotation=f"AUTO-ABORT: {alert.labels.alertname} at {datetime.now(timezone.utc).isoformat()}"
            )

            # Step 2: Abort the running chaos experiment via Litmus API
            abort_result = await abort_litmus_experiment(
                experiment_context=experiment_context,
                abort_reason=alert.annotations.summary or alert.labels.alertname
            )

            logger.info(
                f"Experiment aborted successfully. "
                f"Snapshot: {snapshot_url} "
                f"Litmus response: {abort_result}"
            )

    return {"status": "processed", "timestamp": datetime.now(timezone.utc).isoformat()}


async def capture_grafana_snapshot(dashboard_uid: str, annotation: str) -> str:
    """
    Creates a Grafana snapshot of the chaos dashboard at the current moment.
    Returns the public snapshot URL for inclusion in post-mortem reports.
    """
    async with httpx.AsyncClient(timeout=10.0) as grafana_client:
        # First, get the dashboard JSON model
        dashboard_response = await grafana_client.get(
            f"{GRAFANA_API_URL}/api/dashboards/uid/{dashboard_uid}",
            headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"}
        )
        dashboard_response.raise_for_status()
        dashboard_model = dashboard_response.json()["dashboard"]

        # Create a snapshot — Grafana stores it and returns a share URL
        snapshot_response = await grafana_client.post(
            f"{GRAFANA_API_URL}/api/snapshots",
            headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"},
            json={
                "dashboard": dashboard_model,
                "name": f"Chaos Abort Snapshot — {annotation}",
                "expires": 604800,     # Snapshot expires in 7 days
            }
        )
        snapshot_response.raise_for_status()
        snapshot_url = snapshot_response.json()["url"]
        logger.info(f"Grafana snapshot captured: {snapshot_url}")
        return snapshot_url


async def abort_litmus_experiment(experiment_context: str, abort_reason: str) -> dict:
    """
    Calls the Litmus Chaos API to stop the running experiment.
    experiment_context maps to the ChaosEngine name via our labeling convention.
    """
    # Derive the ChaosEngine name from the experiment_context label
    # Convention: experiment_context label value matches ChaosEngine metadata.name
    chaos_engine_name = experiment_context.replace("blast-radius-", "")

    async with httpx.AsyncClient(timeout=15.0) as litmus_client:
        # Litmus REST API: PATCH the engine status to 'stop'
        stop_response = await litmus_client.patch(
            f"{LITMUS_API_URL}/api/chaosengine/{chaos_engine_name}",
            headers={
                "Authorization": f"Bearer {LITMUS_API_TOKEN}",
                "Content-Type": "application/json"
            },
            json={
                "spec": {
                    "engineState": "stop"    # Litmus graceful stop — rolls back faults
                },
                "metadata": {
                    "annotations": {
                        # Record WHY this was aborted — visible in Litmus dashboard
                        "chaos.abort.reason": abort_reason,
                        "chaos.abort.timestamp": datetime.now(timezone.utc).isoformat(),
                        "chaos.abort.source": "watchdog-service"
                    }
                }
            }
        )
        stop_response.raise_for_status()
        logger.info(f"Litmus experiment '{chaos_engine_name}' stopped via API")
        return stop_response.json()


@watchdog_app.get("/health")
async def health_check():
    """Liveness probe — the watchdog must be reachable during experiments."""
    return {"status": "healthy", "timestamp": datetime.now(timezone.utc).isoformat()}
▶ Output
INFO chaos-watchdog — Starting Chaos Experiment Watchdog on :8090
INFO chaos-watchdog — 14:23:49 Received webhook — status=firing, alert_count=2
INFO chaos-watchdog — 14:23:49 Alert ChaosInducedRetryStorm has experiment_context=blast-radius-propagation (severity=critical — not page, skipping abort)
WARN chaos-watchdog — 14:23:49 PAGE-SEVERITY alert during chaos experiment!
Alert=ChaosBusinessImpactDetected
Context=blast-radius-business
INFO chaos-watchdog — 14:23:50 Grafana snapshot captured:
https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB
INFO chaos-watchdog — 14:23:51 Litmus experiment 'business' stopped via API
INFO chaos-watchdog — 14:23:51 Experiment aborted successfully.
Snapshot: https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB
Litmus response: {"spec": {"engineState": "stop"}, "status": "updated"}
INFO uvicorn — 127.0.0.1:0 - "POST /alertmanager/webhook HTTP/1.1" 200 OK
🔥
Interview Gold: Why Two Abort Layers Beat OneInterviewers love this question: 'What happens if your chaos experiment breaks your monitoring stack?' The answer is exactly why you need an external watchdog *separate* from probe-based abort. Probe-based abort lives inside Litmus — if the cluster it runs in is compromised, the probe can't fire. The external watchdog runs in a separate AZ with its own Prometheus and Alertmanager. If a candidate only describes probe-based abort, they've demonstrated they haven't thought about second-order failure modes — which is the entire point of chaos engineering.
AspectLitmus Chaos (CNCF)Gremlin (SaaS)
Deployment modelSelf-hosted Kubernetes operatorSaaS with agent on your infra
SSH enforcementBuilt-in Continuous/Edge probes with Prometheus integrationAttack halt conditions via Gremlin API, less native PromQL
Blast radius controlPODS_AFFECTED_PERC, namespace scoping, label selectorsTarget sets with tag-based filters, AZ pinning in UI
Abort mechanismProbe failure → automatic rollback, PATCH API for external abortAttack stop via API or UI, webhooks for external triggers
Observability integrationPrometheus, Grafana annotation events via chaos_exporterPre-built integrations with Datadog, PagerDuty, Slack
CostFree (open-source), pay for LitmusChaos SaaS (Harness)Paid per user/node — can be significant at scale
Experiment GitOpsYAML ChaosEngine manifests — version controlled nativelyScenario templates via API/Terraform, less kubectl-native
Learning curveHigher — requires Kubernetes fluencyLower — GUI-driven, good for teams new to chaos
Post-mortem artifactsManual — you build the snapshot pipeline (as above)Built-in reports with metric overlays and timeline view

🎯 Key Takeaways

  • A steady-state hypothesis must be a measurable output metric (p99 latency, error rate) — not a system health assertion. 'The pod stays up' is not an SSH. 'Checkout completes within 300ms for 99% of requests' is.
  • Blast radius monitoring requires all three layers: infrastructure (CPU/network), service (error rates per dependency), and business (order completion rate). Alerts without the business layer miss the only metric leadership acts on.
  • Probe-based abort inside Litmus is your first defense. An external watchdog running in a separate AZ and subscribing to Alertmanager webhooks is your second — critical for the case where the chaos experiment impacts your monitoring infrastructure itself.
  • Grafana snapshots captured at abort time are non-negotiable for post-mortems. Auto-chaos is fast enough that metrics auto-heal before an engineer can open a browser — the snapshot is the only permanent record of what the system looked like at the moment of failure.

⚠ Common Mistakes to Avoid

  • Mistake 1: Running chaos experiments directly in production before ever testing in a staging mirror — Symptom: Real users are impacted, experiment can't be safely aborted because the blast radius is undefined — Fix: Always run the first three iterations of an experiment in a production-mirror namespace or environment with synthetic traffic. Use Litmus's appns field and Kubernetes namespace scoping to hard-constrain the blast radius. Only graduate to production after you've observed the experiment complete (pass and fail) in a controlled environment and verified your abort pipeline works.
  • Mistake 2: Setting SSH thresholds without measuring real baseline — Symptom: Every experiment 'passes' even when the system is visibly degraded, or every experiment 'fails' within 10 seconds due to normal traffic variance — Fix: Run histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="your-service"}[7d])) in Prometheus and observe the max over a full week including peak hours. Set your SSH threshold at 1.5× the observed peak p99, not at an aspirational value. Also add a for: 30s window so single-packet blips don't trigger false SSH violations.
  • Mistake 3: Forgetting to label chaos-related Prometheus alerts with experiment_context — Symptom: Your on-call engineer gets paged during a planned chaos experiment, treats it as a real incident, rolls back a separate deployment as root cause, and the chaos experiment never gets properly analyzed — Fix: Add a static experiment_context label to every PrometheusRule alert that could fire during a chaos window. Configure Alertmanager to route alerts with this label to a dedicated 'chaos-experiments' receiver instead of the primary on-call rotation. Brief your on-call team before every experiment with a calendar block that includes the exact experiment window and a link to the Litmus dashboard.

Interview Questions on This Topic

  • QWalk me through how you'd design the observability stack for a chaos experiment targeting a payment service — specifically, what's your steady-state hypothesis, which metrics would you monitor, and how would you determine if the experiment should be automatically aborted?
  • QYour chaos experiment kills 50% of the pods in service A. Metrics for service A look fine, but service B's error rate spikes 30 seconds later. How does your monitoring setup detect this blast-radius propagation, and what does it tell you about service B's design?
  • QA candidate says 'We use probe-based abort in Litmus, so we're covered.' What's the failure mode this misses, and how would you architect around it?

Frequently Asked Questions

What is the difference between chaos engineering and load testing?

Load testing validates that your system handles expected traffic volume — it's about quantity. Chaos engineering validates that your system handles unexpected failures — it's about resilience. You can pass a load test and still have a catastrophic outage when a single availability zone goes down. The two are complementary: run load tests to establish baseline capacity, then run chaos experiments with load active to simulate real failure conditions under realistic traffic.

How do I know if my chaos experiment is too risky to run in production?

Ask three questions: Do you have a defined SSH with automated abort? Have you run this experiment in staging and seen both a pass and a controlled fail? Does your blast-radius control limit the experiment to less than 50% of capacity in any single tier? If any answer is 'no,' the experiment isn't ready for production. The steady-state hypothesis and automated abort are non-negotiable safety gates — running without them isn't chaos engineering, it's just breaking things.

What's the minimal observability setup needed before starting chaos engineering?

At minimum you need: Prometheus scraping your services with RED metrics (Rate, Errors, Duration), Grafana dashboards showing those metrics with at least 30 days of history (for SSH baseline calibration), and Alertmanager configured with at least one working receiver. Without these three, you can't define a steady-state hypothesis, can't observe blast radius propagation in real time, and can't trigger automated abort. Distributed tracing (Jaeger/Tempo) is strongly recommended but can be added incrementally — start with metrics-only chaos experiments and layer traces in as your practice matures.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousRelease Management Best PracticesNext →Multi-Cloud Strategy
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged