Senior 8 min · March 06, 2026

Chaos Engineering — Why Probe Abort Missed Our Retry Storm

A 50% pod fault triggered retry storms exhausting 98/100 DB connections.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Chaos engineering deliberately injects failures to validate system resilience
  • Steady-state hypothesis (SSH) is a measurable output metric — not a health assertion
  • Blast radius monitoring requires three layers: infrastructure, service, and business
  • Automated abort via probes is first line; external watchdog in separate AZ is second
  • Without traces correlated to fault events, 40% of experiment time is wasted
Plain-English First

Imagine a fire drill at school. Nobody waits for a real fire to find out if the exits work — they set off the alarm on purpose to test the plan. Chaos engineering is that fire drill for your software. You intentionally break things in a controlled way to discover weaknesses before real users ever feel them. The monitoring part is the teacher standing by the exit with a clipboard, writing down exactly how long it took everyone to get out safely.

Every system fails eventually. The brutal truth most engineering teams learn too late is that the failure modes they never tested are always the ones that page them at 3 a.m. on a Friday. Netflix coined the term 'chaos engineering' after their migration to AWS exposed a hard reality: distributed systems fail in ways that are impossible to predict by reading code alone. You have to induce failure — deliberately, scientifically — to build genuine confidence in your system's resilience. Monitoring is what separates chaos engineering from plain vandalism: without deep observability, you're just breaking things and hoping for the best.

The problem chaos engineering solves is the gap between 'we think our system handles this' and 'we have evidence our system handles this.' Runbooks, architecture diagrams, and code reviews are all opinions. A chaos experiment with rigorous monitoring attached is a proof. When a database node disappears, does your read replica take over within your SLA? When a downstream service starts returning 500s, does your circuit breaker actually open, and does it show up in your dashboards before a customer tweets about it? These aren't hypothetical questions — they're experiments with measurable outcomes.

By the end of this article you'll be able to design a complete chaos experiment with a defined steady-state hypothesis, wire up the observability stack needed to validate it, interpret blast-radius telemetry in real time, and avoid the production mistakes that turn a controlled experiment into an uncontrolled incident. We'll use real tooling — Chaos Monkey, Litmus Chaos, Prometheus, and Grafana — with fully runnable configurations and the internal mechanics explained at every step.

Steady-State Hypothesis: The Contract Your Monitoring Must Enforce

Before you inject a single failure, you need a written, measurable definition of 'normal.' This is called the steady-state hypothesis (SSH), and it's the foundation that separates chaos engineering from random testing. Without it, your monitoring has nothing to compare against, and you can't tell whether a blip in your metrics is caused by your experiment or just Tuesday afternoon traffic.

A good SSH has three properties: it is measurable (a concrete metric, not a vague description), it is bounded (a specific threshold like p99 latency < 200ms, not 'fast enough'), and it is observable from your existing monitoring stack without manual inspection. Think of it as a contract — the experiment's job is to stress-test the system, and monitoring's job is to flag the moment that contract is breached.

The SSH drives everything downstream: which metrics you scrape, which alert thresholds you set, how long the experiment runs, and when you abort. Teams that skip this step end up running experiments they can't evaluate — they see metrics move, panic, rollback, and learn nothing. A Prometheus recording rule encoding your SSH turns your hypothesis into an automated referee that fires the moment the blast radius exceeds acceptable bounds.

Note the subtle but critical point: the SSH is about the system's outputs (latency, error rate, throughput), not about the fault you're injecting. You're not asserting 'the database will stay up.' You're asserting 'checkout will complete within 300ms for 99% of requests.' That distinction matters — it's entirely possible the database fails and checkout still hits your SLA via a cache layer. Monitoring that, not the database health itself, is the real experiment.

steady_state_hypothesis.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# Litmus Chaos ChaosEngine manifest with embedded steady-state validation
# This defines both the fault injection AND the hypothesis monitoring together
# Run with: kubectl apply -f steady_state_hypothesis.yaml

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: checkout-service-resilience-test
  namespace: production-mirror          # NEVER run first experiments in live prod
spec:
  # The application under test — Litmus uses this to scope blast radius
  appinfo:
    appns: production-mirror
    applabel: "app=checkout-service"
    appkind: deployment

  # When the hypothesis is violated, Litmus can auto-stop the experiment
  jobCleanUpPolicy: retain             # Keep job logs for post-mortem analysis

  # Steady-state hypothesis: these probes ARE your monitoring assertions
  # Litmus evaluates them before injection (baseline), during, and after
  experiments:
    - name: pod-cpu-hog                # Fault: saturate CPU on checkout pods
      spec:
        probe:
          # Probe 1: HTTP probe — does the service still respond?
          - name: checkout-endpoint-alive
            type: httpProbe
            httpProbe/inputs:
              url: "http://checkout-service.production-mirror.svc.cluster.local/health"
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==         # Exact HTTP status code match
                  responseCode: "200"
            mode: Continuous           # Keep checking THROUGHOUT the experiment
            runProperties:
              probeTimeout: 5          # Seconds before probe is marked failed
              interval: 10             # Check every 10 seconds
              retry: 2                 # Allow 2 transient failures before flagging
              probePollingInterval: 2

          # Probe 2: Prometheus probe — p99 latency is the REAL hypothesis
          # If CPU is pegged but latency stays under 300ms, the system is resilient
          - name: checkout-p99-latency-under-300ms
            type: promProbe
            promProbe/inputs:
              # PromQL query evaluating our SSH threshold
              # histogram_quantile computes p99 from Prometheus histogram buckets
              endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
              query: |
                histogram_quantile(
                  0.99,
                  rate(
                    http_request_duration_seconds_bucket{
                      service="checkout-service",
                      route="/api/checkout"
                    }[2m]  
                  )
                ) * 1000
              # The experiment FAILS if p99 latency exceeds 300ms at any Continuous check
              comparator:
                criteria: "<="
                type: float
                value: "300"           # Milliseconds — our SSH threshold
            mode: Continuous
            runProperties:
              probeTimeout: 10
              interval: 15
              retry: 1                 # Only 1 retry — latency spikes matter

        components:
          env:
            # Fault parameters — scope and duration of CPU stress
            - name: CPU_CORES
              value: "2"               # Hog 2 cores per pod
            - name: CPU_LOAD
              value: "90"              # 90% utilization on those cores
            - name: TOTAL_CHAOS_DURATION
              value: "120"             # Run fault for 120 seconds
            - name: PODS_AFFECTED_PERC
              value: "50"              # Blast radius: only 50% of pods affected
            - name: RAMP_TIME
              value: "10"              # 10s stabilization before fault starts
Output
chaosengine.litmuschaos.io/checkout-service-resilience-test created
--- Litmus Chaos Runner Output (kubectl logs -n production-mirror chaos-runner) ---
INFO[0000] Steady State Check — PRE-CHAOS
INFO[0002] Probe: checkout-endpoint-alive → PASS (HTTP 200)
INFO[0004] Probe: checkout-p99-latency-under-300ms → PASS (p99: 47.3ms)
INFO[0010] Steady state established. Injecting fault: pod-cpu-hog
INFO[0020] Fault active on pods: [checkout-7d9f4-xkp2n, checkout-7d9f4-m8qvl]
INFO[0030] Probe: checkout-endpoint-alive → PASS (HTTP 200)
INFO[0030] Probe: checkout-p99-latency-under-300ms → PASS (p99: 189.2ms) ← latency rising
INFO[0060] Probe: checkout-p99-latency-under-300ms → PASS (p99: 241.7ms) ← approaching limit
INFO[0090] Probe: checkout-p99-latency-under-300ms → FAIL (p99: 347.1ms > 300ms threshold)
WARN[0090] SSH VIOLATED — initiating experiment abort
INFO[0092] Fault rolled back. Pods restored.
INFO[0095] Steady State Check — POST-CHAOS
INFO[0097] Probe: checkout-p99-latency-under-300ms → PASS (p99: 52.1ms)
EXPERIMENT VERDICT: FAIL
Reason: p99 latency breached 300ms SLA under 90% CPU saturation on 50% of pods
Recommendation: Investigate CPU throttling limits and horizontal pod autoscaler lag
Watch Out: Your SSH Must Be Pre-Experiment, Not Best-Guess
Teams often set SSH thresholds by intuition ('300ms feels right') rather than by measuring actual baseline p99 over 7 days of production traffic. Run histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[7d])) in Prometheus first. If your real p99 is already 250ms on a quiet day, a 300ms SSH gives you only 50ms of headroom — and your experiment will be noise-intolerant. Calibrate from data, not feelings.
Production Insight
SSH thresholds set too tight trigger false abort — experiments never complete.
Set SSH to 1.5x the 7-day observed peak p99, not an aspirational value.
Add a 'for: 30s' window to filter transient blips caused by normal traffic variance.
Key Takeaway
Your SSH is a contract on outputs, not on infrastructure.
Measure real baseline over 7 days before picking a threshold.
The experiment passes even if infrastructure fails — as long as outputs stay within bounds.

Blast Radius Monitoring: Seeing Exactly How Far the Damage Spreads

Blast radius is how much of your system a failure actually touches. It sounds simple, but monitoring it in real time during an experiment is one of the hardest observability problems in practice. The reason: failure in distributed systems is rarely localized. A single pod killed by Chaos Monkey can cause retry storms upstream, exhaust connection pools in a shared database, trigger cascading timeouts three service hops away, and spike error rates in a completely unrelated service that shares the same thread pool.

To monitor blast radius properly, you need traces, not just metrics. Metrics tell you that something is broken. Distributed traces tell you which path through your system is breaking and how far the breakage travels. During a chaos experiment, a Jaeger or Tempo trace correlated with your fault injection timestamp is worth more than a wall of Grafana panels.

The practical architecture is this: your chaos tool writes a structured event (fault start, fault end, abort) to a shared event bus. Your monitoring stack consumes those events as annotations on time-series dashboards and as trace baggage propagated via OpenTelemetry. Now every metric spike and every slow trace is automatically contextualized against the fault window. Without this correlation, your engineers spend 40% of experiment time trying to figure out which anomalies are experiment artifacts versus pre-existing issues.

Blast radius monitoring also needs to be multi-layered: infrastructure layer (node CPU, network I/O, disk), service layer (error rates per downstream dependency), and business layer (order completion rate, payment throughput). The business layer is the one most teams forget — and it's the only layer your CTO actually cares about during a post-mortem.

chaos_observability_stack.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Prometheus alerting rules that fire DURING a chaos experiment
# These rules correlate with the Litmus chaos event annotations
# Apply with: kubectl apply -f chaos_observability_stack.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: chaos-blast-radius-alerts
  namespace: monitoring
  labels:
    # This label tells the Prometheus Operator to load these rules
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: chaos-experiment-blast-radius
      # Evaluate every 10s during active experiments for fast feedback
      interval: 10s
      rules:

        # Rule 1: Detect upstream retry storm caused by fault injection
        # A retry storm means clients are hammering a degraded service
        - alert: ChaosInducedRetryStorm
          expr: |
            sum by (source_service, target_service) (
              rate(
                grpc_client_handled_total{
                  grpc_code=~"Unavailable|DeadlineExceeded|ResourceExhausted"
                }[1m]
              )
            ) > 50  # More than 50 retryable errors/sec between any two services
          for: 20s   # Must persist 20s — filters transient blips
          labels:
            severity: critical
            experiment_context: "blast-radius-propagation"
          annotations:
            summary: "Retry storm detected: {{ $labels.source_service }} → {{ $labels.target_service }}"
            description: |
              {{ $value | printf "%.1f" }} retryable errors/sec detected between services.
              This indicates the chaos fault has propagated beyond the intended blast radius.
              Check if circuit breaker on {{ $labels.source_service }} is open.
            runbook_url: "https://runbooks.internal/chaos/retry-storm"

        # Rule 2: Connection pool exhaustion — a common blast radius spillover
        # When a service slows down, connection pools fill up and starve other callers
        - alert: DatabaseConnectionPoolExhausted
          expr: |
            (
              db_pool_connections_in_use{pool="checkout-db-pool"}
              /
              db_pool_connections_max{pool="checkout-db-pool"}
            ) > 0.90  # Pool is more than 90% utilized
          for: 15s
          labels:
            severity: critical
            experiment_context: "blast-radius-database"
          annotations:
            summary: "DB connection pool near exhaustion during chaos experiment"
            description: |
              Pool utilization: {{ $value | humanizePercentage }}.
              New requests will queue or fail. Blast radius has reached the database tier.
              Experiment should be aborted if this persists beyond ramp-down window.

        # Rule 3: Business-layer impact — the metric that matters to leadership
        # Drop in order completion rate signals real user impact
        - alert: ChaosBusinessImpactDetected
          expr: |
            (
              rate(orders_completed_total[2m])
              /
              rate(orders_initiated_total[2m])
            ) < 0.95  # Completion rate dropped below 95%
          for: 30s
          labels:
            severity: page            # This one actually wakes someone up
            experiment_context: "blast-radius-business"
          annotations:
            summary: "Order completion rate below 95% — chaos experiment exceeding safe blast radius"
            description: |
              Current completion rate: {{ $value | humanizePercentage }}.
              Expected baseline: >99%. This represents real revenue impact.
              Abort the chaos experiment immediately via Litmus dashboard.

---
# Grafana annotation source — pushes chaos events as vertical lines on dashboards
# This makes every chart self-documenting during an experiment review
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-chaos-annotation-datasource
  namespace: monitoring
data:
  # Grafana reads this query to draw annotations on all dashboards
  # It queries the chaos event log stored as Prometheus labels
  annotation_query: |
    changes(
      litmuschaos_experiment_verdict{
        chaosresult_verdict=~"Pass|Fail|Stopped"
      }[1m]
    ) > 0
  # Annotation label shown in Grafana UI
  annotation_label: "Chaos Fault Window"
Output
prometheusrule.monitoring.coreos.com/chaos-blast-radius-alerts created
configmap/grafana-chaos-annotation-datasource created
--- Prometheus alert evaluation (kubectl logs prometheus-0 -n monitoring | grep chaos) ---
t=14:23:10 level=info msg="Evaluating rule" group=chaos-experiment-blast-radius rule=ChaosInducedRetryStorm
t=14:23:20 level=info msg="Alert firing" alert=ChaosInducedRetryStorm
source_service=payment-service target_service=checkout-service value=73.4
labels: severity=critical experiment_context=blast-radius-propagation
t=14:23:35 level=info msg="Evaluating rule" rule=DatabaseConnectionPoolExhausted
t=14:23:35 level=info msg="Alert resolved" alert=DatabaseConnectionPoolExhausted
-- Pool dropped back to 81% after circuit breaker opened on checkout-service --
t=14:23:50 level=info msg="Evaluating rule" rule=ChaosBusinessImpactDetected
t=14:23:50 level=info msg="Alert NOT firing" -- order completion rate: 97.3% (above threshold)
--- Summary ---
Blast radius contained at: service mesh layer (checkout ↔ payment)
Database pool self-recovered: YES (circuit breaker functioned as designed)
Business impact: NONE (resilience mechanism worked)
Experiment verdict: PARTIAL PASS — retry storm exceeded threshold but auto-recovered
Pro Tip: Use Exemplars to Link Metric Spikes Directly to Traces
Prometheus exemplars (enabled via --enable-feature=exemplar-storage) let you embed a trace ID inside a histogram sample. When your p99 spike shows up on a Grafana panel during an experiment, clicking the spike takes you directly to the Jaeger trace for the slowest request in that window. This cuts blast-radius investigation time from minutes to seconds. Configure your app's histogram metric with WithExemplarFromTraceID() in the OpenTelemetry SDK — it's two lines of code that pay dividends in every experiment post-mortem.
Production Insight
Blast radius often spreads through shared resources like database connection pools.
Without traces, you see the metric spike but can't trace the propagation path.
Business-layer metrics (order completion rate) are the only ones leadership acts on.
Key Takeaway
Monitor three layers: infra, service, business.
Correlate fault events with traces using OpenTelemetry baggage.
The metric your CTO cares about is the one most teams don't measure.

Automating Experiment Abort: Building a Self-Healing Chaos Pipeline

The most dangerous moment in a chaos experiment isn't when you inject the fault — it's the 90 seconds between 'something is wrong' and 'someone pushed the abort button.' Manual abort loops depend on an engineer watching a dashboard at exactly the right moment. In production-adjacent environments, that's too slow. You need automated abort conditions wired directly into your chaos tooling.

Litmus Chaos and Gremlin both support this via their probe-failure semantics: if a Continuous probe fails, the experiment engine automatically rolls back the fault and marks the experiment as failed. But that's only the first line of defense. The second — and more robust — layer is an external watchdog: a small service that subscribes to your alertmanager webhook, matches on the experiment_context label you saw in the previous section, and calls the chaos tool's API to abort if a severity:page alert fires.

This two-layer approach handles a subtle failure mode: what if the chaos experiment itself breaks your monitoring stack? If Prometheus can't scrape metrics because the node running it is the one you killed, your probe-based abort won't fire. The external watchdog needs to live on a separate node pool, with its own health check, and ideally in a different availability zone from the experiment's blast radius. Monitoring your monitoring during a chaos experiment sounds paranoid — until the one time it matters.

The abort pipeline should also emit a structured event: experiment name, abort reason, which probe failed, current metric value vs. SSH threshold, and a direct link to the Grafana dashboard snapshot captured at abort time. This snapshot is gold for post-mortems — it captures the exact state of every metric at the moment the system broke, before any auto-healing obscures the evidence.

chaos_watchdog_service.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
#!/usr/bin/env python3
"""
Chaos Experiment Watchdog Service

Listens for Alertmanager webhook callbacks and automatically aborts
a running Litmus Chaos experiment if a severity:page alert fires
during the experiment window.

Designed to run on a SEPARATE NODE POOL from the experiment blast radius.
Requirements: pip install fastapi uvicorn httpx pydantic
Run with: uvicorn chaos_watchdog_service:watchdog_app --host 0.0.0.0 --port 8090
"""

import asyncio
import logging
import os
from datetime import datetime, timezone
from typing import Optional

import httpx
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel, Field

# ── Configuration ─────────────────────────────────────────────────────────────
LITMUS_API_URL = os.environ["LITMUS_CHAOS_API_URL"]        # e.g. http://litmus.monitoring:9002
LITMUS_API_TOKEN = os.environ["LITMUS_API_TOKEN"]          # Service account token
GRAFANA_API_URL = os.environ["GRAFANA_API_URL"]            # For snapshot capture on abort
GRAFANA_API_TOKEN = os.environ["GRAFANA_API_TOKEN"]

# Dashboard UID that shows all chaos-related panels — captured at abort time
CHAOS_DASHBOARD_UID = os.environ.get("CHAOS_DASHBOARD_UID", "chaos-blast-radius-overview")

# Only abort experiments if alerts carry this label — prevents false triggers
REQUIRED_EXPERIMENT_LABEL = "experiment_context"

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s — %(message)s"
)
logger = logging.getLogger("chaos-watchdog")

# ── Pydantic models matching Alertmanager webhook payload ──────────────────────
class AlertmanagerLabel(BaseModel):
    alertname: str
    severity: Optional[str] = None
    experiment_context: Optional[str] = None   # Set in our PrometheusRule labels
    namespace: Optional[str] = None

class AlertmanagerAnnotation(BaseModel):
    summary: Optional[str] = None
    description: Optional[str] = None

class AlertmanagerAlert(BaseModel):
    status: str                                # 'firing' or 'resolved'
    labels: AlertmanagerLabel
    annotations: AlertmanagerAnnotation
    startsAt: str
    generatorURL: str

class AlertmanagerWebhookPayload(BaseModel):
    version: str = Field(default="4")
    status: str
    alerts: list[AlertmanagerAlert]

# ── FastAPI app ────────────────────────────────────────────────────────────────
watchdog_app = FastAPI(
    title="Chaos Experiment Watchdog",
    description="Auto-aborts chaos experiments when SSH is violated",
    version="1.0.0"
)

@watchdog_app.post("/alertmanager/webhook")
async def handle_alertmanager_webhook(payload: AlertmanagerWebhookPayload, request: Request):
    """
    Alertmanager calls this endpoint when an alert fires or resolves.
    We only act on 'firing' alerts that carry our experiment_context label.
    """
    logger.info(f"Received webhook — status={payload.status}, alert_count={len(payload.alerts)}")

    for alert in payload.alerts:
        # Only act on actively firing alerts
        if alert.status != "firing":
            logger.debug(f"Skipping resolved alert: {alert.labels.alertname}")
            continue

        # Only abort if alert is explicitly tagged as chaos-experiment-related
        experiment_context = alert.labels.experiment_context
        if not experiment_context:
            logger.debug(f"Alert {alert.labels.alertname} has no experiment_context — skipping")
            continue

        # severity:page means business impact — always abort
        if alert.labels.severity == "page":
            logger.warning(
                f"PAGE-SEVERITY alert during chaos experiment! "
                f"Alert={alert.labels.alertname} "
                f"Context={experiment_context}"
            )

            # Step 1: Capture Grafana dashboard snapshot BEFORE aborting
            # This preserves the system state at the moment of breach
            snapshot_url = await capture_grafana_snapshot(
                dashboard_uid=CHAOS_DASHBOARD_UID,
                annotation=f"AUTO-ABORT: {alert.labels.alertname} at {datetime.now(timezone.utc).isoformat()}"
            )

            # Step 2: Abort the running chaos experiment via Litmus API
            abort_result = await abort_litmus_experiment(
                experiment_context=experiment_context,
                abort_reason=alert.annotations.summary or alert.labels.alertname
            )

            logger.info(
                f"Experiment aborted successfully. "
                f"Snapshot: {snapshot_url} "
                f"Litmus response: {abort_result}"
            )

    return {"status": "processed", "timestamp": datetime.now(timezone.utc).isoformat()}


async def capture_grafana_snapshot(dashboard_uid: str, annotation: str) -> str:
    """
    Creates a Grafana snapshot of the chaos dashboard at the current moment.
    Returns the public snapshot URL for inclusion in post-mortem reports.
    """
    async with httpx.AsyncClient(timeout=10.0) as grafana_client:
        # First, get the dashboard JSON model
        dashboard_response = await grafana_client.get(
            f"{GRAFANA_API_URL}/api/dashboards/uid/{dashboard_uid}",
            headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"}
        )
        dashboard_response.raise_for_status()
        dashboard_model = dashboard_response.json()["dashboard"]

        # Create a snapshot — Grafana stores it and returns a share URL
        snapshot_response = await grafana_client.post(
            f"{GRAFANA_API_URL}/api/snapshots",
            headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"},
            json={
                "dashboard": dashboard_model,
                "name": f"Chaos Abort Snapshot — {annotation}",
                "expires": 604800,     # Snapshot expires in 7 days
            }
        )
        snapshot_response.raise_for_status()
        snapshot_url = snapshot_response.json()["url"]
        logger.info(f"Grafana snapshot captured: {snapshot_url}")
        return snapshot_url


async def abort_litmus_experiment(experiment_context: str, abort_reason: str) -> dict:
    """
    Calls the Litmus Chaos API to stop the running experiment.
    experiment_context maps to the ChaosEngine name via our labeling convention.
    """
    # Derive the ChaosEngine name from the experiment_context label
    # Convention: experiment_context label value matches ChaosEngine metadata.name
    chaos_engine_name = experiment_context.replace("blast-radius-", "")

    async with httpx.AsyncClient(timeout=15.0) as litmus_client:
        # Litmus REST API: PATCH the engine status to 'stop'
        stop_response = await litmus_client.patch(
            f"{LITMUS_API_URL}/api/chaosengine/{chaos_engine_name}",
            headers={
                "Authorization": f"Bearer {LITMUS_API_TOKEN}",
                "Content-Type": "application/json"
            },
            json={
                "spec": {
                    "engineState": "stop"    # Litmus graceful stop — rolls back faults
                },
                "metadata": {
                    "annotations": {
                        # Record WHY this was aborted — visible in Litmus dashboard
                        "chaos.abort.reason": abort_reason,
                        "chaos.abort.timestamp": datetime.now(timezone.utc).isoformat(),
                        "chaos.abort.source": "watchdog-service"
                    }
                }
            }
        )
        stop_response.raise_for_status()
        logger.info(f"Litmus experiment '{chaos_engine_name}' stopped via API")
        return stop_response.json()


@watchdog_app.get("/health")
async def health_check():
    """Liveness probe — the watchdog must be reachable during experiments."""
    return {"status": "healthy", "timestamp": datetime.now(timezone.utc).isoformat()}
Output
INFO chaos-watchdog — Starting Chaos Experiment Watchdog on :8090
INFO chaos-watchdog — 14:23:49 Received webhook — status=firing, alert_count=2
INFO chaos-watchdog — 14:23:49 Alert ChaosInducedRetryStorm has experiment_context=blast-radius-propagation (severity=critical — not page, skipping abort)
WARN chaos-watchdog — 14:23:49 PAGE-SEVERITY alert during chaos experiment!
Alert=ChaosBusinessImpactDetected
Context=blast-radius-business
INFO chaos-watchdog — 14:23:50 Grafana snapshot captured:
https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB
INFO chaos-watchdog — 14:23:51 Litmus experiment 'business' stopped via API
INFO chaos-watchdog — 14:23:51 Experiment aborted successfully.
Snapshot: https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB
Litmus response: {"spec": {"engineState": "stop"}, "status": "updated"}
INFO uvicorn — 127.0.0.1:0 - "POST /alertmanager/webhook HTTP/1.1" 200 OK
Interview Gold: Why Two Abort Layers Beat One
Interviewers love this question: 'What happens if your chaos experiment breaks your monitoring stack?' The answer is exactly why you need an external watchdog separate from probe-based abort. Probe-based abort lives inside Litmus — if the cluster it runs in is compromised, the probe can't fire. The external watchdog runs in a separate AZ with its own Prometheus and Alertmanager. If a candidate only describes probe-based abort, they've demonstrated they haven't thought about second-order failure modes — which is the entire point of chaos engineering.
Production Insight
If your experiment kills Prometheus, probe-based abort won't fire.
External watchdog must run in a separate AZ with independent monitoring.
Snapshot dashboards before abort — auto-healing erases evidence quickly.
Key Takeaway
Two abort layers: probe-based (fast) and watchdog (fail-safe).
Snapshot at abort time for post-mortem evidence.
Never skip the external watchdog — it's the safety net when your safety net fails.

Observability Pipelines for Chaos Experiments: From Metrics to Actionable Insights

Running a chaos experiment without a proper observability pipeline is like flying a plane without instruments — you might survive, but you won't learn why. The pipeline needs to collect metrics, traces, and logs from every tier of your stack, annotate them with experiment context, and route them to a dashboard that an engineer can read in real time. Without this, you're guessing.

The key components are: Prometheus for metrics (RED metrics per service: Rate, Errors, Duration), OpenTelemetry for distributed tracing (with baggage propagation to carry experiment IDs), and a structured logging system (JSON logs with experiment_id field). All three need to be timestamp-aligned with the fault injection events. Use the chaos tool's event stream to push annotations into Grafana so every chart shows vertical lines for fault start, end, and abort.

A common mistake is only collecting metrics at the service level. You need infrastructure metrics too: node CPU, memory, network I/O, disk latency. Without these, you can't tell whether a latency spike is caused by increased request processing time or by a noisy neighbor on the same node consuming CPU. This distinction matters because the fix is different: one requires scaling the service, the other requires node isolation.

Another overlooked piece is the experiment rollback observability: after the fault is removed, metrics should return to baseline within the ramp-down period. If they don't, it indicates permanent damage (leaked connections, corrupted state) that won't show up in a pass/fail verdict quickly. Monitor the recovery trajectory — a system that doesn't fully recover is more dangerous than one that fails immediately.

Production Insight
Post-fault recovery metrics that don't return to baseline indicate permanent damage.
Without infra metrics, you can't distinguish service degradation from noisy neighbor.
JSON structured logs with experiment_id let you filter chaos-related events from normal traffic.
Key Takeaway
Three data types: RED metrics, distributed traces, structured logs.
Annotate all signals with experiment context.
Monitor recovery trajectory — not just pass/fail.

Running Chaos Experiments in Production vs Staging: The Graduation Path

The safest approach is to run your first three iterations of each experiment in a production-mirror environment — identical configuration but synthetic traffic. This lets you validate your SSH, probe configuration, abort pipeline, and blast-radius controls without real user impact. Once you've seen the experiment pass and fail (both outcomes are valuable) in the mirror, graduate to a canary production experiment with a tiny blast radius (e.g., 5% of pods, only in one availability zone).

The graduation criteria are: (1) SSH is calibrated from 7-day baseline metrics, (2) automated abort has been verified to work — both probe-based and external watchdog, (3) on-call team has been briefed and has a Grafana dashboard link, (4) blast radius is constrained to less than 10% of capacity per tier, (5) a rollback plan exists and has been rehearsed. If any of these are missing, you're not ready for production.

A production experiment should also have a safety word: a human override command that any engineer can issue to stop all experiments immediately. This is typically a simple kill switch in the chaos tool's API. Document the command and the process for using it in your on-call runbook. The goal is to make aborting an experiment as easy as starting one.

Finally, production experiments need a post-mortem SLA: within 24 hours of the experiment, the team should review the metrics, the abort logs, the Grafana snapshots, and decide whether to (a) graduate the experiment to a larger blast radius, (b) fix the issues found and re-run in a mirror, or (c) disable the experiment permanently because the risk outweighs the insight.

Production Insight
Never skip the production-mirror graduation step — real traffic patterns are unpredictable.
Document and rehearse the kill switch before running in production.
Post-mortem within 24 hours — delays lose the context of what happened.
Key Takeaway
Five graduation criteria before production experiments.
Every experiment needs a human-readable abort command.
Post-mortem within 24 hours with snapshot review.
● Production incidentPOST-MORTEMseverity: high

The Retry Storm That Killed Our Checkout During a Chaos Experiment

Symptom
p99 latency on checkout endpoint jumped from 45ms to 3.2s. Order completion rate dropped to 72%. The on-call engineer was paged, but the alert labels didn't include 'experiment_context', so they assumed it was a real incident and rolled back a recent deployment that had nothing to do with the fault.
Assumption
SSH threshold of 300ms p99 latency provided enough headroom. The Litmus probe would automatically abort if breached. The blast radius was limited to 50% of pods, so the remaining 50% should handle traffic.
Root cause
The Litmus probe did abort when p99 exceeded 300ms at t+90s, but the damage had already propagated: the 50% affected pods became slow, causing upstream services to retry aggressively. Those retries exhausted the shared database connection pool (Tomcat max=100, 98 connections in use). The remaining healthy pods couldn't process requests because they couldn't get a database connection. The probe only monitored the checkout service endpoint, not the database pool utilization — so the abort came too late.
Fix
1. Add a database connection pool utilization probe to the Litmus experiment (continuous, threshold < 80%). 2. Implement an external watchdog service in a separate AZ that subscribes to Alertmanager webhooks with business-layer alerts. 3. Set Alertmanager routing to suppress primary on-call during planned experiments using a 'chaos-mode' label. 4. Reduce retry count on upstream services to 1 instead of 3 to prevent retry storms from overwhelming degraded dependencies.
Key lesson
  • Monitor downstream dependencies, not just the service under test — the blast radius often spreads through shared resources.
  • Probe-based abort is not enough: external watchdog with business impact detection catches what monitoring blind spots miss.
  • Alertmanager routing must separate experiment alerts from production incidents to avoid false alarms that waste on-call time and cause unnecessary rollbacks.
  • Always pre-brief the on-call team about planned experiment windows and provide a Grafana link to track experiment status.
Production debug guideSymptom → Action guide for when a chaos experiment causes unexpected production impact4 entries
Symptom · 01
p99 latency spikes above SSH threshold but experiment doesn't abort
Fix
Check Litmus probe configuration: ensure probe mode is 'Continuous' not 'OnChaosInjection'. Verify Prometheus endpoint connectivity. Check if Prometheus itself is overwhelmed — if the node running Prometheus is in the blast radius, probes will fail silently.
Symptom · 02
Error rate on a dependency service spikes after fault injection
Fix
Use Jaeger to find traces that cross the dependency boundary during the fault window. Look for retry storms: rate of grpc_client_handled_total with codes Unavailable/DeadlineExceeded. Then check circuit breaker state on the caller service.
Symptom · 03
Experiment passes SSH but business metric (e.g., order completion) drops
Fix
Your SSH is wrong — it's monitoring the wrong output metric. Add a business-layer metric to the experiment probe. Example: rate of orders_completed / rate of orders_initiated. Redefine SSH to include this ratio.
Symptom · 04
Alertmanager pages the on-call during a planned experiment
Fix
Check that your experiment_context label is set on all PrometheusRules and that Alertmanager routing separates that label to a 'chaos-experiments' receiver. Also send a calendar block with experiment time and a link to the Litmus dashboard before starting.
★ Chaos Experiment Debugging Cheat SheetFive most common failure patterns during chaos experiments — diagnose and fix in under 60 seconds.
Probe fails immediately after fault injection
Immediate action
Check if the fault targets the same pod as the probe endpoint. If the probe URL points to the same service under test and the fault kills all pods, no service is available to respond.
Commands
kubectl get chaosengine <name> -n <ns> -o jsonpath='{.status.experimentStatuses[*].probeStatuses}'
kubectl logs chaos-runner -n <ns> --tail=50 | grep 'Probe'
Fix now
Set the probe to run against a different instance of the service or use an external endpoint that is not affected by the fault.
Blast radius exceeds configured percentage+
Immediate action
Verify namespace scoping and label selectors. The fault may be using a loose app label that matches unintended pods.
Commands
kubectl get pods -l <applabel> -n <ns> --show-labels
kubectl describe chaosengine <name> -n <ns> | grep -A5 'Scope'
Fix now
Tighten the label selector. Add namespace scoping. Use PODS_AFFECTED_PERC and validate with a dry-run first.
Experiment doesn't abort even though metrics are bad+
Immediate action
Check if the probe is in 'Continuous' mode and the abortOnProbeFailure flag is set. If the probe is 'Edge' mode, it only checks before and after — not during.
Commands
kubectl get chaosengine <name> -n <ns> -o yaml | grep -E 'mode:|abortOnProbeFailure'
curl -X POST <watchdog-url>/health
Fix now
Update ChaosEngine spec: set 'mode: Continuous' on probes and 'abortOnProbeFailure: true' in engine spec.
Grafana dashboard doesn't show fault injection window+
Immediate action
Check if chaos_exporter is running and if Prometheus is scraping its metrics. Without litmuschaos metrics, annotations won't appear.
Commands
kubectl get pods -n <ns> | grep chaos-exporter
curl -s http://prometheus.monitoring:9090/api/v1/query?query=litmuschaos_experiment_verdict | jq
Fix now
Deploy chaos-exporter and add a prometheus scrape config for it. Add a Grafana annotation query using changes(litmuschaos_experiment_verdict[1m]) > 0.
Watchdog service doesn't abort experiment when business impact detected+
Immediate action
Check Alertmanager receiver configuration: the webhook must be configured to send alerts with severity=page to the watchdog URL. Also verify the watchdog can reach the Litmus API.
Commands
curl <watchdog-url>/health
curl <litmus-api>/api/chaosengine/<name> -H 'Authorization: Bearer <token>'
Fix now
Ensure the watchdog service account has network access to Litmus API. Add a test alert in Alertmanager to verify the webhook handler works.
Chaos Engineering Tools Comparison
AspectLitmus Chaos (CNCF)Gremlin (SaaS)
Deployment modelSelf-hosted Kubernetes operatorSaaS with agent on your infra
SSH enforcementBuilt-in Continuous/Edge probes with Prometheus integrationAttack halt conditions via Gremlin API, less native PromQL
Blast radius controlPODS_AFFECTED_PERC, namespace scoping, label selectorsTarget sets with tag-based filters, AZ pinning in UI
Abort mechanismProbe failure → automatic rollback, PATCH API for external abortAttack stop via API or UI, webhooks for external triggers
Observability integrationPrometheus, Grafana annotation events via chaos_exporterPre-built integrations with Datadog, PagerDuty, Slack
CostFree (open-source), pay for LitmusChaos SaaS (Harness)Paid per user/node — can be significant at scale
Experiment GitOpsYAML ChaosEngine manifests — version controlled nativelyScenario templates via API/Terraform, less kubectl-native
Learning curveHigher — requires Kubernetes fluencyLower — GUI-driven, good for teams new to chaos
Post-mortem artifactsManual — you build the snapshot pipeline (as above)Built-in reports with metric overlays and timeline view

Key takeaways

1
A steady-state hypothesis must be a measurable output metric (p99 latency, error rate)
not a system health assertion. 'The pod stays up' is not an SSH. 'Checkout completes within 300ms for 99% of requests' is.
2
Blast radius monitoring requires all three layers
infrastructure (CPU/network), service (error rates per dependency), and business (order completion rate). Alerts without the business layer miss the only metric leadership acts on.
3
Probe-based abort inside Litmus is your first defense. An external watchdog running in a separate AZ and subscribing to Alertmanager webhooks is your second
critical for the case where the chaos experiment impacts your monitoring infrastructure itself.
4
Grafana snapshots captured at abort time are non-negotiable for post-mortems. Auto-chaos is fast enough that metrics auto-heal before an engineer can open a browser
the snapshot is the only permanent record of what the system looked like at the moment of failure.
5
Graduate experiments from production-mirror to canary to full production. Only run in production when SSH is calibrated, abort pipeline verified, blast radius constrained, and rollback plan rehearsed.

Common mistakes to avoid

5 patterns
×

Running chaos experiments directly in production without staging mirror

Symptom
Real users are impacted, experiment can't be safely aborted because the blast radius is undefined.
Fix
Always run the first three iterations of an experiment in a production-mirror namespace with synthetic traffic. Use Litmus's appns field and namespace scoping to hard-constrain blast radius. Only graduate after observing both pass and fail outcomes and verifying abort pipeline.
×

Setting SSH thresholds without measuring real baseline

Symptom
Every experiment 'passes' even when the system is visibly degraded, or every experiment 'fails' within 10 seconds due to normal traffic variance.
Fix
Run histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[7d])) in Prometheus and observe the max over a full week including peak hours. Set SSH threshold at 1.5x the observed peak p99. Add a for: 30s window to filter transient blips.
×

Forgetting to label chaos-related Prometheus alerts with experiment_context

Symptom
On-call engineer gets paged during a planned chaos experiment, treats it as a real incident, rolls back a separate deployment as root cause, and the chaos experiment never gets properly analyzed.
Fix
Add a static experiment_context label to every PrometheusRule alert that could fire during a chaos window. Configure Alertmanager to route alerts with this label to a dedicated 'chaos-experiments' receiver instead of primary on-call rotation. Brief on-call team before every experiment with calendar block and Litmus dashboard link.
×

Only monitoring the service under test, ignoring downstream dependencies

Symptom
Fault injection causes retry storms that exhaust database connection pools, but the experiment passes because the service under test's metrics look fine during the fault window.
Fix
Add Prometheus probes that monitor downstream dependency metrics: database pool utilization (db_pool_connections_in_use/max), upstream retry rates (grpc_client_handled_total with error codes), and circuit breaker state. Set abort thresholds on these as well.
×

Relying solely on probe-based abort without an external watchdog

Symptom
A chaos experiment kills the node running Prometheus, probes can't fire, and the experiment continues unchecked until manual intervention.
Fix
Deploy an external watchdog service in a separate AZ that subscribes to Alertmanager webhooks with severity=page alerts. Ensure the watchdog has its own Prometheus and Alertmanager independent of the experiment's blast radius. Test the watchdog by simulating a page alert before every experiment.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through how you'd design the observability stack for a chaos exp...
Q02SENIOR
Your chaos experiment kills 50% of the pods in service A. Metrics for se...
Q03SENIOR
A candidate says 'We use probe-based abort in Litmus, so we're covered.'...
Q01 of 03SENIOR

Walk me through how you'd design the observability stack for a chaos experiment targeting a payment service — specifically, what's your steady-state hypothesis, which metrics would you monitor, and how would you determine if the experiment should be automatically aborted?

ANSWER
Start by defining the SSH as a measurable output metric: 'p99 latency of /charge endpoint < 500ms for 99% of requests' plus 'payment success rate > 99.5%'. Both must be observable from Prometheus. I'd monitor three layers: infra (node CPU, database connection pool), service (error rates per downstream dependency, circuit breaker state), and business (payment completion rate, chargeback rate). Automatic abort via Litmus Continuous probes for the SSH metrics plus an external watchdog that subscribes to Alertmanager for business-layer alerts (e.g., if payment completion rate drops below 95%). The watchdog runs in a separate AZ with its own monitoring stack so it can fire even if the experiment kills Prometheus.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between chaos engineering and load testing?
02
How do I know if my chaos experiment is too risky to run in production?
03
What's the minimal observability setup needed before starting chaos engineering?
04
How do you handle a situation where the chaos experiment accidentally affects monitoring infrastructure?
🔥

That's Monitoring. Mark it forged?

8 min read · try the examples if you haven't

Previous
Log Aggregation Best Practices
9 / 9 · Monitoring
Next
Introduction to Ansible