Advanced 8 min · March 06, 2026

Chaos Engineering Basics

Chaos Engineering — Why Probe Abort Missed Our Retry Storm

Q: What is the difference between chaos engineering and load testing?

Load testing validates that your system handles expected traffic volume — it's about quantity. Chaos engineering validates that your system handles unexpected failures — it's about resilience. You can pass a load test and still have a catastrophic outage when a single availability zone goes down. The two are complementary: run load tests to establish baseline capacity, then run chaos experiments with load active to simulate real failure conditions under realistic traffic.

Q: How do I know if my chaos experiment is too risky to run in production?

Ask three questions: Do you have a defined SSH with automated abort? Have you run this experiment in staging and seen both a pass and a controlled fail? Does your blast-radius control limit the experiment to less than 50% of capacity in any single tier? If any answer is 'no,' the experiment isn't ready for production. The steady-state hypothesis and automated abort are non-negotiable safety gates — running without them isn't chaos engineering, it's just breaking things.

Q: What's the minimal observability setup needed before starting chaos engineering?

At minimum you need: Prometheus scraping your services with RED metrics (Rate, Errors, Duration), Grafana dashboards showing those metrics with at least 30 days of history (for SSH baseline calibration), and Alertmanager configured with at least one working receiver. Without these three, you can't define a steady-state hypothesis, can't observe blast radius propagation in real time, and can't trigger automated abort. Distributed tracing (Jaeger/Tempo) is strongly recommended but can be added incrementally — start with metrics-only chaos experiments and layer traces in as your practice matures.

Q: How do you handle a situation where the chaos experiment accidentally affects monitoring infrastructure?

This is exactly why you need the two-layer abort strategy. The Litmus probes will fail if monitoring goes down, but the external watchdog running in a separate AZ with its own Prometheus continues to work. The watchdog detects business-layer alerts (like order completion rate drop) and calls the Litmus API to abort. Additionally, you should never include monitoring infrastructure in the blast radius — set namespace scoping to exclude the monitoring namespace and avoid selecting monitoring pods with label selectors. Pre-brief the on-call team to manually abort via the kill switch if they see unexplained monitoring outages during an experiment window.

A 50% pod fault triggered retry storms exhausting 98/100 DB connections.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Production DevOps experience
✓Deep understanding of the tool's internals
✓Experience debugging distributed systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Chaos engineering deliberately injects failures to validate system resilience
Steady-state hypothesis (SSH) is a measurable output metric — not a health assertion
Blast radius monitoring requires three layers: infrastructure, service, and business
Automated abort via probes is first line; external watchdog in separate AZ is second
Without traces correlated to fault events, 40% of experiment time is wasted

✦ Definition~90s read

What is Chaos Engineering Basics?

Chaos Engineering is the disciplined practice of running controlled, hypothesis-driven experiments on a distributed system to uncover weaknesses before they cause customer-facing outages. It’s not about randomly breaking things — it’s about proactively validating that your system’s steady-state (normal behavior) holds under failure conditions.

★

Imagine a fire drill at school.

You define a steady-state hypothesis using metrics like latency, error rates, and throughput, then inject a failure (e.g., kill a pod, saturate network, expire TLS certs) and measure whether the system deviates beyond acceptable bounds. The goal is to surface hidden dependencies, retry storms, or cascading failures that monitoring alone won’t catch until it’s too late.

In practice, Chaos Engineering sits between traditional testing (unit/integration) and full-blown production monitoring. Tools like Chaos Mesh, Litmus, and Gremlin automate experiment execution, blast radius control, and abort conditions. The key insight: you don’t run chaos experiments without a blast radius monitor that tracks how far the impact spreads (e.g., from one service to its downstream dependencies) and an automated abort that kills the experiment if the blast radius exceeds your safety threshold.

This is what separates chaos engineering from reckless testing — you build a self-healing pipeline that stops the experiment before it causes a real incident.

When not to use it: if your system isn’t instrumented with metrics, traces, and logs that can define a steady-state, you’re not ready. Start in staging with synthetic traffic, then graduate to production using a canary approach (e.g., 1% of traffic, short duration, strict abort conditions).

The maturity model: manual experiments → automated pipelines → continuous verification in CI/CD. Companies like Netflix, Amazon, and Microsoft run thousands of experiments daily in production — but they’ve invested years in observability and blast radius controls.

For most teams, starting with a weekly 5-minute experiment in staging is the right first step.

Plain-English First

Imagine a fire drill at school. Nobody waits for a real fire to find out if the exits work — they set off the alarm on purpose to test the plan. Chaos engineering is that fire drill for your software. You intentionally break things in a controlled way to discover weaknesses before real users ever feel them. The monitoring part is the teacher standing by the exit with a clipboard, writing down exactly how long it took everyone to get out safely.

Every system fails eventually. The brutal truth most engineering teams learn too late is that the failure modes they never tested are always the ones that page them at 3 a.m. on a Friday. Netflix coined the term 'chaos engineering' after their migration to AWS exposed a hard reality: distributed systems fail in ways that are impossible to predict by reading code alone. You have to induce failure — deliberately, scientifically — to build genuine confidence in your system's resilience. Monitoring is what separates chaos engineering from plain vandalism: without deep observability, you're just breaking things and hoping for the best.

The problem chaos engineering solves is the gap between 'we think our system handles this' and 'we have evidence our system handles this.' Runbooks, architecture diagrams, and code reviews are all opinions. A chaos experiment with rigorous monitoring attached is a proof. When a database node disappears, does your read replica take over within your SLA? When a downstream service starts returning 500s, does your circuit breaker actually open, and does it show up in your dashboards before a customer tweets about it? These aren't hypothetical questions — they're experiments with measurable outcomes.

By the end of this article you'll be able to design a complete chaos experiment with a defined steady-state hypothesis, wire up the observability stack needed to validate it, interpret blast-radius telemetry in real time, and avoid the production mistakes that turn a controlled experiment into an uncontrolled incident. We'll use real tooling — Chaos Monkey, Litmus Chaos, Prometheus, and Grafana — with fully runnable configurations and the internal mechanics explained at every step.

Why Chaos Engineering Is Not About Breaking Things

Chaos engineering is the disciplined practice of injecting controlled failures into a production system to uncover weaknesses before they cause user-facing outages. The core mechanic is hypothesis-driven experimentation: you define a steady state (e.g., p99 latency < 200ms), introduce a fault (e.g., kill a pod, drop 30% of packets), and measure whether the system deviates from that steady state. It's not random destruction — it's a scientific method for resilience.

In practice, chaos experiments run in short, isolated windows with blast radius controls. Key properties: automated rollback on metric breach, gradual fault injection (e.g., start with 1% traffic loss), and observability hooks that capture every state change. The goal is to validate that retry logic, circuit breakers, and timeouts actually work under real pressure — not just in unit tests.

Use chaos engineering when your system has dependencies (databases, caches, third-party APIs) or when you ship frequently. It matters because the worst outages come from cascading failures — a single timeout in one service triggers a retry storm that takes down the entire fleet. Probes that miss these storms are exactly why you run experiments, not postmortems.

⚠ Chaos ≠ Testing

Chaos engineering validates hypotheses about system behavior under stress; it is not a substitute for unit, integration, or load testing.

📊 Production Insight

A payment service ran a chaos experiment that killed one of three Redis replicas. The probe checked only the primary endpoint, so it reported 'healthy' while the retry logic in the app kept hammering the dead replica, causing a 12-second latency spike for 40% of transactions.

Symptom: p99 latency jumped from 50ms to 12s, but the health endpoint returned 200 OK.

Rule of thumb: Your probe must exercise the same code path as user traffic — if it doesn't trigger retries, it's lying to you.

🎯 Key Takeaway

Chaos engineering is hypothesis-driven, not random destruction.

Always define a steady-state metric before injecting any fault.

A probe that doesn't trigger real code paths will miss cascading failures.

thecodeforge.io

Chaos Engineering Basics

Steady-State Hypothesis: The Contract Your Monitoring Must Enforce

Before you inject a single failure, you need a written, measurable definition of 'normal.' This is called the steady-state hypothesis (SSH), and it's the foundation that separates chaos engineering from random testing. Without it, your monitoring has nothing to compare against, and you can't tell whether a blip in your metrics is caused by your experiment or just Tuesday afternoon traffic.

A good SSH has three properties: it is measurable (a concrete metric, not a vague description), it is bounded (a specific threshold like p99 latency < 200ms, not 'fast enough'), and it is observable from your existing monitoring stack without manual inspection. Think of it as a contract — the experiment's job is to stress-test the system, and monitoring's job is to flag the moment that contract is breached.

The SSH drives everything downstream: which metrics you scrape, which alert thresholds you set, how long the experiment runs, and when you abort. Teams that skip this step end up running experiments they can't evaluate — they see metrics move, panic, rollback, and learn nothing. A Prometheus recording rule encoding your SSH turns your hypothesis into an automated referee that fires the moment the blast radius exceeds acceptable bounds.

Note the subtle but critical point: the SSH is about the system's outputs (latency, error rate, throughput), not about the fault you're injecting. You're not asserting 'the database will stay up.' You're asserting 'checkout will complete within 300ms for 99% of requests.' That distinction matters — it's entirely possible the database fails and checkout still hits your SLA via a cache layer. Monitoring that, not the database health itself, is the real experiment.

steady_state_hypothesis.yamlYAML

# Litmus Chaos ChaosEngine manifest with embedded steady-state validation
# This defines both the fault injection AND the hypothesis monitoring together
# Run with: kubectl apply -f steady_state_hypothesis.yaml

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: checkout-service-resilience-test
  namespace: production-mirror          # NEVER run first experiments in live prod
spec:
  # The application under test — Litmus uses this to scope blast radius
  appinfo:
    appns: production-mirror
    applabel: "app=checkout-service"
    appkind: deployment

  # When the hypothesis is violated, Litmus can auto-stop the experiment
  jobCleanUpPolicy: retain             # Keep job logs for post-mortem analysis

  # Steady-state hypothesis: these probes ARE your monitoring assertions
  # Litmus evaluates them before injection (baseline), during, and after
  experiments:
    - name: pod-cpu-hog                # Fault: saturate CPU on checkout pods
      spec:
        probe:
          # Probe 1: HTTP probe — does the service still respond?
          - name: checkout-endpoint-alive
            type: httpProbe
            httpProbe/inputs:
              url: "http://checkout-service.production-mirror.svc.cluster.local/health"
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==         # Exact HTTP status code match
                  responseCode: "200"
            mode: Continuous           # Keep checking THROUGHOUT the experiment
            runProperties:
              probeTimeout: 5          # Seconds before probe is marked failed
              interval: 10             # Check every 10 seconds
              retry: 2                 # Allow 2 transient failures before flagging
              probePollingInterval: 2

          # Probe 2: Prometheus probe — p99 latency is the REAL hypothesis
          # If CPU is pegged but latency stays under 300ms, the system is resilient
          - name: checkout-p99-latency-under-300ms
            type: promProbe
            promProbe/inputs:
              # PromQL query evaluating our SSH threshold
              # histogram_quantile computes p99 from Prometheus histogram buckets
              endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
              query: |
                histogram_quantile(
                  0.99,
                  rate(
                    http_request_duration_seconds_bucket{
                      service="checkout-service",
                      route="/api/checkout"
                    }[2m]  
                  )
                ) * 1000
              # The experiment FAILS if p99 latency exceeds 300ms at any Continuous check
              comparator:
                criteria: "<="
                type: float
                value: "300"           # Milliseconds — our SSH threshold
            mode: Continuous
            runProperties:
              probeTimeout: 10
              interval: 15
              retry: 1                 # Only 1 retry — latency spikes matter

        components:
          env:
            # Fault parameters — scope and duration of CPU stress
            - name: CPU_CORES
              value: "2"               # Hog 2 cores per pod
            - name: CPU_LOAD
              value: "90"              # 90% utilization on those cores
            - name: TOTAL_CHAOS_DURATION
              value: "120"             # Run fault for 120 seconds
            - name: PODS_AFFECTED_PERC
              value: "50"              # Blast radius: only 50% of pods affected
            - name: RAMP_TIME
              value: "10"              # 10s stabilization before fault starts

Output

chaosengine.litmuschaos.io/checkout-service-resilience-test created

--- Litmus Chaos Runner Output (kubectl logs -n production-mirror chaos-runner) ---

INFO[0000] Steady State Check — PRE-CHAOS

INFO[0002] Probe: checkout-endpoint-alive → PASS (HTTP 200)

INFO[0004] Probe: checkout-p99-latency-under-300ms → PASS (p99: 47.3ms)

INFO[0010] Steady state established. Injecting fault: pod-cpu-hog

INFO[0020] Fault active on pods: [checkout-7d9f4-xkp2n, checkout-7d9f4-m8qvl]

INFO[0030] Probe: checkout-endpoint-alive → PASS (HTTP 200)

INFO[0030] Probe: checkout-p99-latency-under-300ms → PASS (p99: 189.2ms) ← latency rising

INFO[0060] Probe: checkout-p99-latency-under-300ms → PASS (p99: 241.7ms) ← approaching limit

INFO[0090] Probe: checkout-p99-latency-under-300ms → FAIL (p99: 347.1ms > 300ms threshold)

WARN[0090] SSH VIOLATED — initiating experiment abort

INFO[0092] Fault rolled back. Pods restored.

INFO[0095] Steady State Check — POST-CHAOS

INFO[0097] Probe: checkout-p99-latency-under-300ms → PASS (p99: 52.1ms)

EXPERIMENT VERDICT: FAIL

Reason: p99 latency breached 300ms SLA under 90% CPU saturation on 50% of pods

Recommendation: Investigate CPU throttling limits and horizontal pod autoscaler lag

⚠ Watch Out: Your SSH Must Be Pre-Experiment, Not Best-Guess

Teams often set SSH thresholds by intuition ('300ms feels right') rather than by measuring actual baseline p99 over 7 days of production traffic. Run histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[7d])) in Prometheus first. If your real p99 is already 250ms on a quiet day, a 300ms SSH gives you only 50ms of headroom — and your experiment will be noise-intolerant. Calibrate from data, not feelings.

📊 Production Insight

SSH thresholds set too tight trigger false abort — experiments never complete.

Set SSH to 1.5x the 7-day observed peak p99, not an aspirational value.

Add a 'for: 30s' window to filter transient blips caused by normal traffic variance.

🎯 Key Takeaway

Your SSH is a contract on outputs, not on infrastructure.

Measure real baseline over 7 days before picking a threshold.

The experiment passes even if infrastructure fails — as long as outputs stay within bounds.

Blast Radius Monitoring: Seeing Exactly How Far the Damage Spreads

Blast radius is how much of your system a failure actually touches. It sounds simple, but monitoring it in real time during an experiment is one of the hardest observability problems in practice. The reason: failure in distributed systems is rarely localized. A single pod killed by Chaos Monkey can cause retry storms upstream, exhaust connection pools in a shared database, trigger cascading timeouts three service hops away, and spike error rates in a completely unrelated service that shares the same thread pool.

To monitor blast radius properly, you need traces, not just metrics. Metrics tell you that something is broken. Distributed traces tell you which path through your system is breaking and how far the breakage travels. During a chaos experiment, a Jaeger or Tempo trace correlated with your fault injection timestamp is worth more than a wall of Grafana panels.

The practical architecture is this: your chaos tool writes a structured event (fault start, fault end, abort) to a shared event bus. Your monitoring stack consumes those events as annotations on time-series dashboards and as trace baggage propagated via OpenTelemetry. Now every metric spike and every slow trace is automatically contextualized against the fault window. Without this correlation, your engineers spend 40% of experiment time trying to figure out which anomalies are experiment artifacts versus pre-existing issues.

Blast radius monitoring also needs to be multi-layered: infrastructure layer (node CPU, network I/O, disk), service layer (error rates per downstream dependency), and business layer (order completion rate, payment throughput). The business layer is the one most teams forget — and it's the only layer your CTO actually cares about during a post-mortem.

chaos_observability_stack.yamlYAML

100

101

102

# Prometheus alerting rules that fire DURING a chaos experiment
# These rules correlate with the Litmus chaos event annotations
# Apply with: kubectl apply -f chaos_observability_stack.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: chaos-blast-radius-alerts
  namespace: monitoring
  labels:
    # This label tells the Prometheus Operator to load these rules
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: chaos-experiment-blast-radius
      # Evaluate every 10s during active experiments for fast feedback
      interval: 10s
      rules:

        # Rule 1: Detect upstream retry storm caused by fault injection
        # A retry storm means clients are hammering a degraded service
        - alert: ChaosInducedRetryStorm
          expr: |
            sum by (source_service, target_service) (
              rate(
                grpc_client_handled_total{
                  grpc_code=~"Unavailable|DeadlineExceeded|ResourceExhausted"
                }[1m]
              )
            ) > 50  # More than 50 retryable errors/sec between any two services
          for: 20s   # Must persist 20s — filters transient blips
          labels:
            severity: critical
            experiment_context: "blast-radius-propagation"
          annotations:
            summary: "Retry storm detected: {{ $labels.source_service }} → {{ $labels.target_service }}"
            description: |
              {{ $value | printf "%.1f" }} retryable errors/sec detected between services.
              This indicates the chaos fault has propagated beyond the intended blast radius.
              Check if circuit breaker on {{ $labels.source_service }} is open.
            runbook_url: "https://runbooks.internal/chaos/retry-storm"

        # Rule 2: Connection pool exhaustion — a common blast radius spillover
        # When a service slows down, connection pools fill up and starve other callers
        - alert: DatabaseConnectionPoolExhausted
          expr: |
            (
              db_pool_connections_in_use{pool="checkout-db-pool"}
              /
              db_pool_connections_max{pool="checkout-db-pool"}
            ) > 0.90  # Pool is more than 90% utilized
          for: 15s
          labels:
            severity: critical
            experiment_context: "blast-radius-database"
          annotations:
            summary: "DB connection pool near exhaustion during chaos experiment"
            description: |
              Pool utilization: {{ $value | humanizePercentage }}.
              New requests will queue or fail. Blast radius has reached the database tier.
              Experiment should be aborted if this persists beyond ramp-down window.

        # Rule 3: Business-layer impact — the metric that matters to leadership
        # Drop in order completion rate signals real user impact
        - alert: ChaosBusinessImpactDetected
          expr: |
            (
              rate(orders_completed_total[2m])
              /
              rate(orders_initiated_total[2m])
            ) < 0.95  # Completion rate dropped below 95%
          for: 30s
          labels:
            severity: page            # This one actually wakes someone up
            experiment_context: "blast-radius-business"
          annotations:
            summary: "Order completion rate below 95% — chaos experiment exceeding safe blast radius"
            description: |
              Current completion rate: {{ $value | humanizePercentage }}.
              Expected baseline: >99%. This represents real revenue impact.
              Abort the chaos experiment immediately via Litmus dashboard.

---
# Grafana annotation source — pushes chaos events as vertical lines on dashboards
# This makes every chart self-documenting during an experiment review
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-chaos-annotation-datasource
  namespace: monitoring
data:
  # Grafana reads this query to draw annotations on all dashboards
  # It queries the chaos event log stored as Prometheus labels
  annotation_query: |
    changes(
      litmuschaos_experiment_verdict{
        chaosresult_verdict=~"Pass|Fail|Stopped"
      }[1m]
    ) > 0
  # Annotation label shown in Grafana UI
  annotation_label: "Chaos Fault Window"

Output

prometheusrule.monitoring.coreos.com/chaos-blast-radius-alerts created

configmap/grafana-chaos-annotation-datasource created

--- Prometheus alert evaluation (kubectl logs prometheus-0 -n monitoring | grep chaos) ---

t=14:23:10 level=info msg="Evaluating rule" group=chaos-experiment-blast-radius rule=ChaosInducedRetryStorm

t=14:23:20 level=info msg="Alert firing" alert=ChaosInducedRetryStorm

source_service=payment-service target_service=checkout-service value=73.4

labels: severity=critical experiment_context=blast-radius-propagation

t=14:23:35 level=info msg="Evaluating rule" rule=DatabaseConnectionPoolExhausted

t=14:23:35 level=info msg="Alert resolved" alert=DatabaseConnectionPoolExhausted

-- Pool dropped back to 81% after circuit breaker opened on checkout-service --

t=14:23:50 level=info msg="Evaluating rule" rule=ChaosBusinessImpactDetected

t=14:23:50 level=info msg="Alert NOT firing" -- order completion rate: 97.3% (above threshold)

--- Summary ---

Blast radius contained at: service mesh layer (checkout ↔ payment)

Database pool self-recovered: YES (circuit breaker functioned as designed)

Business impact: NONE (resilience mechanism worked)

Experiment verdict: PARTIAL PASS — retry storm exceeded threshold but auto-recovered

💡Pro Tip: Use Exemplars to Link Metric Spikes Directly to Traces

Prometheus exemplars (enabled via --enable-feature=exemplar-storage) let you embed a trace ID inside a histogram sample. When your p99 spike shows up on a Grafana panel during an experiment, clicking the spike takes you directly to the Jaeger trace for the slowest request in that window. This cuts blast-radius investigation time from minutes to seconds. Configure your app's histogram metric with WithExemplarFromTraceID() in the OpenTelemetry SDK — it's two lines of code that pay dividends in every experiment post-mortem.

📊 Production Insight

Blast radius often spreads through shared resources like database connection pools.

Without traces, you see the metric spike but can't trace the propagation path.

Business-layer metrics (order completion rate) are the only ones leadership acts on.

🎯 Key Takeaway

Monitor three layers: infra, service, business.

Correlate fault events with traces using OpenTelemetry baggage.

The metric your CTO cares about is the one most teams don't measure.

thecodeforge.io

Chaos Engineering Basics

Automating Experiment Abort: Building a Self-Healing Chaos Pipeline

The most dangerous moment in a chaos experiment isn't when you inject the fault — it's the 90 seconds between 'something is wrong' and 'someone pushed the abort button.' Manual abort loops depend on an engineer watching a dashboard at exactly the right moment. In production-adjacent environments, that's too slow. You need automated abort conditions wired directly into your chaos tooling.

Litmus Chaos and Gremlin both support this via their probe-failure semantics: if a Continuous probe fails, the experiment engine automatically rolls back the fault and marks the experiment as failed. But that's only the first line of defense. The second — and more robust — layer is an external watchdog: a small service that subscribes to your alertmanager webhook, matches on the experiment_context label you saw in the previous section, and calls the chaos tool's API to abort if a severity:page alert fires.

This two-layer approach handles a subtle failure mode: what if the chaos experiment itself breaks your monitoring stack? If Prometheus can't scrape metrics because the node running it is the one you killed, your probe-based abort won't fire. The external watchdog needs to live on a separate node pool, with its own health check, and ideally in a different availability zone from the experiment's blast radius. Monitoring your monitoring during a chaos experiment sounds paranoid — until the one time it matters.

The abort pipeline should also emit a structured event: experiment name, abort reason, which probe failed, current metric value vs. SSH threshold, and a direct link to the Grafana dashboard snapshot captured at abort time. This snapshot is gold for post-mortems — it captures the exact state of every metric at the moment the system broke, before any auto-healing obscures the evidence.

chaos_watchdog_service.pyPYTHON

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

#!/usr/bin/env python3
"""
Chaos Experiment Watchdog Service

Listens for Alertmanager webhook callbacks and automatically aborts
a running Litmus Chaos experiment if a severity:page alert fires
during the experiment window.

Designed to run on a SEPARATE NODE POOL from the experiment blast radius.
Requirements: pip install fastapi uvicorn httpx pydantic
Run with: uvicorn chaos_watchdog_service:watchdog_app --host 0.0.0.0 --port 8090
"""

import asyncio
import logging
import os
from datetime import datetime, timezone
from typing import Optional

import httpx
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel, Field

# ── Configuration ─────────────────────────────────────────────────────────────
LITMUS_API_URL = os.environ["LITMUS_CHAOS_API_URL"]        # e.g. http://litmus.monitoring:9002
LITMUS_API_TOKEN = os.environ["LITMUS_API_TOKEN"]          # Service account token
GRAFANA_API_URL = os.environ["GRAFANA_API_URL"]            # For snapshot capture on abort
GRAFANA_API_TOKEN = os.environ["GRAFANA_API_TOKEN"]

# Dashboard UID that shows all chaos-related panels — captured at abort time
CHAOS_DASHBOARD_UID = os.environ.get("CHAOS_DASHBOARD_UID", "chaos-blast-radius-overview")

# Only abort experiments if alerts carry this label — prevents false triggers
REQUIRED_EXPERIMENT_LABEL = "experiment_context"

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s — %(message)s"
)
logger = logging.getLogger("chaos-watchdog")

# ── Pydantic models matching Alertmanager webhook payload ──────────────────────
class AlertmanagerLabel(BaseModel):
    alertname: str
    severity: Optional[str] = None
    experiment_context: Optional[str] = None   # Set in our PrometheusRule labels
    namespace: Optional[str] = None

class AlertmanagerAnnotation(BaseModel):
    summary: Optional[str] = None
    description: Optional[str] = None

class AlertmanagerAlert(BaseModel):
    status: str                                # 'firing' or 'resolved'
    labels: AlertmanagerLabel
    annotations: AlertmanagerAnnotation
    startsAt: str
    generatorURL: str

class AlertmanagerWebhookPayload(BaseModel):
    version: str = Field(default="4")
    status: str
    alerts: list[AlertmanagerAlert]

# ── FastAPI app ────────────────────────────────────────────────────────────────
watchdog_app = FastAPI(
    title="Chaos Experiment Watchdog",
    description="Auto-aborts chaos experiments when SSH is violated",
    version="1.0.0"
)

@watchdog_app.post("/alertmanager/webhook")
async def handle_alertmanager_webhook(payload: AlertmanagerWebhookPayload, request: Request):
    """
    Alertmanager calls this endpoint when an alert fires or resolves.
    We only act on 'firing' alerts that carry our experiment_context label.
    """
    logger.info(f"Received webhook — status={payload.status}, alert_count={len(payload.alerts)}")

    for alert in payload.alerts:
        # Only act on actively firing alerts
        if alert.status != "firing":
            logger.debug(f"Skipping resolved alert: {alert.labels.alertname}")
            continue

        # Only abort if alert is explicitly tagged as chaos-experiment-related
        experiment_context = alert.labels.experiment_context
        if not experiment_context:
            logger.debug(f"Alert {alert.labels.alertname} has no experiment_context — skipping")
            continue

        # severity:page means business impact — always abort
        if alert.labels.severity == "page":
            logger.warning(
                f"PAGE-SEVERITY alert during chaos experiment! "
                f"Alert={alert.labels.alertname} "
                f"Context={experiment_context}"
            )

            # Step 1: Capture Grafana dashboard snapshot BEFORE aborting
            # This preserves the system state at the moment of breach
            snapshot_url = await capture_grafana_snapshot(
                dashboard_uid=CHAOS_DASHBOARD_UID,
                annotation=f"AUTO-ABORT: {alert.labels.alertname} at {datetime.now(timezone.utc).isoformat()}"
            )

            # Step 2: Abort the running chaos experiment via Litmus API
            abort_result = await abort_litmus_experiment(
                experiment_context=experiment_context,
                abort_reason=alert.annotations.summary or alert.labels.alertname
            )

            logger.info(
                f"Experiment aborted successfully. "
                f"Snapshot: {snapshot_url} "
                f"Litmus response: {abort_result}"
            )

    return {"status": "processed", "timestamp": datetime.now(timezone.utc).isoformat()}


async def capture_grafana_snapshot(dashboard_uid: str, annotation: str) -> str:
    """
    Creates a Grafana snapshot of the chaos dashboard at the current moment.
    Returns the public snapshot URL for inclusion in post-mortem reports.
    """
    async with httpx.AsyncClient(timeout=10.0) as grafana_client:
        # First, get the dashboard JSON model
        dashboard_response = await grafana_client.get(
            f"{GRAFANA_API_URL}/api/dashboards/uid/{dashboard_uid}",
            headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"}
        )
        dashboard_response.raise_for_status()
        dashboard_model = dashboard_response.json()["dashboard"]

        # Create a snapshot — Grafana stores it and returns a share URL
        snapshot_response = await grafana_client.post(
            f"{GRAFANA_API_URL}/api/snapshots",
            headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"},
            json={
                "dashboard": dashboard_model,
                "name": f"Chaos Abort Snapshot — {annotation}",
                "expires": 604800,     # Snapshot expires in 7 days
            }
        )
        snapshot_response.raise_for_status()
        snapshot_url = snapshot_response.json()["url"]
        logger.info(f"Grafana snapshot captured: {snapshot_url}")
        return snapshot_url


async def abort_litmus_experiment(experiment_context: str, abort_reason: str) -> dict:
    """
    Calls the Litmus Chaos API to stop the running experiment.
    experiment_context maps to the ChaosEngine name via our labeling convention.
    """
    # Derive the ChaosEngine name from the experiment_context label
    # Convention: experiment_context label value matches ChaosEngine metadata.name
    chaos_engine_name = experiment_context.replace("blast-radius-", "")

    async with httpx.AsyncClient(timeout=15.0) as litmus_client:
        # Litmus REST API: PATCH the engine status to 'stop'
        stop_response = await litmus_client.patch(
            f"{LITMUS_API_URL}/api/chaosengine/{chaos_engine_name}",
            headers={
                "Authorization": f"Bearer {LITMUS_API_TOKEN}",
                "Content-Type": "application/json"
            },
            json={
                "spec": {
                    "engineState": "stop"    # Litmus graceful stop — rolls back faults
                },
                "metadata": {
                    "annotations": {
                        # Record WHY this was aborted — visible in Litmus dashboard
                        "chaos.abort.reason": abort_reason,
                        "chaos.abort.timestamp": datetime.now(timezone.utc).isoformat(),
                        "chaos.abort.source": "watchdog-service"
                    }
                }
            }
        )
        stop_response.raise_for_status()
        logger.info(f"Litmus experiment '{chaos_engine_name}' stopped via API")
        return stop_response.json()


@watchdog_app.get("/health")
async def health_check():
    """Liveness probe — the watchdog must be reachable during experiments."""
    return {"status": "healthy", "timestamp": datetime.now(timezone.utc).isoformat()}

Output

INFO chaos-watchdog — Starting Chaos Experiment Watchdog on :8090

INFO chaos-watchdog — 14:23:49 Received webhook — status=firing, alert_count=2

INFO chaos-watchdog — 14:23:49 Alert ChaosInducedRetryStorm has experiment_context=blast-radius-propagation (severity=critical — not page, skipping abort)

WARN chaos-watchdog — 14:23:49 PAGE-SEVERITY alert during chaos experiment!

Alert=ChaosBusinessImpactDetected

Context=blast-radius-business

INFO chaos-watchdog — 14:23:50 Grafana snapshot captured:

https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB

INFO chaos-watchdog — 14:23:51 Litmus experiment 'business' stopped via API

INFO chaos-watchdog — 14:23:51 Experiment aborted successfully.

Snapshot: https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB

Litmus response: {"spec": {"engineState": "stop"}, "status": "updated"}

INFO uvicorn — 127.0.0.1:0 - "POST /alertmanager/webhook HTTP/1.1" 200 OK

🔥Interview Gold: Why Two Abort Layers Beat One

Interviewers love this question: 'What happens if your chaos experiment breaks your monitoring stack?' The answer is exactly why you need an external watchdog separate from probe-based abort. Probe-based abort lives inside Litmus — if the cluster it runs in is compromised, the probe can't fire. The external watchdog runs in a separate AZ with its own Prometheus and Alertmanager. If a candidate only describes probe-based abort, they've demonstrated they haven't thought about second-order failure modes — which is the entire point of chaos engineering.

📊 Production Insight

If your experiment kills Prometheus, probe-based abort won't fire.

External watchdog must run in a separate AZ with independent monitoring.

Snapshot dashboards before abort — auto-healing erases evidence quickly.

🎯 Key Takeaway

Two abort layers: probe-based (fast) and watchdog (fail-safe).

Snapshot at abort time for post-mortem evidence.

Never skip the external watchdog — it's the safety net when your safety net fails.

Observability Pipelines for Chaos Experiments: From Metrics to Actionable Insights

Running a chaos experiment without a proper observability pipeline is like flying a plane without instruments — you might survive, but you won't learn why. The pipeline needs to collect metrics, traces, and logs from every tier of your stack, annotate them with experiment context, and route them to a dashboard that an engineer can read in real time. Without this, you're guessing.

The key components are: Prometheus for metrics (RED metrics per service: Rate, Errors, Duration), OpenTelemetry for distributed tracing (with baggage propagation to carry experiment IDs), and a structured logging system (JSON logs with experiment_id field). All three need to be timestamp-aligned with the fault injection events. Use the chaos tool's event stream to push annotations into Grafana so every chart shows vertical lines for fault start, end, and abort.

A common mistake is only collecting metrics at the service level. You need infrastructure metrics too: node CPU, memory, network I/O, disk latency. Without these, you can't tell whether a latency spike is caused by increased request processing time or by a noisy neighbor on the same node consuming CPU. This distinction matters because the fix is different: one requires scaling the service, the other requires node isolation.

Another overlooked piece is the experiment rollback observability: after the fault is removed, metrics should return to baseline within the ramp-down period. If they don't, it indicates permanent damage (leaked connections, corrupted state) that won't show up in a pass/fail verdict quickly. Monitor the recovery trajectory — a system that doesn't fully recover is more dangerous than one that fails immediately.

📊 Production Insight

Post-fault recovery metrics that don't return to baseline indicate permanent damage.

Without infra metrics, you can't distinguish service degradation from noisy neighbor.

JSON structured logs with experiment_id let you filter chaos-related events from normal traffic.

🎯 Key Takeaway

Three data types: RED metrics, distributed traces, structured logs.

Annotate all signals with experiment context.

Monitor recovery trajectory — not just pass/fail.

Running Chaos Experiments in Production vs Staging: The Graduation Path

The safest approach is to run your first three iterations of each experiment in a production-mirror environment — identical configuration but synthetic traffic. This lets you validate your SSH, probe configuration, abort pipeline, and blast-radius controls without real user impact. Once you've seen the experiment pass and fail (both outcomes are valuable) in the mirror, graduate to a canary production experiment with a tiny blast radius (e.g., 5% of pods, only in one availability zone).

The graduation criteria are: (1) SSH is calibrated from 7-day baseline metrics, (2) automated abort has been verified to work — both probe-based and external watchdog, (3) on-call team has been briefed and has a Grafana dashboard link, (4) blast radius is constrained to less than 10% of capacity per tier, (5) a rollback plan exists and has been rehearsed. If any of these are missing, you're not ready for production.

A production experiment should also have a safety word: a human override command that any engineer can issue to stop all experiments immediately. This is typically a simple kill switch in the chaos tool's API. Document the command and the process for using it in your on-call runbook. The goal is to make aborting an experiment as easy as starting one.

Finally, production experiments need a post-mortem SLA: within 24 hours of the experiment, the team should review the metrics, the abort logs, the Grafana snapshots, and decide whether to (a) graduate the experiment to a larger blast radius, (b) fix the issues found and re-run in a mirror, or (c) disable the experiment permanently because the risk outweighs the insight.

📊 Production Insight

Never skip the production-mirror graduation step — real traffic patterns are unpredictable.

Document and rehearse the kill switch before running in production.

Post-mortem within 24 hours — delays lose the context of what happened.

🎯 Key Takeaway

Five graduation criteria before production experiments.

Every experiment needs a human-readable abort command.

Post-mortem within 24 hours with snapshot review.

The Experiment Hypothesis: Your Assumption Is Wrong Until Proven

Every chaos experiment starts with a bet. You're betting your system can survive a specific failure without degrading user experience. Write that bet down before you touch a single config file. This is your hypothesis. It forces you to define what 'surviving' actually means. Not vague resilience. Concrete metrics. P99 latency under 200ms. Error rate below 0.1%. Transaction completion within 30 seconds. If you can't write a hypothesis with numbers, you're not ready to run the experiment. The failure will prove your assumptions wrong. That's the point. You're not testing the system. You're testing your understanding of the system. Netflix calls this the 'steady-state hypothesis.' I call it a reality check. Every time I've skipped writing a hypothesis, I've wasted hours debugging symptoms instead of causes. Write it. Run the experiment. Compare results. The gap between what you predicted and what happened is your engineering debt.

HypothesisValidator.javaJAVA

// io.thecodeforge.sealed.hypothesis
import java.time.Instant;

public record ExperimentHypothesis(
    String experimentName,
    double expectedP99LatencyMs,
    double maxAllowedErrorRate,
    Instant startTime,
    Instant endTime
) {
    public static ExperimentHypothesis forRegionFailover() {
        return new ExperimentHypothesis(
            "us-east-1-failover-to-us-west-2",
            250.0,  // expected P99 under failover
            0.005,  // 0.5% error rate max
            Instant.now(),
            Instant.now().plusSeconds(300)
        );
    }
    
    public boolean validate(double actualLatency, double actualErrorRate) {
        if (actualLatency > this.expectedP99LatencyMs) {
            System.err.println("Hypothesis FAILED: latency breach");
            return false;
        }
        if (actualErrorRate > this.maxAllowedErrorRate) {
            System.err.println("Hypothesis FAILED: error rate breach");
            return false;
        }
        return true;
    }
}

Output

Hypothesis FAILED: latency breach

P99 was 340ms (expected 250ms max)

⚠ Production Trap:

Never use 'best guess' numbers. Pull your baseline metrics from the last 72 hours of real production traffic. Anything else is cargo culting.

🎯 Key Takeaway

If your hypothesis doesn't have numbers, you're not running an experiment. You're just breaking things for fun.

Game Day Drills: Why Paper Experiments Fail and Live fire Works

I've seen teams spend six months 'preparing' for chaos experiments. They build diagrams. Write runbooks. Draw architecture maps. Then the first real failure hits, and everything burns. Here's why: you can't simulate the cognitive load of a production incident. The incident channel blowing up. Three different on-call engineers screaming conflicting theories. The clock ticking on your SLO budget. That stress changes how people think. Paper experiments don't account for panic. So stop running tabletop exercises. Schedule a game day. Pick a Friday afternoon. Terminate a database primary. Watch what happens. Not what the diagram says should happen. What actually happens. Your team will discover three things immediately: who actually knows the system, which runbooks are outdated, and which alarms nobody configured. The first game day is always chaos. The second one starts looking like an orchestrated dance. By the third one, you're catching regressions before they hit customers. That's the point. Build the muscle memory now, while you're in control.

GameDayOrchestrator.javaJAVA

// io.thecodeforge.sealed.gameday
import java.util.concurrent.CompletableFuture;
import java.time.Duration;

public class GameDayOrchestrator {
    sealed interface Phase permits PreGame, Attack, Observe, Retrospective {}
    record PreGame(String hypothesisId, Duration timeout) implements Phase {}
    record Attack(String targetService, String failureType) implements Phase {}
    record Observe(String metricQuery, double threshold) implements Phase {}
    record Retrospective(String summary, boolean pass) implements Phase {}
    
    public static void main(String[] args) {
        Phase[] drillPhases = {
            new PreGame("hyp-1234", Duration.ofMinutes(5)),
            new Attack("payment-service", "terminate-3-instances"),
            new Observe("p99_latency{service='payment'}", 300.0),
            new Retrospective("Failover kicked in at 8s, SLO preserved", true)
        };
        
        for (Phase p : drillPhases) {
            System.out.println("Executing phase: " + p.getClass().getSimpleName());
            // Actual orchestration would trigger chaos-mesh or Gremlin here
        }
    }
}

Output

Executing phase: PreGame

Executing phase: Attack

Executing phase: Observe

Executing phase: Retrospective

🔥ROI Reality:

One game day drill uncovers more defects than a month of static analysis. The cost of a failed drill is documentation. The cost of a failed production incident is pager duty at 3 AM.

🎯 Key Takeaway

Paper survives contact with the enemy. Run live experiments or don't bother.

● Production incidentPOST-MORTEMseverity: high

The Retry Storm That Killed Our Checkout During a Chaos Experiment

Symptom

p99 latency on checkout endpoint jumped from 45ms to 3.2s. Order completion rate dropped to 72%. The on-call engineer was paged, but the alert labels didn't include 'experiment_context', so they assumed it was a real incident and rolled back a recent deployment that had nothing to do with the fault.

Assumption

SSH threshold of 300ms p99 latency provided enough headroom. The Litmus probe would automatically abort if breached. The blast radius was limited to 50% of pods, so the remaining 50% should handle traffic.

Root cause

The Litmus probe did abort when p99 exceeded 300ms at t+90s, but the damage had already propagated: the 50% affected pods became slow, causing upstream services to retry aggressively. Those retries exhausted the shared database connection pool (Tomcat max=100, 98 connections in use). The remaining healthy pods couldn't process requests because they couldn't get a database connection. The probe only monitored the checkout service endpoint, not the database pool utilization — so the abort came too late.

Fix

1. Add a database connection pool utilization probe to the Litmus experiment (continuous, threshold < 80%). 2. Implement an external watchdog service in a separate AZ that subscribes to Alertmanager webhooks with business-layer alerts. 3. Set Alertmanager routing to suppress primary on-call during planned experiments using a 'chaos-mode' label. 4. Reduce retry count on upstream services to 1 instead of 3 to prevent retry storms from overwhelming degraded dependencies.

Key lesson

Monitor downstream dependencies, not just the service under test — the blast radius often spreads through shared resources.
Probe-based abort is not enough: external watchdog with business impact detection catches what monitoring blind spots miss.
Alertmanager routing must separate experiment alerts from production incidents to avoid false alarms that waste on-call time and cause unnecessary rollbacks.
Always pre-brief the on-call team about planned experiment windows and provide a Grafana link to track experiment status.

Production debug guideSymptom → Action guide for when a chaos experiment causes unexpected production impact4 entries

Symptom · 01

p99 latency spikes above SSH threshold but experiment doesn't abort

→

Fix

Check Litmus probe configuration: ensure probe mode is 'Continuous' not 'OnChaosInjection'. Verify Prometheus endpoint connectivity. Check if Prometheus itself is overwhelmed — if the node running Prometheus is in the blast radius, probes will fail silently.

Symptom · 02

Error rate on a dependency service spikes after fault injection

→

Fix

Use Jaeger to find traces that cross the dependency boundary during the fault window. Look for retry storms: rate of grpc_client_handled_total with codes Unavailable/DeadlineExceeded. Then check circuit breaker state on the caller service.

Symptom · 03

Experiment passes SSH but business metric (e.g., order completion) drops

→

Fix

Your SSH is wrong — it's monitoring the wrong output metric. Add a business-layer metric to the experiment probe. Example: rate of orders_completed / rate of orders_initiated. Redefine SSH to include this ratio.

Symptom · 04

Alertmanager pages the on-call during a planned experiment

→

Fix

Check that your experiment_context label is set on all PrometheusRules and that Alertmanager routing separates that label to a 'chaos-experiments' receiver. Also send a calendar block with experiment time and a link to the Litmus dashboard before starting.

★ Chaos Experiment Debugging Cheat SheetFive most common failure patterns during chaos experiments — diagnose and fix in under 60 seconds.

Probe fails immediately after fault injection−

Immediate action

Check if the fault targets the same pod as the probe endpoint. If the probe URL points to the same service under test and the fault kills all pods, no service is available to respond.

Commands

kubectl get chaosengine <name> -n <ns> -o jsonpath='{.status.experimentStatuses[*].probeStatuses}'

kubectl logs chaos-runner -n <ns> --tail=50 | grep 'Probe'

Fix now

Set the probe to run against a different instance of the service or use an external endpoint that is not affected by the fault.

Blast radius exceeds configured percentage+

Experiment doesn't abort even though metrics are bad+

Grafana dashboard doesn't show fault injection window+

Watchdog service doesn't abort experiment when business impact detected+

Chaos Engineering Tools Comparison

Aspect	Litmus Chaos (CNCF)	Gremlin (SaaS)
Deployment model	Self-hosted Kubernetes operator	SaaS with agent on your infra
SSH enforcement	Built-in Continuous/Edge probes with Prometheus integration	Attack halt conditions via Gremlin API, less native PromQL
Blast radius control	PODS_AFFECTED_PERC, namespace scoping, label selectors	Target sets with tag-based filters, AZ pinning in UI
Abort mechanism	Probe failure → automatic rollback, PATCH API for external abort	Attack stop via API or UI, webhooks for external triggers
Observability integration	Prometheus, Grafana annotation events via chaos_exporter	Pre-built integrations with Datadog, PagerDuty, Slack
Cost	Free (open-source), pay for LitmusChaos SaaS (Harness)	Paid per user/node — can be significant at scale
Experiment GitOps	YAML ChaosEngine manifests — version controlled natively	Scenario templates via API/Terraform, less kubectl-native
Learning curve	Higher — requires Kubernetes fluency	Lower — GUI-driven, good for teams new to chaos
Post-mortem artifacts	Manual — you build the snapshot pipeline (as above)	Built-in reports with metric overlays and timeline view

⚙ Quick Reference

5 commands from this guide

File	Command / Code	Purpose
steady_state_hypothesis.yaml	apiVersion: litmuschaos.io/v1alpha1	Steady-State Hypothesis
chaos_observability_stack.yaml	apiVersion: monitoring.coreos.com/v1	Blast Radius Monitoring
chaos_watchdog_service.py	"""	Automating Experiment Abort
HypothesisValidator.java	public record ExperimentHypothesis(	The Experiment Hypothesis
GameDayOrchestrator.java	public class GameDayOrchestrator {	Game Day Drills

Key takeaways

A steady-state hypothesis must be a measurable output metric (p99 latency, error rate)

not a system health assertion. 'The pod stays up' is not an SSH. 'Checkout completes within 300ms for 99% of requests' is.

Blast radius monitoring requires all three layers

infrastructure (CPU/network), service (error rates per dependency), and business (order completion rate). Alerts without the business layer miss the only metric leadership acts on.

Probe-based abort inside Litmus is your first defense. An external watchdog running in a separate AZ and subscribing to Alertmanager webhooks is your second

critical for the case where the chaos experiment impacts your monitoring infrastructure itself.

Grafana snapshots captured at abort time are non-negotiable for post-mortems. Auto-chaos is fast enough that metrics auto-heal before an engineer can open a browser

the snapshot is the only permanent record of what the system looked like at the moment of failure.

Graduate experiments from production-mirror to canary to full production. Only run in production when SSH is calibrated, abort pipeline verified, blast radius constrained, and rollback plan rehearsed.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Walk me through how you'd design the observability stack for a chaos exp...

Q02SENIOR

Your chaos experiment kills 50% of the pods in service A. Metrics for se...

Q03SENIOR

A candidate says 'We use probe-based abort in Litmus, so we're covered.'...

Q01 of 03SENIOR

Walk me through how you'd design the observability stack for a chaos experiment targeting a payment service — specifically, what's your steady-state hypothesis, which metrics would you monitor, and how would you determine if the experiment should be automatically aborted?

ANSWER

Start by defining the SSH as a measurable output metric: 'p99 latency of /charge endpoint < 500ms for 99% of requests' plus 'payment success rate > 99.5%'. Both must be observable from Prometheus. I'd monitor three layers: infra (node CPU, database connection pool), service (error rates per downstream dependency, circuit breaker state), and business (payment completion rate, chargeback rate). Automatic abort via Litmus Continuous probes for the SSH metrics plus an external watchdog that subscribes to Alertmanager for business-layer alerts (e.g., if payment completion rate drops below 95%). The watchdog runs in a separate AZ with its own monitoring stack so it can fire even if the experiment kills Prometheus.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between chaos engineering and load testing?

How do I know if my chaos experiment is too risky to run in production?

What's the minimal observability setup needed before starting chaos engineering?

How do you handle a situation where the chaos experiment accidentally affects monitoring infrastructure?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's Monitoring. Mark it forged?

8 min read · try the examples if you haven't