Chaos Engineering Basics: Monitoring Failures Before They Find You
Every system fails eventually. The brutal truth most engineering teams learn too late is that the failure modes they never tested are always the ones that page them at 3 a.m. on a Friday. Netflix coined the term 'chaos engineering' after their migration to AWS exposed a hard reality: distributed systems fail in ways that are impossible to predict by reading code alone. You have to induce failure — deliberately, scientifically — to build genuine confidence in your system's resilience. Monitoring is what separates chaos engineering from plain vandalism: without deep observability, you're just breaking things and hoping for the best.
The problem chaos engineering solves is the gap between 'we think our system handles this' and 'we have evidence our system handles this.' Runbooks, architecture diagrams, and code reviews are all opinions. A chaos experiment with rigorous monitoring attached is a proof. When a database node disappears, does your read replica take over within your SLA? When a downstream service starts returning 500s, does your circuit breaker actually open, and does it show up in your dashboards before a customer tweets about it? These aren't hypothetical questions — they're experiments with measurable outcomes.
By the end of this article you'll be able to design a complete chaos experiment with a defined steady-state hypothesis, wire up the observability stack needed to validate it, interpret blast-radius telemetry in real time, and avoid the production mistakes that turn a controlled experiment into an uncontrolled incident. We'll use real tooling — Chaos Monkey, Litmus Chaos, Prometheus, and Grafana — with fully runnable configurations and the internal mechanics explained at every step.
Steady-State Hypothesis: The Contract Your Monitoring Must Enforce
Before you inject a single failure, you need a written, measurable definition of 'normal.' This is called the steady-state hypothesis (SSH), and it's the foundation that separates chaos engineering from random testing. Without it, your monitoring has nothing to compare against, and you can't tell whether a blip in your metrics is caused by your experiment or just Tuesday afternoon traffic.
A good SSH has three properties: it is measurable (a concrete metric, not a vague description), it is bounded (a specific threshold like p99 latency < 200ms, not 'fast enough'), and it is observable from your existing monitoring stack without manual inspection. Think of it as a contract — the experiment's job is to stress-test the system, and monitoring's job is to flag the moment that contract is breached.
The SSH drives everything downstream: which metrics you scrape, which alert thresholds you set, how long the experiment runs, and when you abort. Teams that skip this step end up running experiments they can't evaluate — they see metrics move, panic, rollback, and learn nothing. A Prometheus recording rule encoding your SSH turns your hypothesis into an automated referee that fires the moment the blast radius exceeds acceptable bounds.
Note the subtle but critical point: the SSH is about the system's outputs (latency, error rate, throughput), not about the fault you're injecting. You're not asserting 'the database will stay up.' You're asserting 'checkout will complete within 300ms for 99% of requests.' That distinction matters — it's entirely possible the database fails and checkout still hits your SLA via a cache layer. Monitoring that, not the database health itself, is the real experiment.
# Litmus Chaos ChaosEngine manifest with embedded steady-state validation # This defines both the fault injection AND the hypothesis monitoring together # Run with: kubectl apply -f steady_state_hypothesis.yaml apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: checkout-service-resilience-test namespace: production-mirror # NEVER run first experiments in live prod spec: # The application under test — Litmus uses this to scope blast radius appinfo: appns: production-mirror applabel: "app=checkout-service" appkind: deployment # When the hypothesis is violated, Litmus can auto-stop the experiment jobCleanUpPolicy: retain # Keep job logs for post-mortem analysis # Steady-state hypothesis: these probes ARE your monitoring assertions # Litmus evaluates them before injection (baseline), during, and after experiments: - name: pod-cpu-hog # Fault: saturate CPU on checkout pods spec: probe: # Probe 1: HTTP probe — does the service still respond? - name: checkout-endpoint-alive type: httpProbe httpProbe/inputs: url: "http://checkout-service.production-mirror.svc.cluster.local/health" insecureSkipVerify: false method: get: criteria: == # Exact HTTP status code match responseCode: "200" mode: Continuous # Keep checking THROUGHOUT the experiment runProperties: probeTimeout: 5 # Seconds before probe is marked failed interval: 10 # Check every 10 seconds retry: 2 # Allow 2 transient failures before flagging probePollingInterval: 2 # Probe 2: Prometheus probe — p99 latency is the REAL hypothesis # If CPU is pegged but latency stays under 300ms, the system is resilient - name: checkout-p99-latency-under-300ms type: promProbe promProbe/inputs: # PromQL query evaluating our SSH threshold # histogram_quantile computes p99 from Prometheus histogram buckets endpoint: "http://prometheus.monitoring.svc.cluster.local:9090" query: | histogram_quantile( 0.99, rate( http_request_duration_seconds_bucket{ service="checkout-service", route="/api/checkout" }[2m] ) ) * 1000 # The experiment FAILS if p99 latency exceeds 300ms at any Continuous check comparator: criteria: "<=" type: float value: "300" # Milliseconds — our SSH threshold mode: Continuous runProperties: probeTimeout: 10 interval: 15 retry: 1 # Only 1 retry — latency spikes matter components: env: # Fault parameters — scope and duration of CPU stress - name: CPU_CORES value: "2" # Hog 2 cores per pod - name: CPU_LOAD value: "90" # 90% utilization on those cores - name: TOTAL_CHAOS_DURATION value: "120" # Run fault for 120 seconds - name: PODS_AFFECTED_PERC value: "50" # Blast radius: only 50% of pods affected - name: RAMP_TIME value: "10" # 10s stabilization before fault starts
--- Litmus Chaos Runner Output (kubectl logs -n production-mirror chaos-runner) ---
INFO[0000] Steady State Check — PRE-CHAOS
INFO[0002] Probe: checkout-endpoint-alive → PASS (HTTP 200)
INFO[0004] Probe: checkout-p99-latency-under-300ms → PASS (p99: 47.3ms)
INFO[0010] Steady state established. Injecting fault: pod-cpu-hog
INFO[0020] Fault active on pods: [checkout-7d9f4-xkp2n, checkout-7d9f4-m8qvl]
INFO[0030] Probe: checkout-endpoint-alive → PASS (HTTP 200)
INFO[0030] Probe: checkout-p99-latency-under-300ms → PASS (p99: 189.2ms) ← latency rising
INFO[0060] Probe: checkout-p99-latency-under-300ms → PASS (p99: 241.7ms) ← approaching limit
INFO[0090] Probe: checkout-p99-latency-under-300ms → FAIL (p99: 347.1ms > 300ms threshold)
WARN[0090] SSH VIOLATED — initiating experiment abort
INFO[0092] Fault rolled back. Pods restored.
INFO[0095] Steady State Check — POST-CHAOS
INFO[0097] Probe: checkout-p99-latency-under-300ms → PASS (p99: 52.1ms)
EXPERIMENT VERDICT: FAIL
Reason: p99 latency breached 300ms SLA under 90% CPU saturation on 50% of pods
Recommendation: Investigate CPU throttling limits and horizontal pod autoscaler lag
Blast Radius Monitoring: Seeing Exactly How Far the Damage Spreads
Blast radius is how much of your system a failure actually touches. It sounds simple, but monitoring it in real time during an experiment is one of the hardest observability problems in practice. The reason: failure in distributed systems is rarely localized. A single pod killed by Chaos Monkey can cause retry storms upstream, exhaust connection pools in a shared database, trigger cascading timeouts three service hops away, and spike error rates in a completely unrelated service that shares the same thread pool.
To monitor blast radius properly, you need traces, not just metrics. Metrics tell you that something is broken. Distributed traces tell you which path through your system is breaking and how far the breakage travels. During a chaos experiment, a Jaeger or Tempo trace correlated with your fault injection timestamp is worth more than a wall of Grafana panels.
The practical architecture is this: your chaos tool writes a structured event (fault start, fault end, abort) to a shared event bus. Your monitoring stack consumes those events as annotations on time-series dashboards and as trace baggage propagated via OpenTelemetry. Now every metric spike and every slow trace is automatically contextualized against the fault window. Without this correlation, your engineers spend 40% of experiment time trying to figure out which anomalies are experiment artifacts versus pre-existing issues.
Blast radius monitoring also needs to be multi-layered: infrastructure layer (node CPU, network I/O, disk), service layer (error rates per downstream dependency), and business layer (order completion rate, payment throughput). The business layer is the one most teams forget — and it's the only layer your CTO actually cares about during a post-mortem.
# Prometheus alerting rules that fire DURING a chaos experiment # These rules correlate with the Litmus chaos event annotations # Apply with: kubectl apply -f chaos_observability_stack.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: chaos-blast-radius-alerts namespace: monitoring labels: # This label tells the Prometheus Operator to load these rules prometheus: kube-prometheus role: alert-rules spec: groups: - name: chaos-experiment-blast-radius # Evaluate every 10s during active experiments for fast feedback interval: 10s rules: # Rule 1: Detect upstream retry storm caused by fault injection # A retry storm means clients are hammering a degraded service - alert: ChaosInducedRetryStorm expr: | sum by (source_service, target_service) ( rate( grpc_client_handled_total{ grpc_code=~"Unavailable|DeadlineExceeded|ResourceExhausted" }[1m] ) ) > 50 # More than 50 retryable errors/sec between any two services for: 20s # Must persist 20s — filters transient blips labels: severity: critical experiment_context: "blast-radius-propagation" annotations: summary: "Retry storm detected: {{ $labels.source_service }} → {{ $labels.target_service }}" description: | {{ $value | printf "%.1f" }} retryable errors/sec detected between services. This indicates the chaos fault has propagated beyond the intended blast radius. Check if circuit breaker on {{ $labels.source_service }} is open. runbook_url: "https://runbooks.internal/chaos/retry-storm" # Rule 2: Connection pool exhaustion — a common blast radius spillover # When a service slows down, connection pools fill up and starve other callers - alert: DatabaseConnectionPoolExhausted expr: | ( db_pool_connections_in_use{pool="checkout-db-pool"} / db_pool_connections_max{pool="checkout-db-pool"} ) > 0.90 # Pool is more than 90% utilized for: 15s labels: severity: critical experiment_context: "blast-radius-database" annotations: summary: "DB connection pool near exhaustion during chaos experiment" description: | Pool utilization: {{ $value | humanizePercentage }}. New requests will queue or fail. Blast radius has reached the database tier. Experiment should be aborted if this persists beyond ramp-down window. # Rule 3: Business-layer impact — the metric that matters to leadership # Drop in order completion rate signals real user impact - alert: ChaosBusinessImpactDetected expr: | ( rate(orders_completed_total[2m]) / rate(orders_initiated_total[2m]) ) < 0.95 # Completion rate dropped below 95% for: 30s labels: severity: page # This one actually wakes someone up experiment_context: "blast-radius-business" annotations: summary: "Order completion rate below 95% — chaos experiment exceeding safe blast radius" description: | Current completion rate: {{ $value | humanizePercentage }}. Expected baseline: >99%. This represents real revenue impact. Abort the chaos experiment immediately via Litmus dashboard. --- # Grafana annotation source — pushes chaos events as vertical lines on dashboards # This makes every chart self-documenting during an experiment review apiVersion: v1 kind: ConfigMap metadata: name: grafana-chaos-annotation-datasource namespace: monitoring data: # Grafana reads this query to draw annotations on all dashboards # It queries the chaos event log stored as Prometheus labels annotation_query: | changes( litmuschaos_experiment_verdict{ chaosresult_verdict=~"Pass|Fail|Stopped" }[1m] ) > 0 # Annotation label shown in Grafana UI annotation_label: "Chaos Fault Window"
configmap/grafana-chaos-annotation-datasource created
--- Prometheus alert evaluation (kubectl logs prometheus-0 -n monitoring | grep chaos) ---
t=14:23:10 level=info msg="Evaluating rule" group=chaos-experiment-blast-radius rule=ChaosInducedRetryStorm
t=14:23:20 level=info msg="Alert firing" alert=ChaosInducedRetryStorm
source_service=payment-service target_service=checkout-service value=73.4
labels: severity=critical experiment_context=blast-radius-propagation
t=14:23:35 level=info msg="Evaluating rule" rule=DatabaseConnectionPoolExhausted
t=14:23:35 level=info msg="Alert resolved" alert=DatabaseConnectionPoolExhausted
-- Pool dropped back to 81% after circuit breaker opened on checkout-service --
t=14:23:50 level=info msg="Evaluating rule" rule=ChaosBusinessImpactDetected
t=14:23:50 level=info msg="Alert NOT firing" -- order completion rate: 97.3% (above threshold)
--- Summary ---
Blast radius contained at: service mesh layer (checkout ↔ payment)
Database pool self-recovered: YES (circuit breaker functioned as designed)
Business impact: NONE (resilience mechanism worked)
Experiment verdict: PARTIAL PASS — retry storm exceeded threshold but auto-recovered
Automating Experiment Abort: Building a Self-Healing Chaos Pipeline
The most dangerous moment in a chaos experiment isn't when you inject the fault — it's the 90 seconds between 'something is wrong' and 'someone pushed the abort button.' Manual abort loops depend on an engineer watching a dashboard at exactly the right moment. In production-adjacent environments, that's too slow. You need automated abort conditions wired directly into your chaos tooling.
Litmus Chaos and Gremlin both support this via their probe-failure semantics: if a Continuous probe fails, the experiment engine automatically rolls back the fault and marks the experiment as failed. But that's only the first line of defense. The second — and more robust — layer is an external watchdog: a small service that subscribes to your alertmanager webhook, matches on the experiment_context label you saw in the previous section, and calls the chaos tool's API to abort if a severity:page alert fires.
This two-layer approach handles a subtle failure mode: what if the chaos experiment itself breaks your monitoring stack? If Prometheus can't scrape metrics because the node running it is the one you killed, your probe-based abort won't fire. The external watchdog needs to live on a separate node pool, with its own health check, and ideally in a different availability zone from the experiment's blast radius. Monitoring your monitoring during a chaos experiment sounds paranoid — until the one time it matters.
The abort pipeline should also emit a structured event: experiment name, abort reason, which probe failed, current metric value vs. SSH threshold, and a direct link to the Grafana dashboard snapshot captured at abort time. This snapshot is gold for post-mortems — it captures the exact state of every metric at the moment the system broke, before any auto-healing obscures the evidence.
#!/usr/bin/env python3 """ Chaos Experiment Watchdog Service Listens for Alertmanager webhook callbacks and automatically aborts a running Litmus Chaos experiment if a severity:page alert fires during the experiment window. Designed to run on a SEPARATE NODE POOL from the experiment blast radius. Requirements: pip install fastapi uvicorn httpx pydantic Run with: uvicorn chaos_watchdog_service:watchdog_app --host 0.0.0.0 --port 8090 """ import asyncio import logging import os from datetime import datetime, timezone from typing import Optional import httpx from fastapi import FastAPI, HTTPException, Request from pydantic import BaseModel, Field # ── Configuration ───────────────────────────────────────────────────────────── LITMUS_API_URL = os.environ["LITMUS_CHAOS_API_URL"] # e.g. http://litmus.monitoring:9002 LITMUS_API_TOKEN = os.environ["LITMUS_API_TOKEN"] # Service account token GRAFANA_API_URL = os.environ["GRAFANA_API_URL"] # For snapshot capture on abort GRAFANA_API_TOKEN = os.environ["GRAFANA_API_TOKEN"] # Dashboard UID that shows all chaos-related panels — captured at abort time CHAOS_DASHBOARD_UID = os.environ.get("CHAOS_DASHBOARD_UID", "chaos-blast-radius-overview") # Only abort experiments if alerts carry this label — prevents false triggers REQUIRED_EXPERIMENT_LABEL = "experiment_context" logging.basicConfig( level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s — %(message)s" ) logger = logging.getLogger("chaos-watchdog") # ── Pydantic models matching Alertmanager webhook payload ────────────────────── class AlertmanagerLabel(BaseModel): alertname: str severity: Optional[str] = None experiment_context: Optional[str] = None # Set in our PrometheusRule labels namespace: Optional[str] = None class AlertmanagerAnnotation(BaseModel): summary: Optional[str] = None description: Optional[str] = None class AlertmanagerAlert(BaseModel): status: str # 'firing' or 'resolved' labels: AlertmanagerLabel annotations: AlertmanagerAnnotation startsAt: str generatorURL: str class AlertmanagerWebhookPayload(BaseModel): version: str = Field(default="4") status: str alerts: list[AlertmanagerAlert] # ── FastAPI app ──────────────────────────────────────────────────────────────── watchdog_app = FastAPI( title="Chaos Experiment Watchdog", description="Auto-aborts chaos experiments when SSH is violated", version="1.0.0" ) @watchdog_app.post("/alertmanager/webhook") async def handle_alertmanager_webhook(payload: AlertmanagerWebhookPayload, request: Request): """ Alertmanager calls this endpoint when an alert fires or resolves. We only act on 'firing' alerts that carry our experiment_context label. """ logger.info(f"Received webhook — status={payload.status}, alert_count={len(payload.alerts)}") for alert in payload.alerts: # Only act on actively firing alerts if alert.status != "firing": logger.debug(f"Skipping resolved alert: {alert.labels.alertname}") continue # Only abort if alert is explicitly tagged as chaos-experiment-related experiment_context = alert.labels.experiment_context if not experiment_context: logger.debug(f"Alert {alert.labels.alertname} has no experiment_context — skipping") continue # severity:page means business impact — always abort if alert.labels.severity == "page": logger.warning( f"PAGE-SEVERITY alert during chaos experiment! " f"Alert={alert.labels.alertname} " f"Context={experiment_context}" ) # Step 1: Capture Grafana dashboard snapshot BEFORE aborting # This preserves the system state at the moment of breach snapshot_url = await capture_grafana_snapshot( dashboard_uid=CHAOS_DASHBOARD_UID, annotation=f"AUTO-ABORT: {alert.labels.alertname} at {datetime.now(timezone.utc).isoformat()}" ) # Step 2: Abort the running chaos experiment via Litmus API abort_result = await abort_litmus_experiment( experiment_context=experiment_context, abort_reason=alert.annotations.summary or alert.labels.alertname ) logger.info( f"Experiment aborted successfully. " f"Snapshot: {snapshot_url} " f"Litmus response: {abort_result}" ) return {"status": "processed", "timestamp": datetime.now(timezone.utc).isoformat()} async def capture_grafana_snapshot(dashboard_uid: str, annotation: str) -> str: """ Creates a Grafana snapshot of the chaos dashboard at the current moment. Returns the public snapshot URL for inclusion in post-mortem reports. """ async with httpx.AsyncClient(timeout=10.0) as grafana_client: # First, get the dashboard JSON model dashboard_response = await grafana_client.get( f"{GRAFANA_API_URL}/api/dashboards/uid/{dashboard_uid}", headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"} ) dashboard_response.raise_for_status() dashboard_model = dashboard_response.json()["dashboard"] # Create a snapshot — Grafana stores it and returns a share URL snapshot_response = await grafana_client.post( f"{GRAFANA_API_URL}/api/snapshots", headers={"Authorization": f"Bearer {GRAFANA_API_TOKEN}"}, json={ "dashboard": dashboard_model, "name": f"Chaos Abort Snapshot — {annotation}", "expires": 604800, # Snapshot expires in 7 days } ) snapshot_response.raise_for_status() snapshot_url = snapshot_response.json()["url"] logger.info(f"Grafana snapshot captured: {snapshot_url}") return snapshot_url async def abort_litmus_experiment(experiment_context: str, abort_reason: str) -> dict: """ Calls the Litmus Chaos API to stop the running experiment. experiment_context maps to the ChaosEngine name via our labeling convention. """ # Derive the ChaosEngine name from the experiment_context label # Convention: experiment_context label value matches ChaosEngine metadata.name chaos_engine_name = experiment_context.replace("blast-radius-", "") async with httpx.AsyncClient(timeout=15.0) as litmus_client: # Litmus REST API: PATCH the engine status to 'stop' stop_response = await litmus_client.patch( f"{LITMUS_API_URL}/api/chaosengine/{chaos_engine_name}", headers={ "Authorization": f"Bearer {LITMUS_API_TOKEN}", "Content-Type": "application/json" }, json={ "spec": { "engineState": "stop" # Litmus graceful stop — rolls back faults }, "metadata": { "annotations": { # Record WHY this was aborted — visible in Litmus dashboard "chaos.abort.reason": abort_reason, "chaos.abort.timestamp": datetime.now(timezone.utc).isoformat(), "chaos.abort.source": "watchdog-service" } } } ) stop_response.raise_for_status() logger.info(f"Litmus experiment '{chaos_engine_name}' stopped via API") return stop_response.json() @watchdog_app.get("/health") async def health_check(): """Liveness probe — the watchdog must be reachable during experiments.""" return {"status": "healthy", "timestamp": datetime.now(timezone.utc).isoformat()}
INFO chaos-watchdog — 14:23:49 Received webhook — status=firing, alert_count=2
INFO chaos-watchdog — 14:23:49 Alert ChaosInducedRetryStorm has experiment_context=blast-radius-propagation (severity=critical — not page, skipping abort)
WARN chaos-watchdog — 14:23:49 PAGE-SEVERITY alert during chaos experiment!
Alert=ChaosBusinessImpactDetected
Context=blast-radius-business
INFO chaos-watchdog — 14:23:50 Grafana snapshot captured:
https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB
INFO chaos-watchdog — 14:23:51 Litmus experiment 'business' stopped via API
INFO chaos-watchdog — 14:23:51 Experiment aborted successfully.
Snapshot: https://grafana.internal/dashboard/snapshot/kLmN9pQrXyZ2aB
Litmus response: {"spec": {"engineState": "stop"}, "status": "updated"}
INFO uvicorn — 127.0.0.1:0 - "POST /alertmanager/webhook HTTP/1.1" 200 OK
| Aspect | Litmus Chaos (CNCF) | Gremlin (SaaS) |
|---|---|---|
| Deployment model | Self-hosted Kubernetes operator | SaaS with agent on your infra |
| SSH enforcement | Built-in Continuous/Edge probes with Prometheus integration | Attack halt conditions via Gremlin API, less native PromQL |
| Blast radius control | PODS_AFFECTED_PERC, namespace scoping, label selectors | Target sets with tag-based filters, AZ pinning in UI |
| Abort mechanism | Probe failure → automatic rollback, PATCH API for external abort | Attack stop via API or UI, webhooks for external triggers |
| Observability integration | Prometheus, Grafana annotation events via chaos_exporter | Pre-built integrations with Datadog, PagerDuty, Slack |
| Cost | Free (open-source), pay for LitmusChaos SaaS (Harness) | Paid per user/node — can be significant at scale |
| Experiment GitOps | YAML ChaosEngine manifests — version controlled natively | Scenario templates via API/Terraform, less kubectl-native |
| Learning curve | Higher — requires Kubernetes fluency | Lower — GUI-driven, good for teams new to chaos |
| Post-mortem artifacts | Manual — you build the snapshot pipeline (as above) | Built-in reports with metric overlays and timeline view |
🎯 Key Takeaways
- A steady-state hypothesis must be a measurable output metric (p99 latency, error rate) — not a system health assertion. 'The pod stays up' is not an SSH. 'Checkout completes within 300ms for 99% of requests' is.
- Blast radius monitoring requires all three layers: infrastructure (CPU/network), service (error rates per dependency), and business (order completion rate). Alerts without the business layer miss the only metric leadership acts on.
- Probe-based abort inside Litmus is your first defense. An external watchdog running in a separate AZ and subscribing to Alertmanager webhooks is your second — critical for the case where the chaos experiment impacts your monitoring infrastructure itself.
- Grafana snapshots captured at abort time are non-negotiable for post-mortems. Auto-chaos is fast enough that metrics auto-heal before an engineer can open a browser — the snapshot is the only permanent record of what the system looked like at the moment of failure.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Running chaos experiments directly in production before ever testing in a staging mirror — Symptom: Real users are impacted, experiment can't be safely aborted because the blast radius is undefined — Fix: Always run the first three iterations of an experiment in a production-mirror namespace or environment with synthetic traffic. Use Litmus's
appnsfield and Kubernetes namespace scoping to hard-constrain the blast radius. Only graduate to production after you've observed the experiment complete (pass and fail) in a controlled environment and verified your abort pipeline works. - ✕Mistake 2: Setting SSH thresholds without measuring real baseline — Symptom: Every experiment 'passes' even when the system is visibly degraded, or every experiment 'fails' within 10 seconds due to normal traffic variance — Fix: Run
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="your-service"}[7d]))in Prometheus and observe the max over a full week including peak hours. Set your SSH threshold at 1.5× the observed peak p99, not at an aspirational value. Also add afor: 30swindow so single-packet blips don't trigger false SSH violations. - ✕Mistake 3: Forgetting to label chaos-related Prometheus alerts with experiment_context — Symptom: Your on-call engineer gets paged during a planned chaos experiment, treats it as a real incident, rolls back a separate deployment as root cause, and the chaos experiment never gets properly analyzed — Fix: Add a static
experiment_contextlabel to every PrometheusRule alert that could fire during a chaos window. Configure Alertmanager to route alerts with this label to a dedicated 'chaos-experiments' receiver instead of the primary on-call rotation. Brief your on-call team before every experiment with a calendar block that includes the exact experiment window and a link to the Litmus dashboard.
Interview Questions on This Topic
- QWalk me through how you'd design the observability stack for a chaos experiment targeting a payment service — specifically, what's your steady-state hypothesis, which metrics would you monitor, and how would you determine if the experiment should be automatically aborted?
- QYour chaos experiment kills 50% of the pods in service A. Metrics for service A look fine, but service B's error rate spikes 30 seconds later. How does your monitoring setup detect this blast-radius propagation, and what does it tell you about service B's design?
- QA candidate says 'We use probe-based abort in Litmus, so we're covered.' What's the failure mode this misses, and how would you architect around it?
Frequently Asked Questions
What is the difference between chaos engineering and load testing?
Load testing validates that your system handles expected traffic volume — it's about quantity. Chaos engineering validates that your system handles unexpected failures — it's about resilience. You can pass a load test and still have a catastrophic outage when a single availability zone goes down. The two are complementary: run load tests to establish baseline capacity, then run chaos experiments with load active to simulate real failure conditions under realistic traffic.
How do I know if my chaos experiment is too risky to run in production?
Ask three questions: Do you have a defined SSH with automated abort? Have you run this experiment in staging and seen both a pass and a controlled fail? Does your blast-radius control limit the experiment to less than 50% of capacity in any single tier? If any answer is 'no,' the experiment isn't ready for production. The steady-state hypothesis and automated abort are non-negotiable safety gates — running without them isn't chaos engineering, it's just breaking things.
What's the minimal observability setup needed before starting chaos engineering?
At minimum you need: Prometheus scraping your services with RED metrics (Rate, Errors, Duration), Grafana dashboards showing those metrics with at least 30 days of history (for SSH baseline calibration), and Alertmanager configured with at least one working receiver. Without these three, you can't define a steady-state hypothesis, can't observe blast radius propagation in real time, and can't trigger automated abort. Distributed tracing (Jaeger/Tempo) is strongly recommended but can be added incrementally — start with metrics-only chaos experiments and layer traces in as your practice matures.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.