Senior 16 min · March 06, 2026

Alerting & On-Call — Why Silenced Services Hide Outages

A 7-day PagerDuty silence hid a 4-hour checkout outage.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Alert on symptoms users feel (latency, error rate, availability) — not causes engineers investigate (CPU, memory, disk)
  • The 'for' duration in Prometheus is your noise filter — 2-minute minimum for critical alerts eliminates self-healing false pages
  • SLO burn rate alerting with multi-window expressions cuts alert volume while improving signal quality
  • On-call rotations need minimum 4 engineers — secondary on-call is non-negotiable for any SLA-backed service
  • Every alert needs a runbook with 3 steps: confirm it's real, apply mitigation, escalation criteria
  • Monthly alert audits are the highest-leverage practice — if an alert fired 3+ times without a fix, it's backlog work, not a page
✦ Definition~90s read
What is Alerting and On-call?

Alerting and on-call is the operational practice of detecting when your production systems are in trouble and routing that signal to a human who can fix it. It exists because no system runs perfectly forever, and the gap between 'something broke' and 'someone knows' is where real outages happen.

Imagine your house has a smoke alarm that goes off every time you make toast.

The core tension is that every alert is a trade-off: too few and you miss incidents, too many and your team burns out ignoring noise. The entire discipline is about engineering that signal-to-noise ratio so that when your pager goes off, it actually matters.

Silenced services are the most common way teams accidentally hide outages. When an alert fires repeatedly for a known issue—a flaky dependency, a known deploy lag—someone silences it to stop the noise. But silencing is a memory leak: the alert disappears from dashboards, from incident reviews, from everyone's awareness.

The outage continues, but nobody sees it. This is why the golden rule of monitoring is to alert on symptoms (user-facing failures like high latency or error rates) rather than causes (CPU spikes, disk usage). Symptoms are what your customers feel; causes are implementation details that change as your system evolves.

Structuring on-call rotations that don't destroy your team means accepting that humans are bad at sustained vigilance. The industry standard is a follow-the-sun rotation with 8-12 hour shifts, backed by an escalation chain that guarantees a response within 5-15 minutes for critical alerts.

Tools like PagerDuty, Opsgenie, and Grafana OnCall handle routing, deduplication, and suppression—the plumbing that prevents the same incident from paging five people at once. But the real sanity-saver is the noise budget: your pager should be quiet 90% of the time.

If it's not, you're not doing alerting—you're doing noise. The fix is an alert audit loop where every page triggers a runbook review and a decision: tune the threshold, suppress the alert, or fix the underlying issue.

Plain-English First

Imagine your house has a smoke alarm that goes off every time you make toast. After a week, you'd probably rip the battery out — and then miss a real fire. That's exactly what happens with poorly designed software alerts. Engineers get paged so often for things that fix themselves that they start ignoring everything. Good alerting means your alarm only sounds when the house is actually burning, so the person on duty actually pays attention when it does. And when it does fire, there's a clear note on the wall saying 'open the front door, grab the extinguisher, call the fire brigade if it's still going after two minutes' — not a 40-page manual about the history of residential fire safety.

At 3 AM, your phone screams. You scramble to your laptop, bleary-eyed, only to discover the alert fired because a CPU spike lasted four seconds and self-corrected before you even logged in. You've lost sleep over nothing — and this happens four nights a week.

I've seen this pattern at companies of every size. The specifics vary — sometimes it's CPU, sometimes it's memory, sometimes it's a disk utilization alert on a volume that autoscales — but the shape is always the same. Engineers get paged for things that didn't need human attention. They lose sleep, lose trust in the paging system, and eventually lose patience with the whole programme. The best engineers leave first, because they have options. The ones who stay start silencing things. And then the real fire happens.

This is not a monitoring problem. It's a culture problem that presents as a technical one. The technical symptoms are easy to diagnose: too many alerts, thresholds too low, no runbooks, no audit process. But the underlying culture problem is that most engineering teams treat adding alerts as free and removing alerts as risky. Every post-mortem ends with 'add monitoring.' Nobody's post-mortem ever ends with 'delete the alert that's been crying wolf for six months.' That asymmetry is how you get a paging system that cries wolf 40 times a week and then fails to wake anyone up when the real incident hits.

The root cause isn't that teams care too little about monitoring — it's that they add alerts reactively, after every incident, without ever pruning the ones that stop being useful. Over time, the alert system becomes a noise machine. Engineers stop trusting it, start silencing pages, and miss the signals that actually matter. The fix isn't more dashboards. It's disciplined, intentional alerting philosophy backed by concrete practices for thresholds, routing, escalation, and rotation design.

By the end of this article you'll know how to audit your existing alert stack and identify the ones that don't serve you, how to write alerts that fire on symptoms not causes, how to structure an on-call rotation that doesn't erode your team's wellbeing, and how to use real tooling — Prometheus alerting rules, PagerDuty routing logic, and runbook templates — to make all of it operational and repeatable. The goal isn't a perfect alerting system. It's a system your engineers trust enough to actually pay attention to.

Why Silenced Services Hide Outages

Alerting and on-call best practices define how a team detects, responds to, and resolves system anomalies with minimal human latency. The core mechanic is a tiered escalation pipeline: alerts route from automated detection (e.g., p99 latency > 500ms for 5 minutes) to a primary on-call engineer, then to a secondary, and finally to incident management if unacknowledged within a defined timeout (commonly 15 minutes).

In practice, effective alerting hinges on three properties: signal-to-noise ratio, actionable content, and escalation speed. A good alert fires only when human intervention is required — not for transient blips. Each alert must include the failing component, the observed vs. expected value, and a runbook link. Escalation must be automatic and time-bound; manual handoffs introduce minutes of delay.

Use these practices in any production system where downtime costs exceed the overhead of maintaining on-call rotations. They matter most for services with strict SLOs (e.g., 99.9% uptime) or where a single engineer cannot know every subsystem. Without them, teams experience alert fatigue, missed critical pages, and prolonged mean-time-to-acknowledge (MTTA).

Silence Is Not Resolution
A silenced alert does not fix the underlying issue — it only hides the symptom until the next, often worse, failure surfaces.
Production Insight
A payment-processing service silenced a 'high 5xx rate' alert during a deployment because it fired every deploy. Three weeks later, a misconfigured load balancer caused 40% of requests to fail for 45 minutes before anyone noticed.
The exact symptom: the alert dashboard showed green because the silenced alert never escalated, and no new alert covered the specific error code.
Rule of thumb: never silence an alert for more than 24 hours without a linked ticket to fix the root cause; use temporary suppressions only during active incident response.
Key Takeaway
Every silenced alert is a ticking time bomb — fix the root cause, not the notification.
An alert that fires every deploy is a deployment process failure, not a monitoring problem.
Your on-call rotation is only as good as your alert routing: acknowledge timeouts must be under 15 minutes.
Alerting & On-Call Flow: From Symptoms to SLOs THECODEFORGE.IO Alerting & On-Call Flow: From Symptoms to SLOs How to structure alerts, rotations, and audits to avoid hidden outages Alert on Symptoms, Not Causes Golden Rule: monitor user-facing signals Structured On-Call Rotations Primary, secondary, and escalation tiers Runbooks & SLOs Define targets and step-by-step remediation Alert Routing & Deduplication Suppress noise, group related alerts Noise Budget & Pager Quiet Limit alerts to preserve attention Handoff Protocol Prevent 3 AM drops with clear shift transfer ⚠ Silenced services hide outages Never mute alerts without a documented reason and review THECODEFORGE.IO
thecodeforge.io
Alerting & On-Call Flow: From Symptoms to SLOs
Alerting On Call Best Practices

Alert on Symptoms, Not Causes — The Golden Rule of Monitoring

The most common alerting mistake is alerting on what you think is wrong instead of what the user actually experiences. A high CPU alert fires and the engineer investigates — but CPU being high isn't inherently bad. Maybe a batch job is running. Maybe it's expected load from a traffic spike the autoscaler is still catching up to. Maybe the GC is doing a full collection. The user doesn't care about CPU. They care whether the checkout page loads and their payment goes through.

Symptomatic alerting means your alert fires on things users feel directly: high latency, elevated error rates, failed health checks, degraded availability. Google's SRE book formalised this as the Four Golden Signals — latency, traffic, errors, and saturation — and the framing has held up well. Alerts on these signals are almost always actionable because if error rate is 15%, something is broken for users right now, and the on-call engineer has a clear starting point.

Causal metrics like CPU, memory, and disk are better suited for dashboards and capacity planning, not paging. You investigate them after a symptom alert fires to understand why something is wrong — not to decide whether something is wrong. The distinction matters because it changes the engineer's mental model during an incident. A symptom alert says 'users are experiencing this.' A causal alert says 'something in your infrastructure crossed a threshold.' Only one of those tells you whether to act.

The practical test: before adding any alert, ask yourself 'If this fires at 3 AM, can the on-call engineer take a concrete action within five minutes?' Not 'investigate' — act. If the answer is 'investigate and check some dashboards and maybe escalate', it belongs on a dashboard, not a pager. The 3 AM test is deliberately adversarial: the human responding is sleep-deprived, potentially unfamiliar with the specific service, and operating under pressure. Write your alerts for that human, not for a well-rested engineer on a Tuesday morning.

Here's the pushback you'll hear, usually from engineers who've been on the wrong end of missed incidents: 'But what about catching problems early?' This is a reasonable concern wrapped around a false premise. Causal metrics do catch problems early — on dashboards, during business hours, reviewed by engineers who have context and aren't panicking. You don't need to wake someone up to look at a graph trending in the wrong direction. Set up a daily dashboard review as part of your team's operational rhythm. If a pattern in causal metrics consistently precedes a symptom, write a symptom-based alert for the user-facing impact, not the infrastructure metric that correlates with it. The correlation is interesting. The symptom is the truth.

The Four Golden Signals aren't a checklist to run through mechanically. They're a lens for asking the right question: is what I'm about to alert on something a user would feel? Latency tells you how long users wait. Traffic tells you how much demand you're handling and whether demand patterns are normal. Errors tell you how often you're actively failing users. Saturation tells you how close to the edge you are before one of the other three degrades. Any proposed alert that doesn't map cleanly to one of these four should face an explicit justification before it gets merged.

prometheus_symptom_alerts.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# prometheus_symptom_alerts.yml
# These rules live in your Prometheus alerting rules directory.
# Reference them via the 'rule_files' block in prometheus.yml:
#   rule_files:
#     - /etc/prometheus/rules/*.yml
#
# Validate before applying:
#   promtool check rules /etc/prometheus/rules/prometheus_symptom_alerts.yml

groups:
  - name: user_facing_symptoms
    # Prometheus evaluates these rules every 60 seconds.
    # Shorter intervals increase Prometheus load without meaningfully
    # improving detection time for real incidents.
    interval: 60s
    rules:

      # ──────────────────────────────────────────────────────────
      # GOOD ALERT: Fires on what users actually experience
      # This is a SYMPTOM. Users are receiving 5xx responses.
      # The on-call engineer's action is unambiguous:
      # check error logs, look at recent deploys, investigate downstream.
      # ──────────────────────────────────────────────────────────
      - alert: HighErrorRateShopping
        expr: |
          (
            sum(rate(http_requests_total{
              service="shopping-cart",
              status=~"5.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              service="shopping-cart"
            }[5m]))
          ) > 0.05
        # 'for: 2m' means Prometheus waits 2 minutes of sustained breach
        # before transitioning from 'pending' to 'firing'.
        # This single parameter eliminates the majority of false pages
        # from transient spikes and single bad scrape cycles.
        # Cost: you detect 2 minutes later. Benefit: far fewer 3 AM false alarms.
        for: 2m
        labels:
          severity: critical
          team: checkout
          # 'service' label is critical for Alertmanager routing AND
          # for inhibition rules — both depend on consistent service labels.
          service: shopping-cart
        annotations:
          summary: "Shopping cart error rate above 5% for 2 minutes"
          # The runbook_url annotation is non-negotiable.
          # Engineers responding to this alert should never have to search
          # for documentation during an active incident.
          runbook_url: "https://runbooks.internal/shopping-cart-errors"
          description: "Current error rate: {{ $value | humanizePercentage }}. Check recent deployments and downstream dependencies first."
          # dashboard_url lets the engineer jump directly to the relevant
          # Grafana panel without navigating the dashboard hierarchy.
          dashboard_url: "https://grafana.internal/d/checkout-overview?var-service=shopping-cart"

      # ──────────────────────────────────────────────────────────
      # GOOD ALERT: p99 latency degradation users feel
      # 5-minute 'for' duration is appropriate for latency —
      # a 30-second latency spike is often a cold cache or single
      # slow request. 5 minutes of sustained high p99 means
      # something structural has changed.
      # ──────────────────────────────────────────────────────────
      - alert: CheckoutLatencyHigh
        expr: |
          histogram_quantile(
            0.99,
            sum(rate(http_request_duration_seconds_bucket{
              service="checkout",
              handler="/api/purchase"
            }[10m])) by (le)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
          team: checkout
          service: checkout
        annotations:
          summary: "p99 checkout latency exceeded 2s SLO threshold"
          runbook_url: "https://runbooks.internal/checkout-latency"
          description: "p99 latency is {{ $value | humanizeDuration }}. SLO threshold is 2.0s. Check database query times and downstream payment API latency."
          dashboard_url: "https://grafana.internal/d/checkout-latency"

      # ──────────────────────────────────────────────────────────
      # BAD ALERT — kept here intentionally as a contrast.
      # Do NOT uncomment this. CPU is a cause, not a symptom.
      #
      # Problems with this alert:
      # 1. 'What should the engineer DO?' — there is no clear action.
      #    Check CPU... then what? If users aren't affected, go back to sleep?
      # 2. It fires on batch jobs, GC pauses, autoscaling warmup,
      #    and any other expected load pattern.
      # 3. The threshold (80%) is arbitrary — why not 75%? Why not 90%?
      #    There is no principled answer because CPU% alone isn't the thing
      #    you actually care about.
      # ──────────────────────────────────────────────────────────
      # - alert: HighCPU
      #   expr: node_cpu_utilization > 0.80
      #   for: 1m
      #   annotations:
      #     summary: "CPU is high"
      #
      # If you inherited this alert, delete it today.
      # Add CPU to your Grafana dashboard instead.
Output
# When Prometheus evaluates these rules and HighErrorRateShopping fires:
# (The alert transitions: inactive -> pending -> firing over 2 minutes)
ALERT HighErrorRateShopping
Labels:
alertname = HighErrorRateShopping
severity = critical
team = checkout
service = shopping-cart
Annotations:
summary = Shopping cart error rate above 5% for 2 minutes
description = Current error rate: 7.34%. Check recent deployments and downstream dependencies first.
runbook_url = https://runbooks.internal/shopping-cart-errors
dashboard_url = https://grafana.internal/d/checkout-overview?var-service=shopping-cart
State: firing
ActiveAt: 2026-03-15T03:14:22Z
# In Alertmanager's /api/v1/alerts, this alert will show:
# - Matched receiver: pagerduty-critical (via severity=critical route)
# - Status: firing
# - Inhibited: false (no higher-severity alert suppressing it)
# The on-call engineer receives a PagerDuty page containing:
# - Alert name and summary
# - Clickable runbook URL
# - Current metric value (7.34% error rate)
# - Direct dashboard link
# Everything they need to start investigating in under 60 seconds.
Pro Tip: The 3 AM Test
Before committing any new alert rule, ask your team: 'If this woke someone up at 3 AM, would they know exactly what to do in under 5 minutes?' If anyone hesitates, the alert needs a better runbook, a higher threshold, a longer 'for' duration, or it needs to be demoted to a dashboard panel entirely. Run this test on your existing alerts too — not just new ones. Most alert libraries accumulate technical debt the same way codebases do. The 3 AM test applied to a backlog of 60 alerts will typically eliminate 20 of them in a single sitting.
Production Insight
CPU alerts fire on average 40+ times per week in a typical microservice setup running autoscaling and batch workloads.
The vast majority are false positives: batch jobs, GC pauses, autoscaling warmup, and cold-start traffic.
None of them tell the on-call engineer what to do, which means they all teach the engineer to ignore pages.
Rule: if it doesn't map to a Golden Signal and can't survive the 3 AM test, it's a dashboard panel masquerading as an alert.
Key Takeaway
Symptom alerts (latency, errors, availability) are almost always actionable — they describe what users feel.
Causal alerts (CPU, memory, disk) are almost always noise when they page — they describe what engineers investigate.
The 3 AM test: can the on-call engineer take a concrete action in 5 minutes? If not, it's a dashboard.
Alert vs Dashboard vs Log: Where Does This Metric Go?
IfUser-facing impact: error rate elevated, latency above SLO, availability degraded
UseAlert — this is a symptom. Page the on-call engineer with a runbook link and current metric value.
IfInfrastructure metric (CPU, memory, disk) with no confirmed current user impact
UseDashboard — investigate during business hours, or investigate after a symptom alert fires to understand the cause.
IfMetric that self-corrects within 60-90 seconds reliably
UseSet 'for: 2m' minimum — if it self-corrects before 2 minutes consistently, it never becomes an alert and belongs on a dashboard with an anomaly annotation.
IfMetric you're considering alerting on but you're unsure
UseApply the 3 AM test first. Then ask: has an incident ever occurred that this alert would have caught earlier than a symptom alert? If yes, it may be worth keeping. If no, it's a dashboard panel.

Structuring On-Call Rotations That Don't Destroy Your Team

An on-call rotation is a social contract as much as it is a technical system. Engineers who feel the rotation is fair, predictable, and actively supported stay in it. Engineers who feel it's a punishment — or worse, an invisible tax that everyone pretends doesn't cost anything — churn. And the engineers who leave first are always the ones who have other options: the senior engineers, the ones who built the systems, the ones whose absence creates the next generation of undocumented incidents.

The fundamentals of a healthy rotation start with team size. You need at least four engineers to build a weekly rotation that gives people genuine recovery time between shifts. With three engineers, one person is always either on-call or just came off on-call — cognitive load never fully resets. With two, you're alternating weeks between two people and calling it a rotation. With one, you're not running a rotation at all; you're running a hero, and heroes burn out or leave.

Secondary on-call — a second engineer ready to be escalated to if the primary doesn't acknowledge within ten minutes — is non-negotiable for any service with a real SLA. It serves two purposes. The obvious one is redundancy: if the primary engineer is genuinely unavailable, the incident still gets coverage. The less obvious one is diagnostic: if escalation to secondary happens more than once per week, your primary alert volume is too high. The secondary escalation rate is a canary metric for rotation health that most teams never look at.

On-call handoff meetings deserve more ceremony than they typically receive. The outgoing engineer should document what fired, what was investigated, what commands were run, and what follow-up work was created. Without this, the same incidents repeat because the institutional knowledge of 'oh, that alert fires every Tuesday when the ETL job runs — just acknowledge and wait four minutes' lives in one engineer's head and evaporates with every rotation change. The handoff document is where that knowledge becomes team property.

Compensation matters — and how you structure it signals what you believe on-call is worth. Explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks all communicate that the organisation understands the cost. The specifics matter less than the consistency and transparency.

But here's what I've observed across many teams: the psychological contract matters more than the compensation number. An engineer who receives $500 per on-call week but has no control over alert volume, whose feedback from retrospectives never changes anything, and who watches the same false alarms fire week after week without being fixed — that engineer feels trapped regardless of the pay. An engineer who receives time-off-in-lieu and can see, in every sprint, that the team is actively working to reduce alert noise and improve runbooks — that engineer feels respected. The best on-call programs are characterised by visible investment in reducing the burden, not just by compensating for it.

In 2026, most teams run distributed rotations spanning multiple time zones. This is both an opportunity and a design challenge. Done well, a distributed rotation means nobody carries the full 24-hour burden of a weekly shift — you can hand off at a natural boundary between regions and keep business-hours coverage for each timezone. Done poorly, it creates coordination overhead, unclear escalation paths when the person on-call is 9 time zones away, and handoff meetings that nobody can attend at a reasonable hour. If your team spans more than two time zones, design your rotation explicitly for the timezone distribution — don't just apply a single-timezone rotation template and hope the scheduling works out.

Rotation schedule changes are the most underestimated source of trust erosion. Once you publish a rotation, treat changes with the same communication discipline you'd apply to a production deployment. Engineers plan childcare, travel, and personal commitments around the schedule. A last-minute swap that affects a weekend isn't just inconvenient — it damages the sense that the rotation is a fair and predictable system, which is the only thing that makes it sustainable long-term.

pagerduty_escalation_policy.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# pagerduty_escalation_policy.tf
# Terraform resource definitions for a PagerDuty escalation policy
# and primary on-call rotation schedule.
#
# Prerequisites:
#   - PagerDuty Terraform provider configured:
#     terraform {
#       required_providers {
#         pagerduty = {
#           source  = "PagerDuty/pagerduty"
#           version = "~> 3.0"
#         }
#       }
#     }
#   - PAGERDUTY_TOKEN environment variable set
#
# Apply: terraform init && terraform apply

# ──────────────────────────────────────────────
# ESCALATION POLICY
# Three levels: PrimarySecondaryEM
# This is the standard for any SLA-backed production service.
# ──────────────────────────────────────────────
resource "pagerduty_escalation_policy" "checkout_team" {
  name      = "Checkout Team — Production Escalation"
  # num_loops: after completing all escalation levels with no acknowledgment,
  # restart from Level 1 this many times before giving up.
  # 2 loops = 2 full escalation cycles before the incident goes unacknowledged.
  num_loops = 2

  # LEVEL 1Primary on-call engineer
  # Gets paged first via push notification + SMS simultaneously.
  # 10 minutes is the standard industry acknowledgment window —
  # long enough to wake up, short enough to limit outage duration.
  rule {
    escalation_delay_in_minutes = 10

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.checkout_primary_rotation.id
    }
  }

  # LEVEL 2Secondary on-call engineer
  # Triggered when primary doesn't acknowledge within 10 minutes.
  # This should be a RARE event — if it happens weekly, audit alert volume.
  # Secondary rotation uses different engineers than primary
  # so the same person isn't primary + secondary simultaneously.
  rule {
    escalation_delay_in_minutes = 10

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.checkout_secondary_rotation.id
    }
  }

  # LEVEL 3Engineering Manager
  # This level should fire fewer than once per quarter in a healthy team.
  # If it fires weekly, the EM needs to be in the alert audit conversation,
  # not just at the end of the escalation chain.
  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "user_reference"
      id   = data.pagerduty_user.checkout_engineering_manager.id
    }
  }
}

# ──────────────────────────────────────────────
# PRIMARY ON-CALL SCHEDULE
# Weekly rotation across 4 engineers minimum.
# Fewer than 4 engineers = rotation gaps and burnout risk.
# ──────────────────────────────────────────────
resource "pagerduty_schedule" "checkout_primary_rotation" {
  name      = "Checkout — Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name = "Weekly Primary Rotation"
    # Rotation starts Monday 9 AM — handoff during business hours
    # means the incoming engineer can get context before night coverage.
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 7 days exactly
    rotation_virtual_start       = "2026-01-06T09:00:00-05:00"

    # 4 engineers minimum for a healthy weekly rotation.
    # Each engineer is primary once every 4 weeks.
    # With 4 people: 1 week on, 3 weeks off.
    # With 6 people: 1 week on, 5 weeks off — meaningful recovery time.
    users = [
      data.pagerduty_user.alice.id,
      data.pagerduty_user.bob.id,
      data.pagerduty_user.carol.id,
      data.pagerduty_user.david.id,
    ]
  }

  # For distributed teams spanning multiple time zones,
  # use multiple layers with restrictions to implement follow-the-sun:
  # layer 1: AMER engineers, active 9 AM - 6 PM EST
  # layer 2: EMEA engineers, active 9 AM - 6 PM GMT
  # layer 3: APAC engineers, active 9 AM - 6 PM SGT
  # Each layer has a 'restriction' block defining their active hours.
  # This keeps on-call within business hours for each region.
}
Output
# terraform apply output:
pagerduty_schedule.checkout_primary_rotation: Creating...
pagerduty_schedule.checkout_primary_rotation: Creation complete after 2s [id=P3X8K2A]
pagerduty_escalation_policy.checkout_team: Creating...
pagerduty_escalation_policy.checkout_team: Creation complete after 1s [id=PQRST99]
Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
# Escalation timeline when an alert fires at 03:14 AM:
#
# T+00:00 — Alert fires in Prometheus, transitions to 'firing' state
# T+00:05 — Alertmanager routes to pagerduty-critical receiver
# T+00:05 — PagerDuty creates incident, pages Alice (primary)
# Notification: push notification + SMS simultaneously
# T+10:00 — Alice hasn't acknowledged (phone on silent, deep sleep)
# PagerDuty escalates to Bob (secondary)
# Bob receives push + SMS
# T+10:45 — Bob acknowledges on mobile app
# Alice receives 'incident acknowledged by Bob' notification
# T+20:00 — If Bob also didn't acknowledge, EM is paged
#
# Level 3 firing = something is structurally wrong with your rotation
# or your alert volume is unsustainable. Either needs fixing urgently.
Watch Out: Manager-First Escalation
Never put a manager at Level 1 escalation. It trains engineers to wait for someone else to handle incidents, creates a single point of failure when the manager is travelling or has a family emergency, and burns out your engineering manager faster than almost anything else. Managers who get paged at 3 AM for production incidents can't lead teams effectively the next morning. Keep managers at Level 3 as a genuine last resort — the signal that your entire rotation has gone dark, not the default when the first engineer is slow to respond.
Production Insight
Teams with fewer than 4 on-call engineers show 3x higher attrition rates within 18 months — the math is simple: burn enough sleep and good engineers leave.
Secondary on-call escalation more than once per week is a leading indicator of alert volume problems, not coverage problems — audit alerts before adjusting the rotation.
Distributed teams in 2026 should design rotations explicitly for timezone coverage, not apply a single-timezone template and hope scheduling works out.
Rule: treat gaps in the rotation schedule with the same urgency as gaps in production coverage. Both represent unacceptable risk.
Key Takeaway
A healthy rotation needs 4+ engineers, an active secondary, a handoff ritual, and visible investment in reducing noise.
Compensation without alert noise reduction is a band-aid. Fix the alert volume first — then the compensation validates the team's experience rather than apologising for it.
Fairness and predictability retain engineers. Schedule changes made without notice destroy the trust that makes rotations sustainable.
Rotation Design Based on Team Size and Distribution
IfTeam of 4+ engineers, single timezone or overlapping timezones
UseWeekly rotation with primary + secondary on-call. Each engineer is primary once per N weeks where N is team size. This is the sustainable baseline.
IfTeam of 4+ engineers spanning 3+ distinct time zones
UseFollow-the-sun rotation with PagerDuty layer restrictions by active hours. Keep on-call within business hours for each region. No engineer should carry overnight coverage for a timezone they don't live in.
IfTeam of 2-3 engineers
UseDaily rotation is possible but burnout risk is significant within 6 months. Escalate hiring as a reliability risk, not just a headcount request. Quantify the alert volume and overnight pages to make the cost visible to leadership.
IfTeam of 1 engineer
UseNot a rotation — this is a single point of failure for both the service and the engineer. Escalate to management immediately with explicit framing: this is a business continuity risk, not a staffing preference.
IfSecondary escalation fires more than once per week
UseDo not expand the secondary rotation. Audit primary alert volume first. If secondary is escalating often, the problem is too many pages, not too few people in the rotation.

Runbooks, SLOs, and the Alert Audit Loop That Keeps You Sane

Every alert that fires should have a runbook. Not a wiki page describing the service architecture. Not a Confluence document that was last updated eighteen months ago. A runbook: a living, numbered document that tells the on-call engineer exactly what to check, what commands to run, and what decisions to make — optimised for a person who may be half-asleep, may not know this service deeply, and has about five minutes before stakeholders start asking questions.

The most common runbook failure is structure: engineers write runbooks like documentation. They lead with context — here's how this service works, here's its architecture, here's the dependency graph. That context belongs in the service's architecture docs. A runbook during an active incident needs to start with the thing the engineer does right now, not the context they'd need to understand the system from scratch. Lead with the first command they should run. Everything else is footnotes.

Runbooks don't need to be perfect on day one. The minimum viable runbook has three headings: 'Is this alert real?', 'Known mitigations', and 'When to escalate and who to call.' The first heading should have a specific command or dashboard link that lets the engineer confirm the alert reflects a real problem, not a metric collection glitch. The second should have the two or three mitigations that have worked historically, even if they're partial. The third should have an explicit escalation matrix — not 'contact the team' but 'if the database is unreachable, call the database team at this PagerDuty service; if the issue is the payment gateway, here's the vendor's emergency number.'

As incidents happen, the on-call engineer appends what they learned. Within three months of consistently following this pattern, you have battle-tested documentation that reflects reality rather than design intent. The gap between design intent and operational reality is where most runbooks fail — and closing that gap is what makes the difference between a runbook that helps and one that the engineer closes after 30 seconds because it doesn't match what they're seeing.

SLO-based alerting is a different philosophy entirely, and it's worth understanding the shift it requires. Instead of alerting when error rate exceeds 1% — an arbitrary threshold chosen by someone who had a reasonable gut feeling — you alert when you're consuming your monthly error budget faster than sustainable. This ties your paging directly to whether you're going to breach the reliability commitment you've made to your users. It dramatically reduces alert volume while ensuring that every page represents a genuine threat to that commitment.

The monthly alert audit is the most underrated practice in on-call operations. Pull a report of every alert that fired in the last 30 days. For each one: Was it actionable? Was there a documented response? If the same alert fired more than three times without a code fix being shipped, that's reliability debt — it belongs in the engineering backlog with a sprint assignment, not as a recurring 3 AM interruption that the team has collectively accepted as normal.

The audit meeting structure matters. Pull the data before the meeting — total alerts fired, percentage that resulted in an acknowledged incident with a documented response, mean time to acknowledge, and the ranked list of repeat offenders by firing frequency. Present it visually. The team's job during the meeting is to make three decisions about each alert on the list: keep it as is, fix the underlying condition that makes it fire too often, or delete it. Not discuss it — decide. Meetings that produce 'we should look into that' instead of 'deleted' or 'ticket created' waste their 30 minutes entirely.

The SLO frame changes the audit conversation in a valuable way. Instead of asking 'was this specific alert actionable?' you ask 'is our error budget on track this month?' If the budget is healthy and you have 80% remaining at the midpoint, many threshold alerts that fired are provably noise — they didn't threaten the SLO. If the budget is burning and you're at 40% at midpoint, you need more alerting sensitivity, not less. The budget number makes the decision about alert sensitivity a function of reliability risk rather than engineering anxiety.

slo_burn_rate_alert.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# slo_burn_rate_alert.yml
# Multi-window burn rate alerting — the approach from Google SRE Workbook Chapter 5.
#
# PREREQUISITES:
# - A defined SLO: for this example, 99.9% availability (0.1% error budget)
# - The service must emit http_requests_total with a 'status' label
# - A 30-day SLO measurement window (most common in production)
#
# HOW BURN RATE WORKS:
# If your SLO is 99.9%, your monthly error budget is 0.1% of all requests.
# A burn rate of 1x means you're consuming budget at exactly the rate that
# exhausts it in 30 days. A burn rate of 14x means you'll exhaust 30 days
# of budget in ~52 hours. That's when you wake someone up immediately.
# A burn rate of 3x means you'll exhaust budget in ~10 days — serious,
# but you have time to investigate during business hours.
#
# WHY MULTI-WINDOW:
# Single-window: 1h burn rate > 14x fires on a 5-minute spike, then resolves.
# Multi-window: BOTH the 1h window AND the 5m window must exceed the threshold.
# A 5-minute spike that self-corrects won't sustain the 1h window above 14x.
# This eliminates transient false positives without meaningful detection delay.

groups:
  - name: slo_burn_rates
    rules:

      # ──────────────────────────────────────────────────────────
      # FAST BURN — 14x rate (budget exhausted in ~52 hours)
      # This is a drop-everything incident. Page immediately.
      # Multi-window: 1h (detects sustained fast burn) AND
      #               5m (confirms it's still happening right now).
      # ──────────────────────────────────────────────────────────
      - alert: CheckoutSLOFastBurn
        expr: |
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[1h]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[1h]))
              )
            ) / 0.001
          ) > 14
          and
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[5m]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[5m]))
              )
            ) / 0.001
          ) > 14
        # 'for' duration is deliberately short here — fast burn needs fast detection.
        # The multi-window expression does the noise filtering,
        # so a short 'for' doesn't create false positives.
        for: 2m
        labels:
          severity: critical
          alert_type: slo_burn
          service: checkout
        annotations:
          summary: "Checkout SLO: fast burn — 30-day error budget exhausts in ~52 hours at current rate"
          runbook_url: "https://runbooks.internal/checkout-slo-burn"
          description: |
            Burn rate is {{ $value | printf "%.1f" }}x sustainable.
            At this rate, your monthly error budget exhausts in approximately
            {{ printf "%.0f" (div 720.0 $value) }} hours.
            Immediate investigation required — this is SLA-threatening."
          dashboard_url: "https://grafana.internal/d/slo-error-budget?var-service=checkout"

      # ──────────────────────────────────────────────────────────
      # SLOW BURN — 3x rate (budget exhausted in ~10 days)
      # Serious but not drop-everything. Investigate today.
      # Multi-window: 6h (confirms sustained trend, not a spike) AND
      #               30m (confirms it's still happening, not historical).
      # ──────────────────────────────────────────────────────────
      - alert: CheckoutSLOSlowBurn
        expr: |
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[6h]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[6h]))
              )
            ) / 0.001
          ) > 3
          and
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[30m]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[30m]))
              )
            ) / 0.001
          ) > 3
        # Longer 'for' here — slow burn is a trend, not an event.
        # We want to be certain before alerting at warning severity.
        for: 15m
        labels:
          severity: warning
          alert_type: slo_burn
          service: checkout
        annotations:
          summary: "Checkout SLO: slow burn — 30-day error budget exhausts in ~10 days at current rate"
          runbook_url: "https://runbooks.internal/checkout-slo-burn"
          description: |
            Burn rate is {{ $value | printf "%.1f" }}x sustainable.
            Current trajectory exhausts monthly budget in approximately
            {{ printf "%.0f" (div 720.0 $value) }} hours.
            Investigate during business hours — this does not require waking anyone up."
          dashboard_url: "https://grafana.internal/d/slo-error-budget?var-service=checkout"
Output
# Fast burn alert fires during an incident at 03:17 AM:
ALERT CheckoutSLOFastBurn
Labels:
alertname = CheckoutSLOFastBurn
severity = critical
alert_type = slo_burn
service = checkout
Annotations:
summary = Checkout SLO: fast burn — 30-day error budget exhausts in ~52 hours at current rate
description = Burn rate is 18.3x sustainable. At this rate, your monthly error budget
exhausts in approximately 39 hours. Immediate investigation required.
runbook_url = https://runbooks.internal/checkout-slo-burn
dashboard_url = https://grafana.internal/d/slo-error-budget?var-service=checkout
State: firing
ActiveAt: 2026-03-15T03:17:44Z
# Simultaneously in Prometheus UI:
# CheckoutSLOSlowBurn: inactive
# (Inhibited — the fast burn critical alert suppresses the slow burn warning
# via Alertmanager inhibition rules for the same service label)
# What the on-call engineer sees in PagerDuty:
# - Alert name + summary
# - Current burn rate: 18.3x
# - Time until budget exhaustion: ~39 hours
# - Direct link to the SLO error budget dashboard
# - Runbook link
# They know immediately: this is real, it's SLA-threatening,
# and here's where to start investigating.
Interview Gold: Why Multi-Window?
Single-window burn rate alerts have a specific failure mode that comes up in senior engineering interviews: a short spike in error rate can trigger the fast-burn threshold on a 1-hour window even if it self-corrects in 5 minutes. The 1-hour burn rate sees the spike, fires the alert, the on-call engineer investigates and finds everything healthy — false page, trust eroded. Multi-window alerting (1h AND 5m both elevated simultaneously) requires the problem to persist across two time horizons at once. A 5-minute spike that self-corrects won't maintain the 1-hour window above threshold, so the alert doesn't fire. The detection delay cost is minimal. The false positive reduction is substantial. That trade-off, explained clearly, is a reliable differentiator in SRE interviews.
Production Insight
Monthly alert audits consistently take 30 minutes but eliminate 10+ hours of false pages per month when run with data in hand and a bias toward deleting.
Alerts that fire 3+ times in 30 days without a fix are not monitoring — they're acknowledged technical debt that your team has decided is cheaper to tolerate than to fix. Make that decision explicit by creating a ticket.
Runbooks written after an incident are 10x more accurate than runbooks written before one — they reflect what actually happens, not what was designed to happen.
Rule: if the same alert pages you twice without a fix, create the ticket before it pages you a third time. Three times is a pattern you chose.
Key Takeaway
Runbooks start with three numbered steps: confirm it's real, apply the known mitigation, escalation criteria. Architecture context belongs at the bottom, not the top.
SLO burn rate alerting ties pages to your actual reliability commitments — it makes the question 'should we page someone?' a function of business risk, not arbitrary thresholds.
The monthly audit is your highest-leverage operational practice — 30 minutes of honest deletion saves your team more than any dashboard improvement.
Monthly Alert Audit Decision Tree
IfAlert fired 3+ times in 30 days with no code fix or configuration change
UseCreate a reliability debt ticket with a sprint assignment. Suppress the alert using amtool with the ticket URL as required comment. Do not let it page for a fourth time without a plan.
IfAlert fired but there is no associated runbook or the runbook is empty
UseBlock that alert from paging until a 3-step runbook exists: confirm it's real, apply known mitigation, escalation criteria. An alert without a runbook is a liability, not monitoring.
IfSLO error budget is below 50% at the calendar midpoint of the month
UseTreat as urgent — you're on a trajectory to breach. Freeze non-critical deployments, investigate the top error contributors, and move the burn rate alert threshold to warning severity at the SRE team standup.
IfMonthly audit shows 80%+ of alerts were non-actionable (acknowledged and closed with no action taken)
UseThe alert system needs a reset, not a trim. Delete everything. Rebuild from the last six months of actual incidents — write alerts only for the problems that happened, with the thresholds calibrated to the values seen during those incidents.

Alert Routing, Deduplication, and Suppression — The Plumbing That Makes It Work

Routing is where most alerting systems silently break in ways that are invisible until an incident proves it. Prometheus fires correctly. Alertmanager receives the alert. The alert routes to the wrong team — or to nobody at all — and the incident goes undetected. This failure is particularly dangerous because everything looks healthy: the alert fired, Alertmanager processed it, PagerDuty shows the service as active. The failure is in the gap between those systems, specifically in a label that's missing or mismatched.

Every alert label you add is implicitly a routing decision. The severity label determines which receiver handles the alert — PagerDuty for critical, Slack for warning. The team label determines which team's escalation policy is invoked. The service label enables inhibition rules and deduplication. Get any of these wrong and the alert either goes to the wrong destination or routes to Alertmanager's default receiver, which most teams configure as a catch-all Slack channel that nobody monitors at 3 AM.

Deduplication is Alertmanager's mechanism for not paging you multiple times for the same underlying problem. Two Prometheus instances — for example, a primary and a replica, or two regional scrape targets — will both fire the same alert when a metric breaches. Alertmanager groups them by label identity into a single notification. This is elegant when it works. The failure mode: two alerts that describe the same problem but have different labels — perhaps because one alert uses service="checkout" and another uses service="checkout-api" — are treated as separate alerts and generate separate pages. Label consistency across your alert rules is a more important operational practice than most teams realise, and it becomes critical when you have more than a handful of services.

Suppression windows are your surgical tool for planned maintenance. Unlike blanket silences — which are blunt instruments that suppress all alerts from a service regardless of type — proper suppression targets specific alert names for specific time windows. The rule is simple and should be enforced at the tooling level: every silence requires a comment with a linked ticket URL, and every silence auto-expires within 4 hours maximum. If your maintenance window is longer than 4 hours, renew the silence manually. This creates intentional checkpoints where someone has to actively decide that the silence should continue.

Inhibition rules solve a specific and common problem: when a critical alert fires, you don't want five additional warning alerts for the same service all generating pages simultaneously. Alertmanager inhibition rules suppress lower-severity alerts when a higher-severity alert is already active for the same service. The result is one page for one incident instead of five pages for five symptoms of the same root cause. This is the difference between an on-call engineer who wakes up to a single clear incident and one who wakes up to a flood of notifications that obscures which one to start with.

One important operational practice that often gets skipped: test your routing after any label change. When a service is renamed, when an alert rule is refactored, when a team restructuring changes the team label values — routing breaks silently. amtool config routes test with a set of representative alert labels takes about 90 seconds and catches misroutes before an incident does. Add it to your CI pipeline as a validation step any time alertmanager.yml or alert rule files change.

alertmanager_routing.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# alertmanager_routing.yml
# Load via: alertmanager --config.file=alertmanager.yml
#
# Validate before reloading:
#   amtool check-config alertmanager.yml
#
# Reload without restart (Alertmanager supports hot reload):
#   curl -X POST localhost:9093/-/reload

global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# ──────────────────────────────────────────────────────────────
# INHIBITION RULES
# When a critical alert is firing for a service,
# suppress all warning-severity alerts for the same service.
#
# Without this: a database outage causes 8 separate alerts
# (latency warning, error rate warning, availability warning, etc.)
# and the on-call engineer gets 8 pages for one incident.
# With this: they get 1 page — the critical alert that matters.
#
# 'equal' defines which labels must match for inhibition to apply.
# 'service' AND 'team' must both match to avoid cross-team suppression.
# ──────────────────────────────────────────────────────────────
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    # Both labels must match — inhibiting checkout critical
    # should not suppress payments warning.
    equal: ['service', 'team']

# ──────────────────────────────────────────────────────────────
# ROUTE TREE
# Routes are evaluated top-to-bottom, first match wins.
# The 'default' route at the root catches anything that
# doesn't match a specific child route.
#
# Key timing parameters:
# group_wait:      How long to buffer alerts before sending the first notification.
#                  30s for default (batch similar alerts). 10s for critical (act fast).
# group_interval:  How long to wait before sending updates to an existing alert group.
# repeat_interval: How long to wait before re-notifying about an unresolved alert.
# ──────────────────────────────────────────────────────────────
route:
  receiver: default-slack
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical severity → immediate PagerDuty page
    # Short group_wait (10s) minimises detection-to-page delay
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 10s
      repeat_interval: 1h

    # SLO burn rate alerts → always page regardless of time of day
    # Separate route ensures SLO alerts can't be accidentally
    # caught by a warning-level route if someone misconfigures severity
    - match:
        alert_type: slo_burn
      receiver: pagerduty-critical
      group_wait: 10s
      repeat_interval: 1h

    # Warning severity → Slack notification, no page
    # Engineers review these during business hours
    - match:
        severity: warning
      receiver: slack-warnings
      group_wait: 5m
      repeat_interval: 4h

# ──────────────────────────────────────────────────────────────
# RECEIVERS
# Each receiver defines a notification integration.
# The PagerDuty receiver requires a service integration key
# unique to each service — do not share keys across services,
# it makes incident routing in PagerDuty impossible to untangle.
# ──────────────────────────────────────────────────────────────
receivers:
  - name: default-slack
    slack_configs:
      - channel: '#alerts-default'
        send_resolved: true
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY>'
        # Pass severity through to PagerDuty for UI triage
        severity: '{{ .GroupLabels.severity }}'
        # Include runbook and dashboard links in PagerDuty incident details
        details:
          runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          dashboard: '{{ (index .Alerts 0).Annotations.dashboard_url }}'

  - name: slack-warnings
    slack_configs:
      - channel: '#alerts-warnings'
        send_resolved: true
        title: '[WARNING] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }} | {{ .Annotations.runbook_url }}{{ end }}'

# ──────────────────────────────────────────────────────────────
# SILENCE MANAGEMENT — use amtool, never the UI for production
#
# Good: time-bounded, ticket-linked, specific alert name
# amtool silence add \
#   alertname="CheckoutLatencyHigh" \
#   service="checkout" \
#   --duration=4h \
#   --comment="TICKET-1234: Planned database migration window"
#
# Bad: blanket silence, no ticket, long duration
# amtool silence add \
#   service="checkout" \
#   --duration=168h \
#   --comment="noisy"
# ↑ This is the pattern from the production incident above.
#   It hid a 4-hour outage. Never do this.
# ──────────────────────────────────────────────────────────────
Output
# Validating config:
$ amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route with 3 child routes
- 1 inhibition rule
- 3 receivers
# Testing routing for a critical SLO burn alert:
$ amtool config routes test \
alertname=CheckoutSLOFastBurn \
severity=critical \
alert_type=slo_burn \
service=checkout
pagerduty-critical
# Correct — critical SLO alert routes to PagerDuty.
# Testing routing for a warning alert (should NOT page):
$ amtool config routes test \
alertname=CheckoutLatencyHigh \
severity=warning \
service=checkout
slack-warnings
# Correct — warning routes to Slack, not PagerDuty.
# Listing active silences:
$ amtool silence query --active
ID Matchers Ends At Creator Comment
a1b2c3d4-e5f6-7890-abcd-ef1234567890 alertname=CheckoutLatencyHigh,service=checkout 2026-03-15T07:00Z alice TICKET-1234: Planned migration
# Testing inhibition: if CheckoutSLOFastBurn (critical) is active,
# CheckoutLatencyHigh (warning, same service) should be inhibited.
# Verify in Alertmanager UI at /api/v1/alerts — inhibited: true
Think of Alertmanager as a Post Office
  • Alertmanager routes by label match — a missing 'team' label means the alert hits the default route, not your team's PagerDuty service. Test every label combination with amtool after any label change.
  • Deduplication groups alerts by identical labels — two alerts about the same problem with different labels (e.g. 'checkout' vs 'checkout-api') create two separate incidents. Label consistency is an operational requirement, not a style preference.
  • Suppression windows must be time-bounded (4h max), ticket-linked, and visible to the whole team. Blanket silences are the number one cause of undetected outages in teams with noisy alerting environments.
  • Inhibition rules let critical alerts suppress warnings for the same service — one page instead of five for the same incident. Without them, an on-call engineer wakes up to a flood of notifications that all point at the same root cause.
Production Insight
40% of PagerDuty misroutes in teams with more than 20 alert rules trace back to a single missing or renamed label — usually discovered during an incident, not before it.
Blanket silences are the single most common cause of undetected outages in teams that have previously experienced high alert volumes.
Routing validation should be a CI check, not a manual verification step. amtool config routes test with a matrix of expected label combinations takes 2 minutes to configure and catches misroutes before production does.
Rule: every silence needs a ticket URL and a maximum 4-hour duration. Enforce this at the API level — a policy that isn't enforced technically isn't a policy, it's a suggestion.
Key Takeaway
Every alert label is a routing decision — get one label wrong and the page goes to the wrong team or nowhere.
Deduplication requires label identity — two alerts for the same problem with different labels generate two separate incidents.
Silences should be surgical, time-bounded, ticket-linked, and visible to every engineer on the team. Anything less is a liability.
Routing and Suppression Decision Tree
IfAlert fires in Prometheus but no PagerDuty page arrives
UseRun 'amtool config routes test' with the exact labels from the alert. Check for active silences with 'amtool silence query --active'. Verify the PagerDuty integration key is valid and the service is not in maintenance mode.
IfSame incident generates 5+ separate pages for the same service
UseConfigure Alertmanager inhibition rules — critical alerts should suppress warnings for the same service. Also check group_by configuration: are you grouping by both alertname and service? Grouping by service alone can merge unrelated alerts.
IfPlanned maintenance window approaching in the next 24 hours
UseCreate a targeted, time-bounded silence with amtool. Silence specific alert names, not the entire service. Set duration to the maintenance window plus 30 minutes buffer. Include the ticket URL in the comment field.
IfMultiple teams receive the same alert notification
UseAlert labels are too broad for the current route tree. Add team-specific labels to route correctly. Verify with 'amtool config routes test' for each team's expected label set before deploying the change.

The Noise Budget — Why Your Pager Should Be Quiet 90% of the Time

If your on-call engineer's phone buzzes more than once a shift, you've already lost. Every alert is a tax on focus. A noisy pager trains people to ignore it. That's how real outages slip through.

Here's the hard truth: Most alerts are noise. CPU at 80% isn't an emergency. A five-minute latency spike isn't an outage. Your monitoring tools are cheap; your engineer's attention isn't.

Set a noise budget. Calculate your team's tolerance — maybe 10 alerts per week per rotation. Anything above that gets triaged. Either automate the response (self-healing) or kill the alert. If it fires but never requires human action for 30 days, it's not an alert — it's a metric. Graph it, don't page on it.

Your goal: when the pager goes off, it's a goddamn emergency. Every time. If your on-call engineer can't remember the last time they got paged for a real issue, you've built trust. That's the only metric that matters.

NoiseBudgetPolicy.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — devops tutorial

alerting_policy:
  name: production-noise-budget
  rotation: weekly
  budget:
    max_alerts_per_rotation: 10
    alert_types:
      - critical
      - warning
    exceptions:
      - incident_id: INC-2025-039  # sev-1 only
  auto_remediation:
    cpu_high:
      threshold: 85%
      action: scale_up_replicas
  dead_alert_cleanup:
    if_fires_without_ack: 7 days
    action: disable
Output
Budget: max 10 alerts/week. Dead alerts disabled after 7 days. Auto-remediation on CPU > 85%.
Production Trap:
Don't confuse 'monitoring' with 'alerting'. If you page on every metric blip, you're not monitoring — you're spamming. Kill 90% of your alerts. You'll thank me when you're not woken up for a 2-minute spike.
Key Takeaway
If your pager fires more than once per shift, you are burning your team's attention. Automate or silence — never tolerate noise.

The Handoff Protocol That Prevents the 3 AM Drop

The worst call you'll ever get is the one where the previous shift 'forgot' to mention the database connection pool is leaking. I've seen it. The on-call engineer who just got paged has no context, no runbook, and zero clue why the cluster is melting.

Stop treating handoffs like a handshake at a party. Make them surgical. Every shift change MUST include: a written summary of any ongoing issues, the current state of all active alerts, and a quick sync. No exceptions. If your team is distributed across time zones, write it in a shared doc — not Slack. Slack scrolls away. Docs don't.

Standardize. A handoff checklist — timestamp, alerts fired, actions taken, next steps. If you're not doing this, you're gambling. One missed detail can turn a 10-minute incident into a 2-hour post-mortem about process failure.

Here's the rule: The incoming engineer should be able to start debugging within 30 seconds of reading the handoff. If they can't, you failed. Automate the status dump — pull alert history, open incidents, and runbook state into a single handoff report. No excuses.

HandoffChecklist.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — devops tutorial

handoff_protocol:
  required:
    - current_alert_count: 3
    - active_incidents:
        - INC-045: database_conn_pool_leak
    - actions_taken: "restarted primary replica, monitor for 30 mins"
    - next_steps: "check slow queries at 0600 UTC"
  automation:
    generate_report:
      trigger: shift_end
      sources:
        - pagerduty_incidents
        - runbook_current_state
      output: shared_drive/handoffs/2025-04-14.md
    sync_required: true  # 5-min Slack huddle or async doc
Output
Handoff report generated: shared_drive/handoffs/2025-04-14.md. Active incidents: 1. Sync required — yes.
Senior Shortcut:
Write the handoff doc as you triage. Don't wait until the end of shift. Future you (or the person replacing you) will know exactly what you were thinking. It's cheap insurance against amnesia.
Key Takeaway
A handoff without a shared, timestamped doc is not a handoff — it's a wish. Automate the report. Every shift. No exceptions.
● Production incidentPOST-MORTEMseverity: high

The Silenced PagerDuty Service That Hid a 4-Hour Outage

Symptom
Checkout service returning 500 errors for 4 hours. No PagerDuty alert fired. Discovery came via a customer support ticket that a customer escalated directly to the engineering Slack channel — the worst possible detection mechanism for a payment-critical service.
Assumption
The team assumed PagerDuty was healthy because the checkout service showed as 'active' in the PagerDuty dashboard. Active in the dashboard means the service exists and has an escalation policy — it says nothing about active silences. Nobody on the team knew to check the silence list because silences had always been temporary and personal. They'd never been treated as team-visible infrastructure.
Root cause
An engineer had created a PagerDuty silence rule matching all alerts from the checkout service with a 7-day duration. This happened on a Tuesday after their third consecutive night of false CPU alerts. Each one was the same: the checkout service's batch reconciliation job ran at 2 AM, pegged CPU to 85% for 90 seconds, and PagerDuty fired a P2. The engineer would acknowledge, see CPU returning to baseline, and go back to sleep. On the fourth night, they created a silence rule and went to bed for the rest of the week. The silence was entirely rational from their perspective. The CPU alerts were never actionable — they fired on a scheduled batch job that ran the same way every night. Nobody had ever fixed them because nobody had ever flagged them as wrong. They existed because a previous engineer added a CPU > 80% alert after a different incident two years prior, and the alert had silently outlived its usefulness. The silence expired the day after the outage ended. The team only discovered it existed when they pulled the PagerDuty audit log trying to understand why they weren't paged.
Fix
The immediate fix was removing the blanket silence and deleting every CPU-based alert from the checkout service. Those were replaced the same day with two symptom-based alerts: error rate above 5% for 2 minutes (critical, pages primary on-call) and p99 latency above 2 seconds for 5 minutes (warning, posts to Slack with a PagerDuty low-urgency incident). The structural fix took two weeks. The team implemented a PagerDuty policy requiring all silence rules to include a comment field with a ticket URL — silences without a linked ticket are blocked at the API level via a custom webhook that validates the comment format before allowing creation. Silence duration was capped at 4 hours maximum, enforced by the same webhook. A daily Slack digest was added showing all active silences across every service, visible to the whole engineering team in the #oncall-status channel. The cultural fix took longer. The team ran a retrospective focused on why the CPU alert had existed for two years without anyone questioning it. The answer was that nobody felt empowered to delete an alert someone else had written — deleting an alert felt like removing a safety mechanism, even if that mechanism had never once caught a real problem. They added an explicit team agreement: any alert that fires more than three times in 30 days without a corresponding code fix is automatically a backlog item, not a monitoring truth. The engineer who created the silence was not blamed. They were the person who finally made the failure mode visible.
Key lesson
  • Noisy alerts don't just waste time — they systematically train engineers to silence everything, including the alerts that matter. The silence was the correct response to a broken alerting system. Fix the system, not the behaviour.
  • Every silence rule must require a linked ticket and auto-expire within 4 hours maximum. Enforce this at the API level, not just as a policy — policies get forgotten under pressure.
  • Add a daily digest of active silences visible to the whole team. Silences should be as visible as deployments, not private workarounds buried in an individual's PagerDuty account.
  • If an engineer silences an entire service, treat it as a signal that your alert design has failed, not that the engineer made a mistake. The engineer is the symptom. The noise is the disease.
Production debug guideDiagnose why your alerts aren't working before your next incident proves it6 entries
Symptom · 01
Alert fires but on-call engineer takes no action within 15 minutes
Fix
Check two things before blaming the engineer. First: does the alert have a runbook URL in its annotations? If not, the engineer is doing archaeology in the middle of an incident — they're not being slow, they're being handed an undocumented system at 3 AM. Add a runbook with exactly three numbered steps: (1) confirm the alert is real with a specific command or dashboard link, (2) apply the known mitigation if one exists, (3) escalation criteria and who to call. Second: is the alert actually actionable, or does it fire on something that resolves itself 80% of the time? If the engineer has learned to wait 5 minutes to see if it self-clears, your alert has already lost their trust.
Symptom · 02
Same alert fires 3+ times per week with no code fix shipped
Fix
This is reliability debt wearing a monitoring costume. Create a backlog ticket for the root cause — give it a severity label and a sprint assignment, not a 'someday' tag. Suppress the alert with amtool using the ticket URL as the required comment, and set a 4-hour expiry that must be renewed manually. If the ticket keeps getting deprioritised sprint after sprint, escalate it: an alert that fires three times per week without a fix is costing your team 15+ engineer-hours per month in interrupted sleep and cognitive context-switching. Put that number in the ticket.
Symptom · 03
Engineer creates a blanket silence covering all alerts from a service
Fix
Investigate immediately — not to discipline the engineer, but because this is the most reliable leading indicator of alert fatigue reaching critical mass. Pull the alert history for that service for the last 30 days. Count how many alerts fired, how many were acknowledged within 5 minutes, and how many had documented mitigations. The pattern will show you exactly where your alert design broke down. Review every alert for that service against the 3 AM test and delete the non-actionable ones before lifting the silence. Lifting the silence without fixing the alerts just restarts the countdown to the next silence.
Symptom · 04
Escalation to secondary on-call happens more than once per week
Fix
This is your canary for primary alert volume being too high. Secondary escalation should be reserved for genuine unavailability — phone died, emergency, deep sleep during an off-peak window — not for situations where the primary engineer got paged 4 times in 3 hours and is physically unable to acknowledge fast enough. Pull the incident timeline: if escalations cluster between midnight and 6 AM, you have a volume problem. If they're spread throughout the day, you may have a coverage gap. Either way, audit alert count per shift before touching the rotation. Target: no more than 2 actionable pages per 12-hour shift, per Google SRE guidance.
Symptom · 05
On-call handoff meeting has no documentation from the previous rotation
Fix
Institute a mandatory handoff template enforced by your incident management tooling, not by trust. The template has four fields: alerts fired this week (count and names), actions taken (what was mitigated and how), follow-up tickets created (with links), and anything the next on-call needs to know that isn't in a ticket yet. The last field is where institutional knowledge lives or dies. Without it, the same database quirk that surprised the outgoing engineer will surprise the incoming one two weeks later. Make completion of the handoff template a requirement before the rotation transfer confirms in PagerDuty.
Symptom · 06
SLO burn rate alert fires but team doesn't know the current error budget status
Fix
The alert is doing its job — the dashboard isn't. A burn rate alert tells you the rate of consumption has exceeded a threshold. The dashboard needs to show you three things at a glance: how much budget you started the month with (in absolute error count and percentage), how much you've consumed so far, and at the current burn rate, when will you exhaust it. Engineers who don't have that context when the alert fires will either over-respond (treating a slow-burn warning as a full incident) or under-respond (dismissing a fast-burn critical as a blip). The dashboard is what turns the alert from a notification into a decision.
★ On-Call Quick Debug Cheat SheetFast diagnostics for alerting pipeline issues. Commands assume Prometheus, Alertmanager, and PagerDuty. Run these in order — each one narrows the failure surface before you need the next.
Alert not firing despite metric breach visible in Prometheus UI
Immediate action
Check Prometheus rule evaluation state — the metric being visible in the UI and the alerting rule evaluating correctly are two separate things
Commands
promtool check rules /etc/prometheus/rules/*.yml
curl -s localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.state=="inactive") | {name: .name, lastError: .lastError}'
Fix now
Verify the 'for' duration isn't longer than the breach window — if your metric has only been above threshold for 90 seconds and your 'for' is 2m, the alert is in pending state, not firing. Check for label mismatches between the alert expression and the target metric: a single label difference between what your rule selects and what your metric exports results in zero matches and a permanently inactive alert. Use the Prometheus UI's Table view on the rule expression to confirm it returns data before assuming the rule logic is correct.
Alert fires in Prometheus but no PagerDuty page arrives+
Immediate action
Check Alertmanager routing and active silences — these are the two most common gaps between a firing alert and a delivered page
Commands
curl -s localhost:9093/api/v1/alerts | jq '.data[] | {alertname: .labels.alertname, status: .status, receivers: .receivers}'
amtool silence query --active
Fix now
Verify alert labels match a route in alertmanager.yml — copy the labels from the first command's output and run amtool config routes test with them explicitly. If the receivers field shows 'default-slack' when you expected 'pagerduty-critical', a label is missing or mismatched in your alert rule. Check for active silences that might be suppressing the alert by label match even if the silence wasn't created for this specific alert. Confirm the PagerDuty integration key is correct and the service is not in maintenance mode in the PagerDuty UI.
On-call engineer doesn't receive the PagerDuty page+
Immediate action
Check notification rules and contact methods for the specific user — don't assume the account is configured correctly just because it exists
Commands
curl -s -H "Authorization: Token token=$PD_TOKEN" https://api.pagerduty.com/users/$USER_ID/notification_rules | jq '.notification_rules[]'
curl -s -H "Authorization: Token token=$PD_TOKEN" https://api.pagerduty.com/users/$USER_ID/contact_methods | jq '.contact_methods[]'
Fix now
Ensure push notification, SMS, and phone call are all configured for high-urgency incidents — do not rely on push notification alone, because phones in Do Not Disturb mode will block it. Verify the phone number on the account is current — engineers who change phones often forget to update PagerDuty. Test the full chain with a manual incident trigger before declaring it fixed, not just by checking the configuration looks correct. If the engineer uses the PagerDuty mobile app, verify the app has notification permissions on their specific device.
Too many pages per shift — engineer alert fatigue is approaching critical+
Immediate action
Quantify before acting — you need data to make the right cuts, not the most recent cuts
Commands
curl -s -H "Authorization: Token token=$PD_TOKEN" "https://api.pagerduty.com/incidents?since=$(date -d '7 days ago' +%Y-%m-%d)&until=$(date +%Y-%m-%d)&statuses[]=triggered" | jq '.incidents | length'
curl -s localhost:9093/api/v1/alerts | jq '[.data[] | .labels.alertname] | group_by(.) | map({alert: .[0], count: length}) | sort_by(-.count)'
Fix now
If more than 10 pages per week, run a focused alert audit before the next rotation starts — do not wait for the monthly audit cycle. The second command gives you a ranked list of alert names by firing frequency. Start at the top and apply the 3 AM test to each one: can the on-call engineer take a concrete action within 5 minutes? If no, delete it or move it to a dashboard. Do not reduce thresholds as a first move — that just delays the same alert. Delete or redesign first, then verify volume drops before declaring success.
Threshold-Based vs SLO Burn Rate Alerting
AspectThreshold-Based AlertingSLO Burn Rate Alerting
What it monitorsRaw metric value at a point in time (e.g. error rate > 1%)Rate of error budget consumption across a rolling time window
Alert volumeHigh — fires on any metric breach, including brief self-correcting spikesLow — requires sustained budget impact across multiple time windows simultaneously
False positive rateHigh — transient spikes, batch jobs, autoscaling events all trigger pagesLow — multi-window filtering requires sustained degradation, not momentary breaches
Business alignmentPoor — a 1% error rate threshold has no direct relationship to your SLAExcellent — directly tied to whether you'll breach your reliability commitment to users
Complexity to implementLow — one PromQL expression per alert, straightforward to writeMedium — requires a defined SLO, burn rate math, and multi-window expressions
Best forEarly-stage services, simple infrastructure monitoring, teams new to observabilityProduction services with defined SLOs, teams with mature observability practices
Recovery detectionAlert resolves immediately when metric drops below thresholdAlert resolves when burn rate normalises — may persist briefly after the incident resolves
On-call experiencePoor — engineers get woken for self-healing spikes they can't affect and don't understandGood — every page represents a genuine, sustained threat to a business commitment

Key takeaways

1
Alert on symptoms users feel
latency, error rate, availability — not causes engineers investigate. CPU, memory, and disk belong on dashboards where engineers look during business hours, not on pagers that wake people up at 3 AM.
2
The 'for' duration in Prometheus is your most powerful noise filter. A 2-minute minimum for critical alerts and 5-minute minimum for warnings eliminates the entire category of false pages from transient spikes and single bad scrape cycles.
3
SLO burn rate alerting with multi-window expressions dramatically reduces alert volume while improving signal quality. It only pages when your reliability commitment to users is genuinely threatened
not when an arbitrary metric threshold is briefly crossed.
4
The monthly alert audit
30 minutes, data in hand, bias toward deleting — is the single highest-leverage practice in on-call operations. An alert that fires three times without a fix is backlog work disguised as monitoring.
5
Every alert that pages someone must have a runbook. Not documentation about the system
a numbered response guide starting with 'confirm it's real,' followed by 'known mitigations,' followed by 'when and who to escalate to.' Architecture context belongs at the bottom.
6
Blanket silences are the number one cause of undetected outages in teams that have experienced alert fatigue. Enforce 4-hour maximum duration and mandatory ticket URLs at the API level
policies that aren't technically enforced are suggestions.

Common mistakes to avoid

6 patterns
×

Alerting on every metric that Prometheus exports by default

Symptom
50+ alerts firing weekly, most of them self-resolving within minutes. Engineers start to treat every page with suspicion — waiting to see if it clears before investigating. A real outage gets missed because the on-call engineer has learned that 70% of pages don't need a response. The alert system is working technically but has lost the team's trust operationally.
Fix
Run an alert audit before the volume gets this bad, but if you're already here, audit under pressure. Pull every alert that fired in the last 30 days. For each one, apply the 3 AM test: if this fired at 3 AM, would the on-call engineer know what concrete action to take within 5 minutes? If the answer is no — delete it today, not after discussion. Add alerts back only when an actual incident demonstrates that the metric predicted or explained a user-facing problem. Start from incidents, not from Prometheus's metric catalogue.
×

Setting 'for: 0m' on critical alerts

Symptom
A single bad Prometheus scrape cycle — which happens on virtually every production system at some frequency — triggers a P1 page at 3 AM. The on-call engineer investigates for 20 minutes, finds everything healthy, and goes back to sleep with slightly less trust in the system. After this happens three or four times, they start adding 'wait 5 minutes before looking at the laptop' as an informal personal policy — which is indistinguishable from ignoring the alert.
Fix
Set a minimum 'for' duration of 2 minutes for critical alerts and 5 minutes for warnings. The 'for' duration requires the metric to be in breach continuously for the specified time before the alert transitions from 'pending' to 'firing'. A single bad scrape lasts 15 seconds. A real incident lasts minutes to hours. The 2-minute minimum costs you nothing in real detection delay and eliminates the category of false pages that train engineers to distrust the system. If a team member argues 'but we need faster detection', the answer is: you need faster detection of real incidents, not faster false alarms.
×

Writing runbooks that describe the system instead of the response

Symptom
On-call engineer opens the runbook during an active incident, reads two paragraphs about the service's architecture, a diagram of its database relationships, and a section on its deployment history — and then has to scroll to find the first actionable step. By the time they've oriented themselves in the document, 4 minutes have elapsed and they're still not sure what command to run.
Fix
Runbooks are incident tools, not architecture documents. Structure every runbook in a strict order: first, how to confirm the alert is real (one specific command or dashboard link, nothing more); second, the known mitigations in order of likelihood and simplicity; third, explicit escalation criteria and who to contact. Architecture context, service dependency diagrams, and historical incident notes belong at the bottom. The engineer responding to an incident should reach the first executable step within 30 seconds of opening the runbook. If they can't, the runbook is structured wrong.
×

Creating blanket silences without a ticket URL or expiration time

Symptom
An engineer silences all alerts from a service to survive a noisy rotation. The silence outlives its justification. A real outage occurs. No page fires. Discovery comes from a customer complaint 4 hours later. This is not a hypothetical — it's the production incident at the top of this article, and variations of it happen at most organisations that have experienced sustained alert fatigue.
Fix
Enforce silence hygiene at the API level, not just as a policy. Use a webhook that validates silence creation requests: silences without a ticket URL in the comment field are rejected. Maximum silence duration is 4 hours — enforced by the webhook, not by trust. Add a daily Slack digest of active silences visible to every engineer on the team. Treat any silence older than 4 hours without a renewal as a bug that requires investigation. The policy is simple: if you can't explain why a silence exists in one sentence and point to a ticket, it shouldn't exist.
×

Treating on-call as 'just part of the job' without compensation or meaningful support

Symptom
Senior engineers start leaving, citing 'work-life balance' in exit interviews when the real cause is three months of 3 AM false pages with no visible team investment in reducing them. The engineers who built the systems and know how to fix things quietly transfer to teams with better on-call culture. Institutional knowledge — the kind that lives in someone's head and not in any runbook — walks out with them.
Fix
Offer explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks. But compensation alone doesn't fix this — it just makes the burden slightly more tolerable. Pair it with visible investment: show the team, in every sprint review, the alert volume trend, the audit results, the alerts that were deleted this month. Engineers who see the organisation treating alert noise as a real cost that deserves engineering resources stay longer than engineers who receive on-call pay for an experience that never gets better.
×

Not validating alert routing after changing Prometheus labels or service names

Symptom
Team renames a service from 'checkout' to 'checkout-service' during a refactoring sprint. Every alert rule is updated to use the new name. The Alertmanager route tree still matches on 'checkout'. All alerts for the service now route to the default Slack channel instead of PagerDuty. Nobody notices until an incident fires during the next rotation and the primary on-call engineer receives no page.
Fix
Add amtool config routes test to your CI pipeline as a required check on any change to alertmanager.yml or alert rule files. Maintain a test matrix of label combinations for each service that asserts the expected receiver. This catches misroutes in CI instead of in production. It takes about 90 seconds to configure and runs in under a second. The cost of not doing this is discovering a routing gap during an incident — at which point the gap becomes part of the incident timeline.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What's the difference between alerting on causes versus symptoms, and ca...
Q02SENIOR
How would you design an on-call rotation for a team of six engineers cov...
Q03SENIOR
A senior engineer proposes adding a CPU utilization alert at 80% thresho...
Q04SENIOR
Explain multi-window SLO burn rate alerting and why single-window burn r...
Q05SENIOR
Your team's alert volume has tripled after a major feature launch. Walk ...
Q01 of 05SENIOR

What's the difference between alerting on causes versus symptoms, and can you give a concrete example of each from a web service context?

ANSWER
Cause-based alerting fires on infrastructure metrics that might indicate a problem is developing — CPU utilisation at 80%, memory pressure at 90%, disk filling up. Symptom-based alerting fires on what users actually experience — error rates above threshold, request latency exceeding an SLO target, failed health checks. Concrete example: CPU at 80% is a cause alert. It might mean a problem is developing — or it might be a scheduled batch job, a GC cycle, or autoscaling warming up after a traffic spike. The CPU percentage alone gives the on-call engineer nothing actionable. 'Is the user impacted?' requires additional investigation before you even know whether the page warranted waking someone up. The checkout API returning 500 errors to 5% of users is a symptom alert. Users are being impacted right now. The on-call engineer has a clear starting point: check recent deployments, look at downstream dependency health, check the error logs for the specific 500 pattern. The symptom tells you action is required and gives you a direction to start. The practical test is what I call the 3 AM test: if this fires at 3 AM, can the on-call engineer take a concrete, directed action within 5 minutes? For CPU alerts, the answer is almost always 'investigate and see if anything else is wrong' — which is investigation, not action. For symptom alerts on error rate or latency, the answer is 'yes — look at these specific things in this order.' That's the distinction that determines whether something belongs on a pager or a dashboard.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How many alerts should a healthy on-call engineer receive per shift?
02
What's the difference between an SLO and an SLA, and why does it matter for alerting?
03
Should I use PagerDuty or OpsGenie for on-call routing?
04
How do you handle alerting during planned maintenance windows?
05
What's the right 'for' duration for a critical alert?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Monitoring. Mark it forged?

16 min read · try the examples if you haven't

Previous
SLI SLO SLA Explained
7 / 9 · Monitoring
Next
Log Aggregation Best Practices