Intermediate 13 min · March 06, 2026

Alerting and On-call Best Practices

Alerting & On-Call — Why Silenced Services Hide Outages

Q: How many alerts should a healthy on-call engineer receive per shift?

Google's SRE book recommends no more than two pages per 12-hour shift as a starting target, with a long-term goal of automation handling the rest without human intervention. In practice, most mature teams aim for fewer than five actionable pages per week across the entire rotation. The operative word is actionable — pages that required the engineer to do something meaningful, not acknowledge and close. If you're counting total pages including self-resolving ones, the number is meaningless. If you're seeing more than 5 actionable pages per week per engineer, treat alert volume as a reliability incident with a dedicated sprint slot. It won't fix itself.

Q: What's the difference between an SLO and an SLA, and why does it matter for alerting?

An SLA — Service Level Agreement — is a contractual commitment to customers. Breach it and there are financial consequences: service credits, contract penalties, sometimes legal liability. An SLO — Service Level Objective — is your internal reliability target, intentionally set stricter than the SLA so your team has a buffer between 'we're getting close to the line' and 'we've crossed the line.' You alert on SLO burn rate so the team responds before the SLA is ever in danger. If you alert directly on SLA violation, you're already in breach when the page fires. The SLO is the early warning system; the SLA is the cliff edge. The whole point of SLO-based alerting is that you never need to see the cliff.

Q: Should I use PagerDuty or OpsGenie for on-call routing?

Both are mature, production-proven tools that support escalation policies, on-call schedules, and integrations with Prometheus, Grafana, Datadog, and most observability stacks. The honest answer is that the tool choice matters far less than the quality of alerts feeding into it and the runbooks attached to each incident type. A perfectly configured PagerDuty instance with 60 noisy CPU alerts will destroy your team's wellbeing just as effectively as a poorly configured one. If you're already on one platform and it works, the cost of switching rarely justifies the disruption. If you're choosing fresh: PagerDuty has a broader third-party integration ecosystem and more mature enterprise features; OpsGenie integrates more natively with Atlassian tooling. Pick based on your existing stack and team familiarity, then invest the savings in fixing your alert design.

Q: How do you handle alerting during planned maintenance windows?

Create a time-bounded, targeted silence in Alertmanager using amtool before the maintenance window begins. Silence specific alert names — the ones you know will fire during maintenance — not the entire service. Set the duration to the maintenance window plus a 30-minute buffer. Include the ticket URL in the comment field. After the window, verify the silence has expired with `amtool silence query --active` and confirm alerts are routing correctly again with a test page. Post the silence details in your team's standup channel so everyone knows it exists for the duration. The critical discipline: never create a silence matching all alerts from a service. If you don't know which specific alerts will fire during maintenance, you don't know what your maintenance will actually affect — and that's a different problem to solve before the window starts.

Q: What's the right 'for' duration for a critical alert?

At minimum: 2 minutes for critical alerts, 5 minutes for warnings. The 'for' duration is the time the metric must remain in continuous breach before the alert transitions from 'pending' to 'firing.' It filters transient spikes that self-resolve — a single bad scrape, a 15-second network glitch, a brief GC pause — without adding meaningful delay to detection of real incidents. Most production incidents persist for minutes to hours. The 2-minute minimum costs you nothing in real detection speed and eliminates the entire category of false pages that erode engineer trust in the paging system. If your argument for 'for: 0m' is that you need immediate detection, consider that you also need engineers to trust the alert when it fires. A 2-minute delay on detection is far less costly than engineers who've learned to wait 5 minutes before looking at the laptop.

A 7-day PagerDuty silence hid a 4-hour checkout outage.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of DevOps fundamentals
✓Comfortable with command-line tools
✓Basic Linux administration knowledge

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Alert on symptoms users feel (latency, error rate, availability) — not causes engineers investigate (CPU, memory, disk)
The 'for' duration in Prometheus is your noise filter — 2-minute minimum for critical alerts eliminates self-healing false pages
SLO burn rate alerting with multi-window expressions cuts alert volume while improving signal quality
On-call rotations need minimum 4 engineers — secondary on-call is non-negotiable for any SLA-backed service
Every alert needs a runbook with 3 steps: confirm it's real, apply mitigation, escalation criteria
Monthly alert audits are the highest-leverage practice — if an alert fired 3+ times without a fix, it's backlog work, not a page

✦ Definition~90s read

What is Alerting and On-call?

Alerting and on-call is the operational practice of detecting when your production systems are in trouble and routing that signal to a human who can fix it. It exists because no system runs perfectly forever, and the gap between 'something broke' and 'someone knows' is where real outages happen.

★

Imagine your house has a smoke alarm that goes off every time you make toast.

The core tension is that every alert is a trade-off: too few and you miss incidents, too many and your team burns out ignoring noise. The entire discipline is about engineering that signal-to-noise ratio so that when your pager goes off, it actually matters.

Silenced services are the most common way teams accidentally hide outages. When an alert fires repeatedly for a known issue—a flaky dependency, a known deploy lag—someone silences it to stop the noise. But silencing is a memory leak: the alert disappears from dashboards, from incident reviews, from everyone's awareness.

The outage continues, but nobody sees it. This is why the golden rule of monitoring is to alert on symptoms (user-facing failures like high latency or error rates) rather than causes (CPU spikes, disk usage). Symptoms are what your customers feel; causes are implementation details that change as your system evolves.

Structuring on-call rotations that don't destroy your team means accepting that humans are bad at sustained vigilance. The industry standard is a follow-the-sun rotation with 8-12 hour shifts, backed by an escalation chain that guarantees a response within 5-15 minutes for critical alerts.

Tools like PagerDuty, Opsgenie, and Grafana OnCall handle routing, deduplication, and suppression—the plumbing that prevents the same incident from paging five people at once. But the real sanity-saver is the noise budget: your pager should be quiet 90% of the time.

If it's not, you're not doing alerting—you're doing noise. The fix is an alert audit loop where every page triggers a runbook review and a decision: tune the threshold, suppress the alert, or fix the underlying issue.

Plain-English First

Imagine your house has a smoke alarm that goes off every time you make toast. After a week, you'd probably rip the battery out — and then miss a real fire. That's exactly what happens with poorly designed software alerts. Engineers get paged so often for things that fix themselves that they start ignoring everything. Good alerting means your alarm only sounds when the house is actually burning, so the person on duty actually pays attention when it does. And when it does fire, there's a clear note on the wall saying 'open the front door, grab the extinguisher, call the fire brigade if it's still going after two minutes' — not a 40-page manual about the history of residential fire safety.

At 3 AM, your phone screams. You scramble to your laptop, bleary-eyed, only to discover the alert fired because a CPU spike lasted four seconds and self-corrected before you even logged in. You've lost sleep over nothing — and this happens four nights a week.

I've seen this pattern at companies of every size. The specifics vary — sometimes it's CPU, sometimes it's memory, sometimes it's a disk utilization alert on a volume that autoscales — but the shape is always the same. Engineers get paged for things that didn't need human attention. They lose sleep, lose trust in the paging system, and eventually lose patience with the whole programme. The best engineers leave first, because they have options. The ones who stay start silencing things. And then the real fire happens.

This is not a monitoring problem. It's a culture problem that presents as a technical one. The technical symptoms are easy to diagnose: too many alerts, thresholds too low, no runbooks, no audit process. But the underlying culture problem is that most engineering teams treat adding alerts as free and removing alerts as risky. Every post-mortem ends with 'add monitoring.' Nobody's post-mortem ever ends with 'delete the alert that's been crying wolf for six months.' That asymmetry is how you get a paging system that cries wolf 40 times a week and then fails to wake anyone up when the real incident hits.

The root cause isn't that teams care too little about monitoring — it's that they add alerts reactively, after every incident, without ever pruning the ones that stop being useful. Over time, the alert system becomes a noise machine. Engineers stop trusting it, start silencing pages, and miss the signals that actually matter. The fix isn't more dashboards. It's disciplined, intentional alerting philosophy backed by concrete practices for thresholds, routing, escalation, and rotation design.

By the end of this article you'll know how to audit your existing alert stack and identify the ones that don't serve you, how to write alerts that fire on symptoms not causes, how to structure an on-call rotation that doesn't erode your team's wellbeing, and how to use real tooling — Prometheus alerting rules, PagerDuty routing logic, and runbook templates — to make all of it operational and repeatable. The goal isn't a perfect alerting system. It's a system your engineers trust enough to actually pay attention to.

Why Silenced Services Hide Outages

Alerting and on-call best practices define how a team detects, responds to, and resolves system anomalies with minimal human latency. The core mechanic is a tiered escalation pipeline: alerts route from automated detection (e.g., p99 latency > 500ms for 5 minutes) to a primary on-call engineer, then to a secondary, and finally to incident management if unacknowledged within a defined timeout (commonly 15 minutes).

In practice, effective alerting hinges on three properties: signal-to-noise ratio, actionable content, and escalation speed. A good alert fires only when human intervention is required — not for transient blips. Each alert must include the failing component, the observed vs. expected value, and a runbook link. Escalation must be automatic and time-bound; manual handoffs introduce minutes of delay.

Use these practices in any production system where downtime costs exceed the overhead of maintaining on-call rotations. They matter most for services with strict SLOs (e.g., 99.9% uptime) or where a single engineer cannot know every subsystem. Without them, teams experience alert fatigue, missed critical pages, and prolonged mean-time-to-acknowledge (MTTA).

⚠ Silence Is Not Resolution

A silenced alert does not fix the underlying issue — it only hides the symptom until the next, often worse, failure surfaces.

📊 Production Insight

A payment-processing service silenced a 'high 5xx rate' alert during a deployment because it fired every deploy. Three weeks later, a misconfigured load balancer caused 40% of requests to fail for 45 minutes before anyone noticed.

The exact symptom: the alert dashboard showed green because the silenced alert never escalated, and no new alert covered the specific error code.

Rule of thumb: never silence an alert for more than 24 hours without a linked ticket to fix the root cause; use temporary suppressions only during active incident response.

🎯 Key Takeaway

Every silenced alert is a ticking time bomb — fix the root cause, not the notification.

An alert that fires every deploy is a deployment process failure, not a monitoring problem.

Your on-call rotation is only as good as your alert routing: acknowledge timeouts must be under 15 minutes.

thecodeforge.io

Alerting On Call Best Practices

Alert on Symptoms, Not Causes — The Golden Rule of Monitoring

The most common alerting mistake is alerting on what you think is wrong instead of what the user actually experiences. A high CPU alert fires and the engineer investigates — but CPU being high isn't inherently bad. Maybe a batch job is running. Maybe it's expected load from a traffic spike the autoscaler is still catching up to. Maybe the GC is doing a full collection. The user doesn't care about CPU. They care whether the checkout page loads and their payment goes through.

Symptomatic alerting means your alert fires on things users feel directly: high latency, elevated error rates, failed health checks, degraded availability. Google's SRE book formalised this as the Four Golden Signals — latency, traffic, errors, and saturation — and the framing has held up well. Alerts on these signals are almost always actionable because if error rate is 15%, something is broken for users right now, and the on-call engineer has a clear starting point.

Causal metrics like CPU, memory, and disk are better suited for dashboards and capacity planning, not paging. You investigate them after a symptom alert fires to understand why something is wrong — not to decide whether something is wrong. The distinction matters because it changes the engineer's mental model during an incident. A symptom alert says 'users are experiencing this.' A causal alert says 'something in your infrastructure crossed a threshold.' Only one of those tells you whether to act.

The practical test: before adding any alert, ask yourself 'If this fires at 3 AM, can the on-call engineer take a concrete action within five minutes?' Not 'investigate' — act. If the answer is 'investigate and check some dashboards and maybe escalate', it belongs on a dashboard, not a pager. The 3 AM test is deliberately adversarial: the human responding is sleep-deprived, potentially unfamiliar with the specific service, and operating under pressure. Write your alerts for that human, not for a well-rested engineer on a Tuesday morning.

Here's the pushback you'll hear, usually from engineers who've been on the wrong end of missed incidents: 'But what about catching problems early?' This is a reasonable concern wrapped around a false premise. Causal metrics do catch problems early — on dashboards, during business hours, reviewed by engineers who have context and aren't panicking. You don't need to wake someone up to look at a graph trending in the wrong direction. Set up a daily dashboard review as part of your team's operational rhythm. If a pattern in causal metrics consistently precedes a symptom, write a symptom-based alert for the user-facing impact, not the infrastructure metric that correlates with it. The correlation is interesting. The symptom is the truth.

The Four Golden Signals aren't a checklist to run through mechanically. They're a lens for asking the right question: is what I'm about to alert on something a user would feel? Latency tells you how long users wait. Traffic tells you how much demand you're handling and whether demand patterns are normal. Errors tell you how often you're actively failing users. Saturation tells you how close to the edge you are before one of the other three degrades. Any proposed alert that doesn't map cleanly to one of these four should face an explicit justification before it gets merged.

prometheus_symptom_alerts.ymlYAML

100

101

102

103

104

105

106

# prometheus_symptom_alerts.yml
# These rules live in your Prometheus alerting rules directory.
# Reference them via the 'rule_files' block in prometheus.yml:
#   rule_files:
#     - /etc/prometheus/rules/*.yml
#
# Validate before applying:
#   promtool check rules /etc/prometheus/rules/prometheus_symptom_alerts.yml

groups:
  - name: user_facing_symptoms
    # Prometheus evaluates these rules every 60 seconds.
    # Shorter intervals increase Prometheus load without meaningfully
    # improving detection time for real incidents.
    interval: 60s
    rules:

      # ──────────────────────────────────────────────────────────
      # GOOD ALERT: Fires on what users actually experience
      # This is a SYMPTOM. Users are receiving 5xx responses.
      # The on-call engineer's action is unambiguous:
      # check error logs, look at recent deploys, investigate downstream.
      # ──────────────────────────────────────────────────────────
      - alert: HighErrorRateShopping
        expr: |
          (
            sum(rate(http_requests_total{
              service="shopping-cart",
              status=~"5.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              service="shopping-cart"
            }[5m]))
          ) > 0.05
        # 'for: 2m' means Prometheus waits 2 minutes of sustained breach
        # before transitioning from 'pending' to 'firing'.
        # This single parameter eliminates the majority of false pages
        # from transient spikes and single bad scrape cycles.
        # Cost: you detect 2 minutes later. Benefit: far fewer 3 AM false alarms.
        for: 2m
        labels:
          severity: critical
          team: checkout
          # 'service' label is critical for Alertmanager routing AND
          # for inhibition rules — both depend on consistent service labels.
          service: shopping-cart
        annotations:
          summary: "Shopping cart error rate above 5% for 2 minutes"
          # The runbook_url annotation is non-negotiable.
          # Engineers responding to this alert should never have to search
          # for documentation during an active incident.
          runbook_url: "https://runbooks.internal/shopping-cart-errors"
          description: "Current error rate: {{ $value | humanizePercentage }}. Check recent deployments and downstream dependencies first."
          # dashboard_url lets the engineer jump directly to the relevant
          # Grafana panel without navigating the dashboard hierarchy.
          dashboard_url: "https://grafana.internal/d/checkout-overview?var-service=shopping-cart"

      # ──────────────────────────────────────────────────────────
      # GOOD ALERT: p99 latency degradation users feel
      # 5-minute 'for' duration is appropriate for latency —
      # a 30-second latency spike is often a cold cache or single
      # slow request. 5 minutes of sustained high p99 means
      # something structural has changed.
      # ──────────────────────────────────────────────────────────
      - alert: CheckoutLatencyHigh
        expr: |
          histogram_quantile(
            0.99,
            sum(rate(http_request_duration_seconds_bucket{
              service="checkout",
              handler="/api/purchase"
            }[10m])) by (le)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
          team: checkout
          service: checkout
        annotations:
          summary: "p99 checkout latency exceeded 2s SLO threshold"
          runbook_url: "https://runbooks.internal/checkout-latency"
          description: "p99 latency is {{ $value | humanizeDuration }}. SLO threshold is 2.0s. Check database query times and downstream payment API latency."
          dashboard_url: "https://grafana.internal/d/checkout-latency"

      # ──────────────────────────────────────────────────────────
      # BAD ALERT — kept here intentionally as a contrast.
      # Do NOT uncomment this. CPU is a cause, not a symptom.
      #
      # Problems with this alert:
      # 1. 'What should the engineer DO?' — there is no clear action.
      #    Check CPU... then what? If users aren't affected, go back to sleep?
      # 2. It fires on batch jobs, GC pauses, autoscaling warmup,
      #    and any other expected load pattern.
      # 3. The threshold (80%) is arbitrary — why not 75%? Why not 90%?
      #    There is no principled answer because CPU% alone isn't the thing
      #    you actually care about.
      # ──────────────────────────────────────────────────────────
      # - alert: HighCPU
      #   expr: node_cpu_utilization > 0.80
      #   for: 1m
      #   annotations:
      #     summary: "CPU is high"
      #
      # If you inherited this alert, delete it today.
      # Add CPU to your Grafana dashboard instead.

Output

# When Prometheus evaluates these rules and HighErrorRateShopping fires:

# (The alert transitions: inactive -> pending -> firing over 2 minutes)

ALERT HighErrorRateShopping

Labels:

alertname = HighErrorRateShopping

severity = critical

team = checkout

service = shopping-cart

Annotations:

summary = Shopping cart error rate above 5% for 2 minutes

description = Current error rate: 7.34%. Check recent deployments and downstream dependencies first.

runbook_url = https://runbooks.internal/shopping-cart-errors

dashboard_url = https://grafana.internal/d/checkout-overview?var-service=shopping-cart

State: firing

ActiveAt: 2026-03-15T03:14:22Z

# In Alertmanager's /api/v1/alerts, this alert will show:

# - Matched receiver: pagerduty-critical (via severity=critical route)

# - Status: firing

# - Inhibited: false (no higher-severity alert suppressing it)

# The on-call engineer receives a PagerDuty page containing:

# - Alert name and summary

# - Clickable runbook URL

# - Current metric value (7.34% error rate)

# - Direct dashboard link

# Everything they need to start investigating in under 60 seconds.

💡Pro Tip: The 3 AM Test

Before committing any new alert rule, ask your team: 'If this woke someone up at 3 AM, would they know exactly what to do in under 5 minutes?' If anyone hesitates, the alert needs a better runbook, a higher threshold, a longer 'for' duration, or it needs to be demoted to a dashboard panel entirely. Run this test on your existing alerts too — not just new ones. Most alert libraries accumulate technical debt the same way codebases do. The 3 AM test applied to a backlog of 60 alerts will typically eliminate 20 of them in a single sitting.

📊 Production Insight

CPU alerts fire on average 40+ times per week in a typical microservice setup running autoscaling and batch workloads.

The vast majority are false positives: batch jobs, GC pauses, autoscaling warmup, and cold-start traffic.

None of them tell the on-call engineer what to do, which means they all teach the engineer to ignore pages.

Rule: if it doesn't map to a Golden Signal and can't survive the 3 AM test, it's a dashboard panel masquerading as an alert.

🎯 Key Takeaway

Symptom alerts (latency, errors, availability) are almost always actionable — they describe what users feel.

Causal alerts (CPU, memory, disk) are almost always noise when they page — they describe what engineers investigate.

The 3 AM test: can the on-call engineer take a concrete action in 5 minutes? If not, it's a dashboard.

Alert vs Dashboard vs Log: Where Does This Metric Go?

IfUser-facing impact: error rate elevated, latency above SLO, availability degraded

→

UseAlert — this is a symptom. Page the on-call engineer with a runbook link and current metric value.

IfInfrastructure metric (CPU, memory, disk) with no confirmed current user impact

→

UseDashboard — investigate during business hours, or investigate after a symptom alert fires to understand the cause.

IfMetric that self-corrects within 60-90 seconds reliably

→

UseSet 'for: 2m' minimum — if it self-corrects before 2 minutes consistently, it never becomes an alert and belongs on a dashboard with an anomaly annotation.

IfMetric you're considering alerting on but you're unsure

→

UseApply the 3 AM test first. Then ask: has an incident ever occurred that this alert would have caught earlier than a symptom alert? If yes, it may be worth keeping. If no, it's a dashboard panel.

Structuring On-Call Rotations That Don't Destroy Your Team

An on-call rotation is a social contract as much as it is a technical system. Engineers who feel the rotation is fair, predictable, and actively supported stay in it. Engineers who feel it's a punishment — or worse, an invisible tax that everyone pretends doesn't cost anything — churn. And the engineers who leave first are always the ones who have other options: the senior engineers, the ones who built the systems, the ones whose absence creates the next generation of undocumented incidents.

The fundamentals of a healthy rotation start with team size. You need at least four engineers to build a weekly rotation that gives people genuine recovery time between shifts. With three engineers, one person is always either on-call or just came off on-call — cognitive load never fully resets. With two, you're alternating weeks between two people and calling it a rotation. With one, you're not running a rotation at all; you're running a hero, and heroes burn out or leave.

Secondary on-call — a second engineer ready to be escalated to if the primary doesn't acknowledge within ten minutes — is non-negotiable for any service with a real SLA. It serves two purposes. The obvious one is redundancy: if the primary engineer is genuinely unavailable, the incident still gets coverage. The less obvious one is diagnostic: if escalation to secondary happens more than once per week, your primary alert volume is too high. The secondary escalation rate is a canary metric for rotation health that most teams never look at.

On-call handoff meetings deserve more ceremony than they typically receive. The outgoing engineer should document what fired, what was investigated, what commands were run, and what follow-up work was created. Without this, the same incidents repeat because the institutional knowledge of 'oh, that alert fires every Tuesday when the ETL job runs — just acknowledge and wait four minutes' lives in one engineer's head and evaporates with every rotation change. The handoff document is where that knowledge becomes team property.

Compensation matters — and how you structure it signals what you believe on-call is worth. Explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks all communicate that the organisation understands the cost. The specifics matter less than the consistency and transparency.

But here's what I've observed across many teams: the psychological contract matters more than the compensation number. An engineer who receives $500 per on-call week but has no control over alert volume, whose feedback from retrospectives never changes anything, and who watches the same false alarms fire week after week without being fixed — that engineer feels trapped regardless of the pay. An engineer who receives time-off-in-lieu and can see, in every sprint, that the team is actively working to reduce alert noise and improve runbooks — that engineer feels respected. The best on-call programs are characterised by visible investment in reducing the burden, not just by compensating for it.

In 2026, most teams run distributed rotations spanning multiple time zones. This is both an opportunity and a design challenge. Done well, a distributed rotation means nobody carries the full 24-hour burden of a weekly shift — you can hand off at a natural boundary between regions and keep business-hours coverage for each timezone. Done poorly, it creates coordination overhead, unclear escalation paths when the person on-call is 9 time zones away, and handoff meetings that nobody can attend at a reasonable hour. If your team spans more than two time zones, design your rotation explicitly for the timezone distribution — don't just apply a single-timezone rotation template and hope the scheduling works out.

Rotation schedule changes are the most underestimated source of trust erosion. Once you publish a rotation, treat changes with the same communication discipline you'd apply to a production deployment. Engineers plan childcare, travel, and personal commitments around the schedule. A last-minute swap that affects a weekend isn't just inconvenient — it damages the sense that the rotation is a fair and predictable system, which is the only thing that makes it sustainable long-term.

pagerduty_escalation_policy.tfHCL

100

101

102

103

104

105

106

107

108

# pagerduty_escalation_policy.tf
# Terraform resource definitions for a PagerDuty escalation policy
# and primary on-call rotation schedule.
#
# Prerequisites:
#   - PagerDuty Terraform provider configured:
#     terraform {
#       required_providers {
#         pagerduty = {
#           source  = "PagerDuty/pagerduty"
#           version = "~> 3.0"
#         }
#       }
#     }
#   - PAGERDUTY_TOKEN environment variable set
#
# Apply: terraform init && terraform apply

# ──────────────────────────────────────────────
# ESCALATION POLICY
# Three levels: Primary → Secondary → EM
# This is the standard for any SLA-backed production service.
# ──────────────────────────────────────────────
resource "pagerduty_escalation_policy" "checkout_team" {
  name      = "Checkout Team — Production Escalation"
  # num_loops: after completing all escalation levels with no acknowledgment,
  # restart from Level 1 this many times before giving up.
  # 2 loops = 2 full escalation cycles before the incident goes unacknowledged.
  num_loops = 2

  # LEVEL 1 — Primary on-call engineer
  # Gets paged first via push notification + SMS simultaneously.
  # 10 minutes is the standard industry acknowledgment window —
  # long enough to wake up, short enough to limit outage duration.
  rule {
    escalation_delay_in_minutes = 10

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.checkout_primary_rotation.id
    }
  }

  # LEVEL 2 — Secondary on-call engineer
  # Triggered when primary doesn't acknowledge within 10 minutes.
  # This should be a RARE event — if it happens weekly, audit alert volume.
  # Secondary rotation uses different engineers than primary
  # so the same person isn't primary + secondary simultaneously.
  rule {
    escalation_delay_in_minutes = 10

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.checkout_secondary_rotation.id
    }
  }

  # LEVEL 3 — Engineering Manager
  # This level should fire fewer than once per quarter in a healthy team.
  # If it fires weekly, the EM needs to be in the alert audit conversation,
  # not just at the end of the escalation chain.
  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "user_reference"
      id   = data.pagerduty_user.checkout_engineering_manager.id
    }
  }
}

# ──────────────────────────────────────────────
# PRIMARY ON-CALL SCHEDULE
# Weekly rotation across 4 engineers minimum.
# Fewer than 4 engineers = rotation gaps and burnout risk.
# ──────────────────────────────────────────────
resource "pagerduty_schedule" "checkout_primary_rotation" {
  name      = "Checkout — Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name = "Weekly Primary Rotation"
    # Rotation starts Monday 9 AM — handoff during business hours
    # means the incoming engineer can get context before night coverage.
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 7 days exactly
    rotation_virtual_start       = "2026-01-06T09:00:00-05:00"

    # 4 engineers minimum for a healthy weekly rotation.
    # Each engineer is primary once every 4 weeks.
    # With 4 people: 1 week on, 3 weeks off.
    # With 6 people: 1 week on, 5 weeks off — meaningful recovery time.
    users = [
      data.pagerduty_user.alice.id,
      data.pagerduty_user.bob.id,
      data.pagerduty_user.carol.id,
      data.pagerduty_user.david.id,
    ]
  }

  # For distributed teams spanning multiple time zones,
  # use multiple layers with restrictions to implement follow-the-sun:
  # layer 1: AMER engineers, active 9 AM - 6 PM EST
  # layer 2: EMEA engineers, active 9 AM - 6 PM GMT
  # layer 3: APAC engineers, active 9 AM - 6 PM SGT
  # Each layer has a 'restriction' block defining their active hours.
  # This keeps on-call within business hours for each region.
}

Output

# terraform apply output:

pagerduty_schedule.checkout_primary_rotation: Creating...

pagerduty_schedule.checkout_primary_rotation: Creation complete after 2s [id=P3X8K2A]

pagerduty_escalation_policy.checkout_team: Creating...

pagerduty_escalation_policy.checkout_team: Creation complete after 1s [id=PQRST99]

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

# Escalation timeline when an alert fires at 03:14 AM:

# T+00:00 — Alert fires in Prometheus, transitions to 'firing' state

# T+00:05 — Alertmanager routes to pagerduty-critical receiver

# T+00:05 — PagerDuty creates incident, pages Alice (primary)

# Notification: push notification + SMS simultaneously

# T+10:00 — Alice hasn't acknowledged (phone on silent, deep sleep)

# PagerDuty escalates to Bob (secondary)

# Bob receives push + SMS

# T+10:45 — Bob acknowledges on mobile app

# Alice receives 'incident acknowledged by Bob' notification

# T+20:00 — If Bob also didn't acknowledge, EM is paged

# Level 3 firing = something is structurally wrong with your rotation

# or your alert volume is unsustainable. Either needs fixing urgently.

⚠ Watch Out: Manager-First Escalation

Never put a manager at Level 1 escalation. It trains engineers to wait for someone else to handle incidents, creates a single point of failure when the manager is travelling or has a family emergency, and burns out your engineering manager faster than almost anything else. Managers who get paged at 3 AM for production incidents can't lead teams effectively the next morning. Keep managers at Level 3 as a genuine last resort — the signal that your entire rotation has gone dark, not the default when the first engineer is slow to respond.

📊 Production Insight

Teams with fewer than 4 on-call engineers show 3x higher attrition rates within 18 months — the math is simple: burn enough sleep and good engineers leave.

Secondary on-call escalation more than once per week is a leading indicator of alert volume problems, not coverage problems — audit alerts before adjusting the rotation.

Distributed teams in 2026 should design rotations explicitly for timezone coverage, not apply a single-timezone template and hope scheduling works out.

Rule: treat gaps in the rotation schedule with the same urgency as gaps in production coverage. Both represent unacceptable risk.

🎯 Key Takeaway

A healthy rotation needs 4+ engineers, an active secondary, a handoff ritual, and visible investment in reducing noise.

Compensation without alert noise reduction is a band-aid. Fix the alert volume first — then the compensation validates the team's experience rather than apologising for it.

Fairness and predictability retain engineers. Schedule changes made without notice destroy the trust that makes rotations sustainable.

Rotation Design Based on Team Size and Distribution

IfTeam of 4+ engineers, single timezone or overlapping timezones

→

UseWeekly rotation with primary + secondary on-call. Each engineer is primary once per N weeks where N is team size. This is the sustainable baseline.

IfTeam of 4+ engineers spanning 3+ distinct time zones

→

UseFollow-the-sun rotation with PagerDuty layer restrictions by active hours. Keep on-call within business hours for each region. No engineer should carry overnight coverage for a timezone they don't live in.

IfTeam of 2-3 engineers

→

UseDaily rotation is possible but burnout risk is significant within 6 months. Escalate hiring as a reliability risk, not just a headcount request. Quantify the alert volume and overnight pages to make the cost visible to leadership.

IfTeam of 1 engineer

→

UseNot a rotation — this is a single point of failure for both the service and the engineer. Escalate to management immediately with explicit framing: this is a business continuity risk, not a staffing preference.

IfSecondary escalation fires more than once per week

→

UseDo not expand the secondary rotation. Audit primary alert volume first. If secondary is escalating often, the problem is too many pages, not too few people in the rotation.

thecodeforge.io

Alerting On Call Best Practices

Runbooks, SLOs, and the Alert Audit Loop That Keeps You Sane

Every alert that fires should have a runbook. Not a wiki page describing the service architecture. Not a Confluence document that was last updated eighteen months ago. A runbook: a living, numbered document that tells the on-call engineer exactly what to check, what commands to run, and what decisions to make — optimised for a person who may be half-asleep, may not know this service deeply, and has about five minutes before stakeholders start asking questions.

The most common runbook failure is structure: engineers write runbooks like documentation. They lead with context — here's how this service works, here's its architecture, here's the dependency graph. That context belongs in the service's architecture docs. A runbook during an active incident needs to start with the thing the engineer does right now, not the context they'd need to understand the system from scratch. Lead with the first command they should run. Everything else is footnotes.

Runbooks don't need to be perfect on day one. The minimum viable runbook has three headings: 'Is this alert real?', 'Known mitigations', and 'When to escalate and who to call.' The first heading should have a specific command or dashboard link that lets the engineer confirm the alert reflects a real problem, not a metric collection glitch. The second should have the two or three mitigations that have worked historically, even if they're partial. The third should have an explicit escalation matrix — not 'contact the team' but 'if the database is unreachable, call the database team at this PagerDuty service; if the issue is the payment gateway, here's the vendor's emergency number.'

As incidents happen, the on-call engineer appends what they learned. Within three months of consistently following this pattern, you have battle-tested documentation that reflects reality rather than design intent. The gap between design intent and operational reality is where most runbooks fail — and closing that gap is what makes the difference between a runbook that helps and one that the engineer closes after 30 seconds because it doesn't match what they're seeing.

SLO-based alerting is a different philosophy entirely, and it's worth understanding the shift it requires. Instead of alerting when error rate exceeds 1% — an arbitrary threshold chosen by someone who had a reasonable gut feeling — you alert when you're consuming your monthly error budget faster than sustainable. This ties your paging directly to whether you're going to breach the reliability commitment you've made to your users. It dramatically reduces alert volume while ensuring that every page represents a genuine threat to that commitment.

The monthly alert audit is the most underrated practice in on-call operations. Pull a report of every alert that fired in the last 30 days. For each one: Was it actionable? Was there a documented response? If the same alert fired more than three times without a code fix being shipped, that's reliability debt — it belongs in the engineering backlog with a sprint assignment, not as a recurring 3 AM interruption that the team has collectively accepted as normal.

The audit meeting structure matters. Pull the data before the meeting — total alerts fired, percentage that resulted in an acknowledged incident with a documented response, mean time to acknowledge, and the ranked list of repeat offenders by firing frequency. Present it visually. The team's job during the meeting is to make three decisions about each alert on the list: keep it as is, fix the underlying condition that makes it fire too often, or delete it. Not discuss it — decide. Meetings that produce 'we should look into that' instead of 'deleted' or 'ticket created' waste their 30 minutes entirely.

The SLO frame changes the audit conversation in a valuable way. Instead of asking 'was this specific alert actionable?' you ask 'is our error budget on track this month?' If the budget is healthy and you have 80% remaining at the midpoint, many threshold alerts that fired are provably noise — they didn't threaten the SLO. If the budget is burning and you're at 40% at midpoint, you need more alerting sensitivity, not less. The budget number makes the decision about alert sensitivity a function of reliability risk rather than engineering anxiety.

slo_burn_rate_alert.ymlYAML

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

# slo_burn_rate_alert.yml
# Multi-window burn rate alerting — the approach from Google SRE Workbook Chapter 5.
#
# PREREQUISITES:
# - A defined SLO: for this example, 99.9% availability (0.1% error budget)
# - The service must emit http_requests_total with a 'status' label
# - A 30-day SLO measurement window (most common in production)
#
# HOW BURN RATE WORKS:
# If your SLO is 99.9%, your monthly error budget is 0.1% of all requests.
# A burn rate of 1x means you're consuming budget at exactly the rate that
# exhausts it in 30 days. A burn rate of 14x means you'll exhaust 30 days
# of budget in ~52 hours. That's when you wake someone up immediately.
# A burn rate of 3x means you'll exhaust budget in ~10 days — serious,
# but you have time to investigate during business hours.
#
# WHY MULTI-WINDOW:
# Single-window: 1h burn rate > 14x fires on a 5-minute spike, then resolves.
# Multi-window: BOTH the 1h window AND the 5m window must exceed the threshold.
# A 5-minute spike that self-corrects won't sustain the 1h window above 14x.
# This eliminates transient false positives without meaningful detection delay.

groups:
  - name: slo_burn_rates
    rules:

      # ──────────────────────────────────────────────────────────
      # FAST BURN — 14x rate (budget exhausted in ~52 hours)
      # This is a drop-everything incident. Page immediately.
      # Multi-window: 1h (detects sustained fast burn) AND
      #               5m (confirms it's still happening right now).
      # ──────────────────────────────────────────────────────────
      - alert: CheckoutSLOFastBurn
        expr: |
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[1h]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[1h]))
              )
            ) / 0.001
          ) > 14
          and
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[5m]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[5m]))
              )
            ) / 0.001
          ) > 14
        # 'for' duration is deliberately short here — fast burn needs fast detection.
        # The multi-window expression does the noise filtering,
        # so a short 'for' doesn't create false positives.
        for: 2m
        labels:
          severity: critical
          alert_type: slo_burn
          service: checkout
        annotations:
          summary: "Checkout SLO: fast burn — 30-day error budget exhausts in ~52 hours at current rate"
          runbook_url: "https://runbooks.internal/checkout-slo-burn"
          description: |
            Burn rate is {{ $value | printf "%.1f" }}x sustainable.
            At this rate, your monthly error budget exhausts in approximately
            {{ printf "%.0f" (div 720.0 $value) }} hours.
            Immediate investigation required — this is SLA-threatening."
          dashboard_url: "https://grafana.internal/d/slo-error-budget?var-service=checkout"

      # ──────────────────────────────────────────────────────────
      # SLOW BURN — 3x rate (budget exhausted in ~10 days)
      # Serious but not drop-everything. Investigate today.
      # Multi-window: 6h (confirms sustained trend, not a spike) AND
      #               30m (confirms it's still happening, not historical).
      # ──────────────────────────────────────────────────────────
      - alert: CheckoutSLOSlowBurn
        expr: |
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[6h]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[6h]))
              )
            ) / 0.001
          ) > 3
          and
          (
            (
              1 - (
                sum(rate(http_requests_total{
                  service="checkout",
                  status!~"5.."
                }[30m]))
                /
                sum(rate(http_requests_total{
                  service="checkout"
                }[30m]))
              )
            ) / 0.001
          ) > 3
        # Longer 'for' here — slow burn is a trend, not an event.
        # We want to be certain before alerting at warning severity.
        for: 15m
        labels:
          severity: warning
          alert_type: slo_burn
          service: checkout
        annotations:
          summary: "Checkout SLO: slow burn — 30-day error budget exhausts in ~10 days at current rate"
          runbook_url: "https://runbooks.internal/checkout-slo-burn"
          description: |
            Burn rate is {{ $value | printf "%.1f" }}x sustainable.
            Current trajectory exhausts monthly budget in approximately
            {{ printf "%.0f" (div 720.0 $value) }} hours.
            Investigate during business hours — this does not require waking anyone up."
          dashboard_url: "https://grafana.internal/d/slo-error-budget?var-service=checkout"

Output

# Fast burn alert fires during an incident at 03:17 AM:

ALERT CheckoutSLOFastBurn

Labels:

alertname = CheckoutSLOFastBurn

severity = critical

alert_type = slo_burn

service = checkout

Annotations:

summary = Checkout SLO: fast burn — 30-day error budget exhausts in ~52 hours at current rate

description = Burn rate is 18.3x sustainable. At this rate, your monthly error budget

exhausts in approximately 39 hours. Immediate investigation required.

runbook_url = https://runbooks.internal/checkout-slo-burn

dashboard_url = https://grafana.internal/d/slo-error-budget?var-service=checkout

State: firing

ActiveAt: 2026-03-15T03:17:44Z

# Simultaneously in Prometheus UI:

# CheckoutSLOSlowBurn: inactive

# (Inhibited — the fast burn critical alert suppresses the slow burn warning

# via Alertmanager inhibition rules for the same service label)

# What the on-call engineer sees in PagerDuty:

# - Alert name + summary

# - Current burn rate: 18.3x

# - Time until budget exhaustion: ~39 hours

# - Direct link to the SLO error budget dashboard

# - Runbook link

# They know immediately: this is real, it's SLA-threatening,

# and here's where to start investigating.

🔥Interview Gold: Why Multi-Window?

Single-window burn rate alerts have a specific failure mode that comes up in senior engineering interviews: a short spike in error rate can trigger the fast-burn threshold on a 1-hour window even if it self-corrects in 5 minutes. The 1-hour burn rate sees the spike, fires the alert, the on-call engineer investigates and finds everything healthy — false page, trust eroded. Multi-window alerting (1h AND 5m both elevated simultaneously) requires the problem to persist across two time horizons at once. A 5-minute spike that self-corrects won't maintain the 1-hour window above threshold, so the alert doesn't fire. The detection delay cost is minimal. The false positive reduction is substantial. That trade-off, explained clearly, is a reliable differentiator in SRE interviews.

📊 Production Insight

Monthly alert audits consistently take 30 minutes but eliminate 10+ hours of false pages per month when run with data in hand and a bias toward deleting.

Alerts that fire 3+ times in 30 days without a fix are not monitoring — they're acknowledged technical debt that your team has decided is cheaper to tolerate than to fix. Make that decision explicit by creating a ticket.

Runbooks written after an incident are 10x more accurate than runbooks written before one — they reflect what actually happens, not what was designed to happen.

Rule: if the same alert pages you twice without a fix, create the ticket before it pages you a third time. Three times is a pattern you chose.

🎯 Key Takeaway

Runbooks start with three numbered steps: confirm it's real, apply the known mitigation, escalation criteria. Architecture context belongs at the bottom, not the top.

SLO burn rate alerting ties pages to your actual reliability commitments — it makes the question 'should we page someone?' a function of business risk, not arbitrary thresholds.

The monthly audit is your highest-leverage operational practice — 30 minutes of honest deletion saves your team more than any dashboard improvement.

Monthly Alert Audit Decision Tree

IfAlert fired 3+ times in 30 days with no code fix or configuration change

→

UseCreate a reliability debt ticket with a sprint assignment. Suppress the alert using amtool with the ticket URL as required comment. Do not let it page for a fourth time without a plan.

IfAlert fired but there is no associated runbook or the runbook is empty

→

UseBlock that alert from paging until a 3-step runbook exists: confirm it's real, apply known mitigation, escalation criteria. An alert without a runbook is a liability, not monitoring.

IfSLO error budget is below 50% at the calendar midpoint of the month

→

UseTreat as urgent — you're on a trajectory to breach. Freeze non-critical deployments, investigate the top error contributors, and move the burn rate alert threshold to warning severity at the SRE team standup.

IfMonthly audit shows 80%+ of alerts were non-actionable (acknowledged and closed with no action taken)

→

UseThe alert system needs a reset, not a trim. Delete everything. Rebuild from the last six months of actual incidents — write alerts only for the problems that happened, with the thresholds calibrated to the values seen during those incidents.

Alert Routing, Deduplication, and Suppression — The Plumbing That Makes It Work

Routing is where most alerting systems silently break in ways that are invisible until an incident proves it. Prometheus fires correctly. Alertmanager receives the alert. The alert routes to the wrong team — or to nobody at all — and the incident goes undetected. This failure is particularly dangerous because everything looks healthy: the alert fired, Alertmanager processed it, PagerDuty shows the service as active. The failure is in the gap between those systems, specifically in a label that's missing or mismatched.

Every alert label you add is implicitly a routing decision. The severity label determines which receiver handles the alert — PagerDuty for critical, Slack for warning. The team label determines which team's escalation policy is invoked. The service label enables inhibition rules and deduplication. Get any of these wrong and the alert either goes to the wrong destination or routes to Alertmanager's default receiver, which most teams configure as a catch-all Slack channel that nobody monitors at 3 AM.

Deduplication is Alertmanager's mechanism for not paging you multiple times for the same underlying problem. Two Prometheus instances — for example, a primary and a replica, or two regional scrape targets — will both fire the same alert when a metric breaches. Alertmanager groups them by label identity into a single notification. This is elegant when it works. The failure mode: two alerts that describe the same problem but have different labels — perhaps because one alert uses service="checkout" and another uses service="checkout-api" — are treated as separate alerts and generate separate pages. Label consistency across your alert rules is a more important operational practice than most teams realise, and it becomes critical when you have more than a handful of services.

Suppression windows are your surgical tool for planned maintenance. Unlike blanket silences — which are blunt instruments that suppress all alerts from a service regardless of type — proper suppression targets specific alert names for specific time windows. The rule is simple and should be enforced at the tooling level: every silence requires a comment with a linked ticket URL, and every silence auto-expires within 4 hours maximum. If your maintenance window is longer than 4 hours, renew the silence manually. This creates intentional checkpoints where someone has to actively decide that the silence should continue.

Inhibition rules solve a specific and common problem: when a critical alert fires, you don't want five additional warning alerts for the same service all generating pages simultaneously. Alertmanager inhibition rules suppress lower-severity alerts when a higher-severity alert is already active for the same service. The result is one page for one incident instead of five pages for five symptoms of the same root cause. This is the difference between an on-call engineer who wakes up to a single clear incident and one who wakes up to a flood of notifications that obscures which one to start with.

One important operational practice that often gets skipped: test your routing after any label change. When a service is renamed, when an alert rule is refactored, when a team restructuring changes the team label values — routing breaks silently. amtool config routes test with a set of representative alert labels takes about 90 seconds and catches misroutes before an incident does. Add it to your CI pipeline as a validation step any time alertmanager.yml or alert rule files change.

alertmanager_routing.ymlYAML

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

# alertmanager_routing.yml
# Load via: alertmanager --config.file=alertmanager.yml
#
# Validate before reloading:
#   amtool check-config alertmanager.yml
#
# Reload without restart (Alertmanager supports hot reload):
#   curl -X POST localhost:9093/-/reload

global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# ──────────────────────────────────────────────────────────────
# INHIBITION RULES
# When a critical alert is firing for a service,
# suppress all warning-severity alerts for the same service.
#
# Without this: a database outage causes 8 separate alerts
# (latency warning, error rate warning, availability warning, etc.)
# and the on-call engineer gets 8 pages for one incident.
# With this: they get 1 page — the critical alert that matters.
#
# 'equal' defines which labels must match for inhibition to apply.
# 'service' AND 'team' must both match to avoid cross-team suppression.
# ──────────────────────────────────────────────────────────────
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    # Both labels must match — inhibiting checkout critical
    # should not suppress payments warning.
    equal: ['service', 'team']

# ──────────────────────────────────────────────────────────────
# ROUTE TREE
# Routes are evaluated top-to-bottom, first match wins.
# The 'default' route at the root catches anything that
# doesn't match a specific child route.
#
# Key timing parameters:
# group_wait:      How long to buffer alerts before sending the first notification.
#                  30s for default (batch similar alerts). 10s for critical (act fast).
# group_interval:  How long to wait before sending updates to an existing alert group.
# repeat_interval: How long to wait before re-notifying about an unresolved alert.
# ──────────────────────────────────────────────────────────────
route:
  receiver: default-slack
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical severity → immediate PagerDuty page
    # Short group_wait (10s) minimises detection-to-page delay
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 10s
      repeat_interval: 1h

    # SLO burn rate alerts → always page regardless of time of day
    # Separate route ensures SLO alerts can't be accidentally
    # caught by a warning-level route if someone misconfigures severity
    - match:
        alert_type: slo_burn
      receiver: pagerduty-critical
      group_wait: 10s
      repeat_interval: 1h

    # Warning severity → Slack notification, no page
    # Engineers review these during business hours
    - match:
        severity: warning
      receiver: slack-warnings
      group_wait: 5m
      repeat_interval: 4h

# ──────────────────────────────────────────────────────────────
# RECEIVERS
# Each receiver defines a notification integration.
# The PagerDuty receiver requires a service integration key
# unique to each service — do not share keys across services,
# it makes incident routing in PagerDuty impossible to untangle.
# ──────────────────────────────────────────────────────────────
receivers:
  - name: default-slack
    slack_configs:
      - channel: '#alerts-default'
        send_resolved: true
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY>'
        # Pass severity through to PagerDuty for UI triage
        severity: '{{ .GroupLabels.severity }}'
        # Include runbook and dashboard links in PagerDuty incident details
        details:
          runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          dashboard: '{{ (index .Alerts 0).Annotations.dashboard_url }}'

  - name: slack-warnings
    slack_configs:
      - channel: '#alerts-warnings'
        send_resolved: true
        title: '[WARNING] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }} | {{ .Annotations.runbook_url }}{{ end }}'

# ──────────────────────────────────────────────────────────────
# SILENCE MANAGEMENT — use amtool, never the UI for production
#
# Good: time-bounded, ticket-linked, specific alert name
# amtool silence add \
#   alertname="CheckoutLatencyHigh" \
#   service="checkout" \
#   --duration=4h \
#   --comment="TICKET-1234: Planned database migration window"
#
# Bad: blanket silence, no ticket, long duration
# amtool silence add \
#   service="checkout" \
#   --duration=168h \
#   --comment="noisy"
# ↑ This is the pattern from the production incident above.
#   It hid a 4-hour outage. Never do this.
# ──────────────────────────────────────────────────────────────

Output

# Validating config:

$ amtool check-config alertmanager.yml

Checking 'alertmanager.yml' SUCCESS

Found:

- global config

- route with 3 child routes

- 1 inhibition rule

- 3 receivers

# Testing routing for a critical SLO burn alert:

$ amtool config routes test \

alertname=CheckoutSLOFastBurn \

severity=critical \

alert_type=slo_burn \

service=checkout

pagerduty-critical

# Correct — critical SLO alert routes to PagerDuty.

# Testing routing for a warning alert (should NOT page):

$ amtool config routes test \

alertname=CheckoutLatencyHigh \

severity=warning \

service=checkout

slack-warnings

# Correct — warning routes to Slack, not PagerDuty.

# Listing active silences:

$ amtool silence query --active

ID Matchers Ends At Creator Comment

a1b2c3d4-e5f6-7890-abcd-ef1234567890 alertname=CheckoutLatencyHigh,service=checkout 2026-03-15T07:00Z alice TICKET-1234: Planned migration

# Testing inhibition: if CheckoutSLOFastBurn (critical) is active,

# CheckoutLatencyHigh (warning, same service) should be inhibited.

# Verify in Alertmanager UI at /api/v1/alerts — inhibited: true

Mental Model

Think of Alertmanager as a Post Office

Every alert label is an address on an envelope. If the address doesn't match any route in the route tree, the letter goes to the default receiver — and nobody checks the default pile at 3 AM. The failure is silent: the alert fired, Alertmanager processed it, and it disappeared into a Slack channel with 400 unread messages.

Alertmanager routes by label match — a missing 'team' label means the alert hits the default route, not your team's PagerDuty service. Test every label combination with amtool after any label change.
Deduplication groups alerts by identical labels — two alerts about the same problem with different labels (e.g. 'checkout' vs 'checkout-api') create two separate incidents. Label consistency is an operational requirement, not a style preference.
Suppression windows must be time-bounded (4h max), ticket-linked, and visible to the whole team. Blanket silences are the number one cause of undetected outages in teams with noisy alerting environments.
Inhibition rules let critical alerts suppress warnings for the same service — one page instead of five for the same incident. Without them, an on-call engineer wakes up to a flood of notifications that all point at the same root cause.

📊 Production Insight

40% of PagerDuty misroutes in teams with more than 20 alert rules trace back to a single missing or renamed label — usually discovered during an incident, not before it.

Blanket silences are the single most common cause of undetected outages in teams that have previously experienced high alert volumes.

Routing validation should be a CI check, not a manual verification step. amtool config routes test with a matrix of expected label combinations takes 2 minutes to configure and catches misroutes before production does.

Rule: every silence needs a ticket URL and a maximum 4-hour duration. Enforce this at the API level — a policy that isn't enforced technically isn't a policy, it's a suggestion.

🎯 Key Takeaway

Every alert label is a routing decision — get one label wrong and the page goes to the wrong team or nowhere.

Deduplication requires label identity — two alerts for the same problem with different labels generate two separate incidents.

Silences should be surgical, time-bounded, ticket-linked, and visible to every engineer on the team. Anything less is a liability.

Routing and Suppression Decision Tree

IfAlert fires in Prometheus but no PagerDuty page arrives

→

UseRun 'amtool config routes test' with the exact labels from the alert. Check for active silences with 'amtool silence query --active'. Verify the PagerDuty integration key is valid and the service is not in maintenance mode.

IfSame incident generates 5+ separate pages for the same service

→

UseConfigure Alertmanager inhibition rules — critical alerts should suppress warnings for the same service. Also check group_by configuration: are you grouping by both alertname and service? Grouping by service alone can merge unrelated alerts.

IfPlanned maintenance window approaching in the next 24 hours

→

UseCreate a targeted, time-bounded silence with amtool. Silence specific alert names, not the entire service. Set duration to the maintenance window plus 30 minutes buffer. Include the ticket URL in the comment field.

IfMultiple teams receive the same alert notification

→

UseAlert labels are too broad for the current route tree. Add team-specific labels to route correctly. Verify with 'amtool config routes test' for each team's expected label set before deploying the change.

If your on-call engineer's phone buzzes more than once a shift, you've already lost. Every alert is a tax on focus. A noisy pager trains people to ignore it. That's how real outages slip through.

Here's the hard truth: Most alerts are noise. CPU at 80% isn't an emergency. A five-minute latency spike isn't an outage. Your monitoring tools are cheap; your engineer's attention isn't.

Set a noise budget. Calculate your team's tolerance — maybe 10 alerts per week per rotation. Anything above that gets triaged. Either automate the response (self-healing) or kill the alert. If it fires but never requires human action for 30 days, it's not an alert — it's a metric. Graph it, don't page on it.

Your goal: when the pager goes off, it's a goddamn emergency. Every time. If your on-call engineer can't remember the last time they got paged for a real issue, you've built trust. That's the only metric that matters.

NoiseBudgetPolicy.ymlYAML

// io.thecodeforge — devops tutorial

alerting_policy:
  name: production-noise-budget
  rotation: weekly
  budget:
    max_alerts_per_rotation: 10
    alert_types:
      - critical
      - warning
    exceptions:
      - incident_id: INC-2025-039  # sev-1 only
  auto_remediation:
    cpu_high:
      threshold: 85%
      action: scale_up_replicas
  dead_alert_cleanup:
    if_fires_without_ack: 7 days
    action: disable

Output

Budget: max 10 alerts/week. Dead alerts disabled after 7 days. Auto-remediation on CPU > 85%.

⚠ Production Trap:

Don't confuse 'monitoring' with 'alerting'. If you page on every metric blip, you're not monitoring — you're spamming. Kill 90% of your alerts. You'll thank me when you're not woken up for a 2-minute spike.

🎯 Key Takeaway

If your pager fires more than once per shift, you are burning your team's attention. Automate or silence — never tolerate noise.

The Handoff Protocol That Prevents the 3 AM Drop

The worst call you'll ever get is the one where the previous shift 'forgot' to mention the database connection pool is leaking. I've seen it. The on-call engineer who just got paged has no context, no runbook, and zero clue why the cluster is melting.

Stop treating handoffs like a handshake at a party. Make them surgical. Every shift change MUST include: a written summary of any ongoing issues, the current state of all active alerts, and a quick sync. No exceptions. If your team is distributed across time zones, write it in a shared doc — not Slack. Slack scrolls away. Docs don't.

Standardize. A handoff checklist — timestamp, alerts fired, actions taken, next steps. If you're not doing this, you're gambling. One missed detail can turn a 10-minute incident into a 2-hour post-mortem about process failure.

Here's the rule: The incoming engineer should be able to start debugging within 30 seconds of reading the handoff. If they can't, you failed. Automate the status dump — pull alert history, open incidents, and runbook state into a single handoff report. No excuses.

HandoffChecklist.ymlYAML

// io.thecodeforge — devops tutorial

handoff_protocol:
  required:
    - current_alert_count: 3
    - active_incidents:
        - INC-045: database_conn_pool_leak
    - actions_taken: "restarted primary replica, monitor for 30 mins"
    - next_steps: "check slow queries at 0600 UTC"
  automation:
    generate_report:
      trigger: shift_end
      sources:
        - pagerduty_incidents
        - runbook_current_state
      output: shared_drive/handoffs/2025-04-14.md
    sync_required: true  # 5-min Slack huddle or async doc

Output

Handoff report generated: shared_drive/handoffs/2025-04-14.md. Active incidents: 1. Sync required — yes.

💡Senior Shortcut:

Write the handoff doc as you triage. Don't wait until the end of shift. Future you (or the person replacing you) will know exactly what you were thinking. It's cheap insurance against amnesia.

🎯 Key Takeaway

A handoff without a shared, timestamped doc is not a handoff — it's a wish. Automate the report. Every shift. No exceptions.

● Production incidentPOST-MORTEMseverity: high

The Silenced PagerDuty Service That Hid a 4-Hour Outage

Symptom

Checkout service returning 500 errors for 4 hours. No PagerDuty alert fired. Discovery came via a customer support ticket that a customer escalated directly to the engineering Slack channel — the worst possible detection mechanism for a payment-critical service.

Assumption

The team assumed PagerDuty was healthy because the checkout service showed as 'active' in the PagerDuty dashboard. Active in the dashboard means the service exists and has an escalation policy — it says nothing about active silences. Nobody on the team knew to check the silence list because silences had always been temporary and personal. They'd never been treated as team-visible infrastructure.

Root cause

An engineer had created a PagerDuty silence rule matching all alerts from the checkout service with a 7-day duration. This happened on a Tuesday after their third consecutive night of false CPU alerts. Each one was the same: the checkout service's batch reconciliation job ran at 2 AM, pegged CPU to 85% for 90 seconds, and PagerDuty fired a P2. The engineer would acknowledge, see CPU returning to baseline, and go back to sleep. On the fourth night, they created a silence rule and went to bed for the rest of the week. The silence was entirely rational from their perspective. The CPU alerts were never actionable — they fired on a scheduled batch job that ran the same way every night. Nobody had ever fixed them because nobody had ever flagged them as wrong. They existed because a previous engineer added a CPU > 80% alert after a different incident two years prior, and the alert had silently outlived its usefulness. The silence expired the day after the outage ended. The team only discovered it existed when they pulled the PagerDuty audit log trying to understand why they weren't paged.

Fix

The immediate fix was removing the blanket silence and deleting every CPU-based alert from the checkout service. Those were replaced the same day with two symptom-based alerts: error rate above 5% for 2 minutes (critical, pages primary on-call) and p99 latency above 2 seconds for 5 minutes (warning, posts to Slack with a PagerDuty low-urgency incident). The structural fix took two weeks. The team implemented a PagerDuty policy requiring all silence rules to include a comment field with a ticket URL — silences without a linked ticket are blocked at the API level via a custom webhook that validates the comment format before allowing creation. Silence duration was capped at 4 hours maximum, enforced by the same webhook. A daily Slack digest was added showing all active silences across every service, visible to the whole engineering team in the #oncall-status channel. The cultural fix took longer. The team ran a retrospective focused on why the CPU alert had existed for two years without anyone questioning it. The answer was that nobody felt empowered to delete an alert someone else had written — deleting an alert felt like removing a safety mechanism, even if that mechanism had never once caught a real problem. They added an explicit team agreement: any alert that fires more than three times in 30 days without a corresponding code fix is automatically a backlog item, not a monitoring truth. The engineer who created the silence was not blamed. They were the person who finally made the failure mode visible.

Key lesson

Noisy alerts don't just waste time — they systematically train engineers to silence everything, including the alerts that matter. The silence was the correct response to a broken alerting system. Fix the system, not the behaviour.
Every silence rule must require a linked ticket and auto-expire within 4 hours maximum. Enforce this at the API level, not just as a policy — policies get forgotten under pressure.
Add a daily digest of active silences visible to the whole team. Silences should be as visible as deployments, not private workarounds buried in an individual's PagerDuty account.
If an engineer silences an entire service, treat it as a signal that your alert design has failed, not that the engineer made a mistake. The engineer is the symptom. The noise is the disease.

Production debug guideDiagnose why your alerts aren't working before your next incident proves it6 entries

Symptom · 01

Alert fires but on-call engineer takes no action within 15 minutes

→

Fix

Check two things before blaming the engineer. First: does the alert have a runbook URL in its annotations? If not, the engineer is doing archaeology in the middle of an incident — they're not being slow, they're being handed an undocumented system at 3 AM. Add a runbook with exactly three numbered steps: (1) confirm the alert is real with a specific command or dashboard link, (2) apply the known mitigation if one exists, (3) escalation criteria and who to call. Second: is the alert actually actionable, or does it fire on something that resolves itself 80% of the time? If the engineer has learned to wait 5 minutes to see if it self-clears, your alert has already lost their trust.

Symptom · 02

Same alert fires 3+ times per week with no code fix shipped

→

Fix

This is reliability debt wearing a monitoring costume. Create a backlog ticket for the root cause — give it a severity label and a sprint assignment, not a 'someday' tag. Suppress the alert with amtool using the ticket URL as the required comment, and set a 4-hour expiry that must be renewed manually. If the ticket keeps getting deprioritised sprint after sprint, escalate it: an alert that fires three times per week without a fix is costing your team 15+ engineer-hours per month in interrupted sleep and cognitive context-switching. Put that number in the ticket.

Symptom · 03

Engineer creates a blanket silence covering all alerts from a service

→

Fix

Investigate immediately — not to discipline the engineer, but because this is the most reliable leading indicator of alert fatigue reaching critical mass. Pull the alert history for that service for the last 30 days. Count how many alerts fired, how many were acknowledged within 5 minutes, and how many had documented mitigations. The pattern will show you exactly where your alert design broke down. Review every alert for that service against the 3 AM test and delete the non-actionable ones before lifting the silence. Lifting the silence without fixing the alerts just restarts the countdown to the next silence.

Symptom · 04

Escalation to secondary on-call happens more than once per week

→

Fix

This is your canary for primary alert volume being too high. Secondary escalation should be reserved for genuine unavailability — phone died, emergency, deep sleep during an off-peak window — not for situations where the primary engineer got paged 4 times in 3 hours and is physically unable to acknowledge fast enough. Pull the incident timeline: if escalations cluster between midnight and 6 AM, you have a volume problem. If they're spread throughout the day, you may have a coverage gap. Either way, audit alert count per shift before touching the rotation. Target: no more than 2 actionable pages per 12-hour shift, per Google SRE guidance.

Symptom · 05

On-call handoff meeting has no documentation from the previous rotation

→

Fix

Institute a mandatory handoff template enforced by your incident management tooling, not by trust. The template has four fields: alerts fired this week (count and names), actions taken (what was mitigated and how), follow-up tickets created (with links), and anything the next on-call needs to know that isn't in a ticket yet. The last field is where institutional knowledge lives or dies. Without it, the same database quirk that surprised the outgoing engineer will surprise the incoming one two weeks later. Make completion of the handoff template a requirement before the rotation transfer confirms in PagerDuty.

Symptom · 06

SLO burn rate alert fires but team doesn't know the current error budget status

→

Fix

The alert is doing its job — the dashboard isn't. A burn rate alert tells you the rate of consumption has exceeded a threshold. The dashboard needs to show you three things at a glance: how much budget you started the month with (in absolute error count and percentage), how much you've consumed so far, and at the current burn rate, when will you exhaust it. Engineers who don't have that context when the alert fires will either over-respond (treating a slow-burn warning as a full incident) or under-respond (dismissing a fast-burn critical as a blip). The dashboard is what turns the alert from a notification into a decision.

★ On-Call Quick Debug Cheat SheetFast diagnostics for alerting pipeline issues. Commands assume Prometheus, Alertmanager, and PagerDuty. Run these in order — each one narrows the failure surface before you need the next.

Alert not firing despite metric breach visible in Prometheus UI−

Immediate action

Check Prometheus rule evaluation state — the metric being visible in the UI and the alerting rule evaluating correctly are two separate things

Commands

promtool check rules /etc/prometheus/rules/*.yml

curl -s localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.state=="inactive") | {name: .name, lastError: .lastError}'

Fix now

Verify the 'for' duration isn't longer than the breach window — if your metric has only been above threshold for 90 seconds and your 'for' is 2m, the alert is in pending state, not firing. Check for label mismatches between the alert expression and the target metric: a single label difference between what your rule selects and what your metric exports results in zero matches and a permanently inactive alert. Use the Prometheus UI's Table view on the rule expression to confirm it returns data before assuming the rule logic is correct.

Alert fires in Prometheus but no PagerDuty page arrives+

On-call engineer doesn't receive the PagerDuty page+

Too many pages per shift — engineer alert fatigue is approaching critical+

Threshold-Based vs SLO Burn Rate Alerting

Aspect	Threshold-Based Alerting	SLO Burn Rate Alerting
What it monitors	Raw metric value at a point in time (e.g. error rate > 1%)	Rate of error budget consumption across a rolling time window
Alert volume	High — fires on any metric breach, including brief self-correcting spikes	Low — requires sustained budget impact across multiple time windows simultaneously
False positive rate	High — transient spikes, batch jobs, autoscaling events all trigger pages	Low — multi-window filtering requires sustained degradation, not momentary breaches
Business alignment	Poor — a 1% error rate threshold has no direct relationship to your SLA	Excellent — directly tied to whether you'll breach your reliability commitment to users
Complexity to implement	Low — one PromQL expression per alert, straightforward to write	Medium — requires a defined SLO, burn rate math, and multi-window expressions
Best for	Early-stage services, simple infrastructure monitoring, teams new to observability	Production services with defined SLOs, teams with mature observability practices
Recovery detection	Alert resolves immediately when metric drops below threshold	Alert resolves when burn rate normalises — may persist briefly after the incident resolves
On-call experience	Poor — engineers get woken for self-healing spikes they can't affect and don't understand	Good — every page represents a genuine, sustained threat to a business commitment

⚙ Quick Reference

6 commands from this guide

File	Command / Code	Purpose
prometheus_symptom_alerts.yml	groups:	Alert on Symptoms, Not Causes
pagerduty_escalation_policy.tf	resource "pagerduty_escalation_policy" "checkout_team" {	Structuring On-Call Rotations That Don't Destroy Your Team
slo_burn_rate_alert.yml	groups:	Runbooks, SLOs, and the Alert Audit Loop That Keeps You Sane
alertmanager_routing.yml	global:	Alert Routing, Deduplication, and Suppression
NoiseBudgetPolicy.yml	alerting_policy:	The Noise Budget
HandoffChecklist.yml	handoff_protocol:	The Handoff Protocol That Prevents the 3 AM Drop

Key takeaways

Alert on symptoms users feel

latency, error rate, availability — not causes engineers investigate. CPU, memory, and disk belong on dashboards where engineers look during business hours, not on pagers that wake people up at 3 AM.

The 'for' duration in Prometheus is your most powerful noise filter. A 2-minute minimum for critical alerts and 5-minute minimum for warnings eliminates the entire category of false pages from transient spikes and single bad scrape cycles.

SLO burn rate alerting with multi-window expressions dramatically reduces alert volume while improving signal quality. It only pages when your reliability commitment to users is genuinely threatened

not when an arbitrary metric threshold is briefly crossed.

The monthly alert audit

30 minutes, data in hand, bias toward deleting — is the single highest-leverage practice in on-call operations. An alert that fires three times without a fix is backlog work disguised as monitoring.

Every alert that pages someone must have a runbook. Not documentation about the system

a numbered response guide starting with 'confirm it's real,' followed by 'known mitigations,' followed by 'when and who to escalate to.' Architecture context belongs at the bottom.

Blanket silences are the number one cause of undetected outages in teams that have experienced alert fatigue. Enforce 4-hour maximum duration and mandatory ticket URLs at the API level

policies that aren't technically enforced are suggestions.

Common mistakes to avoid

6 patterns

Alerting on every metric that Prometheus exports by default

Symptom

50+ alerts firing weekly, most of them self-resolving within minutes. Engineers start to treat every page with suspicion — waiting to see if it clears before investigating. A real outage gets missed because the on-call engineer has learned that 70% of pages don't need a response. The alert system is working technically but has lost the team's trust operationally.

Fix

Run an alert audit before the volume gets this bad, but if you're already here, audit under pressure. Pull every alert that fired in the last 30 days. For each one, apply the 3 AM test: if this fired at 3 AM, would the on-call engineer know what concrete action to take within 5 minutes? If the answer is no — delete it today, not after discussion. Add alerts back only when an actual incident demonstrates that the metric predicted or explained a user-facing problem. Start from incidents, not from Prometheus's metric catalogue.

Setting 'for: 0m' on critical alerts

Symptom

A single bad Prometheus scrape cycle — which happens on virtually every production system at some frequency — triggers a P1 page at 3 AM. The on-call engineer investigates for 20 minutes, finds everything healthy, and goes back to sleep with slightly less trust in the system. After this happens three or four times, they start adding 'wait 5 minutes before looking at the laptop' as an informal personal policy — which is indistinguishable from ignoring the alert.

Fix

Set a minimum 'for' duration of 2 minutes for critical alerts and 5 minutes for warnings. The 'for' duration requires the metric to be in breach continuously for the specified time before the alert transitions from 'pending' to 'firing'. A single bad scrape lasts 15 seconds. A real incident lasts minutes to hours. The 2-minute minimum costs you nothing in real detection delay and eliminates the category of false pages that train engineers to distrust the system. If a team member argues 'but we need faster detection', the answer is: you need faster detection of real incidents, not faster false alarms.

Writing runbooks that describe the system instead of the response

Symptom

On-call engineer opens the runbook during an active incident, reads two paragraphs about the service's architecture, a diagram of its database relationships, and a section on its deployment history — and then has to scroll to find the first actionable step. By the time they've oriented themselves in the document, 4 minutes have elapsed and they're still not sure what command to run.

Fix

Runbooks are incident tools, not architecture documents. Structure every runbook in a strict order: first, how to confirm the alert is real (one specific command or dashboard link, nothing more); second, the known mitigations in order of likelihood and simplicity; third, explicit escalation criteria and who to contact. Architecture context, service dependency diagrams, and historical incident notes belong at the bottom. The engineer responding to an incident should reach the first executable step within 30 seconds of opening the runbook. If they can't, the runbook is structured wrong.

Creating blanket silences without a ticket URL or expiration time

Symptom

An engineer silences all alerts from a service to survive a noisy rotation. The silence outlives its justification. A real outage occurs. No page fires. Discovery comes from a customer complaint 4 hours later. This is not a hypothetical — it's the production incident at the top of this article, and variations of it happen at most organisations that have experienced sustained alert fatigue.

Fix

Enforce silence hygiene at the API level, not just as a policy. Use a webhook that validates silence creation requests: silences without a ticket URL in the comment field are rejected. Maximum silence duration is 4 hours — enforced by the webhook, not by trust. Add a daily Slack digest of active silences visible to every engineer on the team. Treat any silence older than 4 hours without a renewal as a bug that requires investigation. The policy is simple: if you can't explain why a silence exists in one sentence and point to a ticket, it shouldn't exist.

Treating on-call as 'just part of the job' without compensation or meaningful support

Symptom

Senior engineers start leaving, citing 'work-life balance' in exit interviews when the real cause is three months of 3 AM false pages with no visible team investment in reducing them. The engineers who built the systems and know how to fix things quietly transfer to teams with better on-call culture. Institutional knowledge — the kind that lives in someone's head and not in any runbook — walks out with them.

Fix

Offer explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks. But compensation alone doesn't fix this — it just makes the burden slightly more tolerable. Pair it with visible investment: show the team, in every sprint review, the alert volume trend, the audit results, the alerts that were deleted this month. Engineers who see the organisation treating alert noise as a real cost that deserves engineering resources stay longer than engineers who receive on-call pay for an experience that never gets better.

Not validating alert routing after changing Prometheus labels or service names

Symptom

Team renames a service from 'checkout' to 'checkout-service' during a refactoring sprint. Every alert rule is updated to use the new name. The Alertmanager route tree still matches on 'checkout'. All alerts for the service now route to the default Slack channel instead of PagerDuty. Nobody notices until an incident fires during the next rotation and the primary on-call engineer receives no page.

Fix

Add amtool config routes test to your CI pipeline as a required check on any change to alertmanager.yml or alert rule files. Maintain a test matrix of label combinations for each service that asserts the expected receiver. This catches misroutes in CI instead of in production. It takes about 90 seconds to configure and runs in under a second. The cost of not doing this is discovering a routing gap during an incident — at which point the gap becomes part of the incident timeline.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What's the difference between alerting on causes versus symptoms, and ca...

Q02SENIOR

How would you design an on-call rotation for a team of six engineers cov...

Q03SENIOR

A senior engineer proposes adding a CPU utilization alert at 80% thresho...

Q04SENIOR

Explain multi-window SLO burn rate alerting and why single-window burn r...

Q05SENIOR

Your team's alert volume has tripled after a major feature launch. Walk ...

Q01 of 05SENIOR

What's the difference between alerting on causes versus symptoms, and can you give a concrete example of each from a web service context?

ANSWER

Cause-based alerting fires on infrastructure metrics that might indicate a problem is developing — CPU utilisation at 80%, memory pressure at 90%, disk filling up. Symptom-based alerting fires on what users actually experience — error rates above threshold, request latency exceeding an SLO target, failed health checks. Concrete example: CPU at 80% is a cause alert. It might mean a problem is developing — or it might be a scheduled batch job, a GC cycle, or autoscaling warming up after a traffic spike. The CPU percentage alone gives the on-call engineer nothing actionable. 'Is the user impacted?' requires additional investigation before you even know whether the page warranted waking someone up. The checkout API returning 500 errors to 5% of users is a symptom alert. Users are being impacted right now. The on-call engineer has a clear starting point: check recent deployments, look at downstream dependency health, check the error logs for the specific 500 pattern. The symptom tells you action is required and gives you a direction to start. The practical test is what I call the 3 AM test: if this fires at 3 AM, can the on-call engineer take a concrete, directed action within 5 minutes? For CPU alerts, the answer is almost always 'investigate and see if anything else is wrong' — which is investigation, not action. For symptom alerts on error rate or latency, the answer is 'yes — look at these specific things in this order.' That's the distinction that determines whether something belongs on a pager or a dashboard.

FAQ · 5 QUESTIONS

Frequently Asked Questions

How many alerts should a healthy on-call engineer receive per shift?

What's the difference between an SLO and an SLA, and why does it matter for alerting?

Should I use PagerDuty or OpsGenie for on-call routing?

How do you handle alerting during planned maintenance windows?

What's the right 'for' duration for a critical alert?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's Monitoring. Mark it forged?

13 min read · try the examples if you haven't

Alerting & On-Call — Why Silenced Services Hide Outages

Why Silenced Services Hide Outages

Alert on Symptoms, Not Causes — The Golden Rule of Monitoring

Structuring On-Call Rotations That Don't Destroy Your Team

Runbooks, SLOs, and the Alert Audit Loop That Keeps You Sane

Alert Routing, Deduplication, and Suppression — The Plumbing That Makes It Work

The Noise Budget — Why Your Pager Should Be Quiet 90% of the Time

The Handoff Protocol That Prevents the 3 AM Drop

The Silenced PagerDuty Service That Hid a 4-Hour Outage

Key takeaways

Common mistakes to avoid

Alerting on every metric that Prometheus exports by default

Setting 'for: 0m' on critical alerts

Writing runbooks that describe the system instead of the response

Creating blanket silences without a ticket URL or expiration time

Treating on-call as 'just part of the job' without compensation or meaningful support

Not validating alert routing after changing Prometheus labels or service names

Interview Questions on This Topic

Frequently Asked Questions

That's Monitoring. Mark it forged?