Home DevOps Alerting and On-Call Best Practices That Actually Reduce Burnout

Alerting and On-Call Best Practices That Actually Reduce Burnout

In Plain English 🔥
Imagine your house has a smoke alarm that goes off every time you make toast. After a week, you'd probably rip the battery out — and then miss a real fire. That's exactly what happens with poorly designed software alerts. Good alerting means your alarm only sounds when the house is actually burning, so the person on duty actually pays attention when it does.
⚡ Quick Answer
Imagine your house has a smoke alarm that goes off every time you make toast. After a week, you'd probably rip the battery out — and then miss a real fire. That's exactly what happens with poorly designed software alerts. Good alerting means your alarm only sounds when the house is actually burning, so the person on duty actually pays attention when it does.

At 3 AM, your phone screams. You scramble to your laptop, bleary-eyed, only to discover the alert fired because a CPU spike lasted four seconds and self-corrected before you even logged in. You've lost sleep over nothing — and this happens four nights a week. This is the lived reality at hundreds of engineering teams right now, and it's not a monitoring problem. It's a culture problem disguised as a technical one. Alert fatigue is the silent killer of on-call programs, and it costs companies real engineers who quietly quit rather than endure another sleepless rotation.

The root cause isn't that teams care too little about monitoring — it's that they add alerts reactively, after every incident, without ever pruning the ones that stop being useful. Over time, the alert system becomes a noise machine. Engineers stop trusting it, start silencing pages, and miss the signals that actually matter. The fix isn't more dashboards. It's disciplined, intentional alerting philosophy backed by concrete practices for thresholds, routing, escalation, and rotation design.

By the end of this article you'll know how to audit your existing alert stack and throw out the ones that don't serve you, how to write alerts that fire on symptoms not causes, how to structure an on-call rotation that doesn't destroy your team's wellbeing, and how to use real tooling — Prometheus alerting rules, PagerDuty routing logic, and runbook templates — to make all of it operational and repeatable.

Alert on Symptoms, Not Causes — The Golden Rule of Monitoring

The most common alerting mistake is alerting on what you think is wrong instead of what the user actually experiences. A high CPU alert fires and the engineer investigates — but CPU being high isn't inherently bad. Maybe a batch job is running. Maybe it's expected load. The user doesn't care about CPU. They care whether the checkout page loads.

Symptomatic alerting means your alert fires on things users feel directly: high latency, elevated error rates, failed health checks. These are called the Four Golden Signals from Google's SRE book — latency, traffic, errors, and saturation. Alerts on these signals are almost always actionable because if error rate is 15%, something is broken for users right now.

Causal metrics like CPU, memory, and disk are better suited for dashboards and capacity planning, not paging. You investigate them after an alert fires to understand why — not to decide whether something is wrong.

The practical test: before adding any alert, ask yourself 'If this fires at 3 AM, can the on-call engineer take a concrete action within five minutes?' If the answer is no, it belongs on a dashboard, not a pager.

prometheus_symptom_alerts.yml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
# prometheus_symptom_alerts.yml
# These rules live in your Prometheus alerting rules directory.
# Load them via the 'rule_files' block in prometheus.yml.

groups:
  - name: user_facing_symptoms
    # Evaluate every 60 seconds
    interval: 60s
    rules:

      # GOOD ALERT: Fires on what users experience — high error rate
      # This is a SYMPTOM. Users are getting 5xx errors. Action is clear.
      - alert: HighErrorRateShopping
        expr: |
          (
            sum(rate(http_requests_total{service="shopping-cart", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="shopping-cart"}[5m]))
          ) > 0.05
        # Wait 2 minutes before firing — avoids alerting on a single bad scrape
        for: 2m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Shopping cart error rate above 5% for 2 minutes"
          # Link directly to the runbook — engineers shouldn't have to search
          runbook_url: "https://runbooks.internal/shopping-cart-errors"
          description: "Current error rate: {{ $value | humanizePercentage }}"

      # GOOD ALERT: Fires on latency degradation users feel
      - alert: CheckoutLatencyHigh
        expr: |
          histogram_quantile(
            0.99,
            sum(rate(http_request_duration_seconds_bucket{
              service="checkout", handler="/api/purchase"
            }[10m])) by (le)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
          team: checkout
        annotations:
          summary: "p99 checkout latency exceeded 2 seconds"
          runbook_url: "https://runbooks.internal/checkout-latency"
          description: "p99 latency is {{ $value }}s. SLO threshold is 2.0s."

      # BAD ALERT (shown for contrast — commented out intentionally)
      # Do NOT do this. CPU spiking doesn't mean users are affected.
      # - alert: HighCPU
      #   expr: node_cpu_utilization > 0.80
      #   for: 1m
      #   annotations:
      #     summary: "CPU is high" # <-- What should the engineer DO with this?
▶ Output
# When Prometheus evaluates these rules and HighErrorRateShopping fires:

ALERT HighErrorRateShopping
Labels:
alertname = HighErrorRateShopping
severity = critical
team = checkout
Annotations:
summary = Shopping cart error rate above 5% for 2 minutes
description = Current error rate: 7.34%
runbook_url = https://runbooks.internal/shopping-cart-errors
State: firing
ActiveAt: 2024-03-15T03:14:22Z
⚠️
Pro Tip: The 3 AM TestBefore committing any new alert rule, ask your team: 'If this woke someone up at 3 AM, would they know exactly what to do in under 5 minutes?' If anyone hesitates, the alert needs a better runbook, a higher threshold, or to be demoted to a dashboard panel instead.

Structuring On-Call Rotations That Don't Destroy Your Team

An on-call rotation is a social contract as much as it is a technical system. Engineers who feel the rotation is fair, predictable, and well-supported stay in it. Engineers who feel it's a punishment churn — and the institutional knowledge they carry walks out with them.

The fundamentals of a healthy rotation start with team size. You need at least four engineers to build a weekly rotation that gives people genuine recovery time. With fewer, someone is always on-call or just got off on-call, and cognitive load never fully resets.

Secondary on-call — where a second engineer is ready to be escalated to if the primary doesn't acknowledge in ten minutes — is non-negotiable for any service with a real SLA. It's also a forcing function: if escalation to secondary happens more than once a week, your primary alert volume is too high.

On-call handoff meetings deserve as much ceremony as a sprint retrospective. The outgoing engineer should document what fired, what was investigated, and what follow-up work was created. Without this, the same incidents repeat endlessly because the fixes live only in one person's head.

Compensation matters too. Whether that's explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks, teams that acknowledge the burden retain engineers far longer than teams that treat it as 'just part of the job'.

pagerduty_escalation_policy.yml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
# pagerduty_escalation_policy.yml
# This is a Terraform representation of a PagerDuty escalation policy.
# Apply with: terraform apply
# Requires: pagerduty Terraform provider configured with your API token.

resource "pagerduty_escalation_policy" "checkout_team" {
  name      = "Checkout Team - Production Escalation"
  num_loops = 2  # After 2 full loops with no acknowledgment, re-alert from top

  # LEVEL 1Primary on-call engineer gets paged first
  rule {
    escalation_delay_in_minutes = 10  # Wait 10 min before escalating to secondary

    target {
      type = "schedule_reference"
      # This schedule rotates weekly across your primary on-call engineers
      id   = pagerduty_schedule.checkout_primary_rotation.id
    }
  }

  # LEVEL 2Secondary on-call gets paged if primary doesn't acknowledge
  rule {
    escalation_delay_in_minutes = 10  # Another 10 min before going to manager

    target {
      type = "schedule_reference"
      # Secondary rotation — separate schedule, separate engineers
      id   = pagerduty_schedule.checkout_secondary_rotation.id
    }
  }

  # LEVEL 3Engineering manager is the last resort, not the first call
  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "user_reference"
      # Direct page to the EMthis should be rare
      id   = data.pagerduty_user.checkout_engineering_manager.id
    }
  }
}

# Weekly rotating schedule for primary on-call
resource "pagerduty_schedule" "checkout_primary_rotation" {
  name      = "Checkout Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2024-01-08T09:00:00-05:00"
    # Rotate every 7 days — weekly on-call is the industry standard baseline
    rotation_turn_length_seconds = 604800
    rotation_virtual_start       = "2024-01-08T09:00:00-05:00"

    # Engineers in the rotation — minimum 4 for healthy coverage
    users = [
      data.pagerduty_user.alice.id,
      data.pagerduty_user.bob.id,
      data.pagerduty_user.carol.id,
      data.pagerduty_user.david.id,
    ]
  }
}
▶ Output
# terraform apply output:

pagerduty_schedule.checkout_primary_rotation: Creating...
pagerduty_schedule.checkout_primary_rotation: Creation complete after 1s [id=P3X8K2A]

pagerduty_escalation_policy.checkout_team: Creating...
pagerduty_escalation_policy.checkout_team: Creation complete after 1s [id=PQRST99]

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

# Escalation flow when an alert fires:
# T+0:00 → Alice (primary) receives page via push + SMS
# T+10:00 → No ack? Bob (secondary) receives page
# T+20:00 → No ack? Engineering Manager receives page
# T+35:00 → Loop 2 begins from Level 1 if still unacknowledged
⚠️
Watch Out: Manager-First EscalationNever put a manager at Level 1 escalation. It trains engineers to wait for someone else to handle things, creates a single point of failure, and burns out your EM fast. Managers belong at Level 3 as a true last resort when your entire rotation has gone dark.

Runbooks, SLOs, and the Alert Audit Loop That Keeps You Sane

Every alert that fires should have a runbook — a living document that tells the on-call engineer exactly what to check, what commands to run, and what decisions to make. Without runbooks, every incident is an archaeology project. With them, even a junior engineer who's never seen the service before can respond effectively.

Runbooks don't need to be perfect on day one. Start with three headings: 'What is happening', 'Immediate steps to mitigate', and 'How to escalate if mitigation fails'. As incidents happen, the on-call engineer appends what they learned. Within a few months you have genuinely battle-tested documentation.

SLO-based alerting takes this further by tying your alerts to your error budget. Instead of alerting when error rate exceeds 1%, you alert when you're burning through your monthly error budget faster than sustainable. This dramatically reduces alert volume while ensuring you page only when business commitments are actually at risk.

Finally, the most underrated practice in on-call is the monthly alert audit. Pull a report of every alert that fired in the last 30 days. For each one: Was it actionable? Was the response documented? If the same alert fired more than three times without a code fix, that's a reliability debt item that belongs on the backlog, not a recurring 3 AM wake-up.

slo_burn_rate_alert.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
# slo_burn_rate_alert.yml
# Multi-window burn rate alerting — the approach recommended by Google SRE Workbook.
# This fires when you're consuming error budget fast enough to exhaust it prematurely.
# Assumes: 30-day SLO window, 99.9% availability target (0.1% error budget).

groups:
  - name: slo_burn_rates
    rules:

      # Fast burn — consuming budget 14x faster than sustainable
      # At this rate, 30 days of error budget is gone in ~52 hours
      # Use SHORT windows to catch it quickly. PAGE IMMEDIATELY.
      - alert: CheckoutSLOFastBurn
        expr: |
          (
            # 1-hour window burn rate
            (
              1 - (
                sum(rate(http_requests_total{service="checkout", status!~"5.."}[1h]))
                /
                sum(rate(http_requests_total{service="checkout"}[1h]))
              )
            ) / 0.001  # 0.001 = 1 - 0.999 SLO target
          ) > 14
          and
          (
            # 5-minute window must also be burning fast — avoids single-minute noise
            (
              1 - (
                sum(rate(http_requests_total{service="checkout", status!~"5.."}[5m]))
                /
                sum(rate(http_requests_total{service="checkout"}[5m]))
              )
            ) / 0.001
          ) > 14
        labels:
          severity: critical
          alert_type: slo_burn
        annotations:
          summary: "Checkout SLO: fast burn rate — error budget exhausted in ~52h"
          runbook_url: "https://runbooks.internal/checkout-slo-burn"
          description: "Burn rate is {{ $value }}x the sustainable rate. Immediate action required."

      # Slow burn — consuming budget 3x faster than sustainable
      # At this rate, budget is gone in ~10 days. Urgent but not drop-everything.
      # Use LONGER windows — this is a trend, not a spike.
      - alert: CheckoutSLOSlowBurn
        expr: |
          (
            (
              1 - (
                sum(rate(http_requests_total{service="checkout", status!~"5.."}[6h]))
                /
                sum(rate(http_requests_total{service="checkout"}[6h]))
              )
            ) / 0.001
          ) > 3
          and
          (
            (
              1 - (
                sum(rate(http_requests_total{service="checkout", status!~"5.."}[30m]))
                /
                sum(rate(http_requests_total{service="checkout"}[30m]))
              )
            ) / 0.001
          ) > 3
        labels:
          severity: warning
          alert_type: slo_burn
        annotations:
          summary: "Checkout SLO: slow burn — error budget exhausted in ~10 days"
          runbook_url: "https://runbooks.internal/checkout-slo-burn"
          description: "Burn rate is {{ $value }}x sustainable. Investigate during business hours."
▶ Output
# Example: Alert fires during incident at 3:17 AM

ALERT CheckoutSLOFastBurn
Labels:
alertname = CheckoutSLOFastBurn
severity = critical
alert_type = slo_burn
Annotations:
summary = Checkout SLO: fast burn rate — error budget exhausted in ~52h
description = Burn rate is 18.3x the sustainable rate. Immediate action required.
runbook_url = https://runbooks.internal/checkout-slo-burn
State: firing
ActiveAt: 2024-03-15T03:17:44Z

# CheckoutSLOSlowBurn would appear in Prometheus UI as 'pending' or 'inactive'
# because slow burn requires 30min+ of sustained degradation to confirm
🔥
Interview Gold: Why Multi-Window?Single-window burn rate alerts have a critical flaw: a short spike can trigger a fast-burn alert even if it self-corrects immediately. Multi-window alerting (e.g. 1h AND 5m both elevated) requires the problem to persist across two time horizons simultaneously, which filters out transient noise without adding meaningful detection delay. This is a common interview differentiator.
AspectThreshold-Based AlertingSLO Burn Rate Alerting
What it monitorsRaw metric value (e.g. error rate > 1%)Rate of error budget consumption over time
Alert volumeHigh — fires on any breach, even briefLow — requires sustained budget impact
False positive rateHigh — transient spikes trigger pagesLow — multi-window filtering catches real problems
Business alignmentPoor — metric thresholds don't map to SLAsExcellent — directly tied to reliability commitments
Complexity to set upLow — one expr per alertMedium — requires SLO definition and burn rate math
Best forEarly-stage services, simple infra monitoringProduction services with defined SLOs and SLAs
Recovery detectionAlert resolves when metric drops below thresholdAlert resolves when budget burn rate normalises
On-call friendlinessPoor — engineers get woken for self-healing spikesGood — pages are meaningful and business-impacting

🎯 Key Takeaways

  • Alert on symptoms users feel (latency, error rate, availability) — not causes engineers investigate (CPU, memory, disk). Causal metrics belong on dashboards.
  • The 'for' duration in Prometheus alert rules is your noise filter — always require at least 2 minutes of sustained breach before a critical alert fires to eliminate self-healing false pages.
  • SLO burn rate alerting with multi-window expressions dramatically reduces alert volume while improving signal quality — it fires only when your reliability commitments to users are genuinely at risk.
  • A monthly alert audit — reviewing every alert that fired, its resolution, and whether a code fix was created — is the single highest-leverage practice for keeping your on-call program healthy long-term.

⚠ Common Mistakes to Avoid

  • Mistake 1: Alerting on every metric that Prometheus exports — Symptom: 50+ alerts firing weekly, engineers start ignoring them all, a real outage gets missed — Fix: Audit every alert against the 3 AM test. If the on-call engineer can't take a concrete action within 5 minutes, delete it or move it to a dashboard. Start from zero and add alerts only when an incident proves one is needed.
  • Mistake 2: Setting 'for: 0m' on critical alerts (no minimum duration before firing) — Symptom: A single bad metric scrape causes a P1 page at 3 AM; engineer investigates and finds everything healthy — Fix: Always set a 'for' duration of at least 2 minutes for critical alerts and 5 minutes for warnings. This lets transient spikes self-resolve without paging anyone. The cost is slightly slower detection; the benefit is massively fewer false pages.
  • Mistake 3: Writing runbooks that describe the system instead of the response — Symptom: On-call engineer opens the runbook during an incident and finds paragraphs about architecture with no actionable steps, causing them to improvise and escalate unnecessarily — Fix: Every runbook must start with three numbered steps the engineer can execute in the first five minutes: (1) confirm the alert is real with a specific command or dashboard link, (2) apply a known mitigation if one exists, (3) escalation criteria if mitigation fails. Architecture context belongs at the bottom, not the top.

Interview Questions on This Topic

  • QWhat's the difference between alerting on causes versus symptoms, and can you give a concrete example of each from a web service context?
  • QHow would you design an on-call rotation for a team of six engineers covering a service with a 99.9% SLA, and what safeguards would you put in place to prevent burnout?
  • QA senior engineer proposes adding a CPU utilization alert at 80% threshold to catch performance problems early. How do you push back on this, and what would you propose instead?

Frequently Asked Questions

How many alerts should a healthy on-call engineer receive per shift?

Google's SRE book recommends no more than two pages per 12-hour shift as a starting target, with a long-term goal of zero pages meaning everything resolved automatically. In practice, most mature teams aim for fewer than five actionable pages per week per engineer. If you're seeing more than that, treat alert volume itself as a reliability incident that needs a dedicated sprint.

What's the difference between an SLO and an SLA, and why does it matter for alerting?

An SLA (Service Level Agreement) is the contractual commitment to customers — breach it and there are financial consequences. An SLO (Service Level Objective) is your internal target, intentionally set stricter than the SLA so you have a buffer. You alert on SLO burn rate so that your team responds before the SLA is ever breached. Alerting directly on SLA violation means you're already in breach when the page fires — too late.

Should I use PagerDuty or OpsGenie for on-call routing?

Both are mature tools that support escalation policies, schedules, and integrations with Prometheus, Grafana, and Datadog. The real differentiator is your existing stack: PagerDuty has a larger third-party integration ecosystem, while OpsGenie integrates more natively with Atlassian tools like Jira and Confluence. Either tool only works as well as the alert quality feeding into it — choosing the right tool matters far less than designing your alerts and runbooks correctly first.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousGoogle Cloud Run BasicsNext →Semantic Versioning Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged