Alerting and On-Call Best Practices That Actually Reduce Burnout
At 3 AM, your phone screams. You scramble to your laptop, bleary-eyed, only to discover the alert fired because a CPU spike lasted four seconds and self-corrected before you even logged in. You've lost sleep over nothing — and this happens four nights a week. This is the lived reality at hundreds of engineering teams right now, and it's not a monitoring problem. It's a culture problem disguised as a technical one. Alert fatigue is the silent killer of on-call programs, and it costs companies real engineers who quietly quit rather than endure another sleepless rotation.
The root cause isn't that teams care too little about monitoring — it's that they add alerts reactively, after every incident, without ever pruning the ones that stop being useful. Over time, the alert system becomes a noise machine. Engineers stop trusting it, start silencing pages, and miss the signals that actually matter. The fix isn't more dashboards. It's disciplined, intentional alerting philosophy backed by concrete practices for thresholds, routing, escalation, and rotation design.
By the end of this article you'll know how to audit your existing alert stack and throw out the ones that don't serve you, how to write alerts that fire on symptoms not causes, how to structure an on-call rotation that doesn't destroy your team's wellbeing, and how to use real tooling — Prometheus alerting rules, PagerDuty routing logic, and runbook templates — to make all of it operational and repeatable.
Alert on Symptoms, Not Causes — The Golden Rule of Monitoring
The most common alerting mistake is alerting on what you think is wrong instead of what the user actually experiences. A high CPU alert fires and the engineer investigates — but CPU being high isn't inherently bad. Maybe a batch job is running. Maybe it's expected load. The user doesn't care about CPU. They care whether the checkout page loads.
Symptomatic alerting means your alert fires on things users feel directly: high latency, elevated error rates, failed health checks. These are called the Four Golden Signals from Google's SRE book — latency, traffic, errors, and saturation. Alerts on these signals are almost always actionable because if error rate is 15%, something is broken for users right now.
Causal metrics like CPU, memory, and disk are better suited for dashboards and capacity planning, not paging. You investigate them after an alert fires to understand why — not to decide whether something is wrong.
The practical test: before adding any alert, ask yourself 'If this fires at 3 AM, can the on-call engineer take a concrete action within five minutes?' If the answer is no, it belongs on a dashboard, not a pager.
# prometheus_symptom_alerts.yml # These rules live in your Prometheus alerting rules directory. # Load them via the 'rule_files' block in prometheus.yml. groups: - name: user_facing_symptoms # Evaluate every 60 seconds interval: 60s rules: # GOOD ALERT: Fires on what users experience — high error rate # This is a SYMPTOM. Users are getting 5xx errors. Action is clear. - alert: HighErrorRateShopping expr: | ( sum(rate(http_requests_total{service="shopping-cart", status=~"5.."}[5m])) / sum(rate(http_requests_total{service="shopping-cart"}[5m])) ) > 0.05 # Wait 2 minutes before firing — avoids alerting on a single bad scrape for: 2m labels: severity: critical team: checkout annotations: summary: "Shopping cart error rate above 5% for 2 minutes" # Link directly to the runbook — engineers shouldn't have to search runbook_url: "https://runbooks.internal/shopping-cart-errors" description: "Current error rate: {{ $value | humanizePercentage }}" # GOOD ALERT: Fires on latency degradation users feel - alert: CheckoutLatencyHigh expr: | histogram_quantile( 0.99, sum(rate(http_request_duration_seconds_bucket{ service="checkout", handler="/api/purchase" }[10m])) by (le) ) > 2.0 for: 5m labels: severity: warning team: checkout annotations: summary: "p99 checkout latency exceeded 2 seconds" runbook_url: "https://runbooks.internal/checkout-latency" description: "p99 latency is {{ $value }}s. SLO threshold is 2.0s." # BAD ALERT (shown for contrast — commented out intentionally) # Do NOT do this. CPU spiking doesn't mean users are affected. # - alert: HighCPU # expr: node_cpu_utilization > 0.80 # for: 1m # annotations: # summary: "CPU is high" # <-- What should the engineer DO with this?
ALERT HighErrorRateShopping
Labels:
alertname = HighErrorRateShopping
severity = critical
team = checkout
Annotations:
summary = Shopping cart error rate above 5% for 2 minutes
description = Current error rate: 7.34%
runbook_url = https://runbooks.internal/shopping-cart-errors
State: firing
ActiveAt: 2024-03-15T03:14:22Z
Structuring On-Call Rotations That Don't Destroy Your Team
An on-call rotation is a social contract as much as it is a technical system. Engineers who feel the rotation is fair, predictable, and well-supported stay in it. Engineers who feel it's a punishment churn — and the institutional knowledge they carry walks out with them.
The fundamentals of a healthy rotation start with team size. You need at least four engineers to build a weekly rotation that gives people genuine recovery time. With fewer, someone is always on-call or just got off on-call, and cognitive load never fully resets.
Secondary on-call — where a second engineer is ready to be escalated to if the primary doesn't acknowledge in ten minutes — is non-negotiable for any service with a real SLA. It's also a forcing function: if escalation to secondary happens more than once a week, your primary alert volume is too high.
On-call handoff meetings deserve as much ceremony as a sprint retrospective. The outgoing engineer should document what fired, what was investigated, and what follow-up work was created. Without this, the same incidents repeat endlessly because the fixes live only in one person's head.
Compensation matters too. Whether that's explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks, teams that acknowledge the burden retain engineers far longer than teams that treat it as 'just part of the job'.
# pagerduty_escalation_policy.yml # This is a Terraform representation of a PagerDuty escalation policy. # Apply with: terraform apply # Requires: pagerduty Terraform provider configured with your API token. resource "pagerduty_escalation_policy" "checkout_team" { name = "Checkout Team - Production Escalation" num_loops = 2 # After 2 full loops with no acknowledgment, re-alert from top # LEVEL 1 — Primary on-call engineer gets paged first rule { escalation_delay_in_minutes = 10 # Wait 10 min before escalating to secondary target { type = "schedule_reference" # This schedule rotates weekly across your primary on-call engineers id = pagerduty_schedule.checkout_primary_rotation.id } } # LEVEL 2 — Secondary on-call gets paged if primary doesn't acknowledge rule { escalation_delay_in_minutes = 10 # Another 10 min before going to manager target { type = "schedule_reference" # Secondary rotation — separate schedule, separate engineers id = pagerduty_schedule.checkout_secondary_rotation.id } } # LEVEL 3 — Engineering manager is the last resort, not the first call rule { escalation_delay_in_minutes = 15 target { type = "user_reference" # Direct page to the EM — this should be rare id = data.pagerduty_user.checkout_engineering_manager.id } } } # Weekly rotating schedule for primary on-call resource "pagerduty_schedule" "checkout_primary_rotation" { name = "Checkout Primary On-Call" time_zone = "America/New_York" layer { name = "Weekly Rotation" start = "2024-01-08T09:00:00-05:00" # Rotate every 7 days — weekly on-call is the industry standard baseline rotation_turn_length_seconds = 604800 rotation_virtual_start = "2024-01-08T09:00:00-05:00" # Engineers in the rotation — minimum 4 for healthy coverage users = [ data.pagerduty_user.alice.id, data.pagerduty_user.bob.id, data.pagerduty_user.carol.id, data.pagerduty_user.david.id, ] } }
pagerduty_schedule.checkout_primary_rotation: Creating...
pagerduty_schedule.checkout_primary_rotation: Creation complete after 1s [id=P3X8K2A]
pagerduty_escalation_policy.checkout_team: Creating...
pagerduty_escalation_policy.checkout_team: Creation complete after 1s [id=PQRST99]
Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
# Escalation flow when an alert fires:
# T+0:00 → Alice (primary) receives page via push + SMS
# T+10:00 → No ack? Bob (secondary) receives page
# T+20:00 → No ack? Engineering Manager receives page
# T+35:00 → Loop 2 begins from Level 1 if still unacknowledged
Runbooks, SLOs, and the Alert Audit Loop That Keeps You Sane
Every alert that fires should have a runbook — a living document that tells the on-call engineer exactly what to check, what commands to run, and what decisions to make. Without runbooks, every incident is an archaeology project. With them, even a junior engineer who's never seen the service before can respond effectively.
Runbooks don't need to be perfect on day one. Start with three headings: 'What is happening', 'Immediate steps to mitigate', and 'How to escalate if mitigation fails'. As incidents happen, the on-call engineer appends what they learned. Within a few months you have genuinely battle-tested documentation.
SLO-based alerting takes this further by tying your alerts to your error budget. Instead of alerting when error rate exceeds 1%, you alert when you're burning through your monthly error budget faster than sustainable. This dramatically reduces alert volume while ensuring you page only when business commitments are actually at risk.
Finally, the most underrated practice in on-call is the monthly alert audit. Pull a report of every alert that fired in the last 30 days. For each one: Was it actionable? Was the response documented? If the same alert fired more than three times without a code fix, that's a reliability debt item that belongs on the backlog, not a recurring 3 AM wake-up.
# slo_burn_rate_alert.yml # Multi-window burn rate alerting — the approach recommended by Google SRE Workbook. # This fires when you're consuming error budget fast enough to exhaust it prematurely. # Assumes: 30-day SLO window, 99.9% availability target (0.1% error budget). groups: - name: slo_burn_rates rules: # Fast burn — consuming budget 14x faster than sustainable # At this rate, 30 days of error budget is gone in ~52 hours # Use SHORT windows to catch it quickly. PAGE IMMEDIATELY. - alert: CheckoutSLOFastBurn expr: | ( # 1-hour window burn rate ( 1 - ( sum(rate(http_requests_total{service="checkout", status!~"5.."}[1h])) / sum(rate(http_requests_total{service="checkout"}[1h])) ) ) / 0.001 # 0.001 = 1 - 0.999 SLO target ) > 14 and ( # 5-minute window must also be burning fast — avoids single-minute noise ( 1 - ( sum(rate(http_requests_total{service="checkout", status!~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout"}[5m])) ) ) / 0.001 ) > 14 labels: severity: critical alert_type: slo_burn annotations: summary: "Checkout SLO: fast burn rate — error budget exhausted in ~52h" runbook_url: "https://runbooks.internal/checkout-slo-burn" description: "Burn rate is {{ $value }}x the sustainable rate. Immediate action required." # Slow burn — consuming budget 3x faster than sustainable # At this rate, budget is gone in ~10 days. Urgent but not drop-everything. # Use LONGER windows — this is a trend, not a spike. - alert: CheckoutSLOSlowBurn expr: | ( ( 1 - ( sum(rate(http_requests_total{service="checkout", status!~"5.."}[6h])) / sum(rate(http_requests_total{service="checkout"}[6h])) ) ) / 0.001 ) > 3 and ( ( 1 - ( sum(rate(http_requests_total{service="checkout", status!~"5.."}[30m])) / sum(rate(http_requests_total{service="checkout"}[30m])) ) ) / 0.001 ) > 3 labels: severity: warning alert_type: slo_burn annotations: summary: "Checkout SLO: slow burn — error budget exhausted in ~10 days" runbook_url: "https://runbooks.internal/checkout-slo-burn" description: "Burn rate is {{ $value }}x sustainable. Investigate during business hours."
ALERT CheckoutSLOFastBurn
Labels:
alertname = CheckoutSLOFastBurn
severity = critical
alert_type = slo_burn
Annotations:
summary = Checkout SLO: fast burn rate — error budget exhausted in ~52h
description = Burn rate is 18.3x the sustainable rate. Immediate action required.
runbook_url = https://runbooks.internal/checkout-slo-burn
State: firing
ActiveAt: 2024-03-15T03:17:44Z
# CheckoutSLOSlowBurn would appear in Prometheus UI as 'pending' or 'inactive'
# because slow burn requires 30min+ of sustained degradation to confirm
| Aspect | Threshold-Based Alerting | SLO Burn Rate Alerting |
|---|---|---|
| What it monitors | Raw metric value (e.g. error rate > 1%) | Rate of error budget consumption over time |
| Alert volume | High — fires on any breach, even brief | Low — requires sustained budget impact |
| False positive rate | High — transient spikes trigger pages | Low — multi-window filtering catches real problems |
| Business alignment | Poor — metric thresholds don't map to SLAs | Excellent — directly tied to reliability commitments |
| Complexity to set up | Low — one expr per alert | Medium — requires SLO definition and burn rate math |
| Best for | Early-stage services, simple infra monitoring | Production services with defined SLOs and SLAs |
| Recovery detection | Alert resolves when metric drops below threshold | Alert resolves when budget burn rate normalises |
| On-call friendliness | Poor — engineers get woken for self-healing spikes | Good — pages are meaningful and business-impacting |
🎯 Key Takeaways
- Alert on symptoms users feel (latency, error rate, availability) — not causes engineers investigate (CPU, memory, disk). Causal metrics belong on dashboards.
- The 'for' duration in Prometheus alert rules is your noise filter — always require at least 2 minutes of sustained breach before a critical alert fires to eliminate self-healing false pages.
- SLO burn rate alerting with multi-window expressions dramatically reduces alert volume while improving signal quality — it fires only when your reliability commitments to users are genuinely at risk.
- A monthly alert audit — reviewing every alert that fired, its resolution, and whether a code fix was created — is the single highest-leverage practice for keeping your on-call program healthy long-term.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Alerting on every metric that Prometheus exports — Symptom: 50+ alerts firing weekly, engineers start ignoring them all, a real outage gets missed — Fix: Audit every alert against the 3 AM test. If the on-call engineer can't take a concrete action within 5 minutes, delete it or move it to a dashboard. Start from zero and add alerts only when an incident proves one is needed.
- ✕Mistake 2: Setting 'for: 0m' on critical alerts (no minimum duration before firing) — Symptom: A single bad metric scrape causes a P1 page at 3 AM; engineer investigates and finds everything healthy — Fix: Always set a 'for' duration of at least 2 minutes for critical alerts and 5 minutes for warnings. This lets transient spikes self-resolve without paging anyone. The cost is slightly slower detection; the benefit is massively fewer false pages.
- ✕Mistake 3: Writing runbooks that describe the system instead of the response — Symptom: On-call engineer opens the runbook during an incident and finds paragraphs about architecture with no actionable steps, causing them to improvise and escalate unnecessarily — Fix: Every runbook must start with three numbered steps the engineer can execute in the first five minutes: (1) confirm the alert is real with a specific command or dashboard link, (2) apply a known mitigation if one exists, (3) escalation criteria if mitigation fails. Architecture context belongs at the bottom, not the top.
Interview Questions on This Topic
- QWhat's the difference between alerting on causes versus symptoms, and can you give a concrete example of each from a web service context?
- QHow would you design an on-call rotation for a team of six engineers covering a service with a 99.9% SLA, and what safeguards would you put in place to prevent burnout?
- QA senior engineer proposes adding a CPU utilization alert at 80% threshold to catch performance problems early. How do you push back on this, and what would you propose instead?
Frequently Asked Questions
How many alerts should a healthy on-call engineer receive per shift?
Google's SRE book recommends no more than two pages per 12-hour shift as a starting target, with a long-term goal of zero pages meaning everything resolved automatically. In practice, most mature teams aim for fewer than five actionable pages per week per engineer. If you're seeing more than that, treat alert volume itself as a reliability incident that needs a dedicated sprint.
What's the difference between an SLO and an SLA, and why does it matter for alerting?
An SLA (Service Level Agreement) is the contractual commitment to customers — breach it and there are financial consequences. An SLO (Service Level Objective) is your internal target, intentionally set stricter than the SLA so you have a buffer. You alert on SLO burn rate so that your team responds before the SLA is ever breached. Alerting directly on SLA violation means you're already in breach when the page fires — too late.
Should I use PagerDuty or OpsGenie for on-call routing?
Both are mature tools that support escalation policies, schedules, and integrations with Prometheus, Grafana, and Datadog. The real differentiator is your existing stack: PagerDuty has a larger third-party integration ecosystem, while OpsGenie integrates more natively with Atlassian tools like Jira and Confluence. Either tool only works as well as the alert quality feeding into it — choosing the right tool matters far less than designing your alerts and runbooks correctly first.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.