Alerting & On-Call — Why Silenced Services Hide Outages
A 7-day PagerDuty silence hid a 4-hour checkout outage.
- Alert on symptoms users feel (latency, error rate, availability) — not causes engineers investigate (CPU, memory, disk)
- The 'for' duration in Prometheus is your noise filter — 2-minute minimum for critical alerts eliminates self-healing false pages
- SLO burn rate alerting with multi-window expressions cuts alert volume while improving signal quality
- On-call rotations need minimum 4 engineers — secondary on-call is non-negotiable for any SLA-backed service
- Every alert needs a runbook with 3 steps: confirm it's real, apply mitigation, escalation criteria
- Monthly alert audits are the highest-leverage practice — if an alert fired 3+ times without a fix, it's backlog work, not a page
At 3 AM, your phone screams. You scramble to your laptop, bleary-eyed, only to discover the alert fired because a CPU spike lasted four seconds and self-corrected before you even logged in. You've lost sleep over nothing — and this happens four nights a week.
I've seen this pattern at companies of every size. The specifics vary — sometimes it's CPU, sometimes it's memory, sometimes it's a disk utilization alert on a volume that autoscales — but the shape is always the same. Engineers get paged for things that didn't need human attention. They lose sleep, lose trust in the paging system, and eventually lose patience with the whole programme. The best engineers leave first, because they have options. The ones who stay start silencing things. And then the real fire happens.
This is not a monitoring problem. It's a culture problem that presents as a technical one. The technical symptoms are easy to diagnose: too many alerts, thresholds too low, no runbooks, no audit process. But the underlying culture problem is that most engineering teams treat adding alerts as free and removing alerts as risky. Every post-mortem ends with 'add monitoring.' Nobody's post-mortem ever ends with 'delete the alert that's been crying wolf for six months.' That asymmetry is how you get a paging system that cries wolf 40 times a week and then fails to wake anyone up when the real incident hits.
The root cause isn't that teams care too little about monitoring — it's that they add alerts reactively, after every incident, without ever pruning the ones that stop being useful. Over time, the alert system becomes a noise machine. Engineers stop trusting it, start silencing pages, and miss the signals that actually matter. The fix isn't more dashboards. It's disciplined, intentional alerting philosophy backed by concrete practices for thresholds, routing, escalation, and rotation design.
By the end of this article you'll know how to audit your existing alert stack and identify the ones that don't serve you, how to write alerts that fire on symptoms not causes, how to structure an on-call rotation that doesn't erode your team's wellbeing, and how to use real tooling — Prometheus alerting rules, PagerDuty routing logic, and runbook templates — to make all of it operational and repeatable. The goal isn't a perfect alerting system. It's a system your engineers trust enough to actually pay attention to.
Alert on Symptoms, Not Causes — The Golden Rule of Monitoring
The most common alerting mistake is alerting on what you think is wrong instead of what the user actually experiences. A high CPU alert fires and the engineer investigates — but CPU being high isn't inherently bad. Maybe a batch job is running. Maybe it's expected load from a traffic spike the autoscaler is still catching up to. Maybe the GC is doing a full collection. The user doesn't care about CPU. They care whether the checkout page loads and their payment goes through.
Symptomatic alerting means your alert fires on things users feel directly: high latency, elevated error rates, failed health checks, degraded availability. Google's SRE book formalised this as the Four Golden Signals — latency, traffic, errors, and saturation — and the framing has held up well. Alerts on these signals are almost always actionable because if error rate is 15%, something is broken for users right now, and the on-call engineer has a clear starting point.
Causal metrics like CPU, memory, and disk are better suited for dashboards and capacity planning, not paging. You investigate them after a symptom alert fires to understand why something is wrong — not to decide whether something is wrong. The distinction matters because it changes the engineer's mental model during an incident. A symptom alert says 'users are experiencing this.' A causal alert says 'something in your infrastructure crossed a threshold.' Only one of those tells you whether to act.
The practical test: before adding any alert, ask yourself 'If this fires at 3 AM, can the on-call engineer take a concrete action within five minutes?' Not 'investigate' — act. If the answer is 'investigate and check some dashboards and maybe escalate', it belongs on a dashboard, not a pager. The 3 AM test is deliberately adversarial: the human responding is sleep-deprived, potentially unfamiliar with the specific service, and operating under pressure. Write your alerts for that human, not for a well-rested engineer on a Tuesday morning.
Here's the pushback you'll hear, usually from engineers who've been on the wrong end of missed incidents: 'But what about catching problems early?' This is a reasonable concern wrapped around a false premise. Causal metrics do catch problems early — on dashboards, during business hours, reviewed by engineers who have context and aren't panicking. You don't need to wake someone up to look at a graph trending in the wrong direction. Set up a daily dashboard review as part of your team's operational rhythm. If a pattern in causal metrics consistently precedes a symptom, write a symptom-based alert for the user-facing impact, not the infrastructure metric that correlates with it. The correlation is interesting. The symptom is the truth.
The Four Golden Signals aren't a checklist to run through mechanically. They're a lens for asking the right question: is what I'm about to alert on something a user would feel? Latency tells you how long users wait. Traffic tells you how much demand you're handling and whether demand patterns are normal. Errors tell you how often you're actively failing users. Saturation tells you how close to the edge you are before one of the other three degrades. Any proposed alert that doesn't map cleanly to one of these four should face an explicit justification before it gets merged.
Structuring On-Call Rotations That Don't Destroy Your Team
An on-call rotation is a social contract as much as it is a technical system. Engineers who feel the rotation is fair, predictable, and actively supported stay in it. Engineers who feel it's a punishment — or worse, an invisible tax that everyone pretends doesn't cost anything — churn. And the engineers who leave first are always the ones who have other options: the senior engineers, the ones who built the systems, the ones whose absence creates the next generation of undocumented incidents.
The fundamentals of a healthy rotation start with team size. You need at least four engineers to build a weekly rotation that gives people genuine recovery time between shifts. With three engineers, one person is always either on-call or just came off on-call — cognitive load never fully resets. With two, you're alternating weeks between two people and calling it a rotation. With one, you're not running a rotation at all; you're running a hero, and heroes burn out or leave.
Secondary on-call — a second engineer ready to be escalated to if the primary doesn't acknowledge within ten minutes — is non-negotiable for any service with a real SLA. It serves two purposes. The obvious one is redundancy: if the primary engineer is genuinely unavailable, the incident still gets coverage. The less obvious one is diagnostic: if escalation to secondary happens more than once per week, your primary alert volume is too high. The secondary escalation rate is a canary metric for rotation health that most teams never look at.
On-call handoff meetings deserve more ceremony than they typically receive. The outgoing engineer should document what fired, what was investigated, what commands were run, and what follow-up work was created. Without this, the same incidents repeat because the institutional knowledge of 'oh, that alert fires every Tuesday when the ETL job runs — just acknowledge and wait four minutes' lives in one engineer's head and evaporates with every rotation change. The handoff document is where that knowledge becomes team property.
Compensation matters — and how you structure it signals what you believe on-call is worth. Explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks all communicate that the organisation understands the cost. The specifics matter less than the consistency and transparency.
But here's what I've observed across many teams: the psychological contract matters more than the compensation number. An engineer who receives $500 per on-call week but has no control over alert volume, whose feedback from retrospectives never changes anything, and who watches the same false alarms fire week after week without being fixed — that engineer feels trapped regardless of the pay. An engineer who receives time-off-in-lieu and can see, in every sprint, that the team is actively working to reduce alert noise and improve runbooks — that engineer feels respected. The best on-call programs are characterised by visible investment in reducing the burden, not just by compensating for it.
In 2026, most teams run distributed rotations spanning multiple time zones. This is both an opportunity and a design challenge. Done well, a distributed rotation means nobody carries the full 24-hour burden of a weekly shift — you can hand off at a natural boundary between regions and keep business-hours coverage for each timezone. Done poorly, it creates coordination overhead, unclear escalation paths when the person on-call is 9 time zones away, and handoff meetings that nobody can attend at a reasonable hour. If your team spans more than two time zones, design your rotation explicitly for the timezone distribution — don't just apply a single-timezone rotation template and hope the scheduling works out.
Rotation schedule changes are the most underestimated source of trust erosion. Once you publish a rotation, treat changes with the same communication discipline you'd apply to a production deployment. Engineers plan childcare, travel, and personal commitments around the schedule. A last-minute swap that affects a weekend isn't just inconvenient — it damages the sense that the rotation is a fair and predictable system, which is the only thing that makes it sustainable long-term.
Runbooks, SLOs, and the Alert Audit Loop That Keeps You Sane
Every alert that fires should have a runbook. Not a wiki page describing the service architecture. Not a Confluence document that was last updated eighteen months ago. A runbook: a living, numbered document that tells the on-call engineer exactly what to check, what commands to run, and what decisions to make — optimised for a person who may be half-asleep, may not know this service deeply, and has about five minutes before stakeholders start asking questions.
The most common runbook failure is structure: engineers write runbooks like documentation. They lead with context — here's how this service works, here's its architecture, here's the dependency graph. That context belongs in the service's architecture docs. A runbook during an active incident needs to start with the thing the engineer does right now, not the context they'd need to understand the system from scratch. Lead with the first command they should run. Everything else is footnotes.
Runbooks don't need to be perfect on day one. The minimum viable runbook has three headings: 'Is this alert real?', 'Known mitigations', and 'When to escalate and who to call.' The first heading should have a specific command or dashboard link that lets the engineer confirm the alert reflects a real problem, not a metric collection glitch. The second should have the two or three mitigations that have worked historically, even if they're partial. The third should have an explicit escalation matrix — not 'contact the team' but 'if the database is unreachable, call the database team at this PagerDuty service; if the issue is the payment gateway, here's the vendor's emergency number.'
As incidents happen, the on-call engineer appends what they learned. Within three months of consistently following this pattern, you have battle-tested documentation that reflects reality rather than design intent. The gap between design intent and operational reality is where most runbooks fail — and closing that gap is what makes the difference between a runbook that helps and one that the engineer closes after 30 seconds because it doesn't match what they're seeing.
SLO-based alerting is a different philosophy entirely, and it's worth understanding the shift it requires. Instead of alerting when error rate exceeds 1% — an arbitrary threshold chosen by someone who had a reasonable gut feeling — you alert when you're consuming your monthly error budget faster than sustainable. This ties your paging directly to whether you're going to breach the reliability commitment you've made to your users. It dramatically reduces alert volume while ensuring that every page represents a genuine threat to that commitment.
The monthly alert audit is the most underrated practice in on-call operations. Pull a report of every alert that fired in the last 30 days. For each one: Was it actionable? Was there a documented response? If the same alert fired more than three times without a code fix being shipped, that's reliability debt — it belongs in the engineering backlog with a sprint assignment, not as a recurring 3 AM interruption that the team has collectively accepted as normal.
The audit meeting structure matters. Pull the data before the meeting — total alerts fired, percentage that resulted in an acknowledged incident with a documented response, mean time to acknowledge, and the ranked list of repeat offenders by firing frequency. Present it visually. The team's job during the meeting is to make three decisions about each alert on the list: keep it as is, fix the underlying condition that makes it fire too often, or delete it. Not discuss it — decide. Meetings that produce 'we should look into that' instead of 'deleted' or 'ticket created' waste their 30 minutes entirely.
The SLO frame changes the audit conversation in a valuable way. Instead of asking 'was this specific alert actionable?' you ask 'is our error budget on track this month?' If the budget is healthy and you have 80% remaining at the midpoint, many threshold alerts that fired are provably noise — they didn't threaten the SLO. If the budget is burning and you're at 40% at midpoint, you need more alerting sensitivity, not less. The budget number makes the decision about alert sensitivity a function of reliability risk rather than engineering anxiety.
Alert Routing, Deduplication, and Suppression — The Plumbing That Makes It Work
Routing is where most alerting systems silently break in ways that are invisible until an incident proves it. Prometheus fires correctly. Alertmanager receives the alert. The alert routes to the wrong team — or to nobody at all — and the incident goes undetected. This failure is particularly dangerous because everything looks healthy: the alert fired, Alertmanager processed it, PagerDuty shows the service as active. The failure is in the gap between those systems, specifically in a label that's missing or mismatched.
Every alert label you add is implicitly a routing decision. The severity label determines which receiver handles the alert — PagerDuty for critical, Slack for warning. The team label determines which team's escalation policy is invoked. The service label enables inhibition rules and deduplication. Get any of these wrong and the alert either goes to the wrong destination or routes to Alertmanager's default receiver, which most teams configure as a catch-all Slack channel that nobody monitors at 3 AM.
Deduplication is Alertmanager's mechanism for not paging you multiple times for the same underlying problem. Two Prometheus instances — for example, a primary and a replica, or two regional scrape targets — will both fire the same alert when a metric breaches. Alertmanager groups them by label identity into a single notification. This is elegant when it works. The failure mode: two alerts that describe the same problem but have different labels — perhaps because one alert uses service="checkout" and another uses service="checkout-api" — are treated as separate alerts and generate separate pages. Label consistency across your alert rules is a more important operational practice than most teams realise, and it becomes critical when you have more than a handful of services.
Suppression windows are your surgical tool for planned maintenance. Unlike blanket silences — which are blunt instruments that suppress all alerts from a service regardless of type — proper suppression targets specific alert names for specific time windows. The rule is simple and should be enforced at the tooling level: every silence requires a comment with a linked ticket URL, and every silence auto-expires within 4 hours maximum. If your maintenance window is longer than 4 hours, renew the silence manually. This creates intentional checkpoints where someone has to actively decide that the silence should continue.
Inhibition rules solve a specific and common problem: when a critical alert fires, you don't want five additional warning alerts for the same service all generating pages simultaneously. Alertmanager inhibition rules suppress lower-severity alerts when a higher-severity alert is already active for the same service. The result is one page for one incident instead of five pages for five symptoms of the same root cause. This is the difference between an on-call engineer who wakes up to a single clear incident and one who wakes up to a flood of notifications that obscures which one to start with.
One important operational practice that often gets skipped: test your routing after any label change. When a service is renamed, when an alert rule is refactored, when a team restructuring changes the team label values — routing breaks silently. amtool config routes test with a set of representative alert labels takes about 90 seconds and catches misroutes before an incident does. Add it to your CI pipeline as a validation step any time alertmanager.yml or alert rule files change.
| Aspect | Threshold-Based Alerting | SLO Burn Rate Alerting |
|---|---|---|
| What it monitors | Raw metric value at a point in time (e.g. error rate > 1%) | Rate of error budget consumption across a rolling time window |
| Alert volume | High — fires on any metric breach, including brief self-correcting spikes | Low — requires sustained budget impact across multiple time windows simultaneously |
| False positive rate | High — transient spikes, batch jobs, autoscaling events all trigger pages | Low — multi-window filtering requires sustained degradation, not momentary breaches |
| Business alignment | Poor — a 1% error rate threshold has no direct relationship to your SLA | Excellent — directly tied to whether you'll breach your reliability commitment to users |
| Complexity to implement | Low — one PromQL expression per alert, straightforward to write | Medium — requires a defined SLO, burn rate math, and multi-window expressions |
| Best for | Early-stage services, simple infrastructure monitoring, teams new to observability | Production services with defined SLOs, teams with mature observability practices |
| Recovery detection | Alert resolves immediately when metric drops below threshold | Alert resolves when burn rate normalises — may persist briefly after the incident resolves |
| On-call experience | Poor — engineers get woken for self-healing spikes they can't affect and don't understand | Good — every page represents a genuine, sustained threat to a business commitment |
Key Takeaways
- Alert on symptoms users feel — latency, error rate, availability — not causes engineers investigate. CPU, memory, and disk belong on dashboards where engineers look during business hours, not on pagers that wake people up at 3 AM.
- The 'for' duration in Prometheus is your most powerful noise filter. A 2-minute minimum for critical alerts and 5-minute minimum for warnings eliminates the entire category of false pages from transient spikes and single bad scrape cycles.
- SLO burn rate alerting with multi-window expressions dramatically reduces alert volume while improving signal quality. It only pages when your reliability commitment to users is genuinely threatened — not when an arbitrary metric threshold is briefly crossed.
- The monthly alert audit — 30 minutes, data in hand, bias toward deleting — is the single highest-leverage practice in on-call operations. An alert that fires three times without a fix is backlog work disguised as monitoring.
- Every alert that pages someone must have a runbook. Not documentation about the system — a numbered response guide starting with 'confirm it's real,' followed by 'known mitigations,' followed by 'when and who to escalate to.' Architecture context belongs at the bottom.
- Blanket silences are the number one cause of undetected outages in teams that have experienced alert fatigue. Enforce 4-hour maximum duration and mandatory ticket URLs at the API level — policies that aren't technically enforced are suggestions.
Common Mistakes to Avoid
- Alerting on every metric that Prometheus exports by default
Symptom: 50+ alerts firing weekly, most of them self-resolving within minutes. Engineers start to treat every page with suspicion — waiting to see if it clears before investigating. A real outage gets missed because the on-call engineer has learned that 70% of pages don't need a response. The alert system is working technically but has lost the team's trust operationally.
Fix: Run an alert audit before the volume gets this bad, but if you're already here, audit under pressure. Pull every alert that fired in the last 30 days. For each one, apply the 3 AM test: if this fired at 3 AM, would the on-call engineer know what concrete action to take within 5 minutes? If the answer is no — delete it today, not after discussion. Add alerts back only when an actual incident demonstrates that the metric predicted or explained a user-facing problem. Start from incidents, not from Prometheus's metric catalogue. - Setting 'for: 0m' on critical alerts
Symptom: A single bad Prometheus scrape cycle — which happens on virtually every production system at some frequency — triggers a P1 page at 3 AM. The on-call engineer investigates for 20 minutes, finds everything healthy, and goes back to sleep with slightly less trust in the system. After this happens three or four times, they start adding 'wait 5 minutes before looking at the laptop' as an informal personal policy — which is indistinguishable from ignoring the alert.
Fix: Set a minimum 'for' duration of 2 minutes for critical alerts and 5 minutes for warnings. The 'for' duration requires the metric to be in breach continuously for the specified time before the alert transitions from 'pending' to 'firing'. A single bad scrape lasts 15 seconds. A real incident lasts minutes to hours. The 2-minute minimum costs you nothing in real detection delay and eliminates the category of false pages that train engineers to distrust the system. If a team member argues 'but we need faster detection', the answer is: you need faster detection of real incidents, not faster false alarms. - Writing runbooks that describe the system instead of the response
Symptom: On-call engineer opens the runbook during an active incident, reads two paragraphs about the service's architecture, a diagram of its database relationships, and a section on its deployment history — and then has to scroll to find the first actionable step. By the time they've oriented themselves in the document, 4 minutes have elapsed and they're still not sure what command to run.
Fix: Runbooks are incident tools, not architecture documents. Structure every runbook in a strict order: first, how to confirm the alert is real (one specific command or dashboard link, nothing more); second, the known mitigations in order of likelihood and simplicity; third, explicit escalation criteria and who to contact. Architecture context, service dependency diagrams, and historical incident notes belong at the bottom. The engineer responding to an incident should reach the first executable step within 30 seconds of opening the runbook. If they can't, the runbook is structured wrong. - Creating blanket silences without a ticket URL or expiration time
Symptom: An engineer silences all alerts from a service to survive a noisy rotation. The silence outlives its justification. A real outage occurs. No page fires. Discovery comes from a customer complaint 4 hours later. This is not a hypothetical — it's the production incident at the top of this article, and variations of it happen at most organisations that have experienced sustained alert fatigue.
Fix: Enforce silence hygiene at the API level, not just as a policy. Use a webhook that validates silence creation requests: silences without a ticket URL in the comment field are rejected. Maximum silence duration is 4 hours — enforced by the webhook, not by trust. Add a daily Slack digest of active silences visible to every engineer on the team. Treat any silence older than 4 hours without a renewal as a bug that requires investigation. The policy is simple: if you can't explain why a silence exists in one sentence and point to a ticket, it shouldn't exist. - Treating on-call as 'just part of the job' without compensation or meaningful support
Symptom: Senior engineers start leaving, citing 'work-life balance' in exit interviews when the real cause is three months of 3 AM false pages with no visible team investment in reducing them. The engineers who built the systems and know how to fix things quietly transfer to teams with better on-call culture. Institutional knowledge — the kind that lives in someone's head and not in any runbook — walks out with them.
Fix: Offer explicit on-call pay, time-off-in-lieu, or reduced sprint commitments during on-call weeks. But compensation alone doesn't fix this — it just makes the burden slightly more tolerable. Pair it with visible investment: show the team, in every sprint review, the alert volume trend, the audit results, the alerts that were deleted this month. Engineers who see the organisation treating alert noise as a real cost that deserves engineering resources stay longer than engineers who receive on-call pay for an experience that never gets better. - Not validating alert routing after changing Prometheus labels or service names
Symptom: Team renames a service from 'checkout' to 'checkout-service' during a refactoring sprint. Every alert rule is updated to use the new name. The Alertmanager route tree still matches on 'checkout'. All alerts for the service now route to the default Slack channel instead of PagerDuty. Nobody notices until an incident fires during the next rotation and the primary on-call engineer receives no page.
Fix: Addamtool config routes testto your CI pipeline as a required check on any change to alertmanager.yml or alert rule files. Maintain a test matrix of label combinations for each service that asserts the expected receiver. This catches misroutes in CI instead of in production. It takes about 90 seconds to configure and runs in under a second. The cost of not doing this is discovering a routing gap during an incident — at which point the gap becomes part of the incident timeline.
Interview Questions on This Topic
- QWhat's the difference between alerting on causes versus symptoms, and can you give a concrete example of each from a web service context?Mid-levelReveal
- QHow would you design an on-call rotation for a team of six engineers covering a service with a 99.9% SLA, and what safeguards would you put in place to prevent burnout?SeniorReveal
- QA senior engineer proposes adding a CPU utilization alert at 80% threshold to catch performance problems early. How do you push back on this, and what would you propose instead?SeniorReveal
- QExplain multi-window SLO burn rate alerting and why single-window burn rate alerts have a critical flaw.SeniorReveal
- QYour team's alert volume has tripled after a major feature launch. Walk me through how you'd triage and reduce it.SeniorReveal
Frequently Asked Questions
How many alerts should a healthy on-call engineer receive per shift?
Google's SRE book recommends no more than two pages per 12-hour shift as a starting target, with a long-term goal of automation handling the rest without human intervention. In practice, most mature teams aim for fewer than five actionable pages per week across the entire rotation. The operative word is actionable — pages that required the engineer to do something meaningful, not acknowledge and close. If you're counting total pages including self-resolving ones, the number is meaningless. If you're seeing more than 5 actionable pages per week per engineer, treat alert volume as a reliability incident with a dedicated sprint slot. It won't fix itself.
What's the difference between an SLO and an SLA, and why does it matter for alerting?
An SLA — Service Level Agreement — is a contractual commitment to customers. Breach it and there are financial consequences: service credits, contract penalties, sometimes legal liability. An SLO — Service Level Objective — is your internal reliability target, intentionally set stricter than the SLA so your team has a buffer between 'we're getting close to the line' and 'we've crossed the line.' You alert on SLO burn rate so the team responds before the SLA is ever in danger. If you alert directly on SLA violation, you're already in breach when the page fires. The SLO is the early warning system; the SLA is the cliff edge. The whole point of SLO-based alerting is that you never need to see the cliff.
Should I use PagerDuty or OpsGenie for on-call routing?
Both are mature, production-proven tools that support escalation policies, on-call schedules, and integrations with Prometheus, Grafana, Datadog, and most observability stacks. The honest answer is that the tool choice matters far less than the quality of alerts feeding into it and the runbooks attached to each incident type. A perfectly configured PagerDuty instance with 60 noisy CPU alerts will destroy your team's wellbeing just as effectively as a poorly configured one. If you're already on one platform and it works, the cost of switching rarely justifies the disruption. If you're choosing fresh: PagerDuty has a broader third-party integration ecosystem and more mature enterprise features; OpsGenie integrates more natively with Atlassian tooling. Pick based on your existing stack and team familiarity, then invest the savings in fixing your alert design.
How do you handle alerting during planned maintenance windows?
Create a time-bounded, targeted silence in Alertmanager using amtool before the maintenance window begins. Silence specific alert names — the ones you know will fire during maintenance — not the entire service. Set the duration to the maintenance window plus a 30-minute buffer. Include the ticket URL in the comment field. After the window, verify the silence has expired with amtool silence query --active and confirm alerts are routing correctly again with a test page. Post the silence details in your team's standup channel so everyone knows it exists for the duration. The critical discipline: never create a silence matching all alerts from a service. If you don't know which specific alerts will fire during maintenance, you don't know what your maintenance will actually affect — and that's a different problem to solve before the window starts.
What's the right 'for' duration for a critical alert?
At minimum: 2 minutes for critical alerts, 5 minutes for warnings. The 'for' duration is the time the metric must remain in continuous breach before the alert transitions from 'pending' to 'firing.' It filters transient spikes that self-resolve — a single bad scrape, a 15-second network glitch, a brief GC pause — without adding meaningful delay to detection of real incidents. Most production incidents persist for minutes to hours. The 2-minute minimum costs you nothing in real detection speed and eliminates the entire category of false pages that erode engineer trust in the paging system. If your argument for 'for: 0m' is that you need immediate detection, consider that you also need engineers to trust the alert when it fires. A 2-minute delay on detection is far less costly than engineers who've learned to wait 5 minutes before looking at the laptop.
That's Monitoring. Mark it forged?
14 min read · try the examples if you haven't