SLI SLO SLA — Server Uptime Isn't Customer Uptime
99.99% server uptime masked an SLA breach because SLI was misdefined.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
- SLI measures what actually happens: latency, availability, error rate.
- SLO sets the target threshold: "99.9% of requests under 200ms."
- SLA is the legal contract: breach it and you pay penalties.
- Error budget = 100% - SLO; it's your permission to deploy.
- Biggest mistake: setting SLO without tracking SLI first.
Imagine you hire a pizza delivery service that promises your pizza arrives within 30 minutes, 95% of the time. The 30-minute window is the target (SLO), the actual measurement of how long deliveries really took is the indicator (SLI), and the written contract you signed guaranteeing that promise — with a refund if they fail — is the agreement (SLA). SLI is what you measure, SLO is what you aim for, and SLA is what you're legally on the hook for.
Every time your app goes down at 2am, someone is paging an on-call engineer, a customer is losing money, and a trust contract is being violated. The difference between teams that handle outages gracefully and teams that scramble blindly almost always comes down to whether they've defined what 'good' looks like before everything breaks. SLIs, SLOs, and SLAs are the three-layer framework that forces that definition into existence — and they're the backbone of Site Reliability Engineering (SRE) as practised at Google, Netflix, and virtually every serious tech company.
The problem they solve is deceptively simple: how do you know if your service is performing well enough? Without precise definitions, 'the site is slow' is just vibes. Is 500ms response time acceptable? What about 800ms? What percentage of requests can fail before your users actually churn? SLIs give you the measurement, SLOs give you the threshold, and SLAs give those thresholds real commercial weight. Together they transform fuzzy gut-feelings into actionable engineering decisions.
By the end of this article you'll be able to write a real SLO for a web API, understand how to derive SLIs from Prometheus metrics, explain error budgets to a product manager, and avoid the three classic mistakes that cause teams to either over-promise in their SLAs or burn out their engineers chasing impossible uptime targets.
What Is a Service Level Indicator (SLI)?
An SLI is a raw measurement of some aspect of your service's behavior. Think of it as a gauge on your dashboard. Common SLIs include request latency, error rate, throughput, or availability. The key is that an SLI must be quantifiable, collected consistently, and aligned with what users perceive.
- Latency: p50, p95, p99 response times
- Error rate: proportion of 5xx responses
- Throughput: requests per second
- Availability: ratio of successful requests to total
You don't need to measure everything. Pick the few metrics that directly impact user satisfaction. Google's SRE book recommends no more than five SLIs per service.
- Thermometer gives a number, not a verdict — SLI gives raw data.
- Multiple thermometers give a better picture: latency + error rate + throughput.
- The same SLI can be healthy in one context and broken in another (e.g., 500ms for async vs synchronous).
What Is a Service Level Objective (SLO)?
An SLO is the target value or range for an SLI over a specific time window. It's the promise you make to yourself (and your team) about how good the service should be. For example: "99.9% of requests will complete in under 300ms, measured over a rolling 30-day window."
SLOs are your internal reliability goals. They drive engineering decisions: if the SLO is at risk, you stop shipping features and fix stability. The time window matters — a 30-day window smooths out spikes but can hide long-term degradation. A 7-day window reacts faster but might trigger false alarms.
SLOs are also the basis for error budgets: the acceptable amount of unreliability (100% - SLO). An SLO of 99.9% means you can be down 0.1% of the time, which is about 43 minutes per month.
What Is a Service Level Agreement (SLA)?
An SLA is a formal contract between you and your customer (or another team) that specifies the level of service you guarantee, often with financial or business penalties if it's breached. Unlike SLOs which are internal, SLAs are external commitments.
SLAs are usually expressed in terms of availability (e.g., "99.9% uptime per month") but can also include latency, support response times, or throughput. The key difference from SLOs is the consequence: breach an SLO and you have a postmortem; breach an SLA and you write a cheque.
Real-world example: AWS Compute SLA promises 99.99% availability for EC2. If it drops below that, you get service credits. That's money-back guarantee — that's an SLA.
SLAs should be more lenient than your SLOs. If your SLA is 99.9% and your SLO is also 99.9%, you have zero error budget. Best practice: set SLO tighter than SLA (e.g., internal SLO 99.95%, external SLA 99.9%).
How Error Budgets Connect SLIs, SLOs, and SLAs
Error budget is the amount of unreliability your service is allowed, defined as 100% minus your SLO. For a 99.9% SLO, you have 0.1% error budget (about 43 minutes/month). As long as you haven't exhausted the budget, you're free to deploy new features. Once the budget is depleted, you freeze releases until reliability is restored.
This mechanism solves the classic tension between feature velocity and stability. Instead of arguing about whether to ship, you have a data-driven policy: if error budget remaining > 0, ship; if zero, fix.
Error budgets are usually tracked over a rolling window (30 days) to reflect recent performance. They consume slowly over time — a single 10-minute outage might eat 25% of your monthly budget for a 99.9% SLO.
Real-world use: Teams at Google use error budgets to decide if they can launch new features or must focus on reliability. It's not about perfection; it's about knowing when to push and when to hold.
- Each outage is a withdrawal from the account.
- If you hit zero, your team goes into 'debt recovery' mode — no new features until balance is restored.
- You can carry over budget month to month? Usually not — it resets. But some teams use a quarter window.
- Surplus budget is permission to innovate. It's not a sign of laziness; it's a resource.
Common Pitfalls in Implementing SLI/SLO/SLA
Teams often dive into defining SLOs without first understanding their SLIs. That's putting the cart before the horse. Here are the three biggest mistakes:
- Defining SLOs without data: You can't set a meaningful target unless you know your current baseline. Collect SLI data for at least two weeks first.
- SLO too strict: 99.99% sounds great on a slide deck, but it means you can afford only 4.3 minutes of downtime per month. That's brutal unless you have redundant infrastructure.
- SLA equals SLO: If your internal target is same as your contractual promise, you have zero room for surprise. Always make SLO tighter than SLA.
Another subtle pitfall: measuring SLIs at the wrong granularity. A single global SLI might hide regional failures. Always consider segmenting by geography, data center, or critical endpoint.
How to Calculate Burn Rate Before Your SLO Catches Fire
You don't wait for the monthly SLO report to find out you're bleeding error budget. By then, you've already lost. Burn rate tells you, in real time, how fast you're consuming your allowed failures relative to the SLO window. If your SLO is 99.9% over 30 days, you have 43.2 minutes of total downtime. A burn rate of 1 means you'll hit zero exactly at the end of the month. A rate of 2 means you'll exhaust your budget in 15 days. Anything above 1 is a code red. Calculate it by dividing your actual error budget consumption rate by the ideal consumption rate. Monitor this as a p1 alert threshold, not a dashboard afterthought. Set a high burn rate alert (e.g., 2x over 1 hour) to catch cascading failures before they blow a quarter's worth of reliability in a single deployment. Low burn rate alerts (e.g., 1x over 6 hours) catch slow regressions. These are your early warning systems. Implement them before you need them.
Why Multi-Window, Multi-Burn-Rate Alerting Saves Your Weekend
A single burn rate alert window gives you false positives or misses entirely. Here's the fix: two windows, two thresholds. The short window (e.g., 1 hour) catches fast, catastrophic events. The long window (e.g., 6 hours) catches slow drifts from bad configs or resource leaks. You only alert when both windows exceed their respective burn rate thresholds. This eliminates the noise from brief traffic spikes or transient failures that self-heal. For example, a 5-minute burst of 503s during a deploy triggers the short window, but if the long window is clean, you skip the page. Your NOC thanks you. Implement this with Prometheus rules using two separate recording rules for burn rates over different time ranges, then combine them with an AND condition. This is how mature SRE teams filter out the 90% of alerts that don't matter. The math is simple: short window = 2x burn rate for 1 hour, long window = 1x burn rate for 6 hours. Any single failure mode that hits both simultaneously is real.
How To SLA Your Way Out Of A Contractual Ambush
A SLA is a legal document, not a technical target. Your SLO is what you commit to internally. Your SLA is what you promise a customer in writing. Never let your SLA match your SLO. Always set the SLA lower (worse) than your SLO. Why? Because you need a buffer. If your SLO is 99.9% and your SLA is also 99.9%, one bad month means you've broken a contract. You pay penalties or lose the customer. Instead, set your SLO at 99.9% and your SLA at 99.5%. Now you have 0.4% of room for error before legal gets involved. This isn't being dishonest — it's being realistic. Your internal SLO is where you aim. Your SLA is the floor below which you promise compensation. Write it into the contract explicitly: "Service Level Target: 99.9% monthly. Service Level Commitment: 99.5% monthly." Also define the measurement window, exclusion windows (scheduled maintenance, customer-induced failures), and credit calculation. Standard penalty is 5-15% of monthly fees per 0.1% below SLA. Without these definitions, your legal team relies on your monitoring data — which they'll ask you to defend in a deposition. Don't learn this lesson in a courtroom.
The Mental Model That Stops You From Wasting Time on the Wrong Metrics
Stop cargo-culting dashboards. SLIs, SLOs, and SLAs are a decision-making hierarchy, not a compliance checklist. Here's the why: You need to know what good looks like before you can promise it, and you need to know what you promised before you get sued.
The mental model is simple: SLIs are raw measurement. They answer "is it working?". SLOs are target guardrails. They answer "are we okay?". SLAs are contractual teeth. They answer "how much do we owe them when we screw up?". You pick SLIs that actually reflect user happiness (latency, error rate), not CPU. You set SLOs that give you room to ship without violating a promise. You let SLAs be driven by business risk, not engineering ego.
Most teams invert this pyramid. They start with an SLA because legal said so, then work backwards to guess at an SLO. That's how you end up measuring p99 latency on a batch job that users don't even hit. Fix your model first.
Real-World Example: How an E-Commerce App Killed Its Black Friday SLO in 12 Minutes
Here's what happens when theory meets a production clusterfuck. Your app has three critical services: product search, checkout, and payment. Each gets its own SLI/SLO. The mistake most teams make is rolling everything into a single "app" SLO. That's how you miss the payment service burning at 3x the rate while search is fine.
Set separate SLOs. Product search: p95 latency under 200ms, 99.5% success. Checkout: p99 latency under 1s, 99.9% success (because that's where users bail). Payment: 100% success, 5s timeout (because banks are slow). Now run Black Friday. Payment starts timing out due to upstream bank latency. Your 5s threshold burns budget fast. You see the burn rate alert at 14x. You have 12 minutes before you violate the SLO.
Because you have separate SLOs, you route traffic away from the failing payment provider in 2 minutes. You don't take down the whole site. You don't email customers about a 'site-wide issue'. You just failover a provider. Your SLO survives. Your SLA survives. Your bonus survives. That's why you model per-service, not per-app.
Tools and Technologies for Monitoring and Managing SRE Metrics
SLIs, SLOs, and SLAs are useless without the tools to measure them. Production observability platforms like Datadog, New Relic, and Google’s Stackdriver (now Google Cloud Operations) provide ready-made SLI dashboards and error budget tracking. Datadog’s SLO widget shows real-time burn rate against your target. New Relic’s SLO feature lets you define a numeric target and shows predicted exhaustion. Stackdriver’s Service Monitoring ties directly to GCP services, automatically generating SLIs for uptime, latency, and throughput. For deeper SLO management, dedicated tools like Nobl9, Slok, or Gremlin’s Chaos Engine codify SLOs as code and enforce alerting policies. The WHY: manual calculation leads to stale data and missed breaches. Tooling automates the math and surfaces warnings before your weekend is ruined. Pick a platform that integrates with your existing stack—no one wins by adding yet another dashboard nobody watches.
Involving Stakeholders in SLO Negotiation
SLOs built in isolation by SREs get ignored. Stakeholders—product managers, business owners, and engineering leads—define customer expectations. WHY: You need shared ownership of reliability. Start by asking stakeholders: “What makes a request successful from the user’s view?” That answer yields a raw SLI, like “page load under 2 seconds.” Then negotiate: is 99.9% uptime worth the cost of over-provisioning? Use error budgets as the currency of this negotiation. Frame it as trade-offs: raising the SLO to 99.99% means fewer features shipped. Show them a burn rate chart: “If we hit this velocity, we stop deploys for 6 hours.” This makes reliability a business decision, not an engineering ultimatum. The outcome: a contract both sides respect. Without stakeholder buy-in, your SLO becomes a secret metric—and secret metrics never save weekends.
Balancing Ambition and Realism in SLO Targets
Setting an SLO at 99.999% sounds heroic but destroys your engineering velocity. The WHY: higher SLOs consume error budget faster—every tiny outage burns a larger fraction. Realism means knowing your platform’s baseline: start by measuring current SLIs over 30 days. If your actual p99 latency is 800ms, a 99.9% SLO at 200ms is fantasy. Ambition means setting targets slightly above your baseline to drive improvements, but not so high that the error budget is exhausted by routine deploys. A balanced SLO: 99.5% for internal APIs, 99.9% for customer-facing endpoints. Use the burn rate formula: if you consume >10% of error budget in 1 hour, your target is too aggressive. Review quarterly—as reliability improves, tighten the SLO. The rule: your SLO should be a stretch, not a mirage.
The 99.9% Uptime That Lost a Million-Dollar Client
- Your SLI must reflect what your customer experiences, not what your infrastructure reports.
- An SLO that doesn't match the real user journey is a ticking time bomb.
- Always map SLIs to customer-facing metrics before committing to an SLA.
prometheus: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Check recent changes: git log --oneline --since="24 hours ago"Key takeaways
Common mistakes to avoid
4 patternsSetting SLO before measuring SLI
Using server-side uptime as the only SLI
Making SLA identical to SLO
Not segmenting SLIs by criticality
Interview Questions on This Topic
What is the difference between SLI, SLO, and SLA?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's Monitoring. Mark it forged?
10 min read · try the examples if you haven't