SLI SLO SLA — Server Uptime Isn't Customer Uptime
99.
- SLI measures what actually happens: latency, availability, error rate.
- SLO sets the target threshold: "99.9% of requests under 200ms."
- SLA is the legal contract: breach it and you pay penalties.
- Error budget = 100% - SLO; it's your permission to deploy.
- Biggest mistake: setting SLO without tracking SLI first.
Every time your app goes down at 2am, someone is paging an on-call engineer, a customer is losing money, and a trust contract is being violated. The difference between teams that handle outages gracefully and teams that scramble blindly almost always comes down to whether they've defined what 'good' looks like before everything breaks. SLIs, SLOs, and SLAs are the three-layer framework that forces that definition into existence — and they're the backbone of Site Reliability Engineering (SRE) as practised at Google, Netflix, and virtually every serious tech company.
The problem they solve is deceptively simple: how do you know if your service is performing well enough? Without precise definitions, 'the site is slow' is just vibes. Is 500ms response time acceptable? What about 800ms? What percentage of requests can fail before your users actually churn? SLIs give you the measurement, SLOs give you the threshold, and SLAs give those thresholds real commercial weight. Together they transform fuzzy gut-feelings into actionable engineering decisions.
By the end of this article you'll be able to write a real SLO for a web API, understand how to derive SLIs from Prometheus metrics, explain error budgets to a product manager, and avoid the three classic mistakes that cause teams to either over-promise in their SLAs or burn out their engineers chasing impossible uptime targets.
What Is a Service Level Indicator (SLI)?
An SLI is a raw measurement of some aspect of your service's behavior. Think of it as a gauge on your dashboard. Common SLIs include request latency, error rate, throughput, or availability. The key is that an SLI must be quantifiable, collected consistently, and aligned with what users perceive.
- Latency: p50, p95, p99 response times
- Error rate: proportion of 5xx responses
- Throughput: requests per second
- Availability: ratio of successful requests to total
You don't need to measure everything. Pick the few metrics that directly impact user satisfaction. Google's SRE book recommends no more than five SLIs per service.
What Is a Service Level Objective (SLO)?
An SLO is the target value or range for an SLI over a specific time window. It's the promise you make to yourself (and your team) about how good the service should be. For example: "99.9% of requests will complete in under 300ms, measured over a rolling 30-day window."
SLOs are your internal reliability goals. They drive engineering decisions: if the SLO is at risk, you stop shipping features and fix stability. The time window matters — a 30-day window smooths out spikes but can hide long-term degradation. A 7-day window reacts faster but might trigger false alarms.
SLOs are also the basis for error budgets: the acceptable amount of unreliability (100% - SLO). An SLO of 99.9% means you can be down 0.1% of the time, which is about 43 minutes per month.
What Is a Service Level Agreement (SLA)?
An SLA is a formal contract between you and your customer (or another team) that specifies the level of service you guarantee, often with financial or business penalties if it's breached. Unlike SLOs which are internal, SLAs are external commitments.
SLAs are usually expressed in terms of availability (e.g., "99.9% uptime per month") but can also include latency, support response times, or throughput. The key difference from SLOs is the consequence: breach an SLO and you have a postmortem; breach an SLA and you write a cheque.
Real-world example: AWS Compute SLA promises 99.99% availability for EC2. If it drops below that, you get service credits. That's money-back guarantee — that's an SLA.
SLAs should be more lenient than your SLOs. If your SLA is 99.9% and your SLO is also 99.9%, you have zero error budget. Best practice: set SLO tighter than SLA (e.g., internal SLO 99.95%, external SLA 99.9%).
How Error Budgets Connect SLIs, SLOs, and SLAs
Error budget is the amount of unreliability your service is allowed, defined as 100% minus your SLO. For a 99.9% SLO, you have 0.1% error budget (about 43 minutes/month). As long as you haven't exhausted the budget, you're free to deploy new features. Once the budget is depleted, you freeze releases until reliability is restored.
This mechanism solves the classic tension between feature velocity and stability. Instead of arguing about whether to ship, you have a data-driven policy: if error budget remaining > 0, ship; if zero, fix.
Error budgets are usually tracked over a rolling window (30 days) to reflect recent performance. They consume slowly over time — a single 10-minute outage might eat 25% of your monthly budget for a 99.9% SLO.
Real-world use: Teams at Google use error budgets to decide if they can launch new features or must focus on reliability. It's not about perfection; it's about knowing when to push and when to hold.
Common Pitfalls in Implementing SLI/SLO/SLA
Teams often dive into defining SLOs without first understanding their SLIs. That's putting the cart before the horse. Here are the three biggest mistakes:
- Defining SLOs without data: You can't set a meaningful target unless you know your current baseline. Collect SLI data for at least two weeks first.
- SLO too strict: 99.99% sounds great on a slide deck, but it means you can afford only 4.3 minutes of downtime per month. That's brutal unless you have redundant infrastructure.
- SLA equals SLO: If your internal target is same as your contractual promise, you have zero room for surprise. Always make SLO tighter than SLA.
Another subtle pitfall: measuring SLIs at the wrong granularity. A single global SLI might hide regional failures. Always consider segmenting by geography, data center, or critical endpoint.
| Aspect | SLI (Service Level Indicator) | SLO (Service Level Objective) | SLA (Service Level Agreement) |
|---|---|---|---|
| Definition | A raw measurement of service performance (e.g., latency, error rate) | A target value for an SLI over a time window | A contractual promise to a customer with penalties |
| Owner | Engineering / DevOps team | Engineering team (internal) | Legal / Business / Customer team |
| Example | p99 response time = 300ms | 99% of requests under 500ms over 30 days | Uptime >= 99.9% per month, credits if breached |
| Consequences if missed | You see red on dashboard | Stop feature releases, fix reliability | Pay financial penalties or lose customer trust |
| Time window | Real-time or short rolling window | Typically 28-30 days rolling | Monthly or quarterly, often fixed calendar |
| Relation | The raw data | The goal based on SLI | The promise based on SLO |
Key Takeaways
- SLI: measure what users experience, not what servers report.
- SLO: set a data-backed target, leave headroom above your SLA.
- SLA: only promise what you can measure and enforce.
- Error budget: the permission to deploy. Track it daily.
- Start simple: one SLI, one SLO, one SLA. Iterate as you learn.
Common Mistakes to Avoid
- Setting SLO before measuring SLI
Symptom: Your SLO is unachievable or too lenient because it's based on intuition, not data. You'll either miss it constantly or never challenge the team.
Fix: Instrument your service to collect SLIs for at least two weeks. Compute p99 latency, error rate, and availability baselines. Then set an SLO that's slightly stricter than the baseline. - Using server-side uptime as the only SLI
Symptom: Your dashboards show 99.99% uptime, but users complain of errors. You miss the real problem because your SLI doesn't reflect user experience.
Fix: Include client-side metrics (RUM) or synthetic probes. Define SLI as proportion of successful user-facing requests, not just server health. - Making SLA identical to SLO
Symptom: No headroom: a minor incident that bumps you from 99.95% to 99.90% triggers SLA penalties. Your error budget is zero.
Fix: Set SLO 0.05% to 0.1% tighter than SLA. For example, internal SLO at 99.95%, external SLA at 99.9%. That gives you 0.05% buffer (about 22 minutes/month). - Not segmenting SLIs by criticality
Symptom: A broken checkout endpoint is hidden in a global 'all endpoints' average. You miss that your revenue-critical flow is failing.
Fix: Define separate SLIs and SLOs for critical user journeys (login, checkout, search) rather than a single service-level metric.
Interview Questions on This Topic
- QWhat is the difference between SLI, SLO, and SLA?JuniorReveal
- QHow do you decide the right SLO for a new service?Mid-levelReveal
- QYour error budget is depleted. What do you do?SeniorReveal
- QWhy is it risky to use a single global SLI?SeniorReveal
Frequently Asked Questions
Do I need all three (SLI, SLO, SLA) for every service?
Not necessarily. For internal services, an SLO may be sufficient to guide reliability improvements. For customer-facing services with contractual obligations, an SLA is needed. SLIs are always needed if you want to measure anything. Start with SLI + SLO for all services; add SLA only when there's a commercial agreement.
Can an SLO be the same as an SLA?
Technically yes, but it's risky. If your internal target equals your legal commitment, you have zero margin for error. A single outage could breach both. Best practice is to set SLO stricter (e.g., 99.95%) than SLA (e.g., 99.9%) to provide a buffer.
How often should we review SLIs and SLOs?
At least quarterly. SLIs may need to evolve as your system changes or user expectations shift. SLOs can be tightened as your reliability improves. Always review after a major outage or infrastructure change. Avoid changing SLOs reactively during an incident; wait until things stabilize.
What's the recommended granularity for an error budget?
Error budgets are typically tracked over a rolling 30-day window. This smooths daily variation while being responsive to sustained issues. Some teams use a 7-day window for faster feedback, but more frequent alerts can cause noise. The rolling window resets only as old data ages out.
That's Monitoring. Mark it forged?
4 min read · try the examples if you haven't