Intermediate 4 min · March 06, 2026

SLI SLO SLA — Server Uptime Isn't Customer Uptime

99.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
Quick Answer
  • SLI measures what actually happens: latency, availability, error rate.
  • SLO sets the target threshold: "99.9% of requests under 200ms."
  • SLA is the legal contract: breach it and you pay penalties.
  • Error budget = 100% - SLO; it's your permission to deploy.
  • Biggest mistake: setting SLO without tracking SLI first.

Every time your app goes down at 2am, someone is paging an on-call engineer, a customer is losing money, and a trust contract is being violated. The difference between teams that handle outages gracefully and teams that scramble blindly almost always comes down to whether they've defined what 'good' looks like before everything breaks. SLIs, SLOs, and SLAs are the three-layer framework that forces that definition into existence — and they're the backbone of Site Reliability Engineering (SRE) as practised at Google, Netflix, and virtually every serious tech company.

The problem they solve is deceptively simple: how do you know if your service is performing well enough? Without precise definitions, 'the site is slow' is just vibes. Is 500ms response time acceptable? What about 800ms? What percentage of requests can fail before your users actually churn? SLIs give you the measurement, SLOs give you the threshold, and SLAs give those thresholds real commercial weight. Together they transform fuzzy gut-feelings into actionable engineering decisions.

By the end of this article you'll be able to write a real SLO for a web API, understand how to derive SLIs from Prometheus metrics, explain error budgets to a product manager, and avoid the three classic mistakes that cause teams to either over-promise in their SLAs or burn out their engineers chasing impossible uptime targets.

What Is a Service Level Indicator (SLI)?

An SLI is a raw measurement of some aspect of your service's behavior. Think of it as a gauge on your dashboard. Common SLIs include request latency, error rate, throughput, or availability. The key is that an SLI must be quantifiable, collected consistently, and aligned with what users perceive.

For a web API, typical SLIs are
  • Latency: p50, p95, p99 response times
  • Error rate: proportion of 5xx responses
  • Throughput: requests per second
  • Availability: ratio of successful requests to total

You don't need to measure everything. Pick the few metrics that directly impact user satisfaction. Google's SRE book recommends no more than five SLIs per service.

What Is a Service Level Objective (SLO)?

An SLO is the target value or range for an SLI over a specific time window. It's the promise you make to yourself (and your team) about how good the service should be. For example: "99.9% of requests will complete in under 300ms, measured over a rolling 30-day window."

SLOs are your internal reliability goals. They drive engineering decisions: if the SLO is at risk, you stop shipping features and fix stability. The time window matters — a 30-day window smooths out spikes but can hide long-term degradation. A 7-day window reacts faster but might trigger false alarms.

SLOs are also the basis for error budgets: the acceptable amount of unreliability (100% - SLO). An SLO of 99.9% means you can be down 0.1% of the time, which is about 43 minutes per month.

What Is a Service Level Agreement (SLA)?

An SLA is a formal contract between you and your customer (or another team) that specifies the level of service you guarantee, often with financial or business penalties if it's breached. Unlike SLOs which are internal, SLAs are external commitments.

SLAs are usually expressed in terms of availability (e.g., "99.9% uptime per month") but can also include latency, support response times, or throughput. The key difference from SLOs is the consequence: breach an SLO and you have a postmortem; breach an SLA and you write a cheque.

Real-world example: AWS Compute SLA promises 99.99% availability for EC2. If it drops below that, you get service credits. That's money-back guarantee — that's an SLA.

SLAs should be more lenient than your SLOs. If your SLA is 99.9% and your SLO is also 99.9%, you have zero error budget. Best practice: set SLO tighter than SLA (e.g., internal SLO 99.95%, external SLA 99.9%).

How Error Budgets Connect SLIs, SLOs, and SLAs

Error budget is the amount of unreliability your service is allowed, defined as 100% minus your SLO. For a 99.9% SLO, you have 0.1% error budget (about 43 minutes/month). As long as you haven't exhausted the budget, you're free to deploy new features. Once the budget is depleted, you freeze releases until reliability is restored.

This mechanism solves the classic tension between feature velocity and stability. Instead of arguing about whether to ship, you have a data-driven policy: if error budget remaining > 0, ship; if zero, fix.

Error budgets are usually tracked over a rolling window (30 days) to reflect recent performance. They consume slowly over time — a single 10-minute outage might eat 25% of your monthly budget for a 99.9% SLO.

Real-world use: Teams at Google use error budgets to decide if they can launch new features or must focus on reliability. It's not about perfection; it's about knowing when to push and when to hold.

Common Pitfalls in Implementing SLI/SLO/SLA

Teams often dive into defining SLOs without first understanding their SLIs. That's putting the cart before the horse. Here are the three biggest mistakes:

  1. Defining SLOs without data: You can't set a meaningful target unless you know your current baseline. Collect SLI data for at least two weeks first.
  2. SLO too strict: 99.99% sounds great on a slide deck, but it means you can afford only 4.3 minutes of downtime per month. That's brutal unless you have redundant infrastructure.
  3. SLA equals SLO: If your internal target is same as your contractual promise, you have zero room for surprise. Always make SLO tighter than SLA.

Another subtle pitfall: measuring SLIs at the wrong granularity. A single global SLI might hide regional failures. Always consider segmenting by geography, data center, or critical endpoint.

SLI vs SLO vs SLA
AspectSLI (Service Level Indicator)SLO (Service Level Objective)SLA (Service Level Agreement)
DefinitionA raw measurement of service performance (e.g., latency, error rate)A target value for an SLI over a time windowA contractual promise to a customer with penalties
OwnerEngineering / DevOps teamEngineering team (internal)Legal / Business / Customer team
Examplep99 response time = 300ms99% of requests under 500ms over 30 daysUptime >= 99.9% per month, credits if breached
Consequences if missedYou see red on dashboardStop feature releases, fix reliabilityPay financial penalties or lose customer trust
Time windowReal-time or short rolling windowTypically 28-30 days rollingMonthly or quarterly, often fixed calendar
RelationThe raw dataThe goal based on SLIThe promise based on SLO

Key Takeaways

  • SLI: measure what users experience, not what servers report.
  • SLO: set a data-backed target, leave headroom above your SLA.
  • SLA: only promise what you can measure and enforce.
  • Error budget: the permission to deploy. Track it daily.
  • Start simple: one SLI, one SLO, one SLA. Iterate as you learn.

Common Mistakes to Avoid

  • Setting SLO before measuring SLI
    Symptom: Your SLO is unachievable or too lenient because it's based on intuition, not data. You'll either miss it constantly or never challenge the team.
    Fix: Instrument your service to collect SLIs for at least two weeks. Compute p99 latency, error rate, and availability baselines. Then set an SLO that's slightly stricter than the baseline.
  • Using server-side uptime as the only SLI
    Symptom: Your dashboards show 99.99% uptime, but users complain of errors. You miss the real problem because your SLI doesn't reflect user experience.
    Fix: Include client-side metrics (RUM) or synthetic probes. Define SLI as proportion of successful user-facing requests, not just server health.
  • Making SLA identical to SLO
    Symptom: No headroom: a minor incident that bumps you from 99.95% to 99.90% triggers SLA penalties. Your error budget is zero.
    Fix: Set SLO 0.05% to 0.1% tighter than SLA. For example, internal SLO at 99.95%, external SLA at 99.9%. That gives you 0.05% buffer (about 22 minutes/month).
  • Not segmenting SLIs by criticality
    Symptom: A broken checkout endpoint is hidden in a global 'all endpoints' average. You miss that your revenue-critical flow is failing.
    Fix: Define separate SLIs and SLOs for critical user journeys (login, checkout, search) rather than a single service-level metric.

Interview Questions on This Topic

  • QWhat is the difference between SLI, SLO, and SLA?JuniorReveal
    SLI is a specific metric like latency or error rate that's measured. SLO is a target value for an SLI over a time window, e.g., 99% of requests under 500ms. SLA is a contractual commitment to a customer, often with financial penalties. The key distinction: SLI is the data, SLO is the goal, SLA is the promise.
  • QHow do you decide the right SLO for a new service?Mid-levelReveal
    First, collect SLI data for at least two weeks to understand baseline performance. Then consider business impact: critical user journeys demand tighter SLOs. Use the error budget approach: start with a manageable SLO (e.g., 99.9% latency under 500ms) and tighten gradually. Also factor in your infrastructure — single-region deployments can't realistically promise 99.99%. Finally, set SLO stricter than SLA to leave headroom.
  • QYour error budget is depleted. What do you do?SeniorReveal
    Immediately freeze all non-critical releases and feature deployments. The team switches to 'reliability mode' — identify the root cause of the unreliability (via postmortem of incidents in the budget window). Apply fixes, such as adding circuit breakers, increasing redundancy, or optimizing slow endpoints. Once the service recovers and the error budget has positive remaining (next window), you can resume releases. This is the core of data-driven reliability: when budget is gone, you stop building and fix.
  • QWhy is it risky to use a single global SLI?SeniorReveal
    A single global average hides regional or endpoint-level degradation. For example, a global 99.9% latency could mask that a critical region (like EU) is experiencing 2-second latency. The best approach is to segment SLIs by geography, data center, or user journey. Then set SLOs per segment, and alert on any segment approaching breach. This gives you actionable, granular reliability data.

Frequently Asked Questions

Do I need all three (SLI, SLO, SLA) for every service?

Not necessarily. For internal services, an SLO may be sufficient to guide reliability improvements. For customer-facing services with contractual obligations, an SLA is needed. SLIs are always needed if you want to measure anything. Start with SLI + SLO for all services; add SLA only when there's a commercial agreement.

Can an SLO be the same as an SLA?

Technically yes, but it's risky. If your internal target equals your legal commitment, you have zero margin for error. A single outage could breach both. Best practice is to set SLO stricter (e.g., 99.95%) than SLA (e.g., 99.9%) to provide a buffer.

How often should we review SLIs and SLOs?

At least quarterly. SLIs may need to evolve as your system changes or user expectations shift. SLOs can be tightened as your reliability improves. Always review after a major outage or infrastructure change. Avoid changing SLOs reactively during an incident; wait until things stabilize.

What's the recommended granularity for an error budget?

Error budgets are typically tracked over a rolling 30-day window. This smooths daily variation while being responsive to sustained issues. Some teams use a 7-day window for faster feedback, but more frequent alerts can cause noise. The rolling window resets only as old data ages out.

🔥

That's Monitoring. Mark it forged?

4 min read · try the examples if you haven't

Previous
Distributed Tracing with Jaeger
6 / 9 · Monitoring
Next
Alerting and On-call Best Practices