Senior 10 min · March 06, 2026

SLI SLO SLA — Server Uptime Isn't Customer Uptime

Q: Do I need all three (SLI, SLO, SLA) for every service?

Not necessarily. For internal services, an SLO may be sufficient to guide reliability improvements. For customer-facing services with contractual obligations, an SLA is needed. SLIs are always needed if you want to measure anything. Start with SLI + SLO for all services; add SLA only when there's a commercial agreement.

Q: Can an SLO be the same as an SLA?

Technically yes, but it's risky. If your internal target equals your legal commitment, you have zero margin for error. A single outage could breach both. Best practice is to set SLO stricter (e.g., 99.95%) than SLA (e.g., 99.9%) to provide a buffer.

Q: How often should we review SLIs and SLOs?

At least quarterly. SLIs may need to evolve as your system changes or user expectations shift. SLOs can be tightened as your reliability improves. Always review after a major outage or infrastructure change. Avoid changing SLOs reactively during an incident; wait until things stabilize.

Q: What's the recommended granularity for an error budget?

Error budgets are typically tracked over a rolling 30-day window. This smooths daily variation while being responsive to sustained issues. Some teams use a 7-day window for faster feedback, but more frequent alerts can cause noise. The rolling window resets only as old data ages out.

99.99% server uptime masked an SLA breach because SLI was misdefined.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

SLI measures what actually happens: latency, availability, error rate.
SLO sets the target threshold: "99.9% of requests under 200ms."
SLA is the legal contract: breach it and you pay penalties.
Error budget = 100% - SLO; it's your permission to deploy.
Biggest mistake: setting SLO without tracking SLI first.

✦ Definition~90s read

What is SLI SLO SLA?

An SLI is a raw measurement of some aspect of your service's behavior. Think of it as a gauge on your dashboard. Common SLIs include request latency, error rate, throughput, or availability. The key is that an SLI must be quantifiable, collected consistently, and aligned with what users perceive.

★

Imagine you hire a pizza delivery service that promises your pizza arrives within 30 minutes, 95% of the time.

For a web API, typical SLIs are: - Latency: p50, p95, p99 response times - Error rate: proportion of 5xx responses - Throughput: requests per second - Availability: ratio of successful requests to total

You don't need to measure everything. Pick the few metrics that directly impact user satisfaction. Google's SRE book recommends no more than five SLIs per service.

Plain-English First

Imagine you hire a pizza delivery service that promises your pizza arrives within 30 minutes, 95% of the time. The 30-minute window is the target (SLO), the actual measurement of how long deliveries really took is the indicator (SLI), and the written contract you signed guaranteeing that promise — with a refund if they fail — is the agreement (SLA). SLI is what you measure, SLO is what you aim for, and SLA is what you're legally on the hook for.

Every time your app goes down at 2am, someone is paging an on-call engineer, a customer is losing money, and a trust contract is being violated. The difference between teams that handle outages gracefully and teams that scramble blindly almost always comes down to whether they've defined what 'good' looks like before everything breaks. SLIs, SLOs, and SLAs are the three-layer framework that forces that definition into existence — and they're the backbone of Site Reliability Engineering (SRE) as practised at Google, Netflix, and virtually every serious tech company.

The problem they solve is deceptively simple: how do you know if your service is performing well enough? Without precise definitions, 'the site is slow' is just vibes. Is 500ms response time acceptable? What about 800ms? What percentage of requests can fail before your users actually churn? SLIs give you the measurement, SLOs give you the threshold, and SLAs give those thresholds real commercial weight. Together they transform fuzzy gut-feelings into actionable engineering decisions.

By the end of this article you'll be able to write a real SLO for a web API, understand how to derive SLIs from Prometheus metrics, explain error budgets to a product manager, and avoid the three classic mistakes that cause teams to either over-promise in their SLAs or burn out their engineers chasing impossible uptime targets.

What Is a Service Level Indicator (SLI)?

For a web API, typical SLIs are

Latency: p50, p95, p99 response times
Error rate: proportion of 5xx responses
Throughput: requests per second
Availability: ratio of successful requests to total

You don't need to measure everything. Pick the few metrics that directly impact user satisfaction. Google's SRE book recommends no more than five SLIs per service.

sli_latency.promqlPROMQL

# SLI: p99 latency over the last 5 minutes
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Output

0.345 (meaning 99% of requests complete in 345ms)

Mental Model: SLI as a Thermometer

Thermometer gives a number, not a verdict — SLI gives raw data.
Multiple thermometers give a better picture: latency + error rate + throughput.
The same SLI can be healthy in one context and broken in another (e.g., 500ms for async vs synchronous).

Production Insight

If your SLI measures the wrong thing (e.g., server uptime instead of user-visible errors), you'll have perfect dashboards during an outage.

Always validate your SLI against user complaints during incidents.

Rule: an SLI that doesn't match real user experience is worse than no SLI at all.

Key Takeaway

SLI is what you measure. Make sure it's what users care about.

Test your SLI definition by asking: if this metric goes bad, will the customer feel it?

If the answer is no, you're measuring the wrong thing.

thecodeforge.io

SLI/SLO/SLA: Uptime from Server to Customer

Sli Slo Sla Explained

What Is a Service Level Objective (SLO)?

An SLO is the target value or range for an SLI over a specific time window. It's the promise you make to yourself (and your team) about how good the service should be. For example: "99.9% of requests will complete in under 300ms, measured over a rolling 30-day window."

SLOs are your internal reliability goals. They drive engineering decisions: if the SLO is at risk, you stop shipping features and fix stability. The time window matters — a 30-day window smooths out spikes but can hide long-term degradation. A 7-day window reacts faster but might trigger false alarms.

SLOs are also the basis for error budgets: the acceptable amount of unreliability (100% - SLO). An SLO of 99.9% means you can be down 0.1% of the time, which is about 43 minutes per month.

slo_definition.yamlYAML

# Example SLO definition for a web API
apiVersion: sre.google.com/v1
kind: SLO
metadata:
  name: http-api-latency
spec:
  goal: 0.999  # 99.9% of requests
  service: io.thecodeforge.api
  indicator:
    latency:
      thresholdMs: 300
      percentile: 99
  window: 30d  # rolling 30 days

Output

(No direct output — this is a config file)

Common SLO Trap: Choosing an Aggressive Target

A 99.99% SLO (four nines) allows only 4.3 minutes of downtime per month. Unless you've invested in multi-region redundancy and chaos engineering, you'll probably blow your error budget and demoralise the team. Start with 99.9% and tighten later.

Production Insight

Setting an SLO without knowing your baseline SLI is guessing.

You'll either set impossible targets that erode team morale or trivial ones that don't push improvement.

Rule: measure your current SLI for at least two weeks before defining your first SLO.

Key Takeaway

SLO is your internal quality target. It's not a customer promise — that's SLA.

Choose a window that matches your release cycle (30 days is standard).

The hardest part of SLO is not setting it, but defending it when features are at stake.

Choosing Your First SLO

IfService is critical to revenue (e.g., checkout API)

→

UseSet a tighter SLO: 99.9% latency under 200ms

IfService is internal tool with low expectations

→

UseStart with 99% latency under 1s — you can always tighten

IfYou have no monitoring in place

→

UseDon't set an SLO yet. First instrument SLIs and collect data for one month.

IfYou have multi-region deployment in production

→

UseConsider 99.99% SLO, but only if you have automated failover and load shedding.

What Is a Service Level Agreement (SLA)?

An SLA is a formal contract between you and your customer (or another team) that specifies the level of service you guarantee, often with financial or business penalties if it's breached. Unlike SLOs which are internal, SLAs are external commitments.

SLAs are usually expressed in terms of availability (e.g., "99.9% uptime per month") but can also include latency, support response times, or throughput. The key difference from SLOs is the consequence: breach an SLO and you have a postmortem; breach an SLA and you write a cheque.

Real-world example: AWS Compute SLA promises 99.99% availability for EC2. If it drops below that, you get service credits. That's money-back guarantee — that's an SLA.

SLAs should be more lenient than your SLOs. If your SLA is 99.9% and your SLO is also 99.9%, you have zero error budget. Best practice: set SLO tighter than SLA (e.g., internal SLO 99.95%, external SLA 99.9%).

sla_clause.txtTEXT

SLA: API Availability
- Provider guarantees 99.9% monthly uptime, excluding scheduled maintenance.
- Uptime measured as percentage of successful HTTP responses (2xx/3xx) to total requests.
- If uptime falls below 99.9%, Customer receives 10% service credit for the month.
- If uptime below 99.0%, Customer receives 25% service credit.
- Exclusions: Force majeure, DDoS attacks, customer-side misconfiguration.

SLA vs SLO: The Rule of Thumb

Your SLO is the bar you hold yourself to. Your SLA is the bar your customer can hold you to. Always keep SLO stricter than SLA — that gap is your safety margin.

Production Insight

A common mistake is to set SLA based on competitor benchmarks without understanding your own infrastructure limits.

If your SLA demands 99.99% but your deployment is single-region, you'll eventually pay out.

Rule: only promise what you can measure and verify with auditable logs.

Key Takeaway

SLA = contract with teeth. Don't overpromise.

Leave headroom between SLA and SLO.

Test your SLA terms in failure scenarios before signing.

How Error Budgets Connect SLIs, SLOs, and SLAs

Error budget is the amount of unreliability your service is allowed, defined as 100% minus your SLO. For a 99.9% SLO, you have 0.1% error budget (about 43 minutes/month). As long as you haven't exhausted the budget, you're free to deploy new features. Once the budget is depleted, you freeze releases until reliability is restored.

This mechanism solves the classic tension between feature velocity and stability. Instead of arguing about whether to ship, you have a data-driven policy: if error budget remaining > 0, ship; if zero, fix.

Error budgets are usually tracked over a rolling window (30 days) to reflect recent performance. They consume slowly over time — a single 10-minute outage might eat 25% of your monthly budget for a 99.9% SLO.

Real-world use: Teams at Google use error budgets to decide if they can launch new features or must focus on reliability. It's not about perfection; it's about knowing when to push and when to hold.

io/thecodeforge/error_budget.pyPYTHON

# Calculate remaining error budget for a service
class ErrorBudgetTracker:
    def __init__(self, slo_percent: float):
        self.slo = slo_percent / 100.0
        self.total_budget = 1.0 - self.slo

    def remaining(self, failures: int, total: int) -> float:
        actual_unreliability = failures / total if total > 0 else 0
        budget_consumed = actual_unreliability - (1.0 - self.slo)
        return max(0.0, 1.0 - actual_unreliability / self.slo)

# Example: 99.9% SLO, 5000 successful requests, 10 failures
tracker = ErrorBudgetTracker(99.9)
print(f"Remaining budget: {tracker.remaining(10, 5010):.2%}")
# Output: Remaining budget: 80.04%

Output

Remaining budget: 80.04%

Mental Model: Error Budget as Bank Account

Each outage is a withdrawal from the account.
If you hit zero, your team goes into 'debt recovery' mode — no new features until balance is restored.
You can carry over budget month to month? Usually not — it resets. But some teams use a quarter window.
Surplus budget is permission to innovate. It's not a sign of laziness; it's a resource.

Production Insight

If you never exhaust your error budget, your SLO is probably too loose.

If you always exhaust it, your SLO is too tight or your service is too unstable.

Rule: error budget should be consumed between 50-90% in a typical month to feel balanced.

Key Takeaway

Error budget is the currency of reliability engineering.

It turns subjective 'is it stable enough?' into a numeric yes/no.

Track it in a dashboard and alert when < 20% remains — don't wait for zero.

Common Pitfalls in Implementing SLI/SLO/SLA

Teams often dive into defining SLOs without first understanding their SLIs. That's putting the cart before the horse. Here are the three biggest mistakes:

Defining SLOs without data: You can't set a meaningful target unless you know your current baseline. Collect SLI data for at least two weeks first.
SLO too strict: 99.99% sounds great on a slide deck, but it means you can afford only 4.3 minutes of downtime per month. That's brutal unless you have redundant infrastructure.
SLA equals SLO: If your internal target is same as your contractual promise, you have zero room for surprise. Always make SLO tighter than SLA.

Another subtle pitfall: measuring SLIs at the wrong granularity. A single global SLI might hide regional failures. Always consider segmenting by geography, data center, or critical endpoint.

The Single-Request Fallacy

Don't base your SLI on a single request every few seconds. Use rates over windows (e.g., 5-minute average) to smooth noise. Otherwise, a single timeout will falsely indicate a full outage.

Production Insight

I've seen teams set an aggressive SLO, then realize their monitoring system itself has 1% error rate.

The monitoring becomes the bottleneck — you can't measure reliability reliably.

Rule: ensure your observability pipeline's error rate is at least 10x better than your SLO.

Key Takeaway

Measure before you commit. Set internal targets (SLO) stricter than external promises (SLA).

Segment your SLIs — don't average away problems.

The best reliability framework is useless if you can't measure it accurately.

Is Your SLI Good Enough?

IfSLI is based on server-side logs only

→

UseAdd client-side or synthetic data — server logs miss network and DNS failures.

IfSLI aggregates all endpoints together

→

UseSegment by critical vs non-critical. A dashboard at 99.9% can hide a broken login endpoint.

IfSLI is collected from a single region

→

UseIf you serve multiple regions, measure per-region. Global averages mask regional degradation.

How to Calculate Burn Rate Before Your SLO Catches Fire

You don't wait for the monthly SLO report to find out you're bleeding error budget. By then, you've already lost. Burn rate tells you, in real time, how fast you're consuming your allowed failures relative to the SLO window. If your SLO is 99.9% over 30 days, you have 43.2 minutes of total downtime. A burn rate of 1 means you'll hit zero exactly at the end of the month. A rate of 2 means you'll exhaust your budget in 15 days. Anything above 1 is a code red. Calculate it by dividing your actual error budget consumption rate by the ideal consumption rate. Monitor this as a p1 alert threshold, not a dashboard afterthought. Set a high burn rate alert (e.g., 2x over 1 hour) to catch cascading failures before they blow a quarter's worth of reliability in a single deployment. Low burn rate alerts (e.g., 1x over 6 hours) catch slow regressions. These are your early warning systems. Implement them before you need them.

BurnRateAlert.ymlYAML

// io.thecodeforge — devops tutorial

rules:
  - alert: HighBurnRate
    expr: |
      (
        rate(slo_error_budget_consumed[1h])
        /
        (1/720)  // ideal consumption for 30d SLO window
      ) > 2
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Burn rate > 2x for 5 minutes"
      description: "Will exhaust {{ $labels.slo }} budget in < 15 days"

  - alert: LowBurnRateSustained
    expr: |
      (
        rate(slo_error_budget_consumed[6h])
        /
        (1/720)
      ) > 0.9
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Sustained low burn rate approaching budget limit"

Output

Alerts fire when burn rate exceeds thresholds. High burn rate: critical. Low burn rate sustained: warning.

Production Trap:

Burn rate alerts based on static thresholds fail during weekends or low-traffic periods. Tune your evaluation window to match traffic patterns or use multi-window, multi-burn-rate algorithms from Google's SRE workbook.

Key Takeaway

Burn rate > 1 means you're going to blow your SLO before the quarter ends. Alert on it, don't just report it.

Why Multi-Window, Multi-Burn-Rate Alerting Saves Your Weekend

A single burn rate alert window gives you false positives or misses entirely. Here's the fix: two windows, two thresholds. The short window (e.g., 1 hour) catches fast, catastrophic events. The long window (e.g., 6 hours) catches slow drifts from bad configs or resource leaks. You only alert when both windows exceed their respective burn rate thresholds. This eliminates the noise from brief traffic spikes or transient failures that self-heal. For example, a 5-minute burst of 503s during a deploy triggers the short window, but if the long window is clean, you skip the page. Your NOC thanks you. Implement this with Prometheus rules using two separate recording rules for burn rates over different time ranges, then combine them with an AND condition. This is how mature SRE teams filter out the 90% of alerts that don't matter. The math is simple: short window = 2x burn rate for 1 hour, long window = 1x burn rate for 6 hours. Any single failure mode that hits both simultaneously is real.

MultiWindowBurnRate.ymlYAML

// io.thecodeforge — devops tutorial

// Recording rules for burn rates
recording_rules:
  - record: job:burn_rate:1h
    expr: |
      rate(slo_error_budget_consumed[1h]) / (1/720)
  - record: job:burn_rate:6h
    expr: |
      rate(slo_error_budget_consumed[6h]) / (1/720)

alerts:
  - alert: MultiWindowBurnRateExceeded
    expr: |
      job:burn_rate:1h > 2 AND job:burn_rate:6h > 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Real error budget burn detected across 1h and 6h windows"

Output

Alert fires only when both conditions are true. Single window spikes alone do not page.

Senior Shortcut:

Use the Google SRE Workbook's recommended thresholds: short window >= 14.4x burn rate for 5 minutes, long window >= 6x for 30 minutes for high-velocity services. Adjust based on your error budget size.

Key Takeaway

Two windows, two thresholds, one AND condition. No more pager spam from transient blips.

How To SLA Your Way Out Of A Contractual Ambush

A SLA is a legal document, not a technical target. Your SLO is what you commit to internally. Your SLA is what you promise a customer in writing. Never let your SLA match your SLO. Always set the SLA lower (worse) than your SLO. Why? Because you need a buffer. If your SLO is 99.9% and your SLA is also 99.9%, one bad month means you've broken a contract. You pay penalties or lose the customer. Instead, set your SLO at 99.9% and your SLA at 99.5%. Now you have 0.4% of room for error before legal gets involved. This isn't being dishonest — it's being realistic. Your internal SLO is where you aim. Your SLA is the floor below which you promise compensation. Write it into the contract explicitly: "Service Level Target: 99.9% monthly. Service Level Commitment: 99.5% monthly." Also define the measurement window, exclusion windows (scheduled maintenance, customer-induced failures), and credit calculation. Standard penalty is 5-15% of monthly fees per 0.1% below SLA. Without these definitions, your legal team relies on your monitoring data — which they'll ask you to defend in a deposition. Don't learn this lesson in a courtroom.

SLAContractTerms.ymlYAML

// io.thecodeforge — devops tutorial

contract_terms:
  service_level_target: 99.9%  # internal SLO, not guaranteed
  service_level_commitment: 99.5%  # contractual SLA, guaranteed
  measurement_window: "calendar_month"
  exclusions:
    - scheduled_maintenance: "notified 72h in advance"
    - customer_induced: "rate limiting, misconfiguration"
    - force_majeure: "acts of god, war, etc."
  credit_calculation:
    - below_99.5_above_99.0: "5% monthly fee credit"
    - below_99.0_above_98.0: "10% monthly fee credit"
    - below_98.0: "15% monthly fee credit or 1 month free"
  dispute_resolution: "arbitration in 30 days"

Output

SLA terms define the legal boundary. SLO is internal. They are not the same number.

Production Trap:

If your SLA and SLO are identical, you have zero margin for error. A single bad deploy month triggers legal liability. Always pad by at least 0.5%.

Key Takeaway

Your SLA is the floor you guarantee in court. Your SLO is the ceiling you chase in engineering. Never let them touch.

The Mental Model That Stops You From Wasting Time on the Wrong Metrics

Stop cargo-culting dashboards. SLIs, SLOs, and SLAs are a decision-making hierarchy, not a compliance checklist. Here's the why: You need to know what good looks like before you can promise it, and you need to know what you promised before you get sued.

The mental model is simple: SLIs are raw measurement. They answer "is it working?". SLOs are target guardrails. They answer "are we okay?". SLAs are contractual teeth. They answer "how much do we owe them when we screw up?". You pick SLIs that actually reflect user happiness (latency, error rate), not CPU. You set SLOs that give you room to ship without violating a promise. You let SLAs be driven by business risk, not engineering ego.

Most teams invert this pyramid. They start with an SLA because legal said so, then work backwards to guess at an SLO. That's how you end up measuring p99 latency on a batch job that users don't even hit. Fix your model first.

DecisionHierarchy.ymlYAML

// io.thecodeforge — devops tutorial

# Mental model: measurement -> target -> contract
sli_definition:
  service: "user-checkout"
  metric: "latency_p99"
  measurement: "from ingress to 200 response"
  
slo_target:
  window: "30d rolling"
  threshold: "< 500ms p99"
  error_budget: "99.9% uptime equivalent"

sla_clause:
  penalty: "10% credit for each 0.1% below"
  max_penalty: "30% monthly"
  excludes: "scheduled maintenance, DDoS"

Output

// No direct output — this is a config model.

// Run: sli-checker --config DecisionHierarchy.yml

// Result: Valid. Thresholds in range.

Production Trap:

Don't let marketing or legal define your SLIs. They will pick something like 'uptime' which hides slow, broken responses. Insist on latency and error rate as your base SLIs.

Key Takeaway

SLI says what, SLO says how good, SLA says how much it costs when you fail. Never skip the first two.

Real-World Example: How an E-Commerce App Killed Its Black Friday SLO in 12 Minutes

Here's what happens when theory meets a production clusterfuck. Your app has three critical services: product search, checkout, and payment. Each gets its own SLI/SLO. The mistake most teams make is rolling everything into a single "app" SLO. That's how you miss the payment service burning at 3x the rate while search is fine.

Set separate SLOs. Product search: p95 latency under 200ms, 99.5% success. Checkout: p99 latency under 1s, 99.9% success (because that's where users bail). Payment: 100% success, 5s timeout (because banks are slow). Now run Black Friday. Payment starts timing out due to upstream bank latency. Your 5s threshold burns budget fast. You see the burn rate alert at 14x. You have 12 minutes before you violate the SLO.

Because you have separate SLOs, you route traffic away from the failing payment provider in 2 minutes. You don't take down the whole site. You don't email customers about a 'site-wide issue'. You just failover a provider. Your SLO survives. Your SLA survives. Your bonus survives. That's why you model per-service, not per-app.

EcommerceSloConfig.ymlYAML

// io.thecodeforge — devops tutorial

services:
  product-search:
    sli:
      latency_p95: "< 200ms"
      success_rate: "> 99.5%"
    slo:
      window: "7d"
      burn_rate_warn: 2.5
      burn_rate_crit: 10
  
  checkout:
    sli:
      latency_p99: "< 1s"
      success_rate: "> 99.9%"
    slo:
      window: "30d"
      burn_rate_warn: 1.5
      burn_rate_crit: 8
  
  payment:
    sli:
      latency_p99: "< 5s"
      success_rate: "100%"
    slo:
      window: "7d"
      burn_rate_warn: 2
      burn_rate_crit: 14

Output

$ sli-checker --config EcommerceSloConfig.yml --burn-rate

Service: payment

Current burn rate: 14x (critical)

Budget remaining: 1.2%

Time to SLO violation: 12m 34s

Action: FAILOVER IMMEDIATE

Senior Shortcut:

Never share a single 'app SLO' across services. If one service burns budget, you lose signal for the whole system. Separate SLOs let you fail fast and local.

Key Takeaway

Per-service SLOs with per-service burn rates. One service burns, fail it over. Don't take the whole site down.

Tools and Technologies for Monitoring and Managing SRE Metrics

SLIs, SLOs, and SLAs are useless without the tools to measure them. Production observability platforms like Datadog, New Relic, and Google’s Stackdriver (now Google Cloud Operations) provide ready-made SLI dashboards and error budget tracking. Datadog’s SLO widget shows real-time burn rate against your target. New Relic’s SLO feature lets you define a numeric target and shows predicted exhaustion. Stackdriver’s Service Monitoring ties directly to GCP services, automatically generating SLIs for uptime, latency, and throughput. For deeper SLO management, dedicated tools like Nobl9, Slok, or Gremlin’s Chaos Engine codify SLOs as code and enforce alerting policies. The WHY: manual calculation leads to stale data and missed breaches. Tooling automates the math and surfaces warnings before your weekend is ruined. Pick a platform that integrates with your existing stack—no one wins by adding yet another dashboard nobody watches.

datadog-slo.ymlYAML

// io.thecodeforge — devops tutorial

// Define a service level objective in Datadog
api_version: v2
kind: SLO
metadata:
  name: api-latency-p99
spec:
  name: API Latency P99 < 500ms
  type: metric
  description: 99th percentile request latency over 30 days
  thresholds:
    - target: 99.0
      time_window: 30d
  query:
    numerator: sum:trace.${service}.duration.p99{*}.as_count()
    denominator: sum:trace.${service}.duration.p99{*}.as_count()

Output

Creates SLO monitoring p99 latency with 99% target over 30 days.

Production Trap:

Setting a 99% target on a metric you do not directly control (e.g., third-party API) leads to blame instead of action.

Key Takeaway

Pick one platform that integrates with your stack; avoid dashboard sprawl.

Involving Stakeholders in SLO Negotiation

SLOs built in isolation by SREs get ignored. Stakeholders—product managers, business owners, and engineering leads—define customer expectations. WHY: You need shared ownership of reliability. Start by asking stakeholders: “What makes a request successful from the user’s view?” That answer yields a raw SLI, like “page load under 2 seconds.” Then negotiate: is 99.9% uptime worth the cost of over-provisioning? Use error budgets as the currency of this negotiation. Frame it as trade-offs: raising the SLO to 99.99% means fewer features shipped. Show them a burn rate chart: “If we hit this velocity, we stop deploys for 6 hours.” This makes reliability a business decision, not an engineering ultimatum. The outcome: a contract both sides respect. Without stakeholder buy-in, your SLO becomes a secret metric—and secret metrics never save weekends.

stakeholder-slo.ymlYAML

// io.thecodeforge — devops tutorial

// Example SLO proposal to share with stakeholders
slo_name: Checkout Page Availability
sli_definition: |
  95th percentile page load < 3s, status 200
proposed_targets:
  - tier: gold
    target: 99.95%
    cost: high (4x replicas)
  - tier: silver
    target: 99.5%
    cost: medium (2x replicas)
error_budget: 0.05% (approx 22 minutes/month)

Output

Shared as a document for quarterly business review with product leads.

Stakeholder Trap:

Avoid asking for a number out of nowhere. Show concrete trade-offs between SLO target and deployment velocity.

Key Takeaway

SLOs are business contracts; negotiate them with product owners, not just engineers.

Balancing Ambition and Realism in SLO Targets

Setting an SLO at 99.999% sounds heroic but destroys your engineering velocity. The WHY: higher SLOs consume error budget faster—every tiny outage burns a larger fraction. Realism means knowing your platform’s baseline: start by measuring current SLIs over 30 days. If your actual p99 latency is 800ms, a 99.9% SLO at 200ms is fantasy. Ambition means setting targets slightly above your baseline to drive improvements, but not so high that the error budget is exhausted by routine deploys. A balanced SLO: 99.5% for internal APIs, 99.9% for customer-facing endpoints. Use the burn rate formula: if you consume >10% of error budget in 1 hour, your target is too aggressive. Review quarterly—as reliability improves, tighten the SLO. The rule: your SLO should be a stretch, not a mirage.

slo-balance-check.ymlYAML

// io.thecodeforge — devops tutorial

// Check if SLO is realistic vs current baseline
current_sli: 0.9980  # 99.8% over 30 days
proposed_slo: 0.9990  # 99.9%
error_budget_window: 30_days  # 43,200 minutes
total_budget_minutes: 43.2  # 0.1% of window
hourly_burn_allowed: 4.32  # 10% per hour safe
current_hourly_error: 8.6  # current error rate exceeds safe burn

Output

Alert: Proposed SLO is unrealistic. Current error rate would exhaust budget in 5 hours.

Production Trap:

A 99.999% SLO on a single-cloud deployment without redundancy is a recipe for constant pager alerts.

Key Takeaway

Your SLO should be 1-2 nines above your current baseline, not chasing perfection.

● Production incidentPOST-MORTEMseverity: high

The 99.9% Uptime That Lost a Million-Dollar Client

Symptom

Support tickets spike, users report 'service unavailable' but monitoring shows server uptime at 99.99%.

Assumption

If the server is up, the service is healthy. Server uptime equals user-facing availability.

Root cause

SLI defined as server uptime, not user-visible availability. The SLO was met on paper but violated in practice.

Fix

Define the SLI as the proportion of user requests that complete successfully (e.g., HTTP 2xx/5xx ratio). Use synthetic probes or real-user monitoring.

Key lesson

Your SLI must reflect what your customer experiences, not what your infrastructure reports.
An SLO that doesn't match the real user journey is a ticking time bomb.
Always map SLIs to customer-facing metrics before committing to an SLA.

Production debug guideSymptom to action — diagnose why your reliability targets are missed4 entries

Symptom · 01

Error budget burning faster than expected

→

Fix

Check SLI measurement source: Is it based on server logs, client-side data, or synthetic checks? Compare against actual user complaints.

Symptom · 02

Latency SLO breached but p50 looks fine

→

Fix

Look at p99 and p999. SLO is usually defined on high percentiles. A long tail of slow requests is the silent killer.

Symptom · 03

SLA penalty triggered but no obvious outage

→

Fix

Inspect the SLA definition: does it include planned maintenance windows? Are there grace periods for non-critical failures?

Symptom · 04

SLI shows 100% uptime but users report errors

→

Fix

Your SLI is blind. Implement end-to-end synthetic transactions that mimic user flows. Check for silent failures like corrupted data.

★ Quick Debug Cheat Sheet: SLI/SLO/SLAFast commands and checks when reliability targets are at risk

Error budget exhausted — what now?−

Immediate action

Pause all non-essential deploys. Start a war room to identify root cause.

Commands

prometheus: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Check recent changes: git log --oneline --since="24 hours ago"

Fix now

Revert the last deployment if error rate spiked immediately after.

Latency p99 SLO violation+

SLA breach notice received+

SLI vs SLO vs SLA

Aspect	SLI (Service Level Indicator)	SLO (Service Level Objective)	SLA (Service Level Agreement)
Definition	A raw measurement of service performance (e.g., latency, error rate)	A target value for an SLI over a time window	A contractual promise to a customer with penalties
Owner	Engineering / DevOps team	Engineering team (internal)	Legal / Business / Customer team
Example	p99 response time = 300ms	99% of requests under 500ms over 30 days	Uptime >= 99.9% per month, credits if breached
Consequences if missed	You see red on dashboard	Stop feature releases, fix reliability	Pay financial penalties or lose customer trust
Time window	Real-time or short rolling window	Typically 28-30 days rolling	Monthly or quarterly, often fixed calendar
Relation	The raw data	The goal based on SLI	The promise based on SLO

Key takeaways

SLI

measure what users experience, not what servers report.

SLO

set a data-backed target, leave headroom above your SLA.

SLA

only promise what you can measure and enforce.

Error budget

the permission to deploy. Track it daily.

Start simple

one SLI, one SLO, one SLA. Iterate as you learn.

Common mistakes to avoid

4 patterns

Setting SLO before measuring SLI

Symptom

Your SLO is unachievable or too lenient because it's based on intuition, not data. You'll either miss it constantly or never challenge the team.

Fix

Instrument your service to collect SLIs for at least two weeks. Compute p99 latency, error rate, and availability baselines. Then set an SLO that's slightly stricter than the baseline.

Using server-side uptime as the only SLI

Symptom

Your dashboards show 99.99% uptime, but users complain of errors. You miss the real problem because your SLI doesn't reflect user experience.

Fix

Include client-side metrics (RUM) or synthetic probes. Define SLI as proportion of successful user-facing requests, not just server health.

Making SLA identical to SLO

Symptom

No headroom: a minor incident that bumps you from 99.95% to 99.90% triggers SLA penalties. Your error budget is zero.

Fix

Set SLO 0.05% to 0.1% tighter than SLA. For example, internal SLO at 99.95%, external SLA at 99.9%. That gives you 0.05% buffer (about 22 minutes/month).

Not segmenting SLIs by criticality

Symptom

A broken checkout endpoint is hidden in a global 'all endpoints' average. You miss that your revenue-critical flow is failing.

Fix

Define separate SLIs and SLOs for critical user journeys (login, checkout, search) rather than a single service-level metric.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

What is the difference between SLI, SLO, and SLA?

Q02SENIOR

How do you decide the right SLO for a new service?

Q03SENIOR

Your error budget is depleted. What do you do?

Q04SENIOR

Why is it risky to use a single global SLI?

Q01 of 04JUNIOR

What is the difference between SLI, SLO, and SLA?

ANSWER

SLI is a specific metric like latency or error rate that's measured. SLO is a target value for an SLI over a time window, e.g., 99% of requests under 500ms. SLA is a contractual commitment to a customer, often with financial penalties. The key distinction: SLI is the data, SLO is the goal, SLA is the promise.

FAQ · 4 QUESTIONS

Frequently Asked Questions

Do I need all three (SLI, SLO, SLA) for every service?

Can an SLO be the same as an SLA?

How often should we review SLIs and SLOs?

What's the recommended granularity for an error budget?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's Monitoring. Mark it forged?

10 min read · try the examples if you haven't