Senior 5 min · March 05, 2026

Availability vs Reliability — Why 5 Nines Broke Checkout

TCP health checks passed while 10% of users hit checkout errors.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Availability: Is the system responding? Measured by uptime and nines.
  • Reliability: Is the system returning correct results? Measured by error rate and correctness.
  • Performance insight: 99.99% availability allows 52.6 minutes of downtime per year; reliability failures often hide within that window.
  • Production insight: A system can be 99.999% available but still corrupt data silently — availability doesn't guarantee correctness.
  • Biggest mistake: Treating availability and reliability as interchangeable — they optimize for different failure modes.
Plain-English First

Imagine a vending machine in your school hallway. Availability is whether the machine is ON and ready when you walk up to it — does it respond? Reliability is whether it actually gives you the right snack every time without jamming or eating your money. A machine can be 'on' (available) but still mess up your order (unreliable). You want both — a machine that's always on AND always gets it right. That's exactly what engineers mean when they talk about these two ideas in software systems.

Every time you open Netflix, tap 'Pay' on your phone, or check your bank balance, you're silently depending on someone's backend system to be up and working correctly. When those systems fail — even for seconds — real money is lost, real users churn, and real engineers get paged at 3am. Availability and reliability are the two most fundamental promises a system makes to its users, and understanding the difference between them is what separates junior engineers from architects who design systems that actually survive the real world.

The problem is that most teams treat availability and reliability as the same thing and optimize for only one. They slap a load balancer in front of their app, call it 'highly available', and ship it — only to discover their distributed system now silently returns wrong data under load, or drops 0.3% of transactions without anyone noticing for weeks. High availability without reliability is a liar's guarantee. Your system is 'up', but it's quietly betraying your users.

By the end of this article you'll be able to calculate availability from nines, explain the difference between availability and reliability in a system design interview without freezing, identify the architectural patterns that serve each goal, and spot the common trade-offs that teams get wrong when they chase one metric at the expense of the other. Let's build the mental model from the ground up.

What Is Availability?

Availability measures whether a system is up and reachable. It's a binary property: the system either responds to requests or it doesn't. Engineers track availability as a percentage of uptime over a given period — typically a month or a year.

Availability = (Total Time – Downtime) / Total Time × 100%

Nines are a shorthand: 99% (two nines) means ~3.65 days of downtime per year. 99.999% (five nines) means ~5.26 minutes. Each extra nine costs exponentially more in infrastructure and operational complexity.

Production systems aim for four nines (99.99% – 52.6 minutes/year) as a baseline. Five nines is the gold standard for critical financial or healthcare systems. Anything above that is usually marketing fluff — achieving six nines (99.9999% – 31.5 seconds/year) requires fully redundant, geographically distributed infrastructure and near-instant failover.

io/thecodeforge/AvailabilityCalculator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package io.thecodeforge;

import java.time.Duration;
import java.time.LocalDateTime;

public class AvailabilityCalculator {

    /**
     * Computes availability percentage for a given period and downtime.
     * @param totalPeriod total time of observation e.g., 365 days in milliseconds
     * @param downtimeMs total downtime in milliseconds
     * @return availability as a percentage
     */
    public static double availability(long totalPeriod, long downtimeMs) {
        if (totalPeriod <= 0) throw new IllegalArgumentException("totalPeriod must be > 0");
        long uptime = totalPeriod - downtimeMs;
        return (double) uptime / totalPeriod * 100.0;
    }

    public static void main(String[] args) {
        long yearMs = Duration.ofDays(365).toMillis();
        long fiveMinMs = Duration.ofMinutes(5).toMillis();
        double avail = availability(yearMs, fiveMinMs);
        System.out.printf("5 minutes downtime per year gives %.5f%% availability%n", avail);
    }
}
Output
5 minutes downtime per year gives 99.99905% availability
Mental Model: The Light Switch
Think of availability like a light switch. When you flip it, the light either turns on or it doesn't. That's availability. It doesn't care if the light is dim or flickering — just whether it's on. In production, a server can be 'on' but returning wrong data — that's where reliability comes in.
Production Insight
A TCP health check will pass even if the application is deadlocked.
To truly measure availability, use application-layer health endpoints.
Rule: every health check must exercise at least one downstream dependency.
Key Takeaway
Availability = uptime percentage.
Nines are a shorthand for downtime budgets.
Remember: availability tells you nothing about correctness.

What Is Reliability?

Reliability measures whether the system produces the correct output. A system can be 100% available — responding to every request — yet be 0% reliable if every response is wrong. Reliability is probabilistic: we usually talk about the probability that the system returns the correct result for a given request.

Common reliability metrics
  • Error rate: ratio of failed requests to total requests.
  • Latency distribution: tail latencies (p99, p999) matter more than averages.
  • Data integrity: checksum mismatches, corruption rates.
  • Correctness under failure: tolerance of Byzantine faults.

Reliability is harder to guarantee than availability because it requires ensuring every component in the request path behaves correctly under all conditions — including partial failures, network partitions, and concurrent modifications.

Mental Model: The Vending Machine
  • Available: machine is on, touchscreen responsive.
  • Reliable: pressing A3 gives you a Snickers, not a bag of chips or nothing.
  • An available but unreliable machine eats your money — users hate it.
  • A reliable but unavailable machine sits dark — users can't use it.
  • Production systems must be both: available to accept traffic, reliable to serve correct data.
Production Insight
Error rate SLIs can mask silent data corruption — a request may return HTTP 200 with wrong data.
Instrument every data path with checksums and validation to catch silent failures.
Rule: never trust an optimistic error rate; always measure correctness with synthetic tests.
Key Takeaway
Reliability = correctness under load.
It's probabilistic, not binary.
Measure error rates AND data integrity — they're not the same thing.

How Availability and Reliability Relate — But Differ

Availability and reliability are two orthogonal axes. A system can be available and reliable (happy path), unavailable (off), available but unreliable (silently corrupt), or unavailable but reliable (correct data but inaccessible).

In distributed systems, they interact
  • CAP theorem: Network partitions force a trade-off between availability and consistency (a form of reliability).
  • Circuit breakers: When a dependency is unreliable, you can sacrifice availability (returning a cached or degraded response) to preserve overall system reliability.
  • Retries: They improve reliability by recovering from transient failures, but too many retries can degrade availability (thundering herd).

Senior engineers design systems with explicit availability targets and reliability targets — and they know which one to sacrifice when something breaks.

The most expensive production incidents often occur when teams optimised for availability at the expense of reliability: the system stayed up, but served corrupted data to thousands of users before anyone noticed.

Common Trap
Don't confuse 'the system is up' with 'the system is working.' An available system that returns wrong data is worse than an unavailable system — because you don't know you should be fixing it until users complain.
Production Insight
During a partial network partition, one side may still be 'available' but miss updates — causing stale reads.
If your reliability SLI only checks 200 status, you'll miss the stale-data failure.
Rule: every reliability SLI must validate the response payload, not just the HTTP status.
Key Takeaway
Availability and reliability are independent axes.
CAP forces a trade-off during partitions.
Know which one to sacrifice when things break — your design should make that choice explicit.
When to Prioritise Availability vs Reliability
IfYour system serves live transactions (payments, orders)
UsePrioritise reliability: a wrong charge or shipment is worse than a brief outage.
IfYour system serves cached content (CDN, news feed)
UsePrioritise availability: stale content is acceptable; a blank page is not.
IfYou're designing a control-plane API (Kubernetes, deployment)
UseAvailability first: operators can retry if a command fails, but they can't work if the API is down.
IfYou're building a real-time collaboration tool
UseBoth matter equally. Partial failures cause subtle conflicts (reliability) and downtime causes user frustration (availability).

Measuring Availability: Nines and Budgets

Availability = (AGREED_UPTIME – DOWNTIME) / AGREED_UPTIME

'Agreed uptime' is typically the period your SLA covers — often 30 or 365 days. Downtime is any period where the service was not reachable by users.

nines example table: | Nines | Availability % | Downtime per year | |-------|----------------|--------------------| | 1 | 90% | 36.5 days | | 2 | 99% | 3.65 days | | 3 | 99.9% | 8.76 hours | | 4 | 99.99% | 52.6 minutes | | 5 | 99.999% | 5.26 minutes | | 6 | 99.9999% | 31.5 seconds |

Measuring correctly requires defining what counts as 'down.' Do you start the clock when the first user reports an issue, when monitoring alerts, or when the load balancer marks the instance unhealthy? Each choice changes the number.

Senior teams define availability measurement in their incident response playbook: clear start and stop conditions for downtime clock, and how partial degradation is counted.

Gotcha: Counting Partial Degradation
If your service runs on 10 instances and 1 fails, is that 10% downtime? Most SLAs consider it 'degraded' but not 'down' unless the degraded throughput exceeds a threshold (e.g., >5% error rate). Define this in your SLO to avoid disputes.
Production Insight
Teams often overcount availability by excluding scheduled maintenance from downtime calculations.
That's fine for internal SLOs, but users don't care if your downtime was 'planned.'
Rule: for external SLAs, include all downtime — planned or unplanned.
Key Takeaway
Uptime formula is simple — but defining 'down' is the hard part.
Decide measurement criteria before an incident.
Remember: availability is a binary measure of reachability, not health.

Measuring Reliability: SLIs, SLOs, and Error Budgets

Reliability measurement starts with Service Level Indicators (SLIs) — concrete metrics like request error rate, latency percentiles, or data freshness. Each SLI has a target Service Level Objective (SLO), e.g., '99.9% of requests return a correct response within 200ms.'

An error budget is the amount of unreliability you're allowed. For a 99.9% SLO (0.1% error budget) over 30 days, you can have about 43 minutes of errors. Once the budget is spent, you stop shipping features and focus on reliability.

Common reliability SLIs
  • Request success rate: (200 responses / total requests)
  • Latency SLO: % of requests under threshold (e.g., p99 < 500ms)
  • Data integrity: checksum mismatch rate
  • Freshness: time since last update for a data source

Reliability is harder to measure because you need sample payload validation, not just HTTP status. Many teams fake reliability by counting 200s as 'success' — but a 200 with wrong data is a failure. Real reliability measurement requires end-to-end synthetic transactions that validate response correctness.

Mental Model: The Error Budget as a Battery
  • Error budget = 1 - SLO target (e.g., 0.1% if SLO is 99.9%)
  • Total allowed errors per month: request count × error budget
  • When budget is spent: rollback risky changes, invest in testing, add circuit breakers.
  • If you never spend your budget, you're probably over-engineering (too expensive).
  • If you frequently burn through it, your system quality needs a structural fix.
Production Insight
Many teams only measure error rate on the critical path, ignoring background jobs or data pipelines.
A cron job that silently corrupts a database is a reliability failure that won't show in request error rates.
Rule: instrument every service boundary — not just customer-facing endpoints.
Key Takeaway
Reliability SLIs must validate correctness, not just HTTP status.
Error budgets decide when to stop shipping and start fixing.
A good SLO balances reliability cost against innovation velocity.

Architectural Patterns for Both Availability and Reliability

Senior architects blend patterns that serve both goals. Here's how each pattern contributes:

For Availability: - Redundancy (active-passive or active-active) — eliminates single points of failure. - Load balancing with health checks — routes traffic only to healthy instances. - Multi-region deployment — survives entire cloud provider failures. - Graceful degradation — when a dependency fails, serve a fallback response rather than a 500.

For Reliability: - Idempotent APIs — safe retry without double-booking. - Circuit breakers — stop calling a flaky dependency before it corrupts state. - Data validation layers — reject malformed data at every boundary. - Transactional outbox pattern — ensure atomicity between service and database.

The intersection is where most incidents hide. For example, a multi-region failover (availability pattern) can cause temporary data inconsistency (reliability failure) if the secondary region hasn't caught up on replication. That's why chaos engineering drills exercise both availability and reliability scenarios.

Senior Tip
Never implement an availability pattern (like failover) without also verifying the reliability implications. Test what happens to data consistency during failover — and measure the error rate during the transition.
Production Insight
Active-active load balancing improves availability but introduces split-brain risk for stateful services.
If both copies accept writes without coordination, they might diverge — a reliability failure.
Rule: if you're running active-active, you must implement conflict resolution or use a shared data store.
Key Takeaway
Availability patterns improve uptime but can harm reliability.
Reliability patterns protect correctness but add latency.
Always test the interaction between the two sets of patterns.
Choosing Patterns for Your Service Type
IfStateless service (e.g., API gateway)
UseAvailability patterns dominate: add replicas, load balancer, health checks. Reliability is mainly about correct request routing.
IfStateful service with external DB (e.g., order service)
UseBoth matter: use idempotency, circuit breakers, database retry logic. Availability requires DB redundancy.
IfStateful service with embedded data (e.g., caching node)
UseMostly reliability: data corruption is the biggest risk. Use replication, consistency checks, validation.
IfCritical data pipeline (e.g., batch batch processing)
UseReliability first: use checkpointing, idempotent processing, dead letter queues. Availability is secondary — job can be retried.
● Production incidentPOST-MORTEMseverity: high

The 3AM Pager: 5 Nines Available, 10% Error Rate

Symptom
Checkout failures for 10% of users. Monitoring showed all servers responding, pings succeeding, and load balancer reporting healthy backends. No alerts fired because health checks only verified TCP connectivity, not application logic.
Assumption
If the system is up and responding quickly, it must be working correctly. The team assumed that 99.999% uptime automatically meant reliability.
Root cause
A memcached node returned stale, corrupted serialised objects. The deserialisation logic threw exceptions on half the reads. The server processes themselves were alive — the JVM didn't crash — so TCP health checks passed. Only a synthetic transaction test would have caught it.
Fix
1. Add application-layer health checks that exercise the full checkout flow against a shadow database. 2. Implement circuit breaker on cache reads — after 3 deserialisation failures, fall through to the database. 3. Add SLI for checkout success rate and alert when it drops below 99.5%. This caught the issue within 30 seconds on the next occurrence.
Key lesson
  • Availability and reliability are not the same metric.
  • Health checks at the TCP level can hide application-level failures.
  • Always measure what matters: an SLI for correctness beats any uptime dashboard.
  • If you only monitor for availability, you'll miss reliability failures until the users complain.
Production debug guideWhen you get paged at 2AM, use this symptom-action guide to quickly classify whether you're facing an availability problem or a reliability problem.4 entries
Symptom · 01
All servers respond to ping but some requests fail with 500s or timeouts
Fix
Check application-level health endpoints. Run synthetic transaction probes. If probes fail but ping succeeds, you have a reliability issue — not an availability issue.
Symptom · 02
Server doesn't respond at all or load balancer marks it unhealthy
Fix
Availability problem. Check OS resources, process existence, and network connectivity. Restart service or failover to a redundant instance.
Symptom · 03
Error rate spikes on one host while others are fine
Fix
Isolate the host — likely reliability failure (e.g., corrupted cache, disk error). Remove from rotation and investigate root cause. If all hosts spike simultaneously, check dependency health.
Symptom · 04
Uptime dashboard shows 99.999% but customer complaints about wrong data
Fix
Review SLI definitions. You're measuring availability, not reliability. Instrument the data path with checksum or validation middleware. Alert on data integrity violations.
★ The Availability vs Reliability Quick Debug Cheat SheetUse this when the on-call phone buzzes. It'll save you 20 minutes of guessing.
Server not reachable
Immediate action
Check if process is running and port is open.
Commands
curl -I http://localhost:8080/health
systemctl status my-service
Fix now
Restart the service or trigger failover to replica.
Server reachable but errors returned+
Immediate action
Check application logs for stack traces, especially serialization or timeouts.
Commands
tail -n 100 /var/log/app/error.log | grep -i exception
curl -X POST -d '{"test":"true"}' http://localhost:8080/checkout
Fix now
Identify the specific error pattern and either rollback the last deploy or toggle the faulty feature flag.
No errors in logs but users complain about stale data+
Immediate action
Verify cache layer integrity. Flush the cache if necessary.
Commands
redis-cli flushall
curl -v http://localhost:8080/api/v1/product/123 -H 'Cache-Control: no-cache'
Fix now
Force a full cache refresh and add data versioning checks.
Availability vs Reliability at a Glance
DimensionAvailabilityReliability
DefinitionSystem is reachable and responds to requestsSystem returns correct and consistent results
Primary MetricUptime percentage (nines)Error rate, latency percentiles, data integrity
Measurement MethodHealth checks, uptime monitorsSynthetic transactions, log analysis, checksums
Worst Failure ModeSystem unreachable (outage)Silent data corruption (users trust wrong data)
Typical SLO99.9% – 99.999% uptime< 0.1% error rate, p99 < 500ms
Improvement PatternsRedundancy, failover, multi-regionIdempotency, circuit breakers, validation layers
Cost DriverInfrastructure redundancy (more servers, regions)Development rigor (testing, observability, retries)

Key takeaways

1
Availability is about reachability; reliability is about correctness. Never confuse them.
2
Measure availability with uptime nines; measure reliability with error rates and data integrity SLIs.
3
High availability without reliability is a lie
your system is up but serving wrong data.
4
Error budgets tell you when to stop shipping features and start fixing reliability.
5
Architectural patterns for availability (redundancy, failover) can harm reliability if not designed carefully.
6
Always include application-layer health checks and synthetic transaction tests
they catch reliability failures that TCP checks miss.

Common mistakes to avoid

4 patterns
×

Treating availability and reliability as the same thing

Symptom
Teams invest in load balancers and failover but skip idempotency and data validation. Then they wonder why users see wrong orders or duplicate charges.
Fix
Separately define SLIs for availability (uptime) and reliability (error rate, correctness). Assign distinct ownership and budgets for each.
×

Using only TCP-level health checks

Symptom
Servers pass health checks but application is deadlocked or returning corrupt data. Load balancer keeps routing traffic to a broken instance.
Fix
Implement application-layer health checks (e.g., /health with dependency testing) that verify the service can actually serve a request.
×

Confusing SLA with SLO

Symptom
Teams promise 99.99% availability to customers but measure it differently than their internal SLO, leading to missed penalty clauses.
Fix
Your SLA is the contractual minimum — always stricter (or equal) than your internal SLO. Never set an SLO higher than the penalty-free zone you can achieve.
×

Not counting partial degradation as downtime

Symptom
During an incident, 1 out of 10 instances returns errors but monitoring shows '100% uptime' because at least one instance was alive. Users see 10% error rate.
Fix
Define availability as the portion of requests that succeed, not the portion of instances that are alive. Use request-level uptime SLI.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between availability and reliability with a concr...
Q02SENIOR
How do you measure availability for a distributed system with multiple m...
Q03SENIOR
What's the relationship between the CAP theorem and the availability vs ...
Q04SENIOR
You have an SLO of 99.9% reliability (success rate). Your team ships a n...
Q01 of 04SENIOR

Explain the difference between availability and reliability with a concrete production example.

ANSWER
Availability is whether the system is reachable; reliability is whether it returns the correct result. For example, a payment gateway that returns HTTP 200 but never actually charges the card is 100% available but 0% reliable. In a real incident, we had a cache node that corrupted serialised objects — the server was up (available) but on half the reads the deserialisation threw an exception (unreliable). Only application-layer synthetic tests caught it.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between availability and reliability in simple terms?
02
How many nines of availability do most production systems target?
03
Can a system be reliable but not available?
04
What is an error budget and how does it relate to reliability?
05
How do you measure reliability with synthetic transactions?
🔥

That's Fundamentals. Mark it forged?

5 min read · try the examples if you haven't

Previous
Latency and Throughput
8 / 10 · Fundamentals
Next
SLA and Uptime Calculation