Availability vs Reliability — Why 5 Nines Broke Checkout
TCP health checks passed while 10% of users hit checkout errors.
- Availability: Is the system responding? Measured by uptime and nines.
- Reliability: Is the system returning correct results? Measured by error rate and correctness.
- Performance insight: 99.99% availability allows 52.6 minutes of downtime per year; reliability failures often hide within that window.
- Production insight: A system can be 99.999% available but still corrupt data silently — availability doesn't guarantee correctness.
- Biggest mistake: Treating availability and reliability as interchangeable — they optimize for different failure modes.
Imagine a vending machine in your school hallway. Availability is whether the machine is ON and ready when you walk up to it — does it respond? Reliability is whether it actually gives you the right snack every time without jamming or eating your money. A machine can be 'on' (available) but still mess up your order (unreliable). You want both — a machine that's always on AND always gets it right. That's exactly what engineers mean when they talk about these two ideas in software systems.
Every time you open Netflix, tap 'Pay' on your phone, or check your bank balance, you're silently depending on someone's backend system to be up and working correctly. When those systems fail — even for seconds — real money is lost, real users churn, and real engineers get paged at 3am. Availability and reliability are the two most fundamental promises a system makes to its users, and understanding the difference between them is what separates junior engineers from architects who design systems that actually survive the real world.
The problem is that most teams treat availability and reliability as the same thing and optimize for only one. They slap a load balancer in front of their app, call it 'highly available', and ship it — only to discover their distributed system now silently returns wrong data under load, or drops 0.3% of transactions without anyone noticing for weeks. High availability without reliability is a liar's guarantee. Your system is 'up', but it's quietly betraying your users.
By the end of this article you'll be able to calculate availability from nines, explain the difference between availability and reliability in a system design interview without freezing, identify the architectural patterns that serve each goal, and spot the common trade-offs that teams get wrong when they chase one metric at the expense of the other. Let's build the mental model from the ground up.
What Is Availability?
Availability measures whether a system is up and reachable. It's a binary property: the system either responds to requests or it doesn't. Engineers track availability as a percentage of uptime over a given period — typically a month or a year.
Uptime is calculated using this formula:
Availability = (Total Time – Downtime) / Total Time × 100%
Nines are a shorthand: 99% (two nines) means ~3.65 days of downtime per year. 99.999% (five nines) means ~5.26 minutes. Each extra nine costs exponentially more in infrastructure and operational complexity.
Production systems aim for four nines (99.99% – 52.6 minutes/year) as a baseline. Five nines is the gold standard for critical financial or healthcare systems. Anything above that is usually marketing fluff — achieving six nines (99.9999% – 31.5 seconds/year) requires fully redundant, geographically distributed infrastructure and near-instant failover.
What Is Reliability?
Reliability measures whether the system produces the correct output. A system can be 100% available — responding to every request — yet be 0% reliable if every response is wrong. Reliability is probabilistic: we usually talk about the probability that the system returns the correct result for a given request.
- Error rate: ratio of failed requests to total requests.
- Latency distribution: tail latencies (p99, p999) matter more than averages.
- Data integrity: checksum mismatches, corruption rates.
- Correctness under failure: tolerance of Byzantine faults.
Reliability is harder to guarantee than availability because it requires ensuring every component in the request path behaves correctly under all conditions — including partial failures, network partitions, and concurrent modifications.
- Available: machine is on, touchscreen responsive.
- Reliable: pressing A3 gives you a Snickers, not a bag of chips or nothing.
- An available but unreliable machine eats your money — users hate it.
- A reliable but unavailable machine sits dark — users can't use it.
- Production systems must be both: available to accept traffic, reliable to serve correct data.
How Availability and Reliability Relate — But Differ
Availability and reliability are two orthogonal axes. A system can be available and reliable (happy path), unavailable (off), available but unreliable (silently corrupt), or unavailable but reliable (correct data but inaccessible).
- CAP theorem: Network partitions force a trade-off between availability and consistency (a form of reliability).
- Circuit breakers: When a dependency is unreliable, you can sacrifice availability (returning a cached or degraded response) to preserve overall system reliability.
- Retries: They improve reliability by recovering from transient failures, but too many retries can degrade availability (thundering herd).
Senior engineers design systems with explicit availability targets and reliability targets — and they know which one to sacrifice when something breaks.
The most expensive production incidents often occur when teams optimised for availability at the expense of reliability: the system stayed up, but served corrupted data to thousands of users before anyone noticed.
Measuring Availability: Nines and Budgets
Availability is calculated from uptime. The classic formula:
Availability = (AGREED_UPTIME – DOWNTIME) / AGREED_UPTIME
'Agreed uptime' is typically the period your SLA covers — often 30 or 365 days. Downtime is any period where the service was not reachable by users.
nines example table: | Nines | Availability % | Downtime per year | |-------|----------------|--------------------| | 1 | 90% | 36.5 days | | 2 | 99% | 3.65 days | | 3 | 99.9% | 8.76 hours | | 4 | 99.99% | 52.6 minutes | | 5 | 99.999% | 5.26 minutes | | 6 | 99.9999% | 31.5 seconds |
Measuring correctly requires defining what counts as 'down.' Do you start the clock when the first user reports an issue, when monitoring alerts, or when the load balancer marks the instance unhealthy? Each choice changes the number.
Senior teams define availability measurement in their incident response playbook: clear start and stop conditions for downtime clock, and how partial degradation is counted.
Measuring Reliability: SLIs, SLOs, and Error Budgets
Reliability measurement starts with Service Level Indicators (SLIs) — concrete metrics like request error rate, latency percentiles, or data freshness. Each SLI has a target Service Level Objective (SLO), e.g., '99.9% of requests return a correct response within 200ms.'
An error budget is the amount of unreliability you're allowed. For a 99.9% SLO (0.1% error budget) over 30 days, you can have about 43 minutes of errors. Once the budget is spent, you stop shipping features and focus on reliability.
- Request success rate: (200 responses / total requests)
- Latency SLO: % of requests under threshold (e.g., p99 < 500ms)
- Data integrity: checksum mismatch rate
- Freshness: time since last update for a data source
Reliability is harder to measure because you need sample payload validation, not just HTTP status. Many teams fake reliability by counting 200s as 'success' — but a 200 with wrong data is a failure. Real reliability measurement requires end-to-end synthetic transactions that validate response correctness.
- Error budget = 1 - SLO target (e.g., 0.1% if SLO is 99.9%)
- Total allowed errors per month: request count × error budget
- When budget is spent: rollback risky changes, invest in testing, add circuit breakers.
- If you never spend your budget, you're probably over-engineering (too expensive).
- If you frequently burn through it, your system quality needs a structural fix.
Architectural Patterns for Both Availability and Reliability
Senior architects blend patterns that serve both goals. Here's how each pattern contributes:
For Availability: - Redundancy (active-passive or active-active) — eliminates single points of failure. - Load balancing with health checks — routes traffic only to healthy instances. - Multi-region deployment — survives entire cloud provider failures. - Graceful degradation — when a dependency fails, serve a fallback response rather than a 500.
For Reliability: - Idempotent APIs — safe retry without double-booking. - Circuit breakers — stop calling a flaky dependency before it corrupts state. - Data validation layers — reject malformed data at every boundary. - Transactional outbox pattern — ensure atomicity between service and database.
The intersection is where most incidents hide. For example, a multi-region failover (availability pattern) can cause temporary data inconsistency (reliability failure) if the secondary region hasn't caught up on replication. That's why chaos engineering drills exercise both availability and reliability scenarios.
The 3AM Pager: 5 Nines Available, 10% Error Rate
- Availability and reliability are not the same metric.
- Health checks at the TCP level can hide application-level failures.
- Always measure what matters: an SLI for correctness beats any uptime dashboard.
- If you only monitor for availability, you'll miss reliability failures until the users complain.
Key takeaways
Common mistakes to avoid
4 patternsTreating availability and reliability as the same thing
Using only TCP-level health checks
Confusing SLA with SLO
Not counting partial degradation as downtime
Interview Questions on This Topic
Explain the difference between availability and reliability with a concrete production example.
Frequently Asked Questions
That's Fundamentals. Mark it forged?
5 min read · try the examples if you haven't