Latency and Throughput
- Latency = time for one request. Throughput = requests per unit time.
- Always measure p99 or p99.9, not average — averages hide the slow outliers.
- At high utilisation, latency rises sharply — queues form when arrival rate approaches capacity.
- Latency: measured in milliseconds. Always use percentiles (p50, p95, p99), not averages.
- Throughput: measured in requests per second (RPS) or transactions per second (TPS).
- Trade-off: as throughput approaches capacity, latency rises sharply (hockey stick curve).
- Little's Law: L = lambda x W. Concurrency = throughput x latency.
- p99 of 200ms means 1 in 100 requests takes 200ms or longer.
- At 1M requests/day, that is 10,000 bad experiences daily.
- Optimizing for throughput alone. High throughput with bad p99 latency means most users are fine but your highest-value users (complex queries, large payloads) suffer.
High p99 latency with normal average.
curl -s http://localhost:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))jstat -gcutil <pid> 1000 10Throughput plateau despite low CPU.
jstack <pid> | grep -c 'BLOCKED\|WAITING'pidstat -t -p <pid> 1 5Periodic latency spikes.
jstat -gcutil <pid> 1000 5grep -i 'pause\|gc' /var/log/app/gc.log | tail -20Latency increases with traffic (hockey stick).
curl -s http://localhost:9090/api/v1/query?query=rate(http_requests_total[1m])curl -s http://localhost:9090/api/v1/query?query=process_resident_memory_bytes/1024/1024Sudden throughput drop.
curl -s http://localhost:9090/api/v1/query?query=rate(http_client_requests_total{status=~"5.."}[1m])netstat -an | grep -c TIME_WAITProduction Incident
Production Debug GuideSymptom-first investigation path for performance degradation.
Every production system is ultimately measured by two numbers: how fast it responds (latency) and how much it can handle (throughput). SLAs are written in percentiles — p99 latency under 200ms, throughput above 10,000 RPS. Getting either wrong means either angry users or an over-provisioned bill.
The counterintuitive part: optimizing purely for throughput often destroys latency. A system processing 1,000 RPS might have 5ms average latency, but as queues fill under load, that average hides the 10% of users seeing 500ms. Understanding the latency-throughput tradeoff curve and Little's Law is the foundation for capacity planning, SLO design, and performance debugging.
This is not a textbook definition. It covers how to measure latency correctly (percentiles, not averages), how the latency-throughput curve behaves near capacity, how Little's Law connects concurrency to resource sizing, and the production patterns that separate systems that scale gracefully from those that collapse under load.
Percentiles — Why Averages Lie
Average (mean) latency is the most misleading metric in production systems. It hides tail latency — the slow requests that affect your worst users. A system with 10ms average latency might have 1% of requests taking 2,000ms. The average tells you nothing about those 1%.
Percentiles solve this. The p50 (median) tells you what the typical user sees. The p99 tells you what the worst 1-in-100 users sees. The p99.9 tells you about 1-in-1,000. At scale, even small percentages translate to large absolute numbers of affected users.
import numpy as np response_times = np.concatenate([ np.random.normal(10, 2, 900), np.random.normal(200, 50, 99), [2000] ]) print(f'Mean (average): {response_times.mean():.1f}ms') print(f'p50 (median): {np.percentile(response_times, 50):.1f}ms') print(f'p90: {np.percentile(response_times, 90):.1f}ms') print(f'p95: {np.percentile(response_times, 95):.1f}ms') print(f'p99: {np.percentile(response_times, 99):.1f}ms') print(f'p99.9: {np.percentile(response_times, 99.9):.1f}ms') print(f'Max: {response_times.max():.1f}ms')
p50 (median): 10.1ms
p90: 12.4ms
p95: 180.2ms
p99: 220.4ms
p99.9: 1890.3ms
Max: 2000.0ms
- p50: typical user. Good for capacity planning.
- p95: 1 in 20 users. Good for SLO targets on non-critical paths.
- p99: 1 in 100 users. Standard for user-facing API SLOs.
- p99.9: 1 in 1,000 users. Critical for payment, authentication, and checkout flows.
- Average: useless for SLOs. Only useful for capacity cost estimation.
Little's Law
Little's Law: L = lambda x W. Average number in system = arrival rate x average time in system. This fundamental relationship connects latency, throughput, and concurrency. It applies to any stable system — web servers, connection pools, thread pools, message queues.
The practical power of Little's Law is in capacity planning. If you know your target throughput and your average latency, you can calculate the minimum concurrency (threads, connections, workers) you need. If latency doubles, your concurrency doubles — same throughput requires double the resources.
# Little's Law: L = lambda * W throughput_lambda = 100 latency_W = 0.050 concurrency_L = throughput_lambda * latency_W print(f'Average concurrent requests: {concurrency_L}') new_concurrency = 100 * 0.100 print(f'Concurrency at 100ms latency: {new_concurrency}') required_threads = 500 * 0.020 print(f'Threads needed: {required_threads}') db_concurrency = 200 * 0.015 print(f'DB connections needed: {db_concurrency}')
Concurrency at 100ms latency: 10.0
Threads needed: 10.0
DB connections needed: 3.0
- L = lambda x W: concurrency = throughput x latency.
- Sizing: threads needed = target RPS x average latency in seconds.
- Headroom: 3x the Little's Law minimum for burst tolerance.
- If latency doubles, concurrency doubles — same throughput, double resources.
- Applies to: thread pools, connection pools, message queues, worker pools.
The Latency-Throughput Curve: Hockey Stick Behavior
Every system has a latency-throughput curve. At low load, latency is flat — requests rarely wait. As throughput approaches capacity, latency rises sharply. This is the 'hockey stick' curve, and it is governed by queuing theory.
The M/M/1 queuing model predicts: average response time = service_time / (1 - utilization). At 50% utilization, response time is 2x the service time. At 90%, it is 10x. At 99%, it is 100x. This is why production systems target 50-70% utilization — the steep part of the curve is unpredictable and dangerous.
service_time_ms = 10 for utilization_pct in [10, 30, 50, 70, 80, 90, 95, 99]: rho = utilization_pct / 100.0 response_time = service_time_ms / (1 - rho) queue_time = response_time - service_time_ms print(f'Util: {utilization_pct:3d}% | Response: {response_time:7.1f}ms | Queue: {queue_time:7.1f}ms')
Util: 30% | Response: 14.3ms | Queue: 4.3ms
Util: 50% | Response: 20.0ms | Queue: 10.0ms
Util: 70% | Response: 33.3ms | Queue: 23.3ms
Util: 80% | Response: 50.0ms | Queue: 40.0ms
Util: 90% | Response: 100.0ms | Queue: 90.0ms
Util: 95% | Response: 200.0ms | Queue: 190.0ms
Util: 99% | Response: 1000.0ms | Queue: 990.0ms
- 50% utilization: response time is 2x service time. Comfortable.
- 70% utilization: response time is 3.3x service time. Acceptable.
- 90% utilization: response time is 10x service time. Dangerous.
- 99% utilization: response time is 100x service time. Catastrophic.
- Target: 50-70% under normal load. Auto-scale at 70%. Page at 85%.
Measuring Latency Correctly: Histograms, Bucket Boundaries, and Clock Sources
Measuring latency seems simple — record the start time, record the end time, subtract. In production, it is more complex. Clock source accuracy, histogram bucket boundaries, and measurement scope (wall clock vs CPU time) all affect the correctness of your latency data.
Histograms are the standard for latency measurement in Prometheus. They pre-aggregate observations into buckets at instrumentation time, then histogram_quantile computes percentiles at query time. The bucket boundaries you choose are permanent — you cannot change them without losing time series continuity.
package io.thecodeforge.performance; import java.time.Duration; import java.time.Instant; import java.util.concurrent.ConcurrentSkipListMap; import java.util.concurrent.atomic.AtomicLong; public class LatencyMeasurement { private static final double[] BUCKET_BOUNDARIES = { 0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5, 5.0 }; private final ConcurrentSkipListMap<Double, AtomicLong> buckets = new ConcurrentSkipListMap<>(); private final AtomicLong sum = new AtomicLong(0); private final AtomicLong count = new AtomicLong(0); public LatencyMeasurement() { for (double boundary : BUCKET_BOUNDARIES) { buckets.put(boundary, new AtomicLong(0)); } buckets.put(Double.POSITIVE_INFINITY, new AtomicLong(0)); } public void observe(double latencySeconds) { count.incrementAndGet(); sum.addAndGet((long) (latencySeconds * 1_000_000_000)); for (var entry : buckets.tailMap(latencySeconds, true).entrySet()) { entry.getValue().incrementAndGet(); } } public double percentile(double p) { long totalCount = count.get(); if (totalCount == 0) return 0; double rank = p * totalCount; double prevBoundary = 0; long prevCount = 0; for (var entry : buckets.entrySet()) { long currentCount = entry.getValue().get(); if (currentCount >= rank) { double currentBoundary = entry.getKey(); if (currentBoundary == Double.POSITIVE_INFINITY) return prevBoundary; double bucketWidth = currentBoundary - prevBoundary; long bucketCount = currentCount - prevCount; if (bucketCount == 0) return currentBoundary; double fraction = (rank - prevCount) / bucketCount; return prevBoundary + fraction * bucketWidth; } prevBoundary = entry.getKey(); prevCount = currentCount; } return prevBoundary; } }
p90: 12.4ms
p95: 180.2ms
p99: 220.4ms
p99.9: 1890.3ms
- Default Prometheus buckets go up to 10s — too coarse for APIs with sub-200ms SLOs.
- Custom buckets aligned to SLO thresholds give accurate percentile computation.
- Bucket boundaries are immutable after deployment. Plan carefully.
- histogram_quantile interpolates between buckets — imprecise if boundaries miss SLO thresholds.
- Each bucket adds one time series per label combination. Too many buckets increase cardinality.
System.currentTimeMillis() has millisecond granularity and can jump backward during NTP corrections. System.nanoTime() is monotonic and nanosecond-precise but is only meaningful for elapsed time, not absolute time. Always use nanoTime for latency measurement. In distributed systems, clock skew between nodes means cross-node latency comparisons are approximate — use distributed tracing (OpenTelemetry) for end-to-end latency measurement.| Aspect | Average (Mean) | Percentile (p50/p99) | Histogram |
|---|---|---|---|
| What it measures | Arithmetic mean of all observations | Value below which N% of observations fall | Distribution of observations across predefined buckets |
| Hides tail latency | Yes — heavily influenced by outliers in both directions | No — p99 directly measures tail | No — bucket counts show distribution shape |
| Aggregatable across instances | Yes (if counts are known) | No (cannot average percentiles) | Yes (sum bucket counts, then compute quantile) |
| Storage cost | One time series per label combination | One time series per percentile per label | N+2 time series per label combination (N buckets + _sum + _count) |
| SLO suitability | Poor — misleading at scale | Good for single-instance services | Excellent — supports aggregation and accurate SLO tracking |
| Best for | Cost estimation, capacity planning | Quick debugging, single-instance monitoring | Production SLO dashboards, multi-replica aggregation |
| Prometheus function | rate(metric_sum[5m]) / rate(metric_count[5m]) | Direct query on gauge/summary | histogram_quantile(0.99, rate(metric_bucket[5m])) |
| Accuracy | Exact (but misleading) | Exact for summaries, approximate for histogram-derived | Approximate — depends on bucket granularity |
🎯 Key Takeaways
- Latency = time for one request. Throughput = requests per unit time.
- Always measure p99 or p99.9, not average — averages hide the slow outliers.
- At high utilisation, latency rises sharply — queues form when arrival rate approaches capacity.
- Little's Law: L = lambda x W — concurrency = throughput x latency. Latency spikes increase resource usage.
- p50 tells you the typical user experience; p99 tells you the worst typical user experience.
- The latency-throughput curve is a hockey stick. Target 50-70% utilization. Auto-scale at 70%.
- Histogram bucket boundaries are immutable. Align them with SLO thresholds before deployment.
- Size resources to 3x the Little's Law minimum for burst tolerance. If latency doubles, concurrency doubles.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhy is p99 latency more useful than average latency for SLOs?
- QWhat is Little's Law and how do you use it for capacity planning?
- QWhy does latency increase dramatically as throughput approaches system capacity?
- QExplain the latency-throughput curve. What utilization target do you recommend for production and why?
- QHow do histogram bucket boundaries affect percentile accuracy? What happens if your SLO threshold falls between two buckets?
- QA system has 10ms average latency but 2,000ms p99. What are the three most likely causes?
- QHow do you size a database connection pool using Little's Law? What headroom do you provision?
- QWhy can't you average percentiles across instances? What is the correct approach for multi-replica SLO tracking?
- QExplain the difference between wall clock time and monotonic time for latency measurement. When does it matter?
- QHow would you design a load test to find the latency-throughput curve steepening point?
Frequently Asked Questions
Why does latency increase as throughput approaches capacity?
Queuing theory explains this: as utilisation approaches 100%, queue length grows unboundedly. At 50% utilisation, requests rarely wait. At 90% utilisation, average queue length = 9x the service time. At 99%, it is 99x the service time. This is why systems are designed to operate at 50-70% utilisation — the steep latency curve near capacity is unpredictable.
What is a good p99 latency target?
It depends on the use case. For interactive user-facing APIs: under 200ms is generally good, under 100ms is excellent. For database queries: under 10ms for indexed reads. For batch processing: throughput matters more than latency. Define SLOs based on user experience requirements, not arbitrary targets.
How do I size a connection pool using Little's Law?
Minimum connections = target throughput (RPS) x average query latency (seconds). For 500 RPS with 20ms average query time: 500 x 0.020 = 10 connections minimum. Provision 3x headroom (30 connections) for burst tolerance and latency variance. Monitor active vs idle connections — if active approaches the pool size, you are near saturation.
Why can't I average p99 latencies across instances?
Percentiles are not additive. If instance A has p99 of 100ms and instance B has p99 of 500ms, the global p99 is not 300ms. It depends on the distribution of all requests across both instances. The correct approach: use histograms (which are aggregatable) and compute histogram_quantile on the summed bucket counts.
What is the difference between latency and response time?
Latency is the time the system takes to process a request (server-side). Response time includes latency plus network transit time (round-trip). In practice, the terms are often used interchangeably, but when debugging, distinguish between server-side latency (your code) and client-perceived response time (includes network, DNS, TLS handshake).
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.