Latency vs Throughput — p99 Spike from Pool Exhaustion
p99 latency hit 3,200ms during 2,000 RPS due to 1% slow queries holding connections 50x longer.
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
- Latency: measured in milliseconds. Always use percentiles (p50, p95, p99), not averages.
- Throughput: measured in requests per second (RPS) or transactions per second (TPS).
- Trade-off: as throughput approaches capacity, latency rises sharply (hockey stick curve).
- Little's Law: L = lambda x W. Concurrency = throughput x latency.
- p99 of 200ms means 1 in 100 requests takes 200ms or longer.
- At 1M requests/day, that is 10,000 bad experiences daily.
- Optimizing for throughput alone. High throughput with bad p99 latency means most users are fine but your highest-value users (complex queries, large payloads) suffer.
Imagine a highway toll booth. Latency is how long one car takes to pass through the booth. Throughput is how many cars pass through per hour. You can add more lanes (parallelism) to increase throughput, but if a truck with an oversized load blocks one lane, that truck's latency is terrible — even though the average car passes quickly. Percentiles tell you about the trucks, not just the average car.
Every production system is ultimately measured by two numbers: how fast it responds (latency) and how much it can handle (throughput). SLAs are written in percentiles — p99 latency under 200ms, throughput above 10,000 RPS. Getting either wrong means either angry users or an over-provisioned bill.
The counterintuitive part: optimizing purely for throughput often destroys latency. A system processing 1,000 RPS might have 5ms average latency, but as queues fill under load, that average hides the 10% of users seeing 500ms. Understanding the latency-throughput tradeoff curve and Little's Law is the foundation for capacity planning, SLO design, and performance debugging.
This is not a textbook definition. It covers how to measure latency correctly (percentiles, not averages), how the latency-throughput curve behaves near capacity, how Little's Law connects concurrency to resource sizing, and the production patterns that separate systems that scale gracefully from those that collapse under load.
Latency vs Throughput — The p99 Spike from Pool Exhaustion
Latency measures the time a single request takes from submission to completion; throughput counts how many requests a system processes per second. The core mechanic: they are not independent. When you push throughput beyond a system's capacity, latency spikes non-linearly — often by orders of magnitude. For a thread-pool-backed service, once all threads are busy, new requests queue. Queue wait time adds directly to latency, and at high concurrency, the p99 can jump from 10ms to 10s.
In practice, the relationship follows Little's Law: L = λ × W (concurrency = throughput × latency). If throughput exceeds the system's sustainable rate, latency grows proportionally to queue depth. A 100-thread pool handling 100 req/s with 100ms average latency is at the edge. One more request per second forces queuing, and latency becomes (queue size × service time). The p99 spike is the first symptom of exhaustion — not a transient blip, but a structural overload signal.
Use this understanding when capacity planning or tuning thread pools, connection pools, or database connection limits. The key insight: monitoring only average latency hides the p99 spike until it's too late. Track p99 latency as a function of throughput to find the inflection point where latency breaks away. In production, that inflection defines your true max throughput — not the theoretical peak from a load test.
Percentiles — Why Averages Lie
Average (mean) latency is the most misleading metric in production systems. It hides tail latency — the slow requests that affect your worst users. A system with 10ms average latency might have 1% of requests taking 2,000ms. The average tells you nothing about those 1%.
Percentiles solve this. The p50 (median) tells you what the typical user sees. The p99 tells you what the worst 1-in-100 users sees. The p99.9 tells you about 1-in-1,000. At scale, even small percentages translate to large absolute numbers of affected users.
- p50: typical user. Good for capacity planning.
- p95: 1 in 20 users. Good for SLO targets on non-critical paths.
- p99: 1 in 100 users. Standard for user-facing API SLOs.
- p99.9: 1 in 1,000 users. Critical for payment, authentication, and checkout flows.
- Average: useless for SLOs. Only useful for capacity cost estimation.
Little's Law
Little's Law: L = lambda x W. Average number in system = arrival rate x average time in system. This fundamental relationship connects latency, throughput, and concurrency. It applies to any stable system — web servers, connection pools, thread pools, message queues.
The practical power of Little's Law is in capacity planning. If you know your target throughput and your average latency, you can calculate the minimum concurrency (threads, connections, workers) you need. If latency doubles, your concurrency doubles — same throughput requires double the resources.
- L = lambda x W: concurrency = throughput x latency.
- Sizing: threads needed = target RPS x average latency in seconds.
- Headroom: 3x the Little's Law minimum for burst tolerance.
- If latency doubles, concurrency doubles — same throughput, double resources.
- Applies to: thread pools, connection pools, message queues, worker pools.
The Latency-Throughput Curve: Hockey Stick Behavior
Every system has a latency-throughput curve. At low load, latency is flat — requests rarely wait. As throughput approaches capacity, latency rises sharply. This is the 'hockey stick' curve, and it is governed by queuing theory.
The M/M/1 queuing model predicts: average response time = service_time / (1 - utilization). At 50% utilization, response time is 2x the service time. At 90%, it is 10x. At 99%, it is 100x. This is why production systems target 50-70% utilization — the steep part of the curve is unpredictable and dangerous.
- 50% utilization: response time is 2x service time. Comfortable.
- 70% utilization: response time is 3.3x service time. Acceptable.
- 90% utilization: response time is 10x service time. Dangerous.
- 99% utilization: response time is 100x service time. Catastrophic.
- Target: 50-70% under normal load. Auto-scale at 70%. Page at 85%.
Measuring Latency Correctly: Histograms, Bucket Boundaries, and Clock Sources
Measuring latency seems simple — record the start time, record the end time, subtract. In production, it is more complex. Clock source accuracy, histogram bucket boundaries, and measurement scope (wall clock vs CPU time) all affect the correctness of your latency data.
Histograms are the standard for latency measurement in Prometheus. They pre-aggregate observations into buckets at instrumentation time, then histogram_quantile computes percentiles at query time. The bucket boundaries you choose are permanent — you cannot change them without losing time series continuity.
- Default Prometheus buckets go up to 10s — too coarse for APIs with sub-200ms SLOs.
- Custom buckets aligned to SLO thresholds give accurate percentile computation.
- Bucket boundaries are immutable after deployment. Plan carefully.
- histogram_quantile interpolates between buckets — imprecise if boundaries miss SLO thresholds.
- Each bucket adds one time series per label combination. Too many buckets increase cardinality.
System.currentTimeMillis() has millisecond granularity and can jump backward during NTP corrections. System.nanoTime() is monotonic and nanosecond-precise but is only meaningful for elapsed time, not absolute time. Always use nanoTime for latency measurement. In distributed systems, clock skew between nodes means cross-node latency comparisons are approximate — use distributed tracing (OpenTelemetry) for end-to-end latency measurement.Tail Latency: The Silent Amplifier
You’ve tuned your median latency to 5ms. Feels good. Then you notice your p99.4 is 400ms. That’s tail latency, and it’s a system killer. Why does it happen? Because every request hits a different path. In a distributed system, the slowest component dictates the tail. Think about it: if a single request needs responses from 100 servers, and each server has a 1% chance of being 100ms slower, the chance your request finishes in time collapses. That’s the long-tail problem. It doesn’t matter if your average is clean. That one slow call — a garbage collection pause, a network retransmission, a lock contention — cascades. Users experience the aggregate. They don’t care about your average. They remember the spinner that took two seconds. You must measure tail latency proactively. Use coordinated omission to avoid hiding it in your instrumentation. If you ignore the tail, it will eat your SLA. And the worst part? Tail latency doesn’t scale — it amplifies. One slow node in a replica group infects every request that touches it. You think you’re safe at 99th percentile? Wait until the 99.9th percentile shows you the truth.
Amdahl's Law: Your Bottleneck Conversation
Throughput isn’t just about adding more servers. That’s the naïve mistake. Amdahl’s Law tells you the hard truth: the speedup of a system is limited by the part you cannot parallelize. You have serial work — database writes, cache updates, lock acquisitions. Even if you scale the parallel portion to infinity, your throughput ceiling is determined by that serial bottleneck. Here’s the math: if 10% of your workload is serial (can’t parallelize), the absolute maximum speedup is 10x, no matter how many nodes you throw at it. This is why you see that classic hockey-stick curve. At low concurrency, adding threads or machines improves throughput linearly. Then you hit the knee. Suddenly, adding more resources yields nothing, or even degrades performance because of contention. Your p99 latency spikes, not because the hardware is slow, but because your serial path is choked. You fix this by identifying the serial region — maybe it’s a transactional log write, a synchronous RPC, or a global mutex. Then you optimize it. Maybe you batch the writes. Maybe you use an async journal. Maybe you redesign the protocol to reduce coordination. Amdahl’s Law isn’t a theory. It’s the reason your 64-core machine runs your request handler slower than a 4-core one.
p99 Latency Spike from Database Connection Pool Exhaustion Under Load
- Average latency hides tail latency. Always monitor p99 and p99.9, not just average.
- Little's Law predicts resource needs. If slow queries hold connections 50x longer, they consume 50x more pool capacity per request.
- Connection pools are queues. When the pool is full, requests wait. Queue wait time dominates p99 latency.
- Fail fast is better than queue forever. Set connection acquire timeouts to shed load rather than accumulate queue depth.
- Fix the slow queries first. No amount of pool sizing fixes a full table scan.
curl -s http://localhost:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))jstat -gcutil <pid> 1000 10Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
That's Fundamentals. Mark it forged?
5 min read · try the examples if you haven't