LLM Latency Optimization — How We Cut P99 from 12s to 1.8s Without Changing the Model
Stop throwing GPUs at slow LLMs.
- Token Batching Grouping requests reduces per-request overhead, but watch out for stragglers that delay the entire batch. We saw a 40% throughput gain with dynamic batching.
- Speculative Decoding Use a cheap draft model to guess tokens, then verify with the big model. Cuts latency by 2-3x when the draft is accurate, but adds overhead if not.
- Prompt Compression Truncating or summarizing input context reduces processing time. A 50% context cut saved us 800ms on a 4k-token prompt, but we lost accuracy on nuanced queries.
- KV Cache Optimization Reuse cached key-value states across requests in a session. Cuts time-to-first-token by 60%, but memory grows quadratically with sequence length.
- Quantization Lower precision weights (FP16 to INT8) speed up matrix multiplies. We saw a 1.5x throughput improvement on a 70B model, but accuracy dropped 2% on complex reasoning tasks.
- Streaming Return tokens as they're generated, not all at once. Users perceive lower latency even if total generation time is the same. Critical for chat applications.
Imagine you're a chef making custom pizzas. Instead of making one pizza at a time (slow), you prep all the toppings and bake multiple pizzas together (batching). You also guess what toppings the customer wants before they finish ordering (speculative decoding) and skip reading the entire recipe book if it's a repeat order (KV cache). This way, the customer gets their pizza faster without you buying a bigger oven.
Three months ago, our recommendation engine started timing out. P99 latency hit 12 seconds. Users were abandoning the search bar. The knee-jerk reaction was to scale up GPUs — more A100s, more money. But the bottleneck wasn't compute; it was how we were talking to the model. We were making one request per user, sending full conversation histories, and waiting for the entire response before showing anything. Classic rookie moves.
Most latency optimization guides hand you a list of techniques without telling you when they break. Quantization sounds great until your accuracy drops on a multi-hop reasoning task. Streaming is easy until you need to handle mid-response cancellation. And everyone recommends batching, but nobody warns you about the straggler problem — one slow request holding up the whole batch. We learned these lessons at 3am with a pager going off.
This article covers seven production-tested techniques for LLM latency optimization. Each section includes the internal mechanics, a runnable code example, and a real incident where the technique either saved us or burned us. You'll walk away with a debugging checklist, a cheat sheet for 2am triage, and the confidence to tune latency without breaking accuracy. We'll also cover when to ignore the textbook and just add more GPUs.
How Token Batching Actually Works Under the Hood
Token batching is the single most impactful latency optimization — and the most dangerous if you don't understand the internals. The idea is simple: instead of sending one request at a time, you group multiple requests into a single batch. The LLM processes them in parallel, sharing the overhead of model loading and attention computation. But here's what the docs don't tell you: batching only works if all requests in the batch have similar sequence lengths. If one request has a 10k-token context and the others have 100 tokens, the entire batch waits for the longest one. This is called the 'straggler problem.'
Under the hood, batching works by concatenating the input tensors along the batch dimension. The model computes attention across all sequences simultaneously, but the memory and compute scale with the maximum sequence length in the batch. So a batch of 8 requests with lengths [100, 100, 100, 100, 100, 100, 100, 10000] effectively processes 8 requests of length 10000. You've just multiplied your latency by 100x for 7 of those requests.
The solution is dynamic batching with length-aware grouping. Sort requests by token count, then batch similar-length requests together. Set a max batch size and a max context length per request. And always set a timeout per batch — if a batch takes longer than 2 seconds, drop it and process the requests individually.
Speculative Decoding: When to Guess and When to Wait
Speculative decoding is a technique where you use a small, fast 'draft' model to generate candidate tokens, and then the large 'target' model verifies them in parallel. If the draft model is correct, you get multiple tokens for the cost of one verification step. In theory, you can cut latency by 2-3x. In practice, it's more like 1.5x — and only if the draft model is accurate enough.
The key metric is the 'acceptance rate' — the fraction of draft tokens that the target model accepts. If the acceptance rate is below 50%, the overhead of running both models outweighs the benefit. We saw this happen when we used a 7B draft model with a 70B target model on a code generation task. The draft model was too small to understand the code context, so it guessed wrong most of the time. The acceptance rate was 30%, and latency actually increased by 20%.
The fix was to use a larger draft model (13B) and fine-tune it on the same data distribution as the target model. Acceptance rate jumped to 70%, and we saw a 2x latency improvement. But there's a catch: speculative decoding adds complexity to your serving stack. You need to manage two models, handle the draft-verify loop, and deal with the case where the draft is rejected (you have to regenerate from scratch).
Prompt Compression: Cutting Context Without Cutting Accuracy
Every token in your prompt costs compute. A 4k-token prompt takes 4x longer to process than a 1k-token prompt. The obvious fix is to send less context. But how do you decide what to cut? The naive approach is to truncate from the middle — but that breaks the model's ability to follow instructions that are at the beginning and end.
We learned this the hard way. We were building a customer support chatbot that included the full conversation history in every request. The history was growing to 10k tokens over a session. We truncated to the last 2k tokens, but the model started forgetting the customer's original issue. Accuracy dropped by 23%.
The fix was prompt compression: we used a smaller LLM to summarize the conversation history into a 500-token summary, then appended that to the prompt. The summarization model was cheap (a 7B model) and ran asynchronously. Total latency dropped by 40% because the main model had less context to process. But we had to be careful: the summarization model sometimes hallucinated details, leading to incorrect responses. We added a validation step that checked the summary against the original history for factual consistency.
KV Cache Optimization: The Memory Hog You Didn't Notice
The KV cache is a hidden memory sink in LLM inference. Every time the model generates a token, it stores the key-value pairs from the attention computation so it doesn't have to recompute them. This cache grows quadratically with sequence length: a 4k-token sequence uses 16x more cache than a 1k-token sequence. For a 70B model with FP16 precision, a 4k-token sequence can consume 2GB of cache. Now multiply that by the number of concurrent users.
We hit this wall during a Black Friday sale. Our chatbot was handling 10x normal traffic, and the KV cache was growing unbounded. The server ran out of memory, and the model started returning empty responses. The on-call engineer saw a spike in 'CUDA out of memory' errors.
The fix was threefold: (1) Set a max cache size per session (e.g., 2GB). (2) Implement a least-recently-used (LRU) eviction policy for stale sessions. (3) Use PagedAttention, which stores the KV cache in non-contiguous blocks, reducing fragmentation. PagedAttention alone cut memory usage by 60% in our case.
When NOT to Optimize: The Case for Throwing Hardware at the Problem
Sometimes, the smartest latency optimization is to buy more GPUs. I know this sounds like heresy for an optimization article, but hear me out. There are scenarios where software optimizations add complexity, risk, and maintenance burden that outweigh the latency gains.
Example: You're running a 70B model for a low-traffic internal tool (100 req/day). The P99 is 5s, which is acceptable for the use case. You could spend two weeks implementing speculative decoding, prompt compression, and KV cache tuning. Or you could just upgrade from an A100 to an H100 and cut latency by 40% in one afternoon. The H100 costs more, but your engineering time is not free.
Another example: You're building a prototype that needs to ship in a week. Don't waste time on batching logic and cache eviction policies. Use a smaller model (e.g., GPT-4o-mini instead of GPT-4) and enable streaming. That's a 10x latency improvement with zero code changes.
The rule of thumb: if your traffic is below 1000 req/day, hardware upgrades are almost always cheaper than software optimizations. Above 10k req/day, software optimizations become essential because the GPU cost scales linearly with traffic.
Common Mistakes with Specific Examples
Let's talk about the mistakes we've made so you don't have to. These are the patterns that look good on paper but fail in production.
Mistake 1: Batching without length awareness. We covered this earlier. A single long request can ruin the batch. The fix is simple: sort by length before batching, and set a max context length.
Mistake 2: Enabling streaming but not handling cancellation. Streaming is great for perceived latency, but if the user cancels a request mid-stream, you need to stop the generation. Otherwise, the model keeps generating tokens that nobody reads, wasting compute. We saw this when a user clicked 'cancel' on a search, but the model continued generating for another 3 seconds. The fix was to use asyncio cancellation tokens and propagate them to the LLM call.
Mistake 3: Using a draft model that's too small for speculative decoding. A 7B draft model on a 70B target model rarely works. The acceptance rate is too low. Use a 13B or 30B draft model, and fine-tune it on your data.
Mistake 4: Not monitoring cache hit rate. The KV cache is useless if you're evicting sessions too aggressively. We had a 20% cache hit rate because our eviction policy was time-based (evict after 5 minutes). Users were starting new sessions every 3 minutes. Switched to LRU with a size limit, and hit rate jumped to 80%.
Comparison vs Alternatives: Batching, Streaming, or Both?
You have two main tools for reducing perceived latency: batching and streaming. Batching reduces the number of requests the model has to process, but increases the latency of individual requests (because they wait for the batch to fill). Streaming reduces perceived latency by showing tokens as they're generated, but doesn't reduce total generation time.
Which one should you use? It depends on your use case. For chatbots, streaming is non-negotiable — users expect to see tokens appear as they're generated. For batch processing (e.g., summarizing a batch of documents), batching is better because you don't need real-time output.
But you can combine both: batch multiple streaming requests together. This is called 'dynamic batching with streaming'. It's complex to implement but gives you the best of both worlds. We use this pattern in production: we batch up to 8 streaming requests, process them together, and stream the results back to each user. Latency dropped by 50% compared to non-batched streaming.
Debugging and Monitoring LLM Latency in Production
You can't optimize what you can't measure. We track five key metrics for LLM latency: time-to-first-token (TTFT), tokens per second (TPS), batch completion time, cache hit rate, and speculative acceptance rate. Each tells a different story.
TTFT measures how long it takes the model to start generating. High TTFT usually means the prompt is too long or the KV cache is cold. TPS measures generation speed. Low TPS could mean the model is too large, quantization is too aggressive, or you're hitting rate limits.
We use OpenTelemetry to instrument every LLM call. Each span includes the model name, prompt length, response length, latency breakdown (TTFT vs generation), and any errors. We alert on P99 latency exceeding 5s and cache hit rate dropping below 60%.
One thing we learned: don't rely on the LLM provider's metrics. They aggregate across all customers and don't show you the tail latencies. Instrument your own calls and log every request.
The Straggler That Killed Our Batch: A 12-Second P99 Lesson
- Always set a max context length per request. Truncate or summarize long histories before sending.
- Monitor batch completion time variance, not just average. A single straggler ruins the whole batch.
- Use separate queues for short and long requests. Don't let one slow user degrade everyone else's experience.
curl -X GET http://your-service:8080/metrics | grep kv_cache_hit_rate. If below 60%, your cache eviction policy is too aggressive or context lengths vary too much.nvidia-smi --query-gpu=memory.used --format=csv,noheader. If memory is growing, you have a memory leak in the KV cache. Look for sessions not being properly cleaned up.python -c "import openai; print(openai.api_rate_limit)". If you're hitting limits, implement exponential backoff with jitter. Example: time.sleep(min(2 ** retry_count + random.uniform(0, 1), 60))curl -X GET http://your-service:8080/metrics | grep speculative_draft_acceptance_rate. If below 50%, the draft model is too different from the main model. Consider a larger draft model or disabling speculation.curl -X GET http://your-service:8080/metrics | grep batch_sizecurl -X GET http://your-service:8080/metrics | grep batch_completion_time_avgexport BATCH_SIZE=4 && systemctl restart llm-serviceKey takeaways
Common mistakes to avoid
4 patternsStatic batch sizing
Speculative decoding on short prompts
Full KV cache retention for all requests
Prompt compression without validation
Interview Questions on This Topic
Explain how token batching works under the hood in transformer inference.
Frequently Asked Questions
That's Observability. Mark it forged?
8 min read · try the examples if you haven't