Senior 8 min · May 22, 2026

LLM Latency Optimization — How We Cut P99 from 12s to 1.8s Without Changing the Model

Stop throwing GPUs at slow LLMs.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Token Batching Grouping requests reduces per-request overhead, but watch out for stragglers that delay the entire batch. We saw a 40% throughput gain with dynamic batching.
  • Speculative Decoding Use a cheap draft model to guess tokens, then verify with the big model. Cuts latency by 2-3x when the draft is accurate, but adds overhead if not.
  • Prompt Compression Truncating or summarizing input context reduces processing time. A 50% context cut saved us 800ms on a 4k-token prompt, but we lost accuracy on nuanced queries.
  • KV Cache Optimization Reuse cached key-value states across requests in a session. Cuts time-to-first-token by 60%, but memory grows quadratically with sequence length.
  • Quantization Lower precision weights (FP16 to INT8) speed up matrix multiplies. We saw a 1.5x throughput improvement on a 70B model, but accuracy dropped 2% on complex reasoning tasks.
  • Streaming Return tokens as they're generated, not all at once. Users perceive lower latency even if total generation time is the same. Critical for chat applications.
✦ Definition~90s read
What is LLM Latency Optimization?

LLM latency optimization is the practice of reducing the time it takes for a large language model to generate a response, measured from when you send a prompt to when you get the first token back (time-to-first-token, TTFT) and the overall generation time (tokens per second). This isn't about making the model smarter—it's about making the inference pipeline faster without swapping the model weights.

Imagine you're a chef making custom pizzas.

The core tension is that LLMs are autoregressive: they generate one token at a time, and each step requires a full forward pass through the model. That sequential dependency is the fundamental bottleneck, and every optimization technique is a way to cheat that constraint—by batching multiple requests together, guessing future tokens in parallel (speculative decoding), trimming the input to reduce compute, or caching the key-value (KV) pairs from previous tokens so you don't recompute them.

The goal is to cut latency from double-digit seconds to sub-second for interactive use cases like chatbots, code assistants, or real-time translation, where users feel every millisecond of delay.

In practice, these optimizations live in the inference serving layer, not in training. You'll find them in frameworks like vLLM, TensorRT-LLM, or TGI (Text Generation Inference), which handle continuous batching (dynamically adding new requests to an in-flight batch as others finish), PagedAttention for KV cache management (avoiding memory fragmentation), and speculative decoding with a smaller draft model.

Prompt compression, often via tools like LLMLingua or selective context pruning, reduces the number of input tokens by 2-5x while preserving answer quality, directly cutting TTFT. KV cache optimization is the silent killer: a 70B model with a 4K context window can eat 2-4 GB of GPU memory per request just for the cache, and without careful management (e.g., shared prefix caching, quantization to FP8 or INT4), you'll run out of memory long before you hit compute limits.

The tradeoff is that these techniques add complexity—speculative decoding requires a draft model that's fast but accurate enough, and prompt compression can drop critical context if you're not careful.

You should reach for latency optimization when your model is already chosen and you need to hit a specific SLA (e.g., P99 under 2 seconds for a customer-facing product). But don't optimize prematurely: if your traffic is low (e.g., <10 requests per second) or your model is small (e.g., 7B parameters on a single A100), you might just need to throw hardware at it—buy more GPUs or scale horizontally.

The common mistakes are optimizing the wrong bottleneck (e.g., tuning KV cache when your TTFT is high because of network overhead), using speculative decoding with a draft model that's too slow (it adds latency instead of reducing it), or compressing prompts so aggressively that the model hallucinates. The real-world numbers matter: cutting P99 from 12s to 1.8s, as the article describes, is achievable with a combination of continuous batching, KV cache quantization, and speculative decoding—but only if you measure each component's contribution and know when to stop.

LLM Latency Optimization Architecture diagram: LLM Latency Optimization LLM Latency Optimization check cache cache hit cache miss stream 1 Incoming Request User query 2 Semantic Cache Redis / similarity hit 3 Prompt Optimizer Trim + compress 4 LLM (Streaming) Token-by-token output 5 Client Stream + display ASAP THECODEFORGE.IO
Plain-English First

Imagine you're a chef making custom pizzas. Instead of making one pizza at a time (slow), you prep all the toppings and bake multiple pizzas together (batching). You also guess what toppings the customer wants before they finish ordering (speculative decoding) and skip reading the entire recipe book if it's a repeat order (KV cache). This way, the customer gets their pizza faster without you buying a bigger oven.

Three months ago, our recommendation engine started timing out. P99 latency hit 12 seconds. Users were abandoning the search bar. The knee-jerk reaction was to scale up GPUs — more A100s, more money. But the bottleneck wasn't compute; it was how we were talking to the model. We were making one request per user, sending full conversation histories, and waiting for the entire response before showing anything. Classic rookie moves.

Most latency optimization guides hand you a list of techniques without telling you when they break. Quantization sounds great until your accuracy drops on a multi-hop reasoning task. Streaming is easy until you need to handle mid-response cancellation. And everyone recommends batching, but nobody warns you about the straggler problem — one slow request holding up the whole batch. We learned these lessons at 3am with a pager going off.

This article covers seven production-tested techniques for LLM latency optimization. Each section includes the internal mechanics, a runnable code example, and a real incident where the technique either saved us or burned us. You'll walk away with a debugging checklist, a cheat sheet for 2am triage, and the confidence to tune latency without breaking accuracy. We'll also cover when to ignore the textbook and just add more GPUs.

How Token Batching Actually Works Under the Hood

Token batching is the single most impactful latency optimization — and the most dangerous if you don't understand the internals. The idea is simple: instead of sending one request at a time, you group multiple requests into a single batch. The LLM processes them in parallel, sharing the overhead of model loading and attention computation. But here's what the docs don't tell you: batching only works if all requests in the batch have similar sequence lengths. If one request has a 10k-token context and the others have 100 tokens, the entire batch waits for the longest one. This is called the 'straggler problem.'

Under the hood, batching works by concatenating the input tensors along the batch dimension. The model computes attention across all sequences simultaneously, but the memory and compute scale with the maximum sequence length in the batch. So a batch of 8 requests with lengths [100, 100, 100, 100, 100, 100, 100, 10000] effectively processes 8 requests of length 10000. You've just multiplied your latency by 100x for 7 of those requests.

The solution is dynamic batching with length-aware grouping. Sort requests by token count, then batch similar-length requests together. Set a max batch size and a max context length per request. And always set a timeout per batch — if a batch takes longer than 2 seconds, drop it and process the requests individually.

dynamic_batcher.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import asyncio
from typing import List, Dict
import time

class DynamicBatcher:
    def __init__(self, max_batch_size: int = 8, max_context_length: int = 4096, batch_timeout: float = 2.0):
        self.max_batch_size = max_batch_size
        self.max_context_length = max_context_length  # Truncate or reject longer contexts
        self.batch_timeout = batch_timeout
        self.queue: List[Dict] = []  # Each item: {'request': ..., 'context_length': int, 'future': asyncio.Future}

    async def submit(self, request: Dict) -> Dict:
        context_length = len(request['messages'])  # Simplified; actual token count
        if context_length > self.max_context_length:
            raise ValueError(f"Context length {context_length} exceeds max {self.max_context_length}")
        future = asyncio.get_event_loop().create_future()
        self.queue.append({'request': request, 'context_length': context_length, 'future': future})
        # If queue is full, trigger batch processing
        if len(self.queue) >= self.max_batch_size:
            asyncio.create_task(self._process_batch())
        return await future

    async def _process_batch(self):
        # Sort by context length to minimize straggler effect
        batch = sorted(self.queue, key=lambda x: x['context_length'])[:self.max_batch_size]
        self.queue = self.queue[len(batch):]  # Remove processed items
        try:
            # Simulate LLM call with timeout
            results = await asyncio.wait_for(
                self._llm_call([item['request'] for item in batch]),
                timeout=self.batch_timeout
            )
            for item, result in zip(batch, results):
                item['future'].set_result(result)
        except asyncio.TimeoutError:
            # Fallback: process each request individually
            for item in batch:
                try:
                    result = await self._llm_call([item['request']])
                    item['future'].set_result(result[0])
                except Exception as e:
                    item['future'].set_exception(e)

    async def _llm_call(self, requests: List[Dict]) -> List[Dict]:
        # Placeholder for actual LLM API call
        await asyncio.sleep(0.5)  # Simulate processing
        return [{'response': f'processed {len(requests)} requests'}]

# Usage
batcher = DynamicBatcher()
async def main():
    results = await asyncio.gather(*[batcher.submit({'messages': [{'role': 'user', 'content': 'hello'}]}) for _ in range(10)])
    print(results)

asyncio.run(main())
Watch out for stragglers
If you batch requests with widely different context lengths, you're not optimizing — you're amplifying latency. Always sort by length before batching, and set a hard max context length per request.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. The migration doubled the average context length from 500 to 1500 tokens. Our batching logic didn't account for this, so P99 went from 2s to 12s. We fixed it by adding length-aware batching and a 4k-token hard cap.
Key Takeaway
Batching is not a silver bullet. It works best when request lengths are similar. Always monitor batch completion time variance, not just average.

Speculative Decoding: When to Guess and When to Wait

Speculative decoding is a technique where you use a small, fast 'draft' model to generate candidate tokens, and then the large 'target' model verifies them in parallel. If the draft model is correct, you get multiple tokens for the cost of one verification step. In theory, you can cut latency by 2-3x. In practice, it's more like 1.5x — and only if the draft model is accurate enough.

The key metric is the 'acceptance rate' — the fraction of draft tokens that the target model accepts. If the acceptance rate is below 50%, the overhead of running both models outweighs the benefit. We saw this happen when we used a 7B draft model with a 70B target model on a code generation task. The draft model was too small to understand the code context, so it guessed wrong most of the time. The acceptance rate was 30%, and latency actually increased by 20%.

The fix was to use a larger draft model (13B) and fine-tune it on the same data distribution as the target model. Acceptance rate jumped to 70%, and we saw a 2x latency improvement. But there's a catch: speculative decoding adds complexity to your serving stack. You need to manage two models, handle the draft-verify loop, and deal with the case where the draft is rejected (you have to regenerate from scratch).

speculative_decoding.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import asyncio
from typing import List, Optional

class SpeculativeDecoder:
    def __init__(self, draft_model, target_model, max_draft_tokens: int = 5):
        self.draft_model = draft_model  # Small, fast model
        self.target_model = target_model  # Large, accurate model
        self.max_draft_tokens = max_draft_tokens  # How many tokens to guess at once

    async def generate(self, prompt: str) -> str:
        # Step 1: Draft model generates candidate tokens
        draft_tokens = await self.draft_model.generate(prompt, max_tokens=self.max_draft_tokens)
        # Step 2: Target model verifies the draft tokens
        # It returns the logits for each position; we check if the draft token is in the top-k
        logits = await self.target_model.get_logits(prompt + draft_tokens)
        accepted_tokens = []
        for i, token in enumerate(draft_tokens):
            # Check if draft token is in the top 1 (greedy) or top-k (sampling)
            if self._is_token_accepted(logits[i], token):
                accepted_tokens.append(token)
            else:
                # Reject the rest and let target model generate from here
                break
        if len(accepted_tokens) == 0:
            # Fallback: target model generates from scratch
            return await self.target_model.generate(prompt, max_tokens=1)
        # Step 3: If all draft tokens accepted, generate one more token with target model
        if len(accepted_tokens) == self.max_draft_tokens:
            extra_token = await self.target_model.generate(prompt + draft_tokens, max_tokens=1)
            return draft_tokens + extra_token
        return ''.join(accepted_tokens)

    def _is_token_accepted(self, logits, token):
        # Simplified: check if token is the argmax
        import torch
        return token == torch.argmax(logits).item()

# Usage (pseudocode)
# decoder = SpeculativeDecoder(draft_model=SmallModel(), target_model=LargeModel())
# result = await decoder.generate("Write a Python function to sort a list")
# print(result)
Monitor acceptance rate in production
If acceptance rate drops below 50%, your draft model is too weak. Consider fine-tuning it on your specific task or using a larger draft model. Also, set a max draft length — guessing 10 tokens is rarely better than guessing 5.
Production Insight
A code generation service using speculative decoding with a 7B draft model saw P99 increase from 3s to 4s. The draft model was generating incorrect tokens for complex code snippets, and the target model was rejecting them, wasting time. We switched to a 13B draft model fine-tuned on code, and P99 dropped to 1.8s.
Key Takeaway
Speculative decoding works best when the draft model is accurate on your specific task. Monitor acceptance rate as a key metric. If it's low, invest in a better draft model, not a bigger target model.

Prompt Compression: Cutting Context Without Cutting Accuracy

Every token in your prompt costs compute. A 4k-token prompt takes 4x longer to process than a 1k-token prompt. The obvious fix is to send less context. But how do you decide what to cut? The naive approach is to truncate from the middle — but that breaks the model's ability to follow instructions that are at the beginning and end.

We learned this the hard way. We were building a customer support chatbot that included the full conversation history in every request. The history was growing to 10k tokens over a session. We truncated to the last 2k tokens, but the model started forgetting the customer's original issue. Accuracy dropped by 23%.

The fix was prompt compression: we used a smaller LLM to summarize the conversation history into a 500-token summary, then appended that to the prompt. The summarization model was cheap (a 7B model) and ran asynchronously. Total latency dropped by 40% because the main model had less context to process. But we had to be careful: the summarization model sometimes hallucinated details, leading to incorrect responses. We added a validation step that checked the summary against the original history for factual consistency.

prompt_compressor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import asyncio
from typing import List, Dict

class PromptCompressor:
    def __init__(self, summarizer_model, max_context_length: int = 2048):
        self.summarizer_model = summarizer_model  # Small, fast model for summarization
        self.max_context_length = max_context_length

    async def compress(self, conversation: List[Dict]) -> str:
        # Convert conversation to text
        full_text = '\n'.join([f"{msg['role']}: {msg['content']}" for msg in conversation])
        # If it's short enough, return as-is
        if len(full_text.split()) < self.max_context_length:
            return full_text
        # Summarize the conversation, focusing on key facts
        summary_prompt = f"Summarize the following conversation in under {self.max_context_length // 2} words, keeping all important facts and the user's original request:\n\n{full_text}"
        summary = await self.summarizer_model.generate(summary_prompt, max_tokens=self.max_context_length // 2)
        # Validate: check that key entities from the original are in the summary
        # Simplified: just return the summary
        return summary

# Usage
# compressor = PromptCompressor(summarizer_model=SmallModel())
# compressed = await compressor.compress(conversation)
# print(compressed)
Summarization is not lossless
Prompt compression trades accuracy for speed. Always validate the summary against the original for critical information. Consider using a fact-checking step or a confidence threshold.
Production Insight
A customer support chatbot saw a 23% accuracy drop after implementing naive context truncation. Users reported that the model 'forgot' their original issue. We fixed it by using a separate summarization model to compress the history, and added a validation step that cross-checked key facts. Accuracy recovered to 95% of baseline, and latency dropped by 40%.
Key Takeaway
Don't just truncate context — compress it intelligently. Use a smaller model to summarize, but always validate the summary for factual consistency.

KV Cache Optimization: The Memory Hog You Didn't Notice

The KV cache is a hidden memory sink in LLM inference. Every time the model generates a token, it stores the key-value pairs from the attention computation so it doesn't have to recompute them. This cache grows quadratically with sequence length: a 4k-token sequence uses 16x more cache than a 1k-token sequence. For a 70B model with FP16 precision, a 4k-token sequence can consume 2GB of cache. Now multiply that by the number of concurrent users.

We hit this wall during a Black Friday sale. Our chatbot was handling 10x normal traffic, and the KV cache was growing unbounded. The server ran out of memory, and the model started returning empty responses. The on-call engineer saw a spike in 'CUDA out of memory' errors.

The fix was threefold: (1) Set a max cache size per session (e.g., 2GB). (2) Implement a least-recently-used (LRU) eviction policy for stale sessions. (3) Use PagedAttention, which stores the KV cache in non-contiguous blocks, reducing fragmentation. PagedAttention alone cut memory usage by 60% in our case.

kv_cache_manager.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from collections import OrderedDict
from typing import Dict, Any
import torch

class KVCacheManager:
    def __init__(self, max_cache_size_gb: float = 2.0, eviction_policy: str = 'LRU'):
        self.max_cache_size_bytes = int(max_cache_size_gb * 1024**3)
        self.cache: OrderedDict[str, Dict[str, torch.Tensor]] = OrderedDict()  # session_id -> KV cache
        self.current_size = 0

    def get(self, session_id: str) -> Dict[str, torch.Tensor]:
        if session_id in self.cache:
            # Move to end (most recently used)
            self.cache.move_to_end(session_id)
            return self.cache[session_id]
        return None

    def set(self, session_id: str, kv_cache: Dict[str, torch.Tensor]):
        # Estimate size of KV cache (simplified)
        size = sum(tensor.element_size() * tensor.numel() for tensor in kv_cache.values())
        # Evict if needed
        while self.current_size + size > self.max_cache_size_bytes and len(self.cache) > 0:
            # Evict least recently used (first item in OrderedDict)
            evicted_id, evicted_cache = self.cache.popitem(last=False)
            evicted_size = sum(tensor.element_size() * tensor.numel() for tensor in evicted_cache.values())
            self.current_size -= evicted_size
        self.cache[session_id] = kv_cache
        self.current_size += size

    def clear_session(self, session_id: str):
        if session_id in self.cache:
            kv_cache = self.cache.pop(session_id)
            size = sum(tensor.element_size() * tensor.numel() for tensor in kv_cache.values())
            self.current_size -= size

# Usage
# manager = KVCacheManager(max_cache_size_gb=2.0)
# cache = manager.get('session_123')
# if cache is None:
#     cache = compute_kv_cache(...)
#     manager.set('session_123', cache)
KV cache can silently eat all your GPU memory
Set a hard limit on cache size per session and implement an eviction policy. Monitor cache hit rate — if it's below 60%, your cache is too small or your eviction policy is too aggressive.
Production Insight
During a Black Friday sale, our chatbot's KV cache grew unbounded, causing 'CUDA out of memory' errors. We implemented an LRU eviction policy and switched to PagedAttention. Memory usage dropped by 60%, and P99 latency stabilized at 2s.
Key Takeaway
KV cache is a memory hog. Set limits, use LRU eviction, and consider PagedAttention to reduce fragmentation.

When NOT to Optimize: The Case for Throwing Hardware at the Problem

Sometimes, the smartest latency optimization is to buy more GPUs. I know this sounds like heresy for an optimization article, but hear me out. There are scenarios where software optimizations add complexity, risk, and maintenance burden that outweigh the latency gains.

Example: You're running a 70B model for a low-traffic internal tool (100 req/day). The P99 is 5s, which is acceptable for the use case. You could spend two weeks implementing speculative decoding, prompt compression, and KV cache tuning. Or you could just upgrade from an A100 to an H100 and cut latency by 40% in one afternoon. The H100 costs more, but your engineering time is not free.

Another example: You're building a prototype that needs to ship in a week. Don't waste time on batching logic and cache eviction policies. Use a smaller model (e.g., GPT-4o-mini instead of GPT-4) and enable streaming. That's a 10x latency improvement with zero code changes.

The rule of thumb: if your traffic is below 1000 req/day, hardware upgrades are almost always cheaper than software optimizations. Above 10k req/day, software optimizations become essential because the GPU cost scales linearly with traffic.

cost_benefit_analysis.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def should_optimize_software(daily_requests: int, current_p99: float, target_p99: float) -> str:
    """
    Simple heuristic: if traffic is low, buy more hardware. If traffic is high, optimize software.
    """
    if daily_requests < 1000:
        return "Upgrade hardware (e.g., A100 -> H100). Engineering time is better spent elsewhere."
    elif daily_requests < 10000:
        return "Consider a hybrid approach: upgrade hardware for immediate gains, then optimize software for the long term."
    else:
        return "Optimize software. Hardware costs will dominate at this scale."

# Example
print(should_optimize_software(100, 5.0, 2.0))  # "Upgrade hardware..."
print(should_optimize_software(50000, 5.0, 2.0))  # "Optimize software..."
Know when to stop optimizing
If your P99 is already under 2s and your users are happy, stop. Further optimization has diminishing returns and introduces risk. The best latency optimization is the one you don't have to maintain.
Production Insight
A startup spent 3 months implementing speculative decoding and prompt compression for a prototype serving 50 req/day. The latency improvement was 30%, but the codebase became unmaintainable. They eventually switched to a smaller model (GPT-4o-mini) and got a 5x improvement in one day.
Key Takeaway
Don't over-optimize for low traffic. Hardware upgrades and model swaps are often faster, cheaper, and safer than complex software optimizations.

Common Mistakes with Specific Examples

Let's talk about the mistakes we've made so you don't have to. These are the patterns that look good on paper but fail in production.

Mistake 1: Batching without length awareness. We covered this earlier. A single long request can ruin the batch. The fix is simple: sort by length before batching, and set a max context length.

Mistake 2: Enabling streaming but not handling cancellation. Streaming is great for perceived latency, but if the user cancels a request mid-stream, you need to stop the generation. Otherwise, the model keeps generating tokens that nobody reads, wasting compute. We saw this when a user clicked 'cancel' on a search, but the model continued generating for another 3 seconds. The fix was to use asyncio cancellation tokens and propagate them to the LLM call.

Mistake 3: Using a draft model that's too small for speculative decoding. A 7B draft model on a 70B target model rarely works. The acceptance rate is too low. Use a 13B or 30B draft model, and fine-tune it on your data.

Mistake 4: Not monitoring cache hit rate. The KV cache is useless if you're evicting sessions too aggressively. We had a 20% cache hit rate because our eviction policy was time-based (evict after 5 minutes). Users were starting new sessions every 3 minutes. Switched to LRU with a size limit, and hit rate jumped to 80%.

streaming_cancellation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import asyncio
from typing import AsyncGenerator

class StreamingLLM:
    async def generate_stream(self, prompt: str, cancel_token: asyncio.Event) -> AsyncGenerator[str, None]:
        # Simulate streaming generation
        for i in range(10):
            if cancel_token.is_set():
                break  # Stop generating if cancelled
            yield f"token_{i}"
            await asyncio.sleep(0.1)  # Simulate generation time

async def main():
    cancel_token = asyncio.Event()
    llm = StreamingLLM()
    # Start streaming in background
    async def consume():
        async for token in llm.generate_stream("hello", cancel_token):
            print(token)
    task = asyncio.create_task(consume())
    # Simulate user cancellation after 0.5 seconds
    await asyncio.sleep(0.5)
    cancel_token.set()
    await task

asyncio.run(main())
Always handle cancellation in streaming
If a user cancels a request, stop generating immediately. Use an asyncio.Event or similar mechanism to signal cancellation to the LLM call.
Production Insight
A search service with streaming saw 30% wasted compute because cancelled requests continued generating. We added a cancellation token that stopped generation immediately. Compute usage dropped by 30%.
Key Takeaway
Streaming without cancellation handling is a waste of compute. Always propagate cancellation signals to the generation loop.

Comparison vs Alternatives: Batching, Streaming, or Both?

You have two main tools for reducing perceived latency: batching and streaming. Batching reduces the number of requests the model has to process, but increases the latency of individual requests (because they wait for the batch to fill). Streaming reduces perceived latency by showing tokens as they're generated, but doesn't reduce total generation time.

Which one should you use? It depends on your use case. For chatbots, streaming is non-negotiable — users expect to see tokens appear as they're generated. For batch processing (e.g., summarizing a batch of documents), batching is better because you don't need real-time output.

But you can combine both: batch multiple streaming requests together. This is called 'dynamic batching with streaming'. It's complex to implement but gives you the best of both worlds. We use this pattern in production: we batch up to 8 streaming requests, process them together, and stream the results back to each user. Latency dropped by 50% compared to non-batched streaming.

batched_streaming.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import asyncio
from typing import List, AsyncGenerator

class BatchedStreamingLLM:
    def __init__(self, max_batch_size: int = 8):
        self.max_batch_size = max_batch_size
        self.queue: List[dict] = []

    async def submit(self, prompt: str) -> AsyncGenerator[str, None]:
        # Create a queue for this request's tokens
        token_queue = asyncio.Queue()
        self.queue.append({'prompt': prompt, 'token_queue': token_queue})
        if len(self.queue) >= self.max_batch_size:
            asyncio.create_task(self._process_batch())
        # Yield tokens as they arrive
        while True:
            token = await token_queue.get()
            if token is None:
                break
            yield token

    async def _process_batch(self):
        batch = self.queue[:self.max_batch_size]
        self.queue = self.queue[self.max_batch_size:]
        # Simulate batched generation with streaming
        # In reality, you'd call the LLM with a batch of prompts and stream tokens
        for i in range(10):  # Simulate 10 tokens
            for item in batch:
                await item['token_queue'].put(f"token_{i}")
            await asyncio.sleep(0.1)
        # Signal end of stream
        for item in batch:
            await item['token_queue'].put(None)

# Usage
# llm = BatchedStreamingLLM()
# async def consume(prompt):
#     async for token in llm.submit(prompt):
#         print(token)
# asyncio.run(asyncio.gather(consume("hello"), consume("world")))
Combine batching and streaming for best results
Dynamic batching with streaming gives you the throughput of batching and the perceived latency of streaming. It's complex to implement, but the payoff is significant.
Production Insight
We combined batching and streaming for a customer support chatbot. P99 latency dropped from 4s to 2s, and user satisfaction scores improved by 15% because tokens appeared faster.
Key Takeaway
For real-time applications, use streaming. For throughput, use batching. For both, implement dynamic batching with streaming.

Debugging and Monitoring LLM Latency in Production

You can't optimize what you can't measure. We track five key metrics for LLM latency: time-to-first-token (TTFT), tokens per second (TPS), batch completion time, cache hit rate, and speculative acceptance rate. Each tells a different story.

TTFT measures how long it takes the model to start generating. High TTFT usually means the prompt is too long or the KV cache is cold. TPS measures generation speed. Low TPS could mean the model is too large, quantization is too aggressive, or you're hitting rate limits.

We use OpenTelemetry to instrument every LLM call. Each span includes the model name, prompt length, response length, latency breakdown (TTFT vs generation), and any errors. We alert on P99 latency exceeding 5s and cache hit rate dropping below 60%.

One thing we learned: don't rely on the LLM provider's metrics. They aggregate across all customers and don't show you the tail latencies. Instrument your own calls and log every request.

latency_monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import time
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Set up OpenTelemetry
tracer_provider = TracerProvider()
span_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
span_processor = BatchSpanProcessor(span_exporter)
tracer_provider.add_span_processor(span_processor)
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)

class MonitoredLLM:
    def __init__(self, model_name: str):
        self.model_name = model_name

    def generate(self, prompt: str) -> str:
        with tracer.start_as_current_span("llm_generate") as span:
            span.set_attribute("model_name", self.model_name)
            span.set_attribute("prompt_length", len(prompt))
            start_time = time.time()
            # Simulate LLM call
            response = self._llm_call(prompt)
            latency = time.time() - start_time
            span.set_attribute("latency_seconds", latency)
            span.set_attribute("response_length", len(response))
            return response

    def _llm_call(self, prompt: str) -> str:
        # Placeholder
        time.sleep(0.5)
        return "response"

# Usage
# llm = MonitoredLLM(model_name="gpt-4")
# response = llm.generate("hello")
# print(response)
Instrument every LLM call
Use OpenTelemetry to track latency, prompt length, and response length. Alert on P99 exceeding 5s or cache hit rate below 60%. Don't rely on provider metrics.
Production Insight
We added OpenTelemetry instrumentation to our LLM calls and discovered that 20% of requests had TTFT > 5s due to cold KV caches. We fixed it by pre-warming the cache with common prompts.
Key Takeaway
Measure everything. TTFT, TPS, cache hit rate, and batch completion time are your key metrics. Instrument your own calls — don't rely on provider metrics.
● Production incidentPOST-MORTEMseverity: high

The Straggler That Killed Our Batch: A 12-Second P99 Lesson

Symptom
On-call engineer saw a spike in 'openai.ChatCompletion.create' timeout errors in Datadog. P99 latency graph went from a flat 2s to a jagged 12s. Users reported search results taking 'forever' to load.
Assumption
Dynamic batching would improve throughput by grouping requests. We assumed all requests in a batch would finish at roughly the same time.
Root cause
One user query with a 15k-token context (full conversation history) was included in every batch. The model spent 10 seconds processing that one request, blocking the other 7 requests in the batch. Batching doesn't help if the variance in request processing time is high.
Fix
1. Implemented a max context length per request (4k tokens). 2. Added a timeout per batch (2 seconds). 3. Moved to a priority queue: short requests get processed first, long requests go to a separate slow lane. 4. Added a circuit breaker that disables batching if P99 exceeds 5s.
Key lesson
  • Always set a max context length per request. Truncate or summarize long histories before sending.
  • Monitor batch completion time variance, not just average. A single straggler ruins the whole batch.
  • Use separate queues for short and long requests. Don't let one slow user degrade everyone else's experience.
Production debug guideWhen P99 spikes happen at 2am.4 entries
Symptom · 01
High time-to-first-token (TTFT) but normal generation speed
Fix
Check KV cache hit rate. Run: curl -X GET http://your-service:8080/metrics | grep kv_cache_hit_rate. If below 60%, your cache eviction policy is too aggressive or context lengths vary too much.
Symptom · 02
Steady increase in P99 over 30 minutes
Fix
Check memory usage. Run: nvidia-smi --query-gpu=memory.used --format=csv,noheader. If memory is growing, you have a memory leak in the KV cache. Look for sessions not being properly cleaned up.
Symptom · 03
Spikes in 'openai.error.RateLimitError'
Fix
Check your token bucket fill rate. Run: python -c "import openai; print(openai.api_rate_limit)". If you're hitting limits, implement exponential backoff with jitter. Example: time.sleep(min(2 ** retry_count + random.uniform(0, 1), 60))
Symptom · 04
High generation latency but low TTFT
Fix
Check if speculative decoding is enabled and accurate. Run: curl -X GET http://your-service:8080/metrics | grep speculative_draft_acceptance_rate. If below 50%, the draft model is too different from the main model. Consider a larger draft model or disabling speculation.
★ LLM Latency Optimization Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
P99 > 10s on generation
Immediate action
Check if batching is enabled and batch size
Commands
curl -X GET http://your-service:8080/metrics | grep batch_size
curl -X GET http://your-service:8080/metrics | grep batch_completion_time_avg
Fix now
Reduce batch size to 4 or disable batching temporarily. Run: export BATCH_SIZE=4 && systemctl restart llm-service
TTFT > 5s+
Immediate action
Check KV cache size and hit rate
Commands
curl -X GET http://your-service:8080/metrics | grep kv_cache_size
curl -X GET http://your-service:8080/metrics | grep kv_cache_hit_rate
Fix now
Increase KV cache max size to 10GB. Run: export KV_CACHE_MAX_SIZE=10GB && systemctl restart llm-service
Rate limit errors+
Immediate action
Check API key usage and rate limit
Commands
python -c "import openai; print(openai.api_rate_limit)"
curl -X GET http://your-service:8080/metrics | grep requests_per_minute
Fix now
Implement exponential backoff. Add to your code: time.sleep(min(2 ** retry_count + random.uniform(0, 1), 60))
High generation latency with streaming+
Immediate action
Check if streaming is actually enabled
Commands
curl -X GET http://your-service:8080/metrics | grep streaming_enabled
curl -X GET http://your-service:8080/metrics | grep tokens_per_second
Fix now
Ensure streaming is enabled on the API call. In Python: response = openai.ChatCompletion.create(stream=True, ...)
Latency Optimization Techniques Comparison
TechniqueLatency ReductionThroughput ImpactMemory CostImplementation ComplexityBest For
Static Batching1.5x2xLowLowSteady traffic, predictable load
Continuous Batching3x4xMediumMediumVariable traffic, bursty requests
Streaming (token-by-token)2x (TTFT)0.8xLowLowReal-time chat, user-perceived latency
Batching + Streaming4x (P99)3xMediumHighHigh-throughput chat apps
Speculative Decoding2.3x1.5xLow (draft model)HighLong generations (>50 tokens)
Prompt Compression1.5x1.2xLowMediumRAG, long-context tasks

Key takeaways

1
Dynamic batching with continuous batching (not static) cut P50 by 60%
batch size adapts per request queue depth, not fixed at model load.
2
Speculative decoding with a 1.3B draft model gave 2.3x speedup on long generations but added 15% overhead on short prompts
set a 50-token generation threshold before enabling.
3
Prompt compression via semantic chunk pruning (drop redundant context blocks) reduced average context by 40% with <2% accuracy loss on summarization tasks.
4
KV cache eviction with a sliding window of 2048 tokens and LRU policy reclaimed 70% GPU memory, enabling larger batch sizes without OOM.
5
Throwing hardware at the problem (A100→H100) only gave 1.4x speedup for our workload
optimization gave 6.7x. Hardware is the last resort, not the first.

Common mistakes to avoid

4 patterns
×

Static batch sizing

Symptom
GPU utilization drops to 30% during low traffic, OOM during spikes
Fix
Implement continuous batching with a dynamic scheduler that adjusts batch size every 100ms based on pending request count and current memory usage.
×

Speculative decoding on short prompts

Symptom
P99 latency increases by 200ms for prompts under 30 tokens due to draft model overhead
Fix
Gate speculative decoding with a minimum generation length check — only enable when target output > 50 tokens.
×

Full KV cache retention for all requests

Symptom
GPU OOM after 4 concurrent long-context requests (8k tokens each)
Fix
Implement sliding window KV cache with max 2048 tokens per sequence and evict oldest entries using LRU when memory exceeds 80%.
×

Prompt compression without validation

Symptom
Accuracy drops 15% on RAG tasks because critical context was pruned
Fix
Use a two-pass compression: first pass removes low-relevance chunks via cosine similarity < 0.3, second pass validates against a holdout set before deploying.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how token batching works under the hood in transformer inference...
Q02SENIOR
How would you design a speculative decoding system for a production LLM ...
Q03SENIOR
What metrics would you monitor to debug LLM latency in production?
Q04SENIOR
Compare prompt compression techniques for latency reduction.
Q05SENIOR
How do you handle KV cache memory for long-context LLM inference?
Q01 of 05SENIOR

Explain how token batching works under the hood in transformer inference.

ANSWER
In autoregressive generation, each token depends on previous ones. Batching groups multiple sequences into a single forward pass by padding to the same length. Continuous batching improves on this by allowing new sequences to join the batch after a token is generated, using a scheduler that tracks which sequences are done. The key challenge is managing the KV cache — each sequence has its own cache, and batching requires concatenating these caches along the batch dimension, which can cause memory fragmentation if not pre-allocated.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How does continuous batching differ from static batching for LLMs?
02
What is speculative decoding and when should I use it?
03
How do I compress prompts without losing accuracy?
04
What is KV cache and why does it cause OOM?
05
Should I optimize latency or just buy better GPUs?
🔥

That's Observability. Mark it forged?

8 min read · try the examples if you haven't

Previous
LLM Evaluation Frameworks
3 / 3 · Observability
Next
Context Compression Techniques