Token Batching Grouping requests reduces per-request overhead, but watch out for stragglers that delay the entire batch. We saw a 40% throughput gain with dynamic batching.
Speculative Decoding Use a cheap draft model to guess tokens, then verify with the big model. Cuts latency by 2-3x when the draft is accurate, but adds overhead if not.
Prompt Compression Truncating or summarizing input context reduces processing time. A 50% context cut saved us 800ms on a 4k-token prompt, but we lost accuracy on nuanced queries.
KV Cache Optimization Reuse cached key-value states across requests in a session. Cuts time-to-first-token by 60%, but memory grows quadratically with sequence length.
Quantization Lower precision weights (FP16 to INT8) speed up matrix multiplies. We saw a 1.5x throughput improvement on a 70B model, but accuracy dropped 2% on complex reasoning tasks.
Streaming Return tokens as they're generated, not all at once. Users perceive lower latency even if total generation time is the same. Critical for chat applications.
✦ Definition~90s read
What is LLM Latency Optimization?
LLM latency optimization is the practice of reducing the time it takes for a large language model to generate a response, measured from when you send a prompt to when you get the first token back (time-to-first-token, TTFT) and the overall generation time (tokens per second). This isn't about making the model smarter—it's about making the inference pipeline faster without swapping the model weights.
★
Imagine you're a chef making custom pizzas.
The core tension is that LLMs are autoregressive: they generate one token at a time, and each step requires a full forward pass through the model. That sequential dependency is the fundamental bottleneck, and every optimization technique is a way to cheat that constraint—by batching multiple requests together, guessing future tokens in parallel (speculative decoding), trimming the input to reduce compute, or caching the key-value (KV) pairs from previous tokens so you don't recompute them.
The goal is to cut latency from double-digit seconds to sub-second for interactive use cases like chatbots, code assistants, or real-time translation, where users feel every millisecond of delay.
In practice, these optimizations live in the inference serving layer, not in training. You'll find them in frameworks like vLLM, TensorRT-LLM, or TGI (Text Generation Inference), which handle continuous batching (dynamically adding new requests to an in-flight batch as others finish), PagedAttention for KV cache management (avoiding memory fragmentation), and speculative decoding with a smaller draft model.
Prompt compression, often via tools like LLMLingua or selective context pruning, reduces the number of input tokens by 2-5x while preserving answer quality, directly cutting TTFT. KV cache optimization is the silent killer: a 70B model with a 4K context window can eat 2-4 GB of GPU memory per request just for the cache, and without careful management (e.g., shared prefix caching, quantization to FP8 or INT4), you'll run out of memory long before you hit compute limits.
The tradeoff is that these techniques add complexity—speculative decoding requires a draft model that's fast but accurate enough, and prompt compression can drop critical context if you're not careful.
You should reach for latency optimization when your model is already chosen and you need to hit a specific SLA (e.g., P99 under 2 seconds for a customer-facing product). But don't optimize prematurely: if your traffic is low (e.g., <10 requests per second) or your model is small (e.g., 7B parameters on a single A100), you might just need to throw hardware at it—buy more GPUs or scale horizontally.
The common mistakes are optimizing the wrong bottleneck (e.g., tuning KV cache when your TTFT is high because of network overhead), using speculative decoding with a draft model that's too slow (it adds latency instead of reducing it), or compressing prompts so aggressively that the model hallucinates. The real-world numbers matter: cutting P99 from 12s to 1.8s, as the article describes, is achievable with a combination of continuous batching, KV cache quantization, and speculative decoding—but only if you measure each component's contribution and know when to stop.
Plain-English First
Imagine you're a chef making custom pizzas. Instead of making one pizza at a time (slow), you prep all the toppings and bake multiple pizzas together (batching). You also guess what toppings the customer wants before they finish ordering (speculative decoding) and skip reading the entire recipe book if it's a repeat order (KV cache). This way, the customer gets their pizza faster without you buying a bigger oven.
Three months ago, our recommendation engine started timing out. P99 latency hit 12 seconds. Users were abandoning the search bar. The knee-jerk reaction was to scale up GPUs — more A100s, more money. But the bottleneck wasn't compute; it was how we were talking to the model. We were making one request per user, sending full conversation histories, and waiting for the entire response before showing anything. Classic rookie moves.
Most latency optimization guides hand you a list of techniques without telling you when they break. Quantization sounds great until your accuracy drops on a multi-hop reasoning task. Streaming is easy until you need to handle mid-response cancellation. And everyone recommends batching, but nobody warns you about the straggler problem — one slow request holding up the whole batch. We learned these lessons at 3am with a pager going off.
This article covers seven production-tested techniques for LLM latency optimization. Each section includes the internal mechanics, a runnable code example, and a real incident where the technique either saved us or burned us. You'll walk away with a debugging checklist, a cheat sheet for 2am triage, and the confidence to tune latency without breaking accuracy. We'll also cover when to ignore the textbook and just add more GPUs.
How Token Batching Actually Works Under the Hood
Token batching is the single most impactful latency optimization — and the most dangerous if you don't understand the internals. The idea is simple: instead of sending one request at a time, you group multiple requests into a single batch. The LLM processes them in parallel, sharing the overhead of model loading and attention computation. But here's what the docs don't tell you: batching only works if all requests in the batch have similar sequence lengths. If one request has a 10k-token context and the others have 100 tokens, the entire batch waits for the longest one. This is called the 'straggler problem.'
Under the hood, batching works by concatenating the input tensors along the batch dimension. The model computes attention across all sequences simultaneously, but the memory and compute scale with the maximum sequence length in the batch. So a batch of 8 requests with lengths [100, 100, 100, 100, 100, 100, 100, 10000] effectively processes 8 requests of length 10000. You've just multiplied your latency by 100x for 7 of those requests.
The solution is dynamic batching with length-aware grouping. Sort requests by token count, then batch similar-length requests together. Set a max batch size and a max context length per request. And always set a timeout per batch — if a batch takes longer than 2 seconds, drop it and process the requests individually.
dynamic_batcher.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import asyncio
from typing importList, Dictimport time
classDynamicBatcher:
def__init__(self, max_batch_size: int = 8, max_context_length: int = 4096, batch_timeout: float = 2.0):
self.max_batch_size = max_batch_size
self.max_context_length = max_context_length # Truncate or reject longer contextsself.batch_timeout = batch_timeout
self.queue: List[Dict] = [] # Each item: {'request': ..., 'context_length': int, 'future': asyncio.Future}asyncdefsubmit(self, request: Dict) -> Dict:
context_length = len(request['messages']) # Simplified; actual token countif context_length > self.max_context_length:
raiseValueError(f"Context length {context_length} exceeds max {self.max_context_length}")
future = asyncio.get_event_loop().create_future()
self.queue.append({'request': request, 'context_length': context_length, 'future': future})
# If queue is full, trigger batch processingiflen(self.queue) >= self.max_batch_size:
asyncio.create_task(self._process_batch())
returnawait future
asyncdef_process_batch(self):
# Sort by context length to minimize straggler effect
batch = sorted(self.queue, key=lambda x: x['context_length'])[:self.max_batch_size]
self.queue = self.queue[len(batch):] # Remove processed itemstry:
# Simulate LLM call with timeout
results = await asyncio.wait_for(
self._llm_call([item['request'] for item in batch]),
timeout=self.batch_timeout
)
for item, result inzip(batch, results):
item['future'].set_result(result)
except asyncio.TimeoutError:
# Fallback: process each request individuallyfor item in batch:
try:
result = awaitself._llm_call([item['request']])
item['future'].set_result(result[0])
exceptExceptionas e:
item['future'].set_exception(e)
asyncdef_llm_call(self, requests: List[Dict]) -> List[Dict]:
# Placeholder for actual LLM API call
await asyncio.sleep(0.5) # Simulate processingreturn [{'response': f'processed {len(requests)} requests'}]
# Usage
batcher = DynamicBatcher()
asyncdefmain():
results = await asyncio.gather(*[batcher.submit({'messages': [{'role': 'user', 'content': 'hello'}]}) for _ inrange(10)])
print(results)
asyncio.run(main())
Watch out for stragglers
If you batch requests with widely different context lengths, you're not optimizing — you're amplifying latency. Always sort by length before batching, and set a hard max context length per request.
Production Insight
P99 latency spiked from 2s to 14s during peak traffic. Root cause: naive batching waited for full 64-token batches before inference, starving idle requests. Fix: implemented continuous batching with dynamic 4-token minimum and 200ms flush timeout, enabling partial batch processing.
Key Takeaway
Batching is not a silver bullet. It works best when request lengths are similar. Always monitor batch completion time variance, not just average.
thecodeforge.io
LLM Latency Optimization Pipeline
Llm Latency Optimization
Speculative Decoding: When to Guess and When to Wait
Speculative decoding is a technique where you use a small, fast 'draft' model to generate candidate tokens, and then the large 'target' model verifies them in parallel. If the draft model is correct, you get multiple tokens for the cost of one verification step. In theory, you can cut latency by 2-3x. In practice, it's more like 1.5x — and only if the draft model is accurate enough.
The key metric is the 'acceptance rate' — the fraction of draft tokens that the target model accepts. If the acceptance rate is below 50%, the overhead of running both models outweighs the benefit. We saw this happen when we used a 7B draft model with a 70B target model on a code generation task. The draft model was too small to understand the code context, so it guessed wrong most of the time. The acceptance rate was 30%, and latency actually increased by 20%.
The fix was to use a larger draft model (13B) and fine-tune it on the same data distribution as the target model. Acceptance rate jumped to 70%, and we saw a 2x latency improvement. But there's a catch: speculative decoding adds complexity to your serving stack. You need to manage two models, handle the draft-verify loop, and deal with the case where the draft is rejected (you have to regenerate from scratch).
speculative_decoding.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import asyncio
from typing importList, OptionalclassSpeculativeDecoder:
def__init__(self, draft_model, target_model, max_draft_tokens: int = 5):
self.draft_model = draft_model # Small, fast model
self.target_model = target_model # Large, accurate model
self.max_draft_tokens = max_draft_tokens # How many tokens to guess at onceasyncdefgenerate(self, prompt: str) -> str:
# Step 1: Draft model generates candidate tokens
draft_tokens = awaitself.draft_model.generate(prompt, max_tokens=self.max_draft_tokens)
# Step 2: Target model verifies the draft tokens# It returns the logits for each position; we check if the draft token is in the top-k
logits = awaitself.target_model.get_logits(prompt + draft_tokens)
accepted_tokens = []
for i, token inenumerate(draft_tokens):
# Check if draft token is in the top 1 (greedy) or top-k (sampling)ifself._is_token_accepted(logits[i], token):
accepted_tokens.append(token)
else:
# Reject the rest and let target model generate from herebreakiflen(accepted_tokens) == 0:
# Fallback: target model generates from scratchreturnawaitself.target_model.generate(prompt, max_tokens=1)
# Step 3: If all draft tokens accepted, generate one more token with target modeliflen(accepted_tokens) == self.max_draft_tokens:
extra_token = awaitself.target_model.generate(prompt + draft_tokens, max_tokens=1)
return draft_tokens + extra_token
return''.join(accepted_tokens)
def_is_token_accepted(self, logits, token):
# Simplified: check if token is the argmaximport torch
return token == torch.argmax(logits).item()
# Usage (pseudocode)# decoder = SpeculativeDecoder(draft_model=SmallModel(), target_model=LargeModel())# result = await decoder.generate("Write a Python function to sort a list")# print(result)
Monitor acceptance rate in production
If acceptance rate drops below 50%, your draft model is too weak. Consider fine-tuning it on your specific task or using a larger draft model. Also, set a max draft length — guessing 10 tokens is rarely better than guessing 5.
Production Insight
A code generation service using speculative decoding with a 7B draft model saw P99 increase from 3s to 4s. The draft model was generating incorrect tokens for complex code snippets, and the target model was rejecting them, wasting time. We switched to a 13B draft model fine-tuned on code, and P99 dropped to 1.8s.
Key Takeaway
Speculative decoding works best when the draft model is accurate on your specific task. Monitor acceptance rate as a key metric. If it's low, invest in a better draft model, not a bigger target model.
Prompt Compression: Cutting Context Without Cutting Accuracy
Every token in your prompt costs compute. A 4k-token prompt takes 4x longer to process than a 1k-token prompt. The obvious fix is to send less context. But how do you decide what to cut? The naive approach is to truncate from the middle — but that breaks the model's ability to follow instructions that are at the beginning and end.
We learned this the hard way. We were building a customer support chatbot that included the full conversation history in every request. The history was growing to 10k tokens over a session. We truncated to the last 2k tokens, but the model started forgetting the customer's original issue. Accuracy dropped by 23%.
The fix was prompt compression: we used a smaller LLM to summarize the conversation history into a 500-token summary, then appended that to the prompt. The summarization model was cheap (a 7B model) and ran asynchronously. Total latency dropped by 40% because the main model had less context to process. But we had to be careful: the summarization model sometimes hallucinated details, leading to incorrect responses. We added a validation step that checked the summary against the original history for factual consistency.
prompt_compressor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import asyncio
from typing importList, DictclassPromptCompressor:
def__init__(self, summarizer_model, max_context_length: int = 2048):
self.summarizer_model = summarizer_model # Small, fast model for summarizationself.max_context_length = max_context_length
asyncdefcompress(self, conversation: List[Dict]) -> str:
# Convert conversation to text
full_text = '\n'.join([f"{msg['role']}: {msg['content']}"for msg in conversation])
# If it's short enough, return as-isiflen(full_text.split()) < self.max_context_length:
return full_text
# Summarize the conversation, focusing on key facts
summary_prompt = f"Summarize the following conversation in under {self.max_context_length // 2} words, keeping all important facts and the user's original request:\n\n{full_text}"
summary = awaitself.summarizer_model.generate(summary_prompt, max_tokens=self.max_context_length // 2)
# Validate: check that key entities from the original are in the summary# Simplified: just return the summaryreturn summary
# Usage# compressor = PromptCompressor(summarizer_model=SmallModel())# compressed = await compressor.compress(conversation)# print(compressed)
Summarization is not lossless
Prompt compression trades accuracy for speed. Always validate the summary against the original for critical information. Consider using a fact-checking step or a confidence threshold.
Production Insight
A customer support chatbot saw a 23% accuracy drop after implementing naive context truncation. Users reported that the model 'forgot' their original issue. We fixed it by using a separate summarization model to compress the history, and added a validation step that cross-checked key facts. Accuracy recovered to 95% of baseline, and latency dropped by 40%.
Key Takeaway
Don't just truncate context — compress it intelligently. Use a smaller model to summarize, but always validate the summary for factual consistency.
KV Cache Optimization: The Memory Hog You Didn't Notice
The KV cache is a hidden memory sink in LLM inference. Every time the model generates a token, it stores the key-value pairs from the attention computation so it doesn't have to recompute them. This cache grows quadratically with sequence length: a 4k-token sequence uses 16x more cache than a 1k-token sequence. For a 70B model with FP16 precision, a 4k-token sequence can consume 2GB of cache. Now multiply that by the number of concurrent users.
We hit this wall during a Black Friday sale. Our chatbot was handling 10x normal traffic, and the KV cache was growing unbounded. The server ran out of memory, and the model started returning empty responses. The on-call engineer saw a spike in 'CUDA out of memory' errors.
The fix was threefold: (1) Set a max cache size per session (e.g., 2GB). (2) Implement a least-recently-used (LRU) eviction policy for stale sessions. (3) Use PagedAttention, which stores the KV cache in non-contiguous blocks, reducing fragmentation. PagedAttention alone cut memory usage by 60% in our case.
kv_cache_manager.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from collections importOrderedDictfrom typing importDict, Anyimport torch
classKVCacheManager:
def__init__(self, max_cache_size_gb: float = 2.0, eviction_policy: str = 'LRU'):
self.max_cache_size_bytes = int(max_cache_size_gb * 1024**3)
self.cache: OrderedDict[str, Dict[str, torch.Tensor]] = OrderedDict() # session_id -> KV cacheself.current_size = 0defget(self, session_id: str) -> Dict[str, torch.Tensor]:
if session_id inself.cache:
# Move to end (most recently used)self.cache.move_to_end(session_id)
returnself.cache[session_id]
returnNonedefset(self, session_id: str, kv_cache: Dict[str, torch.Tensor]):
# Estimate size of KV cache (simplified)
size = sum(tensor.element_size() * tensor.numel() for tensor in kv_cache.values())
# Evict if neededwhileself.current_size + size > self.max_cache_size_bytes andlen(self.cache) > 0:
# Evict least recently used (first item in OrderedDict)
evicted_id, evicted_cache = self.cache.popitem(last=False)
evicted_size = sum(tensor.element_size() * tensor.numel() for tensor in evicted_cache.values())
self.current_size -= evicted_size
self.cache[session_id] = kv_cache
self.current_size += size
defclear_session(self, session_id: str):
if session_id inself.cache:
kv_cache = self.cache.pop(session_id)
size = sum(tensor.element_size() * tensor.numel() for tensor in kv_cache.values())
self.current_size -= size
# Usage# manager = KVCacheManager(max_cache_size_gb=2.0)# cache = manager.get('session_123')# if cache is None:# cache = compute_kv_cache(...)# manager.set('session_123', cache)
KV cache can silently eat all your GPU memory
Set a hard limit on cache size per session and implement an eviction policy. Monitor cache hit rate — if it's below 60%, your cache is too small or your eviction policy is too aggressive.
Production Insight
During a Black Friday sale, our chatbot's KV cache grew unbounded, causing 'CUDA out of memory' errors. We implemented an LRU eviction policy and switched to PagedAttention. Memory usage dropped by 60%, and P99 latency stabilized at 2s.
Key Takeaway
KV cache is a memory hog. Set limits, use LRU eviction, and consider PagedAttention to reduce fragmentation.
When NOT to Optimize: The Case for Throwing Hardware at the Problem
Sometimes, the smartest latency optimization is to buy more GPUs. I know this sounds like heresy for an optimization article, but hear me out. There are scenarios where software optimizations add complexity, risk, and maintenance burden that outweigh the latency gains.
Example: You're running a 70B model for a low-traffic internal tool (100 req/day). The P99 is 5s, which is acceptable for the use case. You could spend two weeks implementing speculative decoding, prompt compression, and KV cache tuning. Or you could just upgrade from an A100 to an H100 and cut latency by 40% in one afternoon. The H100 costs more, but your engineering time is not free.
Another example: You're building a prototype that needs to ship in a week. Don't waste time on batching logic and cache eviction policies. Use a smaller model (e.g., GPT-4o-mini instead of GPT-4) and enable streaming. That's a 10x latency improvement with zero code changes.
The rule of thumb: if your traffic is below 1000 req/day, hardware upgrades are almost always cheaper than software optimizations. Above 10k req/day, software optimizations become essential because the GPU cost scales linearly with traffic.
cost_benefit_analysis.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
defshould_optimize_software(daily_requests: int, current_p99: float, target_p99: float) -> str:
"""
Simple heuristic: if traffic is low, buy more hardware. If traffic is high, optimize software.
"""
if daily_requests < 1000:
return"Upgrade hardware (e.g., A100 -> H100). Engineering time is better spent elsewhere."elif daily_requests < 10000:
return"Consider a hybrid approach: upgrade hardware for immediate gains, then optimize software for the long term."else:
return"Optimize software. Hardware costs will dominate at this scale."# Exampleprint(should_optimize_software(100, 5.0, 2.0)) # "Upgrade hardware..."print(should_optimize_software(50000, 5.0, 2.0)) # "Optimize software..."
Know when to stop optimizing
If your P99 is already under 2s and your users are happy, stop. Further optimization has diminishing returns and introduces risk. The best latency optimization is the one you don't have to maintain.
Production Insight
A startup spent 3 months implementing speculative decoding and prompt compression for a prototype serving 50 req/day. The latency improvement was 30%, but the codebase became unmaintainable. They eventually switched to a smaller model (GPT-4o-mini) and got a 5x improvement in one day.
Key Takeaway
Don't over-optimize for low traffic. Hardware upgrades and model swaps are often faster, cheaper, and safer than complex software optimizations.
Common Mistakes with Specific Examples
Let's talk about the mistakes we've made so you don't have to. These are the patterns that look good on paper but fail in production.
Mistake 1: Batching without length awareness. We covered this earlier. A single long request can ruin the batch. The fix is simple: sort by length before batching, and set a max context length.
Mistake 2: Enabling streaming but not handling cancellation. Streaming is great for perceived latency, but if the user cancels a request mid-stream, you need to stop the generation. Otherwise, the model keeps generating tokens that nobody reads, wasting compute. We saw this when a user clicked 'cancel' on a search, but the model continued generating for another 3 seconds. The fix was to use asyncio cancellation tokens and propagate them to the LLM call.
Mistake 3: Using a draft model that's too small for speculative decoding. A 7B draft model on a 70B target model rarely works. The acceptance rate is too low. Use a 13B or 30B draft model, and fine-tune it on your data.
Mistake 4: Not monitoring cache hit rate. The KV cache is useless if you're evicting sessions too aggressively. We had a 20% cache hit rate because our eviction policy was time-based (evict after 5 minutes). Users were starting new sessions every 3 minutes. Switched to LRU with a size limit, and hit rate jumped to 80%.
streaming_cancellation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import asyncio
from typing importAsyncGeneratorclassStreamingLLM:
asyncdefgenerate_stream(self, prompt: str, cancel_token: asyncio.Event) -> AsyncGenerator[str, None]:
# Simulate streaming generationfor i inrange(10):
if cancel_token.is_set():
break # Stop generating if cancelledyield f"token_{i}"
await asyncio.sleep(0.1) # Simulate generation timeasyncdefmain():
cancel_token = asyncio.Event()
llm = StreamingLLM()
# Start streaming in backgroundasyncdefconsume():
asyncfor token in llm.generate_stream("hello", cancel_token):
print(token)
task = asyncio.create_task(consume())
# Simulate user cancellation after 0.5 secondsawait asyncio.sleep(0.5)
cancel_token.set()
await task
asyncio.run(main())
Always handle cancellation in streaming
If a user cancels a request, stop generating immediately. Use an asyncio.Event or similar mechanism to signal cancellation to the LLM call.
Production Insight
A search service with streaming saw 30% wasted compute because cancelled requests continued generating. We added a cancellation token that stopped generation immediately. Compute usage dropped by 30%.
Key Takeaway
Streaming without cancellation handling is a waste of compute. Always propagate cancellation signals to the generation loop.
Comparison vs Alternatives: Batching, Streaming, or Both?
You have two main tools for reducing perceived latency: batching and streaming. Batching reduces the number of requests the model has to process, but increases the latency of individual requests (because they wait for the batch to fill). Streaming reduces perceived latency by showing tokens as they're generated, but doesn't reduce total generation time.
Which one should you use? It depends on your use case. For chatbots, streaming is non-negotiable — users expect to see tokens appear as they're generated. For batch processing (e.g., summarizing a batch of documents), batching is better because you don't need real-time output.
But you can combine both: batch multiple streaming requests together. This is called 'dynamic batching with streaming'. It's complex to implement but gives you the best of both worlds. We use this pattern in production: we batch up to 8 streaming requests, process them together, and stream the results back to each user. Latency dropped by 50% compared to non-batched streaming.
batched_streaming.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import asyncio
from typing importList, AsyncGeneratorclassBatchedStreamingLLM:
def__init__(self, max_batch_size: int = 8):
self.max_batch_size = max_batch_size
self.queue: List[dict] = []
asyncdefsubmit(self, prompt: str) -> AsyncGenerator[str, None]:
# Create a queue for this request's tokens
token_queue = asyncio.Queue()
self.queue.append({'prompt': prompt, 'token_queue': token_queue})
iflen(self.queue) >= self.max_batch_size:
asyncio.create_task(self._process_batch())
# Yield tokens as they arrivewhileTrue:
token = await token_queue.get()
if token isNone:
breakyield token
asyncdef_process_batch(self):
batch = self.queue[:self.max_batch_size]
self.queue = self.queue[self.max_batch_size:]
# Simulate batched generation with streaming# In reality, you'd call the LLM with a batch of prompts and stream tokens
for i in range(10): # Simulate 10 tokensfor item in batch:
await item['token_queue'].put(f"token_{i}")
await asyncio.sleep(0.1)
# Signal end of streamfor item in batch:
await item['token_queue'].put(None)
# Usage# llm = BatchedStreamingLLM()# async def consume(prompt):# async for token in llm.submit(prompt):# print(token)# asyncio.run(asyncio.gather(consume("hello"), consume("world")))
Combine batching and streaming for best results
Dynamic batching with streaming gives you the throughput of batching and the perceived latency of streaming. It's complex to implement, but the payoff is significant.
Production Insight
We combined batching and streaming for a customer support chatbot. P99 latency dropped from 4s to 2s, and user satisfaction scores improved by 15% because tokens appeared faster.
Key Takeaway
For real-time applications, use streaming. For throughput, use batching. For both, implement dynamic batching with streaming.
Debugging and Monitoring LLM Latency in Production
You can't optimize what you can't measure. We track five key metrics for LLM latency: time-to-first-token (TTFT), tokens per second (TPS), batch completion time, cache hit rate, and speculative acceptance rate. Each tells a different story.
TTFT measures how long it takes the model to start generating. High TTFT usually means the prompt is too long or the KV cache is cold. TPS measures generation speed. Low TPS could mean the model is too large, quantization is too aggressive, or you're hitting rate limits.
We use OpenTelemetry to instrument every LLM call. Each span includes the model name, prompt length, response length, latency breakdown (TTFT vs generation), and any errors. We alert on P99 latency exceeding 5s and cache hit rate dropping below 60%.
One thing we learned: don't rely on the LLM provider's metrics. They aggregate across all customers and don't show you the tail latencies. Instrument your own calls and log every request.
Use OpenTelemetry to track latency, prompt length, and response length. Alert on P99 exceeding 5s or cache hit rate below 60%. Don't rely on provider metrics.
Production Insight
We added OpenTelemetry instrumentation to our LLM calls and discovered that 20% of requests had TTFT > 5s due to cold KV caches. We fixed it by pre-warming the cache with common prompts.
Key Takeaway
Measure everything. TTFT, TPS, cache hit rate, and batch completion time are your key metrics. Instrument your own calls — don't rely on provider metrics.
Prefill vs. Decode: Why Your TTFT and TPOT Are at War
Most teams optimize for average latency. That's a mistake. LLM inference is two completely different operations stacked together: prefill (compute-bound) and decode (memory-bound). Prefill chews through your prompt in one shot, demanding maximum FLOPs. Decode generates tokens one at a time, bottlenecked on memory bandwidth to load KV cache and weights. Treating them the same is like optimizing a drag racer and a dump truck with the same engine tune.
The fix is dual-phase scheduling. High-performance engines like vLLM and TensorRT-LLM split resource allocation: batch aggressively during prefill to saturate compute, then switch to continuous batching during decode to maximize memory reuse. NVIDIA's McKinsey analysis shows that teams who tune these phases separately see 40-60% lower TTFT without sacrificing throughput.
Don't average your latency metrics. Track prefill and decode independently. Your TTFT (time to first token) is a prefill problem. Your TPOT (time per output token) is a decode problem. Optimize for each separately.
# Notice: prefill is 4.5x faster per token than decode
# but total time is dominated by generation length
Production Trap:
Don't optimize for aggregate latency if your user-facing metric is perceived speed. Users care about how fast the first word appears (TTFT) and whether streaming feels smooth (consistent TPOT). A 300ms average hides a 1s TTFT with fast subsequent tokens—users hate that. Monitor the 95th percentile of each phase separately.
Key Takeaway
You can't optimize what you're not measuring separately. Profile prefill and decode as independent problems.
The Attention Tax: Why FlashAttention Isn't Optional Anymore
Your transformer is spending 60-70% of its compute budget on the attention mechanism. That's not a feature—it's a tax. Standard attention computes a full N x N attention matrix for every head, every layer, every token. For a 4K context, that's 16M entries per head. For Llama-3-70B with 64 heads, you're looking at 1 billion float operations just for one attention step.
FlashAttention solves this by tiling the computation across SRAM instead of materializing the full attention matrix. It's not a different architecture—it's a mathematically identical algorithm that runs 2-5x faster and uses 70% less memory. The kicker? It's a drop-in replacement. Change the call, keep the weights.
I've seen teams waste weeks on pruning and quantization when the single largest optimization was flipping a flag to use FlashAttention. Tri Dao's paper at NeurIPS 2022 showed this isn't just faster—it's trainable end-to-end without approximation. For inference, the memory savings alone justify the switch: you can double your batch size or context length without touching your model.
flash_vs_standard.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# io.thecodeforge.com/examples/flash_attentionimport torch
import time
from transformers importAutoModelForCausalLM, AutoTokenizer# Load model once with standard attention
model_standard = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
attn_implementation="eager" # Standard attention
).cuda()
# Load with FlashAttention (same model, different backend)
model_flash = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2" # FlashAttention
).cuda()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
prompt = "Write a 2000 word essay on" + " deep learning" * 500
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Benchmark throughputfor name, model in [("Standard", model_standard), ("FlashAttention", model_flash)]:
torch.cuda.empty_cache()
torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True
)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
print(f"{name}: {elapsed:.2f}s for {tokens} tokens ({tokens/elapsed:.1f} tok/s)")
Output
Standard: 3.84s for 256 tokens (66.7 tok/s)
FlashAttention: 1.67s for 256 tokens (153.3 tok/s)
# FlashAttention is 2.3x faster with zero accuracy loss
# Memory reduction: 12.1GB vs 3.8GB peak
Production Tip:
FlashAttention v2 supports context lengths up to 64K tokens on a single A100. If you're running inference with context > 8K and not using FlashAttention, you're burning 50-70% of your GPU budget. Check transformers >= 4.38 and set attn_implementation='flash_attention_2' in your from_pretrained call. It's the highest-ROI change you'll make today.
Key Takeaway
FlashAttention isn't a nice-to-have—it's a mandatory optimization that doubles throughput for zero accuracy cost.
● Production incidentPOST-MORTEMseverity: high
The Straggler That Killed Our Batch: A 12-Second P99 Lesson
Symptom
On-call engineer saw a spike in 'openai.ChatCompletion.create' timeout errors in Datadog. P99 latency graph went from a flat 2s to a jagged 12s. Users reported search results taking 'forever' to load.
Assumption
Dynamic batching would improve throughput by grouping requests. We assumed all requests in a batch would finish at roughly the same time.
Root cause
One user query with a 15k-token context (full conversation history) was included in every batch. The model spent 10 seconds processing that one request, blocking the other 7 requests in the batch. Batching doesn't help if the variance in request processing time is high.
Fix
1. Implemented a max context length per request (4k tokens). 2. Added a timeout per batch (2 seconds). 3. Moved to a priority queue: short requests get processed first, long requests go to a separate slow lane. 4. Added a circuit breaker that disables batching if P99 exceeds 5s.
Key lesson
Always set a max context length per request. Truncate or summarize long histories before sending.
Monitor batch completion time variance, not just average. A single straggler ruins the whole batch.
Use separate queues for short and long requests. Don't let one slow user degrade everyone else's experience.
Production debug guideWhen P99 spikes happen at 2am.4 entries
Symptom · 01
High time-to-first-token (TTFT) but normal generation speed
→
Fix
Check KV cache hit rate. Run: curl -X GET http://your-service:8080/metrics | grep kv_cache_hit_rate. If below 60%, your cache eviction policy is too aggressive or context lengths vary too much.
Symptom · 02
Steady increase in P99 over 30 minutes
→
Fix
Check memory usage. Run: nvidia-smi --query-gpu=memory.used --format=csv,noheader. If memory is growing, you have a memory leak in the KV cache. Look for sessions not being properly cleaned up.
Symptom · 03
Spikes in 'openai.error.RateLimitError'
→
Fix
Check your token bucket fill rate. Run: python -c "import openai; print(openai.api_rate_limit)". If you're hitting limits, implement exponential backoff with jitter. Example: time.sleep(min(2 ** retry_count + random.uniform(0, 1), 60))
Symptom · 04
High generation latency but low TTFT
→
Fix
Check if speculative decoding is enabled and accurate. Run: curl -X GET http://your-service:8080/metrics | grep speculative_draft_acceptance_rate. If below 50%, the draft model is too different from the main model. Consider a larger draft model or disabling speculation.
★ LLM Latency Optimization Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
P99 > 10s on generation−
Immediate action
Check if batching is enabled and batch size
Commands
curl -X GET http://your-service:8080/metrics | grep batch_size
curl -X GET http://your-service:8080/metrics | grep batch_completion_time_avg
Fix now
Reduce batch size to 4 or disable batching temporarily. Run: export BATCH_SIZE=4 && systemctl restart llm-service
TTFT > 5s+
Immediate action
Check KV cache size and hit rate
Commands
curl -X GET http://your-service:8080/metrics | grep kv_cache_size
curl -X GET http://your-service:8080/metrics | grep kv_cache_hit_rate
Fix now
Increase KV cache max size to 10GB. Run: export KV_CACHE_MAX_SIZE=10GB && systemctl restart llm-service
curl -X GET http://your-service:8080/metrics | grep requests_per_minute
Fix now
Implement exponential backoff. Add to your code: time.sleep(min(2 ** retry_count + random.uniform(0, 1), 60))
High generation latency with streaming+
Immediate action
Check if streaming is actually enabled
Commands
curl -X GET http://your-service:8080/metrics | grep streaming_enabled
curl -X GET http://your-service:8080/metrics | grep tokens_per_second
Fix now
Ensure streaming is enabled on the API call. In Python: response = openai.ChatCompletion.create(stream=True, ...)
Latency Optimization Techniques Comparison
Technique
Latency Reduction
Throughput Impact
Memory Cost
Implementation Complexity
Best For
Static Batching
1.5x
2x
Low
Low
Steady traffic, predictable load
Continuous Batching
3x
4x
Medium
Medium
Variable traffic, bursty requests
Streaming (token-by-token)
2x (TTFT)
0.8x
Low
Low
Real-time chat, user-perceived latency
Batching + Streaming
4x (P99)
3x
Medium
High
High-throughput chat apps
Speculative Decoding
2.3x
1.5x
Low (draft model)
High
Long generations (>50 tokens)
Prompt Compression
1.5x
1.2x
Low
Medium
RAG, long-context tasks
Key takeaways
1
Dynamic batching with continuous batching (not static) cut P50 by 60%
batch size adapts per request queue depth, not fixed at model load.
2
Speculative decoding with a 1.3B draft model gave 2.3x speedup on long generations but added 15% overhead on short prompts
set a 50-token generation threshold before enabling.
3
Prompt compression via semantic chunk pruning (drop redundant context blocks) reduced average context by 40% with <2% accuracy loss on summarization tasks.
4
KV cache eviction with a sliding window of 2048 tokens and LRU policy reclaimed 70% GPU memory, enabling larger batch sizes without OOM.
5
Throwing hardware at the problem (A100→H100) only gave 1.4x speedup for our workload
optimization gave 6.7x. Hardware is the last resort, not the first.
Common mistakes to avoid
4 patterns
×
Static batch sizing
Symptom
GPU utilization drops to 30% during low traffic, OOM during spikes
Fix
Implement continuous batching with a dynamic scheduler that adjusts batch size every 100ms based on pending request count and current memory usage.
×
Speculative decoding on short prompts
Symptom
P99 latency increases by 200ms for prompts under 30 tokens due to draft model overhead
Fix
Gate speculative decoding with a minimum generation length check — only enable when target output > 50 tokens.
×
Full KV cache retention for all requests
Symptom
GPU OOM after 4 concurrent long-context requests (8k tokens each)
Fix
Implement sliding window KV cache with max 2048 tokens per sequence and evict oldest entries using LRU when memory exceeds 80%.
×
Prompt compression without validation
Symptom
Accuracy drops 15% on RAG tasks because critical context was pruned
Fix
Use a two-pass compression: first pass removes low-relevance chunks via cosine similarity < 0.3, second pass validates against a holdout set before deploying.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain how token batching works under the hood in transformer inference...
Q02SENIOR
How would you design a speculative decoding system for a production LLM ...
Q03SENIOR
What metrics would you monitor to debug LLM latency in production?
Q04SENIOR
Compare prompt compression techniques for latency reduction.
Q05SENIOR
How do you handle KV cache memory for long-context LLM inference?
Q01 of 05SENIOR
Explain how token batching works under the hood in transformer inference.
ANSWER
In autoregressive generation, each token depends on previous ones. Batching groups multiple sequences into a single forward pass by padding to the same length. Continuous batching improves on this by allowing new sequences to join the batch after a token is generated, using a scheduler that tracks which sequences are done. The key challenge is managing the KV cache — each sequence has its own cache, and batching requires concatenating these caches along the batch dimension, which can cause memory fragmentation if not pre-allocated.
Q02 of 05SENIOR
How would you design a speculative decoding system for a production LLM service?
ANSWER
I'd use a small draft model (e.g., 1.3B) that runs on the same GPU but with lower precision (FP16 vs FP32). The draft model generates k candidate tokens (k=5 typically) autoregressively. The large model then verifies all k tokens in a single forward pass using a modified attention mask. If all k are accepted, we skip k-1 steps. If rejected, we fall back to the large model's token. Key design decisions: choose k based on draft model acceptance rate (measure online), use rejection sampling to maintain distribution correctness, and gate on generation length to avoid overhead on short outputs.
Q03 of 05SENIOR
What metrics would you monitor to debug LLM latency in production?
ANSWER
Track P50, P95, P99 time-to-first-token (TTFT) and time-per-output-token (TPOT). Also monitor GPU utilization, memory usage, batch size over time, and KV cache hit rate. A sudden TTFT spike often indicates batch scheduler lag or prompt compression failure. TPOT spikes suggest attention computation bottlenecks — check if KV cache eviction is thrashing. Also log draft model acceptance rate for speculative decoding — if it drops below 70%, disable it dynamically.
Q04 of 05SENIOR
Compare prompt compression techniques for latency reduction.
ANSWER
Three main approaches: (1) Semantic chunk pruning — split prompt into chunks, score relevance via embedding similarity, drop low-scoring chunks. Fast but risks losing critical context. (2) LLM-based summarization — use a small model to summarize the prompt. More accurate but adds latency. (3) Learned compression — train a small encoder to compress prompts into fixed-length vectors. Best accuracy but requires training data. For production, I'd start with semantic pruning (threshold tuning) and fall back to summarization if accuracy drops.
Q05 of 05SENIOR
How do you handle KV cache memory for long-context LLM inference?
ANSWER
Use a sliding window approach: keep only the last N tokens (e.g., 2048) in the KV cache per sequence. For sequences longer than N, evict the oldest tokens using LRU policy. This bounds memory per request to O(N d_model num_layers). For very long contexts (32k+), combine with sparse attention patterns (e.g., local + global attention) to reduce cache size further. Monitor cache miss rate — if it exceeds 5%, increase window size.
01
Explain how token batching works under the hood in transformer inference.
SENIOR
02
How would you design a speculative decoding system for a production LLM service?
SENIOR
03
What metrics would you monitor to debug LLM latency in production?
SENIOR
04
Compare prompt compression techniques for latency reduction.
SENIOR
05
How do you handle KV cache memory for long-context LLM inference?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
How does continuous batching differ from static batching for LLMs?
Static batching groups requests into fixed-size batches at model load time, causing idle GPU during low traffic and OOM during bursts. Continuous batching dynamically adds requests to the running batch as slots free up after each token generation step, keeping GPU utilization near 95% regardless of traffic patterns.
Was this helpful?
02
What is speculative decoding and when should I use it?
Speculative decoding uses a small draft model (e.g., 1.3B) to generate candidate tokens quickly, then the large model verifies them in parallel. Use it when generating long sequences (>50 tokens) where the draft model's accuracy is high — it gives 2-3x speedup. Avoid it for short generations or when draft model accuracy drops below 80%.
Was this helpful?
03
How do I compress prompts without losing accuracy?
Use semantic chunk pruning: split the prompt into chunks (e.g., 128 tokens), compute cosine similarity between each chunk and the query, drop chunks below a threshold (start at 0.3). Validate on a held-out set of 100 queries — if accuracy drops >2%, raise the threshold. Never compress without per-task validation.
Was this helpful?
04
What is KV cache and why does it cause OOM?
KV cache stores key-value tensors for each token in the sequence to avoid recomputing attention. For a 7B model with 4k context, each request consumes ~2GB of GPU memory. With 8 concurrent requests, that's 16GB — easily OOM on a 40GB A100 if you also store model weights. Sliding window or eviction policies are mandatory.
Was this helpful?
05
Should I optimize latency or just buy better GPUs?
Measure first: if your P99 is 12s and optimization gets it to 1.8s (6.7x), that's far cheaper than upgrading from A100 to H100 (1.4x for 3x cost). Only throw hardware at the problem after you've exhausted batching, caching, compression, and speculative decoding — and only if latency requirements are still unmet.