LLM Context Window — How We Lost $4k/Month to Token Truncation in Our RAG Pipeline
Stop guessing context window limits.
- Context Window The maximum tokens an LLM can process in one inference. Exceed it and you get silent truncation or a 400 error — we saw both.
- Tokenization Not character-based. A single word can be 1-5 tokens depending on subword splits. Mistaking tokens for characters caused our chunking to be 60% off.
- Attention Masking The model can only attend to tokens within the window. If your prompt is truncated, the model literally cannot see the last N tokens — no error, just wrong answers.
- Sliding Window vs Fixed Window Sliding windows reuse KV cache across steps but double memory on the first token. Fixed windows are simpler but waste context on overlap.
- Chunking Strategy Overlap percentage matters. 10% overlap in a 4096-token window loses 400 tokens of context per chunk — that's a 10% accuracy drop we measured.
- Production Monitoring Track
prompt_token_countandcompletion_token_countper request. A sudden drop in prompt tokens often means truncation kicked in.
Think of the context window as a whiteboard that an AI can write on. The whiteboard has a fixed size — once it's full, the AI has to erase older notes to make room for new ones. If your instructions are at the end of the board and the AI erases them, it forgets what you asked. We accidentally kept writing instructions that got erased, and the AI started answering the wrong questions.
We deployed a RAG pipeline for a legal document review system. The system was supposed to answer queries from 50-page contracts. After two weeks, accuracy dropped from 89% to 66%. Users reported that the AI 'forgot' key clauses from the middle of the document. Our first assumption: the embedding model was bad. We spent a week retraining embeddings. Nothing changed. Then we checked the actual token counts. The documents were being truncated at 4096 tokens — the default context window of our GPT-3.5 model. The model never saw the last 30% of each document. We were paying for full document ingestion but only processing 70% of the content.
Most tutorials explain the context window as 'the maximum number of tokens the model can handle.' They don't tell you that truncation is silent, that tokenization is non-deterministic across models, or that your chunking strategy directly determines how much of the document the model actually reads. They also skip the cost angle: we were billed for the full 4096 tokens even when only 2800 were used because of padding and overhead.
This article covers the exact token counting pipeline we built, the chunking strategy that fixed our accuracy, the monitoring we added to catch truncation in real-time, and the debug commands you can run at 2am. We'll include the code that calculates effective context usage, the attention mask debugger, and the cost attribution per request. By the end, you'll know exactly why your RAG pipeline is losing context and how to fix it without guessing.
How the Context Window Actually Works Under the Hood
The context window is not a simple buffer. It's a fixed-size tensor that holds token embeddings plus positional encodings. When you send a prompt, the model tokenizes it into a sequence of token IDs, then looks up the embedding for each token and adds the positional encoding. The resulting tensor has shape (1, sequence_length, embedding_dim). If sequence_length exceeds the model's max position, the model either truncates (drops the last tokens) or throws an error depending on the API.
The key detail most tutorials skip: the attention mechanism uses a causal mask that prevents tokens from attending to future tokens. But the mask is also bounded by the context window. If your prompt is 5000 tokens and the window is 4096, the last 904 tokens are never seen by the model — they are not masked, they simply don't exist in the input tensor. The model's output is conditioned only on the first 4096 tokens.
In production, this means your carefully crafted system prompt at the end of the message list is the first thing to get dropped. OpenAI's API appends the user message last, but the system prompt is first. If you have a long conversation history, the system prompt survives, but the user's latest query may be truncated. We learned this the hard way when a user's 2000-word question was silently cut to 1200 tokens — the model answered a different question.
The abstraction hides the tokenization step. LangChain's load_summary chain uses character-based splitting by default. We saw chunks that were 4000 characters but 2800 tokens — well within the 4096 window. But the next chunk was also 2800 tokens, and the model could only fit one chunk plus the query. The second chunk was silently dropped. No error, no warning, just a 23% accuracy drop.
Practical Implementation: Token-Aware Chunking for RAG
Most RAG tutorials use character-based splitting because it's simple. In production, this is a trap. Characters and tokens have no fixed ratio. A 1000-character paragraph can be 200 tokens (English) or 800 tokens (code with many special characters). If your chunk_size is in characters, you have no idea how many tokens each chunk will be.
We switched to token-based splitting using tiktoken and the exact model encoding. The algorithm: split the document into paragraphs, then merge paragraphs until the token count reaches chunk_size. If a single paragraph exceeds chunk_size, split it into sentences. If a sentence exceeds chunk_size, split on words. This ensures each chunk is exactly chunk_size tokens (or less for the last chunk).
We set chunk_size to 2048 tokens — half the model's context window. This leaves room for the system prompt (500 tokens), the user query (500 tokens), and the completion (500 tokens). The remaining 548 tokens are buffer. This prevents the model from ever seeing a truncated chunk.
Overlap is critical. We use 256 tokens of overlap between chunks. This ensures that no information is lost at chunk boundaries. The overlap is stored in the vector database as separate chunks with a source field indicating the original chunk ID. During retrieval, we deduplicate overlapping chunks by source ID.
source_id field and deduplicating at retrieval time.When NOT to Use the Full Context Window
Bigger is not always better. A larger context window means more tokens to attend to, which increases memory and latency quadratically. GPT-4-32k has 8x the context of GPT-3.5 but 64x the memory cost. The attention matrix is O(n^2) in memory. For a 32k token window, the attention matrix is 32,000 x 32,000 = 1 billion entries. At 2 bytes per entry (fp16), that's 2GB of memory for a single attention head. With 20 heads, that's 40GB. You will OOM your GPU.
We tried using GPT-4-32k for our legal pipeline. The latency went from 2s to 18s. The cost per request went from $0.01 to $0.80. And the accuracy didn't improve — the model was drowning in irrelevant context. The 'lost in the middle' problem is real: models perform worse on information in the middle of long prompts. A 32k window doesn't help if the relevant clause is buried in 30k tokens of boilerplate.
Use a small context window (4k-8k) with a RAG pipeline for most use cases. Only use large windows (32k-128k) when the entire input is relevant and you cannot split it. For example, analyzing a single legal contract where every clause matters. But even then, we found that splitting the contract into sections and running separate queries was cheaper and faster than a single 32k prompt.
Another anti-pattern: using the full window for conversation history. We saw a chatbot that stored the entire conversation history in the prompt. After 20 turns, the prompt was 10k tokens. The model started forgetting the user's original request. We fixed this by summarizing the conversation history every 5 turns and only keeping the last 2 turns in the prompt.
Production Patterns and Scale: Context Window Monitoring
You cannot fix what you do not measure. We now track three metrics per request: prompt_token_count, completion_token_count, and truncation_ratio (prompt_tokens / max_window). If truncation_ratio > 0.8, we log a warning. If > 1.0, we page the on-call engineer.
We use Datadog custom metrics. The instrumentation is in the API gateway, not in the LLM client. This catches truncation from any client, not just our Python code. We also log the first 100 tokens of the prompt when truncation is detected, so we can see what was cut.
At scale (50k requests/day), we saw that 12% of requests had truncation_ratio > 0.8. Most of these were from a single client that was sending full documents as queries. We added a per-client token budget that limited prompt tokens to 3000. The client had to implement chunking themselves. This reduced our truncation rate to 0.5%.
We also monitor cost per request. If cost per request spikes without a volume increase, it's usually because the model is processing more tokens than expected. We saw a 30% cost spike when a client accidentally sent the same document 5 times in a single request. The model processed 5x the tokens, and we were billed for all of them.
Common Mistakes with Specific Examples
Mistake 1: Using the wrong tokenizer. We had a team using tiktoken.get_encoding("gpt2") for a GPT-3.5 model. GPT-3.5 uses cl100k_base. The gpt2 tokenizer produces different token counts. A 1000-character paragraph might be 200 tokens with gpt2 but 250 with cl100k_base. Their chunk_size of 2000 tokens was actually 2500 tokens — exceeding the context window.
Mistake 2: Assuming the model will error on truncation. OpenAI's API silently truncates. We had a dashboard showing 'prompt_tokens: 4096' for every request — the API was reporting the truncated count, not the original count. We thought everything was fine. It wasn't.
Mistake 3: Not accounting for the completion tokens in the context window. The context window includes both prompt and completion. If you set max_tokens=2048 and the prompt is 3000 tokens, the total is 5048 — exceeding the 4096 window. The model will truncate the prompt to fit both. We saw this in a chatbot where the response was cut off because the model had to allocate tokens for the completion.
Mistake 4: Using a fixed chunk_size without considering the model's window. We used chunk_size=4000 for a 4096-window model. That left only 96 tokens for the system prompt and query. Any query longer than 96 tokens would cause truncation. We fixed it by setting chunk_size to 2048 (half the window).
Mistake 5: Not testing with the actual model. We tested chunking with GPT-3.5 but deployed with GPT-4. GPT-4 has a different tokenizer (cl100k_base vs p50k_base for older models). The token counts were different, and our chunking was off by 10%.
tiktoken.encoding_for_model() not tiktoken.get_encoding(). The latter requires you to know the encoding name, which changes across models. We caught a production bug where a team used p50k_base for a gpt-4 model — the token counts were off by 15%.Comparison vs Alternatives: Sliding Window vs Fixed Window vs RAG
There are three main strategies for handling context windows: fixed window, sliding window, and RAG. Each has trade-offs.
Fixed window: You split the input into fixed-size chunks and process each chunk independently. Simple, fast, but loses cross-chunk context. We used this for our legal pipeline initially. Accuracy was 66% because the model couldn't see clauses that spanned chunk boundaries.
Sliding window: You process the input with a sliding window that overlaps. The KV cache is reused across steps, so the model can attend to tokens from previous windows. This improves accuracy but doubles memory on the first token because you need to store the KV cache for the entire sliding window. We tried this with a 2048-token sliding window and 512-token overlap. Accuracy improved to 82%, but latency went up 40% due to the KV cache overhead.
RAG (Retrieval-Augmented Generation): You index the input into a vector database, retrieve the most relevant chunks, and send only those chunks to the model. This is the most efficient for large documents because you only pay for the relevant context. We achieved 89% accuracy with RAG, with latency under 2 seconds. The trade-off: you need a good retrieval system. If the retriever misses a relevant chunk, the model can't answer.
We benchmarked all three on 1000 legal queries. Fixed window: 66% accuracy, 1.2s latency. Sliding window: 82% accuracy, 1.7s latency. RAG: 89% accuracy, 1.9s latency. RAG won on accuracy and was close on latency. We chose RAG with a hybrid retriever (BM25 + dense embeddings) to catch edge cases.
For conversation history, we use a sliding window of the last 2 turns plus a summary of earlier turns. This is a hybrid approach: RAG for long-term memory, sliding window for short-term context.
Debugging and Monitoring: The Complete Toolkit
You need three layers of debugging: pre-request validation, in-request monitoring, and post-request analysis.
Pre-request validation: Before sending a request, calculate the total token count of the prompt (system + user + history). If it exceeds 80% of the context window, reject with a clear error message. We use a TokenBudget class that tracks the budget across multiple messages. This catches truncation before it happens.
In-request monitoring: After receiving the response, check the usage field. If prompt_tokens is suspiciously close to the max window, log a warning. We also check if completion_tokens is suspiciously short — that often means the model ran out of context window for the completion.
Post-request analysis: Log every request with prompt_tokens, completion_tokens, and the first 100 tokens of the prompt. This allows you to replay requests and debug truncation after the fact. We store this in a separate table in our data warehouse.
We also have a debug endpoint that returns the token breakdown: system prompt tokens, user query tokens, history tokens, and the total. This helps developers understand why their request was rejected.
The most common debugging scenario: a developer says 'the model is ignoring my instructions.' We check the token breakdown. 90% of the time, the instructions are at the end of the prompt and were truncated.
/debug/token-breakdown endpoint that accepts a prompt and returns the token breakdown. Developers can test their prompts before sending them to the model. This reduced truncation-related tickets by 70%.The $4k/Month Silent Truncation — How We Missed 30% of Our Legal Documents
load_summary chain used RecursiveCharacterTextSplitter with chunk_size=4000 and chunk_overlap=200. But the model's max context window was 4096 tokens. The chunk_size was in characters, not tokens. Each chunk averaged 2800 tokens, but the model's window was 4096 — so the model could only fit the first chunk plus the query. The second chunk was silently dropped. The model never saw 60% of the document.tiktoken with the exact model encoding (cl100k_base for GPT-3.5).
2. Set chunk_size to 2048 tokens (half of context window) and overlap to 256 tokens to ensure at least some context from adjacent chunks.
3. Added a pre-processing step that counted total document tokens and logged a warning if any chunk exceeded 80% of the context window.
4. Implemented a token budget calculator that allocated tokens for system prompt, user query, and document chunks, and rejected requests exceeding the budget.
5. Added a metric prompt_truncation_ratio to Datadog — if >0, page the on-call engineer.
6. Code fix: replaced LangChain's load_summary with a custom chain that explicitly managed token counts.- Always count tokens using the same tokenizer as the model — character counts are meaningless for LLM context windows.
- Monitor actual token usage per request, not just the model's max window. Truncation is silent and costs you money for unused context.
- Implement a token budget before sending the request. If the prompt exceeds the window, fail fast with a clear error instead of silently dropping content.
tiktoken and compare to the model's max context window. Run: python -c "import tiktoken; enc = tiktoken.encoding_for_model('gpt-3.5-turbo'); prompt = open('last_prompt.txt').read(); print(len(enc.encode(prompt)))"print(f'Chunks sent: {len(chunks)}, Expected: {total_chunks}')usage.prompt_tokens and compare to max_tokens — if prompt_tokens >= max_tokens, truncation occurred.curl -X POST https://api.openai.com/v1/chat/completions -H 'Authorization: Bearer $KEY' -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"hi"}],"max_tokens":1}' | jq '.usage' — if prompt_tokens is > the actual token count of 'hi', you have padding.python -c "import tiktoken; enc = tiktoken.encoding_for_model('gpt-3.5-turbo'); print(len(enc.encode(open('prompt.txt').read())))"curl -X POST https://api.openai.com/v1/chat/completions -H 'Authorization: Bearer $KEY' -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"'$(cat prompt.txt)'"}],"max_tokens":1}' | jq '.usage.prompt_tokens'Key takeaways
Common mistakes to avoid
4 patternsCharacter-based chunking
Ignoring system prompt tokens
No overlap in chunking
Assuming max context is usable
Interview Questions on This Topic
Explain how the transformer context window works under the hood. What is the computational complexity of attention with respect to context length?
Frequently Asked Questions
That's LLM Basics. Mark it forged?
9 min read · try the examples if you haven't