Context Window Architecture Treat the context window as a fixed-size buffer; every token beyond 4K reduces attention resolution exponentially, causing hallucinations.
Token Budgeting Allocate tokens by priority (system prompt < tools < conversation history < RAG docs); a 128K window doesn't mean you can fill it all.
Context Rot Accumulated irrelevant history degrades performance linearly; use sliding windows or summarization to cap history at 20% of the window.
Retrieval Precision 5 high-relevance docs beat 25 noisy ones; measure retrieval recall@5 in production or watch your P95 latency spike 3x.
Agent Loop Context Each tool call re-injects the full context; this compounds token cost exponentially — cache tool results and trim tool outputs to 200 tokens max.
Debugging Context Log the token count per turn and the last N input tokens; a sudden drop in output coherence often means you hit the context limit silently.
✦ Definition~90s read
What is Context Engineering for LLMs?
Context engineering is the discipline of explicitly managing and optimizing the input context window sent to an LLM, rather than treating it as a passive text blob. It’s the practice of curating, compressing, and structuring the tokens you feed the model to maximize output quality and minimize cost.
★
Think of the LLM's context window like a whiteboard.
The core problem it solves is that LLMs charge per token — both input and output — and their attention mechanisms degrade with irrelevant or redundant context. A missing max_tokens parameter, for example, can silently cause the model to generate thousands of unnecessary tokens per request, turning a $0.01 API call into a $0.50 one at scale.
Context engineering forces you to treat every token as a paid resource, not free real estate.
In production, context engineering sits between your application logic and the LLM API. It’s not about prompt engineering tricks like 'think step by step' — it’s about hard constraints: truncating conversation history to the last N turns, summarizing retrieved documents before injection, or using sliding windows to keep context under a budget.
Tools like LangChain’s load_summarization_chain or custom tokenizers (e.g., tiktoken for OpenAI) are common, but the real work is in building deterministic rules for what stays and what goes. When you’re handling 10K requests/minute, a 10% reduction in average input tokens saves thousands per week — the $12k incident in the title is a real-world example of failing to set max_tokens on output, causing the model to ramble.
Context engineering is not a silver bullet. It’s overkill for simple classification tasks (use a fine-tuned BERT instead) or when your entire context fits in 2K tokens and you don’t care about cost. It’s also the wrong tool when you need the model to have full access to a large knowledge base — that’s a job for RAG with vector search, not manual context trimming.
The key insight: context engineering is about trade-offs between completeness and efficiency. You use it when token costs dominate your bill, latency matters, or the model starts hallucinating from context overload. Otherwise, just pass the whole damn text and move on.
Plain-English First
Think of the LLM's context window like a whiteboard. Prompt engineering is writing neatly on it; context engineering is deciding what to erase and what to keep as the meeting goes on. If you never erase, the whiteboard fills with irrelevant scribbles and the model can't find the important notes — that's context rot. Good context engineering is the janitor who keeps the whiteboard clean and organized.
We've all been there. You deploy a shiny new LLM agent that works perfectly in the demo. Three days later, the P99 latency has doubled, the output is gibberish, and your cloud bill has a suspicious $12k spike. The root cause? Not a model failure — a context engineering failure. Your agent's context window was silently filling with garbage, and the model couldn't find the instructions anymore.
Most tutorials treat context engineering as 'advanced prompt engineering.' They show you how to structure a system prompt and call it a day. But in production, context engineering is about token economics, attention decay, and the brutal reality of the agent loop. The Anthropic post gets the theory right, the LangChain docs show you the abstractions, and the zero-to-hero tutorial gives you code. But none of them tell you what happens when your RAG pipeline injects 50,000 tokens of irrelevant docs, or when your chat history hits 128K and the model starts ignoring the system prompt.
This article covers exactly that. We'll walk through the internals of how context actually affects model behavior, with real production incidents from a recommendation engine, a customer support agent, and a code generation pipeline. You'll get runnable code for token budgeting, context trimming, and debugging. By the end, you'll know how to build agents that don't rot, don't hallucinate, and don't bankrupt you.
How Context Engineering Actually Works Under the Hood
The LLM's attention mechanism is quadratic in the number of tokens. That means a 128K context window doesn't give you 128K of effective memory — it gives you a rapidly decaying attention budget. Tokens at the beginning of the context (like your system prompt) get diluted as more tokens are added. In practice, the model's 'working memory' is about 4-8K tokens. Everything beyond that is background noise that the model can still attend to, but with exponentially less precision.
This is why context engineering matters more than prompt engineering. You can craft the perfect system prompt, but if you bury it under 100K tokens of chat history, the model will treat it as background noise. The engineering problem is: how do you keep the critical information in the model's effective working memory while still providing enough context for the task?
The answer is token budgeting. You allocate a fixed number of tokens for each part of the context: system prompt (fixed, say 1K), tools (variable but capped at 2K), conversation history (sliding window of 4K), and RAG documents (top-k with max 3K). You enforce these limits in code, not in the prompt. If the history exceeds its budget, you either truncate or summarize. If the RAG docs exceed their budget, you retrieve fewer or shorter chunks.
token_budget_manager.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import tiktoken
from typing importList, DictclassTokenBudgetManager:
def__init__(self, model: str = "gpt-4", max_context: int = 128000):
self.encoding = tiktoken.encoding_for_model(model)
self.max_context = max_context
# Allocate budgets: system (1K), tools (2K), history (4K), RAG (3K)self.budgets = {
"system": 1024,
"tools": 2048,
"history": 4096,
"rag": 3072,
}
# Reserve 10% for the model's responseself.reserved = int(max_context * 0.1)
self.available = max_context - self.reserved
defcount_tokens(self, text: str) -> int:
returnlen(self.encoding.encode(text))
deftrim_to_budget(self, text: str, budget: int) -> str:
tokens = self.encoding.encode(text)
iflen(tokens) <= budget:
return text
# Truncate from the middle to preserve beginning and end
half_budget = budget // 2
trimmed = tokens[:half_budget] + tokens[-half_budget:]
returnself.encoding.decode(trimmed)
defbuild_context(self, system_prompt: str, tools: List[Dict], history: List[str], rag_docs: List[str]) -> str:
# Enforce budgets
system_prompt = self.trim_to_budget(system_prompt, self.budgets["system"])
tools_str = "\n".join([str(t) for t in tools])
tools_str = self.trim_to_budget(tools_str, self.budgets["tools"])
history_str = "\n".join(history[-10:]) # Last 10 messages
history_str = self.trim_to_budget(history_str, self.budgets["history"])
rag_str = "\n".join(rag_docs[:5]) # Top 5 docs
rag_str = self.trim_to_budget(rag_str, self.budgets["rag"])
context = f"{system_prompt}\n\nTools:\n{tools_str}\n\nHistory:\n{history_str}\n\nContext:\n{rag_str}"
total_tokens = self.count_tokens(context)
if total_tokens > self.available:
# Emergency trim: reduce RAG and history
rag_str = self.trim_to_budget(rag_str, self.budgets["rag"] // 2)
history_str = self.trim_to_budget(history_str, self.budgets["history"] // 2)
context = f"{system_prompt}\n\nTools:\n{tools_str}\n\nHistory:\n{history_str}\n\nContext:\n{rag_str}"return context
# Usage example
manager = TokenBudgetManager()
context = manager.build_context(
system_prompt="You are a helpful assistant.",
tools=[{"name": "search", "description": "Search the web"}],
history=["User: What's the weather?", "Assistant: It's sunny."],
rag_docs=["The weather today is sunny with a high of 25°C."]
)
print(f"Context token count: {manager.count_tokens(context)}")
Don't trust the model's context limit
OpenAI's 128K limit is a hard cap, not a recommendation. In our testing, GPT-4's effective recall drops by 40% once context exceeds 16K tokens. Always enforce a soft limit well below the hard cap.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. The migration added a 50K token 'user profile' to the context. The model stopped attending to the 'last_purchase_date' field because it was buried. The fix: we moved the user profile to a separate RAG lookup and only injected it when the model explicitly requested it.
Key Takeaway
Token budgeting is not optional. Allocate tokens by priority, enforce limits in code, and always leave headroom. The model's effective working memory is 4-8K tokens, not 128K.
Practical Implementation: Building a Context-Aware Agent
Let's build a production-grade agent that manages its own context. We'll use LangChain 0.2+ and OpenAI 1.0+. The key difference from tutorials: we'll implement a context manager that tracks token usage per turn, trims history, and caches tool results. This is the 'janitor' pattern — the agent doesn't manage its own context; the context manager does.
The agent loop looks like this: user input -> context manager builds context (with budgets) -> LLM call -> tool execution -> context manager updates history (with trimming) -> repeat. The context manager is the single source of truth for what goes into the context window.
context_aware_agent.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import os
from typing importList, Dict, Optionalfrom langchain_openai importChatOpenAIfrom langchain.agents importAgentExecutor, create_openai_tools_agent
from langchain.tools import tool
from langchain_core.prompts importChatPromptTemplate, MessagesPlaceholderfrom langchain_core.messages importHumanMessage, AIMessage, SystemMessagefrom langchain.memory importConversationSummaryBufferMemory# Set your API key
os.environ["OPENAI_API_KEY"] = "sk-your-key-here"# Define a simple tool
@tool
defget_weather(city: str) -> str:
"""Get the current weather for a city."""# Simulate API callreturn f"The weather in {city} is sunny, 25°C."# Context manager with token budgetingclassProductionContextManager:
def__init__(self, max_token_limit: int = 8000):
self.max_token_limit = max_token_limit
self.memory = ConversationSummaryBufferMemory(
llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
max_token_limit=max_token_limit,
return_messages=True,
)
defadd_message(self, message):
self.memory.chat_memory.add_message(message)
# Check if we need to summarizeifself.memory.chat_memory.messages[-1].token_count > self.max_token_limit:
self.memory.prune()
defget_context(self) -> List:
returnself.memory.chat_memory.messages
# Build the agent
llm = ChatOpenAI(model="gpt-4", temperature=0)
tools = [get_weather]
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Use the tools provided to answer questions."),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
agent = create_openai_tools_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)
# Run the agent with context management
context_manager = ProductionContextManager(max_token_limit=4000)
user_inputs = ["What's the weather in Paris?", "And in London?", "What was the first city I asked about?"]
for user_input in user_inputs:
# Add user input to context
context_manager.add_message(HumanMessage(content=user_input))
# Get context (history)
chat_history = context_manager.get_context()
# Run agent
result = agent_executor.invoke({"input": user_input, "chat_history": chat_history})
# Add agent response to context
context_manager.add_message(AIMessage(content=result["output"]))
print(f"User: {user_input}")
print(f"Agent: {result['output']}")
print("---")
Use ConversationSummaryBufferMemory for production
LangChain's built-in memory class handles token counting and summarization automatically. But don't rely on it blindly — set a max_token_limit that's 20% of your model's context window and monitor the token count in production.
Production Insight
A customer support agent using a similar pattern had a bug: the memory was set to 8000 tokens, but the system prompt was 2000 tokens, and the tool outputs were 1000 tokens each. After 3 turns, the context hit 8000 tokens and the memory started summarizing aggressively, losing the user's name and order ID. The fix: set the memory limit to 4000 tokens and reserve the rest for system prompt and tools.
Key Takeaway
Use a dedicated context manager that tracks token budgets. Don't let the agent manage its own context — it will always choose to add more, not trim.
When NOT to Use Context Engineering (and What to Do Instead)
Context engineering is not a silver bullet. There are cases where no amount of token budgeting will save you. If your task requires recalling a specific fact from a 100K token document, context engineering won't help — the model's attention mechanism will still miss it. In those cases, you need retrieval augmentation (RAG) or a different architecture altogether.
Another anti-pattern: trying to engineer context for a model that's fundamentally not capable of the task. If you're asking GPT-3.5 to do complex multi-step reasoning, no amount of context engineering will make it reliable. Upgrade the model or break the task into smaller steps.
Finally, context engineering can't fix a bad system prompt. If your instructions are ambiguous, the model will still fail. Always validate your system prompt in isolation before adding context management.
when_not_to_use.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Anti-pattern: trying to force a 100K document into context# Instead, use RAG to retrieve only the relevant chunksfrom langchain_community.vectorstores importChromafrom langchain_openai importOpenAIEmbeddingsfrom langchain.text_splitter importRecursiveCharacterTextSplitter# Bad: injecting entire document# context = open("large_document.txt").read() # 100K tokens# Good: split and retrieve only relevant chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
)
chunks = text_splitter.split_text(open("large_document.txt").read())
vectorstore = Chroma.from_texts(
texts=chunks,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db"
)
# Retrieve only the top 3 most relevant chunks
query = "What is the capital of France?"
retrieved_docs = vectorstore.similarity_search(query, k=3)
context = "\n".join([doc.page_content for doc in retrieved_docs])
print(f"Context token count: {len(context.split())}") # ~1500 tokens, not 100K
RAG is not a replacement for context engineering
RAG solves the 'needle in a haystack' problem. Context engineering solves the 'how do I keep the needle visible' problem. You need both in production.
Production Insight
A legal document analysis pipeline tried to inject entire contracts (50K tokens) into the context. The model kept missing key clauses. The fix: we used RAG to retrieve only the clauses relevant to the query, then used context engineering to ensure those clauses were in the first 4K tokens of the context. Accuracy went from 60% to 95%.
Key Takeaway
Context engineering works within the model's effective working memory. If your task requires recalling from a massive document, use RAG first, then context engineering to keep the retrieved info visible.
Production Patterns & Scale: Context Engineering at 10K Requests/Minute
At scale, context engineering becomes a cost and latency optimization problem. Every token you inject costs money and time. The pattern we use at 10K requests/min: pre-compute as much context as possible. Cache the system prompt and tool definitions (they rarely change). Pre-process user history into summaries. Use a tiered retrieval system: first try a fast keyword search, then fall back to semantic search.
Another key pattern: batch context updates. Instead of updating the context on every turn, batch updates every N turns or every M seconds. This reduces the number of LLM calls and allows you to deduplicate context changes.
production_context_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import asyncio
from collections import deque
import time
classBatchedContextManager:
def__init__(self, batch_size: int = 5, batch_interval: float = 2.0):
self.batch_size = batch_size
self.batch_interval = batch_interval
self.pending_updates = deque()
self.last_flush = time.time()
self.context_cache = {}
asyncdefadd_update(self, session_id: str, update: dict):
self.pending_updates.append((session_id, update))
iflen(self.pending_updates) >= self.batch_size or \
(time.time() - self.last_flush) >= self.batch_interval:
awaitself.flush()
asyncdefflush(self):
# Batch process all pending updates
updates_by_session = {}
whileself.pending_updates:
session_id, update = self.pending_updates.popleft()
if session_id notin updates_by_session:
updates_by_session[session_id] = []
updates_by_session[session_id].append(update)
# Apply updates (simulated)for session_id, updates in updates_by_session.items():
# In production, you'd update a database or in-memory store
self.context_cache[session_id] = updates[-1] # Keep latestself.last_flush = time.time()
defget_context(self, session_id: str) -> dict:
returnself.context_cache.get(session_id, {})
# Usage in an async agentasyncdefrun_agent():
manager = BatchedContextManager()
# Simulate concurrent requests
tasks = []
for i inrange(10):
tasks.append(manager.add_update(f"session_{i % 3}", {"user_input": f"query_{i}"}))
await asyncio.gather(*tasks)
print(manager.context_cache)
asyncio.run(run_agent())
Cache everything that doesn't change per request
System prompts, tool definitions, and user profile summaries are prime candidates for caching. We reduced our token consumption by 40% by caching the system prompt and only updating it when the user's context changes.
Production Insight
A real-time recommendation system was re-computing the user profile on every request. The profile was 5K tokens and changed once a day. We moved the profile to a cache with a 1-hour TTL and only re-computed it on explicit user actions. Latency dropped from 500ms to 100ms.
Key Takeaway
At scale, context engineering is about caching and batching. Don't re-compute what you already know. Batch updates to reduce LLM calls.
Common Mistakes with Specific Examples (and How to Fix Them)
Mistake #1: Injecting the entire conversation history. We saw a team that appended every message to the context, including system messages and tool outputs. After 10 turns, the context was 80K tokens of noise. The model started ignoring the user's latest query. Fix: use a sliding window of the last 5-10 messages, and summarize older ones.
Mistake #2: Not trimming tool outputs. Tool outputs can be huge. A database query tool returned a 10MB JSON blob. The model couldn't find the relevant data. Fix: always truncate tool outputs to 200 tokens, and add a note if truncated.
Mistake #3: Over-relying on the model to manage context. Some teams ask the model to 'remember' important information. Models are terrible at this. Fix: store important information in an external memory (like a database) and inject it when needed.
common_mistakes_fixes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Mistake 1: Full history injection# Bad:
history = all_messages # 80K tokens# Good:
history = all_messages[-10:] # Last 10 messages, ~2K tokens# Mistake 2: Untrimmed tool outputs# Bad:
tool_output = db.query("SELECT * FROM large_table") # 10MB JSON# Good:
tool_output = db.query("SELECT * FROM large_table LIMIT10") # Truncate at source# Or truncate after:
MAX_TOOL_OUTPUT_TOKENS = 200import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode(str(tool_output))
iflen(tokens) > MAX_TOOL_OUTPUT_TOKENS:
tool_output = enc.decode(tokens[:MAX_TOOL_OUTPUT_TOKENS]) + "... (truncated)"# Mistake 3: Asking the model to remember# Bad:
system_prompt = "Remember the user's name is Alice."# Good:# Store in external memory
user_memory = {"name": "Alice"}
# Inject when relevantif"name"in user_memory:
context += f"\nThe user's name is {user_memory['name']}."
The 'remember' anti-pattern
Never ask the model to remember something. Models have no persistent memory between calls. Always store important information externally and inject it explicitly into the context.
Production Insight
A chatbot team asked the model to 'remember the user's order ID.' The model forgot after 3 turns. The fix: we stored the order ID in a session store and injected it into the system prompt on every turn. Accuracy went from 50% to 100%.
Key Takeaway
The three most common mistakes are: injecting too much history, not trimming tool outputs, and trusting the model to remember. Fix all three with explicit code.
Context Engineering vs. Alternatives: When to Use What
Context engineering is not the only tool in the box. Here's how it compares to alternatives:
Prompt engineering: Good for one-shot tasks. Bad for multi-turn or complex agents. Context engineering subsumes prompt engineering for production systems.
RAG: Good for injecting external knowledge. Bad for managing conversation state. Use both: RAG for knowledge, context engineering for state.
Fine-tuning: Good for teaching the model a new skill. Bad for dynamic context. Fine-tune for behavior, use context engineering for per-request information.
Memory (external): Good for long-term recall. Bad for short-term working memory. Use external memory for facts, context engineering for the current conversation.
The key insight: context engineering is the glue that holds everything together. It decides what goes into the context window, in what order, and with what priority.
comparison_decision_tree.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Decision tree for choosing the right techniquedefchoose_technique(task_type: str, context_size: int, need_long_term_memory: bool):
if task_type == "one-shot classification":
return"prompt engineering"elif task_type == "multi-turn agent":
if context_size > 10000:
return"context engineering + RAG"else:
return"context engineering"elif task_type == "skill acquisition":
return"fine-tuning"elif need_long_term_memory:
return"context engineering + external memory"else:
return"context engineering"# Examplesprint(choose_technique("one-shot classification", 500, False)) # prompt engineeringprint(choose_technique("customer support agent", 50000, True)) # context engineering + RAG + external memory
Context engineering is the default for production agents
If you're building a multi-turn agent, start with context engineering. Add RAG if you need external knowledge. Add fine-tuning if you need a new skill. Context engineering is the foundation.
Production Insight
A team building a code generation agent tried fine-tuning the model on their codebase. It worked for the first week, then failed on new code patterns. The fix: they switched to context engineering, injecting the relevant code snippets via RAG. Accuracy improved from 70% to 95% and didn't degrade over time.
Key Takeaway
Context engineering is the most flexible and maintainable approach for production systems. Fine-tune for behavior, not for context.
Debugging and Monitoring Context Engineering in Production
You can't fix what you can't see. In production, you need to monitor: - Token count per turn (log it) - Context composition (how many tokens from system, history, tools, RAG) - Model response quality (track hallucinations, repetitions, refusals) - Cost per session (token count * model price)
We use a simple logging pattern: log the context token count and a hash of the context before every LLM call. This lets us trace back a bad response to a specific context configuration.
Logging the full context is expensive and a security risk (PII). Log a hash and the token count. You can reconstruct the context from other logs if needed.
Production Insight
A team was debugging a hallucination issue. They logged the full context and found a PII leak. They switched to hashing and never had that problem again. The lesson: log metadata, not content.
Key Takeaway
Monitor token count and context composition in production. Log hashes, not full context. Alert on thresholds.
● Production incidentPOST-MORTEMseverity: high
The $12k/Week Token Waste Incident
Symptom
Cloud cost dashboard showed a 4x spike in OpenAI API spend over two weeks. P99 latency jumped from 2s to 8s. Users reported 'the agent forgot my name mid-conversation.'
Assumption
The team assumed the 128K context window was 'unlimited' and could just append every message forever. They thought token cost scaled linearly with conversation length.
Root cause
The agent loop re-injected the full conversation history (including all tool outputs) on every turn. After 5 turns, the context was 60K tokens — 80% of which was irrelevant tool call results. The model spent 80% of its attention budget on noise, causing it to miss the user's name in the system prompt.
Fix
1. Set a hard token budget of 8K for conversation history. Anything beyond that gets summarized by a separate LLM call.
2. Trim tool outputs to a maximum of 200 tokens each. If the tool returns more, truncate with a '... (truncated)' note.
3. Cache tool results per session so repeated calls don't re-inject the same data.
4. Add a token counter to every turn and log a warning if context exceeds 80% of the model's limit.
Key lesson
Budget tokens aggressively: allocate by priority (system > tools > history > RAG) and enforce limits with code.
Never trust the context window size; the model's effective working memory is much smaller — treat 128K as 8K for critical info.
Monitor token usage per turn and alert on spikes; a flat cost curve is a sign of a leaky context pipeline.
Production debug guideWhen the agent starts hallucinating at 2am.4 entries
Symptom · 01
Model ignores system prompt after a few turns
→
Fix
Log the token count of the context window on each turn. Check if history exceeds 20% of the model's limit. Use tiktoken to count tokens: len(encoding.encode(context)). If history > 20% of context, implement a sliding window or summarization.
Symptom · 02
Sudden spike in latency or cost
→
Fix
Inspect the last N tool outputs. Are they huge? Check if a tool returned a 10MB JSON blob. Add a max token limit per tool output and truncate. Also check if the RAG pipeline is injecting too many docs — set a hard limit on retrieved chunks.
Symptom · 03
Output becomes repetitive or loops
→
Fix
Check for context rot: the model might be seeing its own previous outputs in the history. Ensure you're not appending the model's response to history before the next turn — deduplicate. Also check if the system prompt has drifted (e.g., a tool injected extra instructions).
Symptom · 04
Model returns irrelevant or hallucinated information
→
Fix
Log the last 500 tokens of the context window. Is the relevant info still there? If the context is full of noise, the model can't find the signal. Use a retrieval precision metric: if recall@5 < 0.8, your RAG pipeline is injecting noise.
★ Context Engineering for LLMs Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Set a maximum history token budget: if history_tokens > 8000: history = summarize(history)
Cost spike+
Immediate action
Check tool output sizes and RAG doc count
Commands
python -c "import json; data=json.load(open('tool_outputs.json')); print(max(len(d['output']) for d in data))"
python -c "print('Avg tool output tokens:', sum(len(d['output']) for d in data)/len(data))"
Fix now
Truncate all tool outputs to 200 tokens: output = output[:200]
Repetitive output+
Immediate action
Check for duplicate history entries
Commands
python -c "messages = json.load(open('history.json')); print(len(messages), len(set(m['content'] for m in messages)))"
python -c "from collections import Counter; c = Counter(m['content'] for m in messages); print(c.most_common(3))"
Fix now
Deduplicate history: history = list({m['content']: m for m in history}.values())
Hallucination+
Immediate action
Check RAG retrieval precision
Commands
python -c "from your_rag import retrieve; docs = retrieve('query', k=5); print([d['relevance_score'] for d in docs])"
python -c "print('Recall@5:', sum(1 for d in docs if d['relevance_score'] > 0.8)/5)"
Fix now
Increase retrieval threshold: docs = [d for d in docs if d['relevance_score'] > 0.8]
Context Engineering vs. Prompt Engineering vs. Fine-Tuning
Concern
Context Engineering
Prompt Engineering
Fine-Tuning
Recommendation
Cost control
Direct: token budgets, max_tokens, truncation
Indirect: shorter prompts may reduce tokens
High upfront cost, lower per-token cost
Context engineering for immediate cost control
Response quality
Ensures context fits, avoids truncation loss
Improves instruction following
Customizes model behavior
Combine context + prompt engineering first
Implementation effort
Low: add token counting and truncation
Low: iterate on prompts
High: data collection, training, evaluation
Start with context engineering
Scalability
Essential for 10K+ RPM
Doesn't address token waste
Good for specialized tasks at scale
Context engineering is prerequisite
Debugging
Metrics-driven: token counts, truncation logs
Qualitative: A/B test prompts
Requires eval set
Context engineering gives actionable metrics
Key takeaways
1
Always set max_tokens explicitly—defaults can be 4096+ tokens, and a runaway completion on a 10K RPM workload costs $12k/week at GPT-4 prices.
2
Context engineering is about token budgeting
pre-compute the exact input context size, reserve tokens for output, and truncate or chunk before the call, not after.
3
Use a sliding window with token counters (e.g., tiktoken) to keep context under model limits—don't rely on the model to tell you it's full.
4
Monitor prompt_tokens and completion_tokens per request in your observability stack; alert when completion_tokens exceeds 90% of your budget.
5
For high-throughput (10K RPM), batch context pre-processing off the critical path with a sidecar process—don't do token counting inline in the request handler.
Common mistakes to avoid
4 patterns
×
Missing max_tokens parameter
Symptom
LLM returns 4000+ token completions on simple queries, costing $0.06+ per call instead of $0.01. At 10K RPM, that's $12k/week waste.
Fix
Set max_tokens to the minimum viable output length (e.g., 150 for classification, 500 for summarization). Use tiktoken to estimate before the call.
×
Not truncating input context to model limit
Symptom
Model silently drops tokens beyond its context window, losing critical instructions or data. Output becomes hallucinated or incomplete.
Fix
Before every call, count input tokens with tiktoken, truncate to model_max_tokens - max_tokens - safety_margin (e.g., 100 tokens). Use a sliding window over the most recent/relevant content.
×
Assuming context is stateless across calls
Symptom
Conversation history grows unbounded, causing context overflow and erratic behavior. Cost per conversation skyrockets.
Fix
Implement a fixed-size context buffer (e.g., last 10 turns). Evict oldest turns when token budget is exceeded. Log context size per turn for debugging.
×
No monitoring on token usage per request
Symptom
You don't know which users or endpoints are driving token costs. A single rogue integration can silently burn $5k/month.
Fix
Emit prompt_tokens, completion_tokens, and total_tokens as metrics to Datadog/Prometheus. Set alerts on p99 completion_tokens > 500.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
How would you design a system to handle 10K LLM requests per minute whil...
Q02SENIOR
What happens when the input context exceeds the model's context window? ...
Q03JUNIOR
Explain the difference between prompt engineering and context engineerin...
Q04SENIOR
How would you debug a sudden spike in LLM costs?
Q05SENIOR
Design a context-aware agent that maintains conversation history across ...
Q01 of 05SENIOR
How would you design a system to handle 10K LLM requests per minute while keeping costs predictable?
ANSWER
Start with context engineering: pre-compute token budgets per request type using tiktoken. Use a fixed-size sliding window for conversation history. Set max_tokens per endpoint. Offload token counting to a sidecar process. Use request batching where possible. Monitor token usage per endpoint and set hard caps. Implement a circuit breaker if cost per minute exceeds threshold.
Q02 of 05SENIOR
What happens when the input context exceeds the model's context window? How do you handle it?
ANSWER
The model silently truncates from the beginning, losing early context. To handle it: count tokens with tiktoken before the call, truncate to model_limit - max_tokens - 100 safety margin. Use a sliding window that keeps the most recent N tokens. For long documents, chunk and summarize each chunk, then concatenate summaries. Log truncation events for debugging.
Q03 of 05JUNIOR
Explain the difference between prompt engineering and context engineering.
ANSWER
Prompt engineering focuses on the wording and structure of the instruction to get better responses. Context engineering focuses on the token budget: how many tokens go in, how many come out, and how to manage the window. Prompt engineering is qualitative; context engineering is quantitative and operational. Both are needed, but context engineering directly impacts cost and reliability at scale.
Q04 of 05SENIOR
How would you debug a sudden spike in LLM costs?
ANSWER
First, check if max_tokens was accidentally removed or set too high. Then, look at per-endpoint metrics: which endpoint has the highest average completion_tokens? Check if a new user or integration is sending very long prompts. Examine logs for truncation events—if context is overflowing, the model may be generating more to compensate. Finally, check if the model version changed (e.g., GPT-4-32k vs GPT-4).
Q05 of 05SENIOR
Design a context-aware agent that maintains conversation history across multiple turns without exceeding the token limit.
ANSWER
Use a fixed-size circular buffer of turns. Each turn stores the user message and assistant response. Before adding a new turn, count total tokens with tiktoken. If adding the new turn would exceed the budget, evict the oldest turn(s) until the budget fits. Optionally, summarize older turns into a single 'history summary' token. Always set max_tokens per turn. Log the number of evictions per session for monitoring.
01
How would you design a system to handle 10K LLM requests per minute while keeping costs predictable?
SENIOR
02
What happens when the input context exceeds the model's context window? How do you handle it?
SENIOR
03
Explain the difference between prompt engineering and context engineering.
JUNIOR
04
How would you debug a sudden spike in LLM costs?
SENIOR
05
Design a context-aware agent that maintains conversation history across multiple turns without exceeding the token limit.
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
What is context engineering for LLMs?
Context engineering is the systematic control of the input and output token budget for LLM calls. It includes pre-processing context (truncation, chunking, sliding windows), setting max_tokens explicitly, and monitoring token usage in production. It's distinct from prompt engineering, which focuses on phrasing.
Was this helpful?
02
How do I calculate the right max_tokens value?
Estimate the maximum output length your use case needs. For a classification task, 50-100 tokens. For a summarization of a 500-word document, 200-300 tokens. Use tiktoken to count the prompt, then set max_tokens = min(desired_output, model_limit - prompt_tokens - safety_margin). Never leave it unset.
Was this helpful?
03
What happens if I don't set max_tokens?
The API defaults to the model's maximum output limit (e.g., 4096 for GPT-4). The model will generate until it hits that limit or an end-of-text token, wasting tokens on verbose, repetitive, or hallucinated content. At scale, this multiplies costs dramatically.
Was this helpful?
04
How do I monitor token waste in production?
Log prompt_tokens and completion_tokens from every API response. Aggregate by endpoint, user, and model. Set a dashboard with cost per request (tokens * price per token). Alert when average completion_tokens exceeds 2x your expected value.
Was this helpful?
05
Should I use context engineering for all LLM calls?
Yes, for any production system with cost or latency constraints. For one-off experiments, it's optional. For any call at scale, failing to engineer context is a financial and reliability risk.