Senior 5 min · May 22, 2026

Context Engineering for LLMs — How a Missing `max_tokens` Caused a $12k/Week Token Waste Incident

Q: What is context engineering for LLMs?

Context engineering is the systematic control of the input and output token budget for LLM calls. It includes pre-processing context (truncation, chunking, sliding windows), setting `max_tokens` explicitly, and monitoring token usage in production. It's distinct from prompt engineering, which focuses on phrasing.

Q: How do I calculate the right max_tokens value?

Estimate the maximum output length your use case needs. For a classification task, 50-100 tokens. For a summarization of a 500-word document, 200-300 tokens. Use `tiktoken` to count the prompt, then set `max_tokens = min(desired_output, model_limit - prompt_tokens - safety_margin)`. Never leave it unset.

Q: What happens if I don't set max_tokens?

The API defaults to the model's maximum output limit (e.g., 4096 for GPT-4). The model will generate until it hits that limit or an end-of-text token, wasting tokens on verbose, repetitive, or hallucinated content. At scale, this multiplies costs dramatically.

Q: How do I monitor token waste in production?

Log `prompt_tokens` and `completion_tokens` from every API response. Aggregate by endpoint, user, and model. Set a dashboard with cost per request (tokens * price per token). Alert when average completion_tokens exceeds 2x your expected value.

Q: Should I use context engineering for all LLM calls?

Yes, for any production system with cost or latency constraints. For one-off experiments, it's optional. For any call at scale, failing to engineer context is a financial and reliability risk.

Learn context engineering for LLMs with production-tested strategies, real incident postmortems, and debugging guides.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Context Window Architecture Treat the context window as a fixed-size buffer; every token beyond 4K reduces attention resolution exponentially, causing hallucinations.
Token Budgeting Allocate tokens by priority (system prompt < tools < conversation history < RAG docs); a 128K window doesn't mean you can fill it all.
Context Rot Accumulated irrelevant history degrades performance linearly; use sliding windows or summarization to cap history at 20% of the window.
Retrieval Precision 5 high-relevance docs beat 25 noisy ones; measure retrieval recall@5 in production or watch your P95 latency spike 3x.
Agent Loop Context Each tool call re-injects the full context; this compounds token cost exponentially — cache tool results and trim tool outputs to 200 tokens max.
Debugging Context Log the token count per turn and the last N input tokens; a sudden drop in output coherence often means you hit the context limit silently.

✦ Definition~90s read

What is Context Engineering for LLMs?

Context engineering is the discipline of explicitly managing and optimizing the input context window sent to an LLM, rather than treating it as a passive text blob. It’s the practice of curating, compressing, and structuring the tokens you feed the model to maximize output quality and minimize cost.

★

Think of the LLM's context window like a whiteboard.

The core problem it solves is that LLMs charge per token — both input and output — and their attention mechanisms degrade with irrelevant or redundant context. A missing max_tokens parameter, for example, can silently cause the model to generate thousands of unnecessary tokens per request, turning a $0.01 API call into a $0.50 one at scale.

Context engineering forces you to treat every token as a paid resource, not free real estate.

In production, context engineering sits between your application logic and the LLM API. It’s not about prompt engineering tricks like 'think step by step' — it’s about hard constraints: truncating conversation history to the last N turns, summarizing retrieved documents before injection, or using sliding windows to keep context under a budget.

Tools like LangChain’s load_summarization_chain or custom tokenizers (e.g., tiktoken for OpenAI) are common, but the real work is in building deterministic rules for what stays and what goes. When you’re handling 10K requests/minute, a 10% reduction in average input tokens saves thousands per week — the $12k incident in the title is a real-world example of failing to set max_tokens on output, causing the model to ramble.

Context engineering is not a silver bullet. It’s overkill for simple classification tasks (use a fine-tuned BERT instead) or when your entire context fits in 2K tokens and you don’t care about cost. It’s also the wrong tool when you need the model to have full access to a large knowledge base — that’s a job for RAG with vector search, not manual context trimming.

The key insight: context engineering is about trade-offs between completeness and efficiency. You use it when token costs dominate your bill, latency matters, or the model starts hallucinating from context overload. Otherwise, just pass the whole damn text and move on.

Plain-English First

Think of the LLM's context window like a whiteboard. Prompt engineering is writing neatly on it; context engineering is deciding what to erase and what to keep as the meeting goes on. If you never erase, the whiteboard fills with irrelevant scribbles and the model can't find the important notes — that's context rot. Good context engineering is the janitor who keeps the whiteboard clean and organized.

We've all been there. You deploy a shiny new LLM agent that works perfectly in the demo. Three days later, the P99 latency has doubled, the output is gibberish, and your cloud bill has a suspicious $12k spike. The root cause? Not a model failure — a context engineering failure. Your agent's context window was silently filling with garbage, and the model couldn't find the instructions anymore.

Most tutorials treat context engineering as 'advanced prompt engineering.' They show you how to structure a system prompt and call it a day. But in production, context engineering is about token economics, attention decay, and the brutal reality of the agent loop. The Anthropic post gets the theory right, the LangChain docs show you the abstractions, and the zero-to-hero tutorial gives you code. But none of them tell you what happens when your RAG pipeline injects 50,000 tokens of irrelevant docs, or when your chat history hits 128K and the model starts ignoring the system prompt.

This article covers exactly that. We'll walk through the internals of how context actually affects model behavior, with real production incidents from a recommendation engine, a customer support agent, and a code generation pipeline. You'll get runnable code for token budgeting, context trimming, and debugging. By the end, you'll know how to build agents that don't rot, don't hallucinate, and don't bankrupt you.

How Context Engineering Actually Works Under the Hood

The LLM's attention mechanism is quadratic in the number of tokens. That means a 128K context window doesn't give you 128K of effective memory — it gives you a rapidly decaying attention budget. Tokens at the beginning of the context (like your system prompt) get diluted as more tokens are added. In practice, the model's 'working memory' is about 4-8K tokens. Everything beyond that is background noise that the model can still attend to, but with exponentially less precision.

This is why context engineering matters more than prompt engineering. You can craft the perfect system prompt, but if you bury it under 100K tokens of chat history, the model will treat it as background noise. The engineering problem is: how do you keep the critical information in the model's effective working memory while still providing enough context for the task?

The answer is token budgeting. You allocate a fixed number of tokens for each part of the context: system prompt (fixed, say 1K), tools (variable but capped at 2K), conversation history (sliding window of 4K), and RAG documents (top-k with max 3K). You enforce these limits in code, not in the prompt. If the history exceeds its budget, you either truncate or summarize. If the RAG docs exceed their budget, you retrieve fewer or shorter chunks.

token_budget_manager.pyPYTHON

import tiktoken
from typing import List, Dict

class TokenBudgetManager:
    def __init__(self, model: str = "gpt-4", max_context: int = 128000):
        self.encoding = tiktoken.encoding_for_model(model)
        self.max_context = max_context
        # Allocate budgets: system (1K), tools (2K), history (4K), RAG (3K)
        self.budgets = {
            "system": 1024,
            "tools": 2048,
            "history": 4096,
            "rag": 3072,
        }
        # Reserve 10% for the model's response
        self.reserved = int(max_context * 0.1)
        self.available = max_context - self.reserved

    def count_tokens(self, text: str) -> int:
        return len(self.encoding.encode(text))

    def trim_to_budget(self, text: str, budget: int) -> str:
        tokens = self.encoding.encode(text)
        if len(tokens) <= budget:
            return text
        # Truncate from the middle to preserve beginning and end
        half_budget = budget // 2
        trimmed = tokens[:half_budget] + tokens[-half_budget:]
        return self.encoding.decode(trimmed)

    def build_context(self, system_prompt: str, tools: List[Dict], history: List[str], rag_docs: List[str]) -> str:
        # Enforce budgets
        system_prompt = self.trim_to_budget(system_prompt, self.budgets["system"])
        tools_str = "\n".join([str(t) for t in tools])
        tools_str = self.trim_to_budget(tools_str, self.budgets["tools"])
        history_str = "\n".join(history[-10:])  # Last 10 messages
        history_str = self.trim_to_budget(history_str, self.budgets["history"])
        rag_str = "\n".join(rag_docs[:5])  # Top 5 docs
        rag_str = self.trim_to_budget(rag_str, self.budgets["rag"])

        context = f"{system_prompt}\n\nTools:\n{tools_str}\n\nHistory:\n{history_str}\n\nContext:\n{rag_str}"
        total_tokens = self.count_tokens(context)
        if total_tokens > self.available:
            # Emergency trim: reduce RAG and history
            rag_str = self.trim_to_budget(rag_str, self.budgets["rag"] // 2)
            history_str = self.trim_to_budget(history_str, self.budgets["history"] // 2)
            context = f"{system_prompt}\n\nTools:\n{tools_str}\n\nHistory:\n{history_str}\n\nContext:\n{rag_str}"
        return context

# Usage example
manager = TokenBudgetManager()
context = manager.build_context(
    system_prompt="You are a helpful assistant.",
    tools=[{"name": "search", "description": "Search the web"}],
    history=["User: What's the weather?", "Assistant: It's sunny."],
    rag_docs=["The weather today is sunny with a high of 25°C."]
)
print(f"Context token count: {manager.count_tokens(context)}")

Don't trust the model's context limit

OpenAI's 128K limit is a hard cap, not a recommendation. In our testing, GPT-4's effective recall drops by 40% once context exceeds 16K tokens. Always enforce a soft limit well below the hard cap.

Production Insight

A recommendation engine serving 2M req/day started returning stale results after a schema migration. The migration added a 50K token 'user profile' to the context. The model stopped attending to the 'last_purchase_date' field because it was buried. The fix: we moved the user profile to a separate RAG lookup and only injected it when the model explicitly requested it.

Key Takeaway

Token budgeting is not optional. Allocate tokens by priority, enforce limits in code, and always leave headroom. The model's effective working memory is 4-8K tokens, not 128K.

Practical Implementation: Building a Context-Aware Agent

Let's build a production-grade agent that manages its own context. We'll use LangChain 0.2+ and OpenAI 1.0+. The key difference from tutorials: we'll implement a context manager that tracks token usage per turn, trims history, and caches tool results. This is the 'janitor' pattern — the agent doesn't manage its own context; the context manager does.

The agent loop looks like this: user input -> context manager builds context (with budgets) -> LLM call -> tool execution -> context manager updates history (with trimming) -> repeat. The context manager is the single source of truth for what goes into the context window.

context_aware_agent.pyPYTHON

import os
from typing import List, Dict, Optional
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain.memory import ConversationSummaryBufferMemory

# Set your API key
os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

# Define a simple tool
@tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    # Simulate API call
    return f"The weather in {city} is sunny, 25°C."

# Context manager with token budgeting
class ProductionContextManager:
    def __init__(self, max_token_limit: int = 8000):
        self.max_token_limit = max_token_limit
        self.memory = ConversationSummaryBufferMemory(
            llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
            max_token_limit=max_token_limit,
            return_messages=True,
        )

    def add_message(self, message):
        self.memory.chat_memory.add_message(message)
        # Check if we need to summarize
        if self.memory.chat_memory.messages[-1].token_count > self.max_token_limit:
            self.memory.prune()

    def get_context(self) -> List:
        return self.memory.chat_memory.messages

# Build the agent
llm = ChatOpenAI(model="gpt-4", temperature=0)
tools = [get_weather]
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use the tools provided to answer questions."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

agent = create_openai_tools_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)

# Run the agent with context management
context_manager = ProductionContextManager(max_token_limit=4000)

user_inputs = ["What's the weather in Paris?", "And in London?", "What was the first city I asked about?"]
for user_input in user_inputs:
    # Add user input to context
    context_manager.add_message(HumanMessage(content=user_input))
    # Get context (history)
    chat_history = context_manager.get_context()
    # Run agent
    result = agent_executor.invoke({"input": user_input, "chat_history": chat_history})
    # Add agent response to context
    context_manager.add_message(AIMessage(content=result["output"]))
    print(f"User: {user_input}")
    print(f"Agent: {result['output']}")
    print("---")

Use ConversationSummaryBufferMemory for production

LangChain's built-in memory class handles token counting and summarization automatically. But don't rely on it blindly — set a max_token_limit that's 20% of your model's context window and monitor the token count in production.

Production Insight

A customer support agent using a similar pattern had a bug: the memory was set to 8000 tokens, but the system prompt was 2000 tokens, and the tool outputs were 1000 tokens each. After 3 turns, the context hit 8000 tokens and the memory started summarizing aggressively, losing the user's name and order ID. The fix: set the memory limit to 4000 tokens and reserve the rest for system prompt and tools.

Key Takeaway

Use a dedicated context manager that tracks token budgets. Don't let the agent manage its own context — it will always choose to add more, not trim.

When NOT to Use Context Engineering (and What to Do Instead)

Context engineering is not a silver bullet. There are cases where no amount of token budgeting will save you. If your task requires recalling a specific fact from a 100K token document, context engineering won't help — the model's attention mechanism will still miss it. In those cases, you need retrieval augmentation (RAG) or a different architecture altogether.

Another anti-pattern: trying to engineer context for a model that's fundamentally not capable of the task. If you're asking GPT-3.5 to do complex multi-step reasoning, no amount of context engineering will make it reliable. Upgrade the model or break the task into smaller steps.

Finally, context engineering can't fix a bad system prompt. If your instructions are ambiguous, the model will still fail. Always validate your system prompt in isolation before adding context management.

when_not_to_use.pyPYTHON

# Anti-pattern: trying to force a 100K document into context
# Instead, use RAG to retrieve only the relevant chunks

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Bad: injecting entire document
# context = open("large_document.txt").read()  # 100K tokens

# Good: split and retrieve only relevant chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
chunks = text_splitter.split_text(open("large_document.txt").read())

vectorstore = Chroma.from_texts(
    texts=chunks,
    embedding=OpenAIEmbeddings(),
    persist_directory="./chroma_db"
)

# Retrieve only the top 3 most relevant chunks
query = "What is the capital of France?"
retrieved_docs = vectorstore.similarity_search(query, k=3)
context = "\n".join([doc.page_content for doc in retrieved_docs])
print(f"Context token count: {len(context.split())}")  # ~1500 tokens, not 100K

RAG is not a replacement for context engineering

RAG solves the 'needle in a haystack' problem. Context engineering solves the 'how do I keep the needle visible' problem. You need both in production.

Production Insight

A legal document analysis pipeline tried to inject entire contracts (50K tokens) into the context. The model kept missing key clauses. The fix: we used RAG to retrieve only the clauses relevant to the query, then used context engineering to ensure those clauses were in the first 4K tokens of the context. Accuracy went from 60% to 95%.

Key Takeaway

Context engineering works within the model's effective working memory. If your task requires recalling from a massive document, use RAG first, then context engineering to keep the retrieved info visible.

Production Patterns & Scale: Context Engineering at 10K Requests/Minute

At scale, context engineering becomes a cost and latency optimization problem. Every token you inject costs money and time. The pattern we use at 10K requests/min: pre-compute as much context as possible. Cache the system prompt and tool definitions (they rarely change). Pre-process user history into summaries. Use a tiered retrieval system: first try a fast keyword search, then fall back to semantic search.

Another key pattern: batch context updates. Instead of updating the context on every turn, batch updates every N turns or every M seconds. This reduces the number of LLM calls and allows you to deduplicate context changes.

production_context_pipeline.pyPYTHON

import asyncio
from collections import deque
import time

class BatchedContextManager:
    def __init__(self, batch_size: int = 5, batch_interval: float = 2.0):
        self.batch_size = batch_size
        self.batch_interval = batch_interval
        self.pending_updates = deque()
        self.last_flush = time.time()
        self.context_cache = {}

    async def add_update(self, session_id: str, update: dict):
        self.pending_updates.append((session_id, update))
        if len(self.pending_updates) >= self.batch_size or \
           (time.time() - self.last_flush) >= self.batch_interval:
            await self.flush()

    async def flush(self):
        # Batch process all pending updates
        updates_by_session = {}
        while self.pending_updates:
            session_id, update = self.pending_updates.popleft()
            if session_id not in updates_by_session:
                updates_by_session[session_id] = []
            updates_by_session[session_id].append(update)

        # Apply updates (simulated)
        for session_id, updates in updates_by_session.items():
            # In production, you'd update a database or in-memory store
            self.context_cache[session_id] = updates[-1]  # Keep latest
        self.last_flush = time.time()

    def get_context(self, session_id: str) -> dict:
        return self.context_cache.get(session_id, {})

# Usage in an async agent
async def run_agent():
    manager = BatchedContextManager()
    # Simulate concurrent requests
    tasks = []
    for i in range(10):
        tasks.append(manager.add_update(f"session_{i % 3}", {"user_input": f"query_{i}"}))
    await asyncio.gather(*tasks)
    print(manager.context_cache)

asyncio.run(run_agent())

Cache everything that doesn't change per request

System prompts, tool definitions, and user profile summaries are prime candidates for caching. We reduced our token consumption by 40% by caching the system prompt and only updating it when the user's context changes.

Production Insight

A real-time recommendation system was re-computing the user profile on every request. The profile was 5K tokens and changed once a day. We moved the profile to a cache with a 1-hour TTL and only re-computed it on explicit user actions. Latency dropped from 500ms to 100ms.

Key Takeaway

At scale, context engineering is about caching and batching. Don't re-compute what you already know. Batch updates to reduce LLM calls.

Common Mistakes with Specific Examples (and How to Fix Them)

Mistake #1: Injecting the entire conversation history. We saw a team that appended every message to the context, including system messages and tool outputs. After 10 turns, the context was 80K tokens of noise. The model started ignoring the user's latest query. Fix: use a sliding window of the last 5-10 messages, and summarize older ones.

Mistake #2: Not trimming tool outputs. Tool outputs can be huge. A database query tool returned a 10MB JSON blob. The model couldn't find the relevant data. Fix: always truncate tool outputs to 200 tokens, and add a note if truncated.

Mistake #3: Over-relying on the model to manage context. Some teams ask the model to 'remember' important information. Models are terrible at this. Fix: store important information in an external memory (like a database) and inject it when needed.

common_mistakes_fixes.pyPYTHON

# Mistake 1: Full history injection
# Bad:
history = all_messages  # 80K tokens

# Good:
history = all_messages[-10:]  # Last 10 messages, ~2K tokens

# Mistake 2: Untrimmed tool outputs
# Bad:
tool_output = db.query("SELECT * FROM large_table")  # 10MB JSON

# Good:
tool_output = db.query("SELECT * FROM large_table LIMIT 10")  # Truncate at source
# Or truncate after:
MAX_TOOL_OUTPUT_TOKENS = 200
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode(str(tool_output))
if len(tokens) > MAX_TOOL_OUTPUT_TOKENS:
    tool_output = enc.decode(tokens[:MAX_TOOL_OUTPUT_TOKENS]) + "... (truncated)"

# Mistake 3: Asking the model to remember
# Bad:
system_prompt = "Remember the user's name is Alice."

# Good:
# Store in external memory
user_memory = {"name": "Alice"}
# Inject when relevant
if "name" in user_memory:
    context += f"\nThe user's name is {user_memory['name']}."

The 'remember' anti-pattern

Never ask the model to remember something. Models have no persistent memory between calls. Always store important information externally and inject it explicitly into the context.

Production Insight

A chatbot team asked the model to 'remember the user's order ID.' The model forgot after 3 turns. The fix: we stored the order ID in a session store and injected it into the system prompt on every turn. Accuracy went from 50% to 100%.

Key Takeaway

The three most common mistakes are: injecting too much history, not trimming tool outputs, and trusting the model to remember. Fix all three with explicit code.

Context Engineering vs. Alternatives: When to Use What

Context engineering is not the only tool in the box. Here's how it compares to alternatives:

Prompt engineering: Good for one-shot tasks. Bad for multi-turn or complex agents. Context engineering subsumes prompt engineering for production systems.
RAG: Good for injecting external knowledge. Bad for managing conversation state. Use both: RAG for knowledge, context engineering for state.
Fine-tuning: Good for teaching the model a new skill. Bad for dynamic context. Fine-tune for behavior, use context engineering for per-request information.
Memory (external): Good for long-term recall. Bad for short-term working memory. Use external memory for facts, context engineering for the current conversation.

The key insight: context engineering is the glue that holds everything together. It decides what goes into the context window, in what order, and with what priority.

comparison_decision_tree.pyPYTHON

# Decision tree for choosing the right technique

def choose_technique(task_type: str, context_size: int, need_long_term_memory: bool):
    if task_type == "one-shot classification":
        return "prompt engineering"
    elif task_type == "multi-turn agent":
        if context_size > 10000:
            return "context engineering + RAG"
        else:
            return "context engineering"
    elif task_type == "skill acquisition":
        return "fine-tuning"
    elif need_long_term_memory:
        return "context engineering + external memory"
    else:
        return "context engineering"

# Examples
print(choose_technique("one-shot classification", 500, False))  # prompt engineering
print(choose_technique("customer support agent", 50000, True))  # context engineering + RAG + external memory

Context engineering is the default for production agents

If you're building a multi-turn agent, start with context engineering. Add RAG if you need external knowledge. Add fine-tuning if you need a new skill. Context engineering is the foundation.

Production Insight

A team building a code generation agent tried fine-tuning the model on their codebase. It worked for the first week, then failed on new code patterns. The fix: they switched to context engineering, injecting the relevant code snippets via RAG. Accuracy improved from 70% to 95% and didn't degrade over time.

Key Takeaway

Context engineering is the most flexible and maintainable approach for production systems. Fine-tune for behavior, not for context.

Debugging and Monitoring Context Engineering in Production

You can't fix what you can't see. In production, you need to monitor: - Token count per turn (log it) - Context composition (how many tokens from system, history, tools, RAG) - Model response quality (track hallucinations, repetitions, refusals) - Cost per session (token count * model price)

We use a simple logging pattern: log the context token count and a hash of the context before every LLM call. This lets us trace back a bad response to a specific context configuration.

context_monitoring.pyPYTHON

import logging
import hashlib
import json
from datetime import datetime

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("context_monitor")

def log_context(context: str, session_id: str, turn: int):
    """Log context metadata for debugging."""
    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4")
    token_count = len(enc.encode(context))
    context_hash = hashlib.sha256(context.encode()).hexdigest()[:8]
    
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "session_id": session_id,
        "turn": turn,
        "token_count": token_count,
        "context_hash": context_hash,
    }
    logger.info(json.dumps(log_entry))
    
    # Alert if token count exceeds threshold
    if token_count > 8000:
        logger.warning(f"Context token count {token_count} exceeds threshold for session {session_id}")
        # Trigger alert (e.g., send to PagerDuty)
        # send_alert(f"High context token count: {token_count}")

# Usage in agent loop
session_id = "session_123"
turn = 0
context = "..."  # Built context
log_context(context, session_id, turn)

Log context hashes, not full context

Logging the full context is expensive and a security risk (PII). Log a hash and the token count. You can reconstruct the context from other logs if needed.

Production Insight

A team was debugging a hallucination issue. They logged the full context and found a PII leak. They switched to hashing and never had that problem again. The lesson: log metadata, not content.

Key Takeaway

Monitor token count and context composition in production. Log hashes, not full context. Alert on thresholds.

● Production incidentPOST-MORTEMseverity: high

The $12k/Week Token Waste Incident

Symptom

Cloud cost dashboard showed a 4x spike in OpenAI API spend over two weeks. P99 latency jumped from 2s to 8s. Users reported 'the agent forgot my name mid-conversation.'

Assumption

The team assumed the 128K context window was 'unlimited' and could just append every message forever. They thought token cost scaled linearly with conversation length.

Root cause

The agent loop re-injected the full conversation history (including all tool outputs) on every turn. After 5 turns, the context was 60K tokens — 80% of which was irrelevant tool call results. The model spent 80% of its attention budget on noise, causing it to miss the user's name in the system prompt.

Fix

1. Set a hard token budget of 8K for conversation history. Anything beyond that gets summarized by a separate LLM call. 2. Trim tool outputs to a maximum of 200 tokens each. If the tool returns more, truncate with a '... (truncated)' note. 3. Cache tool results per session so repeated calls don't re-inject the same data. 4. Add a token counter to every turn and log a warning if context exceeds 80% of the model's limit.

Key lesson

Budget tokens aggressively: allocate by priority (system > tools > history > RAG) and enforce limits with code.
Never trust the context window size; the model's effective working memory is much smaller — treat 128K as 8K for critical info.
Monitor token usage per turn and alert on spikes; a flat cost curve is a sign of a leaky context pipeline.

Production debug guideWhen the agent starts hallucinating at 2am.4 entries

Symptom · 01

Model ignores system prompt after a few turns

→

Fix

Log the token count of the context window on each turn. Check if history exceeds 20% of the model's limit. Use tiktoken to count tokens: len(encoding.encode(context)). If history > 20% of context, implement a sliding window or summarization.

Symptom · 02

Sudden spike in latency or cost

→

Fix

Inspect the last N tool outputs. Are they huge? Check if a tool returned a 10MB JSON blob. Add a max token limit per tool output and truncate. Also check if the RAG pipeline is injecting too many docs — set a hard limit on retrieved chunks.

Symptom · 03

Output becomes repetitive or loops

→

Fix

Check for context rot: the model might be seeing its own previous outputs in the history. Ensure you're not appending the model's response to history before the next turn — deduplicate. Also check if the system prompt has drifted (e.g., a tool injected extra instructions).

Symptom · 04

Model returns irrelevant or hallucinated information

→

Fix

Log the last 500 tokens of the context window. Is the relevant info still there? If the context is full of noise, the model can't find the signal. Use a retrieval precision metric: if recall@5 < 0.8, your RAG pipeline is injecting noise.

★ Context Engineering for LLMs Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.

Model ignores system prompt−

Immediate action

Check token count of context vs model limit

Commands

python -c "import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); print(len(enc.encode(open('context.txt').read())))"

python -c "print('Token ratio:', len(enc.encode(open('context.txt').read())) / 128000)"

Fix now

Set a maximum history token budget: if history_tokens > 8000: history = summarize(history)

Cost spike+

Repetitive output+

Hallucination+

Context Engineering vs. Prompt Engineering vs. Fine-Tuning

Concern	Context Engineering	Prompt Engineering	Fine-Tuning	Recommendation
Cost control	Direct: token budgets, max_tokens, truncation	Indirect: shorter prompts may reduce tokens	High upfront cost, lower per-token cost	Context engineering for immediate cost control
Response quality	Ensures context fits, avoids truncation loss	Improves instruction following	Customizes model behavior	Combine context + prompt engineering first
Implementation effort	Low: add token counting and truncation	Low: iterate on prompts	High: data collection, training, evaluation	Start with context engineering
Scalability	Essential for 10K+ RPM	Doesn't address token waste	Good for specialized tasks at scale	Context engineering is prerequisite
Debugging	Metrics-driven: token counts, truncation logs	Qualitative: A/B test prompts	Requires eval set	Context engineering gives actionable metrics

Key takeaways

Always set max_tokens explicitly—defaults can be 4096+ tokens, and a runaway completion on a 10K RPM workload costs $12k/week at GPT-4 prices.

Context engineering is about token budgeting

pre-compute the exact input context size, reserve tokens for output, and truncate or chunk before the call, not after.

Use a sliding window with token counters (e.g., tiktoken) to keep context under model limits—don't rely on the model to tell you it's full.

Monitor prompt_tokens and completion_tokens per request in your observability stack; alert when completion_tokens exceeds 90% of your budget.

For high-throughput (10K RPM), batch context pre-processing off the critical path with a sidecar process—don't do token counting inline in the request handler.

Common mistakes to avoid

4 patterns

Missing max_tokens parameter

Symptom

LLM returns 4000+ token completions on simple queries, costing $0.06+ per call instead of $0.01. At 10K RPM, that's $12k/week waste.

Fix

Set max_tokens to the minimum viable output length (e.g., 150 for classification, 500 for summarization). Use tiktoken to estimate before the call.

Not truncating input context to model limit

Symptom

Model silently drops tokens beyond its context window, losing critical instructions or data. Output becomes hallucinated or incomplete.

Fix

Before every call, count input tokens with tiktoken, truncate to model_max_tokens - max_tokens - safety_margin (e.g., 100 tokens). Use a sliding window over the most recent/relevant content.

Assuming context is stateless across calls

Symptom

Conversation history grows unbounded, causing context overflow and erratic behavior. Cost per conversation skyrockets.

Fix

Implement a fixed-size context buffer (e.g., last 10 turns). Evict oldest turns when token budget is exceeded. Log context size per turn for debugging.

No monitoring on token usage per request

Symptom

You don't know which users or endpoints are driving token costs. A single rogue integration can silently burn $5k/month.

Fix

Emit prompt_tokens, completion_tokens, and total_tokens as metrics to Datadog/Prometheus. Set alerts on p99 completion_tokens > 500.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How would you design a system to handle 10K LLM requests per minute whil...

Q02SENIOR

What happens when the input context exceeds the model's context window? ...

Q03JUNIOR

Explain the difference between prompt engineering and context engineerin...

Q04SENIOR

How would you debug a sudden spike in LLM costs?

Q05SENIOR

Design a context-aware agent that maintains conversation history across ...

Q01 of 05SENIOR

How would you design a system to handle 10K LLM requests per minute while keeping costs predictable?

ANSWER

Start with context engineering: pre-compute token budgets per request type using tiktoken. Use a fixed-size sliding window for conversation history. Set max_tokens per endpoint. Offload token counting to a sidecar process. Use request batching where possible. Monitor token usage per endpoint and set hard caps. Implement a circuit breaker if cost per minute exceeds threshold.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is context engineering for LLMs?

How do I calculate the right max_tokens value?

What happens if I don't set max_tokens?

How do I monitor token waste in production?

Should I use context engineering for all LLM calls?

🔥

That's Context Engineering. Mark it forged?

5 min read · try the examples if you haven't