Token Budgeting for LLMs — How a Missing Token Count Cost $4,000 in Overnight API Bills
Learn to estimate, track, and enforce token budgets in production LLM apps.
- Token Budget The maximum number of tokens a model can process in a single request. Exceed it and the API silently truncates your prompt or returns an error.
- Context Window The total token capacity shared between input (system prompt, user message, history) and output (model response). Overfilling it causes dropped context or rejected calls.
- Tokenizer A model-specific function that converts text to tokens. Different models use different tokenizers, so counting with the wrong one yields incorrect budgets.
- Budget Planning Allocate token budgets per component: system prompt (fixed), few-shot examples (fixed), conversation history (growing), user input (variable), and output (reserved).
- Cost Estimation Multiply input tokens by input price per token, output tokens by output price per token. A 10K-token prompt on GPT-4o costs $0.035 input + $0.10 output.
- Monitoring Log token counts per request, set alerts on budget overruns, and use streaming to abort early if the model exceeds your budget.
Token budgeting is the practice of explicitly planning and capping the number of tokens your application sends to an LLM per request, session, or billing cycle. It's not just about counting characters—tokens are the atomic unit of LLM pricing (OpenAI charges ~$0.01–$0.03 per 1K input tokens for GPT-4, Anthropic similar), and a runaway loop or unbounded prompt can burn thousands of dollars overnight.
The core problem is that most developers treat token costs as an afterthought, only to discover that a single misconfigured retry loop or a prompt that grows with every user message can silently multiply your API bill by 10x or more. Token budgeting forces you to define hard limits before you deploy: max tokens per request, max context window utilization, and max spend per user per day.
Without it, you're flying blind—and the $4,000 overnight bill in the title is a real example of what happens when a chatbot's conversation history accumulates without a cap, hitting a 128K-token context window every 30 seconds for 8 hours straight.
Think of an LLM's context window like a desk. You can only spread so many papers on it before things fall off. Token budgeting is deciding exactly which papers go on the desk and in what order, so the most important ones don't hit the floor. If you pile on too much, the model 'forgets' the middle of your instructions — but still charges you for the full pile.
We got paged at 2am because our chatbot started returning gibberish. The on-call engineer saw a 400 Bad Request on every third call. After 45 minutes of digging through logs, we found the culprit: a system prompt that had grown to 132,000 tokens overnight because a developer appended a full knowledge base dump without counting tokens. The GPT-4o API silently truncated the prompt at 128,000, dropping the user's question entirely. The model then responded to the truncated system prompt alone — and we were billed for every token of the oversized prompt.
Most tutorials on token budgeting stop at 'count your tokens with tiktoken' and call it a day. They don't tell you that different models use different tokenizers, that the context window is shared between input and output, or that a single runaway loop in a conversation can double your costs in minutes. They also skip the real production problem: how to enforce budgets programmatically before the API rejects your request.
This article covers exactly what you need to stop that 2am page. We'll walk through how tokenization actually works under the hood, how to build a reusable budget planner that splits your context window across system prompts, examples, history, and user input, and how to monitor and alert on token usage in production. We'll also show you the exact code that caused our $4,000 overnight bill — and the one-liner fix that prevents it.
How Tokenization Actually Works Under the Hood
Tokenization is not character counting. It's a model-specific encoding that maps text to integers via a learned vocabulary. GPT-4o uses the cl100k_base tokenizer, which has about 100,000 tokens in its vocabulary. Each token represents a common substring: 'hello' might be one token, but 'hello world' is two. This means a single word can be multiple tokens, and punctuation counts.
The critical production implication: different models use different tokenizers. GPT-4o and GPT-4o-mini both use cl100k_base, but Claude 3.5 Sonnet uses its own tokenizer. Counting tokens with the wrong encoder gives you incorrect budgets. We learned this when a developer used a character-count heuristic and ended up with prompts that were 30% larger than expected.
Another hidden detail: the tokenizer is bidirectional in practice. The same text tokenizes identically whether it's at the start or end of the prompt. But the model's attention mechanism is not — it can lose information in the middle. So token count alone isn't enough; you also need to structure your prompt to keep critical content at the edges.
Building a Reusable Token Budget Planner
A token budget planner splits the context window into fixed and variable components. Fixed components include the system prompt and few-shot examples. Variable components include conversation history and user input. The output budget must be reserved from the total.
Here's the formula: budget_input = context_window - reserved_output. Then you allocate budget_input across system prompt, examples, history, and user input. If the total exceeds budget_input, you must truncate history or reject the request.
A common mistake is to forget that the output counts toward the total token limit. GPT-4o has a 128,000 token context window, but max output is 16,384. If you set max_tokens=16384, the input can only be 111,616 tokens. But if you don't set max_tokens, the model might use all 128,000 for output and leave no room for input.
We enforce budgets with a simple class that tracks usage per component and raises an exception when the budget is exceeded. This prevents the silent truncation that cost us $4,000.
When NOT to Use Token Budgeting
Token budgeting is essential for cost control, but it's not always the right tool. If you're doing batch inference with short prompts, the overhead of counting tokens per request can outweigh the savings. We saw a team add token counting to a batch job that processed 10,000 short prompts per minute. The tiktoken calls added 30% latency. They were better off estimating a fixed budget per prompt and only counting on a sample.
Another case: if your model has a very large context window (like Gemini 2.0 Flash at 1M tokens), budgeting becomes less about overflow and more about cost. At $0.10 per 1M input tokens, the cost of a full-window request is negligible. But the latency of processing 1M tokens is not — it can take 30+ seconds. In that case, budget for latency, not cost.
Finally, don't use token budgeting for tasks where the model needs the full context to be accurate. Legal document analysis or medical record summarization requires the entire input. Truncating history could introduce errors. In those cases, pay for the larger model or use a model with a bigger context window.
Production Patterns for Token Budgeting at Scale
At scale, token budgeting becomes a systems design problem. You need to track usage per request, per user, and per model. We use a middleware pattern that intercepts every API call, counts tokens, and enforces budgets before the request hits the model.
Three patterns we use in production:
- Sliding Window: Keep only the last N messages in conversation history. N depends on the model's context window and your output budget. For GPT-4o with 128K window and 16K output, we keep the last 10 messages plus the system prompt.
- Token Budget Header: Add a custom header to every API call that includes the current token usage. Log this in your monitoring system. This lets you trace cost spikes back to specific users or features.
- Budget Exceeded Webhook: When a request exceeds the budget, instead of failing silently, send a webhook to the developer with the exact token counts. This turns a silent failure into an actionable alert.
Common Token Budgeting Mistakes (With Specific Examples)
We've seen the same mistakes across multiple teams. Here are the top three:
Mistake 1: Using character count instead of token count. A developer wrote len(prompt) > 100000 thinking 100K characters equals 100K tokens. In reality, 100K characters is about 25K tokens for English text. They were undercounting by 4x. The fix: always use tiktoken.
Mistake 2: Forgetting to reserve output tokens. Another team set max_tokens=16384 but didn't subtract it from the input budget. They sent 120K tokens of input, which with 16K output totaled 136K — exceeding the 128K limit. The API returned a 400 error. The fix: subtract max_tokens from the context window before allocating input.
Mistake 3: Not accounting for tokenizer differences between models. A team switched from GPT-4o to Claude 3.5 Sonnet but kept using the cl100k_base tokenizer. Claude's tokenizer produces different counts. They ended up with prompts that were 20% larger than expected, causing context overflows. The fix: use the model-specific tokenizer.
Token Budgeting vs. Other Cost Control Strategies
Token budgeting is one of several cost control strategies. Here's how it compares:
Token Budgeting: Proactive. You count tokens before the call and reject or truncate if over budget. Best for real-time applications where you control the input.
Cost Alerts: Reactive. You monitor API costs and alert when they exceed a threshold. Useful as a safety net, but by the time you get the alert, the money is already spent.
Model Selection: Proactive. Use a cheaper model for simple tasks and a more expensive one for complex tasks. GPT-4o-mini costs 1/17th of GPT-4o per input token. But it also has lower accuracy.
Caching: Proactive. Cache responses for identical prompts. This works well for system prompts and few-shot examples, but not for user-specific queries.
We use all four in combination. Token budgeting is the first line of defense. Cost alerts catch anything that slips through. Model selection reduces baseline costs. Caching eliminates redundant calls.
Debugging and Monitoring Token Budgets in Production
Monitoring token budgets requires logging token counts per request and setting alerts on anomalies. We use structured logging with a 'token_usage' field that includes prompt_tokens, completion_tokens, and total_tokens. This lets us query for outliers: 'find all requests where total_tokens > 100000'.
We also track token usage per user and per feature. If a single user's token usage spikes, it's likely a runaway loop in their conversation. If a feature's token usage spikes, it's likely a bug in the prompt construction.
- Total tokens per request > 90% of context window
- Daily token usage > 2x the 7-day rolling average
- Any request that triggers a budget exceeded error
The $4,000 Overnight Token Overrun
- Always count total tokens before calling the API — use the model-specific tokenizer, not a generic character count.
- Set a hard budget for input tokens and enforce it in code. Reserve at least 20% of the context window for output.
- Monitor token usage per request and alert on anomalies. Cost is a leading indicator of budget overruns.
tiktoken.get_encoding('cl100k_base').encode(prompt) and check len(encoded) against the model's context window. Log the token count per request.max_tokens parameter to cap output. Log usage.completion_tokens from the API response.usage.prompt_tokens across requests. Look for a single request with an abnormally high count — likely a runaway conversation history or a system prompt that grew unbounded.python -c "import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(open('prompt.txt').read())))"python -c "import openai; client = openai.OpenAI(); models = client.models.list(); print([m.id for m in models if 'gpt-4' in m.id])"max_tokens=4096 in the API call.Key takeaways
Common mistakes to avoid
4 patternsCounting tokens only after API call
Ignoring token overhead from system prompts and chat templates
Using a flat per-request budget without a windowed total
Not handling tokenization edge cases (e.g., Unicode, code blocks)
Interview Questions on This Topic
How would you design a token budgeting system for a multi-tenant LLM API?
Frequently Asked Questions
That's Context Engineering. Mark it forged?
6 min read · try the examples if you haven't