Senior 6 min · May 22, 2026

Token Budgeting for LLMs — How a Missing Token Count Cost $4,000 in Overnight API Bills

Learn to estimate, track, and enforce token budgets in production LLM apps.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Token Budget The maximum number of tokens a model can process in a single request. Exceed it and the API silently truncates your prompt or returns an error.
  • Context Window The total token capacity shared between input (system prompt, user message, history) and output (model response). Overfilling it causes dropped context or rejected calls.
  • Tokenizer A model-specific function that converts text to tokens. Different models use different tokenizers, so counting with the wrong one yields incorrect budgets.
  • Budget Planning Allocate token budgets per component: system prompt (fixed), few-shot examples (fixed), conversation history (growing), user input (variable), and output (reserved).
  • Cost Estimation Multiply input tokens by input price per token, output tokens by output price per token. A 10K-token prompt on GPT-4o costs $0.035 input + $0.10 output.
  • Monitoring Log token counts per request, set alerts on budget overruns, and use streaming to abort early if the model exceeds your budget.
What is Token Budgeting for LLMs?

Token budgeting is the practice of explicitly planning and capping the number of tokens your application sends to an LLM per request, session, or billing cycle. It's not just about counting characters—tokens are the atomic unit of LLM pricing (OpenAI charges ~$0.01–$0.03 per 1K input tokens for GPT-4, Anthropic similar), and a runaway loop or unbounded prompt can burn thousands of dollars overnight.

The core problem is that most developers treat token costs as an afterthought, only to discover that a single misconfigured retry loop or a prompt that grows with every user message can silently multiply your API bill by 10x or more. Token budgeting forces you to define hard limits before you deploy: max tokens per request, max context window utilization, and max spend per user per day.

Without it, you're flying blind—and the $4,000 overnight bill in the title is a real example of what happens when a chatbot's conversation history accumulates without a cap, hitting a 128K-token context window every 30 seconds for 8 hours straight.

LLM Token Budget Management Architecture diagram: LLM Token Budget Management LLM Token Budget Management 1 Total Budget e.g. 128k tokens 2 System Prompt Fixed: ~500 tokens 3 Retrieved Chunks Dynamic: top-k RAG 4 Chat History Trimmed: sliding window 5 Response Reserve Min 1k output tokens 6 Context Builder Fit within budget THECODEFORGE.IO
Plain-English First

Think of an LLM's context window like a desk. You can only spread so many papers on it before things fall off. Token budgeting is deciding exactly which papers go on the desk and in what order, so the most important ones don't hit the floor. If you pile on too much, the model 'forgets' the middle of your instructions — but still charges you for the full pile.

We got paged at 2am because our chatbot started returning gibberish. The on-call engineer saw a 400 Bad Request on every third call. After 45 minutes of digging through logs, we found the culprit: a system prompt that had grown to 132,000 tokens overnight because a developer appended a full knowledge base dump without counting tokens. The GPT-4o API silently truncated the prompt at 128,000, dropping the user's question entirely. The model then responded to the truncated system prompt alone — and we were billed for every token of the oversized prompt.

Most tutorials on token budgeting stop at 'count your tokens with tiktoken' and call it a day. They don't tell you that different models use different tokenizers, that the context window is shared between input and output, or that a single runaway loop in a conversation can double your costs in minutes. They also skip the real production problem: how to enforce budgets programmatically before the API rejects your request.

This article covers exactly what you need to stop that 2am page. We'll walk through how tokenization actually works under the hood, how to build a reusable budget planner that splits your context window across system prompts, examples, history, and user input, and how to monitor and alert on token usage in production. We'll also show you the exact code that caused our $4,000 overnight bill — and the one-liner fix that prevents it.

How Tokenization Actually Works Under the Hood

Tokenization is not character counting. It's a model-specific encoding that maps text to integers via a learned vocabulary. GPT-4o uses the cl100k_base tokenizer, which has about 100,000 tokens in its vocabulary. Each token represents a common substring: 'hello' might be one token, but 'hello world' is two. This means a single word can be multiple tokens, and punctuation counts.

The critical production implication: different models use different tokenizers. GPT-4o and GPT-4o-mini both use cl100k_base, but Claude 3.5 Sonnet uses its own tokenizer. Counting tokens with the wrong encoder gives you incorrect budgets. We learned this when a developer used a character-count heuristic and ended up with prompts that were 30% larger than expected.

Another hidden detail: the tokenizer is bidirectional in practice. The same text tokenizes identically whether it's at the start or end of the prompt. But the model's attention mechanism is not — it can lose information in the middle. So token count alone isn't enough; you also need to structure your prompt to keep critical content at the edges.

tokenize_and_count.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import tiktoken

# GPT-4o uses cl100k_base. Always use the model-specific encoder.
enc = tiktoken.get_encoding("cl100k_base")

text = "Hello, world! This is a test prompt."
tokens = enc.encode(text)

print(f"Text length: {len(text)} characters")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")

# Decode back to verify no information loss
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}")

# Common pitfall: newlines and whitespace count
# A single newline is one token!
multi_line = "Line 1\nLine 2"
print(f"Multi-line tokens: {len(enc.encode(multi_line))}")
Tokenizers are model-specific
Always use the exact tokenizer for the model you're calling. GPT-4o and GPT-4o-mini both use cl100k_base, but GPT-3.5-turbo uses a different one. Check the OpenAI docs for the mapping.
Production Insight
A recommendation engine serving 2M requests/day started returning stale results after a schema migration. The team had been using a generic tokenizer from an older library. After the migration, the token counts were off by 15%, causing the context window to overflow silently. The fix: switch to tiktoken and validate token counts against the model's limit before every call.
Key Takeaway
Tokenization is not character counting. Use the model-specific tokenizer (tiktoken for OpenAI, transformers for HuggingFace models) and always validate total tokens before sending the request.

Building a Reusable Token Budget Planner

A token budget planner splits the context window into fixed and variable components. Fixed components include the system prompt and few-shot examples. Variable components include conversation history and user input. The output budget must be reserved from the total.

Here's the formula: budget_input = context_window - reserved_output. Then you allocate budget_input across system prompt, examples, history, and user input. If the total exceeds budget_input, you must truncate history or reject the request.

A common mistake is to forget that the output counts toward the total token limit. GPT-4o has a 128,000 token context window, but max output is 16,384. If you set max_tokens=16384, the input can only be 111,616 tokens. But if you don't set max_tokens, the model might use all 128,000 for output and leave no room for input.

We enforce budgets with a simple class that tracks usage per component and raises an exception when the budget is exceeded. This prevents the silent truncation that cost us $4,000.

token_budget_planner.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import tiktoken
from typing import List, Dict

class TokenBudgetPlanner:
    def __init__(self, model: str = "gpt-4o", reserved_output: int = 4096):
        self.enc = tiktoken.encoding_for_model(model)
        # GPT-4o context window is 128,000 tokens
        self.context_window = 128000
        self.reserved_output = reserved_output
        self.input_budget = self.context_window - self.reserved_output
        self.current_usage = 0

    def count_tokens(self, text: str) -> int:
        return len(self.enc.encode(text))

    def add_message(self, role: str, content: str) -> None:
        tokens = self.count_tokens(content)
        # Account for message format overhead (~4 tokens per message)
        tokens += 4
        if self.current_usage + tokens > self.input_budget:
            raise BudgetExceededError(
                f"Adding {role} message would exceed budget. "
                f"Current: {self.current_usage}, Adding: {tokens}, Budget: {self.input_budget}"
            )
        self.current_usage += tokens

    def get_available_input(self) -> int:
        return self.input_budget - self.current_usage

class BudgetExceededError(Exception):
    pass

# Usage
planner = TokenBudgetPlanner(reserved_output=4096)
planner.add_message("system", "You are a helpful assistant.")
planner.add_message("user", "What is the capital of France?")
print(f"Available input tokens: {planner.get_available_input()}")
Reserve output budget upfront
Always set max_tokens to a specific value. If you don't, the model can use all 128,000 tokens for output, leaving zero for input. A good rule of thumb: reserve 4,096 tokens for output for simple Q&A, 8,192 for complex reasoning.
Production Insight
A chatbot for a SaaS platform kept hitting context limits during peak hours. The team had set max_tokens=16384 but didn't account for conversation history. After 20 messages, the history alone was 30,000 tokens. The fix: implement a sliding window that keeps only the last 5 messages and truncates the oldest ones first.
Key Takeaway
Build a budget planner that tracks each component separately. Reserve output tokens from the total. Truncate history before it overflows the budget.

When NOT to Use Token Budgeting

Token budgeting is essential for cost control, but it's not always the right tool. If you're doing batch inference with short prompts, the overhead of counting tokens per request can outweigh the savings. We saw a team add token counting to a batch job that processed 10,000 short prompts per minute. The tiktoken calls added 30% latency. They were better off estimating a fixed budget per prompt and only counting on a sample.

Another case: if your model has a very large context window (like Gemini 2.0 Flash at 1M tokens), budgeting becomes less about overflow and more about cost. At $0.10 per 1M input tokens, the cost of a full-window request is negligible. But the latency of processing 1M tokens is not — it can take 30+ seconds. In that case, budget for latency, not cost.

Finally, don't use token budgeting for tasks where the model needs the full context to be accurate. Legal document analysis or medical record summarization requires the entire input. Truncating history could introduce errors. In those cases, pay for the larger model or use a model with a bigger context window.

when_not_to_budget.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import time
import tiktoken

# Batch inference: 10,000 short prompts
prompts = ["What is 2+2?"] * 10000
enc = tiktoken.get_encoding("cl100k_base")

# Bad: count tokens for every prompt
start = time.time()
for p in prompts:
    _ = len(enc.encode(p))
print(f"Counting all: {time.time() - start:.2f}s")

# Good: count once and reuse
start = time.time()
first_count = len(enc.encode(prompts[0]))
print(f"Counting once: {time.time() - start:.2f}s")
# For uniform prompts, the variance is negligible
Budget for latency, not just cost
On models with 1M+ context windows, the bottleneck is processing time, not cost. A full-window request on Gemini 2.0 Flash costs $0.10 but takes 30+ seconds. Budget for response time by limiting input size.
Production Insight
A legal document analysis service used token budgeting to truncate contracts to 100,000 tokens. The model missed critical clauses in the truncated sections. The fix: switch to a model with a 200,000 token context window (Claude Opus 4) and remove the truncation entirely.
Key Takeaway
Token budgeting is for cost and overflow control. Don't use it when full context is required for accuracy. In those cases, pay for a larger model or accept the latency.

Production Patterns for Token Budgeting at Scale

At scale, token budgeting becomes a systems design problem. You need to track usage per request, per user, and per model. We use a middleware pattern that intercepts every API call, counts tokens, and enforces budgets before the request hits the model.

  1. Sliding Window: Keep only the last N messages in conversation history. N depends on the model's context window and your output budget. For GPT-4o with 128K window and 16K output, we keep the last 10 messages plus the system prompt.
  2. Token Budget Header: Add a custom header to every API call that includes the current token usage. Log this in your monitoring system. This lets you trace cost spikes back to specific users or features.
  3. Budget Exceeded Webhook: When a request exceeds the budget, instead of failing silently, send a webhook to the developer with the exact token counts. This turns a silent failure into an actionable alert.
token_budget_middleware.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import openai
from functools import wraps
import tiktoken

class TokenBudgetMiddleware:
    def __init__(self, max_input_tokens: int = 100000):
        self.max_input_tokens = max_input_tokens
        self.enc = tiktoken.get_encoding("cl100k_base")

    def __call__(self, func):
        @wraps(func)
        def wrapper(messages, **kwargs):
            # Count tokens for all messages
            total_tokens = sum(len(self.enc.encode(m["content"])) + 4 for m in messages)
            if total_tokens > self.max_input_tokens:
                raise BudgetExceededError(
                    f"Input tokens {total_tokens} exceed budget {self.max_input_tokens}"
                )
            # Add token count to kwargs for logging
            kwargs["extra_headers"] = {"X-Token-Count": str(total_tokens)}
            return func(messages, **kwargs)
        return wrapper

# Apply to OpenAI client
client = openai.OpenAI()
original_create = client.chat.completions.create
client.chat.completions.create = TokenBudgetMiddleware()(original_create)
Don't forget message overhead
Each message in the OpenAI API has a 4-token overhead for the role and formatting. When counting tokens, add 4 for each message in the list. This adds up quickly with long conversation histories.
Production Insight
A customer support system handling 50K requests/day used a sliding window of 20 messages. During a holiday sale, conversation lengths doubled, and the window wasn't enough. The fix: make the window size dynamic based on the average token count per message. If messages are short, keep more; if long, keep fewer.
Key Takeaway
At scale, use middleware to enforce budgets, sliding windows to manage history, and custom headers to trace costs. Make budget enforcement a hard error, not a silent truncation.

Common Token Budgeting Mistakes (With Specific Examples)

We've seen the same mistakes across multiple teams. Here are the top three:

Mistake 1: Using character count instead of token count. A developer wrote len(prompt) > 100000 thinking 100K characters equals 100K tokens. In reality, 100K characters is about 25K tokens for English text. They were undercounting by 4x. The fix: always use tiktoken.

Mistake 2: Forgetting to reserve output tokens. Another team set max_tokens=16384 but didn't subtract it from the input budget. They sent 120K tokens of input, which with 16K output totaled 136K — exceeding the 128K limit. The API returned a 400 error. The fix: subtract max_tokens from the context window before allocating input.

Mistake 3: Not accounting for tokenizer differences between models. A team switched from GPT-4o to Claude 3.5 Sonnet but kept using the cl100k_base tokenizer. Claude's tokenizer produces different counts. They ended up with prompts that were 20% larger than expected, causing context overflows. The fix: use the model-specific tokenizer.

common_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import tiktoken

# Mistake 1: Character count
prompt = "Hello, world!" * 1000
char_count = len(prompt)
token_count = len(tiktoken.get_encoding("cl100k_base").encode(prompt))
print(f"Characters: {char_count}, Tokens: {token_count}")
# Characters: 13000, Tokens: ~3250

# Mistake 2: Not reserving output
context_window = 128000
max_tokens = 16384
input_tokens = 120000
total = input_tokens + max_tokens  # 136384 > 128000
print(f"Total exceeds limit: {total} > {context_window}")

# Mistake 3: Wrong tokenizer
# Claude uses its own tokenizer, not cl100k_base
# Always check the model's documentation
Tokenizers are not interchangeable
A token count from cl100k_base is meaningless for Claude models. Use the model's official tokenizer. For OpenAI, use tiktoken. For Anthropic, use the anthropic library's count_tokens method.
Production Insight
A team building a code generation tool used character count to estimate tokens. Their system prompt was 50K characters, which they thought was 50K tokens. In reality, it was 12.5K tokens. They had 115K tokens of headroom they weren't using, leading to unnecessarily truncated prompts. The fix: switch to tiktoken and increase the prompt size.
Key Takeaway
Always use the model-specific tokenizer. Always reserve output tokens from the context window. Never use character count as a proxy for token count.

Token Budgeting vs. Other Cost Control Strategies

Token budgeting is one of several cost control strategies. Here's how it compares:

Token Budgeting: Proactive. You count tokens before the call and reject or truncate if over budget. Best for real-time applications where you control the input.

Cost Alerts: Reactive. You monitor API costs and alert when they exceed a threshold. Useful as a safety net, but by the time you get the alert, the money is already spent.

Model Selection: Proactive. Use a cheaper model for simple tasks and a more expensive one for complex tasks. GPT-4o-mini costs 1/17th of GPT-4o per input token. But it also has lower accuracy.

Caching: Proactive. Cache responses for identical prompts. This works well for system prompts and few-shot examples, but not for user-specific queries.

We use all four in combination. Token budgeting is the first line of defense. Cost alerts catch anything that slips through. Model selection reduces baseline costs. Caching eliminates redundant calls.

cost_control_strategies.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import openai
from functools import lru_cache

# Model selection: route based on task complexity
def route_to_model(task: str) -> str:
    if task == "simple_qa":
        return "gpt-4o-mini"  # $0.15/M input
    elif task == "complex_reasoning":
        return "gpt-4o"  # $2.50/M input
    else:
        return "gpt-4o-mini"

# Caching: cache system prompt responses
@lru_cache(maxsize=1000)
def get_system_response(system_prompt: str, user_prompt: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        max_tokens=100
    )
    return response.choices[0].message.content
Combine strategies for maximum savings
Use token budgeting as the first line of defense. Add cost alerts as a safety net. Route simple queries to cheaper models. Cache identical prompts. Together, these can reduce costs by 70-90%.
Production Insight
A startup using GPT-4o for all queries was spending $10K/month. They implemented token budgeting (saved 20%), switched to GPT-4o-mini for simple Q&A (saved 60%), and added caching for common system prompts (saved another 10%). Total savings: 90% — down to $1K/month.
Key Takeaway
Token budgeting is the first line of defense, but combine it with model selection, caching, and cost alerts for maximum savings. No single strategy is enough.

Debugging and Monitoring Token Budgets in Production

Monitoring token budgets requires logging token counts per request and setting alerts on anomalies. We use structured logging with a 'token_usage' field that includes prompt_tokens, completion_tokens, and total_tokens. This lets us query for outliers: 'find all requests where total_tokens > 100000'.

We also track token usage per user and per feature. If a single user's token usage spikes, it's likely a runaway loop in their conversation. If a feature's token usage spikes, it's likely a bug in the prompt construction.

Alerts should trigger on
  • Total tokens per request > 90% of context window
  • Daily token usage > 2x the 7-day rolling average
  • Any request that triggers a budget exceeded error
monitor_token_usage.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import json
import logging
from datetime import datetime

# Structured logging for token usage
logger = logging.getLogger("token_budget")
handler = logging.FileHandler("token_usage.log")
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

def log_token_usage(user_id: str, feature: str, prompt_tokens: int, completion_tokens: int):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "user_id": user_id,
        "feature": feature,
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "total_tokens": prompt_tokens + completion_tokens
    }
    logger.info(json.dumps(log_entry))

# Query for outliers (shell command)
# grep 'total_tokens' token_usage.log | python -c "import sys,json; logs=[json.loads(l) for l in sys.stdin]; print(max(l['total_tokens'] for l in logs))"
Log token counts in every response
The OpenAI API response includes a 'usage' field with prompt_tokens and completion_tokens. Always log these. They are the single most useful metric for debugging cost and budget issues.
Production Insight
A team noticed their daily token usage was 3x the normal level. They queried the logs and found a single user who had sent 500 requests in 10 minutes, each with a 50K-token prompt. The user had a bug in their client that was resending the full conversation history. The fix: add rate limiting per user and enforce a maximum prompt size.
Key Takeaway
Log token counts per request, per user, and per feature. Set alerts on outliers. Use structured logging so you can query for anomalies quickly.
● Production incidentPOST-MORTEMseverity: high

The $4,000 Overnight Token Overrun

Symptom
On-call engineer saw a spike in API 400 errors and a corresponding spike in cost dashboard showing $4,000 in 12 hours. Logs showed requests with token counts exceeding 128,000.
Assumption
The team assumed that since the system prompt was under 4,000 tokens, the total request would always be under the 128,000 limit. They didn't account for conversation history growing unbounded.
Root cause
A developer added a feature to include the full conversation history in every request. The history was stored as a list of messages and appended without any token counting. After 47 messages, the total exceeded 128,000 tokens. The API returned a 400 error, but the retry logic resubmitted the same oversized request, incurring charges for the failed call.
Fix
1. Add a token counting step using tiktoken before every API call. 2. Implement a sliding window that truncates conversation history to the last 10 messages. 3. Set a hard budget of 100,000 tokens for input, reserving 28,000 for output. 4. Add a cost alert that triggers if daily spend exceeds $500.
Key lesson
  • Always count total tokens before calling the API — use the model-specific tokenizer, not a generic character count.
  • Set a hard budget for input tokens and enforce it in code. Reserve at least 20% of the context window for output.
  • Monitor token usage per request and alert on anomalies. Cost is a leading indicator of budget overruns.
Production debug guideWhen the API starts returning 400 errors or your cost dashboard spikes at 2am.4 entries
Symptom · 01
API returns 400 Bad Request with 'maximum context length exceeded'
Fix
Run tiktoken.get_encoding('cl100k_base').encode(prompt) and check len(encoded) against the model's context window. Log the token count per request.
Symptom · 02
Model responses are truncated or nonsensical
Fix
Check if the output budget was exceeded. GPT-4o has a max output of 16,384 tokens. Use max_tokens parameter to cap output. Log usage.completion_tokens from the API response.
Symptom · 03
Costs are higher than expected for the number of requests
Fix
Compare usage.prompt_tokens across requests. Look for a single request with an abnormally high count — likely a runaway conversation history or a system prompt that grew unbounded.
Symptom · 04
Model 'forgets' instructions mid-prompt
Fix
Tokenize the full prompt and check if it exceeds the context window. If it does, the model drops the middle. Use a sliding window to keep the most recent and most important parts.
★ Token Budgeting for LLMs Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
400 Bad Request — context length exceeded
Immediate action
Count tokens in the prompt
Commands
python -c "import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(open('prompt.txt').read())))"
python -c "import openai; client = openai.OpenAI(); models = client.models.list(); print([m.id for m in models if 'gpt-4' in m.id])"
Fix now
Truncate conversation history to last 10 messages. Set max_tokens=4096 in the API call.
Cost spike — daily spend > $500+
Immediate action
Check per-request token usage
Commands
grep 'usage' logs/app.log | awk '{print $NF}' | sort -n | tail -5
python -c "import json; logs = [json.loads(l) for l in open('logs/app.log') if 'usage' in l]; print(max(l['usage']['prompt_tokens'] for l in logs))"
Fix now
Add a token budget check before the API call: if total_tokens > 100000: raise BudgetExceededError.
Model drops instructions mid-prompt+
Immediate action
Check if prompt exceeds context window
Commands
python -c "import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); prompt = open('prompt.txt').read(); print(f'Tokens: {len(enc.encode(prompt))}')"
python -c "print('Model context window: 128000 tokens for GPT-4o')"
Fix now
Move critical instructions to the beginning and end of the prompt. Use a sliding window to keep only the last 10 messages.
Token Budgeting vs. Other Cost Control Strategies
ConcernToken BudgetingRate LimitingModel FallbackRecommendation
GranularityPer-request + windowed totalPer-second requestsPer-model cost tierToken budgeting for precision
Prevents bill spikesYes, by capping total tokensNo, only request countPartially, cheaper modelToken budgeting + rate limiting
Implementation complexityMedium (tokenizer + state)Low (simple counter)Low (switch model)Start with rate limiting, add token budgeting
User experience impactGraceful degradationHard throttlingQuality dropToken budgeting with fallback
Real-time controlYes, pre-flight checkYes, per requestNo, per sessionToken budgeting for real-time

Key takeaways

1
Always count tokens client-side before sending a request
never trust the API response alone for budgeting.
2
Set a hard per-request token budget and enforce it with a pre-flight check to avoid silent truncation or rejection.
3
Use a sliding window budget for streaming and batch jobs to cap total spend per time window, not just per call.
4
Monitor token usage per user, per model, and per endpoint separately
aggregate metrics hide spikes.
5
Build a fallback strategy (e.g., switch to cheaper model, queue requests) when budget is exhausted, not after the bill arrives.

Common mistakes to avoid

4 patterns
×

Counting tokens only after API call

Symptom
You see token usage in logs but can't stop runaway spend because the request already went out.
Fix
Use a local tokenizer (e.g., tiktoken for OpenAI) to count tokens before sending. Reject or queue if over budget.
×

Ignoring token overhead from system prompts and chat templates

Symptom
Actual token count is 20-30% higher than expected because system prompt and formatting tokens aren't accounted for.
Fix
Include system prompt, role tags, and stop sequences in your pre-flight token count. Cache the system prompt token count separately.
×

Using a flat per-request budget without a windowed total

Symptom
Each request is under budget, but a burst of 1000 requests drains the account in minutes.
Fix
Implement a sliding window budget (e.g., 1M tokens per hour) using a token bucket or leaky bucket algorithm.
×

Not handling tokenization edge cases (e.g., Unicode, code blocks)

Symptom
Token count mismatch between client and API for non-ASCII text, causing unexpected overages.
Fix
Use the exact tokenizer version the API uses (e.g., cl100k_base for GPT-4). Test with edge cases like emoji, long strings, and code.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design a token budgeting system for a multi-tenant LLM API...
Q02SENIOR
Explain the difference between tokenization in GPT-3 vs GPT-4 and why it...
Q03SENIOR
How do you handle token budget exhaustion in a real-time chat applicatio...
Q04SENIOR
What edge cases break naive token counting, and how do you fix them?
Q05SENIOR
Describe a production incident where token budgeting failed and how you'...
Q01 of 05SENIOR

How would you design a token budgeting system for a multi-tenant LLM API?

ANSWER
Start with per-tenant token buckets (e.g., 100K tokens/hour) using a token bucket algorithm. Use a local tokenizer to count input tokens before sending. For output, set max_tokens and track streaming tokens. Store budget state in Redis with TTL for atomic decrements. Fall back to a cheaper model or queue when budget exhausted. Log all usage to a time-series DB for auditing.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How do I count tokens before sending a request to OpenAI?
02
What happens if I exceed my token budget in production?
03
Should I budget by input tokens, output tokens, or both?
04
How do I handle token budgeting for streaming responses?
05
What's the best way to log token usage for billing?
🔥

That's Context Engineering. Mark it forged?

6 min read · try the examples if you haven't

Previous
Context Engineering for LLMs
2 / 4 · Context Engineering
Next
LLM Memory Management