LLM Observability Tools — The $4k/month Token Leak We Caught at 3am
Stop guessing why your LLM costs are exploding.
- Token Tracking Without per-request token accounting, you're flying blind on costs. We found a 40% token waste in a single misconfigured retry loop.
- Latency Breakdown LLM calls aren't just model inference. Prompt serialization, embedding lookups, and context window management can add 800ms of hidden latency.
- Cost Attribution Tag every span with
user_id,model, andprompt_templateto trace cost spikes to specific features or tenants. - Error Budgets Rate limits and context length errors are the new 503s. Track them with custom metrics and alert on error budget depletion.
- Span Linkage A single user request can spawn 10+ LLM calls, 3 vector DB queries, and 2 re-ranking steps. Distributed tracing is non-negotiable.
- Prompt Drift The same prompt template can produce wildly different token counts after a model update. Monitor token usage per template version.
LLM observability tools are the monitoring and debugging infrastructure specifically designed for the unique failure modes of large language model pipelines — token leaks, hallucination drift, prompt injection, and cost attribution at scale. Unlike traditional APM (Application Performance Monitoring) which tracks request latency and error rates, LLM observability instruments the entire chain: prompt templates, token counts per model call, embedding vector sizes, retrieval-augmented generation (RAG) context windows, and the exact cost-per-request across providers like OpenAI, Anthropic, or self-hosted models.
The core problem these tools solve is that LLMs are non-deterministic, stateful, and expensive — a single misconfigured system prompt can silently burn $4,000/month in excess tokens without triggering any standard 5xx error or latency spike.
Under the hood, LLM observability relies on distributed tracing with semantic conventions for AI-specific spans. OpenTelemetry (OTel) provides the foundation, but you need extensions like OpenLIT or Traceloop that add attributes such as gen_ai.request.model, gen_ai.response.usage.total_tokens, and gen_ai.prompt.template.
These tools capture token-level telemetry, cost calculations (via provider pricing APIs), and latency breakdowns per model call. In production, you'd typically deploy an OTel collector as a sidecar or DaemonSet, exporting traces to a backend like SigNoz, Grafana Tempo, or Datadog's LLM Observability product.
The key difference from standard APM: you're not just measuring p99 latency — you're measuring token efficiency, prompt compression ratios, and whether your RAG pipeline is actually using retrieved context or hallucinating.
Where this fits in the ecosystem: you should use dedicated LLM observability when you have more than one model endpoint, multiple prompt templates, or any cost-sensitive production deployment. Alternatives include Datadog's LLM Observability (proprietary, deep integration with their APM but expensive at scale), LangSmith (great for prototyping but not designed for multi-tenant production), and Weights & Biases Prompts (more experiment tracking than runtime monitoring).
Do NOT use OpenTelemetry-based LLM observability if you're running a single model with fixed prompts and no cost constraints — the overhead of instrumenting every token call isn't worth it. For real-world scale, expect to pay $0.10–0.50 per million traced tokens for managed backends, or run your own OTel collector stack for ~$200/month in infrastructure costs if you're processing 10M+ tokens daily.
Think of LLM observability like having a fuel gauge and a mechanic for your chatbot. Without it, you're driving blind — you don't know how much each conversation costs, which parts are slow, or why the car suddenly stalls when too many people ask the same question. This article gives you the dashboard and the diagnostic tools.
Three weeks ago, our recommendation engine started burning through $4,000/month in OpenAI API costs with no change in traffic. The p99 latency jumped from 2s to 8s. Users complained of timeouts. Our Grafana dashboard showed a flat line — no spikes, no errors. The system was 'working', just slower and more expensive. That's the lie LLM observability is supposed to catch.
Most tutorials hand you a tracing library and call it a day. They show you a pretty waterfall chart of one LLM call and tell you to instrument your app. They don't tell you that the real problems live in the gaps: retry storms from rate limits, token waste from prompt caching, cost attribution to a specific user who's gaming your system. They don't tell you that OpenTelemetry spans are only useful if you tag them with the right metadata.
This guide covers what I wish I'd known before that 3am page. We'll walk through production-grade LLM observability using OpenTelemetry and OpenLIT, with real code for tracing, metrics, and cost tracking. I'll show you the exact config that caught our token leak, the debug steps when your dashboard shows nothing useful, and the one metric you must alert on before your next bill arrives.
How LLM Observability Actually Works Under the Hood
LLM observability is not just API monitoring. A single user request can trigger a chain: prompt serialization, embedding lookup in a vector DB, context window management, multiple LLM calls (for reasoning, tool use, or re-ranking), and post-processing. Each step has its own latency, cost, and failure modes.
The core abstraction is the span. OpenTelemetry defines a span as a single operation with a start and end time, plus attributes. For LLM apps, you need spans at multiple levels: the user request, each LLM call, each vector DB query, and each tool invocation. The tricky part is linking them — you need a trace ID that propagates across service boundaries.
OpenLIT auto-instruments popular LLM libraries (OpenAI, LangChain, Anthropic) by monkey-patching the client's __call__ method. It creates a span for each LLM request, adds attributes like model, prompt_tokens, completion_tokens, and total_tokens, and exports them via OTLP. But auto-instrumentation only gets you so far. You still need to manually instrument your business logic: the prompt assembly, the retry logic, the caching layer.
What the docs don't tell you: span attributes are the difference between a useless waterfall chart and a cost-attribution dashboard. Tag every span with user_id, prompt_template, feature_name, and tenant_id. Without those, you can't answer 'which user is costing me $500/day?'
index_name and collection_id.Practical Implementation: Setting Up OpenTelemetry + OpenLIT for Production
Let's walk through a production-ready setup. We'll use OpenLIT for auto-instrumentation of OpenAI and LangChain, then add manual instrumentation for business logic. We'll export traces and metrics to Grafana Cloud via OTLP.
First, install dependencies. Use pinned versions to avoid breakage. We learned this the hard way when OpenLIT 0.4.0 broke our LangChain integration.
Next, configure the OpenTelemetry SDK. The key decision is the exporter endpoint. For Grafana Cloud, you need the OTLP endpoint and a token. Never hardcode credentials — use environment variables.
Then, initialize OpenLIT. It will automatically patch openai.ChatCompletion.create and LangChain's LLMChain.run. But you still need to wrap your main request handler in a trace to link all spans together.
Finally, add custom metrics. The auto-instrumentation gives you latency histograms and token counts, but you need business metrics: requests per user, cost per template, error rate by model.
OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS environment variables. This makes it easy to switch between dev, staging, and production.OTEL_EXPORTER_OTLP_ENDPOINT environment variable. The SDK silently defaulted to localhost:4317, so all traces went to a non-existent collector. We lost 3 days of data. Add a startup check: if not os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'): raise RuntimeError('OTLP endpoint not set').When NOT to Use OpenTelemetry for LLM Observability
OpenTelemetry is the standard, but it's not always the right choice. Here are three scenarios where you should consider alternatives.
1. You need real-time cost tracking at the sub-second level. OpenTelemetry's batch span processor exports every 5 seconds by default. If you need to enforce a per-user token budget in real time, you need a streaming approach. Consider using a middleware that sends token counts to a Redis counter or a streaming platform like Kafka.
2. You're running LLMs on a massive scale (10k+ requests/second). The OpenTelemetry collector can become a bottleneck. We saw the collector's CPU spike to 80% at 5k req/s. Consider sampling: use the tail-based sampler to keep only traces with errors or high latency.
3. You need deep prompt-level debugging. OpenTelemetry spans are not designed to store full prompts and responses. If you need to replay a specific conversation for debugging, store the prompts and responses in a separate store (e.g., S3 or a database) and link them to the trace via a trace ID.
For most teams, OpenTelemetry is the right choice. But know its limits.
Production Patterns & Scale: Cost Attribution and Tenant Isolation
At scale, the biggest challenge is cost attribution. You need to know which team, feature, or user is driving costs. This requires a consistent tagging strategy across all spans.
Pattern 1: Tag everything with tenant_id. If you have multiple customers or internal teams, tag every span with their ID. This lets you answer 'how much did tenant X cost us this month?'
Pattern 2: Use a cost calculation pipeline. Token counts are not enough. You need to multiply by the model's per-token cost. Create a batch job that reads spans from your observability backend, calculates cost per span, and writes it to a cost dashboard.
Pattern 3: Set per-tenant budgets. Use the cost data to enforce budgets. If tenant X exceeds their budget, you can throttle their requests or switch them to a cheaper model.
Pattern 4: Monitor prompt template drift. The same prompt template can produce wildly different token counts after a model update. Track token usage per template version and alert on significant changes.
Common Mistakes with Specific Examples
Here are the three most common mistakes we've seen (and made) in production.
Mistake 1: Not setting max_retries on the OpenAI client. The default is 0. If you get a 429, the client raises an exception immediately. If you catch that exception and retry in a loop without exponential backoff, you create a retry storm. We saw this cause a 60% cost increase.
Mistake 2: Only monitoring error rates. A system can be 'working' (no errors) while burning money. Monitor token rate, cost rate, and latency distribution. Alert on deviations from baseline.
Mistake 3: Not tagging spans with prompt template version. When you update a prompt template, you need to know if the new version is more expensive or slower. Without template version tags, you can't attribute a cost spike to a prompt change.
max_retries=0 by default. If you don't set it, a single 429 error will crash your request. Set max_retries=3 and backoff_factor=2 in production.Comparison vs Alternatives: OpenTelemetry vs Datadog LLM Observability
The main alternatives are OpenTelemetry (with OpenLIT) and Datadog's LLM Observability SDK. Here's a production comparison.
OpenTelemetry + OpenLIT: - Open source, vendor-agnostic. You can use any backend (Grafana, Jaeger, Prometheus). - Auto-instruments OpenAI, LangChain, Anthropic, and more. - Requires manual setup of the collector and exporters. - No built-in cost calculation or alerting. You build those yourself.
Datadog LLM Observability: - Proprietary, vendor-locked to Datadog. - Provides a higher-level SDK that includes cost tracking, evaluations, and a pre-built dashboard. - Easier to set up: just install the SDK and set environment variables. - More expensive at scale: Datadog's pricing is per-host + per-span.
When to choose OpenTelemetry: You want vendor independence, you already have a Grafana/Prometheus stack, or you need to customize your observability pipeline.
When to choose Datadog: You're already on Datadog, you need the higher-level features (evaluations, cost tracking), and you don't mind the vendor lock-in.
Debugging and Monitoring: The 3am Playbook
When you get paged at 3am because costs are spiking or latency is high, you need a systematic approach.
Step 1: Check the token rate. If costs are spiking but traffic is normal, look at token count per request. A sudden increase in token count per request indicates a prompt change or a retry storm.
Step 2: Check the retry rate. If p99 latency is high but p50 is normal, you likely have a retry storm. Check the retry_count span attribute.
Step 3: Check the model distribution. Did a recent deployment change the default model from gpt-3.5-turbo to gpt-4? That would explain a cost spike.
Step 4: Check the error rate by model. Some models have higher error rates. If you switched to a new model, it might be returning more errors, causing retries.
Step 5: Check the prompt template distribution. Did a new prompt template get deployed? It might be generating more tokens than expected.
The Silent $4k/month Token Leak
backoff variable in the retry logic was resetting to 0 on every retry instead of incrementing, causing the retry loop to fire 10+ times per request when OpenAI returned a 429 rate limit error. Each retry re-sent the full prompt, including the conversation history, which grew unboundedly.openai.InternalServerError and openai.RateLimitError to a retryable status list with exponential backoff.
2. Set max_retries=3 on the OpenAI client.
3. Added a token_count metric to every span.
4. Created a Grafana alert on sum(rate(token_count_total[5m])) > 1e6.
5. Reviewed all retry loops in the codebase. Found and fixed 2 more with the same bug.- Tag every LLM span with
token_count,model, andprompt_template. Cost attribution is impossible without it. - Set a hard limit on retries. Exponential backoff is not optional — your retry loop can become a DoS attack on your own wallet.
- Alert on token rate, not just error rate. A system can be 'working' while burning money.
kubectl logs -l app=llm-service --tail=100 | grep 'token_count' | awk '{print $NF}' | sort | uniq -c | sort -rn to see per-request token distribution. Look for outliers > 10x median.retry_count to your span attributes. Query: sum by (retry_count) (rate(span_count_total[5m])). If retry_count > 3 is non-zero, you have a retry storm.export OPENAI_LOG=debug. Check for RateLimitError in logs.user_id and prompt_template. Query: histogram_quantile(0.99, sum by (le, user_id) (rate(llm_latency_seconds_bucket[5m]))). Find the user with the highest p99.python -c "import pandas as pd; df = pd.read_parquet('spans.parquet'); print(df['token_count'].describe())"kubectl logs -l app=llm-service --tail=500 | grep 'token_count' | awk '{print $NF}' | sort -n | tail -20max_retries=3 and backoff_factor=2 to your OpenAI client. See code in Section 2.Key takeaways
usage field from the API response to avoid double-counting on retries or streaming.tenant_id and user_id at creation time, not as an afterthoughtCommon mistakes to avoid
4 patternsDouble-counting tokens on retries
request_id to each LLM call and deduplicate spans in the exporter. In OpenTelemetry, set span.set_attribute('llm.request_id', uuid) and filter duplicates in the batch processor.Missing tenant isolation in spans
tenant_id=unknown.Context propagation. In your middleware, call span.set_attribute('tenant_id', ctx.tenant_id) before any LLM call — never after.Using client-side token estimation
response.usage from the API response object (e.g., response.usage.total_tokens for OpenAI). Never use tiktoken or other estimators for billing.Not setting up cost alerts on deviation
avg(rate(llm_cost_per_request[5m])) > 1.2 * avg(rate(llm_cost_per_request[1h])) — triggers when per-request cost jumps 20%.Interview Questions on This Topic
How would you design an observability system to detect token leaks in an LLM-powered application?
Frequently Asked Questions
That's Observability. Mark it forged?
6 min read · try the examples if you haven't