Senior 6 min · May 22, 2026

LLM Observability Tools — The $4k/month Token Leak We Caught at 3am

Stop guessing why your LLM costs are exploding.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Token Tracking Without per-request token accounting, you're flying blind on costs. We found a 40% token waste in a single misconfigured retry loop.
  • Latency Breakdown LLM calls aren't just model inference. Prompt serialization, embedding lookups, and context window management can add 800ms of hidden latency.
  • Cost Attribution Tag every span with user_id, model, and prompt_template to trace cost spikes to specific features or tenants.
  • Error Budgets Rate limits and context length errors are the new 503s. Track them with custom metrics and alert on error budget depletion.
  • Span Linkage A single user request can spawn 10+ LLM calls, 3 vector DB queries, and 2 re-ranking steps. Distributed tracing is non-negotiable.
  • Prompt Drift The same prompt template can produce wildly different token counts after a model update. Monitor token usage per template version.
What is LLM Observability Tools?

LLM observability tools are the monitoring and debugging infrastructure specifically designed for the unique failure modes of large language model pipelines — token leaks, hallucination drift, prompt injection, and cost attribution at scale. Unlike traditional APM (Application Performance Monitoring) which tracks request latency and error rates, LLM observability instruments the entire chain: prompt templates, token counts per model call, embedding vector sizes, retrieval-augmented generation (RAG) context windows, and the exact cost-per-request across providers like OpenAI, Anthropic, or self-hosted models.

The core problem these tools solve is that LLMs are non-deterministic, stateful, and expensive — a single misconfigured system prompt can silently burn $4,000/month in excess tokens without triggering any standard 5xx error or latency spike.

Under the hood, LLM observability relies on distributed tracing with semantic conventions for AI-specific spans. OpenTelemetry (OTel) provides the foundation, but you need extensions like OpenLIT or Traceloop that add attributes such as gen_ai.request.model, gen_ai.response.usage.total_tokens, and gen_ai.prompt.template.

These tools capture token-level telemetry, cost calculations (via provider pricing APIs), and latency breakdowns per model call. In production, you'd typically deploy an OTel collector as a sidecar or DaemonSet, exporting traces to a backend like SigNoz, Grafana Tempo, or Datadog's LLM Observability product.

The key difference from standard APM: you're not just measuring p99 latency — you're measuring token efficiency, prompt compression ratios, and whether your RAG pipeline is actually using retrieved context or hallucinating.

Where this fits in the ecosystem: you should use dedicated LLM observability when you have more than one model endpoint, multiple prompt templates, or any cost-sensitive production deployment. Alternatives include Datadog's LLM Observability (proprietary, deep integration with their APM but expensive at scale), LangSmith (great for prototyping but not designed for multi-tenant production), and Weights & Biases Prompts (more experiment tracking than runtime monitoring).

Do NOT use OpenTelemetry-based LLM observability if you're running a single model with fixed prompts and no cost constraints — the overhead of instrumenting every token call isn't worth it. For real-world scale, expect to pay $0.10–0.50 per million traced tokens for managed backends, or run your own OTel collector stack for ~$200/month in infrastructure costs if you're processing 10M+ tokens daily.

LLM Observability Stack Architecture diagram: LLM Observability Stack LLM Observability Stack instrument emit spans threshold alert 1 LLM Call Request + response 2 Tracer LangSmith / Langfuse 3 Metrics Store Latency / tokens / cost 4 Alert Engine p99 latency / error rate 5 Dashboard Grafana / Datadog THECODEFORGE.IO
Plain-English First

Think of LLM observability like having a fuel gauge and a mechanic for your chatbot. Without it, you're driving blind — you don't know how much each conversation costs, which parts are slow, or why the car suddenly stalls when too many people ask the same question. This article gives you the dashboard and the diagnostic tools.

Three weeks ago, our recommendation engine started burning through $4,000/month in OpenAI API costs with no change in traffic. The p99 latency jumped from 2s to 8s. Users complained of timeouts. Our Grafana dashboard showed a flat line — no spikes, no errors. The system was 'working', just slower and more expensive. That's the lie LLM observability is supposed to catch.

Most tutorials hand you a tracing library and call it a day. They show you a pretty waterfall chart of one LLM call and tell you to instrument your app. They don't tell you that the real problems live in the gaps: retry storms from rate limits, token waste from prompt caching, cost attribution to a specific user who's gaming your system. They don't tell you that OpenTelemetry spans are only useful if you tag them with the right metadata.

This guide covers what I wish I'd known before that 3am page. We'll walk through production-grade LLM observability using OpenTelemetry and OpenLIT, with real code for tracing, metrics, and cost tracking. I'll show you the exact config that caught our token leak, the debug steps when your dashboard shows nothing useful, and the one metric you must alert on before your next bill arrives.

How LLM Observability Actually Works Under the Hood

LLM observability is not just API monitoring. A single user request can trigger a chain: prompt serialization, embedding lookup in a vector DB, context window management, multiple LLM calls (for reasoning, tool use, or re-ranking), and post-processing. Each step has its own latency, cost, and failure modes.

The core abstraction is the span. OpenTelemetry defines a span as a single operation with a start and end time, plus attributes. For LLM apps, you need spans at multiple levels: the user request, each LLM call, each vector DB query, and each tool invocation. The tricky part is linking them — you need a trace ID that propagates across service boundaries.

OpenLIT auto-instruments popular LLM libraries (OpenAI, LangChain, Anthropic) by monkey-patching the client's __call__ method. It creates a span for each LLM request, adds attributes like model, prompt_tokens, completion_tokens, and total_tokens, and exports them via OTLP. But auto-instrumentation only gets you so far. You still need to manually instrument your business logic: the prompt assembly, the retry logic, the caching layer.

What the docs don't tell you: span attributes are the difference between a useless waterfall chart and a cost-attribution dashboard. Tag every span with user_id, prompt_template, feature_name, and tenant_id. Without those, you can't answer 'which user is costing me $500/day?'

instrument_llm_call.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import openai
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Set up OpenTelemetry tracer
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

client = openai.OpenAI()

def get_llm_response(prompt: str, user_id: str, template_name: str) -> str:
    with tracer.start_as_current_span("llm_call") as span:
        # Manual instrumentation: add business-logic attributes
        span.set_attribute("user_id", user_id)
        span.set_attribute("prompt_template", template_name)
        span.set_attribute("prompt_length", len(prompt))

        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=500
            )
            # Auto-instrumentation by OpenLIT would add token counts here
            # But we add them manually for safety
            span.set_attribute("token_count", response.usage.total_tokens)
            span.set_attribute("model", response.model)
            return response.choices[0].message.content
        except openai.RateLimitError as e:
            span.set_attribute("error", True)
            span.set_attribute("error_type", "rate_limit")
            span.record_exception(e)
            raise
Auto-instrumentation is not enough
OpenLIT adds token counts and model names, but it can't add your business context. If you don't tag spans with user_id and prompt_template, you'll never trace a cost spike to a specific feature or user.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. The team had auto-instrumentation but no span attributes for the vector DB query. It took 3 days to realize the embedding lookup was querying the wrong index. Tag your vector DB spans with index_name and collection_id.
Key Takeaway
Span attributes are your primary debugging dimension. Invest in tagging every span with business context. Without it, you're looking at a pretty but useless waterfall.

Practical Implementation: Setting Up OpenTelemetry + OpenLIT for Production

Let's walk through a production-ready setup. We'll use OpenLIT for auto-instrumentation of OpenAI and LangChain, then add manual instrumentation for business logic. We'll export traces and metrics to Grafana Cloud via OTLP.

First, install dependencies. Use pinned versions to avoid breakage. We learned this the hard way when OpenLIT 0.4.0 broke our LangChain integration.

Next, configure the OpenTelemetry SDK. The key decision is the exporter endpoint. For Grafana Cloud, you need the OTLP endpoint and a token. Never hardcode credentials — use environment variables.

Then, initialize OpenLIT. It will automatically patch openai.ChatCompletion.create and LangChain's LLMChain.run. But you still need to wrap your main request handler in a trace to link all spans together.

Finally, add custom metrics. The auto-instrumentation gives you latency histograms and token counts, but you need business metrics: requests per user, cost per template, error rate by model.

production_setup.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import os
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.metrics import get_meter_provider, set_meter_provider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
import openlit

# 1. Configure OpenTelemetry SDK with resource attributes
resource = Resource.create({
    "service.name": "llm-recommendation-engine",
    "service.version": "1.2.3",
    "deployment.environment": "production"
})

# 2. Set up trace exporter
# Grafana Cloud OTLP endpoint: https://otlp-gateway-prod-us-east-0.grafana.net/otlp
trace_exporter = OTLPSpanExporter(
    endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT"),
    headers={"Authorization": f"Basic {os.getenv('GRAFANA_CLOUD_TOKEN')}"}
)
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)

# 3. Set up metric exporter
metric_exporter = OTLPMetricExporter(
    endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT"),
    headers={"Authorization": f"Basic {os.getenv('GRAFANA_CLOUD_TOKEN')}"}
)
metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=10000)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
set_meter_provider(meter_provider)

# 4. Initialize OpenLIT for auto-instrumentation
openlit.init(
    application_name="llm-recommendation-engine",
    otlp_endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT"),
    headers={"Authorization": f"Basic {os.getenv('GRAFANA_CLOUD_TOKEN')}"}
)

# 5. Now all OpenAI and LangChain calls are automatically traced
# But we still need to wrap the request handler for trace linkage
from opentelemetry import trace
tracer = trace.get_tracer(__name__)

def handle_request(user_id: str, prompt: str):
    with tracer.start_as_current_span("handle_request") as span:
        span.set_attribute("user_id", user_id)
        # The LLM call inside will be a child span
        response = get_llm_response(prompt, user_id, "default_template")
        return response
Use environment variables for all config
Never hardcode OTLP endpoints or tokens. Use OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS environment variables. This makes it easy to switch between dev, staging, and production.
Production Insight
We once deployed a new version of the LLM service and forgot to set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable. The SDK silently defaulted to localhost:4317, so all traces went to a non-existent collector. We lost 3 days of data. Add a startup check: if not os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'): raise RuntimeError('OTLP endpoint not set').
Key Takeaway
Auto-instrumentation is a starting point, not a finish line. You must manually instrument your request handler to create a root span that links all child spans together.

When NOT to Use OpenTelemetry for LLM Observability

OpenTelemetry is the standard, but it's not always the right choice. Here are three scenarios where you should consider alternatives.

1. You need real-time cost tracking at the sub-second level. OpenTelemetry's batch span processor exports every 5 seconds by default. If you need to enforce a per-user token budget in real time, you need a streaming approach. Consider using a middleware that sends token counts to a Redis counter or a streaming platform like Kafka.

2. You're running LLMs on a massive scale (10k+ requests/second). The OpenTelemetry collector can become a bottleneck. We saw the collector's CPU spike to 80% at 5k req/s. Consider sampling: use the tail-based sampler to keep only traces with errors or high latency.

3. You need deep prompt-level debugging. OpenTelemetry spans are not designed to store full prompts and responses. If you need to replay a specific conversation for debugging, store the prompts and responses in a separate store (e.g., S3 or a database) and link them to the trace via a trace ID.

For most teams, OpenTelemetry is the right choice. But know its limits.

Sampling is your friend at scale
At 5k req/s, storing every span is expensive and slow. Use OpenTelemetry's tail-based sampler to keep only traces with errors, high latency (>p99), or specific user IDs. You'll save 90% on storage costs.
Production Insight
A fintech startup using LLMs for fraud detection needed to enforce per-user token budgets in real time. They used OpenTelemetry for monitoring but built a separate Redis-based counter for enforcement. The OpenTelemetry traces were used for post-mortem analysis, not real-time decisions.
Key Takeaway
OpenTelemetry is for observability, not enforcement. Use it for debugging and cost analysis, not for real-time budget enforcement.

Production Patterns & Scale: Cost Attribution and Tenant Isolation

At scale, the biggest challenge is cost attribution. You need to know which team, feature, or user is driving costs. This requires a consistent tagging strategy across all spans.

Pattern 1: Tag everything with tenant_id. If you have multiple customers or internal teams, tag every span with their ID. This lets you answer 'how much did tenant X cost us this month?'

Pattern 2: Use a cost calculation pipeline. Token counts are not enough. You need to multiply by the model's per-token cost. Create a batch job that reads spans from your observability backend, calculates cost per span, and writes it to a cost dashboard.

Pattern 3: Set per-tenant budgets. Use the cost data to enforce budgets. If tenant X exceeds their budget, you can throttle their requests or switch them to a cheaper model.

Pattern 4: Monitor prompt template drift. The same prompt template can produce wildly different token counts after a model update. Track token usage per template version and alert on significant changes.

cost_calculation_job.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import read_span_data

# This is a simplified example. In production, you'd query your observability backend.
# We'll simulate reading spans from a parquet file.

def calculate_cost_per_tenant(span_file: str):
    df = pd.read_parquet(span_file)
    # Filter to LLM call spans
    llm_spans = df[df['span_name'] == 'llm_call']
    # Cost per token varies by model
    model_costs = {
        'gpt-4': 0.03 / 1000,  # $0.03 per 1K tokens
        'gpt-3.5-turbo': 0.002 / 1000,
        'text-embedding-ada-002': 0.0001 / 1000
    }
    llm_spans['cost'] = llm_spans.apply(
        lambda row: row['token_count'] * model_costs.get(row['model'], 0),
        axis=1
    )
    # Group by tenant_id
    cost_by_tenant = llm_spans.groupby('tenant_id')['cost'].sum()
    print(cost_by_tenant)
    # In production, write this to a database or dashboard
    return cost_by_tenant

if __name__ == '__main__':
    calculate_cost_per_tenant('spans_2025-05-22.parquet')
Model costs change. Keep your cost table up to date.
OpenAI updates pricing periodically. If you hardcode model costs, your cost attribution will be wrong. Fetch pricing from an API or a config file that's updated regularly.
Production Insight
A SaaS company with 50 tenants was surprised to find that one tenant accounted for 60% of their LLM costs. The tenant was using a script that sent the same prompt 1000 times per day. They switched the tenant to a cheaper model and saved $3k/month.
Key Takeaway
Cost attribution requires tagging every span with a tenant ID. Without it, you can't answer the most basic question: who's costing me money?

Common Mistakes with Specific Examples

Here are the three most common mistakes we've seen (and made) in production.

Mistake 1: Not setting max_retries on the OpenAI client. The default is 0. If you get a 429, the client raises an exception immediately. If you catch that exception and retry in a loop without exponential backoff, you create a retry storm. We saw this cause a 60% cost increase.

Mistake 2: Only monitoring error rates. A system can be 'working' (no errors) while burning money. Monitor token rate, cost rate, and latency distribution. Alert on deviations from baseline.

Mistake 3: Not tagging spans with prompt template version. When you update a prompt template, you need to know if the new version is more expensive or slower. Without template version tags, you can't attribute a cost spike to a prompt change.

The default `max_retries=0` is a trap
OpenAI's Python client has max_retries=0 by default. If you don't set it, a single 429 error will crash your request. Set max_retries=3 and backoff_factor=2 in production.
Production Insight
A team deployed a new prompt template that doubled the token count per request. They didn't tag spans with template version, so they couldn't trace the cost spike to the new template. It took 2 weeks to find the cause.
Key Takeaway
Tag every span with prompt template version. Monitor token count per template. Alert on significant changes.

Comparison vs Alternatives: OpenTelemetry vs Datadog LLM Observability

The main alternatives are OpenTelemetry (with OpenLIT) and Datadog's LLM Observability SDK. Here's a production comparison.

OpenTelemetry + OpenLIT: - Open source, vendor-agnostic. You can use any backend (Grafana, Jaeger, Prometheus). - Auto-instruments OpenAI, LangChain, Anthropic, and more. - Requires manual setup of the collector and exporters. - No built-in cost calculation or alerting. You build those yourself.

Datadog LLM Observability: - Proprietary, vendor-locked to Datadog. - Provides a higher-level SDK that includes cost tracking, evaluations, and a pre-built dashboard. - Easier to set up: just install the SDK and set environment variables. - More expensive at scale: Datadog's pricing is per-host + per-span.

When to choose OpenTelemetry: You want vendor independence, you already have a Grafana/Prometheus stack, or you need to customize your observability pipeline.

When to choose Datadog: You're already on Datadog, you need the higher-level features (evaluations, cost tracking), and you don't mind the vendor lock-in.

Vendor lock-in is real
Datadog's LLM Observability SDK exports data in a proprietary format. If you decide to switch vendors later, you'll need to re-instrument your entire application. OpenTelemetry gives you portability.
Production Insight
A startup chose Datadog for its ease of setup. Six months later, they wanted to switch to Grafana Cloud to reduce costs. They had to re-instrument their entire LLM pipeline, which took 2 weeks. OpenTelemetry would have made the switch a config change.
Key Takeaway
Choose OpenTelemetry if you value vendor independence. Choose Datadog if you need the higher-level features and are already on the platform.

Debugging and Monitoring: The 3am Playbook

When you get paged at 3am because costs are spiking or latency is high, you need a systematic approach.

Step 1: Check the token rate. If costs are spiking but traffic is normal, look at token count per request. A sudden increase in token count per request indicates a prompt change or a retry storm.

Step 2: Check the retry rate. If p99 latency is high but p50 is normal, you likely have a retry storm. Check the retry_count span attribute.

Step 3: Check the model distribution. Did a recent deployment change the default model from gpt-3.5-turbo to gpt-4? That would explain a cost spike.

Step 4: Check the error rate by model. Some models have higher error rates. If you switched to a new model, it might be returning more errors, causing retries.

Step 5: Check the prompt template distribution. Did a new prompt template get deployed? It might be generating more tokens than expected.

debug_playbook.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import pandas as pd
from datetime import datetime, timedelta

# Simulate querying your observability backend
# In production, you'd use the Grafana API or Prometheus

def debug_cost_spike(start_time: datetime, end_time: datetime):
    # Query spans from the last hour
    df = pd.read_parquet('spans_latest.parquet')
    # Filter to the time range
    df = df[(df['timestamp'] >= start_time) & (df['timestamp'] <= end_time)]
    
    print("=== Step 1: Token rate ===")
    token_rate = df['token_count'].sum() / (end_time - start_time).total_seconds()
    print(f"Token rate: {token_rate:.2f} tokens/second")
    
    print("\n=== Step 2: Retry rate ===")
    retry_spans = df[df['retry_count'] > 0]
    print(f"Spans with retries: {len(retry_spans)}")
    
    print("\n=== Step 3: Model distribution ===")
    print(df['model'].value_counts())
    
    print("\n=== Step 4: Error rate by model ===")
    error_df = df[df['error'] == True]
    print(error_df.groupby('model').size())
    
    print("\n=== Step 5: Prompt template distribution ===")
    print(df['prompt_template'].value_counts())

if __name__ == '__main__':
    now = datetime.utcnow()
    debug_cost_spike(now - timedelta(hours=1), now)
Automate this playbook as a runbook
Write a script that runs these queries and outputs the results to a Slack channel. When you get paged, you'll have the answers before you even open your laptop.
Production Insight
We automated this playbook as a Python script that runs every 5 minutes and posts to a Slack channel. When the cost spike incident happened, the script caught it 2 minutes after the first retry storm started. The on-call engineer had the root cause (retry loop) before the page even went out.
Key Takeaway
Automate your debugging playbook. The faster you can answer 'what changed?', the faster you can fix it.
● Production incidentPOST-MORTEMseverity: high

The Silent $4k/month Token Leak

Symptom
OpenAI API cost dashboard showed a 60% cost increase week-over-week. No change in request count or user base. P99 latency jumped from 2s to 8s. No error rate increase.
Assumption
The team assumed the cost spike was from a new feature rollout that increased prompt complexity. They had not instrumented per-request token counts.
Root cause
A backoff variable in the retry logic was resetting to 0 on every retry instead of incrementing, causing the retry loop to fire 10+ times per request when OpenAI returned a 429 rate limit error. Each retry re-sent the full prompt, including the conversation history, which grew unboundedly.
Fix
1. Added openai.InternalServerError and openai.RateLimitError to a retryable status list with exponential backoff. 2. Set max_retries=3 on the OpenAI client. 3. Added a token_count metric to every span. 4. Created a Grafana alert on sum(rate(token_count_total[5m])) > 1e6. 5. Reviewed all retry loops in the codebase. Found and fixed 2 more with the same bug.
Key lesson
  • Tag every LLM span with token_count, model, and prompt_template. Cost attribution is impossible without it.
  • Set a hard limit on retries. Exponential backoff is not optional — your retry loop can become a DoS attack on your own wallet.
  • Alert on token rate, not just error rate. A system can be 'working' while burning money.
Production debug guideWhen your dashboard shows no errors but costs are exploding.4 entries
Symptom · 01
Cost spike with no traffic change
Fix
Run kubectl logs -l app=llm-service --tail=100 | grep 'token_count' | awk '{print $NF}' | sort | uniq -c | sort -rn to see per-request token distribution. Look for outliers > 10x median.
Symptom · 02
High p99 latency, normal p50
Fix
Check if retries are the cause. Add retry_count to your span attributes. Query: sum by (retry_count) (rate(span_count_total[5m])). If retry_count > 3 is non-zero, you have a retry storm.
Symptom · 03
OpenAI returns 429 but you see no errors in your app
Fix
Your HTTP client library may be swallowing the error and retrying silently. Enable debug logging: export OPENAI_LOG=debug. Check for RateLimitError in logs.
Symptom · 04
Latency spikes only for certain users or prompt templates
Fix
Tag spans with user_id and prompt_template. Query: histogram_quantile(0.99, sum by (le, user_id) (rate(llm_latency_seconds_bucket[5m]))). Find the user with the highest p99.
★ LLM Observability Tools Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Costs rising, no traffic change
Immediate action
Check per-request token count distribution
Commands
python -c "import pandas as pd; df = pd.read_parquet('spans.parquet'); print(df['token_count'].describe())"
kubectl logs -l app=llm-service --tail=500 | grep 'token_count' | awk '{print $NF}' | sort -n | tail -20
Fix now
Add max_retries=3 and backoff_factor=2 to your OpenAI client. See code in Section 2.
High p99 latency, normal p50+
Immediate action
Check retry count distribution
Commands
python -c "import pandas as pd; df = pd.read_parquet('spans.parquet'); print(df.groupby('retry_count')['duration_ms'].mean())"
curl -s http://localhost:9464/metrics | grep 'llm_retry_count'
Fix now
Set max_retries=3 and add retry_count span attribute. Alert on rate(llm_retry_count_total[5m]) > 0.1.
No errors but slow responses+
Immediate action
Check if prompt serialization is the bottleneck
Commands
python -c "import pandas as pd; df = pd.read_parquet('spans.parquet'); print(df[df['span_type']=='serialization']['duration_ms'].describe())"
kubectl top pods -l app=llm-service | sort -k3 -n
Fix now
Profile your format_prompt function. If it's doing JSON serialization of a large context, cache the serialized prompt.
OpenAI 429 errors, app seems fine+
Immediate action
Check if retries are being swallowed
Commands
kubectl logs -l app=llm-service --tail=200 | grep -E '(429|RateLimitError)' | wc -l
curl -s http://localhost:9464/metrics | grep 'openai_requests_total'
Fix now
Enable debug logging: export OPENAI_LOG=debug. Add a custom metric for 429 responses.
OpenTelemetry vs Datadog LLM Observability
ConcernOpenTelemetry + OpenLITDatadog LLM ObservabilityRecommendation
CostFree (self-hosted OpenLIT) + infrastructure costsPer host + per million spans ingested (~$0.10/span)Start with OpenTelemetry for cost control
Setup time2-3 days to instrument and build dashboards1 day with auto-instrumentation agentDatadog for speed, OTel for flexibility
LLM-specific featuresBasic token counting, cost attribution via custom dashboardsPre-built cost analytics, guardrails, prompt monitoringDatadog if you need out-of-box LLM insights
Vendor lock-inNone — data can be exported to any backendHigh — data format is proprietaryOpenTelemetry for long-term portability
ScalabilityRequires tuning batch exporters and samplingHandles millions of spans/minute automaticallyDatadog for high-volume production

Key takeaways

1
Instrument every LLM call with OpenTelemetry spans that capture prompt, completion, token counts, model, and tenant ID
missing any one field makes cost attribution impossible.
2
Never rely on client-side token counting; always use the model's returned usage field from the API response to avoid double-counting on retries or streaming.
3
Use OpenLIT's built-in span processor to batch export traces every 5 seconds
default OpenTelemetry batch intervals can drop 30% of spans under load.
4
Tag every span with tenant_id and user_id at creation time, not as an afterthought
retroactive tagging requires replaying logs and is a 3am nightmare.
5
Set up a cost-per-token alert that triggers when average cost per request deviates by >20% from baseline
that's how we caught the leak at 3am.

Common mistakes to avoid

4 patterns
×

Double-counting tokens on retries

Symptom
Monthly LLM bill is 2x expected; per-request cost spikes correlate with error rates.
Fix
Add a unique request_id to each LLM call and deduplicate spans in the exporter. In OpenTelemetry, set span.set_attribute('llm.request_id', uuid) and filter duplicates in the batch processor.
×

Missing tenant isolation in spans

Symptom
Cannot attribute costs to specific customers; all traces show tenant_id=unknown.
Fix
Inject tenant context via OpenTelemetry's Context propagation. In your middleware, call span.set_attribute('tenant_id', ctx.tenant_id) before any LLM call — never after.
×

Using client-side token estimation

Symptom
Token counts in traces don't match the model's actual billing; off by 10-30%.
Fix
Always extract response.usage from the API response object (e.g., response.usage.total_tokens for OpenAI). Never use tiktoken or other estimators for billing.
×

Not setting up cost alerts on deviation

Symptom
Token leak runs for weeks before someone notices the bill spike.
Fix
In your monitoring tool (e.g., Grafana), create an alert: avg(rate(llm_cost_per_request[5m])) > 1.2 * avg(rate(llm_cost_per_request[1h])) — triggers when per-request cost jumps 20%.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design an observability system to detect token leaks in an...
Q02SENIOR
Explain how OpenTelemetry's span context propagation works in a microser...
Q03SENIOR
What are the trade-offs between using OpenTelemetry vs a vendor-specific...
Q04SENIOR
How would you handle token counting for streaming LLM responses in obser...
Q05SENIOR
Describe a scenario where LLM observability could cause a production inc...
Q01 of 05SENIOR

How would you design an observability system to detect token leaks in an LLM-powered application?

ANSWER
Start by instrumenting every LLM call with OpenTelemetry spans that capture: model name, prompt tokens, completion tokens, total tokens (from API response), latency, and tenant ID. Export spans to a time-series database. Build a dashboard showing cost per request over time. Set an alert on the ratio of total tokens to successful requests — if it spikes, you have a leak (e.g., retries, infinite loops, or prompt caching issues).
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How do I trace LLM calls with OpenTelemetry without modifying every code path?
02
What's the difference between OpenTelemetry and Datadog LLM Observability?
03
How do I attribute LLM costs to specific tenants in production?
04
Why are my token counts wrong in traces?
05
Can I use OpenTelemetry for LLM observability without a backend?
🔥

That's Observability. Mark it forged?

6 min read · try the examples if you haven't

Previous
LLM Memory Management
1 / 3 · Observability
Next
LLM Evaluation Frameworks