Jaeger Missing Spans — Async Context Propagation Fix
- A trace = complete request journey across services. A span = one operation within a service.
- OpenTelemetry is the vendor-neutral instrumentation API — use it to avoid lock-in.
- FastAPIInstrumentor auto-instruments all routes — you only need manual spans for important sub-operations.
- A trace tracks one request across multiple services; a span is a single operation within a service.
- Jaeger is an open-source CNCF-graduated tracing backend that stores, indexes, and visualises traces.
- Instrument code with OpenTelemetry, export via OTLP (port 4317/4318), and view traces at Jaeger UI.
- Context propagation via W3C traceparent header is critical for cross-service visibility.
- Sampling (head-based or tail-based) trades accuracy for storage cost — always sample errors 100%.
Quick Trace Debug Cheat Sheet
No traces in Jaeger UI
kubectl logs -l app=order-service --tail=20 | grep -i otlpdocker logs jaeger 2>&1 | grep -i errorSpans missing from a trace
curl -H 'traceparent: 00-<trace_id>-<span_id>-01' http://target-service/healthInspect application logs around the request time for exporter errorsDistorted span timings (negative or huge values)
ntpq -p | grep -E '^(##|*)'timedatectl show --property=NTP --valueSampling rate too aggressive (traces missing)
curl http://jaeger-collector:5778/sampling?service=order-serviceLook for 'probabilistic_sampling: { sampling_rate: 0.01 }'Context not propagated to downstream service
curl -v http://order-service/orders/1 2>&1 | grep -i traceCheck if the downstream service receives traceparent header in its logsProduction Incident
aiokafka library without an integration layer, so no traceparent header was passed.opentelemetry.propagate.inject() when producing and extract() when consuming.Production Debug GuideSystematic checks to find why your distributed trace is broken
ntpq -p on all nodes to check clock synchronisation. Spans with timestamps from different hosts can be misordered if clocks drift more than 100ms.Instrumenting a FastAPI App with OpenTelemetry
# pip install opentelemetry-distro opentelemetry-exporter-otlp # pip install opentelemetry-instrumentation-fastapi from fastapi import FastAPI from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor # Configure tracer provider = TracerProvider() exporter = OTLPSpanExporter(endpoint='http://jaeger:4317') # Jaeger OTLP endpoint provider.add_span_processor(BatchSpanProcessor(exporter)) trace.set_tracer_provider(provider) app = FastAPI() FastAPIInstrumentor.instrument_app(app) # auto-instruments all routes tracer = trace.get_tracer(__name__) @app.get('/orders/{order_id}') async def get_order(order_id: int): with tracer.start_as_current_span('fetch-order') as span: span.set_attribute('order.id', order_id) # Manual span for a specific operation with tracer.start_as_current_span('db-query'): order = await db.get_order(order_id) with tracer.start_as_current_span('enrich-order'): user = await user_service.get_user(order.user_id) # cross-service call return {'order': order
Running Jaeger with Docker
# Run Jaeger all-in-one (development setup) docker run -d \n --name jaeger \n -p 16686:16686 \n -p 4317:4317 \n -p 4318:4318 \n jaegertracing/all-in-one:latest # Ports: # 16686 — Jaeger UI # 4317 — OTLP gRPC receiver # 4318 — OTLP HTTP receiver # Open Jaeger UI: http://localhost:16686 # Search by service name → see all traces # Click a trace → see full span timeline # Click a span → see attributes, events, errors
Understanding Spans, Traces, and Context Propagation
A span is a named, timed operation that carries a span ID, trace ID, parent span ID, and attributes. The entire set of spans linked by a common trace ID forms a trace. Context propagation is what connects spans across service boundaries — it passes the trace ID and parent span ID via HTTP headers (W3C Trace Context: traceparent and tracestate). Without propagation, each service creates a separate trace, and you lose the end-to-end view. Propagation is automatic when using OpenTelemetry instrumentation libraries (they inject headers on outgoing requests). If you use raw HTTP clients or message queues, you must manually inject and extract the context.
import requests from opentelemetry import propagators, trace from opentelemetry.propagate import inject, extract # Sending service tracer = trace.get_tracer(__name__) with tracer.start_as_current_span('outgoing-call') as span: headers = {} inject(headers) # injects traceparent from current span response = requests.get('http://payment-service/process', headers=headers) # Receiving service # In middleware or at route entry, extract context from opentelemetry.propagate import extract ctx = extract(request.headers) with tracer.start_as_current_span('process-payment', context=ctx) as span: # this span is now child of the sending service's span pass
requests or httpx without integration), you must inject and extract headers manually. The W3C traceparent header is the standard — don't invent your own.Sampling Strategies in Production
Recording every trace at production scale is expensive — both in storage and network bandwidth. Sampling decides which traces to keep. Head-based sampling makes the decision at the start of a request (e.g., keep 1% of all traces). It's simple but can miss rare high-latency events. Tail-based sampling buffers traces and decides after they complete, keeping those that exceed a latency threshold or contain errors. Jaeger supports both. A common hybrid approach: sample 1–5% of normal requests and always sample requests with HTTP 5xx or custom error attributes.
Setting the right sampling rate is a trade-off. Too low (0.1%) and you'll miss most issues; too high (100%) for high-throughput services will overwhelm storage. Start at 1% and adjust based on storage budget and trace usefulness.
# Remote sampling configuration for Jaeger service_config: - service: "order-service" operation: "/orders/{id}" probabilistic_sampling: sampling_rate: 0.01 # 1% sample - service: "order-service" operation: "*" probabilistic_sampling: sampling_rate: 0.005 # 0.5% for other ops - service: "payment-service" operation: "*" probabilistic_sampling: sampling_rate: 0.05 # 5% for payment (higher risk)
Troubleshooting Missing Spans and Broken Traces
When a trace doesn't appear in Jaeger UI, or appears incomplete, the root cause is almost always one of: (1) context not propagated, (2) spans not exported, (3) sampling dropped the trace, (4) clock skew between service hosts. Use the following systematic checks. First, confirm you're hitting the Jaeger endpoint by looking at application logs for OTLP export errors. Second, check the trace ID uniformity — if each service generates its own trace ID, propagation is missing. Third, verify that the "Trace" view shows all expected spans — missing spans may indicate a failing exporter or network issue. Fourth, if spans from different services appear with wrong timing, check NTP synchronisation: Jaeger relies on span timestamps for ordering.
- Each service drops its breadcrumb (span) and passes the trace ID onward.
- If the breadcrumb is missing (span not created), the chain breaks.
- If the trace ID is not passed (propagation failure), the chain splits into separate chains.
- Your debugging goal: find the first service where the breadcrumb pattern changes.
Integrating Traces with Logs and Metrics
Distributed tracing alone doesn't replace logs or metrics — it complements them. The true power emerges when you correlate trace IDs with log entries and metric events. OpenTelemetry enables this via trace_id injection into log records (MDC in Java, structlog in Python). Metric tools like Prometheus can use trace IDs in labels for targeted alerting. Jaeger's UI allows you to drill from a trace to related logs if you configure the log integration.
A common production pattern: when a latency alert fires, grab the trace ID from the affected request, open Jaeger to see the breakdown, then jump to the logs from that span ID to inspect the exact error message.
import structlog from opentelemetry import trace span = trace.get_current_span() trace_id = format(span.get_span_context().trace_id, '032x') span_id = format(span.get_span_context().span_id, '016x') # Inject trace context into log logger = structlog.get_logger() logger.info("payment processed", trace_id=trace_id, span_id=span_id, order_id=123)
🎯 Key Takeaways
- A trace = complete request journey across services. A span = one operation within a service.
- OpenTelemetry is the vendor-neutral instrumentation API — use it to avoid lock-in.
- FastAPIInstrumentor auto-instruments all routes — you only need manual spans for important sub-operations.
- Trace context (trace ID, span ID) propagates via HTTP headers (traceparent) between services.
- Use span attributes to add business context: order.id, user.id — makes filtering useful.
- Sampling is a storage vs accuracy trade-off: always sample errors 100%, tune per-operation rates.
- Clock skew breaks trace timelines — NTP synchronisation is mandatory in distributed systems.
- Correlate trace IDs with logs and metrics for full observability — or you're still flying blind.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain how distributed tracing works at the protocol level. How does a span get linked to its parent across service boundaries?SeniorReveal
- QCompare head-based and tail-based sampling. When would you use each?SeniorReveal
- QWhat is the role of the OpenTelemetry Collector? How does it differ from sending spans directly to Jaeger?Mid-levelReveal
- QHow would you debug a distributed trace that appears incomplete in Jaeger UI?SeniorReveal
Frequently Asked Questions
What is the difference between distributed tracing, logging, and metrics?
Logs are time-stamped text events from a single service. Metrics are aggregated numerical measurements (request rate, error rate, latency percentiles). Distributed traces show the causal chain of events across services for a single request. Observability requires all three: metrics to know something is wrong, logs to see what happened, traces to find where.
What is sampling in distributed tracing?
Recording every trace at high traffic volumes is expensive. Sampling records only a fraction of traces — head-based sampling decides at the start of a request (simple, misses tail latency). Tail-based sampling decides after the trace completes, keeping slow or error traces — more accurate but requires buffering. Jaeger supports both. Common approach: sample 1-5% of normal traces, always sample errors.
Can I use Jaeger without OpenTelemetry?
Yes, Jaeger supports its own SDKs (Jaeger client libraries) directly. However, OpenTelemetry is the industry standard and recommended because it allows switching backends (e.g., to Zipkin or Datadog) without changing instrumentation. With Jaeger clients you're locked in.
How do I persist Jaeger traces?
Jaeger supports multiple storage backends: Elasticsearch, Cassandra, and Kafka (as intermediate). In production, set SPAN_STORAGE_TYPE=elasticsearch and configure ES connection. The all-in-one image uses in-memory storage — data is lost on restart.
What is the overhead of enabling distributed tracing?
Depends on sampling rate and instrumentation depth. Auto-instrumentation adds <1ms per HTTP request. Manual spans add a few microseconds each (span creation, attribute setting). The larger overhead is network: exporting spans requires a TCP connection to the collector. Use batching (BatchSpanProcessor) to amortise cost. At 1% sampling, overhead is negligible.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.