Jaeger Missing Spans — Async Context Propagation Fix
Kafka consumers showing separate trace IDs? Raw client libraries skip traceparent headers.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
- A trace tracks one request across multiple services; a span is a single operation within a service.
- Jaeger is an open-source CNCF-graduated tracing backend that stores, indexes, and visualises traces.
- Instrument code with OpenTelemetry, export via OTLP (port 4317/4318), and view traces at Jaeger UI.
- Context propagation via W3C traceparent header is critical for cross-service visibility.
- Sampling (head-based or tail-based) trades accuracy for storage cost — always sample errors 100%.
Imagine tracking a package through multiple delivery trucks, but each driver starts a new tracking number instead of passing along the original one. The customer sees several disconnected packages instead of one continuous journey. Fixing context propagation means making sure every driver writes down the original tracking number before handing off the box.
Missing spans in Jaeger are almost always caused by broken context propagation across async boundaries, not sampling or network failures. When trace IDs fail to cross thread pools, message queues, or background workers, traces fragment into orphaned spans that hide critical latency bottlenecks. This article shows how to diagnose and fix async context propagation using OpenTelemetry's propagator API.
How Distributed Tracing with Jaeger Works — and Why Spans Go Missing
Distributed tracing with Jaeger tracks a single request as it propagates through microservices by assigning a unique trace ID and attaching spans — each span representing a unit of work with start time, duration, and metadata. The core mechanic is context propagation: the trace ID must be passed across service boundaries via HTTP headers (or message queue metadata) so that spans from different services can be stitched into one trace. Without correct propagation, spans become orphaned and the trace is incomplete.
Jaeger stores traces in a backend (Cassandra, Elasticsearch, or Kafka) and exposes them via a UI. In practice, each service must extract the incoming trace context, create child spans, and inject the context into outgoing requests. This is typically done with OpenTelemetry SDKs, which handle serialization and deserialization of trace context. The key property that matters: if any service in the chain fails to propagate context — due to async boundaries, thread pool switches, or manual HTTP clients — the trace breaks at that point.
Use Jaeger when you need to debug latency spikes, identify service dependencies, or trace errors across more than three services. In production systems handling thousands of requests per second, a single missing span can hide a 500ms bottleneck in a downstream service. Without tracing, you're debugging blind — logs give you local state, but only traces show the full causal chain.
Instrumenting a FastAPI App with OpenTelemetry
Running Jaeger with Docker
Understanding Spans, Traces, and Context Propagation
A span is a named, timed operation that carries a span ID, trace ID, parent span ID, and attributes. The entire set of spans linked by a common trace ID forms a trace. Context propagation is what connects spans across service boundaries — it passes the trace ID and parent span ID via HTTP headers (W3C Trace Context: traceparent and tracestate). Without propagation, each service creates a separate trace, and you lose the end-to-end view. Propagation is automatic when using OpenTelemetry instrumentation libraries (they inject headers on outgoing requests). If you use raw HTTP clients or message queues, you must manually inject and extract the context.
requests or httpx without integration), you must inject and extract headers manually. The W3C traceparent header is the standard — don't invent your own.Sampling Strategies in Production
Recording every trace at production scale is expensive — both in storage and network bandwidth. Sampling decides which traces to keep. Head-based sampling makes the decision at the start of a request (e.g., keep 1% of all traces). It's simple but can miss rare high-latency events. Tail-based sampling buffers traces and decides after they complete, keeping those that exceed a latency threshold or contain errors. Jaeger supports both. A common hybrid approach: sample 1–5% of normal requests and always sample requests with HTTP 5xx or custom error attributes.
Setting the right sampling rate is a trade-off. Too low (0.1%) and you'll miss most issues; too high (100%) for high-throughput services will overwhelm storage. Start at 1% and adjust based on storage budget and trace usefulness.
Troubleshooting Missing Spans and Broken Traces
When a trace doesn't appear in Jaeger UI, or appears incomplete, the root cause is almost always one of: (1) context not propagated, (2) spans not exported, (3) sampling dropped the trace, (4) clock skew between service hosts. Use the following systematic checks. First, confirm you're hitting the Jaeger endpoint by looking at application logs for OTLP export errors. Second, check the trace ID uniformity — if each service generates its own trace ID, propagation is missing. Third, verify that the "Trace" view shows all expected spans — missing spans may indicate a failing exporter or network issue. Fourth, if spans from different services appear with wrong timing, check NTP synchronisation: Jaeger relies on span timestamps for ordering.
- Each service drops its breadcrumb (span) and passes the trace ID onward.
- If the breadcrumb is missing (span not created), the chain breaks.
- If the trace ID is not passed (propagation failure), the chain splits into separate chains.
- Your debugging goal: find the first service where the breadcrumb pattern changes.
Integrating Traces with Logs and Metrics
Distributed tracing alone doesn't replace logs or metrics — it complements them. The true power emerges when you correlate trace IDs with log entries and metric events. OpenTelemetry enables this via trace_id injection into log records (MDC in Java, structlog in Python). Metric tools like Prometheus can use trace IDs in labels for targeted alerting. Jaeger's UI allows you to drill from a trace to related logs if you configure the log integration.
A common production pattern: when a latency alert fires, grab the trace ID from the affected request, open Jaeger to see the breakdown, then jump to the logs from that span ID to inspect the exact error message.
Why Your Traces Are Silent: The gRPC vs HTTP Exporter Trap
Most beginners copy-paste a Jaeger exporter configuration and wonder why their traces never show up. The culprit is almost always the gRPC endpoint. Jaeger all-in-one runs three separate ports: 14250 for gRPC, 14268 for HTTP Thrift, and 9411 for Zipkin. If you use JaegerExporter (gRPC) but your Jaeger container isn't listening on 14250, your spans vanish into the void. I've debugged this in three separate microservice migrations. The fix is brutally simple: match your exporter to Jaeger's open port. For HTTP, use ThriftExporter. For gRPC, ensure COLLECTOR_GRPC_PORT is set. Don't assume both endpoints are active—check your docker logs. This mismatch wastes hours for teams that could be shipping features.
curl -v telnet://jaeger:14250 before blaming your code.Sampling in Production: Don't Bankrupt Your Storage on Every Request
In development, trace every request. In production, that costs real money—storage, network, and CPU. Smart teams use head-based sampling to keep the firehose manageable. Jaeger's probabilistic sampler with a 5-10% rate catches most anomalies without exploding your budget. But here's the trick: combine it with rate-limiting per endpoint. Your health-check endpoint doesn't need tracing at all. Your payment service deserves higher sampling. I've seen a startup burn $2,000/month on Jaeger storage because they sampled 100% on a high-traffic API. Set sampler.type=probabilistic and sampler.param=0.1 in your OpenTelemetry config. For critical flows, inject a custom sampler that always traces on errors. Your SRE team will thank you.
jaeger-sampling extension. For ultra-low latency apps, use head-based sampling and store only sampled spans. You can always re-analyze hot paths with manual instrumentation.Missing Trace Context on Async Event Bus Caused False 'Healthy' Signals
aiokafka library without an integration layer, so no traceparent header was passed.opentelemetry.propagate.inject() when producing and extract() when consuming.- Auto-instrumentation is not magic — always verify context propagation at each boundary.
- For any async or message-based communication, explicitly inject trace context into messages.
- When testing, generate traces from end to end and check in Jaeger UI that a single trace spans all services.
ntpq -p on all nodes to check clock synchronisation. Spans with timestamps from different hosts can be misordered if clocks drift more than 100ms.kubectl logs -l app=order-service --tail=20 | grep -i otlpdocker logs jaeger 2>&1 | grep -i errorKey takeaways
Common mistakes to avoid
5 patternsUsing auto-instrumentation only and assuming all spans are captured
tracer.start_as_current_span around every external I/O, lock acquisition, or business logic block that can take >10ms.Running Jaeger all-in-one in production without persistent storage
SPAN_STORAGE_TYPE=elasticsearch and proper connection endpoints.Setting a global sampling rate without considering operation criticality
Not injecting trace context into asynchronous or batch job spans
Forgetting to handle clock skew across hosts
Interview Questions on This Topic
Explain how distributed tracing works at the protocol level. How does a span get linked to its parent across service boundaries?
traceparent HTTP header with the current trace ID and span ID. Service B extracts that header and creates a new span with the same trace ID and the received span ID as parent. This creates a directed acyclic graph of spans. The header format is: 00-{trace_id}-{span_id}-{trace_flags} (W3C Trace Context).Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
That's Monitoring. Mark it forged?
4 min read · try the examples if you haven't