Skip to content
Home DevOps Jaeger Missing Spans — Async Context Propagation Fix

Jaeger Missing Spans — Async Context Propagation Fix

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Monitoring → Topic 5 of 9
Kafka consumers showing separate trace IDs? Raw client libraries skip traceparent headers.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
Kafka consumers showing separate trace IDs? Raw client libraries skip traceparent headers.
  • A trace = complete request journey across services. A span = one operation within a service.
  • OpenTelemetry is the vendor-neutral instrumentation API — use it to avoid lock-in.
  • FastAPIInstrumentor auto-instruments all routes — you only need manual spans for important sub-operations.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • A trace tracks one request across multiple services; a span is a single operation within a service.
  • Jaeger is an open-source CNCF-graduated tracing backend that stores, indexes, and visualises traces.
  • Instrument code with OpenTelemetry, export via OTLP (port 4317/4318), and view traces at Jaeger UI.
  • Context propagation via W3C traceparent header is critical for cross-service visibility.
  • Sampling (head-based or tail-based) trades accuracy for storage cost — always sample errors 100%.
🚨 START HERE

Quick Trace Debug Cheat Sheet

Five common trace issues and the exact commands to diagnose them
🟡

No traces in Jaeger UI

Immediate ActionPing the collector: curl http://jaeger-collector:4318
Commands
kubectl logs -l app=order-service --tail=20 | grep -i otlp
docker logs jaeger 2>&1 | grep -i error
Fix NowRestart the instrumented service after verifying endpoint env vars
🟡

Spans missing from a trace

Immediate ActionCheck the trace details in Jaeger UI for 'span count' vs expected
Commands
curl -H 'traceparent: 00-<trace_id>-<span_id>-01' http://target-service/health
Inspect application logs around the request time for exporter errors
Fix NowManually inject traceparent in a test request to isolate the breaking service
🟡

Distorted span timings (negative or huge values)

Immediate ActionCheck system time on each host: date +%s
Commands
ntpq -p | grep -E '^(##|*)'
timedatectl show --property=NTP --value
Fix NowRun `sudo timedatectl set-ntp true` and wait for sync
🟡

Sampling rate too aggressive (traces missing)

Immediate ActionCheck Jaeger remote sampling config endpoint
Commands
curl http://jaeger-collector:5778/sampling?service=order-service
Look for 'probabilistic_sampling: { sampling_rate: 0.01 }'
Fix NowIncrease rate or switch to tail-based sampling for high-latency operations
🟡

Context not propagated to downstream service

Immediate ActionRun test request with curl -v and inspect response headers
Commands
curl -v http://order-service/orders/1 2>&1 | grep -i trace
Check if the downstream service receives traceparent header in its logs
Fix NowAdd manual inject/extract in the communication layer (HTTP client, message producer)
Production Incident

Missing Trace Context on Async Event Bus Caused False 'Healthy' Signals

A team distributed trace ingestion across Kafka consumers and found traces broken into isolated fragments. Each consumer was creating a new trace instead of continuing the parent trace, making it impossible to trace the full path from API to database.
SymptomTraces for order processing showed only a single span (the HTTP handler) with no downstream spans from the Kafka consumer. The consumer's spans had different trace IDs, so they appeared as separate traces in Jaeger UI.
AssumptionTeam assumed OpenTelemetry auto-instrumentation for Kafka would propagate context automatically. It did not — the Kafka integration requires manual setup for header injection.
Root causeOpenTelemetry's Kafka instrumentation propagates headers only if you use the official producer/consumer API wrappers. The team was using a raw aiokafka library without an integration layer, so no traceparent header was passed.
FixSwitch to the OpenTelemetry-instrumented Kafka producer/consumer, or manually inject/export context using opentelemetry.propagate.inject() when producing and extract() when consuming.
Key Lesson
Auto-instrumentation is not magic — always verify context propagation at each boundary.For any async or message-based communication, explicitly inject trace context into messages.When testing, generate traces from end to end and check in Jaeger UI that a single trace spans all services.
Production Debug Guide

Systematic checks to find why your distributed trace is broken

Trace visible in Jaeger UI but shows only a single spanCheck context propagation; verify that the service making downstream calls injects the traceparent header. Test with curl -v and inspect request headers.
No traces for a specific service appear at allVerify the service can reach the Jaeger Collector endpoint. Check service logs for OTLP export errors. Confirm the port (4317 for gRPC, 4318 for HTTP) matches collector configuration.
Traces appear but with spans out of order or negative durationRun ntpq -p on all nodes to check clock synchronisation. Spans with timestamps from different hosts can be misordered if clocks drift more than 100ms.
Only a small fraction of traces appear despite high request volumeCheck sampling configuration. Confirm you're not using a head-based sampler with rate too low for the traffic pattern. Look at Jaeger Collector metrics for 'sampling.dropped'.
Traces contain spans from service A but not service B, though B is calledB likely has a bug in its instrumentation or exporter. Test B in isolation: send a request that produces a trace and verify it appears. Common cause: missing OpenTelemetry package or wrong exporter endpoint.

Instrumenting a FastAPI App with OpenTelemetry

Example · PYTHON
12345678910111213141516171819202122232425262728293031323334
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# pip install opentelemetry-instrumentation-fastapi

from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint='http://jaeger:4317')  # Jaeger OTLP endpoint
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # auto-instruments all routes

tracer = trace.get_tracer(__name__)

@app.get('/orders/{order_id}')
async def get_order(order_id: int):
    with tracer.start_as_current_span('fetch-order') as span:
        span.set_attribute('order.id', order_id)

        # Manual span for a specific operation
        with tracer.start_as_current_span('db-query'):
            order = await db.get_order(order_id)

        with tracer.start_as_current_span('enrich-order'):
            user = await user_service.get_user(order.user_id)  # cross-service call

        return {'order': order
▶ Output
# Traces exported to Jaeger — visible in Jaeger UI at http://jaeger:16686
📊 Production Insight
Auto-instrumentation only covers framework-level spans.
Manual spans for I/O, locks, or business logic are where 90% of latency hides.
Rule: if you only rely on auto-instrumentation, you'll miss the root cause every time.
🎯 Key Takeaway
FastAPIInstrumentor handles HTTP routes automatically.
Manual spans give you visibility into the operations that matter most.
Always wrap external calls and critical logic in custom spans.

Running Jaeger with Docker

Example · BASH
123456789101112
# Run Jaeger all-in-one (development setup)
docker run -d \n  --name jaeger \n  -p 16686:16686 \n  -p 4317:4317 \n  -p 4318:4318 \n  jaegertracing/all-in-one:latest

# Ports:
# 16686Jaeger UI
# 4317OTLP gRPC receiver
# 4318OTLP HTTP receiver

# Open Jaeger UI: http://localhost:16686
# Search by service name → see all traces
# Click a trace → see full span timeline
# Click a span → see attributes, events, errors
▶ Output
# Jaeger UI at http://localhost:16686
📊 Production Insight
The all-in-one image bundles storage, collector, and query into one process.
It's fine for dev but loses all traces on restart — use Elasticsearch or Cassandra in prod.
Rule: never use all-in-one for production; you won't have trace persistence.
🎯 Key Takeaway
Docker run gets you started in 30 seconds.
Port 16686 = UI, 4317 = OTLP gRPC, 4318 = OTLP HTTP.
Lossy dev mode — plan for persistent storage before you go live.

Understanding Spans, Traces, and Context Propagation

A span is a named, timed operation that carries a span ID, trace ID, parent span ID, and attributes. The entire set of spans linked by a common trace ID forms a trace. Context propagation is what connects spans across service boundaries — it passes the trace ID and parent span ID via HTTP headers (W3C Trace Context: traceparent and tracestate). Without propagation, each service creates a separate trace, and you lose the end-to-end view. Propagation is automatic when using OpenTelemetry instrumentation libraries (they inject headers on outgoing requests). If you use raw HTTP clients or message queues, you must manually inject and extract the context.

propagation_example.py · PYTHON
123456789101112131415161718
import requests
from opentelemetry import propagators, trace
from opentelemetry.propagate import inject, extract

# Sending service
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span('outgoing-call') as span:
    headers = {}
    inject(headers)  # injects traceparent from current span
    response = requests.get('http://payment-service/process', headers=headers)

# Receiving service
# In middleware or at route entry, extract context
from opentelemetry.propagate import extract
ctx = extract(request.headers)
with tracer.start_as_current_span('process-payment', context=ctx) as span:
    # this span is now child of the sending service's span
    pass
🔥Manual propagation is a common trap
If you're using HTTP clients not covered by auto-instrumentation (e.g., raw requests or httpx without integration), you must inject and extract headers manually. The W3C traceparent header is the standard — don't invent your own.
📊 Production Insight
Missing context propagation is the #1 reason traces break at service boundaries.
Check for the traceparent header in your incoming requests to verify propagation.
Rule: if you see a new trace ID after a cross-service call, propagation is broken.
🎯 Key Takeaway
Spans get connected via trace ID propagated across services.
If your trace is broken into pieces, context propagation is the culprit.
Inject headers on outgoing calls; extract on incoming — always verify.
Propagation method decision
IfUsing an auto-instrumented library (FastAPIInstrumentor, requests integration)
UsePropagation is automatic — no extra code needed.
IfUsing a custom HTTP client or non-HTTP transport (Redis, Kafka)
UseYou must manually inject/export context using OpenTelemetry API.

Sampling Strategies in Production

Recording every trace at production scale is expensive — both in storage and network bandwidth. Sampling decides which traces to keep. Head-based sampling makes the decision at the start of a request (e.g., keep 1% of all traces). It's simple but can miss rare high-latency events. Tail-based sampling buffers traces and decides after they complete, keeping those that exceed a latency threshold or contain errors. Jaeger supports both. A common hybrid approach: sample 1–5% of normal requests and always sample requests with HTTP 5xx or custom error attributes.

Setting the right sampling rate is a trade-off. Too low (0.1%) and you'll miss most issues; too high (100%) for high-throughput services will overwhelm storage. Start at 1% and adjust based on storage budget and trace usefulness.

jaeger-sampling-config.yaml · YAML
1234567891011121314
# Remote sampling configuration for Jaeger
service_config:
  - service: "order-service"
    operation: "/orders/{id}"
    probabilistic_sampling:
      sampling_rate: 0.01  # 1% sample
  - service: "order-service"
    operation: "*"
    probabilistic_sampling:
      sampling_rate: 0.005  # 0.5% for other ops
  - service: "payment-service"
    operation: "*"
    probabilistic_sampling:
      sampling_rate: 0.05  # 5% for payment (higher risk)
⚠ Watch out for sampling latency bias
Head-based sampling can introduce bias against slow requests because it makes a decision before the request completes. If your sampling rate is 1%, you miss 99% of slow responses. Tail-based sampling solves this but requires buffering and adds memory overhead.
📊 Production Insight
Setting sampling per-operation is key: payment services need higher rate than health checks.
Use remote sampling configuration (Jaeger Collector) to change rates without redeploying.
Rule: always sample error spans at 100% regardless of overall rate — use sampler type 'const: true' for errors.
🎯 Key Takeaway
Head-based sampling is simple but can miss slow requests.
Tail-based sampling captures the long tail but costs more.
Best practice: 100% for errors, 1–5% for normal traffic per service tier.

Troubleshooting Missing Spans and Broken Traces

When a trace doesn't appear in Jaeger UI, or appears incomplete, the root cause is almost always one of: (1) context not propagated, (2) spans not exported, (3) sampling dropped the trace, (4) clock skew between service hosts. Use the following systematic checks. First, confirm you're hitting the Jaeger endpoint by looking at application logs for OTLP export errors. Second, check the trace ID uniformity — if each service generates its own trace ID, propagation is missing. Third, verify that the "Trace" view shows all expected spans — missing spans may indicate a failing exporter or network issue. Fourth, if spans from different services appear with wrong timing, check NTP synchronisation: Jaeger relies on span timestamps for ordering.

Mental Model
Think of traces as breadcrumbs
A complete trace is a chain of breadcrumbs from entry to exit. Each broken link is a missing breadcrumb.
  • Each service drops its breadcrumb (span) and passes the trace ID onward.
  • If the breadcrumb is missing (span not created), the chain breaks.
  • If the trace ID is not passed (propagation failure), the chain splits into separate chains.
  • Your debugging goal: find the first service where the breadcrumb pattern changes.
📊 Production Insight
Clock skew of even 100ms can cause spans to appear out of order in the UI.
Run NTP on all nodes and monitor drift — Jaeger has a built-in clock skew adjustment but it's not perfect.
Rule: if a trace's spans jump backwards in time, check NTP first.
🎯 Key Takeaway
Missing traces = propagation fail or sampling drop.
Incomplete traces = missing spans (exporter error or code bug).
Out-of-order spans = clock skew — NTP is not optional.

Integrating Traces with Logs and Metrics

Distributed tracing alone doesn't replace logs or metrics — it complements them. The true power emerges when you correlate trace IDs with log entries and metric events. OpenTelemetry enables this via trace_id injection into log records (MDC in Java, structlog in Python). Metric tools like Prometheus can use trace IDs in labels for targeted alerting. Jaeger's UI allows you to drill from a trace to related logs if you configure the log integration.

A common production pattern: when a latency alert fires, grab the trace ID from the affected request, open Jaeger to see the breakdown, then jump to the logs from that span ID to inspect the exact error message.

log_correlation.py · PYTHON
12345678910
import structlog
from opentelemetry import trace

span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
span_id = format(span.get_span_context().span_id, '016x')

# Inject trace context into log
logger = structlog.get_logger()
logger.info("payment processed", trace_id=trace_id, span_id=span_id, order_id=123)
💡Unify your observability data
Use OpenTelemetry Collector to export traces to Jaeger, metrics to Prometheus, and logs to Loki. Set up a Grafana dashboard that links metrics panels to trace exploration — this is the 'observability pyramid' in practice.
📊 Production Insight
Correlation is worthless if trace IDs aren't in logs from the start.
Instrument your logging layer early — retrofitting trace IDs into a million log lines is painful.
Rule: enforce trace_id presence in all structured logs via pipeline linting.
🎯 Key Takeaway
Traces show where; logs show what; metrics show when.
Correlate them via trace IDs in log output and metric labels.
Without correlation, you're debugging blind.

🎯 Key Takeaways

  • A trace = complete request journey across services. A span = one operation within a service.
  • OpenTelemetry is the vendor-neutral instrumentation API — use it to avoid lock-in.
  • FastAPIInstrumentor auto-instruments all routes — you only need manual spans for important sub-operations.
  • Trace context (trace ID, span ID) propagates via HTTP headers (traceparent) between services.
  • Use span attributes to add business context: order.id, user.id — makes filtering useful.
  • Sampling is a storage vs accuracy trade-off: always sample errors 100%, tune per-operation rates.
  • Clock skew breaks trace timelines — NTP synchronisation is mandatory in distributed systems.
  • Correlate trace IDs with logs and metrics for full observability — or you're still flying blind.

⚠ Common Mistakes to Avoid

    Using auto-instrumentation only and assuming all spans are captured
    Symptom

    Critical latency inside a database call or cache lookup is invisible because no manual span wraps it. The trace shows the HTTP handler but not the expensive operation inside.

    Fix

    Add manual spans with tracer.start_as_current_span around every external I/O, lock acquisition, or business logic block that can take >10ms.

    Running Jaeger all-in-one in production without persistent storage
    Symptom

    Traces disappear after container restart. Incident post-mortems have no traces because they were lost during the reboot.

    Fix

    Deploy Jaeger with a backend storage (Elasticsearch, Cassandra, or Kafka) configured via environment variables SPAN_STORAGE_TYPE=elasticsearch and proper connection endpoints.

    Setting a global sampling rate without considering operation criticality
    Symptom

    Payment failures or latency spikes are rarely captured because the sampling rate is 1% and the incident happens in the 99% unsampled requests.

    Fix

    Use Jaeger's remote sampling configuration to set higher rates for critical endpoints (payment, auth) and lower rates for health checks and static content.

    Not injecting trace context into asynchronous or batch job spans
    Symptom

    An API request kicks off a background job; the job's spans have a different trace ID, so you can't link the request to the job execution.

    Fix

    Pass the trace context via message headers (Kafka, RabbitMQ) or database column when enqueuing jobs. On the worker side, extract the context before starting the worker span.

    Forgetting to handle clock skew across hosts
    Symptom

    Spans in the Jaeger UI appear with negative duration or overlapping incorrectly. Root cause analysis becomes unreliable.

    Fix

    Run NTP daemon on all servers. Monitor clock offset in your observability dashboards. Alert if offset exceeds 10ms.

Interview Questions on This Topic

  • QExplain how distributed tracing works at the protocol level. How does a span get linked to its parent across service boundaries?SeniorReveal
    Each span carries a trace ID, span ID, and parent span ID. When service A calls service B, OpenTelemetry injects a traceparent HTTP header with the current trace ID and span ID. Service B extracts that header and creates a new span with the same trace ID and the received span ID as parent. This creates a directed acyclic graph of spans. The header format is: 00-{trace_id}-{span_id}-{trace_flags} (W3C Trace Context).
  • QCompare head-based and tail-based sampling. When would you use each?SeniorReveal
    Head-based sampling decides at the start of a request: a deterministic or probabilistic function determines if the trace is kept. It's cheap but can miss slow or error traces because those are determined after the sample decision. Tail-based sampling buffers all traces and decides after completion, keeping those exceeding latency thresholds or containing errors. Use head-based when storage is tight and rare events are acceptable to miss; use tail-based when you need full visibility into the long tail and have buffer capacity.
  • QWhat is the role of the OpenTelemetry Collector? How does it differ from sending spans directly to Jaeger?Mid-levelReveal
    The OpenTelemetry Collector acts as a middleware to receive, process, and export telemetry data. It provides features like batching, tail-based sampling, filtering, and retries that are not available when sending spans directly to Jaeger. Jaeger supports OTLP natively, but the Collector adds resilience and scaling. In production, you usually send OTLP to the Collector, which then forwards to Jaeger backend.
  • QHow would you debug a distributed trace that appears incomplete in Jaeger UI?SeniorReveal
    Checklist: (1) Verify context propagation: does each service receive and forward the traceparent header? Use curl -v to inspect. (2) Check sampling configuration: maybe downstream services have a lower sampling rate and dropped the trace. (3) Look for exporter errors in service logs (connection refused, timeout). (4) Confirm network connectivity from each service to the Jaeger Collector on port 4317/4318. (5) If spans appear but out of order, check clock skew with NTP.

Frequently Asked Questions

What is the difference between distributed tracing, logging, and metrics?

Logs are time-stamped text events from a single service. Metrics are aggregated numerical measurements (request rate, error rate, latency percentiles). Distributed traces show the causal chain of events across services for a single request. Observability requires all three: metrics to know something is wrong, logs to see what happened, traces to find where.

What is sampling in distributed tracing?

Recording every trace at high traffic volumes is expensive. Sampling records only a fraction of traces — head-based sampling decides at the start of a request (simple, misses tail latency). Tail-based sampling decides after the trace completes, keeping slow or error traces — more accurate but requires buffering. Jaeger supports both. Common approach: sample 1-5% of normal traces, always sample errors.

Can I use Jaeger without OpenTelemetry?

Yes, Jaeger supports its own SDKs (Jaeger client libraries) directly. However, OpenTelemetry is the industry standard and recommended because it allows switching backends (e.g., to Zipkin or Datadog) without changing instrumentation. With Jaeger clients you're locked in.

How do I persist Jaeger traces?

Jaeger supports multiple storage backends: Elasticsearch, Cassandra, and Kafka (as intermediate). In production, set SPAN_STORAGE_TYPE=elasticsearch and configure ES connection. The all-in-one image uses in-memory storage — data is lost on restart.

What is the overhead of enabling distributed tracing?

Depends on sampling rate and instrumentation depth. Auto-instrumentation adds <1ms per HTTP request. Manual spans add a few microseconds each (span creation, attribute setting). The larger overhead is network: exporting spans requires a TCP connection to the collector. Use batching (BatchSpanProcessor) to amortise cost. At 1% sampling, overhead is negligible.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousApplication Performance MonitoringNext →SLI SLO SLA Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged