Senior 4 min · March 17, 2026
Distributed Tracing with Jaeger

Jaeger Missing Spans — Async Context Propagation Fix

Kafka consumers showing separate trace IDs? Raw client libraries skip traceparent headers.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Production
production tested
June 10, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • A trace tracks one request across multiple services; a span is a single operation within a service.
  • Jaeger is an open-source CNCF-graduated tracing backend that stores, indexes, and visualises traces.
  • Instrument code with OpenTelemetry, export via OTLP (port 4317/4318), and view traces at Jaeger UI.
  • Context propagation via W3C traceparent header is critical for cross-service visibility.
  • Sampling (head-based or tail-based) trades accuracy for storage cost — always sample errors 100%.
✦ Definition~90s read
What is Distributed Tracing with Jaeger?

Distributed tracing with Jaeger lets you follow a single request as it hops across microservices, databases, and queues. Each unit of work is a span; a chain of spans forms a trace. The core mechanism is context propagation — you must manually pass trace metadata (trace ID, span ID) across service boundaries via HTTP headers, message envelopes, or gRPC metadata.

Imagine tracking a package through multiple delivery trucks, but each driver starts a new tracking number instead of passing along the original one.

When that handoff fails, spans become orphans: they exist in Jaeger but belong to no parent, or the entire trace breaks into disconnected fragments. This is the single most common cause of missing spans in production, not sampling or network issues. OpenTelemetry is the modern standard for instrumentation, replacing Jaeger's native clients.

It handles context propagation automatically for popular frameworks like FastAPI, but only if you configure the propagator correctly — typically W3C TraceContext or Jaeger's own format. Running Jaeger locally with Docker is trivial (docker run -p 16686:16686 jaegertracing/all-in-one:latest), but production setups require careful sampling strategy decisions: head-based sampling (e.g., probabilistic 1%) is simple but can miss rare errors; tail-based sampling (e.g., Jaeger's own or OpenTelemetry Collector's) preserves complete traces for problematic requests.

When spans go missing, first check that every service uses the same propagator, then verify that async tasks (thread pools, background workers, Celery) explicitly propagate context — Python's contextvars and asyncio don't do this automatically. The fix is almost always a missing with tracer.start_as_current_span() or a forgotten propagator.inject() in a custom middleware.

Plain-English First

Imagine tracking a package through multiple delivery trucks, but each driver starts a new tracking number instead of passing along the original one. The customer sees several disconnected packages instead of one continuous journey. Fixing context propagation means making sure every driver writes down the original tracking number before handing off the box.

Missing spans in Jaeger are almost always caused by broken context propagation across async boundaries, not sampling or network failures. When trace IDs fail to cross thread pools, message queues, or background workers, traces fragment into orphaned spans that hide critical latency bottlenecks. This article shows how to diagnose and fix async context propagation using OpenTelemetry's propagator API.

How Distributed Tracing with Jaeger Works — and Why Spans Go Missing

Distributed tracing with Jaeger tracks a single request as it propagates through microservices by assigning a unique trace ID and attaching spans — each span representing a unit of work with start time, duration, and metadata. The core mechanic is context propagation: the trace ID must be passed across service boundaries via HTTP headers (or message queue metadata) so that spans from different services can be stitched into one trace. Without correct propagation, spans become orphaned and the trace is incomplete.

Jaeger stores traces in a backend (Cassandra, Elasticsearch, or Kafka) and exposes them via a UI. In practice, each service must extract the incoming trace context, create child spans, and inject the context into outgoing requests. This is typically done with OpenTelemetry SDKs, which handle serialization and deserialization of trace context. The key property that matters: if any service in the chain fails to propagate context — due to async boundaries, thread pool switches, or manual HTTP clients — the trace breaks at that point.

Use Jaeger when you need to debug latency spikes, identify service dependencies, or trace errors across more than three services. In production systems handling thousands of requests per second, a single missing span can hide a 500ms bottleneck in a downstream service. Without tracing, you're debugging blind — logs give you local state, but only traces show the full causal chain.

Async Context Is Not Automatic
Java's CompletableFuture, ExecutorService, and reactive streams do not propagate trace context by default — you must manually pass it or use OpenTelemetry's context propagation wrappers.
Production Insight
A payment service using ExecutorService for parallel validation calls lost trace context on the thread pool boundary, causing all downstream spans to appear as root spans.
Symptom: Jaeger UI shows multiple disconnected traces for a single checkout request, each with only one or two spans.
Rule: Always wrap thread pools with OpenTelemetry's ContextExecutors or use @WithSpan on async methods to ensure context flows across threads.
Key Takeaway
Distributed tracing is only as reliable as your context propagation — one missed header breaks the entire trace.
Async boundaries (thread pools, reactive streams, message queues) are the most common source of missing spans in Java.
Always validate trace continuity in staging with a known bad path before relying on traces in production debugging.
Jaeger Missing Spans: Async Context Propagation Fix THECODEFORGE.IO Jaeger Missing Spans: Async Context Propagation Fix Flow from instrumentation to trace export with context propagation FastAPI Instrumentation OpenTelemetry auto-instrumentation Span Creation Root span per request, child spans for operations Context Propagation Async context via OpenTelemetry API Span Export gRPC or HTTP exporter to Jaeger ⚠ Missing async context propagation breaks trace continuity Use OpenTelemetry context API to pass spans across async boundaries THECODEFORGE.IO
thecodeforge.io
Jaeger Missing Spans: Async Context Propagation Fix
Distributed Tracing Jaeger

Instrumenting a FastAPI App with OpenTelemetry

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# pip install opentelemetry-instrumentation-fastapi

from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint='http://jaeger:4317')  # Jaeger OTLP endpoint
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # auto-instruments all routes

tracer = trace.get_tracer(__name__)

@app.get('/orders/{order_id}')
async def get_order(order_id: int):
    with tracer.start_as_current_span('fetch-order') as span:
        span.set_attribute('order.id', order_id)

        # Manual span for a specific operation
        with tracer.start_as_current_span('db-query'):
            order = await db.get_order(order_id)

        with tracer.start_as_current_span('enrich-order'):
            user = await user_service.get_user(order.user_id)  # cross-service call

        return {'order': order
Output
# Traces exported to Jaeger — visible in Jaeger UI at http://jaeger:16686
Production Insight
Auto-instrumentation only covers framework-level spans.
Manual spans for I/O, locks, or business logic are where 90% of latency hides.
Rule: if you only rely on auto-instrumentation, you'll miss the root cause every time.
Key Takeaway
FastAPIInstrumentor handles HTTP routes automatically.
Manual spans give you visibility into the operations that matter most.
Always wrap external calls and critical logic in custom spans.

Running Jaeger with Docker

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
# Run Jaeger all-in-one (development setup)
docker run -d \n  --name jaeger \n  -p 16686:16686 \n  -p 4317:4317 \n  -p 4318:4318 \n  jaegertracing/all-in-one:latest

# Ports:
# 16686Jaeger UI
# 4317OTLP gRPC receiver
# 4318OTLP HTTP receiver

# Open Jaeger UI: http://localhost:16686
# Search by service name → see all traces
# Click a trace → see full span timeline
# Click a span → see attributes, events, errors
Output
# Jaeger UI at http://localhost:16686
Production Insight
The all-in-one image bundles storage, collector, and query into one process.
It's fine for dev but loses all traces on restart — use Elasticsearch or Cassandra in prod.
Rule: never use all-in-one for production; you won't have trace persistence.
Key Takeaway
Docker run gets you started in 30 seconds.
Port 16686 = UI, 4317 = OTLP gRPC, 4318 = OTLP HTTP.
Lossy dev mode — plan for persistent storage before you go live.

Understanding Spans, Traces, and Context Propagation

A span is a named, timed operation that carries a span ID, trace ID, parent span ID, and attributes. The entire set of spans linked by a common trace ID forms a trace. Context propagation is what connects spans across service boundaries — it passes the trace ID and parent span ID via HTTP headers (W3C Trace Context: traceparent and tracestate). Without propagation, each service creates a separate trace, and you lose the end-to-end view. Propagation is automatic when using OpenTelemetry instrumentation libraries (they inject headers on outgoing requests). If you use raw HTTP clients or message queues, you must manually inject and extract the context.

propagation_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
from opentelemetry import propagators, trace
from opentelemetry.propagate import inject, extract

# Sending service
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span('outgoing-call') as span:
    headers = {}
    inject(headers)  # injects traceparent from current span
    response = requests.get('http://payment-service/process', headers=headers)

# Receiving service
# In middleware or at route entry, extract context
from opentelemetry.propagate import extract
ctx = extract(request.headers)
with tracer.start_as_current_span('process-payment', context=ctx) as span:
    # this span is now child of the sending service's span
    pass
Manual propagation is a common trap
If you're using HTTP clients not covered by auto-instrumentation (e.g., raw requests or httpx without integration), you must inject and extract headers manually. The W3C traceparent header is the standard — don't invent your own.
Production Insight
Missing context propagation is the #1 reason traces break at service boundaries.
Check for the traceparent header in your incoming requests to verify propagation.
Rule: if you see a new trace ID after a cross-service call, propagation is broken.
Key Takeaway
Spans get connected via trace ID propagated across services.
If your trace is broken into pieces, context propagation is the culprit.
Inject headers on outgoing calls; extract on incoming — always verify.
Propagation method decision
IfUsing an auto-instrumented library (FastAPIInstrumentor, requests integration)
UsePropagation is automatic — no extra code needed.
IfUsing a custom HTTP client or non-HTTP transport (Redis, Kafka)
UseYou must manually inject/export context using OpenTelemetry API.

Sampling Strategies in Production

Recording every trace at production scale is expensive — both in storage and network bandwidth. Sampling decides which traces to keep. Head-based sampling makes the decision at the start of a request (e.g., keep 1% of all traces). It's simple but can miss rare high-latency events. Tail-based sampling buffers traces and decides after they complete, keeping those that exceed a latency threshold or contain errors. Jaeger supports both. A common hybrid approach: sample 1–5% of normal requests and always sample requests with HTTP 5xx or custom error attributes.

Setting the right sampling rate is a trade-off. Too low (0.1%) and you'll miss most issues; too high (100%) for high-throughput services will overwhelm storage. Start at 1% and adjust based on storage budget and trace usefulness.

jaeger-sampling-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Remote sampling configuration for Jaeger
service_config:
  - service: "order-service"
    operation: "/orders/{id}"
    probabilistic_sampling:
      sampling_rate: 0.01  # 1% sample
  - service: "order-service"
    operation: "*"
    probabilistic_sampling:
      sampling_rate: 0.005  # 0.5% for other ops
  - service: "payment-service"
    operation: "*"
    probabilistic_sampling:
      sampling_rate: 0.05  # 5% for payment (higher risk)
Watch out for sampling latency bias
Head-based sampling can introduce bias against slow requests because it makes a decision before the request completes. If your sampling rate is 1%, you miss 99% of slow responses. Tail-based sampling solves this but requires buffering and adds memory overhead.
Production Insight
Setting sampling per-operation is key: payment services need higher rate than health checks.
Use remote sampling configuration (Jaeger Collector) to change rates without redeploying.
Rule: always sample error spans at 100% regardless of overall rate — use sampler type 'const: true' for errors.
Key Takeaway
Head-based sampling is simple but can miss slow requests.
Tail-based sampling captures the long tail but costs more.
Best practice: 100% for errors, 1–5% for normal traffic per service tier.

Troubleshooting Missing Spans and Broken Traces

When a trace doesn't appear in Jaeger UI, or appears incomplete, the root cause is almost always one of: (1) context not propagated, (2) spans not exported, (3) sampling dropped the trace, (4) clock skew between service hosts. Use the following systematic checks. First, confirm you're hitting the Jaeger endpoint by looking at application logs for OTLP export errors. Second, check the trace ID uniformity — if each service generates its own trace ID, propagation is missing. Third, verify that the "Trace" view shows all expected spans — missing spans may indicate a failing exporter or network issue. Fourth, if spans from different services appear with wrong timing, check NTP synchronisation: Jaeger relies on span timestamps for ordering.

Think of traces as breadcrumbs
  • Each service drops its breadcrumb (span) and passes the trace ID onward.
  • If the breadcrumb is missing (span not created), the chain breaks.
  • If the trace ID is not passed (propagation failure), the chain splits into separate chains.
  • Your debugging goal: find the first service where the breadcrumb pattern changes.
Production Insight
Clock skew of even 100ms can cause spans to appear out of order in the UI.
Run NTP on all nodes and monitor drift — Jaeger has a built-in clock skew adjustment but it's not perfect.
Rule: if a trace's spans jump backwards in time, check NTP first.
Key Takeaway
Missing traces = propagation fail or sampling drop.
Incomplete traces = missing spans (exporter error or code bug).
Out-of-order spans = clock skew — NTP is not optional.

Integrating Traces with Logs and Metrics

Distributed tracing alone doesn't replace logs or metrics — it complements them. The true power emerges when you correlate trace IDs with log entries and metric events. OpenTelemetry enables this via trace_id injection into log records (MDC in Java, structlog in Python). Metric tools like Prometheus can use trace IDs in labels for targeted alerting. Jaeger's UI allows you to drill from a trace to related logs if you configure the log integration.

A common production pattern: when a latency alert fires, grab the trace ID from the affected request, open Jaeger to see the breakdown, then jump to the logs from that span ID to inspect the exact error message.

log_correlation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
import structlog
from opentelemetry import trace

span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
span_id = format(span.get_span_context().span_id, '016x')

# Inject trace context into log
logger = structlog.get_logger()
logger.info("payment processed", trace_id=trace_id, span_id=span_id, order_id=123)
Unify your observability data
Use OpenTelemetry Collector to export traces to Jaeger, metrics to Prometheus, and logs to Loki. Set up a Grafana dashboard that links metrics panels to trace exploration — this is the 'observability pyramid' in practice.
Production Insight
Correlation is worthless if trace IDs aren't in logs from the start.
Instrument your logging layer early — retrofitting trace IDs into a million log lines is painful.
Rule: enforce trace_id presence in all structured logs via pipeline linting.
Key Takeaway
Traces show where; logs show what; metrics show when.
Correlate them via trace IDs in log output and metric labels.
Without correlation, you're debugging blind.

Why Your Traces Are Silent: The gRPC vs HTTP Exporter Trap

Most beginners copy-paste a Jaeger exporter configuration and wonder why their traces never show up. The culprit is almost always the gRPC endpoint. Jaeger all-in-one runs three separate ports: 14250 for gRPC, 14268 for HTTP Thrift, and 9411 for Zipkin. If you use JaegerExporter (gRPC) but your Jaeger container isn't listening on 14250, your spans vanish into the void. I've debugged this in three separate microservice migrations. The fix is brutally simple: match your exporter to Jaeger's open port. For HTTP, use ThriftExporter. For gRPC, ensure COLLECTOR_GRPC_PORT is set. Don't assume both endpoints are active—check your docker logs. This mismatch wastes hours for teams that could be shipping features.

exporter_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from opentelemetry.exporter.jaeger.thrift import JaegerExporter as ThriftExporter
from opentelemetry.exporter.jaeger.proto.grpc import JaegerExporter as GrpcExporter

# HTTP exporter (port 14268)
http_exporter = ThriftExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

# gRPC exporter (port 14250)
grpc_exporter = GrpcExporter(
    collector_endpoint="http://localhost:14250",
    insecure=True
)

print("HTTP exporter configured on port 6831")
print("gRPC exporter configured on port 14250")
Output
HTTP exporter configured on port 6831
gRPC exporter configured on port 14250
Production Trap:
Kubernetes often exposes only one port via ingress. If you expose 14268 but ship gRPC traces, you'll get 302 redirects or timeouts. Always validate connectivity with curl -v telnet://jaeger:14250 before blaming your code.
Key Takeaway
Match your Jaeger exporter protocol to the open collector port—gRPC and HTTP are not interchangeable.

Sampling in Production: Don't Bankrupt Your Storage on Every Request

In development, trace every request. In production, that costs real money—storage, network, and CPU. Smart teams use head-based sampling to keep the firehose manageable. Jaeger's probabilistic sampler with a 5-10% rate catches most anomalies without exploding your budget. But here's the trick: combine it with rate-limiting per endpoint. Your health-check endpoint doesn't need tracing at all. Your payment service deserves higher sampling. I've seen a startup burn $2,000/month on Jaeger storage because they sampled 100% on a high-traffic API. Set sampler.type=probabilistic and sampler.param=0.1 in your OpenTelemetry config. For critical flows, inject a custom sampler that always traces on errors. Your SRE team will thank you.

sampling_config.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import Sampler, Decision

class HealthCheckAwareSampler(Sampler):
    def should_sample(self, parent_context, trace_id, name, kind, attributes, links):
        # Skip tracing for health checks
        if name == "/health":
            return Decision.DROP
        # Sample 20% for everything else
        return Decision.RECORD_AND_SAMPLE if trace_id % 10 < 2 else Decision.DROP

trace.set_tracer_provider(
    TracerProvider(sampler=HealthCheckAwareSampler())
)

print("Custom sampler applied: health checks dropped, 20% sampling rate for others")
Output
Custom sampler applied: health checks dropped, 20% sampling rate for others
Pro Tip:
Jaeger supports tail-based sampling via the jaeger-sampling extension. For ultra-low latency apps, use head-based sampling and store only sampled spans. You can always re-analyze hot paths with manual instrumentation.
Key Takeaway
Sample 5-10% in production—save money, keep signal. Health checks get 0%.
● Production incidentPOST-MORTEMseverity: high

Missing Trace Context on Async Event Bus Caused False 'Healthy' Signals

Symptom
Traces for order processing showed only a single span (the HTTP handler) with no downstream spans from the Kafka consumer. The consumer's spans had different trace IDs, so they appeared as separate traces in Jaeger UI.
Assumption
Team assumed OpenTelemetry auto-instrumentation for Kafka would propagate context automatically. It did not — the Kafka integration requires manual setup for header injection.
Root cause
OpenTelemetry's Kafka instrumentation propagates headers only if you use the official producer/consumer API wrappers. The team was using a raw aiokafka library without an integration layer, so no traceparent header was passed.
Fix
Switch to the OpenTelemetry-instrumented Kafka producer/consumer, or manually inject/export context using opentelemetry.propagate.inject() when producing and extract() when consuming.
Key lesson
  • Auto-instrumentation is not magic — always verify context propagation at each boundary.
  • For any async or message-based communication, explicitly inject trace context into messages.
  • When testing, generate traces from end to end and check in Jaeger UI that a single trace spans all services.
Production debug guideSystematic checks to find why your distributed trace is broken5 entries
Symptom · 01
Trace visible in Jaeger UI but shows only a single span
Fix
Check context propagation; verify that the service making downstream calls injects the traceparent header. Test with curl -v and inspect request headers.
Symptom · 02
No traces for a specific service appear at all
Fix
Verify the service can reach the Jaeger Collector endpoint. Check service logs for OTLP export errors. Confirm the port (4317 for gRPC, 4318 for HTTP) matches collector configuration.
Symptom · 03
Traces appear but with spans out of order or negative duration
Fix
Run ntpq -p on all nodes to check clock synchronisation. Spans with timestamps from different hosts can be misordered if clocks drift more than 100ms.
Symptom · 04
Only a small fraction of traces appear despite high request volume
Fix
Check sampling configuration. Confirm you're not using a head-based sampler with rate too low for the traffic pattern. Look at Jaeger Collector metrics for 'sampling.dropped'.
Symptom · 05
Traces contain spans from service A but not service B, though B is called
Fix
B likely has a bug in its instrumentation or exporter. Test B in isolation: send a request that produces a trace and verify it appears. Common cause: missing OpenTelemetry package or wrong exporter endpoint.
★ Quick Trace Debug Cheat SheetFive common trace issues and the exact commands to diagnose them
No traces in Jaeger UI
Immediate action
Ping the collector: curl http://jaeger-collector:4318
Commands
kubectl logs -l app=order-service --tail=20 | grep -i otlp
docker logs jaeger 2>&1 | grep -i error
Fix now
Restart the instrumented service after verifying endpoint env vars
Spans missing from a trace+
Immediate action
Check the trace details in Jaeger UI for 'span count' vs expected
Commands
curl -H 'traceparent: 00-<trace_id>-<span_id>-01' http://target-service/health
Inspect application logs around the request time for exporter errors
Fix now
Manually inject traceparent in a test request to isolate the breaking service
Distorted span timings (negative or huge values)+
Immediate action
Check system time on each host: date +%s
Commands
ntpq -p | grep -E '^(##|*)'
timedatectl show --property=NTP --value
Fix now
Run sudo timedatectl set-ntp true and wait for sync
Sampling rate too aggressive (traces missing)+
Immediate action
Check Jaeger remote sampling config endpoint
Commands
curl http://jaeger-collector:5778/sampling?service=order-service
Look for 'probabilistic_sampling: { sampling_rate: 0.01 }'
Fix now
Increase rate or switch to tail-based sampling for high-latency operations
Context not propagated to downstream service+
Immediate action
Run test request with curl -v and inspect response headers
Commands
curl -v http://order-service/orders/1 2>&1 | grep -i trace
Check if the downstream service receives traceparent header in its logs
Fix now
Add manual inject/extract in the communication layer (HTTP client, message producer)

Key takeaways

1
A trace = complete request journey across services. A span = one operation within a service.
2
OpenTelemetry is the vendor-neutral instrumentation API
use it to avoid lock-in.
3
FastAPIInstrumentor auto-instruments all routes
you only need manual spans for important sub-operations.
4
Trace context (trace ID, span ID) propagates via HTTP headers (traceparent) between services.
5
Use span attributes to add business context
order.id, user.id — makes filtering useful.
6
Sampling is a storage vs accuracy trade-off
always sample errors 100%, tune per-operation rates.
7
Clock skew breaks trace timelines
NTP synchronisation is mandatory in distributed systems.
8
Correlate trace IDs with logs and metrics for full observability
or you're still flying blind.

Common mistakes to avoid

5 patterns
×

Using auto-instrumentation only and assuming all spans are captured

Symptom
Critical latency inside a database call or cache lookup is invisible because no manual span wraps it. The trace shows the HTTP handler but not the expensive operation inside.
Fix
Add manual spans with tracer.start_as_current_span around every external I/O, lock acquisition, or business logic block that can take >10ms.
×

Running Jaeger all-in-one in production without persistent storage

Symptom
Traces disappear after container restart. Incident post-mortems have no traces because they were lost during the reboot.
Fix
Deploy Jaeger with a backend storage (Elasticsearch, Cassandra, or Kafka) configured via environment variables SPAN_STORAGE_TYPE=elasticsearch and proper connection endpoints.
×

Setting a global sampling rate without considering operation criticality

Symptom
Payment failures or latency spikes are rarely captured because the sampling rate is 1% and the incident happens in the 99% unsampled requests.
Fix
Use Jaeger's remote sampling configuration to set higher rates for critical endpoints (payment, auth) and lower rates for health checks and static content.
×

Not injecting trace context into asynchronous or batch job spans

Symptom
An API request kicks off a background job; the job's spans have a different trace ID, so you can't link the request to the job execution.
Fix
Pass the trace context via message headers (Kafka, RabbitMQ) or database column when enqueuing jobs. On the worker side, extract the context before starting the worker span.
×

Forgetting to handle clock skew across hosts

Symptom
Spans in the Jaeger UI appear with negative duration or overlapping incorrectly. Root cause analysis becomes unreliable.
Fix
Run NTP daemon on all servers. Monitor clock offset in your observability dashboards. Alert if offset exceeds 10ms.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how distributed tracing works at the protocol level. How does a ...
Q02SENIOR
Compare head-based and tail-based sampling. When would you use each?
Q03SENIOR
What is the role of the OpenTelemetry Collector? How does it differ from...
Q04SENIOR
How would you debug a distributed trace that appears incomplete in Jaege...
Q01 of 04SENIOR

Explain how distributed tracing works at the protocol level. How does a span get linked to its parent across service boundaries?

ANSWER
Each span carries a trace ID, span ID, and parent span ID. When service A calls service B, OpenTelemetry injects a traceparent HTTP header with the current trace ID and span ID. Service B extracts that header and creates a new span with the same trace ID and the received span ID as parent. This creates a directed acyclic graph of spans. The header format is: 00-{trace_id}-{span_id}-{trace_flags} (W3C Trace Context).
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between distributed tracing, logging, and metrics?
02
What is sampling in distributed tracing?
03
Can I use Jaeger without OpenTelemetry?
04
How do I persist Jaeger traces?
05
What is the overhead of enabling distributed tracing?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Verified
production tested
June 10, 2026
last updated
1,554
articles · all by Naren
🔥

That's Monitoring. Mark it forged?

4 min read · try the examples if you haven't

Previous
Application Performance Monitoring
5 / 9 · Monitoring
Next
SLI SLO SLA Explained