Skip to content
Home DevOps N+1 Queries Hide in Low CPU — APM Metrics That Expose Them

N+1 Queries Hide in Low CPU — APM Metrics That Expose Them

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Monitoring → Topic 4 of 9
App CPU at 30% while p99 latency hit 4 seconds.
⚙️ Intermediate — basic DevOps knowledge assumed
In this tutorial, you'll learn
App CPU at 30% while p99 latency hit 4 seconds.
  • APM gives you three telemetry types: metrics (aggregated numbers), traces (single request journeys), and logs (discrete events). All three are necessary for efficient debugging.
  • The RED method (Rate, Errors, Duration) is the standard for service-level monitoring. Track p99 latency, not average — averages hide outliers, and outliers are what users notice.
  • Distributed tracing shows you where time is spent across service boundaries. Without trace context propagation (W3C Trace-Context), each service starts a new trace and you lose the end-to-end view.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • APM gives you telemetry — metrics (numerical measurements), traces (request journeys), and logs (discrete events) — to find performance problems before users complain
  • Core components: RED method (Rate, Errors, Duration) for services; USE method (Utilisation, Saturation, Errors) for resources; distributed tracing for microservices
  • Performance cost: OpenTelemetry adds 2-5% CPU overhead when sampled at 1% (adjust sampling rate based on traffic)
  • Production trap: Alerting on CPU usage alone — a 90% CPU alert fires while users are happy (pre-computed cache), and misses when a slow database query makes users wait (low CPU, high latency)
  • Biggest mistake: No baseline for normal latency — you can't know p99 is bad if you never tracked p50 when the system was healthy
🚨 START HERE

APM Debug Cheat Sheet

Fast diagnostics for production performance issues. Run these commands at the first sign of slowness.
🟠

Slow API endpoint — can't tell if it's code, database, or network

Immediate ActionLook at the distributed trace to break down latency by component
Commands
curl http://apm-collector:14268/api/traces?service=my-api | jq '.data[].spans[] | {operationName, duration}'
kubectl exec -it jaeger-query -- wget -O- 'http://localhost:16686/api/traces?service=api&limit=1' | jq '.data[0].spans[].duration'
Fix NowIf database span dominates, add index or reduce query count. If HTTP client span dominates, check external API latency. If duration is in 'code' span, profile the application.
🟠

App server CPU at 100%, database CPU normal, latency spiking

Immediate ActionProfile the application to find CPU hot spots
Commands
async-profiler -d 30 -f /tmp/flamegraph.html <pid>
top -H -p <pid>
Fix NowLook for busy loops, inefficient regex, JSON parsing in hot path, or logging at DEBUG level. Offload CPU-heavy work to background threads.
🟠

Database CPU at 100%, app server CPU normal, slow queries in log

Immediate ActionCheck for missing indexes, inefficient joins, or N+1 patterns
Commands
pg_stat_statements (PostgreSQL) to find top queries by total time; EXPLAIN ANALYZE on the slow query
SHOW PROCESSLIST; (MySQL) to see currently running queries
Fix NowAdd missing indexes on WHERE/JOIN columns. Rewrite query to avoid full table scan. Use connection pool monitoring to check for leak.
🟡

Memory usage growing over time — suspected leak

Immediate ActionCapture heap dump and analyse for retained objects
Commands
jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>
jcmd <pid> GC.heap_info
Fix NowOpen heap dump in Eclipse MAT / VisualVM. Look for objects with high 'retained heap'. Check for event listeners not unregistered, caches without eviction, or thread locals not cleared.
🟠

Latency p99 spikes but p50 is fine — 1% of requests are very slow

Immediate ActionCheck if slow requests share a pattern: specific user, specific data, or specific time
Commands
grep 'duration=.*ms' /var/log/api.log | awk '{print $5}' | sort -n | tail -20
prometheus_query('histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))')
Fix NowAdd more trace sampling for slow requests. Check for large payloads, deep pagination, or data skew. Implement request timeouts to fail fast.
Production Incident

The Silent N+1 Query That Killed Black Friday

An e-commerce product page made 1 query for the product and 1 query per review. A page with 100 reviews made 101 database round trips. Page load time went from 120ms to 3.2 seconds at peak traffic.
SymptomProduct page latency p99 jumped from 200ms to 4 seconds within 10 minutes of peak traffic. CPU on app servers stayed below 30%. Database CPU spiked to 95%. No alerts fired because CPU alert threshold was 80% and app servers looked fine. Customers started abandoning carts. The team saw the latency spike in dashboards but couldn't find the root cause.
AssumptionThe team assumed the product page was efficient because it performed well in load tests with 10 reviews per product. They didn't test with 100 reviews. They also assumed high latency meant slow code in the app server — not the database — because app server CPU was low.
Root causeThe product page code fetched the product object, then looped through each review ID and executed a separate SELECT query. That's 1 query for the product + N queries for N reviews. At 100 reviews per product, that's 101 database round trips. Under peak load, database connection pool saturated, queries queued, and latency exploded. The ORM's default eager loading was disabled, and no one noticed because staging data had only 2-3 reviews per product.
FixChanged the code to use a single JOIN query: SELECT * FROM products LEFT JOIN reviews ON products.id = reviews.product_id WHERE products.id = ?. Added Review as an embedded collection on the Product object using the ORM's eager loading feature. Added an APM custom span around the database query to measure its contribution to total latency. Deployed a migration to add an index on reviews.product_id. After the fix, page latency dropped to 150ms even at peak traffic, and database CPU dropped to 25%.
Key Lesson
N+1 queries are invisible in app server CPU metrics — the app server waits for the database, so its CPU stays low. Always monitor database query count and latency per endpoint.Load test with realistic data volumes. A product page with 2 reviews behaves nothing like a page with 200 reviews. Use production data size in staging.APM should trace database queries per request. A sudden increase in 'SELECT * FROM reviews WHERE product_id = ?' call count is a smoking gun for N+1.Set up alerts on p99 latency per endpoint, not just CPU. A 400% latency increase with flat CPU points directly at database or external dependencies.
Production Debug Guide

Symptom → Action mapping for common performance failures

High latency, app server CPU low, database CPU highClassic N+1 query or inefficient database access pattern. Check APM traces for per-request query count. Look for loops executing SELECT statements inside a request. Use database slow query log to identify expensive queries. Add missing indexes.
High latency, app server CPU high, database CPU normalApplication code is the bottleneck, not the database. Use profiler (async-profiler, py-spy) to find CPU hot methods. Check for inefficient loops, serialisation overhead (JSON parsing), or regex backtracing. Consider caching expensive computations.
Latency spikes every hour like clockworkLikely cron job, cache expiry, or batch process. Check scheduled jobs running at that time. Look for cache stampede (multiple requests recomputing same cache simultaneously). Add jitter to scheduled tasks. Use 'lock' for cache recomputation.
Latency increases linearly with number of usersShared resource bottleneck: database connection pool, thread pool, or external API rate limit. Check connection pool size vs active connections. If maxed out, requests queue. Increase pool size or reduce connection hold time. Check thread pool saturation.
Latency high for first request after deploy, then improvesCold start or lack of connection预热. Database connection pools, caches, and JIT compilation need 'warmup' after deployment. Send synthetic traffic before opening to users. Use 'health check' endpoint that exercises critical paths.

Every time a user clicks 'Buy Now' and nothing happens, a customer is lost — possibly forever. Studies from Google and Akamai consistently show that a 100ms increase in page load time can drop conversion rates by 1%. At scale, that's not a UX annoyance; it's a revenue crisis. Yet most engineering teams only find out their app is slow after a flood of support tickets or, worse, a trending tweet. APM exists to flip that script.

The core problem APM solves is invisibility. Code runs inside servers you can't touch, across networks you don't control, on databases holding millions of rows. Without instrumentation, you're flying blind. A query that took 50ms in staging suddenly takes 4 seconds in production under real load — and you have no idea why. APM gives you the telemetry — metrics, traces, logs — to pinpoint the exact line of code, database call, or third-party API dragging your app down.

By the end you'll understand the three pillars of observability, know exactly which metrics to instrument first, set up Prometheus-based collection, configure meaningful alert thresholds (not just 'CPU > 90%'), and read a distributed trace to find hidden latency.

The Three Pillars — Metrics, Traces, Logs

APM rests on three types of telemetry data. Each answers a different question, and you need all three to debug effectively.

Metrics are numerical measurements over time — request rate, error rate, latency percentiles, CPU usage. They answer 'what is happening?' and are cheap to store and query. Metrics are aggregated (averages, sums, counts) and lose individual request details.

Traces track a single request's journey across services — every database call, RPC, and cache hit. They answer 'why is this specific request slow?' A trace is a tree of spans, each representing a unit of work. Traces are sampled (1-10% of requests) because storing every trace is expensive.

Logs are discrete timestamped events — 'User 123 logged in', 'Payment failed: insufficient funds'. They answer 'what happened at this exact moment?' Logs are high-cardinality but unstructured; parsing them at scale requires indexing.

The relationship: metrics tell you something is wrong (p99 latency spiked). Traces tell you where (database query slow). Logs tell you why (connection pool exhausted). Without all three, you're missing context.

io/thecodeforge/apm/OpenTelemetryInstrumentation.java · JAVA
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
package io.thecodeforge.apm;

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.resources.Resource;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;
import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;

import java.util.concurrent.TimeUnit;

/**
 * Production OpenTelemetry instrumentation for a Java service.
 *
 * This adds distributed tracing so you can see exactly where latency
 * is hiding — database calls, HTTP requests, or your own code.
 */
public class OpenTelemetryInstrumentation {

    private final Tracer tracer;

    public OpenTelemetryInstrumentation(String serviceName, String otlpEndpoint) {
        // Configure OTLP exporter — sends traces to collector (Jaeger, Tempo, etc.)
        OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
                .setEndpoint(otlpEndpoint)  // e.g., "http://jaeger-collector:14250"
                .setTimeout(30, TimeUnit.SECONDS)
                .build();

        Resource serviceResource = Resource.getDefault().toBuilder()
                .put(ResourceAttributes.SERVICE_NAME, serviceName)
                .put(ResourceAttributes.SERVICE_VERSION, "1.2.3")
                .build();

        SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
                .addSpanProcessor(BatchSpanProcessor.builder(spanExporter).build())
                .setResource(serviceResource)
                .build();

        OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
                .setTracerProvider(tracerProvider)
                .buildAndRegisterGlobal();

        this.tracer = openTelemetry.getTracer(serviceName, "1.0.0");
    }

    /**
     * Example: instrument a database query with custom span.
     *
     * This creates a child span under the current request trace.
     * In APM UI, you'll see exactly how long the database call took
     * and can correlate it with other spans in the same trace.
     */
    public void executeDatabaseQuery(String query) {
        Span dbSpan = tracer.spanBuilder("DB Query")
                .setAttribute("db.statement", query)
                .setAttribute("db.system", "postgresql")
                .startSpan();

        try (Scope scope = dbSpan.makeCurrent()) {
            // Execute actual database query here
            // connection.execute(query);
            System.out.println("Executing: " + query);
        } catch (Exception e) {
            dbSpan.recordException(e);
            dbSpan.setAttribute("error", true);
            throw e;
        } finally {
            dbSpan.end();  // Duration recorded here — visible in trace UI
        }
    }

    /**
     * Example: instrument an HTTP call to an external API.
     */
    public void callExternalApi(String url) {
        Span httpSpan = tracer.spanBuilder("HTTP " + url)
                .setAttribute("http.url", url)
                .setAttribute("http.method", "GET")
                .startSpan();

        try (Scope scope = httpSpan.makeCurrent()) {
            // Make the actual HTTP call
            // httpClient.get(url);
            System.out.println("Calling: " + url);
        } catch (Exception e) {
            httpSpan.recordException(e);
            httpSpan.setAttribute("error", true);
            throw e;
        } finally {
            httpSpan.end();
        }
    }
}
Mental Model
Metrics, Traces, Logs — The Observability Trinity
Metrics tell you 'something is wrong', traces tell you 'where', and logs tell you 'why'. You need all three to debug production issues efficiently.
  • Metrics: aggregated numbers (rate, errors, duration). Cheap to store, but lose individual request detail.
  • Traces: single request journey across services. Expensive to store (sampled at 1-10%). Show exact latency breakdown.
  • Logs: discrete events with high cardinality. Unstructured, need indexing for search. Best for debugging 'why' after trace identifies 'where'.
  • OpenTelemetry: vendor-neutral API for generating telemetry; send to any backend (Jaeger, Prometheus, Datadog, New Relic).
  • Rule: Start with RED metrics (Rate, Errors, Duration) for every service, then add traces for slow endpoints, then structured logs for errors.
📊 Production Insight
Metrics alone can't debug a single slow request. They're aggregated, so a 1-second p99 could be 1% of requests taking 10 seconds.
Traces alone can't tell you if a problem is widespread or isolated. Combine metrics (problem exists) with traces (find the cause).
Rule: Sample 100% of traces for error responses (status >= 400), and 1-10% of successful requests. This captures all failures without breaking the bank.
🎯 Key Takeaway
Metrics tell you something is wrong. Traces tell you where. Logs tell you why. You need all three to debug effectively.
Start with RED metrics for every service, then add distributed tracing for slow endpoints, then structured logging for error detail.
Rule: Sample 100% of error traces, 1-10% of success traces. Tail-based sampling catches slow requests without storing every success.
APM Telemetry Sampling Strategy
IfLow traffic service (< 10 requests/second)
UseSample 100% of requests. Store traces for 7 days, errors for 30 days. Cost is negligible and debugging is easier.
IfMedium traffic service (10-100 requests/second)
UseSample 10% of requests, plus 100% of errors. Use probabilistic sampling with consistent probability per trace ID.
IfHigh traffic service (> 100 requests/second)
UseSample 1% of requests, 100% of errors, and 'tail-based' sampling for slow requests (>500ms). Use OpenTelemetry collector with tail-sampling processor.
IfCompliance requirement: must have trace for every transaction
UseUse head-based sampling with probability 1 (100%). Accept higher storage costs. Use cheaper storage tier (S3) for older traces.
IfTransaction spans multiple services (distributed trace)
UseUse consistent probability sampling based on trace ID. All services must use same sampling decision to avoid broken traces.

The RED Method — Rate, Errors, Duration

The RED method (Rate, Errors, Duration) is the standard for service-level monitoring. For every service, track these three metrics, and you'll know instantly whether users are happy.

Rate is the number of requests per second. A sudden drop in rate (traffic falling off a cliff) often means the service is unavailable or rejecting requests. A sudden spike might indicate a DDoS attack or misconfigured client.

Errors is the proportion of requests that failed — HTTP 5xx, thrown exceptions, timeout, or any response that doesn't meet your SLO. Track error rate both as a raw count and as a percentage of total requests. A slow rise in error rate often indicates resource exhaustion (database connections, memory).

Duration is how long requests take, measured as latency percentiles — p50 (median), p95, p99. p99 is what matters for user experience: 1% of requests are slower than this. Average latency hides outliers: a service could have 1000 requests at 1ms and 1 request at 1000ms, average 2ms, but 0.1% of users had a terrible experience.

Instrument duration with a histogram: bucket boundaries at 1ms, 5ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 2500ms, 5000ms, 10000ms. This gives you percentiles without storing every latency value.

Common RED mistakes: measuring only average latency (hides p99 problems), not tracking errors by type (500 internal server error vs 404 not found are very different), and not breaking down rate by endpoint (a drop in /health is fine; a drop in /checkout is a crisis).

io/thecodeforge/apm/REDMetrics.java · JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
package io.thecodeforge.apm;

import io.prometheus.client.CollectorRegistry;
import io.prometheus.client.Counter;
import io.prometheus.client.Histogram;
import io.prometheus.client.exporter.HTTPServer;

import java.io.IOException;

/**
 * Production RED metrics (Rate, Errors, Duration) using Prometheus.
 *
 * These three metrics are enough to know if a service is healthy
 * from the user's perspective — without looking at CPU or memory.
 */
public class REDMetrics {

    // ─── RATE: Total requests per endpoint ───────────────────────────────────
    // Counter represents requests total. Rate = increase over time.
    private static final Counter requestTotal = Counter.build()
            .name("http_requests_total")
            .labelNames("method", "endpoint", "status")
            .help("Total HTTP requests")
            .register();

    // ─── ERRORS: Error counter (subset of requestTotal) ──────────────────────
    // Track errors separately for easier alerting, but also derived from requestTotal
    private static final Counter errorTotal = Counter.build()
            .name("http_errors_total")
            .labelNames("method", "endpoint", "error_type")
            .help("Total HTTP errors (status >= 500 or exception)")
            .register();

    // ─── DURATION: Request latency histogram ─────────────────────────────────
    // Buckets chosen to capture p50 (5-10ms), p95 (50-100ms), p99 (250-500ms)
    // Adjust buckets based on your service's typical latency.
    private static final Histogram requestDuration = Histogram.build()
            .name("http_request_duration_seconds")
            .labelNames("method", "endpoint")
            .help("HTTP request latency in seconds")
            .buckets(0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10)
            .register();

    /**
     * Record metrics for a completed request.
     * Call this in your API framework's response filter/middleware.
     */
    public static void recordRequest(String method, String endpoint,
                                      int statusCode, long durationMs) {
        String status = String.valueOf(statusCode);
        requestTotal.labels(method, endpoint, status).inc();

        if (statusCode >= 500) {
            errorTotal.labels(method, endpoint, "http_5xx").inc();
        }

        requestDuration.labels(method, endpoint).observe(durationMs / 1000.0);
    }

    /**
     * Record an exception that wasn't caught by normal status code handling.
     */
    public static void recordException(String method, String endpoint, String exceptionType) {
        errorTotal.labels(method, endpoint, exceptionType).inc();
    }

    /**
     * Start Prometheus metrics endpoint on port 8081 (separate from app port).
     * Scraped by Prometheus every 15 seconds.
     */
    public static void startMetricsServer() throws IOException {
        HTTPServer server = new HTTPServer(8081);
        System.out.println("Prometheus metrics available at http://localhost:8081/metrics");
    }
}
🔥Prometheus Histogram Buckets
The histogram buckets in the code above (0.001, 0.005, 0.01, ...) are chosen to capture typical web latency ranges. For a database service, you might shift to higher buckets (0.01, 0.05, 0.1, 0.5...). For an in-memory cache, lower buckets (0.0001, 0.0005...). Use summary quantiles if you need exact percentiles, but histograms are cheaper and recommended for production.
📊 Production Insight
Average latency hides problems. A service with 1000 requests at 1ms and 1 at 1000ms has an average of 2ms, but 0.1% of users had a 1000ms experience.
p99 latency is what users actually feel. 1% of requests slower than p99. Track p50 for trends, p99 for SLOs.
Rule: Set p99 latency alerts at 3x your normal baseline, not an absolute number. A 500ms p99 might be fine for a reporting API but terrible for a checkout endpoint.
🎯 Key Takeaway
RED metrics — Rate, Errors, Duration — tell you if users are happy without looking at CPU or memory.
Track p99 latency, not average. Averages hide outliers, and outliers are what users notice.
Rule: Start with RED for every service before adding more detailed metrics. If RED is green, users are happy; if RED is red, start debugging.
RED Metrics by Service Type
IfWeb API or synchronous service (user waiting for response)
UseTrack p99 latency, error rate, request rate. Alert when p99 > 500ms for 5 minutes; error rate > 1% for 2 minutes.
IfBackground job processor or async worker
UseTrack rate (jobs processed), error rate, and job age (time from enqueue to completion). Alert when age > 5 minutes.
IfDatabase or cache (infrastructure service)
UseTrack query latency p99, connection pool saturation, error rate. Alert when p99 > 100ms for database (with index) or > 5ms for cache.
IfBatch job (cron, ETL)
UseTrack duration (time to completion), error flag (0 or 1), data volume processed. Alert when job takes > 2x baseline duration.
IfThird-party API dependency (downstream call)
UseTrack rate (calls per second), error rate (HTTP 5xx, timeouts), latency p99. Alert when error rate > 5% or p99 > 2 seconds.

Distributed Tracing — Following a Request Across Services

In a monolith, you can find a slow function with a profiler. In microservices, a single request might pass through API gateway → auth service → order service → payment service → inventory service. A 2-second latency could be 100ms in each of 20 services, or 1.9 seconds in a single database query. Distributed tracing tells you which.

A trace is a tree of spans. The root span covers the entire request from client to final response. Child spans cover sub-operations: HTTP calls to downstream services, database queries, cache lookups, even internal function calls.

Key fields: trace ID (same across all spans in a request), span ID (unique per operation), parent span ID (links child to parent), name (operation name: 'GET /products', 'SELECT * FROM orders'), start and end timestamps (duration = end - start), attributes (HTTP method, status code, DB statement), events (logs within a span: 'cache miss', 'retry attempt').

Implementation: instrument your HTTP client and server libraries to automatically propagate trace context via headers (W3C Trace-Context standard: traceparent, tracestate). Use OpenTelemetry auto-instrumentation agents for Java, Python, Node.js, Go. Manual instrumentation for business-critical spans.

Common tracing mistakes: not propagating trace context across asynchronous boundaries (message queues, background threads) — resulting in broken traces; sampling too aggressively (1% of 1% leaves 0.01% of requests traced); not storing traces long enough (7 days minimum for debugging weekly patterns); and not linking traces to logs (add trace ID to every log line).

io/thecodeforge/apm/DistributedTracing.java · JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
package io.thecodeforge.apm;

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Context;
import io.opentelemetry.context.propagation.TextMapGetter;
import io.opentelemetry.context.propagation.TextMapSetter;

import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

/**
 * Distributed tracing with context propagation across service boundaries.
 *
 * The key challenge in distributed tracing is propagating the trace context
 * from caller to callee. OpenTelemetry's propagator handles automatically
 * when using instrumented clients. For custom protocols (message queues),
 * inject the context manually.
 */
public class DistributedTracing {

    private final Tracer tracer;
    private final OpenTelemetry openTelemetry;
    private final HttpClient httpClient;

    public DistributedTracing(OpenTelemetry openTelemetry) {
        this.openTelemetry = openTelemetry;
        this.tracer = openTelemetry.getTracer("api-service");
        this.httpClient = HttpClient.newHttpClient();
    }

    /**
     * Example: calling a downstream service with automatic trace propagation.
     *
     * When using OpenTelemetry-instrumented HTTP client, the trace context
     * is automatically injected into the `traceparent` header.
     * The downstream service extracts it and creates a child span.
     */
    public String callOrderService(String orderId) throws Exception {
        // Start a child span for this HTTP call
        Span httpSpan = tracer.spanBuilder("HTTP POST /orders")
                .setAttribute("order.id", orderId)
                .startSpan();

        try (var scope = httpSpan.makeCurrent()) {
            HttpRequest request = HttpRequest.newBuilder()
                    .uri(java.net.URI.create("http://order-service/api/orders"))
                    .header("Content-Type", "application/json")
                    .POST(HttpRequest.BodyPublishers.ofString("{\"id\":\"" + orderId + "\"}"))
                    .build();

            // If using automatic instrumentation, the `traceparent` header
            // is added automatically. If manual, inject via:
            // openTelemetry.getPropagators().getTextMapPropagator()
            //     .inject(Context.current(), request, (r, k, v) -> r.headers().put(k, v));

            HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());

            httpSpan.setAttribute("http.status_code", response.statusCode());
            return response.body();
        } catch (Exception e) {
            httpSpan.recordException(e);
            throw e;
        } finally {
            httpSpan.end();
        }
    }

    /**
     * Extract trace context from incoming request headers.
     * OpenTelemetry's server instrumentation does this automatically.
     *
     * This is how a service knows it's part of an existing trace
     * rather than starting a new one.
     */
    public void handleIncomingRequest(String traceparentHeader) {
        // Extract context from headers (auto-instrumented frameworks do this)
        TextMapGetter<MapHeaders> getter = new TextMapGetter<>() {
            @Override
            public String get(MapHeaders carrier, String key) {
                return carrier.get(key);
            }

            @Override
            public Iterable<String> keys(MapHeaders carrier) {
                return carrier.keys();
            }
        };

        Context extractedContext = openTelemetry.getPropagators().getTextMapPropagator()
                .extract(Context.current(), new MapHeaders(traceparentHeader), getter);

        // Start a span as child of extracted context
        Span span = tracer.spanBuilder("handle request")
                .setParent(extractedContext)
                .startSpan();

        try (var scope = span.makeCurrent()) {
            // Process the request here
            System.out.println("Processing request with trace ID: " + span.getSpanContext().getTraceId());
        } finally {
            span.end();
        }
    }

    // Helper class for header propagation example
    static class MapHeaders {
        private final java.util.Map<String, String> headers = new java.util.HashMap<>();
        MapHeaders(String traceparent) { headers.put("traceparent", traceparent); }
        String get(String key) { return headers.get(key); }
        Iterable<String> keys() { return headers.keySet(); }
    }
}
⚠ Trace Context Propagation is Non-Negotiable
Without automatic trace context propagation, your traces will be broken — each service creates a new trace root. Use W3C Trace-Context headers (traceparent, tracestate). OpenTelemetry auto-instrumentation handles this for HTTP, gRPC, and many database clients. For message queues (Kafka, RabbitMQ), you must inject the context into the message headers manually, then extract on the consumer side.
📊 Production Insight
A trace without context propagation is just a log of each service's independent timings. You can't see the full request journey.
The W3C Trace-Context standard (traceparent header) is supported by all major tracing backends. Use it, not proprietary formats.
Rule: Test trace propagation in staging. Make a request that spans 3 services and verify the same trace ID appears in all service logs.
🎯 Key Takeaway
Distributed tracing shows you where time is spent across service boundaries — database, cache, RPC, external API.
Without trace context propagation, each service starts a new trace, and you lose the end-to-end view.
Rule: Use OpenTelemetry auto-instrumentation for HTTP clients/servers. For message queues, propagate trace context in message headers manually.
Tracing Sampling Decisions
IfDebugging an intermittent production issue that happens to 0.1% of requests
UseIncrease sampling rate to 10% temporarily. Change back after issue resolved. Use remote configuration (without redeploy).
IfCompliance requires full audit trail for every transaction
UseSample 100% of traces. Use 'head-based' sampling with probability 1. Accept storage costs. Archive older traces to cold storage.
IfStorage costs are a concern, but you need to debug slow requests
UseUse 'tail-based' sampling: sample 100% of traces, but only store those with errors or duration > 500ms (configured in OpenTelemetry collector).
IfYou need to debug specific user or session
UseUse 'request-id' based conditional sampling. Extract user ID from request header; if user is in debug list, sample 100%.
IfTracing overhead is affecting production latency (rare > 5% overhead)
UseReduce sampling rate. Use lighter propagator. Use async span processor. Sample 100% of errors, lower success sampling.
🗂 RED Method Metrics by Service Type
Choose the right metrics based on what your service does and who waits for it
Service TypeRate (R)Errors (E)Duration (D)Key Alert
Web API (user-facing)Requests/sec per endpointHTTP 5xx rate, exception ratep99 latency per endpointp99 > 500ms for 5 minutes
Background WorkerJobs processed/secFailed job rateJob age (time from enqueue to completion)Job age > 5 minutes
DatabaseQueries/secDeadlock rate, connection errorsp99 query latencyp99 > 100ms (if indexed properly)
Cache (Redis, Memcached)Operations/sec (GET, SET)Error rate, miss ratep99 operation latencyp99 > 5ms or miss rate > 20%
Message Queue (Kafka)Messages published/sec, consumed/secConsumer lag (offset difference)Produce latency, consume latencyLag > 10,000 messages for 10 minutes
Third-party APICalls/secHTTP 5xx, timeout ratep99 response timeError rate > 5% or p99 > 2 seconds

🎯 Key Takeaways

  • APM gives you three telemetry types: metrics (aggregated numbers), traces (single request journeys), and logs (discrete events). All three are necessary for efficient debugging.
  • The RED method (Rate, Errors, Duration) is the standard for service-level monitoring. Track p99 latency, not average — averages hide outliers, and outliers are what users notice.
  • Distributed tracing shows you where time is spent across service boundaries. Without trace context propagation (W3C Trace-Context), each service starts a new trace and you lose the end-to-end view.
  • Alert on p99 latency and error rate, not CPU usage. A 90% CPU alert fires while users are happy (pre-computed cache) and misses when a slow database query makes users wait (low CPU, high latency).
  • Tail-based sampling captures all slow and failed requests without storing every successful trace. Sample 100% of errors, 100% of requests > 500ms, and 1% of normal requests.

⚠ Common Mistakes to Avoid

    Alerting on CPU usage and ignoring latency
    Symptom

    CPU alert at 90% fires, but users are happy (pre-computed cache warmed up). Latency alert would have been green. Later, database query slows down to 4 seconds, CPU stays at 20% — no alert fires, users complain.

    Fix

    Alert on p99 latency and error rate, not CPU. CPU is a resource metric for capacity planning, not user experience. Use the RED method: if users are happy (low latency, low errors), CPU can be 95% and it's fine. If users are unhappy (high latency), CPU can be 10% and you need to investigate database/external dependencies.

    Monitoring average latency instead of p99
    Symptom

    Average latency dashboard shows 50ms, green. But p99 is 5 seconds — 1% of users have terrible experience. The product manager sees 'green' and doesn't understand why support tickets about slowness keep coming.

    Fix

    Always monitor latency percentiles: p50 for trends, p95 for most users, p99 for worst-case experience. Average hides outliers. Use Prometheus histogram with histogram_quantile(0.99, rate(...)) or a dedicated APM tool.

    No baseline — alert threshold is absolute, not relative
    Symptom

    Alert set at 'p99 latency > 1 second'. During normal operation, p99 is 20ms, so alert never fires. One day, p99 rises to 500ms — still under 1 second, so no alert, but users are already unhappy because normal was 20ms.

    Fix

    Alert threshold should be relative to baseline: p99 latency > 3x normal for 5 minutes. Use anomaly detection (Prometheus predict_linear or external tool). For absolute thresholds, set them at 2-3x your SLO, not at a number that sounds reasonable.

    Not tail-sampling slow requests
    Symptom

    Sampling rate is 1% to save costs. A slow request that affects 0.1% of users has only 0.001% chance of being traced (1% of 0.1%). You never see it in traces, and debugging takes weeks.

    Fix

    Use tail-based sampling: store 100% of traces in the collector, but only export those with errors or duration > 500ms. OpenTelemetry collector supports tail_sampling processor. This captures all slow and failed requests at near-zero storage cost for fast ones.

    Monitoring only aggregated service-level metrics, not per-endpoint
    Symptom

    Service p99 latency is 50ms (green). But /checkout endpoint p99 is 2 seconds (red). The slow endpoint's traffic is diluted by fast /health checks and /products calls. No one notices until checkout fails during a sale.

    Fix

    Break down metrics by endpoint, especially for user-facing operations. The /health endpoint should be monitored separately from /checkout. Use custom buckets per critical endpoint. Alert on high-latency endpoints even if service average is green.

Interview Questions on This Topic

  • QExplain the difference between p99 latency and average latency — and why p99 matters more for user experience.Mid-levelReveal
    Average latency is the arithmetic mean of all request latencies. It hides outliers: 1000 requests at 1ms and 1 request at 1000ms average to 2ms. p99 latency is the value below which 99% of requests fall — 1% of requests are slower than this. In the same example, p99 would be 1000ms (or close to it). For user experience, outliers matter because slow requests affect a percentage of users. An e-commerce site with 10,000 orders per hour: if p99 latency is 1 second, 100 orders per hour take >1 second. If you use average latency (2ms), you'd think everything is fine while 100 customers per hour are waiting. Most SLOs are defined in percentiles (e.g., 99% of requests under 500ms). Monitoring average latency alone is effectively ignoring user experience for a significant fraction of users.
  • QWalk me through how you would debug a sudden increase in p99 latency from 200ms to 3 seconds in a microservices architecture with 10 services.SeniorReveal
    Step 1 — Confirm scope: check RED metrics per service. If one service's p99 increased and others are normal, focus there. If all increased, suspect common dependency (database, cache, network). Step 2 — Look at the distributed trace for a slow request. Find the span with the highest duration. That's the bottleneck. Step 3 — If the bottleneck is a database query: check slow query log, run EXPLAIN, look for missing indexes, N+1 patterns, or lock contention. Step 4 — If bottleneck is an HTTP call to another service: check that service's metrics recursively (go to step 1 for that service). Step 5 — If bottleneck is 'code' span: use profiler (async-profiler) to find hot methods. Check for inefficient loops, JSON parsing, regex. Step 6 — If no single span dominates but many small spans add up: check for context switching overload, thread pool saturation, or lock contention. Step 7 — After finding root cause, deploy fix, verify latency returns to baseline, and add regression test to catch reoccurrence. Also add SLO alert for p99 > 500ms to catch earlier next time.
  • QWhat is the difference between metrics, traces, and logs? Give a scenario where you need all three to debug an issue.Mid-levelReveal
    Metrics are aggregated numerical data — request rate, error rate, latency percentiles. They're cheap to store and query, but lose individual request detail. Traces capture a single request's journey across services — each database call, RPC, cache hit. They're sampled (1-10%) because storage is expensive. Logs are discrete timestamped events — 'User 123 login failed', 'Connection pool exhausted'. They're high-cardinality but unstructured. Scenario: Metrics show p99 latency spiked to 3 seconds at 14:05 (something is wrong). You query traces for that time window and find a trace where the 'database query' span took 2.9 seconds (where the time is spent). You then look at logs for that database query (using the trace ID to filter) and see 'deadlock detected, retrying'. That's the 'why'. Without metrics, you wouldn't know there was a problem. Without traces, you wouldn't know the problem was the database. Without logs, you wouldn't know it was a deadlock.
  • QHow would you design an alerting strategy for a new microservice? What metrics would you alert on, and what thresholds would you use?SeniorReveal
    I'd start with the RED method: Rate, Errors, Duration. For a user-facing web API, alert on: (1) p99 latency > 3x baseline (or absolute 500ms) for 5 minutes — user experience degraded. (2) Error rate > 1% for 2 minutes — service is failing. (3) Rate drop > 50% over 5 minutes — service may be unavailable or rejecting requests. Also alert on resource exhaustion: (4) Database connection pool saturation > 90% — capacity issue. (5) Thread pool queue size > 1000 — service can't keep up. Avoid CPU alerts for user-facing services — high CPU is fine if latency is low. Set thresholds using historical data: p99 latency baseline from last 7 days, alert when 3x median. Use for clauses (e.g., 'for: 5m') to avoid flapping on transient spikes. For critical endpoints (/checkout, /login), use service-level indicators (SLIs) and error budget alerts: remaining error budget < 1 hour at current burn rate.

Frequently Asked Questions

What's the difference between APM and Observability?

APM (Application Performance Monitoring) is a product category — tools like Datadog, New Relic, Dynatrace that collect metrics, traces, and logs. Observability is a property of a system: how well you can understand its internal state from its external outputs (telemetry). You achieve observability by instrumenting your code with metrics, traces, and logs. APM tools are one way to achieve observability. OpenTelemetry (vendor-neutral) is the current standard for instrumentation, replacing vendor-specific agents.

How much overhead does APM instrumentation add?

OpenTelemetry adds 2-5% CPU overhead at 1% sampling rate for traces. Metrics histograms add negligible overhead (~0.5% CPU). Logging at INFO level adds ~1% CPU. The biggest overhead is trace export (network, serialisation). Always sample traces (1-10% for high-traffic services). Use async span processors (non-blocking). For extremely latency-sensitive systems (<50us p99), consider eBPF-based monitoring or kernel tracing instead of code instrumentation.

How long should I store metrics, traces, and logs?

Metrics: 30-90 days for aggregates, 7 days for raw data. Use downsampling: keep 1-minute resolution for 7 days, 5-minute for 30 days, 1-hour for 90 days. Traces: 7-14 days for debugging weekly patterns; errors and slow requests for 30 days. Logs: 30 days for general, 90 days for compliance (GDPR, PCI). Use tiered storage: hot (SSD) for 7 days, warm (SSD/HDD) for 30 days, cold (S3) for older data. OpenTelemetry collector supports routing traces to different backends based on attributes (e.g., errors → long-term).

What is tail-based sampling and when should I use it?

Head-based sampling decides at the start of the request (e.g., random 1%). Tail-based sampling makes the decision after the request completes. The OpenTelemetry collector buffers traces for a few seconds, then decides to keep or drop based on criteria: if duration > 500ms, keep; if error occurred, keep; otherwise, sample 1%. This ensures you have traces for all slow and failed requests (the ones you actually want to debug) without storing every successful 50ms request. Use tail-based sampling for high-traffic services (> 100 req/sec) where storing 100% of traces is expensive but you need to debug rare issues. The trade-off is added latency (traces held in buffer) and collector memory usage.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousELK Stack — Elasticsearch Logstash KibanaNext →Distributed Tracing with Jaeger
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged