Senior 10 min · March 06, 2026
Application Performance Monitoring

N+1 Queries Hide in Low CPU — APM Metrics That Expose Them

App CPU at 30% while p99 latency hit 4 seconds.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • APM gives you telemetry — metrics (numerical measurements), traces (request journeys), and logs (discrete events) — to find performance problems before users complain
  • Core components: RED method (Rate, Errors, Duration) for services; USE method (Utilisation, Saturation, Errors) for resources; distributed tracing for microservices
  • Performance cost: OpenTelemetry adds 2-5% CPU overhead when sampled at 1% (adjust sampling rate based on traffic)
  • Production trap: Alerting on CPU usage alone — a 90% CPU alert fires while users are happy (pre-computed cache), and misses when a slow database query makes users wait (low CPU, high latency)
  • Biggest mistake: No baseline for normal latency — you can't know p99 is bad if you never tracked p50 when the system was healthy
✦ Definition~90s read
What is Application Performance Monitoring?

Application Performance Monitoring (APM) is the practice of measuring and managing the availability, responsiveness, and resource consumption of software applications in production. It exists because traditional infrastructure monitoring (CPU, memory, disk) tells you a server is alive but not whether your users are actually getting correct responses quickly.

Imagine your app is a restaurant kitchen.

APM fills that gap by instrumenting application code to track request paths, database queries, external API calls, and error rates — giving you visibility into the actual user experience and the internal bottlenecks that degrade it. Without APM, you're flying blind on issues like N+1 queries that silently waste database connections while CPU stays low.

APM sits between infrastructure monitoring (Prometheus, CloudWatch) and user experience monitoring (RUM, synthetic checks). It's the layer that connects infrastructure metrics to business outcomes. You should use APM when you need to understand why a page loads slowly, which database queries are the most expensive, or how a microservice failure propagates.

You should NOT use APM as your sole monitoring tool — it's complementary to logs (for debugging) and infrastructure metrics (for capacity planning). Modern APM tools like Datadog, New Relic, and Honeycomb process billions of data points daily, with distributed tracing being the key differentiator for microservice architectures.

The core value of APM is correlation: it links high-level metrics (request latency) to specific traces (the exact request that was slow) to individual spans (the database query that took 2 seconds). This is how you catch N+1 queries that hide in low CPU — the database might be at 10% utilization, but a single endpoint is making 200 sequential queries because an ORM lazily loaded associations.

APM exposes this through span-level database call counts and duration breakdowns that infrastructure metrics never show.

Plain-English First

Imagine your app is a restaurant kitchen. APM is like having a head chef who watches every cook, every dish, and every order in real time — they know instantly if the fryer is too slow, if a dish keeps getting sent back, or if one cook is overwhelmed while others stand idle. Without that chef, you only find out something went wrong when a customer walks out. APM is that watchful chef for your software — it tells you exactly where the kitchen is breaking down before your diners notice.

Every time a user clicks 'Buy Now' and nothing happens, a customer is lost — possibly forever. Studies from Google and Akamai consistently show that a 100ms increase in page load time can drop conversion rates by 1%. At scale, that's not a UX annoyance; it's a revenue crisis. Yet most engineering teams only find out their app is slow after a flood of support tickets or, worse, a trending tweet. APM exists to flip that script.

The core problem APM solves is invisibility. Code runs inside servers you can't touch, across networks you don't control, on databases holding millions of rows. Without instrumentation, you're flying blind. A query that took 50ms in staging suddenly takes 4 seconds in production under real load — and you have no idea why. APM gives you the telemetry — metrics, traces, logs — to pinpoint the exact line of code, database call, or third-party API dragging your app down.

By the end you'll understand the three pillars of observability, know exactly which metrics to instrument first, set up Prometheus-based collection, configure meaningful alert thresholds (not just 'CPU > 90%'), and read a distributed trace to find hidden latency.

What Application Performance Monitoring Actually Tracks

Application performance monitoring (APM) is the practice of measuring and analyzing the end-to-end behavior of a software system in production, focusing on response times, error rates, and resource consumption. The core mechanic is distributed tracing: every request is tagged with a unique trace ID, and each service, database call, or external API hit is recorded as a span. This creates a waterfall view of where time is spent, from the user's click to the final response.

In practice, APM tools instrument your code with minimal overhead — typically <5% CPU — by weaving in bytecode agents or using OpenTelemetry SDKs. They aggregate metrics like p50/p99 latency, throughput, and error budgets, but the real power is in the trace-level detail: you can drill into a single slow request and see that 95% of its time was spent in 200 sequential database queries, each taking 2ms. That's the N+1 pattern, invisible in average CPU but screaming in trace depth.

Use APM when you need to understand why a system behaves differently under load than in staging. It matters most for microservices, where a single slow downstream call can cascade into a global timeout storm, or for monoliths where a hidden O(n) loop in a hot path turns a 50ms endpoint into a 5s one. Without APM, you're debugging blind.

APM ≠ Infrastructure Monitoring
CPU and memory graphs won't show you that a single endpoint is making 200 database calls — only trace-level APM reveals the N+1 pattern hiding in low CPU.
Production Insight
A team at a payments company saw p99 latency spike from 200ms to 4s every 10 minutes. CPU was flat at 30%. APM traces revealed a batch job was loading an order with 500 line items, each triggering a separate SQL query via Hibernate's lazy loading.
The exact symptom: a periodic latency spike with no CPU or memory pressure, correlated with a scheduled task.
Rule of thumb: if you see latency spikes without CPU or memory increase, suspect N+1 queries — trace the slowest request and count the database spans.
Key Takeaway
APM's core value is trace-level visibility, not aggregate metrics — always drill into the slowest trace.
N+1 queries hide in low CPU because each query is fast; only the count reveals the O(n) cost.
Instrument every external call (DB, cache, API) as a span — otherwise you're blind to the bottleneck.
APM Metrics That Expose N+1 Queries THECODEFORGE.IO APM Metrics That Expose N+1 Queries From RED method to distributed tracing in Kubernetes RED Method Rate, Errors, Duration per service Distributed Tracing Follow request across microservices APM Fire Triangle Metrics, traces, logs correlation Kubernetes Fire Hose High cardinality pod-level metrics N+1 Query Detection Low CPU but high DB calls Optimized Query Pattern Batch loading reduces N+1 ⚠ Low CPU masks N+1 queries in APM Always check DB call count, not just CPU usage THECODEFORGE.IO
thecodeforge.io
APM Metrics That Expose N+1 Queries
Application Performance Monitoring

The Three Pillars — Metrics, Traces, Logs

APM rests on three types of telemetry data. Each answers a different question, and you need all three to debug effectively.

Metrics are numerical measurements over time — request rate, error rate, latency percentiles, CPU usage. They answer 'what is happening?' and are cheap to store and query. Metrics are aggregated (averages, sums, counts) and lose individual request details.

Traces track a single request's journey across services — every database call, RPC, and cache hit. They answer 'why is this specific request slow?' A trace is a tree of spans, each representing a unit of work. Traces are sampled (1-10% of requests) because storing every trace is expensive.

Logs are discrete timestamped events — 'User 123 logged in', 'Payment failed: insufficient funds'. They answer 'what happened at this exact moment?' Logs are high-cardinality but unstructured; parsing them at scale requires indexing.

The relationship: metrics tell you something is wrong (p99 latency spiked). Traces tell you where (database query slow). Logs tell you why (connection pool exhausted). Without all three, you're missing context.

io/thecodeforge/apm/OpenTelemetryInstrumentation.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
package io.thecodeforge.apm;

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.resources.Resource;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;
import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;

import java.util.concurrent.TimeUnit;

/**
 * Production OpenTelemetry instrumentation for a Java service.
 *
 * This adds distributed tracing so you can see exactly where latency
 * is hiding — database calls, HTTP requests, or your own code.
 */
public class OpenTelemetryInstrumentation {

    private final Tracer tracer;

    public OpenTelemetryInstrumentation(String serviceName, String otlpEndpoint) {
        // Configure OTLP exporter — sends traces to collector (Jaeger, Tempo, etc.)
        OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
                .setEndpoint(otlpEndpoint)  // e.g., "http://jaeger-collector:14250"
                .setTimeout(30, TimeUnit.SECONDS)
                .build();

        Resource serviceResource = Resource.getDefault().toBuilder()
                .put(ResourceAttributes.SERVICE_NAME, serviceName)
                .put(ResourceAttributes.SERVICE_VERSION, "1.2.3")
                .build();

        SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
                .addSpanProcessor(BatchSpanProcessor.builder(spanExporter).build())
                .setResource(serviceResource)
                .build();

        OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
                .setTracerProvider(tracerProvider)
                .buildAndRegisterGlobal();

        this.tracer = openTelemetry.getTracer(serviceName, "1.0.0");
    }

    /**
     * Example: instrument a database query with custom span.
     *
     * This creates a child span under the current request trace.
     * In APM UI, you'll see exactly how long the database call took
     * and can correlate it with other spans in the same trace.
     */
    public void executeDatabaseQuery(String query) {
        Span dbSpan = tracer.spanBuilder("DB Query")
                .setAttribute("db.statement", query)
                .setAttribute("db.system", "postgresql")
                .startSpan();

        try (Scope scope = dbSpan.makeCurrent()) {
            // Execute actual database query here
            // connection.execute(query);
            System.out.println("Executing: " + query);
        } catch (Exception e) {
            dbSpan.recordException(e);
            dbSpan.setAttribute("error", true);
            throw e;
        } finally {
            dbSpan.end();  // Duration recorded here — visible in trace UI
        }
    }

    /**
     * Example: instrument an HTTP call to an external API.
     */
    public void callExternalApi(String url) {
        Span httpSpan = tracer.spanBuilder("HTTP " + url)
                .setAttribute("http.url", url)
                .setAttribute("http.method", "GET")
                .startSpan();

        try (Scope scope = httpSpan.makeCurrent()) {
            // Make the actual HTTP call
            // httpClient.get(url);
            System.out.println("Calling: " + url);
        } catch (Exception e) {
            httpSpan.recordException(e);
            httpSpan.setAttribute("error", true);
            throw e;
        } finally {
            httpSpan.end();
        }
    }
}
Metrics, Traces, Logs — The Observability Trinity
  • Metrics: aggregated numbers (rate, errors, duration). Cheap to store, but lose individual request detail.
  • Traces: single request journey across services. Expensive to store (sampled at 1-10%). Show exact latency breakdown.
  • Logs: discrete events with high cardinality. Unstructured, need indexing for search. Best for debugging 'why' after trace identifies 'where'.
  • OpenTelemetry: vendor-neutral API for generating telemetry; send to any backend (Jaeger, Prometheus, Datadog, New Relic).
  • Rule: Start with RED metrics (Rate, Errors, Duration) for every service, then add traces for slow endpoints, then structured logs for errors.
Production Insight
Metrics alone can't debug a single slow request. They're aggregated, so a 1-second p99 could be 1% of requests taking 10 seconds.
Traces alone can't tell you if a problem is widespread or isolated. Combine metrics (problem exists) with traces (find the cause).
Rule: Sample 100% of traces for error responses (status >= 400), and 1-10% of successful requests. This captures all failures without breaking the bank.
Key Takeaway
Metrics tell you something is wrong. Traces tell you where. Logs tell you why. You need all three to debug effectively.
Start with RED metrics for every service, then add distributed tracing for slow endpoints, then structured logging for error detail.
Rule: Sample 100% of error traces, 1-10% of success traces. Tail-based sampling catches slow requests without storing every success.
APM Telemetry Sampling Strategy
IfLow traffic service (< 10 requests/second)
UseSample 100% of requests. Store traces for 7 days, errors for 30 days. Cost is negligible and debugging is easier.
IfMedium traffic service (10-100 requests/second)
UseSample 10% of requests, plus 100% of errors. Use probabilistic sampling with consistent probability per trace ID.
IfHigh traffic service (> 100 requests/second)
UseSample 1% of requests, 100% of errors, and 'tail-based' sampling for slow requests (>500ms). Use OpenTelemetry collector with tail-sampling processor.
IfCompliance requirement: must have trace for every transaction
UseUse head-based sampling with probability 1 (100%). Accept higher storage costs. Use cheaper storage tier (S3) for older traces.
IfTransaction spans multiple services (distributed trace)
UseUse consistent probability sampling based on trace ID. All services must use same sampling decision to avoid broken traces.

The RED Method — Rate, Errors, Duration

The RED method (Rate, Errors, Duration) is the standard for service-level monitoring. For every service, track these three metrics, and you'll know instantly whether users are happy.

Rate is the number of requests per second. A sudden drop in rate (traffic falling off a cliff) often means the service is unavailable or rejecting requests. A sudden spike might indicate a DDoS attack or misconfigured client.

Errors is the proportion of requests that failed — HTTP 5xx, thrown exceptions, timeout, or any response that doesn't meet your SLO. Track error rate both as a raw count and as a percentage of total requests. A slow rise in error rate often indicates resource exhaustion (database connections, memory).

Duration is how long requests take, measured as latency percentiles — p50 (median), p95, p99. p99 is what matters for user experience: 1% of requests are slower than this. Average latency hides outliers: a service could have 1000 requests at 1ms and 1 request at 1000ms, average 2ms, but 0.1% of users had a terrible experience.

Instrument duration with a histogram: bucket boundaries at 1ms, 5ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 2500ms, 5000ms, 10000ms. This gives you percentiles without storing every latency value.

Common RED mistakes: measuring only average latency (hides p99 problems), not tracking errors by type (500 internal server error vs 404 not found are very different), and not breaking down rate by endpoint (a drop in /health is fine; a drop in /checkout is a crisis).

io/thecodeforge/apm/REDMetrics.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
package io.thecodeforge.apm;

import io.prometheus.client.CollectorRegistry;
import io.prometheus.client.Counter;
import io.prometheus.client.Histogram;
import io.prometheus.client.exporter.HTTPServer;

import java.io.IOException;

/**
 * Production RED metrics (Rate, Errors, Duration) using Prometheus.
 *
 * These three metrics are enough to know if a service is healthy
 * from the user's perspective — without looking at CPU or memory.
 */
public class REDMetrics {

    // ─── RATE: Total requests per endpoint ───────────────────────────────────
    // Counter represents requests total. Rate = increase over time.
    private static final Counter requestTotal = Counter.build()
            .name("http_requests_total")
            .labelNames("method", "endpoint", "status")
            .help("Total HTTP requests")
            .register();

    // ─── ERRORS: Error counter (subset of requestTotal) ──────────────────────
    // Track errors separately for easier alerting, but also derived from requestTotal
    private static final Counter errorTotal = Counter.build()
            .name("http_errors_total")
            .labelNames("method", "endpoint", "error_type")
            .help("Total HTTP errors (status >= 500 or exception)")
            .register();

    // ─── DURATION: Request latency histogram ─────────────────────────────────
    // Buckets chosen to capture p50 (5-10ms), p95 (50-100ms), p99 (250-500ms)
    // Adjust buckets based on your service's typical latency.
    private static final Histogram requestDuration = Histogram.build()
            .name("http_request_duration_seconds")
            .labelNames("method", "endpoint")
            .help("HTTP request latency in seconds")
            .buckets(0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10)
            .register();

    /**
     * Record metrics for a completed request.
     * Call this in your API framework's response filter/middleware.
     */
    public static void recordRequest(String method, String endpoint,
                                      int statusCode, long durationMs) {
        String status = String.valueOf(statusCode);
        requestTotal.labels(method, endpoint, status).inc();

        if (statusCode >= 500) {
            errorTotal.labels(method, endpoint, "http_5xx").inc();
        }

        requestDuration.labels(method, endpoint).observe(durationMs / 1000.0);
    }

    /**
     * Record an exception that wasn't caught by normal status code handling.
     */
    public static void recordException(String method, String endpoint, String exceptionType) {
        errorTotal.labels(method, endpoint, exceptionType).inc();
    }

    /**
     * Start Prometheus metrics endpoint on port 8081 (separate from app port).
     * Scraped by Prometheus every 15 seconds.
     */
    public static void startMetricsServer() throws IOException {
        HTTPServer server = new HTTPServer(8081);
        System.out.println("Prometheus metrics available at http://localhost:8081/metrics");
    }
}
Prometheus Histogram Buckets
The histogram buckets in the code above (0.001, 0.005, 0.01, ...) are chosen to capture typical web latency ranges. For a database service, you might shift to higher buckets (0.01, 0.05, 0.1, 0.5...). For an in-memory cache, lower buckets (0.0001, 0.0005...). Use summary quantiles if you need exact percentiles, but histograms are cheaper and recommended for production.
Production Insight
Average latency hides problems. A service with 1000 requests at 1ms and 1 at 1000ms has an average of 2ms, but 0.1% of users had a 1000ms experience.
p99 latency is what users actually feel. 1% of requests slower than p99. Track p50 for trends, p99 for SLOs.
Rule: Set p99 latency alerts at 3x your normal baseline, not an absolute number. A 500ms p99 might be fine for a reporting API but terrible for a checkout endpoint.
Key Takeaway
RED metrics — Rate, Errors, Duration — tell you if users are happy without looking at CPU or memory.
Track p99 latency, not average. Averages hide outliers, and outliers are what users notice.
Rule: Start with RED for every service before adding more detailed metrics. If RED is green, users are happy; if RED is red, start debugging.
RED Metrics by Service Type
IfWeb API or synchronous service (user waiting for response)
UseTrack p99 latency, error rate, request rate. Alert when p99 > 500ms for 5 minutes; error rate > 1% for 2 minutes.
IfBackground job processor or async worker
UseTrack rate (jobs processed), error rate, and job age (time from enqueue to completion). Alert when age > 5 minutes.
IfDatabase or cache (infrastructure service)
UseTrack query latency p99, connection pool saturation, error rate. Alert when p99 > 100ms for database (with index) or > 5ms for cache.
IfBatch job (cron, ETL)
UseTrack duration (time to completion), error flag (0 or 1), data volume processed. Alert when job takes > 2x baseline duration.
IfThird-party API dependency (downstream call)
UseTrack rate (calls per second), error rate (HTTP 5xx, timeouts), latency p99. Alert when error rate > 5% or p99 > 2 seconds.

Distributed Tracing — Following a Request Across Services

In a monolith, you can find a slow function with a profiler. In microservices, a single request might pass through API gateway → auth service → order service → payment service → inventory service. A 2-second latency could be 100ms in each of 20 services, or 1.9 seconds in a single database query. Distributed tracing tells you which.

A trace is a tree of spans. The root span covers the entire request from client to final response. Child spans cover sub-operations: HTTP calls to downstream services, database queries, cache lookups, even internal function calls.

Key fields: trace ID (same across all spans in a request), span ID (unique per operation), parent span ID (links child to parent), name (operation name: 'GET /products', 'SELECT * FROM orders'), start and end timestamps (duration = end - start), attributes (HTTP method, status code, DB statement), events (logs within a span: 'cache miss', 'retry attempt').

Implementation: instrument your HTTP client and server libraries to automatically propagate trace context via headers (W3C Trace-Context standard: traceparent, tracestate). Use OpenTelemetry auto-instrumentation agents for Java, Python, Node.js, Go. Manual instrumentation for business-critical spans.

Common tracing mistakes: not propagating trace context across asynchronous boundaries (message queues, background threads) — resulting in broken traces; sampling too aggressively (1% of 1% leaves 0.01% of requests traced); not storing traces long enough (7 days minimum for debugging weekly patterns); and not linking traces to logs (add trace ID to every log line).

io/thecodeforge/apm/DistributedTracing.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
package io.thecodeforge.apm;

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Context;
import io.opentelemetry.context.propagation.TextMapGetter;
import io.opentelemetry.context.propagation.TextMapSetter;

import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

/**
 * Distributed tracing with context propagation across service boundaries.
 *
 * The key challenge in distributed tracing is propagating the trace context
 * from caller to callee. OpenTelemetry's propagator handles automatically
 * when using instrumented clients. For custom protocols (message queues),
 * inject the context manually.
 */
public class DistributedTracing {

    private final Tracer tracer;
    private final OpenTelemetry openTelemetry;
    private final HttpClient httpClient;

    public DistributedTracing(OpenTelemetry openTelemetry) {
        this.openTelemetry = openTelemetry;
        this.tracer = openTelemetry.getTracer("api-service");
        this.httpClient = HttpClient.newHttpClient();
    }

    /**
     * Example: calling a downstream service with automatic trace propagation.
     *
     * When using OpenTelemetry-instrumented HTTP client, the trace context
     * is automatically injected into the `traceparent` header.
     * The downstream service extracts it and creates a child span.
     */
    public String callOrderService(String orderId) throws Exception {
        // Start a child span for this HTTP call
        Span httpSpan = tracer.spanBuilder("HTTP POST /orders")
                .setAttribute("order.id", orderId)
                .startSpan();

        try (var scope = httpSpan.makeCurrent()) {
            HttpRequest request = HttpRequest.newBuilder()
                    .uri(java.net.URI.create("http://order-service/api/orders"))
                    .header("Content-Type", "application/json")
                    .POST(HttpRequest.BodyPublishers.ofString("{\"id\":\"" + orderId + "\"}"))
                    .build();

            // If using automatic instrumentation, the `traceparent` header
            // is added automatically. If manual, inject via:
            // openTelemetry.getPropagators().getTextMapPropagator()
            //     .inject(Context.current(), request, (r, k, v) -> r.headers().put(k, v));

            HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());

            httpSpan.setAttribute("http.status_code", response.statusCode());
            return response.body();
        } catch (Exception e) {
            httpSpan.recordException(e);
            throw e;
        } finally {
            httpSpan.end();
        }
    }

    /**
     * Extract trace context from incoming request headers.
     * OpenTelemetry's server instrumentation does this automatically.
     *
     * This is how a service knows it's part of an existing trace
     * rather than starting a new one.
     */
    public void handleIncomingRequest(String traceparentHeader) {
        // Extract context from headers (auto-instrumented frameworks do this)
        TextMapGetter<MapHeaders> getter = new TextMapGetter<>() {
            @Override
            public String get(MapHeaders carrier, String key) {
                return carrier.get(key);
            }

            @Override
            public Iterable<String> keys(MapHeaders carrier) {
                return carrier.keys();
            }
        };

        Context extractedContext = openTelemetry.getPropagators().getTextMapPropagator()
                .extract(Context.current(), new MapHeaders(traceparentHeader), getter);

        // Start a span as child of extracted context
        Span span = tracer.spanBuilder("handle request")
                .setParent(extractedContext)
                .startSpan();

        try (var scope = span.makeCurrent()) {
            // Process the request here
            System.out.println("Processing request with trace ID: " + span.getSpanContext().getTraceId());
        } finally {
            span.end();
        }
    }

    // Helper class for header propagation example
    static class MapHeaders {
        private final java.util.Map<String, String> headers = new java.util.HashMap<>();
        MapHeaders(String traceparent) { headers.put("traceparent", traceparent); }
        String get(String key) { return headers.get(key); }
        Iterable<String> keys() { return headers.keySet(); }
    }
}
Trace Context Propagation is Non-Negotiable
Without automatic trace context propagation, your traces will be broken — each service creates a new trace root. Use W3C Trace-Context headers (traceparent, tracestate). OpenTelemetry auto-instrumentation handles this for HTTP, gRPC, and many database clients. For message queues (Kafka, RabbitMQ), you must inject the context into the message headers manually, then extract on the consumer side.
Production Insight
A trace without context propagation is just a log of each service's independent timings. You can't see the full request journey.
The W3C Trace-Context standard (traceparent header) is supported by all major tracing backends. Use it, not proprietary formats.
Rule: Test trace propagation in staging. Make a request that spans 3 services and verify the same trace ID appears in all service logs.
Key Takeaway
Distributed tracing shows you where time is spent across service boundaries — database, cache, RPC, external API.
Without trace context propagation, each service starts a new trace, and you lose the end-to-end view.
Rule: Use OpenTelemetry auto-instrumentation for HTTP clients/servers. For message queues, propagate trace context in message headers manually.
Tracing Sampling Decisions
IfDebugging an intermittent production issue that happens to 0.1% of requests
UseIncrease sampling rate to 10% temporarily. Change back after issue resolved. Use remote configuration (without redeploy).
IfCompliance requires full audit trail for every transaction
UseSample 100% of traces. Use 'head-based' sampling with probability 1. Accept storage costs. Archive older traces to cold storage.
IfStorage costs are a concern, but you need to debug slow requests
UseUse 'tail-based' sampling: sample 100% of traces, but only store those with errors or duration > 500ms (configured in OpenTelemetry collector).
IfYou need to debug specific user or session
UseUse 'request-id' based conditional sampling. Extract user ID from request header; if user is in debug list, sample 100%.
IfTracing overhead is affecting production latency (rare > 5% overhead)
UseReduce sampling rate. Use lighter propagator. Use async span processor. Sample 100% of errors, lower success sampling.

Why APM Matters in DevOps — The Fire Triangle

You can't fix what you can't see. That's the whole argument for APM in one sentence. DevOps is about closing the loop between code commit and production behavior. Without real-time visibility into how your application actually runs, you're flying blind.

APM isn't a dashboard for the operations team to stare at. It's the feedback mechanism that tells you whether your last deployment actually improved anything — or if you just swapped one bottleneck for another. When a user reports slowness, APM answers the three questions that matter: What's slow? Where is the slowness happening? Why is it happening now?

Most teams don't fail because they lack monitoring. They fail because they monitor the wrong things. CPU usage is a distraction. You need to track the metrics that correlate directly with user experience — response time, error rate, and saturation. Everything else is noise. APM forces you to focus on what actually breaks the user's day.

ApminDevOps.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — devops tutorial

# Minimal APM config for a payment service
# Track the customer-impacting metrics, not the server noise

apm:
  service_name: payment-gateway
  environment: production
  metrics:
    - response_time_p99      # Slowest 1% of requests
    - error_rate_percent     # 4xx/5xx vs total requests
    - apdex_score            # User satisfaction threshold
  alerting:
    # Fire when p99 crosses 2 seconds for 5 minutes
    - rule: p99_latency_high
      condition: "p99 > 2000ms for 5m"
      severity: critical
      oncall: sre-team
    - rule: error_rate_spike
      condition: "error_rate > 5% for 1m"
      severity: warning
      oncall: dev-oncall
Output
Payment gateway metrics firing:
- p99_latency_high: ACTIVE (p99=2400ms, duration=7m)
- error_rate_spike: OK (2.1%, below threshold)
Production Trap:
Don't alert on CPU or memory in isolation. They're symptoms, not causes. Alert on the user-facing metrics first — response time and error rate — then let those alerts guide you to the infrastructure root cause.
Key Takeaway
APM exists to answer 'is my code making users unhappy?' — not 'is my server running?'

Core Components of Modern APM — The Parts That Actually Matter

Modern APM is not one thing. It's four layers that stack together, and if you skip any of them, you're working with incomplete data.

First: End-user Experience Monitoring (EUEM). This is your synthetic transactions and real-user monitoring. It captures how actual humans experience your app — page load times, click-to-response latency, client-side errors. Without it, you might think the backend is healthy while users are staring at a blank screen because a JavaScript bundle broke.

Second: Application Runtime Architecture. This is where you instrument your code — the database calls, external API calls, thread pools, and memory allocation. You're measuring what your code actually does at runtime. Not what you think it does. Not what the code review suggested. What it really does. This is where you find the N+1 queries, the unbounded retry loops, and the object allocation that triggers GC pauses.

Third: Infrastructure Monitoring. You need to know what's happening at the OS and container level — CPU, memory, disk I/O, network. But here's the trick: infrastructure data is only useful when correlated with application data. A CPU spike during normal traffic means something totally different from a CPU spike during a traffic surge. Don't look at infrastructure in isolation.

Fourth: Transaction Tracing and Dependency Mapping. This is the map of every service call your application makes. It shows you the path of a single request across services, databases, queues, and caches. Without this, you can't tell if the payment service is slow because the database is slow, or because the fraud-check service is timing out.

ApmStackExample.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — devops tutorial

# APM stack configuration for a microservices deployment
# Each component runs in its own namespace

components:
  euem:
    tool: rum-sensor
    enabled: true
    capture_requests: true
    capture_errors: true
  runtime:
    instrumentations:
      - python: "0.12.0"  # Auto-instrument all Python services
      - nodejs: "0.9.2"
      - java: "0.15.1"
  infrastructure:
    agent: node-exporter
    enabled: true
    metrics:
      - cpu_usage
      - memory_rss
      - disk_iops
  tracing:
    sampler: probabilistic
    sampling_rate: 0.1  # Trace 10% of requests in prod
    export: otlp-http
Output
Layer status after deployment:
- EUEM: 3000 synthetic checks/min, 0.2% error rate
- Runtime: 12 instrumentations active, 59 ms overhead per request
- Infrastructure: node-exporter collecting 47 metrics
- Tracing: 10% sampling, 150 traces exported per second
Senior Shortcut:
Always start with runtime instrumentation. If you don't know what your code is doing at the method level, you're guessing. Everything else is secondary.
Key Takeaway
APM is four layers working together — omit any one and you're flying with partial instruments.

Essential APM Metrics — The Only Ones That Survive an Incident

Every monitoring tool lets you create 500 dashboards. Most teams end up with 500 dashboards and zero actionable insight. Here's the short list of metrics that matter when a P1 hits.

Latency (p50, p95, p99): p50 tells you what the typical user experiences. p95 tells you about the edge cases. p99 tells you about the outliers that will get you a call at 3 AM. If p99 is 10x p50, you have a long-tail latency problem — probably a bad cache hit ratio or a slow external dependency.

Error Rate: Track as a percentage of all requests, not an absolute count. A spike from 0.1% to 1% is a 10x increase. Your alerting should catch that. But you also need error budgets — a way to say "we can tolerate X% errors for Y time before we page someone." Without error budgets, your on-call will be paged for every single 500 error from a load balancer health check. Don't be that team.

Saturation: This is the hard one. It measures how close your system is to its limit. For a database, it's connection pool usage. For a queue, it's message backlog. For a CPU, it's run queue depth. Saturation is a leading indicator of failure. When saturation hits 80% of capacity, you have minutes to react before performance collapses. If you wait until latency spikes to act, you've already lost.

Throughput: Requests per second. It's the denominator for all your rate calculations. Throughput dropping suddenly usually means something upstream is failing. Throughput spiking might be a DDoS or a misconfigured retry loop. Throughput trending up over weeks means you need to scale. All three are useful, but none tells the whole story alone.

EssentialMetrics.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — devops tutorial

# Essential metrics config for an API gateway
# Only four metrics, but correlated across all services

metrics:
  latency:
    dashboard: p50: 120ms | p95: 350ms | p99: 800ms
    alarm: p99 > 1500ms for 3 minutes
  error_rate:
    dashboard: 0.3% of requests (24/hour)
    error_budget: 99.9% uptime = 0.1% errors/month
    alarm: rate > 1% for 1 minute
  saturation:
    dashboard: db_conn_pool: 67% (134/200 connections)
    alarm: saturation > 80% for 5 minutes
  throughput:
    dashboard: 420 req/s (rolling 1m average)
    alarm: throughput drops > 50% in 1 minute
Output
Live dashboard snapshot:
| Metric | Current | Alarm Threshold | Status |
|--------------|-----------|-----------------|--------|
| p99 Latency | 820ms | > 1500ms for 3m | OK |
| Error Rate | 0.3% | > 1% for 1m | OK |
| Saturation | 67% | > 80% for 5m | OK |
| Throughput | 420 req/s | > 50% drop | OK |
Production Trap:
Don't measure p99 latency on a single node. Measure it per service endpoint. An API that returns 'user not found' in 2ms will hide the endpoint that takes 5 seconds to fetch the report. Always segment latency by endpoint.
Key Takeaway
Four metrics — latency, error rate, saturation, throughput — cover 80% of production incidents. The other 20% require distributed tracing to untangle.

Kubernetes: Where APM Becomes a Fire Hose

You don't monitor Kubernetes—you monitor what runs on it. The platform is just a noisy scheduler. Your APM must answer: which pod, which node, which container version caused the p99 spike?

Forget instrumenting every replica. Use eBPF to capture network flows and service mesh telemetry from Istio or Linkerd. That gives you per-pod latency, error rates, and traffic patterns without touching application code.

The real win: correlate a Kubernetes rollout with an APM anomaly. When your deploy triggers a 5xx storm, the trace should show the new pod's image tag and resource limits. Anything less is guessing. Production demands pod-level granularity, not cluster-level averages.

k8s-apm-instrumentation.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — devops tutorial

apiVersion: v1
kind: Pod
metadata:
  name: payment-service-v2
  labels:
    app: payment
    version: "2.1.0"
spec:
  containers:
  - name: payment
    image: payment:2.1.0
    env:
    - name: OTEL_RESOURCE_ATTRIBUTES
      value: "k8s.pod.name=$(POD_NAME),k8s.node.name=$(NODE_NAME)"
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
Output
Pod payment-service-v2 created with OpenTelemetry resource attributes.
APM traces now include pod and node identity for drill-down.
Senior Shortcut:
Skip manual instrumentation of every microservice. Use eBPF-based auto-instrumentation (e.g., Pixie, Cilium Tetragon) to capture traffic and traces without code changes. Saves months of instrumenting legacy services.
Key Takeaway
APM without pod-level identity is just noise. Always tag traces with Kubernetes metadata.

Real-Time Alerting: Stop Paging at 3 AM for Nothing

Most alerts are someone else's problem—misconfigured thresholds, static baselines, or plain noise. Real-time alerting means acting on data fresh enough to matter, with context that tells you what to do.

Your pipeline: instrument -> aggregate -> window -> evaluate -> notify. The window matters most. Use sliding windows of 1-5 minutes for error rates, 30-60 seconds for latency spikes. Static thresholds are dead. Use dynamic baselines from the last 7 days, same hour.

When the page comes, it must include: service name, trace ID, pod/node, and a link to the span. If your alert says "Error rate high" with no context, it's worse than no alert. Production teams route to Slack with a runbook attachment or they discard the channel.

alerting-rules.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — devops tutorial

groups:
- name: apm-critical
  rules:
  - alert: HighErrorRate
    expr: |
      rate(http_requests_total{status=~"5.."}[5m]) /
      rate(http_requests_total[5m]) > 0.02
    for: 3m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "Service {{ $labels.service }} error rate > 2%"
      runbook: "https://runbooks.thecodeforge.io/apm/error-spike"
      trace_id: "{{ $labels.trace_id }}"
Output
Alert fires when service error rate exceeds 2% for 3 minutes.
Runbook link and trace ID included in notification payload.
Production Trap:
Never alert on raw counts. Always use rates and error ratios. A sudden spike to 10 errors per second means nothing if baseline was 1000 req/s. Ratio catches the real SLO breaches.
Key Takeaway
An alert without a runbook and trace link is just noise. Only page when context fits in the notification.

Log Aggregation: The APM Lie Everyone Believes

APM tools sell you on traces as the savior. The lie: traces make logs obsolete. The truth: no trace tells you why a payment failed—just that it did. Logs hold the stack trace, the user ID, the exact SQL query. Aggregating them is how you fix incidents.

Centralize logs in Elasticsearch, Loki, or CloudWatch. Standardize on structured JSON with a schema: timestamp, level, service, trace_id, message. The trace_id is the bridge—connect log lines to APM spans.

Use LogQL or KQL to search across environments in seconds. When the p99 latency spikes, grep for spans with duration > 2s and eat the database log lines. Logs are raw evidence. APM is the map. You need both to survive incident response.

log-aggregation-schema.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// io.thecodeforge — devops tutorial

log_schema:
  version: "1.0"
  fields:
  - name: timestamp
    type: rfc3339
    required: true
  - name: level
    type: enum
    values: [DEBUG, INFO, WARN, ERROR, FATAL]
  - name: service
    type: string
  - name: trace_id
    type: string
    required: true
  - name: span_id
    type: string
  - name: message
    type: string
    required: true
  - name: error
    type: object
    properties:
      type: string
      stack: string
  - name: metadata
    type: object
    description: "Arbitrary key-value pairs for context"
  retention: 30 days
Output
Centralized log schema enforced across all services.
Trace_id enables cross-referencing between logs and APM spans.
Senior Shortcut:
Enforce a logging schema at the application framework level (e.g., Winston, Log4j, or slog). Never parse logs—always emit structured JSON. Parsing logs in your aggregation pipeline is tech debt that will kill you during an outage.
Key Takeaway
Traces tell you where. Logs tell you why. Never debug an incident without both.
● Production incidentPOST-MORTEMseverity: high

The Silent N+1 Query That Killed Black Friday

Symptom
Product page latency p99 jumped from 200ms to 4 seconds within 10 minutes of peak traffic. CPU on app servers stayed below 30%. Database CPU spiked to 95%. No alerts fired because CPU alert threshold was 80% and app servers looked fine. Customers started abandoning carts. The team saw the latency spike in dashboards but couldn't find the root cause.
Assumption
The team assumed the product page was efficient because it performed well in load tests with 10 reviews per product. They didn't test with 100 reviews. They also assumed high latency meant slow code in the app server — not the database — because app server CPU was low.
Root cause
The product page code fetched the product object, then looped through each review ID and executed a separate SELECT query. That's 1 query for the product + N queries for N reviews. At 100 reviews per product, that's 101 database round trips. Under peak load, database connection pool saturated, queries queued, and latency exploded. The ORM's default eager loading was disabled, and no one noticed because staging data had only 2-3 reviews per product.
Fix
Changed the code to use a single JOIN query: SELECT * FROM products LEFT JOIN reviews ON products.id = reviews.product_id WHERE products.id = ?. Added Review as an embedded collection on the Product object using the ORM's eager loading feature. Added an APM custom span around the database query to measure its contribution to total latency. Deployed a migration to add an index on reviews.product_id. After the fix, page latency dropped to 150ms even at peak traffic, and database CPU dropped to 25%.
Key lesson
  • N+1 queries are invisible in app server CPU metrics — the app server waits for the database, so its CPU stays low. Always monitor database query count and latency per endpoint.
  • Load test with realistic data volumes. A product page with 2 reviews behaves nothing like a page with 200 reviews. Use production data size in staging.
  • APM should trace database queries per request. A sudden increase in 'SELECT * FROM reviews WHERE product_id = ?' call count is a smoking gun for N+1.
  • Set up alerts on p99 latency per endpoint, not just CPU. A 400% latency increase with flat CPU points directly at database or external dependencies.
Production debug guideSymptom → Action mapping for common performance failures5 entries
Symptom · 01
High latency, app server CPU low, database CPU high
Fix
Classic N+1 query or inefficient database access pattern. Check APM traces for per-request query count. Look for loops executing SELECT statements inside a request. Use database slow query log to identify expensive queries. Add missing indexes.
Symptom · 02
High latency, app server CPU high, database CPU normal
Fix
Application code is the bottleneck, not the database. Use profiler (async-profiler, py-spy) to find CPU hot methods. Check for inefficient loops, serialisation overhead (JSON parsing), or regex backtracing. Consider caching expensive computations.
Symptom · 03
Latency spikes every hour like clockwork
Fix
Likely cron job, cache expiry, or batch process. Check scheduled jobs running at that time. Look for cache stampede (multiple requests recomputing same cache simultaneously). Add jitter to scheduled tasks. Use 'lock' for cache recomputation.
Symptom · 04
Latency increases linearly with number of users
Fix
Shared resource bottleneck: database connection pool, thread pool, or external API rate limit. Check connection pool size vs active connections. If maxed out, requests queue. Increase pool size or reduce connection hold time. Check thread pool saturation.
Symptom · 05
Latency high for first request after deploy, then improves
Fix
Cold start or lack of connection预热. Database connection pools, caches, and JIT compilation need 'warmup' after deployment. Send synthetic traffic before opening to users. Use 'health check' endpoint that exercises critical paths.
★ APM Debug Cheat SheetFast diagnostics for production performance issues. Run these commands at the first sign of slowness.
Slow API endpoint — can't tell if it's code, database, or network
Immediate action
Look at the distributed trace to break down latency by component
Commands
curl http://apm-collector:14268/api/traces?service=my-api | jq '.data[].spans[] | {operationName, duration}'
kubectl exec -it jaeger-query -- wget -O- 'http://localhost:16686/api/traces?service=api&limit=1' | jq '.data[0].spans[].duration'
Fix now
If database span dominates, add index or reduce query count. If HTTP client span dominates, check external API latency. If duration is in 'code' span, profile the application.
App server CPU at 100%, database CPU normal, latency spiking+
Immediate action
Profile the application to find CPU hot spots
Commands
async-profiler -d 30 -f /tmp/flamegraph.html <pid>
top -H -p <pid>
Fix now
Look for busy loops, inefficient regex, JSON parsing in hot path, or logging at DEBUG level. Offload CPU-heavy work to background threads.
Database CPU at 100%, app server CPU normal, slow queries in log+
Immediate action
Check for missing indexes, inefficient joins, or N+1 patterns
Commands
pg_stat_statements (PostgreSQL) to find top queries by total time; EXPLAIN ANALYZE on the slow query
SHOW PROCESSLIST; (MySQL) to see currently running queries
Fix now
Add missing indexes on WHERE/JOIN columns. Rewrite query to avoid full table scan. Use connection pool monitoring to check for leak.
Memory usage growing over time — suspected leak+
Immediate action
Capture heap dump and analyse for retained objects
Commands
jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>
jcmd <pid> GC.heap_info
Fix now
Open heap dump in Eclipse MAT / VisualVM. Look for objects with high 'retained heap'. Check for event listeners not unregistered, caches without eviction, or thread locals not cleared.
Latency p99 spikes but p50 is fine — 1% of requests are very slow+
Immediate action
Check if slow requests share a pattern: specific user, specific data, or specific time
Commands
grep 'duration=.*ms' /var/log/api.log | awk '{print $5}' | sort -n | tail -20
prometheus_query('histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))')
Fix now
Add more trace sampling for slow requests. Check for large payloads, deep pagination, or data skew. Implement request timeouts to fail fast.
RED Method Metrics by Service Type
Service TypeRate (R)Errors (E)Duration (D)Key Alert
Web API (user-facing)Requests/sec per endpointHTTP 5xx rate, exception ratep99 latency per endpointp99 > 500ms for 5 minutes
Background WorkerJobs processed/secFailed job rateJob age (time from enqueue to completion)Job age > 5 minutes
DatabaseQueries/secDeadlock rate, connection errorsp99 query latencyp99 > 100ms (if indexed properly)
Cache (Redis, Memcached)Operations/sec (GET, SET)Error rate, miss ratep99 operation latencyp99 > 5ms or miss rate > 20%
Message Queue (Kafka)Messages published/sec, consumed/secConsumer lag (offset difference)Produce latency, consume latencyLag > 10,000 messages for 10 minutes
Third-party APICalls/secHTTP 5xx, timeout ratep99 response timeError rate > 5% or p99 > 2 seconds

Key takeaways

1
APM gives you three telemetry types
metrics (aggregated numbers), traces (single request journeys), and logs (discrete events). All three are necessary for efficient debugging.
2
The RED method (Rate, Errors, Duration) is the standard for service-level monitoring. Track p99 latency, not average
averages hide outliers, and outliers are what users notice.
3
Distributed tracing shows you where time is spent across service boundaries. Without trace context propagation (W3C Trace-Context), each service starts a new trace and you lose the end-to-end view.
4
Alert on p99 latency and error rate, not CPU usage. A 90% CPU alert fires while users are happy (pre-computed cache) and misses when a slow database query makes users wait (low CPU, high latency).
5
Tail-based sampling captures all slow and failed requests without storing every successful trace. Sample 100% of errors, 100% of requests > 500ms, and 1% of normal requests.

Common mistakes to avoid

5 patterns
×

Alerting on CPU usage and ignoring latency

Symptom
CPU alert at 90% fires, but users are happy (pre-computed cache warmed up). Latency alert would have been green. Later, database query slows down to 4 seconds, CPU stays at 20% — no alert fires, users complain.
Fix
Alert on p99 latency and error rate, not CPU. CPU is a resource metric for capacity planning, not user experience. Use the RED method: if users are happy (low latency, low errors), CPU can be 95% and it's fine. If users are unhappy (high latency), CPU can be 10% and you need to investigate database/external dependencies.
×

Monitoring average latency instead of p99

Symptom
Average latency dashboard shows 50ms, green. But p99 is 5 seconds — 1% of users have terrible experience. The product manager sees 'green' and doesn't understand why support tickets about slowness keep coming.
Fix
Always monitor latency percentiles: p50 for trends, p95 for most users, p99 for worst-case experience. Average hides outliers. Use Prometheus histogram with histogram_quantile(0.99, rate(...)) or a dedicated APM tool.
×

No baseline — alert threshold is absolute, not relative

Symptom
Alert set at 'p99 latency > 1 second'. During normal operation, p99 is 20ms, so alert never fires. One day, p99 rises to 500ms — still under 1 second, so no alert, but users are already unhappy because normal was 20ms.
Fix
Alert threshold should be relative to baseline: p99 latency > 3x normal for 5 minutes. Use anomaly detection (Prometheus predict_linear or external tool). For absolute thresholds, set them at 2-3x your SLO, not at a number that sounds reasonable.
×

Not tail-sampling slow requests

Symptom
Sampling rate is 1% to save costs. A slow request that affects 0.1% of users has only 0.001% chance of being traced (1% of 0.1%). You never see it in traces, and debugging takes weeks.
Fix
Use tail-based sampling: store 100% of traces in the collector, but only export those with errors or duration > 500ms. OpenTelemetry collector supports tail_sampling processor. This captures all slow and failed requests at near-zero storage cost for fast ones.
×

Monitoring only aggregated service-level metrics, not per-endpoint

Symptom
Service p99 latency is 50ms (green). But /checkout endpoint p99 is 2 seconds (red). The slow endpoint's traffic is diluted by fast /health checks and /products calls. No one notices until checkout fails during a sale.
Fix
Break down metrics by endpoint, especially for user-facing operations. The /health endpoint should be monitored separately from /checkout. Use custom buckets per critical endpoint. Alert on high-latency endpoints even if service average is green.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between p99 latency and average latency — and why...
Q02SENIOR
Walk me through how you would debug a sudden increase in p99 latency fro...
Q03SENIOR
What is the difference between metrics, traces, and logs? Give a scenari...
Q04SENIOR
How would you design an alerting strategy for a new microservice? What m...
Q01 of 04SENIOR

Explain the difference between p99 latency and average latency — and why p99 matters more for user experience.

ANSWER
Average latency is the arithmetic mean of all request latencies. It hides outliers: 1000 requests at 1ms and 1 request at 1000ms average to 2ms. p99 latency is the value below which 99% of requests fall — 1% of requests are slower than this. In the same example, p99 would be 1000ms (or close to it). For user experience, outliers matter because slow requests affect a percentage of users. An e-commerce site with 10,000 orders per hour: if p99 latency is 1 second, 100 orders per hour take >1 second. If you use average latency (2ms), you'd think everything is fine while 100 customers per hour are waiting. Most SLOs are defined in percentiles (e.g., 99% of requests under 500ms). Monitoring average latency alone is effectively ignoring user experience for a significant fraction of users.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What's the difference between APM and Observability?
02
How much overhead does APM instrumentation add?
03
How long should I store metrics, traces, and logs?
04
What is tail-based sampling and when should I use it?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Monitoring. Mark it forged?

10 min read · try the examples if you haven't

Previous
ELK Stack — Elasticsearch Logstash Kibana
4 / 9 · Monitoring
Next
Distributed Tracing with Jaeger