APM gives you telemetry — metrics (numerical measurements), traces (request journeys), and logs (discrete events) — to find performance problems before users complain
Core components: RED method (Rate, Errors, Duration) for services; USE method (Utilisation, Saturation, Errors) for resources; distributed tracing for microservices
Performance cost: OpenTelemetry adds 2-5% CPU overhead when sampled at 1% (adjust sampling rate based on traffic)
Production trap: Alerting on CPU usage alone — a 90% CPU alert fires while users are happy (pre-computed cache), and misses when a slow database query makes users wait (low CPU, high latency)
Biggest mistake: No baseline for normal latency — you can't know p99 is bad if you never tracked p50 when the system was healthy
✦ Definition~90s read
What is Application Performance Monitoring?
Application Performance Monitoring (APM) is the practice of measuring and managing the availability, responsiveness, and resource consumption of software applications in production. It exists because traditional infrastructure monitoring (CPU, memory, disk) tells you a server is alive but not whether your users are actually getting correct responses quickly.
★
Imagine your app is a restaurant kitchen.
APM fills that gap by instrumenting application code to track request paths, database queries, external API calls, and error rates — giving you visibility into the actual user experience and the internal bottlenecks that degrade it. Without APM, you're flying blind on issues like N+1 queries that silently waste database connections while CPU stays low.
APM sits between infrastructure monitoring (Prometheus, CloudWatch) and user experience monitoring (RUM, synthetic checks). It's the layer that connects infrastructure metrics to business outcomes. You should use APM when you need to understand why a page loads slowly, which database queries are the most expensive, or how a microservice failure propagates.
You should NOT use APM as your sole monitoring tool — it's complementary to logs (for debugging) and infrastructure metrics (for capacity planning). Modern APM tools like Datadog, New Relic, and Honeycomb process billions of data points daily, with distributed tracing being the key differentiator for microservice architectures.
The core value of APM is correlation: it links high-level metrics (request latency) to specific traces (the exact request that was slow) to individual spans (the database query that took 2 seconds). This is how you catch N+1 queries that hide in low CPU — the database might be at 10% utilization, but a single endpoint is making 200 sequential queries because an ORM lazily loaded associations.
APM exposes this through span-level database call counts and duration breakdowns that infrastructure metrics never show.
Plain-English First
Imagine your app is a restaurant kitchen. APM is like having a head chef who watches every cook, every dish, and every order in real time — they know instantly if the fryer is too slow, if a dish keeps getting sent back, or if one cook is overwhelmed while others stand idle. Without that chef, you only find out something went wrong when a customer walks out. APM is that watchful chef for your software — it tells you exactly where the kitchen is breaking down before your diners notice.
Every time a user clicks 'Buy Now' and nothing happens, a customer is lost — possibly forever. Studies from Google and Akamai consistently show that a 100ms increase in page load time can drop conversion rates by 1%. At scale, that's not a UX annoyance; it's a revenue crisis. Yet most engineering teams only find out their app is slow after a flood of support tickets or, worse, a trending tweet. APM exists to flip that script.
The core problem APM solves is invisibility. Code runs inside servers you can't touch, across networks you don't control, on databases holding millions of rows. Without instrumentation, you're flying blind. A query that took 50ms in staging suddenly takes 4 seconds in production under real load — and you have no idea why. APM gives you the telemetry — metrics, traces, logs — to pinpoint the exact line of code, database call, or third-party API dragging your app down.
By the end you'll understand the three pillars of observability, know exactly which metrics to instrument first, set up Prometheus-based collection, configure meaningful alert thresholds (not just 'CPU > 90%'), and read a distributed trace to find hidden latency.
What Application Performance Monitoring Actually Tracks
Application performance monitoring (APM) is the practice of measuring and analyzing the end-to-end behavior of a software system in production, focusing on response times, error rates, and resource consumption. The core mechanic is distributed tracing: every request is tagged with a unique trace ID, and each service, database call, or external API hit is recorded as a span. This creates a waterfall view of where time is spent, from the user's click to the final response.
In practice, APM tools instrument your code with minimal overhead — typically <5% CPU — by weaving in bytecode agents or using OpenTelemetry SDKs. They aggregate metrics like p50/p99 latency, throughput, and error budgets, but the real power is in the trace-level detail: you can drill into a single slow request and see that 95% of its time was spent in 200 sequential database queries, each taking 2ms. That's the N+1 pattern, invisible in average CPU but screaming in trace depth.
Use APM when you need to understand why a system behaves differently under load than in staging. It matters most for microservices, where a single slow downstream call can cascade into a global timeout storm, or for monoliths where a hidden O(n) loop in a hot path turns a 50ms endpoint into a 5s one. Without APM, you're debugging blind.
APM ≠ Infrastructure Monitoring
CPU and memory graphs won't show you that a single endpoint is making 200 database calls — only trace-level APM reveals the N+1 pattern hiding in low CPU.
Production Insight
A team at a payments company saw p99 latency spike from 200ms to 4s every 10 minutes. CPU was flat at 30%. APM traces revealed a batch job was loading an order with 500 line items, each triggering a separate SQL query via Hibernate's lazy loading.
The exact symptom: a periodic latency spike with no CPU or memory pressure, correlated with a scheduled task.
Rule of thumb: if you see latency spikes without CPU or memory increase, suspect N+1 queries — trace the slowest request and count the database spans.
Key Takeaway
APM's core value is trace-level visibility, not aggregate metrics — always drill into the slowest trace.
N+1 queries hide in low CPU because each query is fast; only the count reveals the O(n) cost.
Instrument every external call (DB, cache, API) as a span — otherwise you're blind to the bottleneck.
thecodeforge.io
APM Metrics That Expose N+1 Queries
Application Performance Monitoring
The Three Pillars — Metrics, Traces, Logs
APM rests on three types of telemetry data. Each answers a different question, and you need all three to debug effectively.
Metrics are numerical measurements over time — request rate, error rate, latency percentiles, CPU usage. They answer 'what is happening?' and are cheap to store and query. Metrics are aggregated (averages, sums, counts) and lose individual request details.
Traces track a single request's journey across services — every database call, RPC, and cache hit. They answer 'why is this specific request slow?' A trace is a tree of spans, each representing a unit of work. Traces are sampled (1-10% of requests) because storing every trace is expensive.
Logs are discrete timestamped events — 'User 123 logged in', 'Payment failed: insufficient funds'. They answer 'what happened at this exact moment?' Logs are high-cardinality but unstructured; parsing them at scale requires indexing.
The relationship: metrics tell you something is wrong (p99 latency spiked). Traces tell you where (database query slow). Logs tell you why (connection pool exhausted). Without all three, you're missing context.
package io.thecodeforge.apm;
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.resources.Resource;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;
import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;
import java.util.concurrent.TimeUnit;
/**
* ProductionOpenTelemetry instrumentation for a Java service.
*
* This adds distributed tracing so you can see exactly where latency
* is hiding — database calls, HTTP requests, or your own code.
*/
publicclassOpenTelemetryInstrumentation {
privatefinalTracer tracer;
publicOpenTelemetryInstrumentation(String serviceName, String otlpEndpoint) {
// Configure OTLP exporter — sends traces to collector (Jaeger, Tempo, etc.)OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
.setEndpoint(otlpEndpoint) // e.g., "http://jaeger-collector:14250"
.setTimeout(30, TimeUnit.SECONDS)
.build();
Resource serviceResource = Resource.getDefault().toBuilder()
.put(ResourceAttributes.SERVICE_NAME, serviceName)
.put(ResourceAttributes.SERVICE_VERSION, "1.2.3")
.build();
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(spanExporter).build())
.setResource(serviceResource)
.build();
OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.buildAndRegisterGlobal();
this.tracer = openTelemetry.getTracer(serviceName, "1.0.0");
}
/**
* Example: instrument a database query with custom span.
*
* This creates a child span under the current request trace.
* InAPMUI, you'll see exactly how long the database call took
* and can correlate it with other spans in the same trace.
*/
publicvoidexecuteDatabaseQuery(String query) {
Span dbSpan = tracer.spanBuilder("DB Query")
.setAttribute("db.statement", query)
.setAttribute("db.system", "postgresql")
.startSpan();
try (Scope scope = dbSpan.makeCurrent()) {
// Execute actual database query here// connection.execute(query);System.out.println("Executing: " + query);
} catch (Exception e) {
dbSpan.recordException(e);
dbSpan.setAttribute("error", true);
throw e;
} finally {
dbSpan.end(); // Duration recorded here — visible in trace UI
}
}
/**
* Example: instrument an HTTP call to an external API.
*/
publicvoidcallExternalApi(String url) {
Span httpSpan = tracer.spanBuilder("HTTP " + url)
.setAttribute("http.url", url)
.setAttribute("http.method", "GET")
.startSpan();
try (Scope scope = httpSpan.makeCurrent()) {
// Make the actual HTTP call// httpClient.get(url);System.out.println("Calling: " + url);
} catch (Exception e) {
httpSpan.recordException(e);
httpSpan.setAttribute("error", true);
throw e;
} finally {
httpSpan.end();
}
}
}
Metrics, Traces, Logs — The Observability Trinity
Metrics: aggregated numbers (rate, errors, duration). Cheap to store, but lose individual request detail.
Traces: single request journey across services. Expensive to store (sampled at 1-10%). Show exact latency breakdown.
Logs: discrete events with high cardinality. Unstructured, need indexing for search. Best for debugging 'why' after trace identifies 'where'.
OpenTelemetry: vendor-neutral API for generating telemetry; send to any backend (Jaeger, Prometheus, Datadog, New Relic).
Rule: Start with RED metrics (Rate, Errors, Duration) for every service, then add traces for slow endpoints, then structured logs for errors.
Production Insight
Metrics alone can't debug a single slow request. They're aggregated, so a 1-second p99 could be 1% of requests taking 10 seconds.
Traces alone can't tell you if a problem is widespread or isolated. Combine metrics (problem exists) with traces (find the cause).
Rule: Sample 100% of traces for error responses (status >= 400), and 1-10% of successful requests. This captures all failures without breaking the bank.
Key Takeaway
Metrics tell you something is wrong. Traces tell you where. Logs tell you why. You need all three to debug effectively.
Start with RED metrics for every service, then add distributed tracing for slow endpoints, then structured logging for error detail.
Rule: Sample 100% of error traces, 1-10% of success traces. Tail-based sampling catches slow requests without storing every success.
APM Telemetry Sampling Strategy
IfLow traffic service (< 10 requests/second)
→
UseSample 100% of requests. Store traces for 7 days, errors for 30 days. Cost is negligible and debugging is easier.
IfMedium traffic service (10-100 requests/second)
→
UseSample 10% of requests, plus 100% of errors. Use probabilistic sampling with consistent probability per trace ID.
IfHigh traffic service (> 100 requests/second)
→
UseSample 1% of requests, 100% of errors, and 'tail-based' sampling for slow requests (>500ms). Use OpenTelemetry collector with tail-sampling processor.
IfCompliance requirement: must have trace for every transaction
→
UseUse head-based sampling with probability 1 (100%). Accept higher storage costs. Use cheaper storage tier (S3) for older traces.
UseUse consistent probability sampling based on trace ID. All services must use same sampling decision to avoid broken traces.
The RED Method — Rate, Errors, Duration
The RED method (Rate, Errors, Duration) is the standard for service-level monitoring. For every service, track these three metrics, and you'll know instantly whether users are happy.
Rate is the number of requests per second. A sudden drop in rate (traffic falling off a cliff) often means the service is unavailable or rejecting requests. A sudden spike might indicate a DDoS attack or misconfigured client.
Errors is the proportion of requests that failed — HTTP 5xx, thrown exceptions, timeout, or any response that doesn't meet your SLO. Track error rate both as a raw count and as a percentage of total requests. A slow rise in error rate often indicates resource exhaustion (database connections, memory).
Duration is how long requests take, measured as latency percentiles — p50 (median), p95, p99. p99 is what matters for user experience: 1% of requests are slower than this. Average latency hides outliers: a service could have 1000 requests at 1ms and 1 request at 1000ms, average 2ms, but 0.1% of users had a terrible experience.
Instrument duration with a histogram: bucket boundaries at 1ms, 5ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 2500ms, 5000ms, 10000ms. This gives you percentiles without storing every latency value.
Common RED mistakes: measuring only average latency (hides p99 problems), not tracking errors by type (500 internal server error vs 404 not found are very different), and not breaking down rate by endpoint (a drop in /health is fine; a drop in /checkout is a crisis).
io/thecodeforge/apm/REDMetrics.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
package io.thecodeforge.apm;
import io.prometheus.client.CollectorRegistry;
import io.prometheus.client.Counter;
import io.prometheus.client.Histogram;
import io.prometheus.client.exporter.HTTPServer;
import java.io.IOException;
/**
* ProductionREDmetrics (Rate, Errors, Duration) using Prometheus.
*
* These three metrics are enough to know if a service is healthy
* from the user's perspective — without looking at CPU or memory.
*/
publicclassREDMetrics {
// ─── RATE: Total requests per endpoint ───────────────────────────────────// Counter represents requests total. Rate = increase over time.privatestaticfinalCounter requestTotal = Counter.build()
.name("http_requests_total")
.labelNames("method", "endpoint", "status")
.help("Total HTTP requests")
.register();
// ─── ERRORS: Error counter (subset of requestTotal) ──────────────────────// Track errors separately for easier alerting, but also derived from requestTotalprivatestaticfinalCounter errorTotal = Counter.build()
.name("http_errors_total")
.labelNames("method", "endpoint", "error_type")
.help("Total HTTP errors (status >= 500 or exception)")
.register();
// ─── DURATION: Request latency histogram ─────────────────────────────────// Buckets chosen to capture p50 (5-10ms), p95 (50-100ms), p99 (250-500ms)// Adjust buckets based on your service's typical latency.privatestaticfinalHistogram requestDuration = Histogram.build()
.name("http_request_duration_seconds")
.labelNames("method", "endpoint")
.help("HTTP request latency in seconds")
.buckets(0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10)
.register();
/**
* Record metrics for a completed request.
* Callthis in your API framework's response filter/middleware.
*/
publicstaticvoidrecordRequest(String method, String endpoint,
int statusCode, long durationMs) {
String status = String.valueOf(statusCode);
requestTotal.labels(method, endpoint, status).inc();
if (statusCode >= 500) {
errorTotal.labels(method, endpoint, "http_5xx").inc();
}
requestDuration.labels(method, endpoint).observe(durationMs / 1000.0);
}
/**
* Record an exception that wasn't caught by normal status code handling.
*/
publicstaticvoidrecordException(String method, String endpoint, String exceptionType) {
errorTotal.labels(method, endpoint, exceptionType).inc();
}
/**
* StartPrometheus metrics endpoint on port 8081 (separate from app port).
* Scraped by Prometheus every 15 seconds.
*/
publicstaticvoidstartMetricsServer() throwsIOException {
HTTPServer server = newHTTPServer(8081);
System.out.println("Prometheus metrics available at http://localhost:8081/metrics");
}
}
Prometheus Histogram Buckets
The histogram buckets in the code above (0.001, 0.005, 0.01, ...) are chosen to capture typical web latency ranges. For a database service, you might shift to higher buckets (0.01, 0.05, 0.1, 0.5...). For an in-memory cache, lower buckets (0.0001, 0.0005...). Use summary quantiles if you need exact percentiles, but histograms are cheaper and recommended for production.
Production Insight
Average latency hides problems. A service with 1000 requests at 1ms and 1 at 1000ms has an average of 2ms, but 0.1% of users had a 1000ms experience.
p99 latency is what users actually feel. 1% of requests slower than p99. Track p50 for trends, p99 for SLOs.
Rule: Set p99 latency alerts at 3x your normal baseline, not an absolute number. A 500ms p99 might be fine for a reporting API but terrible for a checkout endpoint.
Key Takeaway
RED metrics — Rate, Errors, Duration — tell you if users are happy without looking at CPU or memory.
Track p99 latency, not average. Averages hide outliers, and outliers are what users notice.
Rule: Start with RED for every service before adding more detailed metrics. If RED is green, users are happy; if RED is red, start debugging.
RED Metrics by Service Type
IfWeb API or synchronous service (user waiting for response)
→
UseTrack p99 latency, error rate, request rate. Alert when p99 > 500ms for 5 minutes; error rate > 1% for 2 minutes.
IfBackground job processor or async worker
→
UseTrack rate (jobs processed), error rate, and job age (time from enqueue to completion). Alert when age > 5 minutes.
IfDatabase or cache (infrastructure service)
→
UseTrack query latency p99, connection pool saturation, error rate. Alert when p99 > 100ms for database (with index) or > 5ms for cache.
IfBatch job (cron, ETL)
→
UseTrack duration (time to completion), error flag (0 or 1), data volume processed. Alert when job takes > 2x baseline duration.
IfThird-party API dependency (downstream call)
→
UseTrack rate (calls per second), error rate (HTTP 5xx, timeouts), latency p99. Alert when error rate > 5% or p99 > 2 seconds.
Distributed Tracing — Following a Request Across Services
In a monolith, you can find a slow function with a profiler. In microservices, a single request might pass through API gateway → auth service → order service → payment service → inventory service. A 2-second latency could be 100ms in each of 20 services, or 1.9 seconds in a single database query. Distributed tracing tells you which.
A trace is a tree of spans. The root span covers the entire request from client to final response. Child spans cover sub-operations: HTTP calls to downstream services, database queries, cache lookups, even internal function calls.
Key fields: trace ID (same across all spans in a request), span ID (unique per operation), parent span ID (links child to parent), name (operation name: 'GET /products', 'SELECT * FROM orders'), start and end timestamps (duration = end - start), attributes (HTTP method, status code, DB statement), events (logs within a span: 'cache miss', 'retry attempt').
Implementation: instrument your HTTP client and server libraries to automatically propagate trace context via headers (W3C Trace-Context standard: traceparent, tracestate). Use OpenTelemetry auto-instrumentation agents for Java, Python, Node.js, Go. Manual instrumentation for business-critical spans.
Common tracing mistakes: not propagating trace context across asynchronous boundaries (message queues, background threads) — resulting in broken traces; sampling too aggressively (1% of 1% leaves 0.01% of requests traced); not storing traces long enough (7 days minimum for debugging weekly patterns); and not linking traces to logs (add trace ID to every log line).
io/thecodeforge/apm/DistributedTracing.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
package io.thecodeforge.apm;
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Context;
import io.opentelemetry.context.propagation.TextMapGetter;
import io.opentelemetry.context.propagation.TextMapSetter;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
/**
* Distributed tracing with context propagation across service boundaries.
*
* The key challenge in distributed tracing is propagating the trace context
* from caller to callee. OpenTelemetry's propagator handles automatically
* when using instrumented clients. For custom protocols (message queues),
* inject the context manually.
*/
publicclassDistributedTracing {
privatefinalTracer tracer;
privatefinalOpenTelemetry openTelemetry;
privatefinalHttpClient httpClient;
publicDistributedTracing(OpenTelemetry openTelemetry) {
this.openTelemetry = openTelemetry;
this.tracer = openTelemetry.getTracer("api-service");
this.httpClient = HttpClient.newHttpClient();
}
/**
* Example: calling a downstream service with automatic trace propagation.
*
* When using OpenTelemetry-instrumented HTTP client, the trace context
* is automatically injected into the `traceparent` header.
* The downstream service extracts it and creates a child span.
*/
publicStringcallOrderService(String orderId) throwsException {
// Start a child span for this HTTP callSpan httpSpan = tracer.spanBuilder("HTTP POST /orders")
.setAttribute("order.id", orderId)
.startSpan();
try (var scope = httpSpan.makeCurrent()) {
HttpRequest request = HttpRequest.newBuilder()
.uri(java.net.URI.create("http://order-service/api/orders"))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString("{\"id\":\"" + orderId + "\"}"))
.build();
// If using automatic instrumentation, the `traceparent` header// is added automatically. If manual, inject via:// openTelemetry.getPropagators().getTextMapPropagator()// .inject(Context.current(), request, (r, k, v) -> r.headers().put(k, v));HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
httpSpan.setAttribute("http.status_code", response.statusCode());
return response.body();
} catch (Exception e) {
httpSpan.recordException(e);
throw e;
} finally {
httpSpan.end();
}
}
/**
* Extract trace context from incoming request headers.
* OpenTelemetry's server instrumentation does this automatically.
*
* This is how a service knows it's part of an existing trace
* rather than starting a new one.
*/
publicvoidhandleIncomingRequest(String traceparentHeader) {
// Extract context from headers (auto-instrumented frameworks do this)TextMapGetter<MapHeaders> getter = newTextMapGetter<>() {
@OverridepublicStringget(MapHeaders carrier, String key) {
return carrier.get(key);
}
@OverridepublicIterable<String> keys(MapHeaders carrier) {
return carrier.keys();
}
};
Context extractedContext = openTelemetry.getPropagators().getTextMapPropagator()
.extract(Context.current(), newMapHeaders(traceparentHeader), getter);
// Start a span as child of extracted contextSpan span = tracer.spanBuilder("handle request")
.setParent(extractedContext)
.startSpan();
try (var scope = span.makeCurrent()) {
// Process the request hereSystem.out.println("Processing request with trace ID: " + span.getSpanContext().getTraceId());
} finally {
span.end();
}
}
// Helper class for header propagation examplestaticclassMapHeaders {
privatefinal java.util.Map<String, String> headers = new java.util.HashMap<>();
MapHeaders(String traceparent) { headers.put("traceparent", traceparent); }
Stringget(String key) { return headers.get(key); }
Iterable<String> keys() { return headers.keySet(); }
}
}
Trace Context Propagation is Non-Negotiable
Without automatic trace context propagation, your traces will be broken — each service creates a new trace root. Use W3C Trace-Context headers (traceparent, tracestate). OpenTelemetry auto-instrumentation handles this for HTTP, gRPC, and many database clients. For message queues (Kafka, RabbitMQ), you must inject the context into the message headers manually, then extract on the consumer side.
Production Insight
A trace without context propagation is just a log of each service's independent timings. You can't see the full request journey.
The W3C Trace-Context standard (traceparent header) is supported by all major tracing backends. Use it, not proprietary formats.
Rule: Test trace propagation in staging. Make a request that spans 3 services and verify the same trace ID appears in all service logs.
Key Takeaway
Distributed tracing shows you where time is spent across service boundaries — database, cache, RPC, external API.
Without trace context propagation, each service starts a new trace, and you lose the end-to-end view.
Rule: Use OpenTelemetry auto-instrumentation for HTTP clients/servers. For message queues, propagate trace context in message headers manually.
Tracing Sampling Decisions
IfDebugging an intermittent production issue that happens to 0.1% of requests
→
UseIncrease sampling rate to 10% temporarily. Change back after issue resolved. Use remote configuration (without redeploy).
IfCompliance requires full audit trail for every transaction
→
UseSample 100% of traces. Use 'head-based' sampling with probability 1. Accept storage costs. Archive older traces to cold storage.
IfStorage costs are a concern, but you need to debug slow requests
→
UseUse 'tail-based' sampling: sample 100% of traces, but only store those with errors or duration > 500ms (configured in OpenTelemetry collector).
IfYou need to debug specific user or session
→
UseUse 'request-id' based conditional sampling. Extract user ID from request header; if user is in debug list, sample 100%.
IfTracing overhead is affecting production latency (rare > 5% overhead)
→
UseReduce sampling rate. Use lighter propagator. Use async span processor. Sample 100% of errors, lower success sampling.
Why APM Matters in DevOps — The Fire Triangle
You can't fix what you can't see. That's the whole argument for APM in one sentence. DevOps is about closing the loop between code commit and production behavior. Without real-time visibility into how your application actually runs, you're flying blind.
APM isn't a dashboard for the operations team to stare at. It's the feedback mechanism that tells you whether your last deployment actually improved anything — or if you just swapped one bottleneck for another. When a user reports slowness, APM answers the three questions that matter: What's slow? Where is the slowness happening? Why is it happening now?
Most teams don't fail because they lack monitoring. They fail because they monitor the wrong things. CPU usage is a distraction. You need to track the metrics that correlate directly with user experience — response time, error rate, and saturation. Everything else is noise. APM forces you to focus on what actually breaks the user's day.
ApminDevOps.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — devops tutorial
# MinimalAPM config for a payment service
# Track the customer-impacting metrics, not the server noise
apm:
service_name: payment-gateway
environment: production
metrics:
- response_time_p99 # Slowest1% of requests
- error_rate_percent # 4xx/5xx vs total requests
- apdex_score # User satisfaction threshold
alerting:
# Fire when p99 crosses 2 seconds for5 minutes
- rule: p99_latency_high
condition: "p99 > 2000ms for 5m"
severity: critical
oncall: sre-team
- rule: error_rate_spike
condition: "error_rate > 5% for 1m"
severity: warning
oncall: dev-oncall
Output
Payment gateway metrics firing:
- p99_latency_high: ACTIVE (p99=2400ms, duration=7m)
- error_rate_spike: OK (2.1%, below threshold)
Production Trap:
Don't alert on CPU or memory in isolation. They're symptoms, not causes. Alert on the user-facing metrics first — response time and error rate — then let those alerts guide you to the infrastructure root cause.
Key Takeaway
APM exists to answer 'is my code making users unhappy?' — not 'is my server running?'
Core Components of Modern APM — The Parts That Actually Matter
Modern APM is not one thing. It's four layers that stack together, and if you skip any of them, you're working with incomplete data.
First: End-user Experience Monitoring (EUEM). This is your synthetic transactions and real-user monitoring. It captures how actual humans experience your app — page load times, click-to-response latency, client-side errors. Without it, you might think the backend is healthy while users are staring at a blank screen because a JavaScript bundle broke.
Second: Application Runtime Architecture. This is where you instrument your code — the database calls, external API calls, thread pools, and memory allocation. You're measuring what your code actually does at runtime. Not what you think it does. Not what the code review suggested. What it really does. This is where you find the N+1 queries, the unbounded retry loops, and the object allocation that triggers GC pauses.
Third: Infrastructure Monitoring. You need to know what's happening at the OS and container level — CPU, memory, disk I/O, network. But here's the trick: infrastructure data is only useful when correlated with application data. A CPU spike during normal traffic means something totally different from a CPU spike during a traffic surge. Don't look at infrastructure in isolation.
Fourth: Transaction Tracing and Dependency Mapping. This is the map of every service call your application makes. It shows you the path of a single request across services, databases, queues, and caches. Without this, you can't tell if the payment service is slow because the database is slow, or because the fraud-check service is timing out.
ApmStackExample.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — devops tutorial
# APM stack configuration for a microservices deployment
# Each component runs in its own namespace
components:
euem:
tool: rum-sensor
enabled: true
capture_requests: true
capture_errors: true
runtime:
instrumentations:
- python: "0.12.0" # Auto-instrument all Python services
- nodejs: "0.9.2"
- java: "0.15.1"
infrastructure:
agent: node-exporter
enabled: true
metrics:
- cpu_usage
- memory_rss
- disk_iops
tracing:
sampler: probabilistic
sampling_rate: 0.1 # Trace10% of requests in prod
export: otlp-http
- Tracing: 10% sampling, 150 traces exported per second
Senior Shortcut:
Always start with runtime instrumentation. If you don't know what your code is doing at the method level, you're guessing. Everything else is secondary.
Key Takeaway
APM is four layers working together — omit any one and you're flying with partial instruments.
Essential APM Metrics — The Only Ones That Survive an Incident
Every monitoring tool lets you create 500 dashboards. Most teams end up with 500 dashboards and zero actionable insight. Here's the short list of metrics that matter when a P1 hits.
Latency (p50, p95, p99): p50 tells you what the typical user experiences. p95 tells you about the edge cases. p99 tells you about the outliers that will get you a call at 3 AM. If p99 is 10x p50, you have a long-tail latency problem — probably a bad cache hit ratio or a slow external dependency.
Error Rate: Track as a percentage of all requests, not an absolute count. A spike from 0.1% to 1% is a 10x increase. Your alerting should catch that. But you also need error budgets — a way to say "we can tolerate X% errors for Y time before we page someone." Without error budgets, your on-call will be paged for every single 500 error from a load balancer health check. Don't be that team.
Saturation: This is the hard one. It measures how close your system is to its limit. For a database, it's connection pool usage. For a queue, it's message backlog. For a CPU, it's run queue depth. Saturation is a leading indicator of failure. When saturation hits 80% of capacity, you have minutes to react before performance collapses. If you wait until latency spikes to act, you've already lost.
Throughput: Requests per second. It's the denominator for all your rate calculations. Throughput dropping suddenly usually means something upstream is failing. Throughput spiking might be a DDoS or a misconfigured retry loop. Throughput trending up over weeks means you need to scale. All three are useful, but none tells the whole story alone.
EssentialMetrics.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — devops tutorial
# Essential metrics config for an API gateway
# Only four metrics, but correlated across all services
metrics:
latency:
dashboard: p50: 120ms | p95: 350ms | p99: 800ms
alarm: p99 > 1500ms for3 minutes
error_rate:
dashboard: 0.3% of requests (24/hour)
error_budget: 99.9% uptime = 0.1% errors/month
alarm: rate > 1% for1 minute
saturation:
dashboard: db_conn_pool: 67% (134/200 connections)
alarm: saturation > 80% for5 minutes
throughput:
dashboard: 420 req/s (rolling 1m average)
alarm: throughput drops > 50% in 1 minute
Don't measure p99 latency on a single node. Measure it per service endpoint. An API that returns 'user not found' in 2ms will hide the endpoint that takes 5 seconds to fetch the report. Always segment latency by endpoint.
Key Takeaway
Four metrics — latency, error rate, saturation, throughput — cover 80% of production incidents. The other 20% require distributed tracing to untangle.
Kubernetes: Where APM Becomes a Fire Hose
You don't monitor Kubernetes—you monitor what runs on it. The platform is just a noisy scheduler. Your APM must answer: which pod, which node, which container version caused the p99 spike?
Forget instrumenting every replica. Use eBPF to capture network flows and service mesh telemetry from Istio or Linkerd. That gives you per-pod latency, error rates, and traffic patterns without touching application code.
The real win: correlate a Kubernetes rollout with an APM anomaly. When your deploy triggers a 5xx storm, the trace should show the new pod's image tag and resource limits. Anything less is guessing. Production demands pod-level granularity, not cluster-level averages.
Pod payment-service-v2 created with OpenTelemetry resource attributes.
APM traces now include pod and node identity for drill-down.
Senior Shortcut:
Skip manual instrumentation of every microservice. Use eBPF-based auto-instrumentation (e.g., Pixie, Cilium Tetragon) to capture traffic and traces without code changes. Saves months of instrumenting legacy services.
Key Takeaway
APM without pod-level identity is just noise. Always tag traces with Kubernetes metadata.
Real-Time Alerting: Stop Paging at 3 AM for Nothing
Most alerts are someone else's problem—misconfigured thresholds, static baselines, or plain noise. Real-time alerting means acting on data fresh enough to matter, with context that tells you what to do.
Your pipeline: instrument -> aggregate -> window -> evaluate -> notify. The window matters most. Use sliding windows of 1-5 minutes for error rates, 30-60 seconds for latency spikes. Static thresholds are dead. Use dynamic baselines from the last 7 days, same hour.
When the page comes, it must include: service name, trace ID, pod/node, and a link to the span. If your alert says "Error rate high" with no context, it's worse than no alert. Production teams route to Slack with a runbook attachment or they discard the channel.
Alert fires when service error rate exceeds 2% for 3 minutes.
Runbook link and trace ID included in notification payload.
Production Trap:
Never alert on raw counts. Always use rates and error ratios. A sudden spike to 10 errors per second means nothing if baseline was 1000 req/s. Ratio catches the real SLO breaches.
Key Takeaway
An alert without a runbook and trace link is just noise. Only page when context fits in the notification.
Log Aggregation: The APM Lie Everyone Believes
APM tools sell you on traces as the savior. The lie: traces make logs obsolete. The truth: no trace tells you why a payment failed—just that it did. Logs hold the stack trace, the user ID, the exact SQL query. Aggregating them is how you fix incidents.
Centralize logs in Elasticsearch, Loki, or CloudWatch. Standardize on structured JSON with a schema: timestamp, level, service, trace_id, message. The trace_id is the bridge—connect log lines to APM spans.
Use LogQL or KQL to search across environments in seconds. When the p99 latency spikes, grep for spans with duration > 2s and eat the database log lines. Logs are raw evidence. APM is the map. You need both to survive incident response.
Centralized log schema enforced across all services.
Trace_id enables cross-referencing between logs and APM spans.
Senior Shortcut:
Enforce a logging schema at the application framework level (e.g., Winston, Log4j, or slog). Never parse logs—always emit structured JSON. Parsing logs in your aggregation pipeline is tech debt that will kill you during an outage.
Key Takeaway
Traces tell you where. Logs tell you why. Never debug an incident without both.
● Production incidentPOST-MORTEMseverity: high
The Silent N+1 Query That Killed Black Friday
Symptom
Product page latency p99 jumped from 200ms to 4 seconds within 10 minutes of peak traffic. CPU on app servers stayed below 30%. Database CPU spiked to 95%. No alerts fired because CPU alert threshold was 80% and app servers looked fine. Customers started abandoning carts. The team saw the latency spike in dashboards but couldn't find the root cause.
Assumption
The team assumed the product page was efficient because it performed well in load tests with 10 reviews per product. They didn't test with 100 reviews. They also assumed high latency meant slow code in the app server — not the database — because app server CPU was low.
Root cause
The product page code fetched the product object, then looped through each review ID and executed a separate SELECT query. That's 1 query for the product + N queries for N reviews. At 100 reviews per product, that's 101 database round trips. Under peak load, database connection pool saturated, queries queued, and latency exploded. The ORM's default eager loading was disabled, and no one noticed because staging data had only 2-3 reviews per product.
Fix
Changed the code to use a single JOIN query: SELECT * FROM products LEFT JOIN reviews ON products.id = reviews.product_id WHERE products.id = ?. Added Review as an embedded collection on the Product object using the ORM's eager loading feature. Added an APM custom span around the database query to measure its contribution to total latency. Deployed a migration to add an index on reviews.product_id. After the fix, page latency dropped to 150ms even at peak traffic, and database CPU dropped to 25%.
Key lesson
N+1 queries are invisible in app server CPU metrics — the app server waits for the database, so its CPU stays low. Always monitor database query count and latency per endpoint.
Load test with realistic data volumes. A product page with 2 reviews behaves nothing like a page with 200 reviews. Use production data size in staging.
APM should trace database queries per request. A sudden increase in 'SELECT * FROM reviews WHERE product_id = ?' call count is a smoking gun for N+1.
Set up alerts on p99 latency per endpoint, not just CPU. A 400% latency increase with flat CPU points directly at database or external dependencies.
Production debug guideSymptom → Action mapping for common performance failures5 entries
Symptom · 01
High latency, app server CPU low, database CPU high
→
Fix
Classic N+1 query or inefficient database access pattern. Check APM traces for per-request query count. Look for loops executing SELECT statements inside a request. Use database slow query log to identify expensive queries. Add missing indexes.
Symptom · 02
High latency, app server CPU high, database CPU normal
→
Fix
Application code is the bottleneck, not the database. Use profiler (async-profiler, py-spy) to find CPU hot methods. Check for inefficient loops, serialisation overhead (JSON parsing), or regex backtracing. Consider caching expensive computations.
Symptom · 03
Latency spikes every hour like clockwork
→
Fix
Likely cron job, cache expiry, or batch process. Check scheduled jobs running at that time. Look for cache stampede (multiple requests recomputing same cache simultaneously). Add jitter to scheduled tasks. Use 'lock' for cache recomputation.
Symptom · 04
Latency increases linearly with number of users
→
Fix
Shared resource bottleneck: database connection pool, thread pool, or external API rate limit. Check connection pool size vs active connections. If maxed out, requests queue. Increase pool size or reduce connection hold time. Check thread pool saturation.
Symptom · 05
Latency high for first request after deploy, then improves
→
Fix
Cold start or lack of connection预热. Database connection pools, caches, and JIT compilation need 'warmup' after deployment. Send synthetic traffic before opening to users. Use 'health check' endpoint that exercises critical paths.
★ APM Debug Cheat SheetFast diagnostics for production performance issues. Run these commands at the first sign of slowness.
Slow API endpoint — can't tell if it's code, database, or network−
Immediate action
Look at the distributed trace to break down latency by component
If database span dominates, add index or reduce query count. If HTTP client span dominates, check external API latency. If duration is in 'code' span, profile the application.
App server CPU at 100%, database CPU normal, latency spiking+
Open heap dump in Eclipse MAT / VisualVM. Look for objects with high 'retained heap'. Check for event listeners not unregistered, caches without eviction, or thread locals not cleared.
Latency p99 spikes but p50 is fine — 1% of requests are very slow+
Immediate action
Check if slow requests share a pattern: specific user, specific data, or specific time
Add more trace sampling for slow requests. Check for large payloads, deep pagination, or data skew. Implement request timeouts to fail fast.
RED Method Metrics by Service Type
Service Type
Rate (R)
Errors (E)
Duration (D)
Key Alert
Web API (user-facing)
Requests/sec per endpoint
HTTP 5xx rate, exception rate
p99 latency per endpoint
p99 > 500ms for 5 minutes
Background Worker
Jobs processed/sec
Failed job rate
Job age (time from enqueue to completion)
Job age > 5 minutes
Database
Queries/sec
Deadlock rate, connection errors
p99 query latency
p99 > 100ms (if indexed properly)
Cache (Redis, Memcached)
Operations/sec (GET, SET)
Error rate, miss rate
p99 operation latency
p99 > 5ms or miss rate > 20%
Message Queue (Kafka)
Messages published/sec, consumed/sec
Consumer lag (offset difference)
Produce latency, consume latency
Lag > 10,000 messages for 10 minutes
Third-party API
Calls/sec
HTTP 5xx, timeout rate
p99 response time
Error rate > 5% or p99 > 2 seconds
Key takeaways
1
APM gives you three telemetry types
metrics (aggregated numbers), traces (single request journeys), and logs (discrete events). All three are necessary for efficient debugging.
2
The RED method (Rate, Errors, Duration) is the standard for service-level monitoring. Track p99 latency, not average
averages hide outliers, and outliers are what users notice.
3
Distributed tracing shows you where time is spent across service boundaries. Without trace context propagation (W3C Trace-Context), each service starts a new trace and you lose the end-to-end view.
4
Alert on p99 latency and error rate, not CPU usage. A 90% CPU alert fires while users are happy (pre-computed cache) and misses when a slow database query makes users wait (low CPU, high latency).
5
Tail-based sampling captures all slow and failed requests without storing every successful trace. Sample 100% of errors, 100% of requests > 500ms, and 1% of normal requests.
Common mistakes to avoid
5 patterns
×
Alerting on CPU usage and ignoring latency
Symptom
CPU alert at 90% fires, but users are happy (pre-computed cache warmed up). Latency alert would have been green. Later, database query slows down to 4 seconds, CPU stays at 20% — no alert fires, users complain.
Fix
Alert on p99 latency and error rate, not CPU. CPU is a resource metric for capacity planning, not user experience. Use the RED method: if users are happy (low latency, low errors), CPU can be 95% and it's fine. If users are unhappy (high latency), CPU can be 10% and you need to investigate database/external dependencies.
×
Monitoring average latency instead of p99
Symptom
Average latency dashboard shows 50ms, green. But p99 is 5 seconds — 1% of users have terrible experience. The product manager sees 'green' and doesn't understand why support tickets about slowness keep coming.
Fix
Always monitor latency percentiles: p50 for trends, p95 for most users, p99 for worst-case experience. Average hides outliers. Use Prometheus histogram with histogram_quantile(0.99, rate(...)) or a dedicated APM tool.
×
No baseline — alert threshold is absolute, not relative
Symptom
Alert set at 'p99 latency > 1 second'. During normal operation, p99 is 20ms, so alert never fires. One day, p99 rises to 500ms — still under 1 second, so no alert, but users are already unhappy because normal was 20ms.
Fix
Alert threshold should be relative to baseline: p99 latency > 3x normal for 5 minutes. Use anomaly detection (Prometheus predict_linear or external tool). For absolute thresholds, set them at 2-3x your SLO, not at a number that sounds reasonable.
×
Not tail-sampling slow requests
Symptom
Sampling rate is 1% to save costs. A slow request that affects 0.1% of users has only 0.001% chance of being traced (1% of 0.1%). You never see it in traces, and debugging takes weeks.
Fix
Use tail-based sampling: store 100% of traces in the collector, but only export those with errors or duration > 500ms. OpenTelemetry collector supports tail_sampling processor. This captures all slow and failed requests at near-zero storage cost for fast ones.
×
Monitoring only aggregated service-level metrics, not per-endpoint
Symptom
Service p99 latency is 50ms (green). But /checkout endpoint p99 is 2 seconds (red). The slow endpoint's traffic is diluted by fast /health checks and /products calls. No one notices until checkout fails during a sale.
Fix
Break down metrics by endpoint, especially for user-facing operations. The /health endpoint should be monitored separately from /checkout. Use custom buckets per critical endpoint. Alert on high-latency endpoints even if service average is green.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the difference between p99 latency and average latency — and why...
Q02SENIOR
Walk me through how you would debug a sudden increase in p99 latency fro...
Q03SENIOR
What is the difference between metrics, traces, and logs? Give a scenari...
Q04SENIOR
How would you design an alerting strategy for a new microservice? What m...
Q01 of 04SENIOR
Explain the difference between p99 latency and average latency — and why p99 matters more for user experience.
ANSWER
Average latency is the arithmetic mean of all request latencies. It hides outliers: 1000 requests at 1ms and 1 request at 1000ms average to 2ms. p99 latency is the value below which 99% of requests fall — 1% of requests are slower than this. In the same example, p99 would be 1000ms (or close to it). For user experience, outliers matter because slow requests affect a percentage of users. An e-commerce site with 10,000 orders per hour: if p99 latency is 1 second, 100 orders per hour take >1 second. If you use average latency (2ms), you'd think everything is fine while 100 customers per hour are waiting. Most SLOs are defined in percentiles (e.g., 99% of requests under 500ms). Monitoring average latency alone is effectively ignoring user experience for a significant fraction of users.
Q02 of 04SENIOR
Walk me through how you would debug a sudden increase in p99 latency from 200ms to 3 seconds in a microservices architecture with 10 services.
ANSWER
Step 1 — Confirm scope: check RED metrics per service. If one service's p99 increased and others are normal, focus there. If all increased, suspect common dependency (database, cache, network). Step 2 — Look at the distributed trace for a slow request. Find the span with the highest duration. That's the bottleneck. Step 3 — If the bottleneck is a database query: check slow query log, run EXPLAIN, look for missing indexes, N+1 patterns, or lock contention. Step 4 — If bottleneck is an HTTP call to another service: check that service's metrics recursively (go to step 1 for that service). Step 5 — If bottleneck is 'code' span: use profiler (async-profiler) to find hot methods. Check for inefficient loops, JSON parsing, regex. Step 6 — If no single span dominates but many small spans add up: check for context switching overload, thread pool saturation, or lock contention. Step 7 — After finding root cause, deploy fix, verify latency returns to baseline, and add regression test to catch reoccurrence. Also add SLO alert for p99 > 500ms to catch earlier next time.
Q03 of 04SENIOR
What is the difference between metrics, traces, and logs? Give a scenario where you need all three to debug an issue.
ANSWER
Metrics are aggregated numerical data — request rate, error rate, latency percentiles. They're cheap to store and query, but lose individual request detail. Traces capture a single request's journey across services — each database call, RPC, cache hit. They're sampled (1-10%) because storage is expensive. Logs are discrete timestamped events — 'User 123 login failed', 'Connection pool exhausted'. They're high-cardinality but unstructured. Scenario: Metrics show p99 latency spiked to 3 seconds at 14:05 (something is wrong). You query traces for that time window and find a trace where the 'database query' span took 2.9 seconds (where the time is spent). You then look at logs for that database query (using the trace ID to filter) and see 'deadlock detected, retrying'. That's the 'why'. Without metrics, you wouldn't know there was a problem. Without traces, you wouldn't know the problem was the database. Without logs, you wouldn't know it was a deadlock.
Q04 of 04SENIOR
How would you design an alerting strategy for a new microservice? What metrics would you alert on, and what thresholds would you use?
ANSWER
I'd start with the RED method: Rate, Errors, Duration. For a user-facing web API, alert on: (1) p99 latency > 3x baseline (or absolute 500ms) for 5 minutes — user experience degraded. (2) Error rate > 1% for 2 minutes — service is failing. (3) Rate drop > 50% over 5 minutes — service may be unavailable or rejecting requests. Also alert on resource exhaustion: (4) Database connection pool saturation > 90% — capacity issue. (5) Thread pool queue size > 1000 — service can't keep up. Avoid CPU alerts for user-facing services — high CPU is fine if latency is low. Set thresholds using historical data: p99 latency baseline from last 7 days, alert when 3x median. Use for clauses (e.g., 'for: 5m') to avoid flapping on transient spikes. For critical endpoints (/checkout, /login), use service-level indicators (SLIs) and error budget alerts: remaining error budget < 1 hour at current burn rate.
01
Explain the difference between p99 latency and average latency — and why p99 matters more for user experience.
SENIOR
02
Walk me through how you would debug a sudden increase in p99 latency from 200ms to 3 seconds in a microservices architecture with 10 services.
SENIOR
03
What is the difference between metrics, traces, and logs? Give a scenario where you need all three to debug an issue.
SENIOR
04
How would you design an alerting strategy for a new microservice? What metrics would you alert on, and what thresholds would you use?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
What's the difference between APM and Observability?
APM (Application Performance Monitoring) is a product category — tools like Datadog, New Relic, Dynatrace that collect metrics, traces, and logs. Observability is a property of a system: how well you can understand its internal state from its external outputs (telemetry). You achieve observability by instrumenting your code with metrics, traces, and logs. APM tools are one way to achieve observability. OpenTelemetry (vendor-neutral) is the current standard for instrumentation, replacing vendor-specific agents.
Was this helpful?
02
How much overhead does APM instrumentation add?
OpenTelemetry adds 2-5% CPU overhead at 1% sampling rate for traces. Metrics histograms add negligible overhead (~0.5% CPU). Logging at INFO level adds ~1% CPU. The biggest overhead is trace export (network, serialisation). Always sample traces (1-10% for high-traffic services). Use async span processors (non-blocking). For extremely latency-sensitive systems (<50us p99), consider eBPF-based monitoring or kernel tracing instead of code instrumentation.
Was this helpful?
03
How long should I store metrics, traces, and logs?
Metrics: 30-90 days for aggregates, 7 days for raw data. Use downsampling: keep 1-minute resolution for 7 days, 5-minute for 30 days, 1-hour for 90 days. Traces: 7-14 days for debugging weekly patterns; errors and slow requests for 30 days. Logs: 30 days for general, 90 days for compliance (GDPR, PCI). Use tiered storage: hot (SSD) for 7 days, warm (SSD/HDD) for 30 days, cold (S3) for older data. OpenTelemetry collector supports routing traces to different backends based on attributes (e.g., errors → long-term).
Was this helpful?
04
What is tail-based sampling and when should I use it?
Head-based sampling decides at the start of the request (e.g., random 1%). Tail-based sampling makes the decision after the request completes. The OpenTelemetry collector buffers traces for a few seconds, then decides to keep or drop based on criteria: if duration > 500ms, keep; if error occurred, keep; otherwise, sample 1%. This ensures you have traces for all slow and failed requests (the ones you actually want to debug) without storing every successful 50ms request. Use tail-based sampling for high-traffic services (> 100 req/sec) where storing 100% of traces is expensive but you need to debug rare issues. The trade-off is added latency (traces held in buffer) and collector memory usage.