Senior 5 min · March 06, 2026

Monitoring vs Observability — P99 Blind Spot Disaster

1% of users saw 30-second hangs while avg latency was 120ms green — P99 metrics missing.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Monitoring watches predefined metrics and alerts when thresholds are breached — it answers 'WHAT is broken'
  • Observability lets you ask arbitrary questions about system state from external telemetry — it answers 'WHY it broke'
  • The 3 pillars are Metrics (aggregated numeric signals), Logs (per-event detail), and Traces (request flow across services)
  • Without a shared Correlation ID across all three pillars, debugging cross-service failures is manual timestamp archaeology
  • Prometheus uses pull-based scraping — your app exposes /metrics, Prometheus fetches on a fixed interval
  • High-cardinality labels (User IDs, request IDs) as metric tags will crash your monitoring storage
  • Monitoring is a subset of observability — you need both, and they serve different cognitive roles during an incident
Plain-English First

Imagine your car has a dashboard with a fuel gauge, temperature dial, and oil light. That's monitoring — you set up specific gauges in advance and watch them. Observability is like having a mechanic who can plug a laptop into your car's diagnostic port and ask ANY question about what's happening inside the engine, even questions you never thought to ask before the trip started. Monitoring tells you THAT something is wrong. Observability helps you figure out WHY — and more importantly, it lets you answer questions you didn't know you needed to ask until the car was already on fire.

Your production system crashes at 2 AM on Black Friday. Orders are failing, users are screaming on Twitter, and your on-call engineer is staring at a wall of dashboards wondering where to even begin. This scenario plays out every day at companies around the world — and the difference between a 10-minute fix and a 4-hour outage almost always comes down to one thing: how well-instrumented your system was before the incident.

Monitoring and observability are not luxury features you bolt on after launch. They are the engineering discipline that separates teams who fix incidents in minutes from teams who spend hours in Slack threads and frantic Zoom calls pointing fingers at each other's services. While monitoring is the act of watching known-knowns through dashboards and alerts, observability is the property of a system that allows you to understand its internal state from the external data it emits — logs, metrics, and traces. The operative word is property. Observability is not a tool you buy. It is a quality you engineer into your system from the start.

The critical misconception I see constantly at staff level: treating monitoring and observability as synonyms. They are not. Monitoring is a subset of observability. You can monitor without observability — you get dashboards with no context, alerts that fire with no actionable detail, and on-call engineers who solve incidents through gut instinct and luck. You cannot have meaningful observability without some form of monitoring — you still need alerts to know when to look at anything. The two are complementary, not competing. Get both right and your MTTR drops from hours to minutes. Get only one right and you are still flying partially blind.

The Three Pillars: Metrics, Logs, and Traces

True observability requires all three types of telemetry working together, not independently. Metrics provide a high-level, aggregated view of system health — request rate, error rate, CPU saturation. Logs provide the per-event forensic record of exactly what happened at a specific moment in time. Distributed traces connect everything by showing the complete path of a single request across every service it touched, with timing for each hop.

Each pillar serves a distinct cognitive role in an incident, and understanding those roles prevents the common mistake of trying to use one pillar for a job another pillar does better. Metrics are your early warning system — they tell you THAT something changed, and they tell you fast. A metric alert can fire within 30 seconds of a threshold being crossed. Logs are your forensic evidence — they tell you exactly WHAT happened at a specific request or event level, but querying them at scale is expensive and slow compared to a metric query. Traces are your request GPS — they tell you WHERE a request traveled and HOW LONG each individual service spent processing it. The power of the three-pillar model is not in any single pillar; it is in the correlation between them.

The correlation ID is the connective tissue. When a P99 latency alert fires (metric), you need to jump directly to the traces for the slow requests and then to the logs for the specific error messages those traces contain. If the trace ID is not present in your logs, that jump requires manual timestamp correlation — which is slow, error-prone, and genuinely miserable to do at 2 AM during an active incident. Instrument for correlation from day one, not as a retrofit after the first major outage.

One more thing that gets underweighted: the 'unknown unknowns' problem. Metrics cover what you thought to instrument. Logs cover what you thought to log. Traces cover what you thought to trace. Observability is the property that lets you ask questions you did not anticipate when you wrote the code — because the raw telemetry is rich enough to answer novel queries. A system where you can only ask questions you pre-configured dashboards for is a monitored system. A system where you can compose a new query during an incident and get a meaningful answer is an observable system.

io/thecodeforge/monitoring/OrderMonitor.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
package io.thecodeforge.monitoring;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import io.opentelemetry.api.trace.Span;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

/**
 * io.thecodeforge — Three-Pillar Instrumentation Pattern
 *
 * Demonstrates all three observability pillars in a single operation:
 *   - Metric: forge.orders.success (counter) and forge.payment.latency (histogram timer)
 *   - Log:    Structured log line with traceId injected from active OpenTelemetry span
 *   - Trace:  Active span context is propagated by the OTel agent — no manual span creation needed
 *
 * Key design decisions:
 *   1. traceId comes from the active OTel Span — it is the same ID visible in Jaeger/Tempo
 *   2. MDC injection makes traceId appear in EVERY log line within this thread scope
 *   3. The Timer wraps the entire business operation — it measures wall clock time including I/O
 *   4. Counter uses .tag("status", "success") so failures can be counted separately:
 *      forge.orders.processed{status="failure"} without creating a separate metric
 */
public class OrderMonitor {

    private static final Logger log = LoggerFactory.getLogger(OrderMonitor.class);

    private final Counter orderSuccessCounter;
    private final Counter orderFailureCounter;
    private final Timer paymentTimer;

    public OrderMonitor(MeterRegistry registry) {
        // Use tagged counters so success and failure are queryable separately in PromQL
        // rate(forge.orders.processed_total{status="failure"}[5m]) gives failure rate
        this.orderSuccessCounter = registry.counter("forge.orders.processed", "status", "success");
        this.orderFailureCounter = registry.counter("forge.orders.processed", "status", "failure");

        // Timer automatically produces _count, _sum, and _bucket metrics
        // Enables histogram_quantile(0.99, rate(forge_payment_latency_seconds_bucket[5m]))
        this.paymentTimer = registry.timer("forge.payment.latency");
    }

    public void processOrder(String orderId) {
        // Inject the active OpenTelemetry trace ID into MDC before any log statements
        // This ensures EVERY log line within this method includes the traceId automatically
        String traceId = Span.current().getSpanContext().getTraceId();
        MDC.put("traceId", traceId);
        MDC.put("orderId", orderId);

        try {
            paymentTimer.record(() -> {
                // log.info output will include traceId and orderId from MDC automatically
                // Logback/Log4j2 pattern: %X{traceId} %X{orderId} %msg
                log.info("Order processing started");

                // ... business logic: validate, reserve inventory, charge payment ...

                orderSuccessCounter.increment();
                log.info("Order processing completed successfully");
            });
        } catch (Exception e) {
            orderFailureCounter.increment();
            // ERROR logs include traceId from MDC — jump directly to the trace in Jaeger
            log.error("Order processing failed — check trace for upstream dependency state",
                Map.of("error", e.getMessage(), "errorType", e.getClass().getSimpleName()));
            throw e;
        } finally {
            // Always clear MDC — thread pool reuse will leak context if you skip this
            MDC.clear();
        }
    }
}

/*
 * Example log output (JSON format via logstash-logback-encoder):
 * {
 *   "level":   "INFO",
 *   "logger":  "io.thecodeforge.monitoring.OrderMonitor",
 *   "message": "Order processing completed successfully",
 *   "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",   <-- same ID in Jaeger/Tempo
 *   "orderId": "ORD-101"
 * }
 *
 * Metrics emitted (visible in Prometheus):
 *   forge_orders_processed_total{status="success"} 1.0
 *   forge_payment_latency_seconds_count 1.0
 *   forge_payment_latency_seconds_sum 0.043
 *   forge_payment_latency_seconds_bucket{le="0.05"} 1.0
 */
Output
{
"level": "INFO",
"message": "Order processing completed successfully",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"orderId": "ORD-101"
}
forge_orders_processed_total{status="success"} 1.0
forge_payment_latency_seconds_bucket{le="0.05"} 1.0
The Three Pillars as a Debugging Funnel
  • Metrics are cheap to collect and store — start here for every new service. They are your smoke alarm: they tell you something changed, fast, without requiring you to read individual events.
  • Logs are expensive at scale but irreplaceable for root cause analysis. Use structured JSON logs from day one — if your logs are not queryable by trace ID in under one second, you do not have observability, you have a text file archive.
  • Traces bridge the gap — they connect a metric anomaly (P99 spiked) to the specific service and span (payment-service.charge() took 4.8s) without manual timestamp correlation across log files.
  • The trace ID in your logs is the most important field you can add. It is the hyperlink between a log line and the full request journey. Without it, cross-service debugging during an incident is archaeological work.
  • The unknown unknowns problem is only solvable when all three pillars are present and correlated. Predefined dashboards cover what you anticipated. Rich correlated telemetry lets you ask questions you did not anticipate — which is exactly what novel failure modes require.
Production Insight
Structured JSON logs cost 2–3x more storage than plain text but enable sub-second field-level queries in Elasticsearch or Loki.
Without structured fields, finding all log lines for a specific trace ID in 10TB of log data requires a full-text scan — that takes minutes under load and often times out entirely. With JSON and a proper field index, the same query completes in milliseconds regardless of log volume.
Rule: if your on-call team cannot jump from a Grafana alert to the relevant log lines in under 30 seconds, your logging is not serving its purpose during incidents. That 30-second jump requires structured logs, a trace ID field, and an index on that field. It is not a nice-to-have; it is the baseline for operational effectiveness.
Key Takeaway
Metrics tell you WHEN something broke. Traces tell you WHERE. Logs tell you WHY.
The Correlation ID — specifically the trace ID injected into every log line from the active OpenTelemetry span — is the glue that makes the three pillars an observability system rather than three separate dashboards. Instrument for it from the start. Retrofitting it into an existing codebase is painful and incomplete by nature.
Choosing Which Pillar to Invest In First
IfNo alerting exists at all and the team is flying blind in production
UseStart with metrics — implement the 4 Golden Signals (latency, traffic, errors, saturation) for every service. Use Micrometer for JVM services, OpenTelemetry SDKs for everything else.
IfAlerts fire regularly but root cause investigation takes hours not minutes
UseAdd distributed tracing with OpenTelemetry auto-instrumentation. The goal is to connect an alert directly to the specific service, endpoint, and downstream dependency that is causing the issue.
IfTraces show the failing service and the slow span, but not the reason for the failure
UseAdd structured JSON logging with trace ID injected from the active OTel span context via MDC. Ensure ERROR and WARN logs capture enough context (request parameters, upstream service state) to diagnose without re-running the request.
IfAll three pillars exist but live in separate tools with no cross-linking
UseUnify on a single observability platform that supports exemplars and trace-to-log correlation. Grafana Stack (Prometheus + Loki + Tempo) covers this at low cost. Datadog and Honeycomb cover it with less operational overhead at higher license cost.

Observability in Practice: Prometheus and Grafana

Modern observability stacks typically center on a pull-based collection model, and Prometheus is the reference implementation of that model in the open-source world. Instead of your application pushing metrics to a central store on a schedule it controls, Prometheus scrapes a /metrics endpoint on your service at an interval Prometheus controls. Your application is passive — it just maintains counters and histograms in memory and serves them when asked.

The pull model has a meaningful architectural advantage: a misbehaving application cannot overwhelm the monitoring system with a flood of metric writes. If your application starts logging every debug event as a metric push, a push-based system absorbs the explosion and potentially falls over. Prometheus, by contrast, simply scrapes at its configured interval regardless of what the application is doing. The trade-off is the opposite failure mode: if your application crashes between scrapes, you lose the data from that window. For short-lived jobs and batch processes that complete in under one scrape interval, the Pushgateway solves this by holding pushed metrics until the next Prometheus scrape cycle.

Prometheus stores metrics as time series: a metric name, a set of key-value labels, and a sequence of timestamped float64 values. The label set is everything — it is how you slice a metric by service, region, endpoint, or status code. But the label set is also where most production Prometheus problems originate. Every unique combination of label values creates a separate time series in the TSDB. A metric with three labels, each with ten possible values, creates one thousand time series. Add a fourth label with a thousand possible values and you have a million time series for a single metric name. This is the cardinality problem, and it is not theoretical — I have watched it take down a Prometheus instance in production within 20 minutes of a bad deployment that added a user_id label to a high-traffic metric.

PromQL is where the real diagnostic power lives. PromQL is not a query language for fetching raw data — it is a functional language for computing derived signals from time series. The most important function in incident response is histogram_quantile(), which computes a true percentile from a histogram metric. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) gives you the true P99 latency over a 5-minute window — the slowest experience that 1% of your users received. No dashboard based on averages can show you this. Building this query and putting it on your primary service dashboard is a one-time 10-minute investment that has caught more production incidents for me than any other single piece of instrumentation.

prometheus-config.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# io.thecodeforge — Production Prometheus Scrape Configuration
#
# Design decisions explained inline:
#   - scrape_interval: 15s is the standard for most services
#     Reduce to 5s for latency-sensitive paths, but watch Prometheus CPU/memory
#   - honor_labels: false (default) — Prometheus overwrites conflicting labels
#     from the target with its own. Set true only if you trust the target's labels.
#   - relabel_configs allow normalizing instance labels before storage
#     This prevents cardinality drift from inconsistent hostname formats

global:
  scrape_interval: 15s
  evaluation_interval: 15s  # How often alerting rules are evaluated
  # Global labels added to every time series scraped by this instance
  external_labels:
    datacenter: 'us-east-1'
    environment: 'production'

rule_files:
  - 'rules/forge_alerts.yml'   # Alert rules — evaluated every evaluation_interval
  - 'rules/forge_records.yml'  # Recording rules — pre-aggregate expensive PromQL queries

scrape_configs:
  - job_name: 'forge-api-service'
    # Spring Boot Actuator exposes Prometheus metrics at /actuator/prometheus
    # Not /metrics — this is a common misconfiguration that produces DOWN targets
    metrics_path: '/actuator/prometheus'
    scrape_interval: 5s      # Override global for latency-sensitive checkout service
    scrape_timeout: 4s       # Must be less than scrape_interval to avoid overlap
    static_configs:
      - targets:
          - 'forge-prod-app-01:8080'
          - 'forge-prod-app-02:8080'
        labels:
          service: 'order-service'  # Added to every time series from these targets
    relabel_configs:
      # Normalize the instance label to a stable name instead of host:port
      # Prevents cardinality drift if the port changes or hostname format varies
      - source_labels: [__address__]
        regex: '(forge-prod-app-\d+):.*'
        target_label: instance
        replacement: '$1'

  - job_name: 'forge-payment-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['forge-payment:8080']
        labels:
          service: 'payment-service'

---
# rules/forge_alerts.yml
# Alert on P99 latency — not average. Average latency alerts miss tail latency problems.
groups:
  - name: forge.latency
    rules:
      - alert: ForgeCheckoutP99LatencyHigh
        expr: |
          histogram_quantile(
            0.99,
            rate(http_server_requests_seconds_bucket{
              job="forge-api-service",
              uri="/api/checkout"
            }[5m])
          ) > 2.0
        for: 3m   # Must be elevated for 3 minutes before firing — reduces noise
        labels:
          severity: critical
          team: platform
        annotations:
          summary: 'Checkout P99 latency exceeds 2s'
          description: |
            P99 latency for /api/checkout is {{ $value | humanizeDuration }}.
            Check Jaeger for slow traces on this endpoint.
            Dashboard: https://grafana.thecodeforge.io/d/checkout-slo

---
# rules/forge_records.yml
# Recording rules pre-aggregate expensive queries so dashboards load instantly
# Without this, a 7-day P99 query runs at render time and takes 30+ seconds
groups:
  - name: forge.recordings
    rules:
      - record: job:http_request_duration_p99:rate5m
        expr: |
          histogram_quantile(
            0.99,
            sum by (job, le) (
              rate(http_server_requests_seconds_bucket[5m])
            )
          )
Output
Prometheus targets updated.
Scraping forge-prod-app-01:8080 at /actuator/prometheus every 5s
Scraping forge-prod-app-02:8080 at /actuator/prometheus every 5s
Alert ForgeCheckoutP99LatencyHigh: inactive
Recording rule job:http_request_duration_p99:rate5m: active
The Cardinality Bomb — One Label Can Bring Down Prometheus
  • Never include unique per-request values as metric labels. User IDs, order IDs, session tokens, email addresses — any label whose value changes per request will create one time series per unique value, per metric. On a system with 1 million users, a single user_id label creates 1 million time series for one metric name.
  • Prometheus TSDB stores all active time series in memory. A cardinality explosion is a memory explosion. The symptoms are: Prometheus memory climbs continuously, queries slow, and eventually the process OOMs and restarts. The cycle repeats because the high-cardinality data is already on disk.
  • The fix is not to increase Prometheus memory. The fix is to remove the high-cardinality label from the metric definition. Use exemplars to link a specific metric observation to a trace ID without creating a new time series per request.
  • Rule of thumb: if a label value is unique per request or per user, it does not belong on a metric. Put it in a log field or a trace span attribute instead.
  • Before deploying a new metric, count the expected unique label combinations. Multiply across all label dimensions. If that number exceeds 10,000 for a single metric, redesign the label set.
Production Insight
A Prometheus scrape_interval of 15s means your effective metric resolution is 15 seconds. Any event that begins and ends within a single 15-second window between scrapes is invisible — you lose it entirely.
For most services, 15-second resolution is perfectly adequate. For checkout latency, payment processing, and other revenue-critical paths, consider reducing to 5s. Before you do, run Prometheus with the current configuration for a week and baseline its CPU and memory consumption. A 3x increase in scrape frequency does not produce a 3x increase in resource usage — because most of the overhead is connection setup, not data volume — but you need to know your headroom before making the change in production.
For events shorter than your scrape interval (flash spikes, GC pauses, brief thread pool exhaustion), scrape resolution is not the right tool. Use application-level histograms — they accumulate observations between scrapes, so the statistical signal survives even if the raw spike is not captured in a single scrape.
Key Takeaway
The Prometheus pull model protects your monitoring infrastructure from application misbehavior — a crashing or misbehaving app cannot flood the monitoring system. But scrapes are lossy by design: sub-interval events are invisible unless you use histograms to accumulate observations between scrapes.
Cardinality is the silent killer. One high-cardinality label — added by a developer who thought it would help debugging — can consume gigabytes of Prometheus memory within hours and bring down the entire monitoring stack. Review label sets in code review with the same scrutiny you apply to database index design.
Always use histogram_quantile() in PromQL for latency metrics. Averages are misleading in right-skewed distributions, and your latency distributions are always right-skewed.
Prometheus Architecture Decisions
IfSingle cluster, fewer than 1 million active time series, retention under 30 days
UseSingle Prometheus instance with local TSDB is sufficient. Monitor Prometheus's own memory and disk consumption as a first-class concern.
IfMultiple clusters, need cross-cluster querying, or retention beyond 30 days
UseAdd Thanos Sidecar or Grafana Mimir for horizontal scaling, query federation across clusters, and object storage backend (S3/GCS) for long-term retention at low cost.
IfShort-lived batch jobs or processes that complete in under one scrape interval
UseUse Pushgateway — the job pushes its final metric state before exiting, and Prometheus scrapes Pushgateway on the normal schedule. Clean up stale Pushgateway entries after job completion.
IfNeed to correlate specific metric observations with traces for deep debugging
UseEnable exemplar support in Prometheus (--enable-feature=exemplar-storage) and configure your metrics library to attach the current trace ID as an exemplar. Grafana renders exemplars as clickable links directly to Tempo or Jaeger.
● Production incidentPOST-MORTEMseverity: high

Black Friday P99 Latency Explosion — Monitoring Showed Green, Observability Found the Root Cause

Symptom
Customer support was flooded with 'checkout stuck' reports within 20 minutes of peak traffic starting. The average latency dashboard showed 120ms — well within the 500ms SLA. No alerts fired. Engineers looked at standard dashboards for 45 minutes and saw nothing visibly wrong. The gap between what the dashboards said and what users were experiencing was total.
Assumption
The team assumed it was a frontend rendering issue or a CDN cache invalidation problem, since all backend dashboards showed healthy. They spent 30 minutes investigating client-side JavaScript performance and CDN configuration before anyone looked at the access logs and noticed that a specific subset of requests was consistently timing out at exactly 30 seconds — the database query timeout ceiling.
Root cause
A single read replica had developed a slow disk I/O issue after a storage volume resize operation earlier that day. The load balancer's round-robin routing sent approximately 1% of read queries to this replica. Because 99% of requests were fast, the average latency metric never deviated from baseline. Only P99 and P999 metrics revealed the tail latency problem — and those metrics did not exist on any dashboard. The monitoring system covered average latency and error rate. It had never been asked to track percentiles. The slow replica was not returning errors, so the error rate metric also stayed flat. Every signal the team had configured to watch was green while 1% of users experienced a 30-second checkout hang.
Fix
Added P95 and P99 latency histogram panels to the primary checkout service dashboard as first-class, always-visible metrics rather than buried secondary panels. Reconfigured alerting to fire on P99 > 2s sustained for 3 minutes, completely independent of average latency. Added a distributed trace sampler at 1% of request volume flowing into Jaeger, which immediately revealed the replica routing pattern when replayed against the incident window. Implemented load balancer health checks that included actual query response time probes against each replica, not just TCP connectivity checks — a TCP connection can succeed while the replica is processing queries at 10x normal latency.
Key lesson
  • Averages are a lie in skewed distributions. A single slow replica behind a load balancer is invisible in average latency unless you are actively tracking percentiles. Always monitor P95 and P99 as first-class metrics, not afterthoughts.
  • If your alerting only covers averages and error rates, you are systematically blind to the failures that affect your highest-value users — the ones completing large orders, the enterprise accounts, the users on slow connections who already have the least tolerance for performance degradation.
  • Distributed tracing is the only reliable mechanism to attribute tail latency to a specific downstream component. Without it, you are correlating timestamps across log files by hand during the worst moments of an incident.
  • Load balancer health checks that only test TCP connectivity are not health checks. They are port checks. If a replica is accepting connections but taking 25 seconds to respond, a TCP health check will mark it as healthy every single time. Health checks must test actual application response time.
Production debug guideSymptom → Action mapping for common observability failures5 entries
Symptom · 01
Alert fires but all dashboards show everything green
Fix
Do not dismiss the alert because dashboards look fine — that discrepancy is itself diagnostic information. Check whether the alert is based on a different metric name or time aggregation than the dashboard panel. Verify the alert evaluation window: a 1-minute alert window may have already auto-resolved by the time you open the dashboard with its default 15-minute view. Check for clock skew between the alerting system and the metric source — a 30-second clock difference between Prometheus and an application instance can cause apparent mismatches. Pull the raw metric value from the Prometheus API directly for the timestamp when the alert fired and compare it to the dashboard panel rendering.
Symptom · 02
Cannot trace a request across multiple services — trace appears incomplete or broken
Fix
The most common cause is a single service in the chain failing to propagate trace context headers. Verify that W3C Trace-Context headers (traceparent and tracestate) or B3 headers are present on every outbound HTTP call, including calls made by async workers and message queue consumers. Check that the OpenTelemetry SDK is initialized before any HTTP client beans are constructed — a common Spring Boot mistake is initializing the HTTP client in a @PostConstruct before the tracing agent attaches. For message queue propagation, confirm that trace headers are being written to and read from message metadata, not just HTTP headers.
Symptom · 03
Prometheus targets showing as DOWN in the UI despite the application running
Fix
Start by testing reachability from the Prometheus server itself, not from your workstation — network policies and Kubernetes service mesh rules frequently allow traffic from some sources and block it from others. Use curl from inside the Prometheus pod to hit the /metrics endpoint directly. Check Kubernetes NetworkPolicy objects for rules that might block Prometheus's egress to application pods. Verify the metrics_path in the scrape config matches the actual endpoint path — Spring Boot Actuator exposes /actuator/prometheus, not /metrics. Check that the application is binding to 0.0.0.0 and not 127.0.0.1, which is a common misconfiguration in containerized environments.
Symptom · 04
Log volume exploded overnight and Elasticsearch or Loki is rejecting writes
Fix
Identify the top logging service by volume before you do anything else — killing the wrong log stream wastes time. Query your log aggregation platform for volume by service and log level over the last hour. Look for debug-level logging left enabled in production, which is the most frequent cause of sudden log volume spikes after a deployment. Check for a logging loop: a service logging every retry attempt in an exponential backoff loop will generate thousands of log lines per second. Implement log sampling immediately as a mitigation: keep 100% of ERROR and WARN, sample 10% of INFO, and drop DEBUG entirely until the root cause is resolved. Add per-service log rate limits at the collection agent layer (Fluentd or Logstash) as a permanent backstop.
Symptom · 05
Grafana dashboards load slowly, timeout, or show query errors
Fix
High-cardinality PromQL queries are the most common cause and should be checked first. Open the Grafana query inspector for the slow panel and look at the raw PromQL — any query that includes a label with high uniqueness (user_id, order_id, session_id) will cause Prometheus to scan enormous amounts of data. Reduce the dashboard time range as an immediate workaround. For expensive queries that legitimately need to run, create recording rules that pre-aggregate the result on a schedule — dashboards then query the pre-aggregated series instead of computing it at render time. If dashboards are slow across the board on long time ranges, evaluate Thanos or Grafana Mimir for long-term metric storage with query federation.
★ Observability Debug Cheat SheetWhen your monitoring or observability pipeline is failing, run these checks in order. Each step is designed to narrow the failure surface before you touch any configuration.
Prometheus target is DOWN
Immediate action
Verify the /metrics endpoint is reachable from inside the Prometheus pod, not from your local machine
Commands
curl -s http://forge-prod-app:8080/actuator/prometheus | head -20
kubectl exec -it prometheus-pod -- wget -qO- http://forge-prod-app:8080/actuator/prometheus
Fix now
Check NetworkPolicy objects, service endpoints, and that the application container is binding to 0.0.0.0 not 127.0.0.1. Verify the metrics_path in prometheus scrape config matches the actual endpoint.
Missing traces in Jaeger or Tempo — spans appear incomplete or absent+
Immediate action
Verify trace context propagation headers are present on outbound requests and that the OTel exporter endpoint is reachable
Commands
curl -v http://forge-api:8080/api/orders 2>&1 | grep -i 'traceparent\|trace\|b3'
kubectl logs deployment/forge-api --tail=50 | grep -i 'trace\|span\|otel\|exporter'
Fix now
Ensure the OpenTelemetry SDK agent is initialized before HTTP client construction. Confirm OTEL_EXPORTER_OTLP_ENDPOINT is set correctly in the deployment environment variables. For async consumers, verify trace headers are read from message metadata.
Alert firing but no corresponding data visible in the Grafana dashboard+
Immediate action
Query the raw metric directly from the Prometheus API to confirm the metric name and label set match what the dashboard is using
Commands
curl -s 'http://prometheus:9090/api/v1/query?query=forge_orders_success_total' | python -m json.tool
curl -s 'http://prometheus:9090/api/v1/targets' | python -c "import sys,json; [print(t['labels']['job'], t['health'], t['lastError']) for t in json.load(sys.stdin)['data']['activeTargets']]"
Fix now
Align the metric name and label selectors in both the alert rule and the dashboard panel. A common mismatch is a _total suffix added by Prometheus counters that the dashboard query omits.
Cardinality explosion in Prometheus — memory spiking, queries slowing, OOM restarts+
Immediate action
Identify which metric has the most time series and which label is causing the explosion
Commands
curl -s 'http://prometheus:9090/api/v1/status/tsdb' | python -m json.tool | head -40
curl -s 'http://prometheus:9090/api/v1/label/__name__/values' | python -c "import sys,json; d=json.load(sys.stdin); print('Total metric names:', len(d['data']))"
Fix now
Remove high-cardinality labels (user_id, request_id, session_id) from the metric definition immediately. Use exemplars to link specific metric observations to a trace ID without creating a new time series per request. Restart Prometheus after removing the label to free memory — TSDB will compact on the next cycle.
Monitoring vs Observability
FeatureMonitoring (Traditional)Observability (Modern)
Primary GoalTrack health of predefined, known metrics — you decide in advance what to measureUnderstand internal system state from external telemetry — including failure modes you did not anticipate
ApproachSymptom-based: Is it up? Is error rate below threshold? Did this counter cross a line?Context-based: Why is this specific request slow? Which downstream service introduced the latency? What changed between yesterday and today?
ToolingNagios, basic dashboards, threshold alerts on raw metricsPrometheus, OpenTelemetry, Jaeger or Tempo, Grafana with correlated trace/log/metric views
User ExperienceAlerts when a predefined threshold is crossed — tells you something is wrongVisualizes the complete lifecycle of a request — tells you what went wrong, where, and why
Data StyleAggregated snapshots — you lose per-request detail in favor of efficient storageHigh-fidelity traces and structured event logs — per-request context is preserved and queryable
Unknown UnknownsCannot detect failure modes you did not build a dashboard forRich telemetry allows ad-hoc queries during incidents — answering questions you did not anticipate
Incident ResponseTells you an SLA threshold was crossed — investigation starts from scratchProvides a navigable path from alert → trace → log → root cause

Key takeaways

1
Monitoring tells you when a system is failing; observability provides the diagnostic tools to understand why. You need both
monitoring without observability leaves you with an alarm but no investigation path; observability without monitoring leaves you with data but no trigger to look at it.
2
The 3 Pillars
Metrics, Logs, and Traces — must be unified with a common Correlation ID (trace ID) to function as an observability system rather than three disconnected dashboards. The trace ID in your log lines is the most important field you can instrument.
3
Adopt OpenTelemetry as your instrumentation standard to avoid vendor lock-in. Auto-instrumentation agents cover 90% of JVM instrumentation with zero code changes
the investment is in configuration and rollout, not in rewriting your application.
4
Monitor the 4 Golden Signals
Latency, Traffic, Errors, and Saturation — as your primary alert targets for every service. System resource metrics (CPU, memory) are diagnostic signals, not primary alert targets.
5
Averages hide tail latency in every right-skewed distribution, and your latency distributions are always right-skewed. Always instrument P95 and P99 as first-class metrics. Alert on P99 thresholds, not averages. Use histogram metrics and histogram_quantile() in PromQL
not summaries or gauges.
6
Cardinality is the silent killer of Prometheus. One high-cardinality label (user_id, order_id, session_id) on a high-traffic metric can consume gigabytes of Prometheus memory within hours and cause an OOM restart loop. Review label sets in code review with the same rigor you apply to database index design.
7
Log retention without a tiered strategy is an operational tax that compounds over time. Implement retention tiers by log level from day one
DEBUG disabled in production, INFO sampled at 10–20% for high-volume services, ERROR retained at full fidelity for 90 days. Archive cold logs to object storage rather than paying hot storage prices for data you query once a quarter.
8
Synthetic monitoring verifies SLA uptime from backbone infrastructure. Real User Monitoring captures actual experience from real ISP connections and real devices. Use both. Only RUM detects regional ISP outages
synthetic probes bypass consumer ISP last-mile networks entirely.

Common mistakes to avoid

5 patterns
×

Dashboard overload — 50 graphs with no hierarchy, all treated as equally important

Symptom
Alert fatigue sets in within weeks of launch. Engineers stop looking at dashboards because nothing on them is clearly actionable. Critical alerts drown in a sea of non-critical panels showing system metrics that are within normal ranges. During an actual incident, nobody knows which panel to look at first.
Fix
Build a 3-tier dashboard hierarchy: Executive tier shows SLA and SLO status — green or red, nothing else. Service tier shows the 4 Golden Signals per service — latency, traffic, errors, saturation. Debug tier shows detailed per-request traces, slow query logs, and dependency health. Alert only on tier-1 SLO violations. CPU and memory panels belong on the debug tier, not on the primary service dashboard that engineers check first during an incident.
×

Missing the trace — metrics and logs implemented, distributed tracing skipped as 'too complex'

Symptom
During an outage, engineers spend hours manually correlating log timestamps across three or four services to reconstruct what happened to a failing request. Incident timelines are guesses. Post-mortems are based on partial information. The same failure mode recurs because the root cause was never fully understood.
Fix
Implement OpenTelemetry with automatic instrumentation as early as possible — ideally before the first production deployment, not as a post-incident remediation. Propagate W3C Trace-Context headers across all HTTP calls and message queue operations. Include the trace ID in all log statements via MDC. The auto-instrumentation agent for most JVM frameworks handles 90% of this with zero code changes — the investment is in configuration, not implementation.
×

Measuring system resources instead of business outcomes — optimizing for what is easy to measure, not what matters

Symptom
CPU sits at 30%, memory is comfortable, all infrastructure dashboards are green. Meanwhile, checkout success rate has dropped to 60% and the business is losing thousands of dollars per minute. Engineers see a healthy system. The product team sees a broken product. The disconnect erodes trust in the engineering team's ability to monitor what actually matters.
Fix
Define SLOs based on user-facing outcomes first: checkout success rate, P99 login latency, payment processing completion rate. These are your primary alert targets. System resource metrics (CPU, memory, disk I/O) are secondary signals that help explain why a user-facing SLO degraded — they are diagnostic tools, not alert targets.
×

Using averages as the primary latency metric in dashboards and alerts

Symptom
Average latency is 120ms and alerts are tuned to fire at 500ms average. A single slow replica serves 1% of traffic at 30-second response times. No alerts fire. The 30-second requests belong to real users who do not retry gracefully — they abandon the flow. The business sees elevated cart abandonment. Engineering sees a healthy average.
Fix
Replace average latency as a primary dashboard metric with P95 and P99 histograms. Set alerts on P99 thresholds — a reasonable starting point is alert when P99 exceeds 3x the baseline P99 for your endpoint, sustained for 3 minutes. Use histogram metric types in your instrumentation library (not gauges or summaries) so that PromQL histogram_quantile() can compute true percentiles from the raw bucket data.
×

No retention or sampling strategy for logs — treating all logs as equally valuable

Symptom
Elasticsearch or Loki runs out of disk every week on a predictable schedule. Log ingestion and storage costs exceed the application hosting costs by a factor of two. Engineers cannot query logs older than 3 days because the cluster degrades under the write load. Adding more disk is a recurring operational tax that solves nothing structurally.
Fix
Implement tiered retention based on log level and operational value: DEBUG logs retained 24 hours (better yet, disabled entirely in production unless explicitly enabled for a specific service during active debugging), INFO logs retained 7 days with 10–20% sampling on high-volume services, WARN logs retained 30 days, ERROR logs retained 90 days at full fidelity. Archive cold logs to object storage (S3 or GCS with lifecycle policies) for long-term compliance requirements. Implement the tiered strategy at the collection agent (Fluentd or Vector) so the decision is made before data hits your storage backend.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You have a service where average latency is 100ms but P99 is 5 seconds. ...
Q02SENIOR
Explain the 'Unknown-Unknowns' concept in the context of observability. ...
Q03SENIOR
Describe how a distributed trace propagates through a microservices arch...
Q04SENIOR
What is the trade-off between structured JSON logs and plain text logs i...
Q05SENIOR
How do you implement Synthetic Monitoring versus Real User Monitoring (R...
Q06SENIOR
Scenario: A database query is intermittently slow. How would you instrum...
Q01 of 06SENIOR

You have a service where average latency is 100ms but P99 is 5 seconds. What does this tell you about the system, and how would you use observability to find the bottleneck?

ANSWER
A P99 of 5s with a 100ms average is a heavily right-skewed distribution — the vast majority of requests are fast, but a small fraction are catastrophically slow. That pattern points to a specific class of failure modes: a resource that is occasionally unavailable or slow, not a general degradation affecting all requests. My first step is to check whether the slow requests share a common dimension. Using distributed tracing in Jaeger or Tempo, I would filter for traces in the 95th–100th percentile duration and look for patterns: Do they all hit the same downstream service? The same database instance? The same cache shard? Are they larger payloads? Do they cluster at specific times? Common root causes for this pattern: 1. A single slow replica behind a round-robin load balancer — check per-instance latency metrics broken out by the instance label. One instance will show dramatically higher P99 than the others. 2. JVM garbage collection pauses — correlate the P99 spike timestamps with GC log events. GC pauses show up as sharp latency spikes with full recovery between them. 3. Cold cache misses on a small fraction of requests — check cache hit rate during slow periods. First-access requests pay the database cost; subsequent requests are served from cache. 4. Database connection pool exhaustion — slow requests may be waiting for a connection to become available, not actually executing a slow query. Check connection pool wait time metrics separately from query execution time metrics. For instrumentation, I would ensure the histogram metric for each downstream call is present and properly labeled so that histogram_quantile(0.99, rate(call_duration_seconds_bucket{service="payment"}[5m])) gives me per-dependency P99 values. That query alone usually points at the culprit within a few minutes of looking.
FAQ · 3 QUESTIONS

Frequently Asked Questions

01
Can I have observability without monitoring?
02
What are the four Golden Signals of SRE?
03
Why is high cardinality a problem in observability?
🔥

That's Monitoring. Mark it forged?

5 min read · try the examples if you haven't

Previous
AWS Fargate: Serverless Containers on ECS and EKS
1 / 9 · Monitoring
Next
Prometheus and Grafana Setup