Skip to content
Home DevOps Monitoring vs Observability Explained — The Complete DevOps Guide

Monitoring vs Observability Explained — The Complete DevOps Guide

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Monitoring → Topic 1 of 9
Monitoring and observability demystified for DevOps engineers.
⚙️ Intermediate — basic DevOps knowledge assumed
In this tutorial, you'll learn
Monitoring and observability demystified for DevOps engineers.
  • Monitoring tells you when a system is failing; observability provides the diagnostic tools to understand why. You need both — monitoring without observability leaves you with an alarm but no investigation path; observability without monitoring leaves you with data but no trigger to look at it.
  • The 3 Pillars — Metrics, Logs, and Traces — must be unified with a common Correlation ID (trace ID) to function as an observability system rather than three disconnected dashboards. The trace ID in your log lines is the most important field you can instrument.
  • Adopt OpenTelemetry as your instrumentation standard to avoid vendor lock-in. Auto-instrumentation agents cover 90% of JVM instrumentation with zero code changes — the investment is in configuration and rollout, not in rewriting your application.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Monitoring watches predefined metrics and alerts when thresholds are breached — it answers 'WHAT is broken'
  • Observability lets you ask arbitrary questions about system state from external telemetry — it answers 'WHY it broke'
  • The 3 pillars are Metrics (aggregated numeric signals), Logs (per-event detail), and Traces (request flow across services)
  • Without a shared Correlation ID across all three pillars, debugging cross-service failures is manual timestamp archaeology
  • Prometheus uses pull-based scraping — your app exposes /metrics, Prometheus fetches on a fixed interval
  • High-cardinality labels (User IDs, request IDs) as metric tags will crash your monitoring storage
  • Monitoring is a subset of observability — you need both, and they serve different cognitive roles during an incident
🚨 START HERE
Observability Debug Cheat Sheet
When your monitoring or observability pipeline is failing, run these checks in order. Each step is designed to narrow the failure surface before you touch any configuration.
🟡Prometheus target is DOWN
Immediate ActionVerify the /metrics endpoint is reachable from inside the Prometheus pod, not from your local machine
Commands
curl -s http://forge-prod-app:8080/actuator/prometheus | head -20
kubectl exec -it prometheus-pod -- wget -qO- http://forge-prod-app:8080/actuator/prometheus
Fix NowCheck NetworkPolicy objects, service endpoints, and that the application container is binding to 0.0.0.0 not 127.0.0.1. Verify the metrics_path in prometheus scrape config matches the actual endpoint.
🟡Missing traces in Jaeger or Tempo — spans appear incomplete or absent
Immediate ActionVerify trace context propagation headers are present on outbound requests and that the OTel exporter endpoint is reachable
Commands
curl -v http://forge-api:8080/api/orders 2>&1 | grep -i 'traceparent\|trace\|b3'
kubectl logs deployment/forge-api --tail=50 | grep -i 'trace\|span\|otel\|exporter'
Fix NowEnsure the OpenTelemetry SDK agent is initialized before HTTP client construction. Confirm OTEL_EXPORTER_OTLP_ENDPOINT is set correctly in the deployment environment variables. For async consumers, verify trace headers are read from message metadata.
🟡Alert firing but no corresponding data visible in the Grafana dashboard
Immediate ActionQuery the raw metric directly from the Prometheus API to confirm the metric name and label set match what the dashboard is using
Commands
curl -s 'http://prometheus:9090/api/v1/query?query=forge_orders_success_total' | python -m json.tool
curl -s 'http://prometheus:9090/api/v1/targets' | python -c "import sys,json; [print(t['labels']['job'], t['health'], t['lastError']) for t in json.load(sys.stdin)['data']['activeTargets']]"
Fix NowAlign the metric name and label selectors in both the alert rule and the dashboard panel. A common mismatch is a _total suffix added by Prometheus counters that the dashboard query omits.
🔴Cardinality explosion in Prometheus — memory spiking, queries slowing, OOM restarts
Immediate ActionIdentify which metric has the most time series and which label is causing the explosion
Commands
curl -s 'http://prometheus:9090/api/v1/status/tsdb' | python -m json.tool | head -40
curl -s 'http://prometheus:9090/api/v1/label/__name__/values' | python -c "import sys,json; d=json.load(sys.stdin); print('Total metric names:', len(d['data']))"
Fix NowRemove high-cardinality labels (user_id, request_id, session_id) from the metric definition immediately. Use exemplars to link specific metric observations to a trace ID without creating a new time series per request. Restart Prometheus after removing the label to free memory — TSDB will compact on the next cycle.
Production IncidentBlack Friday P99 Latency Explosion — Monitoring Showed Green, Observability Found the Root CauseAverage latency was 120ms (within SLA), dashboards were green, but 1% of checkout requests took 30+ seconds. Monitoring missed it entirely because it only tracked averages. Distributed tracing revealed a single slow database replica serving all P99 traffic — a failure mode the team had never thought to build a dashboard for.
SymptomCustomer support was flooded with 'checkout stuck' reports within 20 minutes of peak traffic starting. The average latency dashboard showed 120ms — well within the 500ms SLA. No alerts fired. Engineers looked at standard dashboards for 45 minutes and saw nothing visibly wrong. The gap between what the dashboards said and what users were experiencing was total.
AssumptionThe team assumed it was a frontend rendering issue or a CDN cache invalidation problem, since all backend dashboards showed healthy. They spent 30 minutes investigating client-side JavaScript performance and CDN configuration before anyone looked at the access logs and noticed that a specific subset of requests was consistently timing out at exactly 30 seconds — the database query timeout ceiling.
Root causeA single read replica had developed a slow disk I/O issue after a storage volume resize operation earlier that day. The load balancer's round-robin routing sent approximately 1% of read queries to this replica. Because 99% of requests were fast, the average latency metric never deviated from baseline. Only P99 and P999 metrics revealed the tail latency problem — and those metrics did not exist on any dashboard. The monitoring system covered average latency and error rate. It had never been asked to track percentiles. The slow replica was not returning errors, so the error rate metric also stayed flat. Every signal the team had configured to watch was green while 1% of users experienced a 30-second checkout hang.
FixAdded P95 and P99 latency histogram panels to the primary checkout service dashboard as first-class, always-visible metrics rather than buried secondary panels. Reconfigured alerting to fire on P99 > 2s sustained for 3 minutes, completely independent of average latency. Added a distributed trace sampler at 1% of request volume flowing into Jaeger, which immediately revealed the replica routing pattern when replayed against the incident window. Implemented load balancer health checks that included actual query response time probes against each replica, not just TCP connectivity checks — a TCP connection can succeed while the replica is processing queries at 10x normal latency.
Key Lesson
Averages are a lie in skewed distributions. A single slow replica behind a load balancer is invisible in average latency unless you are actively tracking percentiles. Always monitor P95 and P99 as first-class metrics, not afterthoughts.If your alerting only covers averages and error rates, you are systematically blind to the failures that affect your highest-value users — the ones completing large orders, the enterprise accounts, the users on slow connections who already have the least tolerance for performance degradation.Distributed tracing is the only reliable mechanism to attribute tail latency to a specific downstream component. Without it, you are correlating timestamps across log files by hand during the worst moments of an incident.Load balancer health checks that only test TCP connectivity are not health checks. They are port checks. If a replica is accepting connections but taking 25 seconds to respond, a TCP health check will mark it as healthy every single time. Health checks must test actual application response time.
Production Debug GuideSymptom → Action mapping for common observability failures
Alert fires but all dashboards show everything greenDo not dismiss the alert because dashboards look fine — that discrepancy is itself diagnostic information. Check whether the alert is based on a different metric name or time aggregation than the dashboard panel. Verify the alert evaluation window: a 1-minute alert window may have already auto-resolved by the time you open the dashboard with its default 15-minute view. Check for clock skew between the alerting system and the metric source — a 30-second clock difference between Prometheus and an application instance can cause apparent mismatches. Pull the raw metric value from the Prometheus API directly for the timestamp when the alert fired and compare it to the dashboard panel rendering.
Cannot trace a request across multiple services — trace appears incomplete or brokenThe most common cause is a single service in the chain failing to propagate trace context headers. Verify that W3C Trace-Context headers (traceparent and tracestate) or B3 headers are present on every outbound HTTP call, including calls made by async workers and message queue consumers. Check that the OpenTelemetry SDK is initialized before any HTTP client beans are constructed — a common Spring Boot mistake is initializing the HTTP client in a @PostConstruct before the tracing agent attaches. For message queue propagation, confirm that trace headers are being written to and read from message metadata, not just HTTP headers.
Prometheus targets showing as DOWN in the UI despite the application runningStart by testing reachability from the Prometheus server itself, not from your workstation — network policies and Kubernetes service mesh rules frequently allow traffic from some sources and block it from others. Use curl from inside the Prometheus pod to hit the /metrics endpoint directly. Check Kubernetes NetworkPolicy objects for rules that might block Prometheus's egress to application pods. Verify the metrics_path in the scrape config matches the actual endpoint path — Spring Boot Actuator exposes /actuator/prometheus, not /metrics. Check that the application is binding to 0.0.0.0 and not 127.0.0.1, which is a common misconfiguration in containerized environments.
Log volume exploded overnight and Elasticsearch or Loki is rejecting writesIdentify the top logging service by volume before you do anything else — killing the wrong log stream wastes time. Query your log aggregation platform for volume by service and log level over the last hour. Look for debug-level logging left enabled in production, which is the most frequent cause of sudden log volume spikes after a deployment. Check for a logging loop: a service logging every retry attempt in an exponential backoff loop will generate thousands of log lines per second. Implement log sampling immediately as a mitigation: keep 100% of ERROR and WARN, sample 10% of INFO, and drop DEBUG entirely until the root cause is resolved. Add per-service log rate limits at the collection agent layer (Fluentd or Logstash) as a permanent backstop.
Grafana dashboards load slowly, timeout, or show query errorsHigh-cardinality PromQL queries are the most common cause and should be checked first. Open the Grafana query inspector for the slow panel and look at the raw PromQL — any query that includes a label with high uniqueness (user_id, order_id, session_id) will cause Prometheus to scan enormous amounts of data. Reduce the dashboard time range as an immediate workaround. For expensive queries that legitimately need to run, create recording rules that pre-aggregate the result on a schedule — dashboards then query the pre-aggregated series instead of computing it at render time. If dashboards are slow across the board on long time ranges, evaluate Thanos or Grafana Mimir for long-term metric storage with query federation.

Your production system crashes at 2 AM on Black Friday. Orders are failing, users are screaming on Twitter, and your on-call engineer is staring at a wall of dashboards wondering where to even begin. This scenario plays out every day at companies around the world — and the difference between a 10-minute fix and a 4-hour outage almost always comes down to one thing: how well-instrumented your system was before the incident.

Monitoring and observability are not luxury features you bolt on after launch. They are the engineering discipline that separates teams who fix incidents in minutes from teams who spend hours in Slack threads and frantic Zoom calls pointing fingers at each other's services. While monitoring is the act of watching known-knowns through dashboards and alerts, observability is the property of a system that allows you to understand its internal state from the external data it emits — logs, metrics, and traces. The operative word is property. Observability is not a tool you buy. It is a quality you engineer into your system from the start.

The critical misconception I see constantly at staff level: treating monitoring and observability as synonyms. They are not. Monitoring is a subset of observability. You can monitor without observability — you get dashboards with no context, alerts that fire with no actionable detail, and on-call engineers who solve incidents through gut instinct and luck. You cannot have meaningful observability without some form of monitoring — you still need alerts to know when to look at anything. The two are complementary, not competing. Get both right and your MTTR drops from hours to minutes. Get only one right and you are still flying partially blind.

The Three Pillars: Metrics, Logs, and Traces

True observability requires all three types of telemetry working together, not independently. Metrics provide a high-level, aggregated view of system health — request rate, error rate, CPU saturation. Logs provide the per-event forensic record of exactly what happened at a specific moment in time. Distributed traces connect everything by showing the complete path of a single request across every service it touched, with timing for each hop.

Each pillar serves a distinct cognitive role in an incident, and understanding those roles prevents the common mistake of trying to use one pillar for a job another pillar does better. Metrics are your early warning system — they tell you THAT something changed, and they tell you fast. A metric alert can fire within 30 seconds of a threshold being crossed. Logs are your forensic evidence — they tell you exactly WHAT happened at a specific request or event level, but querying them at scale is expensive and slow compared to a metric query. Traces are your request GPS — they tell you WHERE a request traveled and HOW LONG each individual service spent processing it. The power of the three-pillar model is not in any single pillar; it is in the correlation between them.

The correlation ID is the connective tissue. When a P99 latency alert fires (metric), you need to jump directly to the traces for the slow requests and then to the logs for the specific error messages those traces contain. If the trace ID is not present in your logs, that jump requires manual timestamp correlation — which is slow, error-prone, and genuinely miserable to do at 2 AM during an active incident. Instrument for correlation from day one, not as a retrofit after the first major outage.

One more thing that gets underweighted: the 'unknown unknowns' problem. Metrics cover what you thought to instrument. Logs cover what you thought to log. Traces cover what you thought to trace. Observability is the property that lets you ask questions you did not anticipate when you wrote the code — because the raw telemetry is rich enough to answer novel queries. A system where you can only ask questions you pre-configured dashboards for is a monitored system. A system where you can compose a new query during an incident and get a meaningful answer is an observable system.

io/thecodeforge/monitoring/OrderMonitor.java · JAVA
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
package io.thecodeforge.monitoring;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import io.opentelemetry.api.trace.Span;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

/**
 * io.thecodeforge — Three-Pillar Instrumentation Pattern
 *
 * Demonstrates all three observability pillars in a single operation:
 *   - Metric: forge.orders.success (counter) and forge.payment.latency (histogram timer)
 *   - Log:    Structured log line with traceId injected from active OpenTelemetry span
 *   - Trace:  Active span context is propagated by the OTel agent — no manual span creation needed
 *
 * Key design decisions:
 *   1. traceId comes from the active OTel Span — it is the same ID visible in Jaeger/Tempo
 *   2. MDC injection makes traceId appear in EVERY log line within this thread scope
 *   3. The Timer wraps the entire business operation — it measures wall clock time including I/O
 *   4. Counter uses .tag("status", "success") so failures can be counted separately:
 *      forge.orders.processed{status="failure"} without creating a separate metric
 */
public class OrderMonitor {

    private static final Logger log = LoggerFactory.getLogger(OrderMonitor.class);

    private final Counter orderSuccessCounter;
    private final Counter orderFailureCounter;
    private final Timer paymentTimer;

    public OrderMonitor(MeterRegistry registry) {
        // Use tagged counters so success and failure are queryable separately in PromQL
        // rate(forge.orders.processed_total{status="failure"}[5m]) gives failure rate
        this.orderSuccessCounter = registry.counter("forge.orders.processed", "status", "success");
        this.orderFailureCounter = registry.counter("forge.orders.processed", "status", "failure");

        // Timer automatically produces _count, _sum, and _bucket metrics
        // Enables histogram_quantile(0.99, rate(forge_payment_latency_seconds_bucket[5m]))
        this.paymentTimer = registry.timer("forge.payment.latency");
    }

    public void processOrder(String orderId) {
        // Inject the active OpenTelemetry trace ID into MDC before any log statements
        // This ensures EVERY log line within this method includes the traceId automatically
        String traceId = Span.current().getSpanContext().getTraceId();
        MDC.put("traceId", traceId);
        MDC.put("orderId", orderId);

        try {
            paymentTimer.record(() -> {
                // log.info output will include traceId and orderId from MDC automatically
                // Logback/Log4j2 pattern: %X{traceId} %X{orderId} %msg
                log.info("Order processing started");

                // ... business logic: validate, reserve inventory, charge payment ...

                orderSuccessCounter.increment();
                log.info("Order processing completed successfully");
            });
        } catch (Exception e) {
            orderFailureCounter.increment();
            // ERROR logs include traceId from MDC — jump directly to the trace in Jaeger
            log.error("Order processing failed — check trace for upstream dependency state",
                Map.of("error", e.getMessage(), "errorType", e.getClass().getSimpleName()));
            throw e;
        } finally {
            // Always clear MDC — thread pool reuse will leak context if you skip this
            MDC.clear();
        }
    }
}

/*
 * Example log output (JSON format via logstash-logback-encoder):
 * {
 *   "level":   "INFO",
 *   "logger":  "io.thecodeforge.monitoring.OrderMonitor",
 *   "message": "Order processing completed successfully",
 *   "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",   <-- same ID in Jaeger/Tempo
 *   "orderId": "ORD-101"
 * }
 *
 * Metrics emitted (visible in Prometheus):
 *   forge_orders_processed_total{status="success"} 1.0
 *   forge_payment_latency_seconds_count 1.0
 *   forge_payment_latency_seconds_sum 0.043
 *   forge_payment_latency_seconds_bucket{le="0.05"} 1.0
 */
▶ Output
{
"level": "INFO",
"message": "Order processing completed successfully",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"orderId": "ORD-101"
}
forge_orders_processed_total{status="success"} 1.0
forge_payment_latency_seconds_bucket{le="0.05"} 1.0
Mental Model
The Three Pillars as a Debugging Funnel
Think of the three pillars as a progressive narrowing of the search space during an incident. You enter at metrics, narrow to a service with traces, and land on the exact event with logs. Each pillar answers a different question in that sequence — skipping any one of them forces you to do the narrowing manually.
  • Metrics are cheap to collect and store — start here for every new service. They are your smoke alarm: they tell you something changed, fast, without requiring you to read individual events.
  • Logs are expensive at scale but irreplaceable for root cause analysis. Use structured JSON logs from day one — if your logs are not queryable by trace ID in under one second, you do not have observability, you have a text file archive.
  • Traces bridge the gap — they connect a metric anomaly (P99 spiked) to the specific service and span (payment-service.charge() took 4.8s) without manual timestamp correlation across log files.
  • The trace ID in your logs is the most important field you can add. It is the hyperlink between a log line and the full request journey. Without it, cross-service debugging during an incident is archaeological work.
  • The unknown unknowns problem is only solvable when all three pillars are present and correlated. Predefined dashboards cover what you anticipated. Rich correlated telemetry lets you ask questions you did not anticipate — which is exactly what novel failure modes require.
📊 Production Insight
Structured JSON logs cost 2–3x more storage than plain text but enable sub-second field-level queries in Elasticsearch or Loki.
Without structured fields, finding all log lines for a specific trace ID in 10TB of log data requires a full-text scan — that takes minutes under load and often times out entirely. With JSON and a proper field index, the same query completes in milliseconds regardless of log volume.
Rule: if your on-call team cannot jump from a Grafana alert to the relevant log lines in under 30 seconds, your logging is not serving its purpose during incidents. That 30-second jump requires structured logs, a trace ID field, and an index on that field. It is not a nice-to-have; it is the baseline for operational effectiveness.
🎯 Key Takeaway
Metrics tell you WHEN something broke. Traces tell you WHERE. Logs tell you WHY.
The Correlation ID — specifically the trace ID injected into every log line from the active OpenTelemetry span — is the glue that makes the three pillars an observability system rather than three separate dashboards. Instrument for it from the start. Retrofitting it into an existing codebase is painful and incomplete by nature.
Choosing Which Pillar to Invest In First
IfNo alerting exists at all and the team is flying blind in production
UseStart with metrics — implement the 4 Golden Signals (latency, traffic, errors, saturation) for every service. Use Micrometer for JVM services, OpenTelemetry SDKs for everything else.
IfAlerts fire regularly but root cause investigation takes hours not minutes
UseAdd distributed tracing with OpenTelemetry auto-instrumentation. The goal is to connect an alert directly to the specific service, endpoint, and downstream dependency that is causing the issue.
IfTraces show the failing service and the slow span, but not the reason for the failure
UseAdd structured JSON logging with trace ID injected from the active OTel span context via MDC. Ensure ERROR and WARN logs capture enough context (request parameters, upstream service state) to diagnose without re-running the request.
IfAll three pillars exist but live in separate tools with no cross-linking
UseUnify on a single observability platform that supports exemplars and trace-to-log correlation. Grafana Stack (Prometheus + Loki + Tempo) covers this at low cost. Datadog and Honeycomb cover it with less operational overhead at higher license cost.

Observability in Practice: Prometheus and Grafana

Modern observability stacks typically center on a pull-based collection model, and Prometheus is the reference implementation of that model in the open-source world. Instead of your application pushing metrics to a central store on a schedule it controls, Prometheus scrapes a /metrics endpoint on your service at an interval Prometheus controls. Your application is passive — it just maintains counters and histograms in memory and serves them when asked.

The pull model has a meaningful architectural advantage: a misbehaving application cannot overwhelm the monitoring system with a flood of metric writes. If your application starts logging every debug event as a metric push, a push-based system absorbs the explosion and potentially falls over. Prometheus, by contrast, simply scrapes at its configured interval regardless of what the application is doing. The trade-off is the opposite failure mode: if your application crashes between scrapes, you lose the data from that window. For short-lived jobs and batch processes that complete in under one scrape interval, the Pushgateway solves this by holding pushed metrics until the next Prometheus scrape cycle.

Prometheus stores metrics as time series: a metric name, a set of key-value labels, and a sequence of timestamped float64 values. The label set is everything — it is how you slice a metric by service, region, endpoint, or status code. But the label set is also where most production Prometheus problems originate. Every unique combination of label values creates a separate time series in the TSDB. A metric with three labels, each with ten possible values, creates one thousand time series. Add a fourth label with a thousand possible values and you have a million time series for a single metric name. This is the cardinality problem, and it is not theoretical — I have watched it take down a Prometheus instance in production within 20 minutes of a bad deployment that added a user_id label to a high-traffic metric.

PromQL is where the real diagnostic power lives. PromQL is not a query language for fetching raw data — it is a functional language for computing derived signals from time series. The most important function in incident response is histogram_quantile(), which computes a true percentile from a histogram metric. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) gives you the true P99 latency over a 5-minute window — the slowest experience that 1% of your users received. No dashboard based on averages can show you this. Building this query and putting it on your primary service dashboard is a one-time 10-minute investment that has caught more production incidents for me than any other single piece of instrumentation.

prometheus-config.yml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
# io.thecodeforge — Production Prometheus Scrape Configuration
#
# Design decisions explained inline:
#   - scrape_interval: 15s is the standard for most services
#     Reduce to 5s for latency-sensitive paths, but watch Prometheus CPU/memory
#   - honor_labels: false (default) — Prometheus overwrites conflicting labels
#     from the target with its own. Set true only if you trust the target's labels.
#   - relabel_configs allow normalizing instance labels before storage
#     This prevents cardinality drift from inconsistent hostname formats

global:
  scrape_interval: 15s
  evaluation_interval: 15s  # How often alerting rules are evaluated
  # Global labels added to every time series scraped by this instance
  external_labels:
    datacenter: 'us-east-1'
    environment: 'production'

rule_files:
  - 'rules/forge_alerts.yml'   # Alert rules — evaluated every evaluation_interval
  - 'rules/forge_records.yml'  # Recording rules — pre-aggregate expensive PromQL queries

scrape_configs:
  - job_name: 'forge-api-service'
    # Spring Boot Actuator exposes Prometheus metrics at /actuator/prometheus
    # Not /metrics — this is a common misconfiguration that produces DOWN targets
    metrics_path: '/actuator/prometheus'
    scrape_interval: 5s      # Override global for latency-sensitive checkout service
    scrape_timeout: 4s       # Must be less than scrape_interval to avoid overlap
    static_configs:
      - targets:
          - 'forge-prod-app-01:8080'
          - 'forge-prod-app-02:8080'
        labels:
          service: 'order-service'  # Added to every time series from these targets
    relabel_configs:
      # Normalize the instance label to a stable name instead of host:port
      # Prevents cardinality drift if the port changes or hostname format varies
      - source_labels: [__address__]
        regex: '(forge-prod-app-\d+):.*'
        target_label: instance
        replacement: '$1'

  - job_name: 'forge-payment-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['forge-payment:8080']
        labels:
          service: 'payment-service'

---
# rules/forge_alerts.yml
# Alert on P99 latency — not average. Average latency alerts miss tail latency problems.
groups:
  - name: forge.latency
    rules:
      - alert: ForgeCheckoutP99LatencyHigh
        expr: |
          histogram_quantile(
            0.99,
            rate(http_server_requests_seconds_bucket{
              job="forge-api-service",
              uri="/api/checkout"
            }[5m])
          ) > 2.0
        for: 3m   # Must be elevated for 3 minutes before firing — reduces noise
        labels:
          severity: critical
          team: platform
        annotations:
          summary: 'Checkout P99 latency exceeds 2s'
          description: |
            P99 latency for /api/checkout is {{ $value | humanizeDuration }}.
            Check Jaeger for slow traces on this endpoint.
            Dashboard: https://grafana.thecodeforge.io/d/checkout-slo

---
# rules/forge_records.yml
# Recording rules pre-aggregate expensive queries so dashboards load instantly
# Without this, a 7-day P99 query runs at render time and takes 30+ seconds
groups:
  - name: forge.recordings
    rules:
      - record: job:http_request_duration_p99:rate5m
        expr: |
          histogram_quantile(
            0.99,
            sum by (job, le) (
              rate(http_server_requests_seconds_bucket[5m])
            )
          )
▶ Output
Prometheus targets updated.
Scraping forge-prod-app-01:8080 at /actuator/prometheus every 5s
Scraping forge-prod-app-02:8080 at /actuator/prometheus every 5s
Alert ForgeCheckoutP99LatencyHigh: inactive
Recording rule job:http_request_duration_p99:rate5m: active
⚠ The Cardinality Bomb — One Label Can Bring Down Prometheus
📊 Production Insight
A Prometheus scrape_interval of 15s means your effective metric resolution is 15 seconds. Any event that begins and ends within a single 15-second window between scrapes is invisible — you lose it entirely.
For most services, 15-second resolution is perfectly adequate. For checkout latency, payment processing, and other revenue-critical paths, consider reducing to 5s. Before you do, run Prometheus with the current configuration for a week and baseline its CPU and memory consumption. A 3x increase in scrape frequency does not produce a 3x increase in resource usage — because most of the overhead is connection setup, not data volume — but you need to know your headroom before making the change in production.
For events shorter than your scrape interval (flash spikes, GC pauses, brief thread pool exhaustion), scrape resolution is not the right tool. Use application-level histograms — they accumulate observations between scrapes, so the statistical signal survives even if the raw spike is not captured in a single scrape.
🎯 Key Takeaway
The Prometheus pull model protects your monitoring infrastructure from application misbehavior — a crashing or misbehaving app cannot flood the monitoring system. But scrapes are lossy by design: sub-interval events are invisible unless you use histograms to accumulate observations between scrapes.
Cardinality is the silent killer. One high-cardinality label — added by a developer who thought it would help debugging — can consume gigabytes of Prometheus memory within hours and bring down the entire monitoring stack. Review label sets in code review with the same scrutiny you apply to database index design.
Always use histogram_quantile() in PromQL for latency metrics. Averages are misleading in right-skewed distributions, and your latency distributions are always right-skewed.
Prometheus Architecture Decisions
IfSingle cluster, fewer than 1 million active time series, retention under 30 days
UseSingle Prometheus instance with local TSDB is sufficient. Monitor Prometheus's own memory and disk consumption as a first-class concern.
IfMultiple clusters, need cross-cluster querying, or retention beyond 30 days
UseAdd Thanos Sidecar or Grafana Mimir for horizontal scaling, query federation across clusters, and object storage backend (S3/GCS) for long-term retention at low cost.
IfShort-lived batch jobs or processes that complete in under one scrape interval
UseUse Pushgateway — the job pushes its final metric state before exiting, and Prometheus scrapes Pushgateway on the normal schedule. Clean up stale Pushgateway entries after job completion.
IfNeed to correlate specific metric observations with traces for deep debugging
UseEnable exemplar support in Prometheus (--enable-feature=exemplar-storage) and configure your metrics library to attach the current trace ID as an exemplar. Grafana renders exemplars as clickable links directly to Tempo or Jaeger.
🗂 Monitoring vs Observability
A feature-by-feature comparison for engineering teams evaluating their telemetry strategy
FeatureMonitoring (Traditional)Observability (Modern)
Primary GoalTrack health of predefined, known metrics — you decide in advance what to measureUnderstand internal system state from external telemetry — including failure modes you did not anticipate
ApproachSymptom-based: Is it up? Is error rate below threshold? Did this counter cross a line?Context-based: Why is this specific request slow? Which downstream service introduced the latency? What changed between yesterday and today?
ToolingNagios, basic dashboards, threshold alerts on raw metricsPrometheus, OpenTelemetry, Jaeger or Tempo, Grafana with correlated trace/log/metric views
User ExperienceAlerts when a predefined threshold is crossed — tells you something is wrongVisualizes the complete lifecycle of a request — tells you what went wrong, where, and why
Data StyleAggregated snapshots — you lose per-request detail in favor of efficient storageHigh-fidelity traces and structured event logs — per-request context is preserved and queryable
Unknown UnknownsCannot detect failure modes you did not build a dashboard forRich telemetry allows ad-hoc queries during incidents — answering questions you did not anticipate
Incident ResponseTells you an SLA threshold was crossed — investigation starts from scratchProvides a navigable path from alert → trace → log → root cause

🎯 Key Takeaways

  • Monitoring tells you when a system is failing; observability provides the diagnostic tools to understand why. You need both — monitoring without observability leaves you with an alarm but no investigation path; observability without monitoring leaves you with data but no trigger to look at it.
  • The 3 Pillars — Metrics, Logs, and Traces — must be unified with a common Correlation ID (trace ID) to function as an observability system rather than three disconnected dashboards. The trace ID in your log lines is the most important field you can instrument.
  • Adopt OpenTelemetry as your instrumentation standard to avoid vendor lock-in. Auto-instrumentation agents cover 90% of JVM instrumentation with zero code changes — the investment is in configuration and rollout, not in rewriting your application.
  • Monitor the 4 Golden Signals — Latency, Traffic, Errors, and Saturation — as your primary alert targets for every service. System resource metrics (CPU, memory) are diagnostic signals, not primary alert targets.
  • Averages hide tail latency in every right-skewed distribution, and your latency distributions are always right-skewed. Always instrument P95 and P99 as first-class metrics. Alert on P99 thresholds, not averages. Use histogram metrics and histogram_quantile() in PromQL — not summaries or gauges.
  • Cardinality is the silent killer of Prometheus. One high-cardinality label (user_id, order_id, session_id) on a high-traffic metric can consume gigabytes of Prometheus memory within hours and cause an OOM restart loop. Review label sets in code review with the same rigor you apply to database index design.
  • Log retention without a tiered strategy is an operational tax that compounds over time. Implement retention tiers by log level from day one: DEBUG disabled in production, INFO sampled at 10–20% for high-volume services, ERROR retained at full fidelity for 90 days. Archive cold logs to object storage rather than paying hot storage prices for data you query once a quarter.
  • Synthetic monitoring verifies SLA uptime from backbone infrastructure. Real User Monitoring captures actual experience from real ISP connections and real devices. Use both. Only RUM detects regional ISP outages — synthetic probes bypass consumer ISP last-mile networks entirely.

⚠ Common Mistakes to Avoid

    Dashboard overload — 50 graphs with no hierarchy, all treated as equally important
    Symptom

    Alert fatigue sets in within weeks of launch. Engineers stop looking at dashboards because nothing on them is clearly actionable. Critical alerts drown in a sea of non-critical panels showing system metrics that are within normal ranges. During an actual incident, nobody knows which panel to look at first.

    Fix

    Build a 3-tier dashboard hierarchy: Executive tier shows SLA and SLO status — green or red, nothing else. Service tier shows the 4 Golden Signals per service — latency, traffic, errors, saturation. Debug tier shows detailed per-request traces, slow query logs, and dependency health. Alert only on tier-1 SLO violations. CPU and memory panels belong on the debug tier, not on the primary service dashboard that engineers check first during an incident.

    Missing the trace — metrics and logs implemented, distributed tracing skipped as 'too complex'
    Symptom

    During an outage, engineers spend hours manually correlating log timestamps across three or four services to reconstruct what happened to a failing request. Incident timelines are guesses. Post-mortems are based on partial information. The same failure mode recurs because the root cause was never fully understood.

    Fix

    Implement OpenTelemetry with automatic instrumentation as early as possible — ideally before the first production deployment, not as a post-incident remediation. Propagate W3C Trace-Context headers across all HTTP calls and message queue operations. Include the trace ID in all log statements via MDC. The auto-instrumentation agent for most JVM frameworks handles 90% of this with zero code changes — the investment is in configuration, not implementation.

    Measuring system resources instead of business outcomes — optimizing for what is easy to measure, not what matters
    Symptom

    CPU sits at 30%, memory is comfortable, all infrastructure dashboards are green. Meanwhile, checkout success rate has dropped to 60% and the business is losing thousands of dollars per minute. Engineers see a healthy system. The product team sees a broken product. The disconnect erodes trust in the engineering team's ability to monitor what actually matters.

    Fix

    Define SLOs based on user-facing outcomes first: checkout success rate, P99 login latency, payment processing completion rate. These are your primary alert targets. System resource metrics (CPU, memory, disk I/O) are secondary signals that help explain why a user-facing SLO degraded — they are diagnostic tools, not alert targets.

    Using averages as the primary latency metric in dashboards and alerts
    Symptom

    Average latency is 120ms and alerts are tuned to fire at 500ms average. A single slow replica serves 1% of traffic at 30-second response times. No alerts fire. The 30-second requests belong to real users who do not retry gracefully — they abandon the flow. The business sees elevated cart abandonment. Engineering sees a healthy average.

    Fix

    Replace average latency as a primary dashboard metric with P95 and P99 histograms. Set alerts on P99 thresholds — a reasonable starting point is alert when P99 exceeds 3x the baseline P99 for your endpoint, sustained for 3 minutes. Use histogram metric types in your instrumentation library (not gauges or summaries) so that PromQL histogram_quantile() can compute true percentiles from the raw bucket data.

    No retention or sampling strategy for logs — treating all logs as equally valuable
    Symptom

    Elasticsearch or Loki runs out of disk every week on a predictable schedule. Log ingestion and storage costs exceed the application hosting costs by a factor of two. Engineers cannot query logs older than 3 days because the cluster degrades under the write load. Adding more disk is a recurring operational tax that solves nothing structurally.

    Fix

    Implement tiered retention based on log level and operational value: DEBUG logs retained 24 hours (better yet, disabled entirely in production unless explicitly enabled for a specific service during active debugging), INFO logs retained 7 days with 10–20% sampling on high-volume services, WARN logs retained 30 days, ERROR logs retained 90 days at full fidelity. Archive cold logs to object storage (S3 or GCS with lifecycle policies) for long-term compliance requirements. Implement the tiered strategy at the collection agent (Fluentd or Vector) so the decision is made before data hits your storage backend.

Interview Questions on This Topic

  • QYou have a service where average latency is 100ms but P99 is 5 seconds. What does this tell you about the system, and how would you use observability to find the bottleneck?SeniorReveal
    A P99 of 5s with a 100ms average is a heavily right-skewed distribution — the vast majority of requests are fast, but a small fraction are catastrophically slow. That pattern points to a specific class of failure modes: a resource that is occasionally unavailable or slow, not a general degradation affecting all requests. My first step is to check whether the slow requests share a common dimension. Using distributed tracing in Jaeger or Tempo, I would filter for traces in the 95th–100th percentile duration and look for patterns: Do they all hit the same downstream service? The same database instance? The same cache shard? Are they larger payloads? Do they cluster at specific times? Common root causes for this pattern: 1. A single slow replica behind a round-robin load balancer — check per-instance latency metrics broken out by the instance label. One instance will show dramatically higher P99 than the others. 2. JVM garbage collection pauses — correlate the P99 spike timestamps with GC log events. GC pauses show up as sharp latency spikes with full recovery between them. 3. Cold cache misses on a small fraction of requests — check cache hit rate during slow periods. First-access requests pay the database cost; subsequent requests are served from cache. 4. Database connection pool exhaustion — slow requests may be waiting for a connection to become available, not actually executing a slow query. Check connection pool wait time metrics separately from query execution time metrics. For instrumentation, I would ensure the histogram metric for each downstream call is present and properly labeled so that histogram_quantile(0.99, rate(call_duration_seconds_bucket{service="payment"}[5m])) gives me per-dependency P99 values. That query alone usually points at the culprit within a few minutes of looking.
  • QExplain the 'Unknown-Unknowns' concept in the context of observability. Give a concrete example of an issue monitoring would miss but observability would catch.SeniorReveal
    The known-known / unknown-unknown framework maps to observability this way: Known-knowns are the things you actively monitor: error rate, CPU, request count. You built dashboards for them because you anticipated they would matter. Known-unknowns are things you know exist but have not instrumented yet: 'We should track P99 latency per region someday.' Unknown-unknowns are failure modes that never appeared in your design documents and that you would not think to query for — until they are happening in production and your existing dashboards cannot see them. A concrete example: a payment service returns HTTP 200 on every request. Error rate is 0%. Average and P99 latency are within SLA. All monitoring dashboards are green. But the response body contains a currency conversion rate field populated from a Redis cache. The cache key was accidentally made non-expiring during a configuration change, and the rate is now 48 hours stale. Users are being quoted and charged incorrect amounts. Monitoring cannot catch this because there is no metric for 'response body content correctness.' The failure mode was never anticipated. Observability catches it because you can compose a novel query against your structured logs: 'Show me all checkout requests in the last hour where the response currency_rate differs from the live rate fetched from the exchange API by more than 2%.' That query requires structured logs with the relevant fields present, a trace ID for correlation, and a log query tool that supports arithmetic comparisons. None of that requires anticipating this specific failure — it requires having rich enough telemetry that you can ask questions you did not anticipate. That is the difference. Monitoring answers the questions you pre-programmed. Observability lets you ask new questions during the incident itself.
  • QDescribe how a distributed trace propagates through a microservices architecture using HTTP headers.Mid-levelReveal
    When a request arrives at the first service in the chain, the tracing SDK checks incoming headers for an existing trace context. If none exists, it generates a new globally unique trace ID and a span ID for this service's unit of work. If a trace context header is already present (because the request came from another instrumented service), it extracts the existing trace ID and uses it — creating continuity across the chain. The W3C Trace Context standard defines two headers. The traceparent header carries four fields: version (always '00' currently), trace-id (128-bit hex), parent-id (64-bit hex identifying the sending span), and trace-flags (bitfield, currently just a sampled bit). Example: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. The tracestate header carries vendor-specific key-value pairs and is optional. When Service A makes an HTTP call to Service B, it injects the traceparent header into the outbound request, setting the parent-id to Service A's own span ID. Service B extracts the header, creates a new span with the same trace ID and a new span ID, and sets its parent to Service A's span ID. This creates a parent-child relationship that the tracing backend (Jaeger, Tempo) renders as a hierarchical flame chart. The B3 format — used by Zipkin and still common in older service meshes — uses discrete headers: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, and X-B3-Sampled. Both formats achieve the same structural result. The non-negotiable requirement: every HTTP client, message queue producer, and async worker in every service must propagate these headers on every outbound call. One service in the chain that drops the headers breaks the trace at that point — all downstream services start new, unconnected traces, and you cannot reconstruct the full request journey from them.
  • QWhat is the trade-off between structured JSON logs and plain text logs in a high-scale production environment?Mid-levelReveal
    The trade-off is storage cost and write throughput versus query performance and incident response speed. Structured JSON logs are machine-parseable at ingest time. Fields like level, traceId, service, and userId are first-class indexed fields in Elasticsearch or Loki. A query like level:ERROR AND traceId:4bf92f35 runs against an inverted index and returns results in milliseconds regardless of total log volume. The downside: JSON serialization adds overhead at write time (typically 20–40% CPU increase for high-frequency loggers), and the JSON envelope increases log size by 2–3x compared to equivalent plain text, which translates directly to storage costs and network transfer costs at scale. Plain text logs are compact and human-readable in a terminal. They are cheap to produce. The cost appears at query time: finding a specific trace ID in 10TB of unstructured text requires a full-text scan or regex parsing, which at scale can take minutes and frequently times out. During an active incident, waiting minutes for a log query is not acceptable. The pragmatic approach at scale: apply the format choice based on log level and expected query patterns. ERROR and WARN logs should always be structured JSON — these are the logs you query most urgently during incidents, and the query performance difference is the difference between a 30-second investigation and a 10-minute one. High-volume INFO logs can use plain text or be sampled down to 10% before ingest, which recovers most of the storage cost. DEBUG logs should not reach your log aggregation system at all in production — sample them at the agent level or gate them behind a feature flag that is off by default.
  • QHow do you implement Synthetic Monitoring versus Real User Monitoring (RUM)? Which is better for identifying regional ISP outages?Mid-levelReveal
    Synthetic monitoring runs automated scripts from fixed geographic locations on a scheduled interval — every 60 seconds from AWS us-east-1, eu-west-1, and ap-southeast-1, for example. The scripts simulate user flows: load the homepage, authenticate, add an item to the cart, reach the checkout page. They measure availability and performance proactively and continuously, even when zero real users are active. This makes synthetic monitoring ideal for SLA verification during off-peak hours and for detecting regressions introduced by deployments before users encounter them. Tools: Datadog Synthetics, Grafana Synthetic Monitoring, Pingdom. Real User Monitoring instruments actual user browsers and mobile apps to capture performance data from real network conditions and real devices. A JavaScript agent reports page load time, resource timing, JavaScript errors, and user interaction events back to the analytics platform in real time. Tools: Datadog RUM, New Relic Browser, Elastic RUM, Grafana Faro. For regional ISP outages specifically, RUM is the right tool. Synthetic probes run from AWS or GCP backbone infrastructure — they use transit paths that bypass consumer ISPs entirely. A consumer ISP outage in a specific metropolitan area will not affect synthetic probes running from that AWS region, because those probes use backbone peering, not the ISP's last-mile network. RUM, by contrast, runs inside actual user browsers on actual ISP connections. A sudden spike in RUM-reported latency or page load failures from users in a specific geographic region, with no corresponding degradation in synthetic probes, is a strong and specific signal of an ISP or last-mile network issue. Use synthetic monitoring for SLA uptime guarantees, deployment regression detection, and always-on availability checks. Use RUM for real-world performance baselines, regional issue detection, and understanding the actual user experience across device types and network conditions.
  • QScenario: A database query is intermittently slow. How would you instrument your Java code to capture the exact SQL, bind parameters, and context only when latency exceeds 500ms?SeniorReveal
    The goal is surgical: capture full diagnostic context for slow queries without paying the overhead of logging every query. Here is the approach I would use: First, wrap the database call in a timed block. Record start time with System.nanoTime() before execution and compute elapsed time after. Nanosecond precision matters here — millisecond granularity on a fast query produces zero, which is not useful. If elapsed time exceeds the threshold (500ms in this case), emit a WARN-level structured log with the full SQL statement, bind parameter values (sanitized to remove PII if necessary), elapsed time, connection pool wait time (separate from query execution time — waiting for a connection is a different problem than a slow query), and the current trace ID from MDC. This log line is the full incident report for a single slow query. Always emit the histogram metric for query duration regardless of the threshold — every query execution contributes to the histogram bucket, so your PromQL P99 calculation remains accurate even if you only log the slow ones. For production-grade implementation, I would use OpenTelemetry's @WithSpan annotation on the repository method combined with a custom SpanProcessor. The processor checks span duration on end and adds the SQL statement and bind parameters as span attributes only when duration exceeds the threshold. Fast query spans remain lightweight; slow query spans carry full diagnostic context visible in Jaeger. This keeps the trace clean for 99% of requests and detailed for the 1% that need investigation — without any conditional logic in the application code itself. One sharp edge to be aware of: logging bind parameters can expose sensitive data (passwords, PII, financial amounts). Always pass bind parameters through a sanitizer that replaces sensitive field values with type placeholders (e.g., '? [varchar]') before logging, and document which fields are redacted in your instrumentation standards.

Frequently Asked Questions

Can I have observability without monitoring?

Not in any practically useful sense. Monitoring is a subset of observability — you need the gauges and alerts that monitoring provides to know when to investigate, and you need the rich telemetry that observability provides to complete the investigation effectively.

Monitoring without observability tells you something broke but not why. You get alerts and green/red dashboards with no ability to drill into the cause. Observability without monitoring means you have logs, traces, and metrics available — but no automated trigger to tell you when something is wrong. You would have to manually query your observability platform to discover problems, which is not sustainable.

The practical baseline: implement monitoring first (alerts on the 4 Golden Signals) so you have a trigger. Then invest in observability (distributed tracing, structured logs, metric correlation) so you have a path from trigger to root cause. Neither replaces the other.

What are the four Golden Signals of SRE?

The four Golden Signals were defined by Google's SRE team as the minimum set of metrics to monitor for any production service. They are:

  1. Latency — the time it takes to serve a request. Measure P95 and P99, not average. Distinguish between successful request latency and failed request latency — a request that fails in 1ms looks fast in your average but indicates a different problem than a request that fails after 30s.
  2. Traffic — the demand on your system. Requests per second, messages per second, queries per second — whatever the natural unit of throughput is for your service. Traffic context makes latency and error rate changes meaningful: a 10% error rate at 100 RPS is very different from a 10% error rate at 10,000 RPS.
  3. Errors — the rate of failed requests. Distinguish explicit failures (HTTP 5xx, exception thrown) from implicit failures (HTTP 200 with an error payload, business logic failures that return success codes). Both matter; only explicit errors are automatically visible in HTTP-level metrics.
  4. Saturation — how 'full' your service is relative to its capacity. Thread pool utilization, connection pool utilization, queue depth, memory pressure. Saturation metrics predict approaching failure before it materializes as latency or error spikes — they are your leading indicator.

If you can only instrument four things for a new service, instrument these four.

Why is high cardinality a problem in observability?

Cardinality in the context of Prometheus metrics refers to the number of unique time series created by a metric's label combinations. Prometheus stores one time series per unique combination of metric name and label values — in memory, in TSDB, and on disk.

If you add a user_id label to a metric on a system with 1 million users, Prometheus must track 1 million individual time series for that single metric name. Add two more labels with 10 values each, and you have 100 million time series. That is not a theoretical problem — it is a gigabyte-scale memory allocation that occurs within the first few hours of a production deployment and typically causes Prometheus to OOM-restart before anyone notices the root cause.

The structural fix: high-cardinality per-request data (user ID, order ID, session token) belongs in traces and logs, not in metric labels. Use exemplars to link a specific metric observation to a trace ID — Prometheus stores exemplars separately from the main TSDB, so they do not contribute to cardinality. In Grafana, exemplar markers render as clickable links directly to the corresponding trace in Tempo or Jaeger.

As a rule: before adding a new label to a metric in production, estimate the number of unique values that label can take. If the answer is 'unbounded' or 'one per user' or 'one per request,' the label does not belong on a metric.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →Prometheus and Grafana Setup
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged