Monitoring vs Observability Explained — The Complete DevOps Guide
- Monitoring tells you when a system is failing; observability provides the diagnostic tools to understand why. You need both — monitoring without observability leaves you with an alarm but no investigation path; observability without monitoring leaves you with data but no trigger to look at it.
- The 3 Pillars — Metrics, Logs, and Traces — must be unified with a common Correlation ID (trace ID) to function as an observability system rather than three disconnected dashboards. The trace ID in your log lines is the most important field you can instrument.
- Adopt OpenTelemetry as your instrumentation standard to avoid vendor lock-in. Auto-instrumentation agents cover 90% of JVM instrumentation with zero code changes — the investment is in configuration and rollout, not in rewriting your application.
- Monitoring watches predefined metrics and alerts when thresholds are breached — it answers 'WHAT is broken'
- Observability lets you ask arbitrary questions about system state from external telemetry — it answers 'WHY it broke'
- The 3 pillars are Metrics (aggregated numeric signals), Logs (per-event detail), and Traces (request flow across services)
- Without a shared Correlation ID across all three pillars, debugging cross-service failures is manual timestamp archaeology
- Prometheus uses pull-based scraping — your app exposes /metrics, Prometheus fetches on a fixed interval
- High-cardinality labels (User IDs, request IDs) as metric tags will crash your monitoring storage
- Monitoring is a subset of observability — you need both, and they serve different cognitive roles during an incident
Prometheus target is DOWN
curl -s http://forge-prod-app:8080/actuator/prometheus | head -20kubectl exec -it prometheus-pod -- wget -qO- http://forge-prod-app:8080/actuator/prometheusMissing traces in Jaeger or Tempo — spans appear incomplete or absent
curl -v http://forge-api:8080/api/orders 2>&1 | grep -i 'traceparent\|trace\|b3'kubectl logs deployment/forge-api --tail=50 | grep -i 'trace\|span\|otel\|exporter'Alert firing but no corresponding data visible in the Grafana dashboard
curl -s 'http://prometheus:9090/api/v1/query?query=forge_orders_success_total' | python -m json.toolcurl -s 'http://prometheus:9090/api/v1/targets' | python -c "import sys,json; [print(t['labels']['job'], t['health'], t['lastError']) for t in json.load(sys.stdin)['data']['activeTargets']]"Cardinality explosion in Prometheus — memory spiking, queries slowing, OOM restarts
curl -s 'http://prometheus:9090/api/v1/status/tsdb' | python -m json.tool | head -40curl -s 'http://prometheus:9090/api/v1/label/__name__/values' | python -c "import sys,json; d=json.load(sys.stdin); print('Total metric names:', len(d['data']))"Production Incident
Production Debug GuideSymptom → Action mapping for common observability failures
Your production system crashes at 2 AM on Black Friday. Orders are failing, users are screaming on Twitter, and your on-call engineer is staring at a wall of dashboards wondering where to even begin. This scenario plays out every day at companies around the world — and the difference between a 10-minute fix and a 4-hour outage almost always comes down to one thing: how well-instrumented your system was before the incident.
Monitoring and observability are not luxury features you bolt on after launch. They are the engineering discipline that separates teams who fix incidents in minutes from teams who spend hours in Slack threads and frantic Zoom calls pointing fingers at each other's services. While monitoring is the act of watching known-knowns through dashboards and alerts, observability is the property of a system that allows you to understand its internal state from the external data it emits — logs, metrics, and traces. The operative word is property. Observability is not a tool you buy. It is a quality you engineer into your system from the start.
The critical misconception I see constantly at staff level: treating monitoring and observability as synonyms. They are not. Monitoring is a subset of observability. You can monitor without observability — you get dashboards with no context, alerts that fire with no actionable detail, and on-call engineers who solve incidents through gut instinct and luck. You cannot have meaningful observability without some form of monitoring — you still need alerts to know when to look at anything. The two are complementary, not competing. Get both right and your MTTR drops from hours to minutes. Get only one right and you are still flying partially blind.
The Three Pillars: Metrics, Logs, and Traces
True observability requires all three types of telemetry working together, not independently. Metrics provide a high-level, aggregated view of system health — request rate, error rate, CPU saturation. Logs provide the per-event forensic record of exactly what happened at a specific moment in time. Distributed traces connect everything by showing the complete path of a single request across every service it touched, with timing for each hop.
Each pillar serves a distinct cognitive role in an incident, and understanding those roles prevents the common mistake of trying to use one pillar for a job another pillar does better. Metrics are your early warning system — they tell you THAT something changed, and they tell you fast. A metric alert can fire within 30 seconds of a threshold being crossed. Logs are your forensic evidence — they tell you exactly WHAT happened at a specific request or event level, but querying them at scale is expensive and slow compared to a metric query. Traces are your request GPS — they tell you WHERE a request traveled and HOW LONG each individual service spent processing it. The power of the three-pillar model is not in any single pillar; it is in the correlation between them.
The correlation ID is the connective tissue. When a P99 latency alert fires (metric), you need to jump directly to the traces for the slow requests and then to the logs for the specific error messages those traces contain. If the trace ID is not present in your logs, that jump requires manual timestamp correlation — which is slow, error-prone, and genuinely miserable to do at 2 AM during an active incident. Instrument for correlation from day one, not as a retrofit after the first major outage.
One more thing that gets underweighted: the 'unknown unknowns' problem. Metrics cover what you thought to instrument. Logs cover what you thought to log. Traces cover what you thought to trace. Observability is the property that lets you ask questions you did not anticipate when you wrote the code — because the raw telemetry is rich enough to answer novel queries. A system where you can only ask questions you pre-configured dashboards for is a monitored system. A system where you can compose a new query during an incident and get a meaningful answer is an observable system.
package io.thecodeforge.monitoring; import io.micrometer.core.instrument.Counter; import io.micrometer.core.instrument.MeterRegistry; import io.micrometer.core.instrument.Timer; import io.opentelemetry.api.trace.Span; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.slf4j.MDC; /** * io.thecodeforge — Three-Pillar Instrumentation Pattern * * Demonstrates all three observability pillars in a single operation: * - Metric: forge.orders.success (counter) and forge.payment.latency (histogram timer) * - Log: Structured log line with traceId injected from active OpenTelemetry span * - Trace: Active span context is propagated by the OTel agent — no manual span creation needed * * Key design decisions: * 1. traceId comes from the active OTel Span — it is the same ID visible in Jaeger/Tempo * 2. MDC injection makes traceId appear in EVERY log line within this thread scope * 3. The Timer wraps the entire business operation — it measures wall clock time including I/O * 4. Counter uses .tag("status", "success") so failures can be counted separately: * forge.orders.processed{status="failure"} without creating a separate metric */ public class OrderMonitor { private static final Logger log = LoggerFactory.getLogger(OrderMonitor.class); private final Counter orderSuccessCounter; private final Counter orderFailureCounter; private final Timer paymentTimer; public OrderMonitor(MeterRegistry registry) { // Use tagged counters so success and failure are queryable separately in PromQL // rate(forge.orders.processed_total{status="failure"}[5m]) gives failure rate this.orderSuccessCounter = registry.counter("forge.orders.processed", "status", "success"); this.orderFailureCounter = registry.counter("forge.orders.processed", "status", "failure"); // Timer automatically produces _count, _sum, and _bucket metrics // Enables histogram_quantile(0.99, rate(forge_payment_latency_seconds_bucket[5m])) this.paymentTimer = registry.timer("forge.payment.latency"); } public void processOrder(String orderId) { // Inject the active OpenTelemetry trace ID into MDC before any log statements // This ensures EVERY log line within this method includes the traceId automatically String traceId = Span.current().getSpanContext().getTraceId(); MDC.put("traceId", traceId); MDC.put("orderId", orderId); try { paymentTimer.record(() -> { // log.info output will include traceId and orderId from MDC automatically // Logback/Log4j2 pattern: %X{traceId} %X{orderId} %msg log.info("Order processing started"); // ... business logic: validate, reserve inventory, charge payment ... orderSuccessCounter.increment(); log.info("Order processing completed successfully"); }); } catch (Exception e) { orderFailureCounter.increment(); // ERROR logs include traceId from MDC — jump directly to the trace in Jaeger log.error("Order processing failed — check trace for upstream dependency state", Map.of("error", e.getMessage(), "errorType", e.getClass().getSimpleName())); throw e; } finally { // Always clear MDC — thread pool reuse will leak context if you skip this MDC.clear(); } } } /* * Example log output (JSON format via logstash-logback-encoder): * { * "level": "INFO", * "logger": "io.thecodeforge.monitoring.OrderMonitor", * "message": "Order processing completed successfully", * "traceId": "4bf92f3577b34da6a3ce929d0e0e4736", <-- same ID in Jaeger/Tempo * "orderId": "ORD-101" * } * * Metrics emitted (visible in Prometheus): * forge_orders_processed_total{status="success"} 1.0 * forge_payment_latency_seconds_count 1.0 * forge_payment_latency_seconds_sum 0.043 * forge_payment_latency_seconds_bucket{le="0.05"} 1.0 */
"level": "INFO",
"message": "Order processing completed successfully",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"orderId": "ORD-101"
}
forge_orders_processed_total{status="success"} 1.0
forge_payment_latency_seconds_bucket{le="0.05"} 1.0
- Metrics are cheap to collect and store — start here for every new service. They are your smoke alarm: they tell you something changed, fast, without requiring you to read individual events.
- Logs are expensive at scale but irreplaceable for root cause analysis. Use structured JSON logs from day one — if your logs are not queryable by trace ID in under one second, you do not have observability, you have a text file archive.
- Traces bridge the gap — they connect a metric anomaly (P99 spiked) to the specific service and span (payment-service.charge() took 4.8s) without manual timestamp correlation across log files.
- The trace ID in your logs is the most important field you can add. It is the hyperlink between a log line and the full request journey. Without it, cross-service debugging during an incident is archaeological work.
- The unknown unknowns problem is only solvable when all three pillars are present and correlated. Predefined dashboards cover what you anticipated. Rich correlated telemetry lets you ask questions you did not anticipate — which is exactly what novel failure modes require.
Observability in Practice: Prometheus and Grafana
Modern observability stacks typically center on a pull-based collection model, and Prometheus is the reference implementation of that model in the open-source world. Instead of your application pushing metrics to a central store on a schedule it controls, Prometheus scrapes a /metrics endpoint on your service at an interval Prometheus controls. Your application is passive — it just maintains counters and histograms in memory and serves them when asked.
The pull model has a meaningful architectural advantage: a misbehaving application cannot overwhelm the monitoring system with a flood of metric writes. If your application starts logging every debug event as a metric push, a push-based system absorbs the explosion and potentially falls over. Prometheus, by contrast, simply scrapes at its configured interval regardless of what the application is doing. The trade-off is the opposite failure mode: if your application crashes between scrapes, you lose the data from that window. For short-lived jobs and batch processes that complete in under one scrape interval, the Pushgateway solves this by holding pushed metrics until the next Prometheus scrape cycle.
Prometheus stores metrics as time series: a metric name, a set of key-value labels, and a sequence of timestamped float64 values. The label set is everything — it is how you slice a metric by service, region, endpoint, or status code. But the label set is also where most production Prometheus problems originate. Every unique combination of label values creates a separate time series in the TSDB. A metric with three labels, each with ten possible values, creates one thousand time series. Add a fourth label with a thousand possible values and you have a million time series for a single metric name. This is the cardinality problem, and it is not theoretical — I have watched it take down a Prometheus instance in production within 20 minutes of a bad deployment that added a user_id label to a high-traffic metric.
PromQL is where the real diagnostic power lives. PromQL is not a query language for fetching raw data — it is a functional language for computing derived signals from time series. The most important function in incident response is histogram_quantile(), which computes a true percentile from a histogram metric. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) gives you the true P99 latency over a 5-minute window — the slowest experience that 1% of your users received. No dashboard based on averages can show you this. Building this query and putting it on your primary service dashboard is a one-time 10-minute investment that has caught more production incidents for me than any other single piece of instrumentation.
# io.thecodeforge — Production Prometheus Scrape Configuration # # Design decisions explained inline: # - scrape_interval: 15s is the standard for most services # Reduce to 5s for latency-sensitive paths, but watch Prometheus CPU/memory # - honor_labels: false (default) — Prometheus overwrites conflicting labels # from the target with its own. Set true only if you trust the target's labels. # - relabel_configs allow normalizing instance labels before storage # This prevents cardinality drift from inconsistent hostname formats global: scrape_interval: 15s evaluation_interval: 15s # How often alerting rules are evaluated # Global labels added to every time series scraped by this instance external_labels: datacenter: 'us-east-1' environment: 'production' rule_files: - 'rules/forge_alerts.yml' # Alert rules — evaluated every evaluation_interval - 'rules/forge_records.yml' # Recording rules — pre-aggregate expensive PromQL queries scrape_configs: - job_name: 'forge-api-service' # Spring Boot Actuator exposes Prometheus metrics at /actuator/prometheus # Not /metrics — this is a common misconfiguration that produces DOWN targets metrics_path: '/actuator/prometheus' scrape_interval: 5s # Override global for latency-sensitive checkout service scrape_timeout: 4s # Must be less than scrape_interval to avoid overlap static_configs: - targets: - 'forge-prod-app-01:8080' - 'forge-prod-app-02:8080' labels: service: 'order-service' # Added to every time series from these targets relabel_configs: # Normalize the instance label to a stable name instead of host:port # Prevents cardinality drift if the port changes or hostname format varies - source_labels: [__address__] regex: '(forge-prod-app-\d+):.*' target_label: instance replacement: '$1' - job_name: 'forge-payment-service' metrics_path: '/actuator/prometheus' static_configs: - targets: ['forge-payment:8080'] labels: service: 'payment-service' --- # rules/forge_alerts.yml # Alert on P99 latency — not average. Average latency alerts miss tail latency problems. groups: - name: forge.latency rules: - alert: ForgeCheckoutP99LatencyHigh expr: | histogram_quantile( 0.99, rate(http_server_requests_seconds_bucket{ job="forge-api-service", uri="/api/checkout" }[5m]) ) > 2.0 for: 3m # Must be elevated for 3 minutes before firing — reduces noise labels: severity: critical team: platform annotations: summary: 'Checkout P99 latency exceeds 2s' description: | P99 latency for /api/checkout is {{ $value | humanizeDuration }}. Check Jaeger for slow traces on this endpoint. Dashboard: https://grafana.thecodeforge.io/d/checkout-slo --- # rules/forge_records.yml # Recording rules pre-aggregate expensive queries so dashboards load instantly # Without this, a 7-day P99 query runs at render time and takes 30+ seconds groups: - name: forge.recordings rules: - record: job:http_request_duration_p99:rate5m expr: | histogram_quantile( 0.99, sum by (job, le) ( rate(http_server_requests_seconds_bucket[5m]) ) )
Scraping forge-prod-app-01:8080 at /actuator/prometheus every 5s
Scraping forge-prod-app-02:8080 at /actuator/prometheus every 5s
Alert ForgeCheckoutP99LatencyHigh: inactive
Recording rule job:http_request_duration_p99:rate5m: active
histogram_quantile() in PromQL for latency metrics. Averages are misleading in right-skewed distributions, and your latency distributions are always right-skewed.| Feature | Monitoring (Traditional) | Observability (Modern) |
|---|---|---|
| Primary Goal | Track health of predefined, known metrics — you decide in advance what to measure | Understand internal system state from external telemetry — including failure modes you did not anticipate |
| Approach | Symptom-based: Is it up? Is error rate below threshold? Did this counter cross a line? | Context-based: Why is this specific request slow? Which downstream service introduced the latency? What changed between yesterday and today? |
| Tooling | Nagios, basic dashboards, threshold alerts on raw metrics | Prometheus, OpenTelemetry, Jaeger or Tempo, Grafana with correlated trace/log/metric views |
| User Experience | Alerts when a predefined threshold is crossed — tells you something is wrong | Visualizes the complete lifecycle of a request — tells you what went wrong, where, and why |
| Data Style | Aggregated snapshots — you lose per-request detail in favor of efficient storage | High-fidelity traces and structured event logs — per-request context is preserved and queryable |
| Unknown Unknowns | Cannot detect failure modes you did not build a dashboard for | Rich telemetry allows ad-hoc queries during incidents — answering questions you did not anticipate |
| Incident Response | Tells you an SLA threshold was crossed — investigation starts from scratch | Provides a navigable path from alert → trace → log → root cause |
🎯 Key Takeaways
- Monitoring tells you when a system is failing; observability provides the diagnostic tools to understand why. You need both — monitoring without observability leaves you with an alarm but no investigation path; observability without monitoring leaves you with data but no trigger to look at it.
- The 3 Pillars — Metrics, Logs, and Traces — must be unified with a common Correlation ID (trace ID) to function as an observability system rather than three disconnected dashboards. The trace ID in your log lines is the most important field you can instrument.
- Adopt OpenTelemetry as your instrumentation standard to avoid vendor lock-in. Auto-instrumentation agents cover 90% of JVM instrumentation with zero code changes — the investment is in configuration and rollout, not in rewriting your application.
- Monitor the 4 Golden Signals — Latency, Traffic, Errors, and Saturation — as your primary alert targets for every service. System resource metrics (CPU, memory) are diagnostic signals, not primary alert targets.
- Averages hide tail latency in every right-skewed distribution, and your latency distributions are always right-skewed. Always instrument P95 and P99 as first-class metrics. Alert on P99 thresholds, not averages. Use histogram metrics and
histogram_quantile()in PromQL — not summaries or gauges. - Cardinality is the silent killer of Prometheus. One high-cardinality label (user_id, order_id, session_id) on a high-traffic metric can consume gigabytes of Prometheus memory within hours and cause an OOM restart loop. Review label sets in code review with the same rigor you apply to database index design.
- Log retention without a tiered strategy is an operational tax that compounds over time. Implement retention tiers by log level from day one: DEBUG disabled in production, INFO sampled at 10–20% for high-volume services, ERROR retained at full fidelity for 90 days. Archive cold logs to object storage rather than paying hot storage prices for data you query once a quarter.
- Synthetic monitoring verifies SLA uptime from backbone infrastructure. Real User Monitoring captures actual experience from real ISP connections and real devices. Use both. Only RUM detects regional ISP outages — synthetic probes bypass consumer ISP last-mile networks entirely.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QYou have a service where average latency is 100ms but P99 is 5 seconds. What does this tell you about the system, and how would you use observability to find the bottleneck?SeniorReveal
- QExplain the 'Unknown-Unknowns' concept in the context of observability. Give a concrete example of an issue monitoring would miss but observability would catch.SeniorReveal
- QDescribe how a distributed trace propagates through a microservices architecture using HTTP headers.Mid-levelReveal
- QWhat is the trade-off between structured JSON logs and plain text logs in a high-scale production environment?Mid-levelReveal
- QHow do you implement Synthetic Monitoring versus Real User Monitoring (RUM)? Which is better for identifying regional ISP outages?Mid-levelReveal
- QScenario: A database query is intermittently slow. How would you instrument your Java code to capture the exact SQL, bind parameters, and context only when latency exceeds 500ms?SeniorReveal
Frequently Asked Questions
Can I have observability without monitoring?
Not in any practically useful sense. Monitoring is a subset of observability — you need the gauges and alerts that monitoring provides to know when to investigate, and you need the rich telemetry that observability provides to complete the investigation effectively.
Monitoring without observability tells you something broke but not why. You get alerts and green/red dashboards with no ability to drill into the cause. Observability without monitoring means you have logs, traces, and metrics available — but no automated trigger to tell you when something is wrong. You would have to manually query your observability platform to discover problems, which is not sustainable.
The practical baseline: implement monitoring first (alerts on the 4 Golden Signals) so you have a trigger. Then invest in observability (distributed tracing, structured logs, metric correlation) so you have a path from trigger to root cause. Neither replaces the other.
What are the four Golden Signals of SRE?
The four Golden Signals were defined by Google's SRE team as the minimum set of metrics to monitor for any production service. They are:
- Latency — the time it takes to serve a request. Measure P95 and P99, not average. Distinguish between successful request latency and failed request latency — a request that fails in 1ms looks fast in your average but indicates a different problem than a request that fails after 30s.
- Traffic — the demand on your system. Requests per second, messages per second, queries per second — whatever the natural unit of throughput is for your service. Traffic context makes latency and error rate changes meaningful: a 10% error rate at 100 RPS is very different from a 10% error rate at 10,000 RPS.
- Errors — the rate of failed requests. Distinguish explicit failures (HTTP 5xx, exception thrown) from implicit failures (HTTP 200 with an error payload, business logic failures that return success codes). Both matter; only explicit errors are automatically visible in HTTP-level metrics.
- Saturation — how 'full' your service is relative to its capacity. Thread pool utilization, connection pool utilization, queue depth, memory pressure. Saturation metrics predict approaching failure before it materializes as latency or error spikes — they are your leading indicator.
If you can only instrument four things for a new service, instrument these four.
Why is high cardinality a problem in observability?
Cardinality in the context of Prometheus metrics refers to the number of unique time series created by a metric's label combinations. Prometheus stores one time series per unique combination of metric name and label values — in memory, in TSDB, and on disk.
If you add a user_id label to a metric on a system with 1 million users, Prometheus must track 1 million individual time series for that single metric name. Add two more labels with 10 values each, and you have 100 million time series. That is not a theoretical problem — it is a gigabyte-scale memory allocation that occurs within the first few hours of a production deployment and typically causes Prometheus to OOM-restart before anyone notices the root cause.
The structural fix: high-cardinality per-request data (user ID, order ID, session token) belongs in traces and logs, not in metric labels. Use exemplars to link a specific metric observation to a trace ID — Prometheus stores exemplars separately from the main TSDB, so they do not contribute to cardinality. In Grafana, exemplar markers render as clickable links directly to the corresponding trace in Tempo or Jaeger.
As a rule: before adding a new label to a metric in production, estimate the number of unique values that label can take. If the answer is 'unbounded' or 'one per user' or 'one per request,' the label does not belong on a metric.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.