Monitoring vs Observability — P99 Blind Spot Disaster
1% of users saw 30-second hangs while avg latency was 120ms green — P99 metrics missing.
- Monitoring watches predefined metrics and alerts when thresholds are breached — it answers 'WHAT is broken'
- Observability lets you ask arbitrary questions about system state from external telemetry — it answers 'WHY it broke'
- The 3 pillars are Metrics (aggregated numeric signals), Logs (per-event detail), and Traces (request flow across services)
- Without a shared Correlation ID across all three pillars, debugging cross-service failures is manual timestamp archaeology
- Prometheus uses pull-based scraping — your app exposes /metrics, Prometheus fetches on a fixed interval
- High-cardinality labels (User IDs, request IDs) as metric tags will crash your monitoring storage
- Monitoring is a subset of observability — you need both, and they serve different cognitive roles during an incident
Imagine your car has a dashboard with a fuel gauge, temperature dial, and oil light. That's monitoring — you set up specific gauges in advance and watch them. Observability is like having a mechanic who can plug a laptop into your car's diagnostic port and ask ANY question about what's happening inside the engine, even questions you never thought to ask before the trip started. Monitoring tells you THAT something is wrong. Observability helps you figure out WHY — and more importantly, it lets you answer questions you didn't know you needed to ask until the car was already on fire.
Your production system crashes at 2 AM on Black Friday. Orders are failing, users are screaming on Twitter, and your on-call engineer is staring at a wall of dashboards wondering where to even begin. This scenario plays out every day at companies around the world — and the difference between a 10-minute fix and a 4-hour outage almost always comes down to one thing: how well-instrumented your system was before the incident.
Monitoring and observability are not luxury features you bolt on after launch. They are the engineering discipline that separates teams who fix incidents in minutes from teams who spend hours in Slack threads and frantic Zoom calls pointing fingers at each other's services. While monitoring is the act of watching known-knowns through dashboards and alerts, observability is the property of a system that allows you to understand its internal state from the external data it emits — logs, metrics, and traces. The operative word is property. Observability is not a tool you buy. It is a quality you engineer into your system from the start.
The critical misconception I see constantly at staff level: treating monitoring and observability as synonyms. They are not. Monitoring is a subset of observability. You can monitor without observability — you get dashboards with no context, alerts that fire with no actionable detail, and on-call engineers who solve incidents through gut instinct and luck. You cannot have meaningful observability without some form of monitoring — you still need alerts to know when to look at anything. The two are complementary, not competing. Get both right and your MTTR drops from hours to minutes. Get only one right and you are still flying partially blind.
The Three Pillars: Metrics, Logs, and Traces
True observability requires all three types of telemetry working together, not independently. Metrics provide a high-level, aggregated view of system health — request rate, error rate, CPU saturation. Logs provide the per-event forensic record of exactly what happened at a specific moment in time. Distributed traces connect everything by showing the complete path of a single request across every service it touched, with timing for each hop.
Each pillar serves a distinct cognitive role in an incident, and understanding those roles prevents the common mistake of trying to use one pillar for a job another pillar does better. Metrics are your early warning system — they tell you THAT something changed, and they tell you fast. A metric alert can fire within 30 seconds of a threshold being crossed. Logs are your forensic evidence — they tell you exactly WHAT happened at a specific request or event level, but querying them at scale is expensive and slow compared to a metric query. Traces are your request GPS — they tell you WHERE a request traveled and HOW LONG each individual service spent processing it. The power of the three-pillar model is not in any single pillar; it is in the correlation between them.
The correlation ID is the connective tissue. When a P99 latency alert fires (metric), you need to jump directly to the traces for the slow requests and then to the logs for the specific error messages those traces contain. If the trace ID is not present in your logs, that jump requires manual timestamp correlation — which is slow, error-prone, and genuinely miserable to do at 2 AM during an active incident. Instrument for correlation from day one, not as a retrofit after the first major outage.
One more thing that gets underweighted: the 'unknown unknowns' problem. Metrics cover what you thought to instrument. Logs cover what you thought to log. Traces cover what you thought to trace. Observability is the property that lets you ask questions you did not anticipate when you wrote the code — because the raw telemetry is rich enough to answer novel queries. A system where you can only ask questions you pre-configured dashboards for is a monitored system. A system where you can compose a new query during an incident and get a meaningful answer is an observable system.
- Metrics are cheap to collect and store — start here for every new service. They are your smoke alarm: they tell you something changed, fast, without requiring you to read individual events.
- Logs are expensive at scale but irreplaceable for root cause analysis. Use structured JSON logs from day one — if your logs are not queryable by trace ID in under one second, you do not have observability, you have a text file archive.
- Traces bridge the gap — they connect a metric anomaly (P99 spiked) to the specific service and span (payment-service.charge() took 4.8s) without manual timestamp correlation across log files.
- The trace ID in your logs is the most important field you can add. It is the hyperlink between a log line and the full request journey. Without it, cross-service debugging during an incident is archaeological work.
- The unknown unknowns problem is only solvable when all three pillars are present and correlated. Predefined dashboards cover what you anticipated. Rich correlated telemetry lets you ask questions you did not anticipate — which is exactly what novel failure modes require.
Observability in Practice: Prometheus and Grafana
Modern observability stacks typically center on a pull-based collection model, and Prometheus is the reference implementation of that model in the open-source world. Instead of your application pushing metrics to a central store on a schedule it controls, Prometheus scrapes a /metrics endpoint on your service at an interval Prometheus controls. Your application is passive — it just maintains counters and histograms in memory and serves them when asked.
The pull model has a meaningful architectural advantage: a misbehaving application cannot overwhelm the monitoring system with a flood of metric writes. If your application starts logging every debug event as a metric push, a push-based system absorbs the explosion and potentially falls over. Prometheus, by contrast, simply scrapes at its configured interval regardless of what the application is doing. The trade-off is the opposite failure mode: if your application crashes between scrapes, you lose the data from that window. For short-lived jobs and batch processes that complete in under one scrape interval, the Pushgateway solves this by holding pushed metrics until the next Prometheus scrape cycle.
Prometheus stores metrics as time series: a metric name, a set of key-value labels, and a sequence of timestamped float64 values. The label set is everything — it is how you slice a metric by service, region, endpoint, or status code. But the label set is also where most production Prometheus problems originate. Every unique combination of label values creates a separate time series in the TSDB. A metric with three labels, each with ten possible values, creates one thousand time series. Add a fourth label with a thousand possible values and you have a million time series for a single metric name. This is the cardinality problem, and it is not theoretical — I have watched it take down a Prometheus instance in production within 20 minutes of a bad deployment that added a user_id label to a high-traffic metric.
PromQL is where the real diagnostic power lives. PromQL is not a query language for fetching raw data — it is a functional language for computing derived signals from time series. The most important function in incident response is histogram_quantile(), which computes a true percentile from a histogram metric. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) gives you the true P99 latency over a 5-minute window — the slowest experience that 1% of your users received. No dashboard based on averages can show you this. Building this query and putting it on your primary service dashboard is a one-time 10-minute investment that has caught more production incidents for me than any other single piece of instrumentation.
- Never include unique per-request values as metric labels. User IDs, order IDs, session tokens, email addresses — any label whose value changes per request will create one time series per unique value, per metric. On a system with 1 million users, a single user_id label creates 1 million time series for one metric name.
- Prometheus TSDB stores all active time series in memory. A cardinality explosion is a memory explosion. The symptoms are: Prometheus memory climbs continuously, queries slow, and eventually the process OOMs and restarts. The cycle repeats because the high-cardinality data is already on disk.
- The fix is not to increase Prometheus memory. The fix is to remove the high-cardinality label from the metric definition. Use exemplars to link a specific metric observation to a trace ID without creating a new time series per request.
- Rule of thumb: if a label value is unique per request or per user, it does not belong on a metric. Put it in a log field or a trace span attribute instead.
- Before deploying a new metric, count the expected unique label combinations. Multiply across all label dimensions. If that number exceeds 10,000 for a single metric, redesign the label set.
histogram_quantile() in PromQL for latency metrics. Averages are misleading in right-skewed distributions, and your latency distributions are always right-skewed.Black Friday P99 Latency Explosion — Monitoring Showed Green, Observability Found the Root Cause
- Averages are a lie in skewed distributions. A single slow replica behind a load balancer is invisible in average latency unless you are actively tracking percentiles. Always monitor P95 and P99 as first-class metrics, not afterthoughts.
- If your alerting only covers averages and error rates, you are systematically blind to the failures that affect your highest-value users — the ones completing large orders, the enterprise accounts, the users on slow connections who already have the least tolerance for performance degradation.
- Distributed tracing is the only reliable mechanism to attribute tail latency to a specific downstream component. Without it, you are correlating timestamps across log files by hand during the worst moments of an incident.
- Load balancer health checks that only test TCP connectivity are not health checks. They are port checks. If a replica is accepting connections but taking 25 seconds to respond, a TCP health check will mark it as healthy every single time. Health checks must test actual application response time.
Key takeaways
histogram_quantile() in PromQLCommon mistakes to avoid
5 patternsDashboard overload — 50 graphs with no hierarchy, all treated as equally important
Missing the trace — metrics and logs implemented, distributed tracing skipped as 'too complex'
Measuring system resources instead of business outcomes — optimizing for what is easy to measure, not what matters
Using averages as the primary latency metric in dashboards and alerts
histogram_quantile() can compute true percentiles from the raw bucket data.No retention or sampling strategy for logs — treating all logs as equally valuable
Interview Questions on This Topic
You have a service where average latency is 100ms but P99 is 5 seconds. What does this tell you about the system, and how would you use observability to find the bottleneck?
histogram_quantile(0.99, rate(call_duration_seconds_bucket{service="payment"}[5m])) gives me per-dependency P99 values. That query alone usually points at the culprit within a few minutes of looking.Frequently Asked Questions
That's Monitoring. Mark it forged?
5 min read · try the examples if you haven't