CloudWatch TreatMissingData — Why Silence Won't Page
Default TreatMissingData keeps alarms OK when metrics stop arriving.
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
- Metrics: time-stamped numbers from every AWS service. CPU, errors, latency. Free for basic. Custom metrics: $0.30/month each.
- Alarms: connect metrics to actions (page SNS, scale ASG). Use 3 evaluation periods + 2 datapoints to alarm — kills false alerts.
- Logs Insights: query all logs in seconds. Fields, stats, percentiles, filter. Pay per GB scanned (~$0.005).
- Metric Filters: run continuously, turn log patterns into metrics you can alarm on. Costs $0.30 per metric.
- Production rule: CPU at 80% for 2 of 3 periods = page. 1 spike = nothing. TreatMissingData = breaching for heartbeat metrics.
- Cost trap: publishing metrics with userId dimension. Each userId = separate billable metric. Use low-cardinality only.
Imagine your house has a smart thermostat, a smoke detector, and a security camera — all sending alerts to your phone when something goes wrong. AWS CloudWatch is exactly that, but for your cloud infrastructure. It watches your servers, databases, and apps 24/7, records everything they do, and pages you the moment something looks off. You don't have to sit staring at a screen — CloudWatch does the watching so you can focus on building.
Every production system breaks. The difference between a 5-minute outage and a 5-hour disaster? Almost always the same thing: how fast you knew something was wrong.
AWS CloudWatch is the native observability service at the heart of every serious AWS deployment. It's not glamorous, but skipping it is like flying a plane with no instruments. Fine until it isn't.
This article covers metrics, alarms, logs, and dashboards. How they connect. How to set them up with CLI and CloudFormation. And the three mistakes engineers make with CloudWatch that page them at 3 AM for no reason.
Why CloudWatch TreatMissingData Is the Difference Between a Pager and a Silent Night
CloudWatch TreatMissingData is a per-alarm configuration that defines how an alarm behaves when a metric stops reporting. The default behavior — 'missing' — treats the missing datapoint as a breach, but only if the alarm's evaluation period has already been in ALARM state. This means a metric that goes silent while the alarm is OK stays OK, which is almost never what you want. The core mechanic is simple: you choose one of four policies — missing, notBreaching, breaching, or ignore — and CloudWatch applies that policy to each missing datapoint during the alarm evaluation.
In practice, TreatMissingData matters because it decouples metric absence from alarm state. If you set it to 'breaching', any gap in data immediately triggers the alarm, which is useful for heartbeat metrics. If you set it to 'notBreaching', the alarm stays OK even if data stops, which is dangerous for critical metrics like CPU utilization or request latency. The 'ignore' option is rarely used because it excludes missing datapoints from the evaluation entirely, effectively shortening the evaluation period. The key property: TreatMissingData only applies when the metric is missing — not when it's NaN, null, or out of range.
Use TreatMissingData when you need to guarantee that a silent metric pages. The canonical example is a dead EC2 instance: if CloudWatch agent stops sending CPU metrics, you want the alarm to fire, not stay OK. Without explicit configuration, your alarm will remain OK until the next evaluation period finds a breach, which could be minutes or never. In production, always set TreatMissingData to 'breaching' for any alarm that monitors a critical health signal — otherwise, silence is treated as health.
CloudWatch Metrics: The Heartbeat of Your Infrastructure
A metric is just a time-stamped number with a name and a namespace. That's it. EC2 sends CloudWatch a CPUUtilization number every minute. RDS sends DatabaseConnections. Lambda sends Duration and Errors. These numbers stream in automatically — you don't write a single line of code to get them.
What makes metrics powerful is the dimension system. A dimension is a key-value pair that narrows down which resource a metric belongs to. For example, CPUUtilization by itself tells you nothing. CPUUtilization where InstanceId=i-0abc123 tells you exactly which server is melting. You can also publish your own custom metrics — think order count per minute, active WebSocket connections, or queue depth in your own application.
Metrics are stored in CloudWatch for 15 months, but the resolution degrades over time: data points are kept at 1-second resolution for 3 hours, then aggregated to 1-minute for 15 days, then 5-minute for 63 days, and finally 1-hour for 15 months. This matters when you're debugging an incident from three weeks ago — you'll only have 5-minute averages, not second-by-second data.
Knowing the retention and resolution schedule helps you set the right alarm evaluation periods and avoid false conclusions from aggregated data.
Custom metric cost trap: AWS charges $0.30 per custom metric per month. If you publish put-metric-data --dimensions Name=userId,Value=u-12345, each userId creates a separate billable metric. With 10,000 active users, that's $3,000 per month. Always use low-cardinality dimensions: Environment (prod/staging), Region, ServiceName, InstanceType. Never put unique identifiers in dimensions.
CloudWatch Alarms: Turning Numbers Into Actions
A metric sitting in CloudWatch does nothing on its own. An alarm is what connects a metric to a response. You define a threshold, a comparison operator, and an evaluation period — and CloudWatch will flip the alarm state from OK to ALARM the moment that condition is met.
An alarm has exactly three states: OK (everything is fine), ALARM (threshold breached), and INSUFFICIENT_DATA (not enough data points have arrived yet, which happens right after you create an alarm or if the metric stops publishing). Understanding INSUFFICIENT_DATA is important — it's not the same as OK, and treating it that way is a common mistake.
Alarms can trigger three types of actions: SNS notifications (email, SMS, PagerDuty webhook), EC2 actions (stop, terminate, reboot, or recover an instance), and Auto Scaling actions (scale in or scale out). This is where CloudWatch becomes genuinely powerful — you can build self-healing infrastructure where an alarm automatically replaces a failing instance without any human involvement.
For composite alarms, you can combine multiple alarms with AND/OR logic. This lets you avoid alert fatigue by only paging someone when CPU is high AND error rate is also high — not when just one or the other spikes briefly.
The DatapointsToAlarm setting is your best defense against alert fatigue. If you set EvaluationPeriods=3 and DatapointsToAlarm=2, the alarm only fires if 2 out of 3 consecutive evaluation windows breach the threshold. A single 1-minute spike won't wake you at 3 AM. A sustained 3-minute problem will.
missing, which causes the alarm to stay in its current state when no data arrives. For error-count metrics this is fine, but for availability metrics (like Lambda Invocations, HealthyHostCount, heartbeat metrics), missing data should be treated as breaching — because if the metric stops publishing, something is very wrong. Pick the right value for each alarm, not the default.CloudWatch Logs: Centralising and Querying Application Output
CloudWatch Logs is where your application output lives. The hierarchy works like this: a Log Group is the top-level container (one per application or service), Log Streams are individual sources within that group (one per Lambda invocation, one per EC2 instance, one per container), and Log Events are the individual timestamped lines inside each stream.
Logs arrive in CloudWatch either automatically (Lambda, ECS, and CloudTrail do this natively) or via the CloudWatch Logs Agent or the newer CloudWatch Agent installed on EC2 instances.
The real power is CloudWatch Logs Insights, a query language that lets you search and aggregate across gigabytes of logs in seconds. It's not SQL, but it's close enough that you'll feel at home immediately. You can filter for errors, extract fields from structured JSON logs, calculate percentiles, and visualise results as time-series charts.
Metric filters are another killer feature: they scan incoming log lines for a pattern and convert matches into CloudWatch metrics. This means you can turn a log line like ERROR: payment gateway timeout into an incrementing metric — and then alarm on that metric. No log aggregation pipeline, no third-party tool, no extra cost beyond the metric itself.
Logs Insights cost warning: You're charged per GB of data scanned ($0.005 per GB). A poorly written query that scans 100 GB costs $0.50. That's cheap for incident debugging. But a query that runs every minute as a dashboard widget will cost $720/month. Use Logs Insights for ad-hoc debugging only. For continuous metrics, use Metric Filters.
Putting It Together: Dashboards and a Real Monitoring Architecture
A CloudWatch Dashboard is a customisable canvas where you pin metrics graphs, alarm states, and log query results side by side. The point isn't just pretty charts — it's reducing the time-to-understanding during an incident. When your on-call engineer gets paged at 2am, the first thing they open should be a dashboard that answers: is this an app problem, a database problem, or a network problem?
Good dashboard design follows the RED method: Rate (requests per second), Errors (error rate), and Duration (latency percentiles). Put those three graphs at the top. Below them, add the saturation metrics — CPU, memory, DB connections. At the bottom, link to Logs Insights queries for the most common failure modes.
Here's the architecture pattern that works in production: CloudWatch receives metrics and logs from all your services automatically. Alarms on the most critical thresholds fire to an SNS topic. That SNS topic routes to PagerDuty (or OpsGenie, or just email if you're early-stage). The on-call engineer opens the service dashboard, runs a Logs Insights query to get the stack trace, fixes the issue, and the OKAction on the alarm auto-resolves the incident. Everything is connected, traceable, and auditable.
This closed loop — metric to alarm to notification to dashboard to logs — is the entire CloudWatch mental model. Once you've internalised it, everything else is just configuration.
CloudWatch Agent: The Bridge Between Your Servers and Observability
Metrics and logs don't magically appear. You need the CloudWatch Agent. It's a daemon you install on EC2 or on-prem servers. It collects system-level metrics like memory, disk, and swap—stuff EC2 doesn't give you by default. It also pushes application logs. The old SSM Agent? That's for parameter store and commands. The CloudWatch Agent is for data. Install it. Configure it. Stop guessing why your CPU is pegged but memory looks fine. The WHY: Without the agent, you're blind to OS-level metrics. Your alarms fire late or never. Your logs sit on disks you can't query. The HOW: Drop the agent configuration JSON in Parameter Store, bootstrap it via user data. The agent picks up permissions from IAM. Use the unified CloudWatch agent—it replaced the old metrics-only version. Test your config with sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a status. If it's not running, neither are your alerts.
metrics_collection_interval low enough. Default is 60 seconds. If your app crashes in 10 seconds, you'll see nothing. Set it to 5 seconds for critical paths.Logs Insights: Stop Grepping, Start Querying Your Way to Root Cause
You've got logs in CloudWatch. Now what? Grepping a thousand streams is for amateurs. Logs Insights is your SQL-for-logs. It queries across all log groups in a region. Syntax looks like SQL but isn't—it's purpose-built for log parsing. You fields @timestamp to get timestamps. You filter by error codes. You stats by bin(5m) to see time-based spikes. The WHY: When an alarm fires at 3 AM, you need to find the error pattern in seconds, not hours. Logs Insights paginates results—sort by @timestamp desc. Common pattern: filter @message like /(?i)(error|exception|timeout)/. Then stats . You'll spot the surge. Pro tip: Save queries as CloudWatch Logs Insights queries. Name them something searchable like "5xx errors last hour". Share them with your team. Your future self, paged at 2 AM, will thank you. Also: use count() by @timestamp, bin(1m)limit 10000 to avoid timeouts on large log groups. Set a time range before running. Never run across all time—that's how you burn money and patience.
The 3 AM Alarm That Wasn't: TreatMissingData Default
TreatMissingData: breaching
2. Added a heartbeat metric: Lambda publishes custom metric every minute. Alarm on missing heartbeat → pages immediately.
3. Set metric filter to count invocations: filter @message like /Processed/ | stats count(). Alarm on zero invocations for 10 minutes.
4. Add CloudWatch alarm on Lambda Invocations metric with TreatMissingData: breaching — zero invocations = dead service.
Prevention: For any metric where silence = failure (invocations, heartbeat, queue consumers, active connections), always set TreatMissingData: breaching.- TreatMissingData default (missing) is dangerous for availability metrics.
- Silence from a service that should be talking is itself a problem worth paging on.
- Lambda Invocations = 0 for 5 minutes means something is broken upstream.
- Add explicit heartbeat metrics for critical scheduled jobs.
aws cloudwatch list-metrics --namespace TheCodeForge/checkout-serviceaws cloudwatch get-metric-statistics --namespace TheCodeForge/checkout-service --metric-name OrdersProcessedPerMinute --period 300 --statistics Sum --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)Key takeaways
Common mistakes to avoid
5 patternsSetting alarms with EvaluationPeriods=1 and DatapointsToAlarm=1
Publishing custom metrics with high-cardinality dimensions like userId or requestId
Leaving TreatMissingData at its default (missing) for availability-type metrics
Putting Logs Insights queries in auto-refreshing dashboards
Storing logs forever with no expiration
aws logs put-retention-policy --log-group-name /my/app --retention-in-days 30.Interview Questions on This Topic
CloudWatch Alarms have three states — OK, ALARM, and INSUFFICIENT_DATA. When would an alarm be in INSUFFICIENT_DATA and why is it dangerous to treat that state the same as OK?
TreatMissingData: breaching. For error rate metrics (silence means no errors), set TreatMissingData: notBreaching.
Example: A Lambda that should process transactions every minute stops being invoked due to a broken event source mapping. Invocations metric goes from 5/minute to 0. With TreatMissingData: breaching on an Invocations < 1 alarm, CloudWatch fires ALARM and pages on-call. With default (missing), the alarm stays OK and no one knows.Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
That's Cloud. Mark it forged?
8 min read · try the examples if you haven't