Senior 8 min · March 06, 2026
AWS CloudWatch Basics

CloudWatch TreatMissingData — Why Silence Won't Page

Default TreatMissingData keeps alarms OK when metrics stop arriving.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Metrics: time-stamped numbers from every AWS service. CPU, errors, latency. Free for basic. Custom metrics: $0.30/month each.
  • Alarms: connect metrics to actions (page SNS, scale ASG). Use 3 evaluation periods + 2 datapoints to alarm — kills false alerts.
  • Logs Insights: query all logs in seconds. Fields, stats, percentiles, filter. Pay per GB scanned (~$0.005).
  • Metric Filters: run continuously, turn log patterns into metrics you can alarm on. Costs $0.30 per metric.
  • Production rule: CPU at 80% for 2 of 3 periods = page. 1 spike = nothing. TreatMissingData = breaching for heartbeat metrics.
  • Cost trap: publishing metrics with userId dimension. Each userId = separate billable metric. Use low-cardinality only.
✦ Definition~90s read
What is AWS CloudWatch Basics?

CloudWatch TreatMissingData is a configuration parameter on CloudWatch alarms that determines how the alarm behaves when a metric stops reporting data. Without it, a missing data point is treated as 'missing' by default, which means the alarm state doesn't change — your pager stays silent even when your application has gone dark.

Imagine your house has a smart thermostat, a smoke detector, and a security camera — all sending alerts to your phone when something goes wrong.

Setting it to 'breaching' or 'notBreaching' forces the alarm to treat silence as a signal, turning a dead metric into a page. This is the difference between knowing your service is down and assuming everything is fine because you stopped hearing from it.

CloudWatch is AWS's native monitoring service, collecting metrics (CPU, latency, request counts) and logs from your infrastructure. Metrics are the heartbeat — numeric time-series data emitted by services like EC2, Lambda, or your own custom application code.

Alarms watch these metrics and trigger actions (SNS notifications, Auto Scaling) when thresholds are crossed. Logs are the autopsy — raw text output from your applications, centralized for querying with Logs Insights. Together, they form the observability backbone for most AWS workloads.

In practice, you'll pair CloudWatch with third-party tools like Datadog or Grafana for richer dashboards and longer retention, but CloudWatch remains the default for AWS-native monitoring. TreatMissingData is critical for alarms on sparse metrics (e.g., Lambda invocations that only fire on errors) or when you need to detect a complete service outage.

The default 'missing' behavior is fine for metrics that report continuously — but for anything where silence means death, you must set it to 'breaching' or your on-call rotation will sleep through the outage.

Plain-English First

Imagine your house has a smart thermostat, a smoke detector, and a security camera — all sending alerts to your phone when something goes wrong. AWS CloudWatch is exactly that, but for your cloud infrastructure. It watches your servers, databases, and apps 24/7, records everything they do, and pages you the moment something looks off. You don't have to sit staring at a screen — CloudWatch does the watching so you can focus on building.

Every production system breaks. The difference between a 5-minute outage and a 5-hour disaster? Almost always the same thing: how fast you knew something was wrong.

AWS CloudWatch is the native observability service at the heart of every serious AWS deployment. It's not glamorous, but skipping it is like flying a plane with no instruments. Fine until it isn't.

This article covers metrics, alarms, logs, and dashboards. How they connect. How to set them up with CLI and CloudFormation. And the three mistakes engineers make with CloudWatch that page them at 3 AM for no reason.

Why CloudWatch TreatMissingData Is the Difference Between a Pager and a Silent Night

CloudWatch TreatMissingData is a per-alarm configuration that defines how an alarm behaves when a metric stops reporting. The default behavior — 'missing' — treats the missing datapoint as a breach, but only if the alarm's evaluation period has already been in ALARM state. This means a metric that goes silent while the alarm is OK stays OK, which is almost never what you want. The core mechanic is simple: you choose one of four policies — missing, notBreaching, breaching, or ignore — and CloudWatch applies that policy to each missing datapoint during the alarm evaluation.

In practice, TreatMissingData matters because it decouples metric absence from alarm state. If you set it to 'breaching', any gap in data immediately triggers the alarm, which is useful for heartbeat metrics. If you set it to 'notBreaching', the alarm stays OK even if data stops, which is dangerous for critical metrics like CPU utilization or request latency. The 'ignore' option is rarely used because it excludes missing datapoints from the evaluation entirely, effectively shortening the evaluation period. The key property: TreatMissingData only applies when the metric is missing — not when it's NaN, null, or out of range.

Use TreatMissingData when you need to guarantee that a silent metric pages. The canonical example is a dead EC2 instance: if CloudWatch agent stops sending CPU metrics, you want the alarm to fire, not stay OK. Without explicit configuration, your alarm will remain OK until the next evaluation period finds a breach, which could be minutes or never. In production, always set TreatMissingData to 'breaching' for any alarm that monitors a critical health signal — otherwise, silence is treated as health.

Default Behavior Traps
The default 'missing' policy does NOT page on silence — it only transitions to ALARM if the alarm was already in ALARM state. Most teams discover this during an outage.
Production Insight
A production incident where an auto-scaled instance was terminated but the CloudWatch alarm for CPU remained OK because TreatMissingData was left as default 'missing'.
The symptom: no pager call for 15 minutes while the ASG replaced the instance, because the alarm never breached — it just stayed OK on missing data.
Rule of thumb: for any alarm that monitors a single-instance metric, set TreatMissingData to 'breaching' so silence always pages.
Key Takeaway
TreatMissingData is not optional — it's the difference between detecting a dead instance and ignoring it.
Default 'missing' only pages if the alarm was already in ALARM — silence while OK stays OK.
For critical health metrics, always set TreatMissingData to 'breaching' — silence must mean failure.
CloudWatch TreatMissingData Alarm Flow THECODEFORGE.IO CloudWatch TreatMissingData Alarm Flow From metric collection to alarm action with missing data handling CloudWatch Agent Collects metrics from servers CloudWatch Metrics Heartbeat data points stored CloudWatch Alarm Evaluates metric vs threshold TreatMissingData Controls alarm on missing points Alarm Action SNS, Auto Scaling, or silence ⚠ TreatMissingData default = missing → alarm state change Use 'notBreaching' to avoid false alarms during brief gaps THECODEFORGE.IO
thecodeforge.io
CloudWatch TreatMissingData Alarm Flow
Aws Cloudwatch Basics

CloudWatch Metrics: The Heartbeat of Your Infrastructure

A metric is just a time-stamped number with a name and a namespace. That's it. EC2 sends CloudWatch a CPUUtilization number every minute. RDS sends DatabaseConnections. Lambda sends Duration and Errors. These numbers stream in automatically — you don't write a single line of code to get them.

What makes metrics powerful is the dimension system. A dimension is a key-value pair that narrows down which resource a metric belongs to. For example, CPUUtilization by itself tells you nothing. CPUUtilization where InstanceId=i-0abc123 tells you exactly which server is melting. You can also publish your own custom metrics — think order count per minute, active WebSocket connections, or queue depth in your own application.

Metrics are stored in CloudWatch for 15 months, but the resolution degrades over time: data points are kept at 1-second resolution for 3 hours, then aggregated to 1-minute for 15 days, then 5-minute for 63 days, and finally 1-hour for 15 months. This matters when you're debugging an incident from three weeks ago — you'll only have 5-minute averages, not second-by-second data.

Knowing the retention and resolution schedule helps you set the right alarm evaluation periods and avoid false conclusions from aggregated data.

Custom metric cost trap: AWS charges $0.30 per custom metric per month. If you publish put-metric-data --dimensions Name=userId,Value=u-12345, each userId creates a separate billable metric. With 10,000 active users, that's $3,000 per month. Always use low-cardinality dimensions: Environment (prod/staging), Region, ServiceName, InstanceType. Never put unique identifiers in dimensions.

publish_custom_metric.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash
# publish_custom_metric.sh
# Publishes a custom business metric to CloudWatch.
# Run this from an EC2 instance, a container, or your CI/CD pipeline.
# Prerequisite: AWS CLI installed and IAM role with cloudwatch:PutMetricData permission.

APP_NAME="checkout-service"
ENVIRONMENT="production"

# Simulate reading the number of orders processed in the last minute.
# In a real app you'd query your database or read from an in-memory counter.
ORDERS_PROCESSED=142

# Publish the custom metric to a namespace we own.
# Namespace acts like a folder — use a consistent naming convention.
aws cloudwatch put-metric-data \
  --namespace "TheCodeForge/${APP_NAME}" \
  --metric-name "OrdersProcessedPerMinute" \
  --value "${ORDERS_PROCESSED}" \
  --unit "Count" \
  --dimensions \
    "Name=Environment,Value=${ENVIRONMENT}" \
    "Name=AppName,Value=${APP_NAME}" \
  --region us-east-1

# Verify the metric was accepted (exit code 0 means success)
if [ $? -eq 0 ]; then
  echo "[OK] Metric published: ${ORDERS_PROCESSED} orders for ${APP_NAME} in ${ENVIRONMENT}"
else
  echo "[ERROR] Failed to publish metric. Check IAM permissions and region."
  exit 1
fi
Output
[OK] Metric published: 142 orders for checkout-service in production
Custom Metric Costs Add Up Fast — Watch Your Dimensions
AWS charges $0.30 per custom metric per month. If you publish metrics with unique dimension combinations (e.g. per user-id), each unique combination is counted as a separate metric. A single app can accidentally generate thousands of billable metrics. Always use low-cardinality dimensions like Environment, Region, or AppName — never user IDs or request IDs.
Production Insight
A startup published custom metrics with userId dimension to track per-customer API latencies. They had 50,000 active customers. Each customer generated 5 metrics per day. CloudWatch bill: $15,000 per month.
Root cause: The engineer assumed dimensions were free like tags. They're not. Each unique combination of dimension values = one billable metric.
Fix: Removed userId dimension. Switched to percentiles (p50, p90, p99) across all customers. Bill dropped to $15/month.
Rule: If you need per-user metrics, send them to a separate analytics service (Athena, Redshift, third-party). CloudWatch is for aggregated infrastructure monitoring, not per-customer analytics.
Key Takeaway
Metrics = time-stamped numbers with dimensions. Built-in are free. Custom cost $0.30/month.
Dimensions must be low-cardinality. environment, region, service name. NEVER userId.
1-second resolution lasts 3 hours. 1-minute for 15 days. Plan retention accordingly.
Know the difference between standard and high-resolution metrics.
Should this metric be custom or use built-in?
IfEC2 CPU, RDS connections, Lambda duration, S3 bucket size
UseBuilt-in metrics. Already published. Free. No code needed.
IfApplication business metric: orders/minute, users online, queue length
UseCustom metric. Publish via CLI or SDK. $0.30/month per metric.
IfPer-user or per-request metrics
UseDo NOT use CloudWatch. Use Athena, Redshift, or third-party analytics.
IfHigh-frequency metrics (sub-minute, thousands of data points/second)
UseHigh-resolution custom metrics cost more. Use CloudWatch embedded metric format or third-party tool like Datadog.

CloudWatch Alarms: Turning Numbers Into Actions

A metric sitting in CloudWatch does nothing on its own. An alarm is what connects a metric to a response. You define a threshold, a comparison operator, and an evaluation period — and CloudWatch will flip the alarm state from OK to ALARM the moment that condition is met.

An alarm has exactly three states: OK (everything is fine), ALARM (threshold breached), and INSUFFICIENT_DATA (not enough data points have arrived yet, which happens right after you create an alarm or if the metric stops publishing). Understanding INSUFFICIENT_DATA is important — it's not the same as OK, and treating it that way is a common mistake.

Alarms can trigger three types of actions: SNS notifications (email, SMS, PagerDuty webhook), EC2 actions (stop, terminate, reboot, or recover an instance), and Auto Scaling actions (scale in or scale out). This is where CloudWatch becomes genuinely powerful — you can build self-healing infrastructure where an alarm automatically replaces a failing instance without any human involvement.

For composite alarms, you can combine multiple alarms with AND/OR logic. This lets you avoid alert fatigue by only paging someone when CPU is high AND error rate is also high — not when just one or the other spikes briefly.

The DatapointsToAlarm setting is your best defense against alert fatigue. If you set EvaluationPeriods=3 and DatapointsToAlarm=2, the alarm only fires if 2 out of 3 consecutive evaluation windows breach the threshold. A single 1-minute spike won't wake you at 3 AM. A sustained 3-minute problem will.

cloudwatch_alarm.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# cloudwatch_alarm.yaml
# CloudFormation template that creates a CloudWatch alarm on a Lambda function.
# Deploy with: aws cloudformation deploy --template-file cloudwatch_alarm.yaml \
#              --stack-name checkout-lambda-alarms --capabilities CAPABILITY_IAM

AWSTemplateFormatVersion: '2010-09-09'
Description: >-
  Alarm that fires when the checkout Lambda error rate exceeds 1% over 5 minutes.
  Sends an alert to the on-call SNS topic when triggered.

Parameters:
  LambdaFunctionName:
    Type: String
    Default: checkout-processor
    Description: The name of the Lambda function to monitor.

  OnCallSnsTopicArn:
    Type: String
    Description: ARN of the SNS topic that routes to PagerDuty or email.

Resources:

  # Alarm: triggers if Lambda errors exceed 5 in any 5-minute window.
  # EvaluationPeriods: how many periods must breach before alarm fires.
  # DatapointsToAlarm: of those periods, how many must actually breach.
  # Using 2-of-3 prevents a single noisy data point from waking someone at 3am.
  CheckoutLambdaErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${LambdaFunctionName}-high-error-rate'
      AlarmDescription: >-
        Fires when checkout Lambda has more than 5 errors in a 5-minute period.
        Check Lambda logs in CloudWatch for stack traces.
      Namespace: AWS/Lambda                 # Built-in Lambda namespace — no setup needed
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref LambdaFunctionName    # Scoped to our specific function
      Statistic: Sum                        # Add up all error counts in the period
      Period: 300                           # Each evaluation window is 5 minutes (300 seconds)
      EvaluationPeriods: 3                  # Look at the last 3 windows (15 minutes total)
      DatapointsToAlarm: 2                  # Alarm only if 2 out of 3 windows breach threshold
      Threshold: 5                          # More than 5 errors triggers the alarm
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching        # No data = function not invoked = not an error
      AlarmActions:
        - !Ref OnCallSnsTopicArn            # Page on-call engineer via SNS
      OKActions:
        - !Ref OnCallSnsTopicArn            # Also notify when alarm recovers

Outputs:
  AlarmName:
    Description: Name of the created CloudWatch alarm
    Value: !Ref CheckoutLambdaErrorAlarm
Output
# After deploying:
# aws cloudformation describe-stacks --stack-name checkout-lambda-alarms
#
# StackStatus: CREATE_COMPLETE
#
# Alarm initial state in CloudWatch console: INSUFFICIENT_DATA
# (changes to OK once Lambda emits its first metric data point)
#
# When errors exceed threshold:
# SNS delivers message to on-call topic:
# Subject: ALARM: "checkout-processor-high-error-rate" in US East (N. Virginia)
# Body: Threshold Crossed: 2 datapoints [6.0, 8.0] were greater than threshold 5.0
Pro Tip: Always Set TreatMissingData Intentionally
The default value for TreatMissingData is missing, which causes the alarm to stay in its current state when no data arrives. For error-count metrics this is fine, but for availability metrics (like Lambda Invocations, HealthyHostCount, heartbeat metrics), missing data should be treated as breaching — because if the metric stops publishing, something is very wrong. Pick the right value for each alarm, not the default.
Production Insight
A company had a CloudWatch alarm on EC2 Status Check Failed metric with EvaluationPeriods=1 and DatapointsToAlarm=1. Every time EC2 performed routine maintenance (instance reboot), the status check would fail for 30 seconds. Alarm fired. On-call got paged. Every day. Engineers started ignoring CloudWatch pages entirely.
When a real outage happened, no one responded for 45 minutes.
Fix: Changed to EvaluationPeriods=3, DatapointsToAlarm=2, Period=60 seconds. Now a 30-second maintenance reboot doesn't fire the alarm (only 1 of 3 periods breaches). A sustained 3-minute failure does.
Rule: Alert fatigue kills on-call effectiveness. Your alarms should only page when the problem is sustained (at least 2 of 3 evaluation windows) AND when missing data indicates a real problem (TreatMissingData: breaching for availability metrics).
Key Takeaway
Alarms connect metrics to actions. 3 states: OK, ALARM, INSUFFICIENT_DATA.
EvaluationPeriods=3, DatapointsToAlarm=2 = no paging for 1-minute spikes.
TreatMissingData: breaching for heartbeat/invocations metrics.
Composite alarms = AND/OR logic to reduce false alerts.

CloudWatch Logs: Centralising and Querying Application Output

CloudWatch Logs is where your application output lives. The hierarchy works like this: a Log Group is the top-level container (one per application or service), Log Streams are individual sources within that group (one per Lambda invocation, one per EC2 instance, one per container), and Log Events are the individual timestamped lines inside each stream.

Logs arrive in CloudWatch either automatically (Lambda, ECS, and CloudTrail do this natively) or via the CloudWatch Logs Agent or the newer CloudWatch Agent installed on EC2 instances.

The real power is CloudWatch Logs Insights, a query language that lets you search and aggregate across gigabytes of logs in seconds. It's not SQL, but it's close enough that you'll feel at home immediately. You can filter for errors, extract fields from structured JSON logs, calculate percentiles, and visualise results as time-series charts.

Metric filters are another killer feature: they scan incoming log lines for a pattern and convert matches into CloudWatch metrics. This means you can turn a log line like ERROR: payment gateway timeout into an incrementing metric — and then alarm on that metric. No log aggregation pipeline, no third-party tool, no extra cost beyond the metric itself.

Logs Insights cost warning: You're charged per GB of data scanned ($0.005 per GB). A poorly written query that scans 100 GB costs $0.50. That's cheap for incident debugging. But a query that runs every minute as a dashboard widget will cost $720/month. Use Logs Insights for ad-hoc debugging only. For continuous metrics, use Metric Filters.

logs_insights_query.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#!/bin/bash
# logs_insights_query.sh
# Runs a CloudWatch Logs Insights query to find the top 10 slowest
# API endpoints in the last hour, using structured JSON logs.
#
# Assumes your app logs JSON lines like:
# {"level":"INFO","endpoint":"/api/checkout","duration_ms":342,"status":200}

LOG_GROUP_NAME="/production/checkout-service"
QUERY_LOOKBACK_SECONDS=3600  # Last 1 hour

START_TIME=$(date -u -d "-${QUERY_LOOKBACK_SECONDS} seconds" +%s)  # Linux
# On macOS use: date -u -v -${QUERY_LOOKBACK_SECONDS}S +%s
END_TIME=$(date -u +%s)

echo "Starting Logs Insights query on: ${LOG_GROUP_NAME}"
echo "Time range: last ${QUERY_LOOKBACK_SECONDS} seconds"

# Start the query — Logs Insights runs asynchronously, so we get a query ID back.
QUERY_ID=$(aws logs start-query \
  --log-group-name "${LOG_GROUP_NAME}" \
  --start-time "${START_TIME}" \
  --end-time "${END_TIME}" \
  --query-string '
    fields @timestamp, endpoint, duration_ms, status
    | filter ispresent(duration_ms)          # Only include log lines that have this field
    | stats avg(duration_ms) as avg_duration,
            max(duration_ms) as max_duration,
            count() as request_count
      by endpoint
    | sort avg_duration desc                 # Slowest endpoints first
    | limit 10
  ' \
  --query 'queryId' \
  --output text \
  --region us-east-1)

echo "Query submitted. ID: ${QUERY_ID}"
echo "Waiting for results..."

# Poll until the query finishes (usually 2-10 seconds for an hour of logs)
while true; do
  STATUS=$(aws logs get-query-results \
    --query-id "${QUERY_ID}" \
    --query 'status' \
    --output text \
    --region us-east-1)

  if [ "${STATUS}" == "Complete" ]; then
    echo "Query complete. Results:"
    # Fetch and pretty-print the results
    aws logs get-query-results \
      --query-id "${QUERY_ID}" \
      --query 'results[*].[?field==`endpoint`].value | [0] | join(`,`, @)' \
      --output json \
      --region us-east-1
    break
  elif [ "${STATUS}" == "Failed" ]; then
    echo "[ERROR] Query failed. Check query syntax and log group name."
    exit 1
  fi

  sleep 2
done
Output
Starting Logs Insights query on: /production/checkout-service
Time range: last 3600 seconds
Query submitted. ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Waiting for results...
Query complete. Results:
[
{ "endpoint": "/api/checkout", "avg_duration": "847", "max_duration": "3201", "request_count": "1243" },
{ "endpoint": "/api/payment/verify", "avg_duration": "612", "max_duration": "2980", "request_count": "988" },
{ "endpoint": "/api/inventory", "avg_duration": "231", "max_duration": "890", "request_count": "4521" }
]
Interview Gold: Logs Insights vs. Metric Filters
Logs Insights is for ad-hoc investigation — you run it manually when debugging. Metric filters run continuously in the background and turn log patterns into metrics you can alarm on. Use metric filters for things you always want to know (error rate, timeout count). Use Logs Insights for things you want to understand during an incident (which user triggered the error, what was the full request payload).
Production Insight
A team created a CloudWatch dashboard with a Logs Insights widget that ran every 60 seconds for the last 15 minutes of logs. The widget scanned 50 GB per execution. Daily cost: 50 GB × $0.005 × 1,440 executions = $360/day. Monthly: $10,800.
Root cause: They didn't know Logs Insights charges per GB scanned. A dashboard widget refreshes continuously.
Fix: Replaced Logs Insights widget with Metric Filters + standard CloudWatch metric graph. Metric Filters cost $0.30 per metric per month (not per query). Dashboard cost dropped to near zero.
Rule: Logs Insights is for human-driven debugging only. Never put Logs Insights queries in auto-refreshing dashboards or automated systems. Use Metric Filters and standard metrics for continuous monitoring.
Key Takeaway
Logs Groups = containers. Log Streams = sources. Log Events = individual lines.
Logs Insights: ad-hoc queries, pay per GB scanned. $0.005/GB.
Metric Filters: continuous log→metric conversion, cost per metric, not per query.
Never put Logs Insights in auto-refreshing dashboards. You'll pay $10k/month.

Putting It Together: Dashboards and a Real Monitoring Architecture

A CloudWatch Dashboard is a customisable canvas where you pin metrics graphs, alarm states, and log query results side by side. The point isn't just pretty charts — it's reducing the time-to-understanding during an incident. When your on-call engineer gets paged at 2am, the first thing they open should be a dashboard that answers: is this an app problem, a database problem, or a network problem?

Good dashboard design follows the RED method: Rate (requests per second), Errors (error rate), and Duration (latency percentiles). Put those three graphs at the top. Below them, add the saturation metrics — CPU, memory, DB connections. At the bottom, link to Logs Insights queries for the most common failure modes.

Here's the architecture pattern that works in production: CloudWatch receives metrics and logs from all your services automatically. Alarms on the most critical thresholds fire to an SNS topic. That SNS topic routes to PagerDuty (or OpsGenie, or just email if you're early-stage). The on-call engineer opens the service dashboard, runs a Logs Insights query to get the stack trace, fixes the issue, and the OKAction on the alarm auto-resolves the incident. Everything is connected, traceable, and auditable.

This closed loop — metric to alarm to notification to dashboard to logs — is the entire CloudWatch mental model. Once you've internalised it, everything else is just configuration.

service_dashboard.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
{
  "Comment": "CloudWatch Dashboard definition for the checkout service.",
  "Comment2": "Deploy with: aws cloudwatch put-dashboard --dashboard-name checkout-production --dashboard-body file://service_dashboard.json",
  "widgets": [
    {
      "type": "metric",
      "x": 0, "y": 0, "width": 8, "height": 6,
      "properties": {
        "title": "Request Rate — Checkout Lambda (invocations/min)",
        "view": "timeSeries",
        "stat": "Sum",
        "period": 60,
        "metrics": [
          [
            "AWS/Lambda",
            "Invocations",
            "FunctionName", "checkout-processor"
          ]
        ],
        "region": "us-east-1"
      }
    },
    {
      "type": "metric",
      "x": 8, "y": 0, "width": 8, "height": 6,
      "properties": {
        "title": "Error Rate — Checkout Lambda (errors/min)",
        "view": "timeSeries",
        "stat": "Sum",
        "period": 60,
        "metrics": [
          [
            "AWS/Lambda",
            "Errors",
            "FunctionName", "checkout-processor",
            { "color": "#d62728" }
          ]
        ],
        "region": "us-east-1",
        "annotations": {
          "horizontal": [
            {
              "label": "Alarm threshold",
              "value": 5,
              "color": "#ff7f0e"
            }
          ]
        }
      }
    },
    {
      "type": "metric",
      "x": 16, "y": 0, "width": 8, "height": 6,
      "properties": {
        "title": "P99 Latency — Checkout Lambda (ms)",
        "view": "timeSeries",
        "stat": "p99",
        "period": 60,
        "metrics": [
          [
            "AWS/Lambda",
            "Duration",
            "FunctionName", "checkout-processor"
          ]
        ],
        "region": "us-east-1"
      }
    },
    {
      "type": "alarm",
      "x": 0, "y": 6, "width": 24, "height": 2,
      "properties": {
        "title": "Active Alarms — Checkout Service",
        "alarms": [
          "arn:aws:cloudwatch:us-east-1:123456789012:alarm:checkout-processor-high-error-rate"
        ]
      }
    }
  ]
}
Output
# After running: aws cloudwatch put-dashboard \
# --dashboard-name checkout-production \
# --dashboard-body file://service_dashboard.json
#
# Response:
# {
# "DashboardValidationMessages": []
# }
#
# An empty DashboardValidationMessages array means the dashboard was accepted with no errors.
# Open in console: https://console.aws.amazon.com/cloudwatch/home#dashboards:name=checkout-production
#
# Dashboard displays:
# Row 1: [Request Rate graph] [Error Rate graph with threshold line] [P99 Latency graph]
# Row 2: [Alarm status panel — shows OK / ALARM / INSUFFICIENT_DATA in real time]
Pro Tip: Pin Dashboards to Cross-Account Views
If you run multiple AWS accounts (dev, staging, prod), you can add widgets from different accounts to a single dashboard using cross-account cross-region CloudWatch. This means one dashboard can show prod metrics from us-east-1 and eu-west-1 side by side. Set it up once in your monitoring account and your whole team has a single pane of glass — no tab switching during incidents.
Production Insight
A team's dashboard contained 47 graphs. On-call engineers had to scroll through 6 screen pages to find the relevant metrics. Mean time to diagnosis (MTTD) was 12 minutes.
Root cause: Dashboard design by accumulation. Every team added their graphs to the same dashboard. No one removed old ones. The dashboard became unusable.
Fix: Created three dashboards: Overview (RED metrics for all services), Service-Specific (detailed for checkout service), Infrastructure (CPU/memory for EC2). Each dashboard had fewer than 10 widgets. MTTD dropped to 3 minutes.
Rule: A dashboard should fit on one screen without scrolling. If you need more than 10 widgets, split into multiple dashboards. The purpose of a dashboard during an incident is to answer questions in seconds, not to catalogue every metric.
Key Takeaway
Dashboard = canvas for metrics, alarms, logs. One screen only (<10 widgets).
RED method: Rate, Errors, Duration at the top. Saturation below.
Closed loop: metrics → alarm → SNS → dashboard → logs → fix.
Cross-account widgets let you monitor all environments from one dashboard.

CloudWatch Agent: The Bridge Between Your Servers and Observability

Metrics and logs don't magically appear. You need the CloudWatch Agent. It's a daemon you install on EC2 or on-prem servers. It collects system-level metrics like memory, disk, and swap—stuff EC2 doesn't give you by default. It also pushes application logs. The old SSM Agent? That's for parameter store and commands. The CloudWatch Agent is for data. Install it. Configure it. Stop guessing why your CPU is pegged but memory looks fine. The WHY: Without the agent, you're blind to OS-level metrics. Your alarms fire late or never. Your logs sit on disks you can't query. The HOW: Drop the agent configuration JSON in Parameter Store, bootstrap it via user data. The agent picks up permissions from IAM. Use the unified CloudWatch agent—it replaced the old metrics-only version. Test your config with sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a status. If it's not running, neither are your alerts.

cloudwatch-agent-config.jsonHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge
// CloudWatch Agent config: collects memory, disk, and app logs
{
  "agent": {
    "metrics_collection_interval": 60,
    "logfile": "/var/log/amazon-cloudwatch-agent.log"
  },
  "metrics": {
    "namespace": "Prod/WebServer",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_available_percent"]
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/", "/data"]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/application/error.log",
            "log_group_name": "/prod/web/error",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}
Output
Agent status: running. Metrics: mem_used_percent=78, disk_used_percent=/=45,/data=62. Logs: pushing /var/log/application/error.log to group /prod/web/error.
Production Trap:
Forgetting to set metrics_collection_interval low enough. Default is 60 seconds. If your app crashes in 10 seconds, you'll see nothing. Set it to 5 seconds for critical paths.
Key Takeaway
Install the CloudWatch Agent on every server. It's the only way to see OS-level metrics and ship logs. Without it, you are flying blind.

Logs Insights: Stop Grepping, Start Querying Your Way to Root Cause

You've got logs in CloudWatch. Now what? Grepping a thousand streams is for amateurs. Logs Insights is your SQL-for-logs. It queries across all log groups in a region. Syntax looks like SQL but isn't—it's purpose-built for log parsing. You fields @timestamp to get timestamps. You filter by error codes. You stats by bin(5m) to see time-based spikes. The WHY: When an alarm fires at 3 AM, you need to find the error pattern in seconds, not hours. Logs Insights paginates results—sort by @timestamp desc. Common pattern: filter @message like /(?i)(error|exception|timeout)/. Then stats count() by @timestamp, bin(1m). You'll spot the surge. Pro tip: Save queries as CloudWatch Logs Insights queries. Name them something searchable like "5xx errors last hour". Share them with your team. Your future self, paged at 2 AM, will thank you. Also: use limit 10000 to avoid timeouts on large log groups. Set a time range before running. Never run across all time—that's how you burn money and patience.

error-spike-query.sqlSQL
1
2
3
4
5
6
7
8
// io.thecodeforge
// Find HTTP 5xx errors in last 15 minutes, group by 1-minute buckets
fields @timestamp, @message
| filter @message like /HTTP\/[0-9\.]+ [45][0-9]{2}/
| filter @message like /5[0-9]{2}/
| stats count() as error_count by bin(1m)
| sort @timestamp desc
| limit 100
Output
@timestamp (minute) | error_count
2024-01-15 02:03:00 | 42
2024-01-15 02:04:00 | 156
2024-01-15 02:05:00 | 203
Pro Tip:
Create a Saved Query for your most common incident types—5xx errors, database timeouts, auth failures. Name them clearly. Your on-call rotation will adopt them overnight.
Key Takeaway
Logs Insights is your first responder tool. Learn the query syntax. Save your top 5 incident queries. You will use them every time something breaks.
● Production incidentPOST-MORTEMseverity: high

The 3 AM Alarm That Wasn't: TreatMissingData Default

Symptom
CloudWatch alarm configured on Lambda Errors metric shows OK status. No notification sent. autorep shows no invocations for hours. The service is down but the monitoring claims everything is fine.
Assumption
The team assumed TreatMissingData default value (missing) would be safe. They thought 'missing data' meant no errors — which would be OK. They didn't realise a dead service also produces no data.
Root cause
The Lambda's event source mapping failed at 2 AM (DynamoDB stream permissions revoked). The function wasn't invoked at all. CloudWatch stopped receiving any metric data points for that function. TreatMissingData was left at the default (missing), which means 'keep the alarm in its current state when no data arrives'. The alarm was currently OK, so it stayed OK. No notification. A service that should always be emitting metrics going silent is itself a critical failure. The default setting masked it completely.
Fix
1. Changed the alarm's TreatMissingData to breaching: TreatMissingData: breaching 2. Added a heartbeat metric: Lambda publishes custom metric every minute. Alarm on missing heartbeat → pages immediately. 3. Set metric filter to count invocations: filter @message like /Processed/ | stats count(). Alarm on zero invocations for 10 minutes. 4. Add CloudWatch alarm on Lambda Invocations metric with TreatMissingData: breaching — zero invocations = dead service. Prevention: For any metric where silence = failure (invocations, heartbeat, queue consumers, active connections), always set TreatMissingData: breaching.
Key lesson
  • TreatMissingData default (missing) is dangerous for availability metrics.
  • Silence from a service that should be talking is itself a problem worth paging on.
  • Lambda Invocations = 0 for 5 minutes means something is broken upstream.
  • Add explicit heartbeat metrics for critical scheduled jobs.
Production debug guideThe 4 most common CloudWatch failure modes and how to diagnose them4 entries
Symptom · 01
Alarm shows INSUFFICIENT_DATA for hours, never OK or ALARM
Fix
Check if metric is actually publishing. Go to CloudWatch Metrics → browse namespace → look for recent data points. No data = service not publishing = fix the publisher.
Symptom · 02
Alarm fires constantly — every few minutes pages on-call
Fix
Check EvaluationPeriods and DatapointsToAlarm. Likely set to 1 and 1 (any single breach fires). Increase to 3 and 2 so only sustained breaches page.
Symptom · 03
Custom metrics cost $300+ per month unexpectedly
Fix
List custom metrics: aws cloudwatch list-metrics --namespace YourApp. Look for high-cardinality dimensions (userId, requestId). Each unique combination = separate billable metric.
Symptom · 04
Logs Insights query returns nothing for expected logs
Fix
Check log group retention period. Default is never expire, but someone may have set 7 days. aws logs describe-log-groups. Also verify timestamp range in query.
★ CloudWatch — 60-Second DiagnosisRun these commands when CloudWatch isn't behaving as expected
Check if custom metric is publishing
Immediate action
List metrics and get recent data points
Commands
aws cloudwatch list-metrics --namespace TheCodeForge/checkout-service
aws cloudwatch get-metric-statistics --namespace TheCodeForge/checkout-service --metric-name OrdersProcessedPerMinute --period 300 --statistics Sum --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
Fix now
If no data, check IAM permissions and region in put-metric-data command
Find expensive custom metrics (high cardinality)+
Immediate action
List all metrics and count unique dimension combinations
Commands
aws cloudwatch list-metrics --namespace YourApp --query 'Metrics[*].Dimensions' --output json | jq 'group_by(.Name) | map({Name: .[0].Name, Count: length})'
aws cloudwatch list-metrics --namespace YourApp --dimension Name=userId
Fix now
Remove userId dimension from publishing code. Keep Environment, ServiceName only.
Logs Insights query too slow or times out+
Immediate action
Reduce time range and add filters early
Commands
aws logs start-query --log-group-name /production/api --start-time $(date -u -d '1 hour ago' +%s) --end-time $(date -u +%s) --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 100'
aws logs get-query-results --query-id $QUERY_ID
Fix now
Break large time ranges into smaller chunks. Add filter before stats.
Alarm treating missing data as OK (silent failure)+
Immediate action
Check TreatMissingData attribute on alarm
Commands
aws cloudwatch describe-alarms --alarm-names high-error-rate --query 'MetricAlarms[0].TreatMissingData'
aws cloudwatch put-metric-alarm --alarm-name high-error-rate --treat-missing-data breaching --region us-east-1
Fix now
Update alarm with TreatMissingData: breaching for availability metrics
CloudWatch Logs Insights vs. Metric Filters
FeatureCloudWatch Logs InsightsCloudWatch Metric Filter
PurposeAd-hoc log investigation during incidentsContinuous log-to-metric conversion for alerting
When it runsOn demand — you trigger it manuallyContinuously — processes every log line as it arrives
OutputQuery results (table, chart, JSON)A CloudWatch metric you can alarm on
CostCharged per GB of data scanned ($0.005/GB)Free to create; charged per custom metric ($0.30/month)
LatencyResults in 2-30 secondsMetrics appear within ~1 minute of log ingestion
Best forDebugging: 'What caused this error?'Alerting: 'Alert me when errors exceed X'
Query complexityRich: stats, percentiles, regex, joinsSimple: pattern match only (e.g. contains 'ERROR')
Retention awarenessCan query any retained log dataOnly converts future log lines — not retroactive
Dashboard use?NO — will cost thousands. Manual debugging only.YES — metric graphs are cheap to auto-refresh.

Key takeaways

1
Metrics stream automatically from AWS services. Custom metrics cost $0.30/month each.
2
High-cardinality dimensions (userId, requestId) multiply custom metric costs. Stick to Environment, Region, ServiceName.
3
Alarms
Use EvaluationPeriods=3, DatapointsToAlarm=2 to avoid paging on momentary spikes.
4
TreatMissingData
breaching for availability metrics (heartbeats, invocations, healthy hosts). Default 'missing' hides dead services.
5
Logs Insights = manual debugging only. Never auto-refresh in dashboards. Use Metric Filters for continuous alerting.
6
RED metrics on your dashboard first
Rate, Error rate, Duration. Everything else below.
7
Set log retention to 30 days for app logs, 90 days for compliance. Don't pay to store logs you'll never read.

Common mistakes to avoid

5 patterns
×

Setting alarms with EvaluationPeriods=1 and DatapointsToAlarm=1

Symptom
Alert fatigue from single noisy spikes; engineers start ignoring pages; real outages get missed.
Fix
Use at least EvaluationPeriods=3 with DatapointsToAlarm=2 so the alarm only fires when a problem persists across multiple windows, not from one momentary blip.
×

Publishing custom metrics with high-cardinality dimensions like userId or requestId

Symptom
AWS bill unexpectedly shows thousands of custom metrics and costs $300+ per month instead of $3.
Fix
Never use unique identifiers as dimension values. Stick to low-cardinality values like Environment (production/staging), Region, or ServiceName.
×

Leaving TreatMissingData at its default (missing) for availability-type metrics

Symptom
A service goes completely dark (stops publishing metrics) and no alarm fires because CloudWatch sees 'no data' and keeps the alarm in its current state (which was OK).
Fix
For any metric where silence means failure (HealthyHostCount, heartbeat metrics, invocations count), explicitly set TreatMissingData: breaching so a dead service pages you.
×

Putting Logs Insights queries in auto-refreshing dashboards

Symptom
Monthly CloudWatch bill hits $10,000+. Dashboard has a Logs Insights widget that scans 50 GB every 60 seconds.
Fix
Logs Insights is for manual debugging only. Never auto-refresh. Use Metric Filters to convert log patterns to metrics, then graph the metrics (costs $0.30/month).
×

Storing logs forever with no expiration

Symptom
Log group has 2 TB of data from 3 years ago. CloudWatch Logs storage bill is $60/month for logs you never query.
Fix
Set retention period to 30 days for application logs, 90 days for compliance logs, 365 days for audit logs. Use aws logs put-retention-policy --log-group-name /my/app --retention-in-days 30.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
CloudWatch Alarms have three states — OK, ALARM, and INSUFFICIENT_DATA. ...
Q02SENIOR
You're trying to alert on the error rate of a Lambda function, but you w...
Q03SENIOR
What's the difference between a CloudWatch Metric Filter and a Logs Insi...
Q01 of 03SENIOR

CloudWatch Alarms have three states — OK, ALARM, and INSUFFICIENT_DATA. When would an alarm be in INSUFFICIENT_DATA and why is it dangerous to treat that state the same as OK?

ANSWER
INSUFFICIENT_DATA occurs when CloudWatch hasn't received enough metric data points to evaluate the alarm. This happens right after creating an alarm (at least 2 evaluation periods needed) or when the metric source stops publishing data. Treating INSUFFICIENT_DATA as OK is dangerous because a dead service (Lambda not running, EC2 crashed, app not sending heartbeat) also produces no data. The alarm would show OK, and you'd never know the service is down. Correct approach: For availability metrics (heartbeats, invocations, healthy hosts), set TreatMissingData: breaching. For error rate metrics (silence means no errors), set TreatMissingData: notBreaching. Example: A Lambda that should process transactions every minute stops being invoked due to a broken event source mapping. Invocations metric goes from 5/minute to 0. With TreatMissingData: breaching on an Invocations < 1 alarm, CloudWatch fires ALARM and pages on-call. With default (missing), the alarm stays OK and no one knows.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How much does AWS CloudWatch cost for a basic setup?
02
What is the difference between CloudWatch and CloudTrail?
03
Can CloudWatch automatically fix problems, or does it only alert?
04
What retention period should I set for application logs?
05
How do I set up a heartbeat alarm to detect when a service stops publishing?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's Cloud. Mark it forged?

8 min read · try the examples if you haven't

Previous
AWS SQS and SNS
19 / 23 · Cloud
Next
Multi-Cloud Strategy