AWS CloudWatch Explained: Metrics, Alarms, and Logs That Actually Work
Every production system breaks eventually. The difference between a five-minute outage and a five-hour disaster is almost always the same thing: how fast you knew something was wrong. AWS CloudWatch is the native observability service that sits at the heart of every serious AWS deployment, giving you the eyes and ears you need to catch problems before your users do. It's not glamorous, but skipping it is like flying a plane with no instruments — fine until it isn't.
Before CloudWatch, teams had to stitch together a patchwork of third-party monitoring tools, custom log shippers, and hand-rolled alerting scripts. The result was fragile, expensive, and almost always incomplete. CloudWatch solves this by providing a unified platform where metrics flow in automatically from over 70 AWS services, logs from your applications can be centralised in seconds, and alarms can trigger automated responses — all without leaving the AWS ecosystem.
By the end of this article you'll understand what metrics, log groups, alarms, and dashboards actually are, how they connect to each other in a real production architecture, and how to set them up using both the AWS CLI and CloudFormation. You'll also know the three most common mistakes engineers make with CloudWatch — and exactly how to avoid them.
CloudWatch Metrics: The Heartbeat of Your Infrastructure
A metric is just a time-stamped number with a name and a namespace. That's it. EC2 sends CloudWatch a CPUUtilization number every minute. RDS sends DatabaseConnections. Lambda sends Duration and Errors. These numbers stream in automatically — you don't write a single line of code to get them.
What makes metrics powerful is the dimension system. A dimension is a key-value pair that narrows down which resource a metric belongs to. For example, CPUUtilization by itself tells you nothing. CPUUtilization where InstanceId=i-0abc123 tells you exactly which server is melting. You can also publish your own custom metrics — think order count per minute, active WebSocket connections, or queue depth in your own application.
Metrics are stored in CloudWatch for 15 months, but the resolution degrades over time: data points are kept at 1-second resolution for 3 hours, then aggregated to 1-minute for 15 days, then 5-minute for 63 days, and finally 1-hour for 15 months. This matters when you're debugging an incident from three weeks ago — you'll only have 5-minute averages, not second-by-second data.
Knowing the retention and resolution schedule helps you set the right alarm evaluation periods and avoid false conclusions from aggregated data.
#!/bin/bash # publish_custom_metric.sh # Publishes a custom business metric to CloudWatch. # Run this from an EC2 instance, a container, or your CI/CD pipeline. # Prerequisite: AWS CLI installed and IAM role with cloudwatch:PutMetricData permission. APP_NAME="checkout-service" ENVIRONMENT="production" # Simulate reading the number of orders processed in the last minute. # In a real app you'd query your database or read from an in-memory counter. ORDERS_PROCESSED=142 # Publish the custom metric to a namespace we own. # Namespace acts like a folder — use a consistent naming convention. aws cloudwatch put-metric-data \ --namespace "TheCodeForge/${APP_NAME}" \ --metric-name "OrdersProcessedPerMinute" \ --value "${ORDERS_PROCESSED}" \ --unit "Count" \ --dimensions \ "Name=Environment,Value=${ENVIRONMENT}" \ "Name=AppName,Value=${APP_NAME}" \ --region us-east-1 # Verify the metric was accepted (exit code 0 means success) if [ $? -eq 0 ]; then echo "[OK] Metric published: ${ORDERS_PROCESSED} orders for ${APP_NAME} in ${ENVIRONMENT}" else echo "[ERROR] Failed to publish metric. Check IAM permissions and region." exit 1 fi
CloudWatch Alarms: Turning Numbers Into Actions
A metric sitting in CloudWatch does nothing on its own. An alarm is what connects a metric to a response. You define a threshold, a comparison operator, and an evaluation period — and CloudWatch will flip the alarm state from OK to ALARM the moment that condition is met.
An alarm has exactly three states: OK (everything is fine), ALARM (threshold breached), and INSUFFICIENT_DATA (not enough data points have arrived yet, which happens right after you create an alarm or if the metric stops publishing). Understanding INSUFFICIENT_DATA is important — it's not the same as OK, and treating it that way is a common mistake.
Alarms can trigger three types of actions: SNS notifications (email, SMS, PagerDuty webhook), EC2 actions (stop, terminate, reboot, or recover an instance), and Auto Scaling actions (scale in or scale out). This is where CloudWatch becomes genuinely powerful — you can build self-healing infrastructure where an alarm automatically replaces a failing instance without any human involvement.
For composite alarms, you can combine multiple alarms with AND/OR logic. This lets you avoid alert fatigue by only paging someone when CPU is high AND error rate is also high — not when just one or the other spikes briefly.
# cloudwatch_alarm.yaml # CloudFormation template that creates a CloudWatch alarm on a Lambda function. # Deploy with: aws cloudformation deploy --template-file cloudwatch_alarm.yaml \ # --stack-name checkout-lambda-alarms --capabilities CAPABILITY_IAM AWSTemplateFormatVersion: '2010-09-09' Description: >- Alarm that fires when the checkout Lambda error rate exceeds 1% over 5 minutes. Sends an alert to the on-call SNS topic when triggered. Parameters: LambdaFunctionName: Type: String Default: checkout-processor Description: The name of the Lambda function to monitor. OnCallSnsTopicArn: Type: String Description: ARN of the SNS topic that routes to PagerDuty or email. Resources: # Alarm: triggers if Lambda errors exceed 5 in any 5-minute window. # EvaluationPeriods: how many periods must breach before alarm fires. # DatapointsToAlarm: of those periods, how many must actually breach. # Using 2-of-3 prevents a single noisy data point from waking someone at 3am. CheckoutLambdaErrorAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub '${LambdaFunctionName}-high-error-rate' AlarmDescription: >- Fires when checkout Lambda has more than 5 errors in a 5-minute period. Check Lambda logs in CloudWatch for stack traces. Namespace: AWS/Lambda # Built-in Lambda namespace — no setup needed MetricName: Errors Dimensions: - Name: FunctionName Value: !Ref LambdaFunctionName # Scoped to our specific function Statistic: Sum # Add up all error counts in the period Period: 300 # Each evaluation window is 5 minutes (300 seconds) EvaluationPeriods: 3 # Look at the last 3 windows (15 minutes total) DatapointsToAlarm: 2 # Alarm only if 2 out of 3 windows breach threshold Threshold: 5 # More than 5 errors triggers the alarm ComparisonOperator: GreaterThanThreshold TreatMissingData: notBreaching # No data = function not invoked = not an error AlarmActions: - !Ref OnCallSnsTopicArn # Page on-call engineer via SNS OKActions: - !Ref OnCallSnsTopicArn # Also notify when alarm recovers Outputs: AlarmName: Description: Name of the created CloudWatch alarm Value: !Ref CheckoutLambdaErrorAlarm
# aws cloudformation describe-stacks --stack-name checkout-lambda-alarms
#
# StackStatus: CREATE_COMPLETE
#
# Alarm initial state in CloudWatch console: INSUFFICIENT_DATA
# (changes to OK once Lambda emits its first metric data point)
#
# When errors exceed threshold:
# SNS delivers message to on-call topic:
# Subject: ALARM: "checkout-processor-high-error-rate" in US East (N. Virginia)
# Body: Threshold Crossed: 2 datapoints [6.0, 8.0] were greater than threshold 5.0
CloudWatch Logs: Centralising and Querying Application Output
CloudWatch Logs is where your application output lives. The hierarchy works like this: a Log Group is the top-level container (one per application or service), Log Streams are individual sources within that group (one per Lambda invocation, one per EC2 instance, one per container), and Log Events are the individual timestamped lines inside each stream.
Logs arrive in CloudWatch either automatically (Lambda, ECS, and CloudTrail do this natively) or via the CloudWatch Logs Agent or the newer CloudWatch Agent installed on EC2 instances.
The real power is CloudWatch Logs Insights, a query language that lets you search and aggregate across gigabytes of logs in seconds. It's not SQL, but it's close enough that you'll feel at home immediately. You can filter for errors, extract fields from structured JSON logs, calculate percentiles, and visualise results as time-series charts.
Metric filters are another killer feature: they scan incoming log lines for a pattern and convert matches into CloudWatch metrics. This means you can turn a log line like ERROR: payment gateway timeout into an incrementing metric — and then alarm on that metric. No log aggregation pipeline, no third-party tool, no extra cost beyond the metric itself.
#!/bin/bash # logs_insights_query.sh # Runs a CloudWatch Logs Insights query to find the top 10 slowest # API endpoints in the last hour, using structured JSON logs. # # Assumes your app logs JSON lines like: # {"level":"INFO","endpoint":"/api/checkout","duration_ms":342,"status":200} LOG_GROUP_NAME="/production/checkout-service" QUERY_LOOKBACK_SECONDS=3600 # Last 1 hour START_TIME=$(date -u -d "-${QUERY_LOOKBACK_SECONDS} seconds" +%s) # Linux # On macOS use: date -u -v -${QUERY_LOOKBACK_SECONDS}S +%s END_TIME=$(date -u +%s) echo "Starting Logs Insights query on: ${LOG_GROUP_NAME}" echo "Time range: last ${QUERY_LOOKBACK_SECONDS} seconds" # Start the query — Logs Insights runs asynchronously, so we get a query ID back. QUERY_ID=$(aws logs start-query \ --log-group-name "${LOG_GROUP_NAME}" \ --start-time "${START_TIME}" \ --end-time "${END_TIME}" \ --query-string ' fields @timestamp, endpoint, duration_ms, status | filter ispresent(duration_ms) # Only include log lines that have this field | stats avg(duration_ms) as avg_duration, max(duration_ms) as max_duration, count() as request_count by endpoint | sort avg_duration desc # Slowest endpoints first | limit 10 ' \ --query 'queryId' \ --output text \ --region us-east-1) echo "Query submitted. ID: ${QUERY_ID}" echo "Waiting for results..." # Poll until the query finishes (usually 2-10 seconds for an hour of logs) while true; do STATUS=$(aws logs get-query-results \ --query-id "${QUERY_ID}" \ --query 'status' \ --output text \ --region us-east-1) if [ "${STATUS}" == "Complete" ]; then echo "Query complete. Results:" # Fetch and pretty-print the results aws logs get-query-results \ --query-id "${QUERY_ID}" \ --query 'results[*].[?field==`endpoint`].value | [0] | join(`,`, @)' \ --output json \ --region us-east-1 break elif [ "${STATUS}" == "Failed" ]; then echo "[ERROR] Query failed. Check query syntax and log group name." exit 1 fi sleep 2 done
Time range: last 3600 seconds
Query submitted. ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Waiting for results...
Query complete. Results:
[
{ "endpoint": "/api/checkout", "avg_duration": "847", "max_duration": "3201", "request_count": "1243" },
{ "endpoint": "/api/payment/verify", "avg_duration": "612", "max_duration": "2980", "request_count": "988" },
{ "endpoint": "/api/inventory", "avg_duration": "231", "max_duration": "890", "request_count": "4521" }
]
Putting It Together: Dashboards and a Real Monitoring Architecture
A CloudWatch Dashboard is a customisable canvas where you pin metrics graphs, alarm states, and log query results side by side. The point isn't just pretty charts — it's reducing the time-to-understanding during an incident. When your on-call engineer gets paged at 2am, the first thing they open should be a dashboard that answers: is this an app problem, a database problem, or a network problem?
Good dashboard design follows the RED method: Rate (requests per second), Errors (error rate), and Duration (latency percentiles). Put those three graphs at the top. Below them, add the saturation metrics — CPU, memory, DB connections. At the bottom, link to Logs Insights queries for the most common failure modes.
Here's the architecture pattern that works in production: CloudWatch receives metrics and logs from all your services automatically. Alarms on the most critical thresholds fire to an SNS topic. That SNS topic routes to PagerDuty (or OpsGenie, or just email if you're early-stage). The on-call engineer opens the service dashboard, runs a Logs Insights query to get the stack trace, fixes the issue, and the OKAction on the alarm auto-resolves the incident. Everything is connected, traceable, and auditable.
This closed loop — metric to alarm to notification to dashboard to logs — is the entire CloudWatch mental model. Once you've internalised it, everything else is just configuration.
{
"Comment": "CloudWatch Dashboard definition for the checkout service.",
"Comment2": "Deploy with: aws cloudwatch put-dashboard --dashboard-name checkout-production --dashboard-body file://service_dashboard.json",
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 8, "height": 6,
"properties": {
"title": "Request Rate — Checkout Lambda (invocations/min)",
"view": "timeSeries",
"stat": "Sum",
"period": 60,
"metrics": [
[
"AWS/Lambda",
"Invocations",
"FunctionName", "checkout-processor"
]
],
"region": "us-east-1"
}
},
{
"type": "metric",
"x": 8, "y": 0, "width": 8, "height": 6,
"properties": {
"title": "Error Rate — Checkout Lambda (errors/min)",
"view": "timeSeries",
"stat": "Sum",
"period": 60,
"metrics": [
[
"AWS/Lambda",
"Errors",
"FunctionName", "checkout-processor",
{ "color": "#d62728" }
]
],
"region": "us-east-1",
"annotations": {
"horizontal": [
{
"label": "Alarm threshold",
"value": 5,
"color": "#ff7f0e"
}
]
}
}
},
{
"type": "metric",
"x": 16, "y": 0, "width": 8, "height": 6,
"properties": {
"title": "P99 Latency — Checkout Lambda (ms)",
"view": "timeSeries",
"stat": "p99",
"period": 60,
"metrics": [
[
"AWS/Lambda",
"Duration",
"FunctionName", "checkout-processor"
]
],
"region": "us-east-1"
}
},
{
"type": "alarm",
"x": 0, "y": 6, "width": 24, "height": 2,
"properties": {
"title": "Active Alarms — Checkout Service",
"alarms": [
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:checkout-processor-high-error-rate"
]
}
}
]
}
# --dashboard-name checkout-production \
# --dashboard-body file://service_dashboard.json
#
# Response:
# {
# "DashboardValidationMessages": []
# }
#
# An empty DashboardValidationMessages array means the dashboard was accepted with no errors.
# Open in console: https://console.aws.amazon.com/cloudwatch/home#dashboards:name=checkout-production
#
# Dashboard displays:
# Row 1: [Request Rate graph] [Error Rate graph with threshold line] [P99 Latency graph]
# Row 2: [Alarm status panel — shows OK / ALARM / INSUFFICIENT_DATA in real time]
| Feature | CloudWatch Logs Insights | CloudWatch Metric Filters |
|---|---|---|
| Purpose | Ad-hoc log investigation during incidents | Continuous log-to-metric conversion for alerting |
| When it runs | On demand — you trigger it manually | Continuously — processes every log line as it arrives |
| Output | Query results (table, chart, JSON) | A CloudWatch metric you can alarm on |
| Cost | Charged per GB of data scanned | Free to create; charged per custom metric ($0.30/month) |
| Latency | Results in 2-30 seconds | Metrics appear within ~1 minute of log ingestion |
| Best for | Debugging: 'What caused this error?' | Alerting: 'Alert me when errors exceed X' |
| Query complexity | Rich: stats, percentiles, regex, joins | Simple: pattern match only (e.g. contains 'ERROR') |
| Retention awareness | Can query any retained log data | Only converts future log lines — not retroactive |
🎯 Key Takeaways
- CloudWatch's core loop is: metrics and logs feed in → alarms detect threshold breaches → SNS routes notifications → dashboards provide context for investigation — internalise this loop and all else follows
- Alarm DatapointsToAlarm is your best defence against alert fatigue — always use a value less than EvaluationPeriods to require sustained breaches, not momentary spikes
- Custom metric dimensions must be low-cardinality (Environment, AppName, Region) — one unique dimension value combination = one billable metric, and it adds up fast
- TreatMissingData: breaching is the right choice for availability metrics — silence from a service that should always be talking is itself a problem worth paging on
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Setting alarms with EvaluationPeriods=1 and DatapointsToAlarm=1 — Symptom: Alert fatigue from single noisy spikes; engineers start ignoring pages — Fix: Use at least EvaluationPeriods=3 with DatapointsToAlarm=2 so the alarm only fires when a problem persists across multiple windows, not from one momentary blip
- ✕Mistake 2: Publishing custom metrics with high-cardinality dimensions like userId or requestId — Symptom: AWS bill unexpectedly shows thousands of custom metrics and costs $300+ per month instead of $3 — Fix: Never use unique identifiers as dimension values; stick to low-cardinality values like Environment (production/staging), Region, or ServiceName
- ✕Mistake 3: Leaving TreatMissingData at its default (missing) for availability-type metrics — Symptom: A service goes completely dark (stops publishing metrics) and no alarm fires because CloudWatch sees 'no data' as neutral — Fix: For any metric where silence means failure (HealthyHostCount, heartbeat metrics, queue consumers), explicitly set TreatMissingData: breaching so a dead service pages you
Interview Questions on This Topic
- QCloudWatch Alarms have three states — OK, ALARM, and INSUFFICIENT_DATA. When would an alarm be in INSUFFICIENT_DATA and why is it dangerous to treat that state the same as OK?
- QYou're trying to alert on the error rate of a Lambda function, but you want to avoid false alarms from momentary spikes. Walk me through how you'd configure the alarm's EvaluationPeriods and DatapointsToAlarm to achieve this.
- QWhat's the difference between a CloudWatch Metric Filter and a Logs Insights query, and when would you choose one over the other in a production monitoring setup?
Frequently Asked Questions
How much does AWS CloudWatch cost for a basic setup?
Basic EC2, Lambda, and RDS metrics are free — AWS publishes these automatically at no charge. You start paying when you add custom metrics ($0.30 per metric per month), store logs ($0.50 per GB ingested, $0.03 per GB stored per month), run Logs Insights queries ($0.005 per GB scanned), or create more than 10 dashboards (first 3 are free). For a small production service, expect $5-30/month; for a large multi-service platform, budget $100-500/month.
What is the difference between CloudWatch and CloudTrail?
CloudWatch monitors the performance and health of your running resources — CPU, errors, latency, application logs. CloudTrail records who did what to your AWS account — API calls like 'who deleted that S3 bucket?' or 'who changed this security group?'. Think of CloudWatch as your operations monitor and CloudTrail as your audit log. You often use them together: CloudTrail logs go to CloudWatch Logs, where you set metric filters to alarm on sensitive actions like root account logins.
Can CloudWatch automatically fix problems, or does it only alert?
It can do both. CloudWatch Alarms support three types of automatic actions beyond notifications: EC2 actions (automatically stop, reboot, or recover an instance when an alarm fires), Auto Scaling actions (scale out more instances when CPU is high, scale in when it's low), and SNS notifications that trigger Lambda functions — which can then do anything you can code, from restarting a service to rolling back a deployment. The most powerful setups combine all three for genuinely self-healing infrastructure.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.