Intermediate 12 min · March 06, 2026

AWS SQS vs SNS — Silent Loss Under Lambda Throttling

SNS retries throttled Lambda twice then drops messages permanently — no DLQ, no alert.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • SQS = queue. One message, one consumer. Durable storage (up to 14 days). Use for async task processing, rate limiting, back-pressure.
  • SNS = pub/sub topic. One message, all subscribers. No storage. Use for event broadcasting, fan-out, decoupling producers from consumers.
  • SNS+SQS fan-out = production standard. SNS broadcasts to multiple SQS queues. Each queue durably stores its copy. Never subscribe Lambda directly to SNS in production — SQS in between absorbs throttling.
  • Long polling: always set WaitTimeSeconds=20 in receive_message. Cuts API calls by 95%, drops costs.
  • Dead-letter queue (DLQ): maxReceiveCount=3. Messages that fail processing go to DLQ, not infinite retry. Monitor DLQ depth → that's your bug signal.
  • Cost trap: SNS to Lambda subscriptions retry twice then drop messages on throttle. SQS queues hold messages safely for days.
Plain-English First

Imagine a busy pizza restaurant. SNS is the manager who shouts 'Order 42 is ready!' — every station (kitchen, cashier, delivery) that cares about that announcement hears it at the same time. SQS is the ticket rail above the grill — each chef grabs one ticket, works through it at their own pace, and the ticket is gone once it's done. One broadcasts, one queues. That's the whole mental model.

Modern applications rarely do one thing at a time. A user places an order and suddenly you need to charge their card, send a confirmation email, update inventory, notify the warehouse, and log an audit trail — all reliably, even if your email service crashes at 2 a.m. That's the problem AWS SQS and SNS were built to solve. They decouple the parts of your system so that a failure in one place doesn't cascade everywhere.

Without messaging services like these, you'd wire services together with direct HTTP calls. Service A calls Service B, which calls Service C. If B is slow, A waits. If C is down, the whole chain breaks. SQS introduces a buffer — a durable queue that holds messages until a consumer is ready to process them. SNS takes a different angle: it lets one event instantly fan out to dozens of subscribers without the publisher needing to know who they are.

By the end of this article you'll know the architectural difference between a queue and a publish-subscribe topic, how to wire SQS and SNS together for a real fan-out pattern, what dead-letter queues are and why you desperately need them, and exactly when to reach for each service in your next cloud project.

SQS — The Durable Message Queue That Saves Your System at 2 A.M.

SQS (Simple Queue Service) is a fully managed message queue. A producer drops a message into the queue, and one or more consumers poll the queue and process messages at their own pace. The key word is 'one' — by default each message is delivered to exactly one consumer. This is point-to-point messaging.

Why does that matter? Because it gives you back-pressure handling for free. If your order-processing service is overwhelmed, messages just pile up in the queue safely. The queue acts as a shock absorber between the part of your system that generates work and the part that does the work.

There are two flavours. Standard queues give you maximum throughput with at-least-once delivery and best-effort ordering — meaning a message might appear twice (rare, but plan for it). FIFO queues guarantee exactly-once processing and strict order, but cap you at 3,000 messages per second with batching. Choose FIFO when order actually matters — financial transactions, state machines. Choose Standard everywhere else.

Messages live in the queue for up to 14 days. The visibility timeout is the other critical setting: after a consumer picks up a message, it becomes invisible to other consumers for that window. If your Lambda or EC2 worker crashes mid-process, the message reappears and gets retried. That's your built-in retry mechanism.

Long polling is the single most impactful cost-saving setting. Always set WaitTimeSeconds=20 in your receive_message calls. Without it, your consumer uses short polling — it returns immediately even if no messages exist. You pay per API call, so a quiet queue will cost you thousands of empty calls. Long polling holds the connection for up to 20 seconds, waiting for a message. On a queue that gets one message per minute, long polling cuts your API calls by 95%.

sqs_producer_consumer.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
import boto3
import json
import time

# --- PRODUCER: Order Service drops a new order into the queue ---

sqs_client = boto3.client('sqs', region_name='us-east-1')

ORDER_QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789012/order-processing-queue'

def place_order(order_id: str
Output
[PRODUCER] Message sent. SQS Message ID: 4e2b1a3f-7c9d-4f1e-b2a0-8d3e5f6c7b8a
[CONSUMER] Polling for messages...
[CONSUMER] Processing order ORD-9981 for alex@example.com
-> Charging card and packing 2 item(s)...
[CONSUMER] Order ORD-9981 done. Message deleted.
Short Polling Will Drain Your Wallet
If you omit WaitTimeSeconds or set it to 0, SQS uses short polling. Your consumer hammers the API with empty responses and you pay for every single API call. Always set WaitTimeSeconds=20 (the maximum). It's a one-line change that cuts polling costs by up to 95% on quiet queues.
Production Insight
A team deployed an SQS consumer with WaitTimeSeconds not set (defaults to 0). The queue received 1 message per minute. Their consumer polled every 0.1 seconds (10 times per second). Daily API calls: 864,000. Monthly cost for just polling: ~$350.
After fixing: WaitTimeSeconds=20. Receive calls dropped to 3 per minute (one long poll every 20 seconds). Daily calls: 4,320. Monthly cost: ~$1.75. A 99.5% cost reduction with one line of code.
Rule: Always set WaitTimeSeconds=20 in every receive_message call. For Lambda with SQS event source mapping, this is handled automatically — but for EC2 workers or on-prem consumers, you must explicitly set it.
Key Takeaway
SQS = durable queue. Message stored 14 days. One consumer per message.
Standard = high throughput (unlimited), at-least-once. FIFO = 3k msg/s, exactly-once.
Visibility timeout = built-in retry. Message reappears if consumer crashes.
Long polling (WaitTimeSeconds=20) = 95% cost reduction. Always use it.
Choose Standard vs FIFO SQS
IfMessage order matters (financial transaction, state machine step, audit trail)
UseUse FIFO queue. Exactly-once processing, strict ordering. 3,000 msg/sec max.
IfHigh throughput (>3,000 msg/sec) or order doesn't matter
UseUse Standard queue. Unlimited throughput, at-least-once delivery. Plan for duplicates.
IfDuplicate processing would break the system (e.g., double payment)
UseUse FIFO queue. Make sure your consumer accounts for exactly-once semantics.
IfYou need high throughput AND no duplicates, with lower throughput
UseUse FIFO with message deduplication ID generated from order ID. Deduplicates within 5 minutes.

SNS — The Pub/Sub Megaphone That Notifies Everyone at Once

SNS (Simple Notification Service) works on the publish-subscribe model. You publish one message to a Topic, and SNS fans it out simultaneously to every subscriber — SQS queues, Lambda functions, HTTP endpoints, email addresses, mobile push notifications. The publisher has zero knowledge of who's listening. Adding a new subscriber doesn't touch the publisher at all.

This is the architectural superpower. Imagine your user-signup event needs to trigger a welcome email, a CRM record creation, a Slack notification to your growth team, and an analytics event. With SNS, your Auth service publishes one 'UserRegistered' message and walks away. Four independent services consume it in parallel.

Message filtering is what takes SNS from useful to essential. Instead of every subscriber receiving every message on a topic, you attach a filter policy to a subscription. Your EU payments service can subscribe to the 'transactions' topic but only receive messages where region=EU. This keeps each service focused on what it actually cares about.

SNS does not store messages. If a subscriber is down when the message arrives, that message is gone unless the subscriber is an SQS queue (which durably stores it). That's the most important SNS limitation to internalise — and it leads directly to the most powerful pattern: SNS + SQS fan-out.

SNS delivery logging is a critical debugging tool that most teams don't enable. It publishes delivery attempts, failures, and throttling events to CloudWatch Logs. When messages go missing, this is the first place to look. Enable it on every production SNS topic.

sns_fanout_publisher.pyPYTHON
1
2
3
4
5
6
7
8
import boto3
import json

sns_client = boto3.client('sns', region_name='us-east-1')

USER_EVENTS_TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789012:user-lifecycle-events'

def publish_user_registered_event(user_id: str

Dead-Letter Queues and the SNS+SQS Fan-Out Pattern — Production Essentials

Two patterns separate a toy cloud setup from a production-grade one: dead-letter queues (DLQs) and the SNS+SQS fan-out architecture. You need to understand both.

A DLQ is just another SQS queue. You configure it on your main queue and set maxReceiveCount — say, 3. If a message fails processing 3 times, SQS automatically moves it to the DLQ instead of retrying forever or silently dropping it. Your team gets alerted, investigates the poisoned message, and the rest of your queue keeps flowing normally. Without a DLQ, one bad message can block your queue or create an infinite retry storm.

The fan-out pattern solves SNS's biggest weakness — no durability. The rule is: never subscribe a Lambda directly to SNS in production if message loss is unacceptable. Instead, subscribe an SQS queue to the SNS topic. The queue durably catches every message. Your Lambda then polls the queue. You get SNS's broadcasting power AND SQS's durability and retry logic together. This is the architectural backbone of most event-driven AWS systems.

The code example below wires both patterns together with IaC-style Boto3 calls, showing exactly how a DLQ connects to a main queue.

Monitoring the DLQ: Set a CloudWatch alarm on ApproximateNumberOfMessages > 0 on your DLQ. A message in the DLQ means your consumer failed to process it after maxReceiveCount attempts. That's a bug — it shouldn't be ignored or silently deleted. Your on-call should get a page.

dlq_and_fanout_setup.pyPYTHON
1
2
3
4
5
6
7
8
import boto3
import json

sqs_client = boto3.client('sqs', region_name='us-east-1')
sns_client = boto3.client('sns', region_name='us-east-1')


def create_queue_with_dlq(main_queue_name: str
Output
[SETUP] DLQ created: https://sqs.us-east-1.amazonaws.com/123456789012/order-fulfilment-dlq
[SETUP] Main queue created with DLQ attached: https://sqs.us-east-1.amazonaws.com/123456789012/order-fulfilment-queue
[SETUP] SQS access policy updated to allow SNS.
[SETUP] Fan-out wired. Subscription ARN: arn:aws:sns:us-east-1:123456789012:order-events:b9c8d7e6-f5a4-3b2c-1d0e-9f8a7b6c5d4e
[READY] Production fan-out pattern active:
SNS Topic -> SQS Queue (durable) -> Lambda/Worker
Failed messages -> DLQ after 3 attempts
Interview Gold: Why SNS+SQS Instead of SNS Directly to Lambda?
Lambda can subscribe to SNS directly, but if Lambda throttles (hits concurrency limits), SNS retries only twice then drops the message — gone forever. With an SQS queue in between, messages wait safely in the queue until Lambda capacity frees up. The queue absorbs the spike. This is the canonical answer to 'how do you handle Lambda throttling in an event-driven system?'
Production Insight
A team had a critical event-driven workflow with SNS → Lambda. Their Lambda concurrency limit was 100. During a marketing campaign, traffic spiked to 500 concurrent events. SNS retried each throttled Lambda invocation twice, then dropped the remaining 300 messages. Three hundred high-value customer signups were lost. No alarm, no error log, no trace.
The marketing team reported 300 fewer signups than expected. Engineering spent 2 days tracing through CloudWatch logs to find the SNS delivery failures.
Root cause: SNS retries throttled Lambda only 3 times total (initial + 2 retries), then gives up. The team didn't know this limit.
Fix: Moved to SNS → SQS → Lambda with SQS event source mapping. The SQS queue now has 14 days of durability. During spikes, messages queue safely. No more drops.
Rule: If you can't afford to lose a single message, never subscribe Lambda directly to SNS. The SQS buffer is not optional — it's mandatory.
Key Takeaway
DLQ = failed messages after maxReceiveCount attempts. maxReceiveCount=3.
Monitor DLQ depth with CloudWatch alarm. Message in DLQ = bug, not noise.
Fan-out pattern: SNS → SQS (durable) → Lambda. Always for critical messages.
SNS direct to Lambda = at-most-once delivery when throttling hits. Assume message loss.

SQS vs SNS vs EventBridge — A Three-Way Decision Table

When you need more than basic pub/sub or queuing, AWS EventBridge enters the picture. It's not a replacement for SQS or SNS — it sits above them, offering a central event bus with advanced routing, schema registry, and integration with third-party SaaS events. Here's a decision table to clarify when to pick each:

FeatureSQSSNSEventBridge
Messaging modelPoint-to-point queuePub/sub topicEvent bus (pub/sub + routing)
DurabilityYes, up to 14 daysNo (unless subscriber is SQS)Yes, 24-hour default, configurable up to 3 days
ThroughputUnlimited (Standard) / 3,000 msg/s (FIFO)300 publishes/s (default, adjustable)5,000 events/s per bus (adjustable)
FilteringConsumer-sideServer-side subscription filters (attribute-based)Rich content-based filtering (JSONPath, prefix, suffix, anything, exists)
OrderingBest-effort / FIFONo orderingNo ordering (use replay or custom)
Payload sizeUp to 256 KBUp to 256 KBUp to 256 KB
PricingPay per request & data transferPay per publish & delivery attemptsPay per event ingested & delivered (higher per-event cost)
Third-party integrationsNone nativeEmail, SMS, mobile pushSaaS apps (Zendesk, Datadog, PagerDuty, 200+ built-in sources)
Schema registryNoNoYes — schema discovery & code generation
Replay eventsNoNoYes — archive and replay events up to 14 days

When to pick EventBridge over SNS: - You need complex content-based filtering (e.g., "order.total > 100 and order.region != 'US'") - You want to ingest events from third-party SaaS providers (GitHub, Shopify, etc.) - You need event replay for debugging or disaster recovery - You want automatic schema discovery to generate strongly typed code

When to stick with SNS+SQS: - You need FIFO ordering or exactly-once processing (EventBridge doesn't support FIFO) - Your throughput is very high and you want the lowest per-message cost - You need 14-day message retention (EventBridge max is 3 days for custom events) - You need the simplicity of a direct queue (SQS) without event bus complexity

The rule of thumb: SNS+SQS covers 80% of event-driven use cases. EventBridge is worth the extra cost and complexity when you need its advanced routing, third-party integration, or replay capabilities.

EventBridge Pricing Surprise
EventBridge costs $1 per million events delivered. SNS costs $0.50 per million publishes plus $0.50 per million deliveries. If you have many subscribers to a single SNS topic, SNS can get cheaper because you pay per publish, not per delivery. Always model your expected volume and number of subscribers before choosing.
Production Insight
A team migrated from SNS+SQS to EventBridge for their order processing system. They needed content-based filtering to route orders by region (EU vs US) and by product category. With SNS, they had to create separate topics and duplicate subscriptions. EventBridge let them define one rule with a single filter expression ("region = 'EU' AND category = 'electronics'").
However, their order volume was 10 million events per month. At $1 per million events delivered, EventBridge cost them $10/month just for the bus, regardless of subscribers. SNS would have cost $5/month for publishes plus $5/month for deliveries to 3 queues = $10/month as well — same cost. But as they add more subscriber queues, SNS cost increases linearly, while EventBridge stays flat. At 10 subscribers, SNS would cost $50/month, EventBridge still $10/month.
Rule: EventBridge becomes cheaper than SNS once you have more than about 5 subscriber endpoints per event type, even before considering its advanced filtering features.
Key Takeaway
SNS+SQS is simpler and cheaper for basic fan-out. EventBridge adds content filtering, third-party sources, and replay at higher per-event cost. Pick based on filtering complexity and subscriber count.

Pros and Cons of SQS and SNS

Every service has trade-offs. Here's a clear-eyed look at what SQS and SNS do well, and where they fall short.

SQS — Advantages - Durable 14-day message storage with automatic retries via visibility timeout - At-least-once delivery (Standard) or exactly-once (FIFO) - Unlimited throughput with Standard queues - Built-in dead-letter queue support - Low cost per API request, especially with long polling - Supports batch operations (up to 10 messages per receive, 10 per send)

SQS — Disadvantages - No built-in fan-out one-to-many (you need SNS for that) - Consumer must poll the queue, adding latency and cost if not optimized - Max message size 256 KB (need S3 large-payload solution for bigger) - Ordering only guaranteed with FIFO (limited throughput) - No content-based server-side filtering

SNS — Advantages - Instant pub/sub fan-out — one message reaches all subscribers simultaneously - Multiple subscriber types (SQS, Lambda, HTTP, email, SMS, mobile push) - Server-side filter policies reduce unnecessary deliveries - No polling overhead for subscribers (push-based) - Simple pricing per publish, not per subscriber

SNS — Disadvantages - Messages are NOT stored — if subscriber is down, message is lost - Limited retries (3 attempts only for HTTP/Lambda subscribers) - No ordering guarantees - Max message size 256 KB - No DLQ for failed deliveries (must use SQS subscriber to get retries) - Filtering limited to message attributes (not body content)

The real insight: SNS's biggest disadvantage (no durability) is also its greatest advantage when paired with SQS. The combination covers each service's weakness. Never use SNS alone for critical events always buffer with SQS.

SNS + HTTP: Watch Out for Delivery Failures
SNS retries HTTP endpoints only 3 times with exponential backoff. If your endpoint is down for more than a few seconds, you lose the message. Always subscribe an SQS queue to SNS and have your HTTP endpoint poll the queue, or use a webhook with built-in retries.
Production Insight
A team used SNS with an HTTP subscriber for a real-time notification service. The HTTP endpoint processed the message and returned 200 immediately. But if the endpoint was slow (e.g., database contention), it might time out after 2 seconds. SNS would retry twice more, then discard. The team lost 5% of notifications due to transient delays.
Fix: They placed an SQS queue between SNS and the HTTP endpoint. The endpoint polled the queue, processed each message with a 30-second timeout, and deleted on success. Even if the endpoint was slow, the message stayed in the queue for retry.
Rule: HTTP endpoints behind SNS are fragile. Use an SQS subscriber to make them robust.
Key Takeaway
SQS excels at durability and retries. SNS excels at fan-out. Use SNS+SQS together to get both. Never rely on SNS alone for critical message delivery.

Pricing Comparison — Standard vs FIFO, Per-Million Request Costs

Understanding cost at scale is critical. Here's the pricing breakdown for SQS, SNS, and the common patterns.

SQS Pricing (as of 2026)

Queue TypeRequest PricingData Transfer PricingFree Tier
Standard$0.40 per million requests$0.09 per GB after first 1 GB/month1 million requests free per month
FIFO$0.50 per million requests$0.09 per GB after first 1 GB/month1 million requests free per month

Notes on SQS requests: - A “request” is any API call: SendMessage, ReceiveMessage, DeleteMessage, ChangeMessageVisibility, etc. - Long polling (WaitTimeSeconds=20) counts as one request per 20-second call, even if the response is empty. - Batch operations count as one request per batch of up to 10 messages.

SNS Pricing (as of 2026)

Topic TypePublish PricingDelivery PricingFree Tier
Standard$0.50 per million publishes$0.50 per million deliveries across all subscribers1 million publishes free per month
FIFO$1.10 per million publishes$0.50 per million deliveries across all subscribersNot included in free tier (pay per use)

Notes on SNS delivery: - Each subscriber receives a copy of the message. If you have 5 SQS subscribers and publish 1 million messages, you pay for 1 million publishes + 5 million deliveries = $3.00 ($0.50 + $2.50). - For FIFO topics, the higher publish price reflects the ordering guarantee.

Comparison Scenario: Fan-out to 3 SQS queues Assume 10 million messages per month.

  • SNS+SQS fan-out: 10M publishes = $5.00. 30M deliveries = $15.00. SQS request cost: 10M sends to each queue = 30M send requests = $12.00. Consumer polling: ~3.6M receive calls with long polling (10M messages / 10 batch size * 3.6 polls per message) = $1.44. Total: ~$33.44.
  • Direct Lambda subscriptions: 10M publishes = $5.00. Lambda invocations free if using async invocation? Actually, SNS to Lambda is free beyond the publish cost. But you risk message loss (see production incident above). Not recommended for critical data.
  • EventBridge bus: 10M events ingested = $10.00. 30M deliveries to 3 rules = $30.00. Total: $40.00.

Cost-saving tips: 1. Always use long polling on SQS consumers to reduce empty receive requests. 2. Batch send messages to SQS (up to 10 per request) to cut send costs by 90%. 3. Use SNS standard for high-volume fan-out; FIFO only when ordering is required. 4. Monitor SQS API usage with CloudWatch metrics. Set budgets and alarms.

Pricing is subject to change. Check the official AWS pricing page for the latest figures.

FIFO Pricing Premium
FIFO queues cost about 25% more per request than Standard. FIFO SNS topics cost 120% more per publish. Only use FIFO when message order and exactly-once processing are truly required. For most use cases, Standard queues plus idempotent consumers are cheaper and simpler.
Production Insight
A startup processed 50 million events per month with SNS+SQS fan-out to 2 queues. They were using FIFO for the main queue out of habit. Monthly cost: ~$170. They realized they didn't need ordering — consumers were idempotent and deduplicated by event ID. Switching to Standard queues cut costs to ~$90 per month, saving 47%. Always validate your ordering requirement before choosing FIFO.
Key Takeaway
SQS Standard and SNS Standard are cheapest. FIFO adds a premium. Use Standard unless you need ordering. Batch requests to reduce API call costs. Long polling is the biggest single cost-saver for SQS.

When to Consider Amazon MQ Instead of SQS or SNS

Amazon MQ is a fully managed message broker service that supports industry-standard protocols: MQTT, AMQP, STOMP, OpenWire, and JMS. It's the cloud version of Apache ActiveMQ and RabbitMQ. When should you reach for it instead of SQS/SNS?

Amazon MQ is the right choice when: - You're migrating an existing on-premises application that already uses JMS, AMQP, or MQTT. Rewriting everything to use SQS/SNS would be too risky or time-consuming. - You need advanced message routing beyond what SNS offers — like topics with wildcards, virtual topics, or message selectors (JMS). - You need transactional messaging across multiple queues (e.g., send to queue A and queue B atomically). - You need lower latency for real-time communication. SQS has a polling model that introduces latency; Amazon MQ's push-based delivery can be faster for time-sensitive workloads. - Your application requires specific features like scheduled messages, delayed delivery, message groups with flexible ordering, or custom dead-letter strategies at the broker level.

Amazon MQ is NOT the right choice when: - You want fully serverless, no infrastructure management. Amazon MQ requires you to manage broker instances (though it automates patching and failover). SQS and SNS are fully serverless. - Your throughput needs are extremely high. SQS Standard scales to unlimited. Amazon MQ is limited by the broker instance size (max 1000+ connections per broker, but instance types have limits). - Your payload size is small (under 256 KB). SQS and SNS handle this natively without needing message chunking. - You need exactly-once processing. SQS FIFO provides this; Amazon MQ requires idempotent consumers and broker-level deduplication, which is not as straightforward.

Cost comparison: Amazon MQ instances start at around $30/month for a small broker and increase with size. In contrast, SQS and SNS pay-per-use costs are negligible for low volume but grow linearly. Above about 100 million messages per month, Amazon MQ may become cheaper than SQS's API request costs, but only if you use the broker efficiently.

Decision rule: Use SQS/SNS for cloud-native applications where serverless is a priority. Use Amazon MQ for migrations, when you need JMS/AMQP compatibility, or when you require advanced broker-level features. If you're starting from scratch and don't have a legacy protocol requirement, SQS+SNS is almost always the simpler, more cost-effective choice.

Amazon MQ is Not Serverless
You provision and pay for an Amazon MQ broker instance 24/7, even if you send zero messages. SQS and SNS charge only for what you use. If your workload is bursty, SQS will be far cheaper. For steady-state high throughput, Amazon MQ might save money.
Production Insight
A financial services company migrated from an on-premises RabbitMQ cluster to Amazon MQ to reduce maintenance overhead. They had hundreds of queues and complex routing rules (message selectors, header exchanges). SNS's simple attribute filtering couldn't match the flexibility they needed. Amazon MQ gave them the same broker semantics in the cloud, and they could keep their existing Java/JMS client code with minimal changes. The migration took 3 weeks instead of 6 months if they had rewritten everything to SQS/SNS.
But they pay about $400/month for a moderate broker instance. Their message volume is 10 million per month. With SQS/SNS, they'd pay ~$30/month. The $370 premium is their cost for protocol compatibility and zero code changes — a worthwhile trade-off for a risk-averse migration.
Key Takeaway
Amazon MQ = migration-friendly, protocol-compatible, but not serverless. SQS/SNS = cloud-native, cheaper for bursty/low volume, simpler architecture. Choose based on your application's existing protocol and need for advanced routing.

Infrastructure as Code — Terraform SNS+SQS Fan-Out Setup

main.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
provider "aws" {
  region = "us-east-1"
}

# --- SNS Topic ---
resource "aws_sns_topic" "order_events" {
  name = "order-lifecycle-events"
}

# --- Main SQS Queue (consumer will poll this) ---
resource "aws_sqs_queue" "order_processing_queue" {
  name                       = "order-processing-queue"
  visibility_timeout_seconds = 60
  delay_seconds              = 0

  # Enable long polling on the queue level (overridden by consumer settings)
  receive_wait_time_seconds = 20

  # Attach Dead-Letter Queue
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_dlq.arn
    maxReceiveCount     = 3
  })
}

# --- Dead-Letter Queue (DLQ) ---
resource "aws_sqs_queue" "order_dlq" {
  name                       = "order-processing-dlq"
  message_retention_seconds  = 1209600  # 14 days
}

# --- SQS Queue Policy: allow SNS to send messages ---
resource "aws_sqs_queue_policy" "allow_sns" {
  queue_url = aws_sqs_queue.order_processing_queue.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowSNSToSendToSQS"
        Effect = "Allow"
        Principal = {
          Service = "sns.amazonaws.com"
        }
        Action   = "sqs:SendMessage"
        Resource = aws_sqs_queue.order_processing_queue.arn
        Condition = {
          ArnEquals = {
            "aws:SourceArn" = aws_sns_topic.order_events.arn
          }
        }
      }
    ]
  })
}

# --- Subscribe the SQS queue to the SNS topic ---
resource "aws_sns_topic_subscription" "fanout" {
  topic_arn = aws_sns_topic.order_events.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.order_processing_queue.arn

  # Enable raw message delivery — the SQS message body is your JSON, no SNS wrapper
  filter_policy = ""   # Add a JSON filter if needed, e.g., jsonencode({event_type = [\"order_placed\"]})\n\n  depends_on = [\n    aws_sqs_queue_policy.allow_sns\n  ]\n}\n\n# --- CloudWatch Alarm: monitor DLQ depth ---\nresource \"aws_cloudwatch_metric_alarm\" \"dlq_alarm\" {\n  alarm_name          = \"order-dlq-not-empty\"\n  comparison_operator = \"GreaterThanThreshold\"\n  evaluation_periods  = 1\n  metric_name         = \"ApproximateNumberOfMessagesVisible\"\n  namespace           = \"AWS/SQS\"\n  period              = 300\n  statistic           = \"Maximum\"\n  threshold           = 0\n  alarm_description   = \"This alarm fires when any message lands in the DLQ. Investigate immediately.\"\n  dimensions = {\n    QueueName = aws_sqs_queue.order_dlq.name\n  }\n  alarm_actions = []  # Add SNS topic ARN for notifications here\n}\n\n# --- (Optional) FIFO version (requires .fifo suffix) ---\n# resource \"aws_sqs_queue\" \"order_fifo\" {\n#   name                        = \"order-processing-queue.fifo\"\n#   fifo_queue                  = true\n#   content_based_deduplication = true\n#   visibility_timeout_seconds  = 60\n#   receive_wait_time_seconds   = 20\n# }\n\noutput \"sns_topic_arn\" {\n  value = aws_sns_topic.order_events.arn\n}\n\noutput \"sqs_queue_url\" {\n  value = aws_sqs_queue.order_processing_queue.id\n}\n\noutput \"dlq_queue_url\" {\n  value = aws_sqs_queue.order_dlq.id\n}",
        "output": "Apply complete! Resources: 6 added.\n\nOutputs:\n\nsns_topic_arn = \"arn:aws:sns:us-east-1:123456789012:order-lifecycle-events\"\nsqs_queue_url = \"https://sqs.us-east-1.amazonaws.com/123456789012/order-processing-queue\"\ndlq_queue_url = \"https://sqs.us-east-1.amazonaws.com/123456789012/order-processing-dlq\""
      }
● Production incidentPOST-MORTEMseverity: high

Lambda Throttling + SNS = 10,000 Lost Order Events

Symptom
SNS publishes show 'Success' in CloudWatch (message sent). Lambda invocations flat — no errors, just not called. SNS delivery logs show 'Delivery failure — throttled'. CloudWatch SNS metrics show NumberOfNotificationsDelivered < NumberOfMessagesPublished.
Assumption
The team assumed SNS would keep retrying until Lambda had capacity. They didn't know SNS retries only 3 times total (initial + 2 retries) then discards the message. They also didn't know Lambda's reserved concurrency was set too low for a holiday spike.
Root cause
SNS to Lambda subscription: When Lambda throttles (hits concurrency limit), SNS retries twice with exponential backoff, then gives up permanently. Message is lost — no DLQ, no error notification to the publisher. Lambda's concurrency limit was 100. Traffic spike needed 300 concurrent executions. SNS tried to deliver to Lambda 3 times, each time got a throttle response, then dropped the message. No alert because SNS 'successfully' published to the topic — the failure happened at the subscription layer, not the publish layer.
Fix
1. Changed architecture: SNS → SQS queue → Lambda. - SNS topic fans out to an SQS queue - Lambda now polls the queue (or uses event source mapping) - Queue holds messages indefinitely during throttling 2. Created dead-letter queue on the SQS queue with maxReceiveCount=3 - Messages that fail after 3 attempts go to DLQ - Team monitors DLQ depth as a metric - No more silent drops 3. Increased Lambda reserved concurrency to 500 4. Added CloudWatch alarm on ApproximateAgeOfOldestMessage > 5 minutes on the SQS queue Rule: If you can't lose the message, never subscribe Lambda directly to SNS in production. Always use SQS as the durable buffer.
Key lesson
  • SNS to Lambda = at-most-once delivery when throttling hits.
  • SNS to SQS = at-least-once delivery + durable storage.
  • Add DLQ to every SQS queue. maxReceiveCount=3.
  • Monitor DLQ depth. A message in DLQ is a service bug.
  • SQS ApproximateAgeOfOldestMessage alarm = early warning system.
Production debug guideThe 4 most common failure modes and how to find them4 entries
Symptom · 01
SNS publishes 'Success' but Lambda never invoked
Fix
Check SNS subscription to Lambda. SNS retries throttled Lambda only twice then drops. Fix: subscribe SQS queue to SNS instead. Lambda polls queue.
Symptom · 02
SQS consumer processes same message repeatedly
Fix
Check for missing delete_message() call. Also check VisibilityTimeout: if processing takes longer than timeout, message reappears. Increase timeout or send heartbeat.
Symptom · 03
Messages in SQS but consumer stuck
Fix
Check ApproximateAgeOfOldestMessage metric. If increasing, consumers are down. Check Lambda concurrency, EC2 worker processes, and SQS policy permissions.
Symptom · 04
SQS bill unexpectedly high
Fix
Check if you're using short polling (WaitTimeSeconds=0). Each receive call costs API request. Always set WaitTimeSeconds=20 (long polling). Also check for high Re-drive count (messages moving to DLQ).
★ SQS/SNS — 60-Second DiagnosisRun these AWS CLI commands when messages are missing or consumers are stuck
Check SQS queue depth and health
Immediate action
Get queue attributes and oldest message age
Commands
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible ApproximateNumberOfMessagesDelayed ApproximateAgeOfOldestMessage
aws cloudwatch get-metric-statistics --namespace AWS/SQS --metric-name ApproximateAgeOfOldestMessage --dimensions Name=QueueName,Value=my-queue --period 300 --statistics Maximum
Fix now
If ApproximateAgeOfOldestMessage > 300 seconds, consumers are lagging
Check if SNS topic is delivering to subscriber+
Immediate action
View SNS delivery metrics and log groups
Commands
aws sns get-topic-attributes --topic-arn arn:aws:sns:us-east-1:123456789012:my-topic --query 'Attributes.EffectiveDeliveryPolicy'
aws logs filter-log-events --log-group-name /aws/sns/DeliveryLogs --filter-pattern 'my-topic' --max-items 10
Fix now
Enable SNS delivery logging if not already. Check for 'Throttling' or 'EndpointDisabled' errors.
Find DLQ depth (messages that failed processing)+
Immediate action
Check DLQ queue attributes
Commands
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue-dlq --attribute-names ApproximateNumberOfMessages
aws sqs receive-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue-dlq --max-number-of-messages 1
Fix now
Any message in DLQ = investigate root cause immediately. Don't just delete them.
Simulate sending a test message through the system+
Immediate action
Publish to SNS, then check SQS for delivery
Commands
aws sns publish --topic-arn arn:aws:sns:us-east-1:123456789012:my-topic --message '{"test":true,"timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%S)Z'"}'
aws sqs receive-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --wait-time-seconds 20
Fix now
If test message not received, check SQS queue policy (allow SNS) and subscription confirmation
🔥

That's Cloud. Mark it forged?

12 min read · try the examples if you haven't

Previous
Google Cloud Run Basics
18 / 23 · Cloud
Next
AWS CloudWatch Basics