SQS vs SNS: The Silent Production Failure Engineers Miss
SNS retries throttled Lambda twice then drops messages permanently.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
- SQS = queue. One message, one consumer. Durable storage (up to 14 days). Use for async task processing, rate limiting, back-pressure.
- SNS = pub/sub topic. One message, all subscribers. No storage. Use for event broadcasting, fan-out, decoupling producers from consumers.
- SNS+SQS fan-out = production standard. SNS broadcasts to multiple SQS queues. Each queue durably stores its copy. Never subscribe Lambda directly to SNS in production — SQS in between absorbs throttling.
- Long polling: always set WaitTimeSeconds=20 in receive_message. Cuts API calls by 95%, drops costs.
- Dead-letter queue (DLQ): maxReceiveCount=3. Messages that fail processing go to DLQ, not infinite retry. Monitor DLQ depth → that's your bug signal.
- Cost trap: SNS to Lambda subscriptions retry twice then drop messages on throttle. SQS queues hold messages safely for days.
Imagine a busy pizza restaurant. SNS is the manager who shouts 'Order 42 is ready!' — every station (kitchen, cashier, delivery) that cares about that announcement hears it at the same time. SQS is the ticket rail above the grill — each chef grabs one ticket, works through it at their own pace, and the ticket is gone once it's done. One broadcasts, one queues. That's the whole mental model.
Modern applications rarely do one thing at a time. A user places an order and suddenly you need to charge their card, send a confirmation email, update inventory, notify the warehouse, and log an audit trail — all reliably, even if your email service crashes at 2 a.m. That's the problem AWS SQS and SNS were built to solve. They decouple the parts of your system so that a failure in one place doesn't cascade everywhere.
Without messaging services like these, you'd wire services together with direct HTTP calls. Service A calls Service B, which calls Service C. If B is slow, A waits. If C is down, the whole chain breaks. SQS introduces a buffer — a durable queue that holds messages until a consumer is ready to process them. SNS takes a different angle: it lets one event instantly fan out to dozens of subscribers without the publisher needing to know who they are.
By the end of this article you'll know the architectural difference between a queue and a publish-subscribe topic, how to wire SQS and SNS together for a real fan-out pattern, what dead-letter queues are and why you desperately need them, and exactly when to reach for each service in your next cloud project.
Here's the reality: most teams learn these services after losing messages in production. That's why this guide focuses on failure modes first. You'll walk away knowing exactly where silent drops happen and how to build a fan-out that survives traffic spikes.
SQS and SNS: The Two Async Pillars That Break Differently
Amazon SQS (Simple Queue Service) and SNS (Simple Notification Service) are AWS's managed messaging services. SQS is a pull-based queue: producers send messages, consumers poll and delete them. SNS is a push-based pub/sub bus: publishers send to a topic, which fans out to multiple subscribers (SQS queues, Lambda, HTTP endpoints, etc.). The core mechanic: SQS guarantees at-least-once delivery with exactly-once processing via deduplication IDs; SNS delivers each message to every subscriber with no built-in retry beyond its delivery policy.
In practice, SQS decouples microservices with a buffer that absorbs traffic spikes — a single queue can handle thousands of messages per second, and visibility timeout prevents duplicate processing. SNS pushes messages immediately, making it ideal for event broadcasting, but its push model means a slow subscriber can cause backpressure or message loss if the delivery policy exhausts retries. SQS supports FIFO ordering and deduplication; SNS does not guarantee order across subscribers.
Use SQS when you need reliable, asynchronous work processing with consumer-driven pacing — e.g., order fulfillment pipelines. Use SNS when you need to fan out the same event to multiple independent consumers — e.g., account creation triggers email, SMS, and audit logging. The critical nuance: SNS + SQS is the standard pattern for reliable fan-out, but many teams miss that SNS delivery failures can silently drop messages if the SQS queue's redrive policy isn't configured.
SQS — The Durable Message Queue That Saves Your System at 2 A.M.
SQS (Simple Queue Service) is a fully managed message queue. A producer drops a message into the queue, and one or more consumers poll the queue and process messages at their own pace. The key word is 'one' — by default each message is delivered to exactly one consumer. This is point-to-point messaging.
Why does that matter? Because it gives you back-pressure handling for free. If your order-processing service is overwhelmed, messages just pile up in the queue safely. The queue acts as a shock absorber between the part of your system that generates work and the part that does the work.
There are two flavours. Standard queues give you maximum throughput with at-least-once delivery and best-effort ordering — meaning a message might appear twice (rare, but plan for it). FIFO queues guarantee exactly-once processing and strict order, but cap you at 3,000 messages per second with batching. Choose FIFO when order actually matters — financial transactions, state machines. Choose Standard everywhere else.
Messages live in the queue for up to 14 days. The visibility timeout is the other critical setting: after a consumer picks up a message, it becomes invisible to other consumers for that window. If your Lambda or EC2 worker crashes mid-process, the message reappears and gets retried. That's your built-in retry mechanism.
Long polling is the single most impactful cost-saving setting. Always set WaitTimeSeconds=20 in your receive_message calls. Without it, your consumer uses short polling — it returns immediately even if no messages exist. You pay per API call, so a quiet queue will cost you thousands of empty calls. Long polling holds the connection for up to 20 seconds, waiting for a message. On a queue that gets one message per minute, long polling cuts your API calls by 95%.
Batch operations amplify cost savings further. Use send_message_batch to send up to 10 messages in a single API call, and receive up to 10 messages per receive_message call. Batching cuts your per-message API cost by up to 90% compared to sending one message per call.
SNS — The Pub/Sub Megaphone That Notifies Everyone at Once
SNS (Simple Notification Service) works on the publish-subscribe model. You publish one message to a Topic, and SNS fans it out simultaneously to every subscriber — SQS queues, Lambda functions, HTTP endpoints, email addresses, mobile push notifications. The publisher has zero knowledge of who's listening. Adding a new subscriber doesn't touch the publisher at all.
This is the architectural superpower. Imagine your user-signup event needs to trigger a welcome email, a CRM record creation, a Slack notification to your growth team, and an analytics event. With SNS, your Auth service publishes one 'UserRegistered' message and walks away. Four independent services consume it in parallel.
Message filtering is what takes SNS from useful to essential. Instead of every subscriber receiving every message on a topic, you attach a filter policy to a subscription. Your EU payments service can subscribe to the 'transactions' topic but only receive messages where region=EU. This keeps each service focused on what it actually cares about. Filter policies support string matching, prefix matching, numeric ranges, and existence checks — making them powerful enough for most routing requirements without needing EventBridge.
SNS does not store messages. If a subscriber is down when the message arrives, that message is gone unless the subscriber is an SQS queue (which durably stores it). That's the most important SNS limitation to internalise — and it leads directly to the most powerful pattern: SNS + SQS fan-out.
SNS delivery logging is a critical debugging tool that most teams don't enable. It publishes delivery attempts, failures, and throttling events to CloudWatch Logs. When messages go missing, this is the first place to look. Enable it on every production SNS topic.
RawMessageDelivery is another setting most teams overlook. By default, SNS wraps your message in a JSON envelope containing metadata like the topic ARN, subject, and signature. When your SQS consumer receives the message, it has to unwrap the SNS envelope before parsing your actual payload. Setting RawMessageDelivery=true on a subscription tells SNS to deliver your original JSON body directly — no wrapping. This simplifies consumer code and avoids subtle bugs from double JSON-encoding.
Dead-Letter Queues and the SNS+SQS Fan-Out Pattern — Production Essentials
Two patterns separate a toy cloud setup from a production-grade one: dead-letter queues (DLQs) and the SNS+SQS fan-out architecture. You need to understand both.
A DLQ is just another SQS queue. You configure it on your main queue and set maxReceiveCount — say, 3. If a message fails processing 3 times, SQS automatically moves it to the DLQ instead of retrying forever or silently dropping it. Your team gets alerted, investigates the poisoned message, and the rest of your queue keeps flowing normally. Without a DLQ, one bad message can block your queue or create an infinite retry storm.
The fan-out pattern solves SNS's biggest weakness — no durability. The rule is: never subscribe a Lambda directly to SNS in production if message loss is unacceptable. Instead, subscribe an SQS queue to the SNS topic. The queue durably catches every message. Your Lambda then polls the queue. You get SNS's broadcasting power AND SQS's durability and retry logic together. This is the architectural backbone of most event-driven AWS systems.
The code example below wires both patterns together with IaC-style Boto3 calls, showing exactly how a DLQ connects to a main queue.
Monitoring the DLQ: Set a CloudWatch alarm on ApproximateNumberOfMessages > 0 on your DLQ. A message in the DLQ means your consumer failed to process it after maxReceiveCount attempts. That's a bug — it shouldn't be ignored or silently deleted. Your on-call should get a page.
DLQ Redrive: Once you've fixed the root cause, use the SQS Dead-Letter Queue Redrive feature (available in the AWS console and via API) to move messages from the DLQ back to the source queue for reprocessing. Never manually replay messages by hand — the redrive API preserves original message attributes and handles batching correctly.
SQS vs SNS vs EventBridge — A Three-Way Decision Table
When you need more than basic pub/sub or queuing, AWS EventBridge enters the picture. It's not a replacement for SQS or SNS — it sits above them, offering a central event bus with advanced routing, schema registry, and integration with third-party SaaS events. Here's a decision table to clarify when to pick each:
| Feature | SQS | SNS | EventBridge |
|---|---|---|---|
| Messaging model | Point-to-point queue | Pub/sub topic | Event bus (pub/sub + routing) |
| Durability | Yes, up to 14 days | No (unless subscriber is SQS) | Yes, 24-hour default, configurable up to 14 days for archive |
| Throughput | Unlimited (Standard) / 3,000 msg/s (FIFO) | 300 publishes/s (default, adjustable) | 10,000 events/s per bus (default, adjustable) |
| Filtering | Consumer-side | Server-side subscription filters (attribute-based) | Rich content-based filtering (JSONPath, prefix, suffix, anything, exists, numeric ranges) |
| Ordering | Best-effort (Standard) / Strict FIFO | No ordering | No ordering |
| Payload size | Up to 256 KB | Up to 256 KB | Up to 256 KB |
| Pricing | Pay per request and data transfer | Pay per publish and delivery attempts | Pay per event ingested and delivered (higher per-event cost than SNS) |
| Third-party integrations | None native | Email, SMS, mobile push, HTTP | SaaS apps (Zendesk, Datadog, PagerDuty, 200+ built-in sources) |
| Schema registry | No | No | Yes — schema discovery and code generation |
| Replay events | No | No | Yes — archive and replay events up to 14 days |
| FIFO support | Yes | Yes (SNS FIFO topics) | No |
When to pick EventBridge over SNS: - You need complex content-based filtering (e.g., "order.total > 100 and order.region != 'US'") directly on the message body, not just attributes. - You want to ingest events from third-party SaaS providers (GitHub, Shopify, Salesforce, etc.) without building custom ingest pipelines. - You need event replay for debugging or disaster recovery — being able to re-run yesterday's events against a fixed consumer is invaluable. - You want automatic schema discovery to generate strongly typed code from your event shapes.
When to stick with SNS+SQS: - You need FIFO ordering or exactly-once processing. EventBridge does not support FIFO. - Your throughput is very high and you want the lowest per-message cost. SNS+SQS is cheaper at high volume. - You need 14-day message retention on the queue itself. EventBridge archive can hold events up to 14 days, but queued durability for unprocessed messages requires SQS. - You need the simplicity of a direct queue (SQS) without event bus routing complexity.
The rule of thumb: SNS+SQS covers 80% of event-driven use cases. EventBridge is worth the extra cost and complexity when you need its advanced routing, third-party integration, or replay capabilities.
Pros and Cons of SQS and SNS
Every service has trade-offs. Here's a clear-eyed look at what SQS and SNS do well, and where they fall short.
SQS — Advantages - Durable 14-day message storage with automatic retries via visibility timeout - At-least-once delivery (Standard) or exactly-once (FIFO) - Unlimited throughput with Standard queues; 3,000 msg/s with FIFO batching - Built-in dead-letter queue support with configurable maxReceiveCount - Low cost per API request, especially with long polling and batch operations - Supports batch operations (up to 10 messages per receive, up to 10 per send) - Decouples producers from consumers — producer doesn't need to know consumer speed or availability
SQS — Disadvantages - No built-in fan-out (point-to-point only — you need SNS or Lambda triggers for fan-out) - Consumer must poll the queue, introducing latency and API cost if not optimized - Max message size 256 KB — need SQS Extended Client Library with S3 for larger payloads - Ordering only guaranteed with FIFO, which has throughput limitations - No content-based server-side filtering — consumers must filter in application code - FIFO queues don't support Lambda event source mapping scaling the same way as Standard queues
SNS — Advantages - Instant pub/sub fan-out — one message reaches all subscribers simultaneously - Multiple subscriber types (SQS, Lambda, HTTP/S, email, SMS, mobile push) - Server-side filter policies on MessageAttributes reduce unnecessary deliveries and compute cost - No polling overhead for Lambda and HTTP subscribers (push-based delivery) - Simple pricing per publish, not multiplied by subscriber count - Supports FIFO topics for ordered fan-out when needed
SNS — Disadvantages - Messages are NOT stored — if subscriber is down, message is lost unless subscriber is SQS - Limited retries (3 attempts total for HTTP and Lambda subscribers on throttle) - No ordering guarantees on Standard topics - Max message size 256 KB - No DLQ for the SNS layer itself — must use an SQS subscriber with its own DLQ to get retry and dead-letter semantics - Filtering limited to MessageAttributes, not the message body content - HTTP subscribers are vulnerable to transient failures during the narrow retry window
The real insight: SNS's biggest disadvantage (no durability) is also its greatest advantage when paired with SQS. The combination covers each service's weakness. Never use SNS alone for critical events — always buffer with SQS.
Pricing Comparison — Standard vs FIFO, Per-Million Request Costs
Understanding cost at scale is critical. Here's the pricing breakdown for SQS, SNS, and the common patterns.
SQS Pricing (as of 2026)
| Queue Type | Request Pricing | Data Transfer Pricing | Free Tier |
|---|---|---|---|
| Standard | $0.40 per million requests | $0.09 per GB after first 1 GB/month | 1 million requests free per month |
| FIFO | $0.50 per million requests | $0.09 per GB after first 1 GB/month | 1 million requests free per month |
Notes on SQS requests: - A "request" is any API call: SendMessage, ReceiveMessage, DeleteMessage, ChangeMessageVisibility, etc. - Long polling (WaitTimeSeconds=20) counts as one request per poll call, even if the response is empty. Far cheaper than short polling which generates continuous empty responses. - Batch operations (SendMessageBatch, DeleteMessageBatch) count as one request per batch of up to 10 messages. Always batch when sending or deleting multiple messages. - ChangeMessageVisibility (heartbeat) also counts as one request — factor this in for long-running consumers.
SNS Pricing (as of 2026)
| Topic Type | Publish Pricing | Delivery Pricing | Free Tier |
|---|---|---|---|
| Standard | $0.50 per million publishes | $0.50 per million deliveries to SQS/Lambda; SMS and email have separate rates | 1 million publishes free per month |
| FIFO | $0.50 per million publishes | $0.50 per million deliveries | Not included in SNS free tier for FIFO |
Notes on SNS delivery: - Each subscriber receives a copy of the message. If you have 5 SQS subscribers and publish 1 million messages, you pay for 1 million publishes ($0.50) + 5 million deliveries ($2.50) = $3.00 total. - SMS delivery is billed separately by destination country — typically $0.00645 per message in the US. Not covered by the standard delivery rate. - Email and email-JSON subscribers are free for the first 1,000 emails per month, then $2.00 per 100,000 emails.
Comparison Scenario: Fan-out to 3 SQS queues, 10 million messages/month
- SNS+SQS fan-out: 10M SNS publishes = $5.00. 30M SNS deliveries to 3 queues = $15.00. SQS send cost (SNS writes to SQS): 30M send requests = $12.00. Consumer polling with long polling and batch size 10: ~1M receive calls = $0.40. Consumer deletes: 30M delete requests = $12.00. Total: ~$44.40/month.
- Direct SNS → Lambda subscriptions: 10M publishes = $5.00. Lambda invocations and duration are billed separately. Risk: message loss on throttle. Not recommended for critical data regardless of cost.
- EventBridge bus: 10M events ingested = $10.00. 30M deliveries to 3 rules = $30.00. Total: $40.00/month — comparable to SNS+SQS at this scale, with replay and content-based filtering included.
Cost-saving tips: 1. Always use long polling (WaitTimeSeconds=20) on SQS consumers to eliminate empty receive requests. 2. Batch send messages (up to 10 per SendMessageBatch call) to cut send costs by up to 90%. 3. Batch delete messages (up to 10 per DeleteMessageBatch call) after processing. 4. Use SNS Standard for high-volume fan-out; FIFO only when ordering is truly required. 5. Monitor SQS API usage with CloudWatch metrics. Set AWS Budgets alarms to catch unexpected growth early.
Pricing is subject to change. Always verify on the official AWS pricing pages for SQS and SNS before making architectural decisions based on cost.
When to Consider Amazon MQ Instead of SQS or SNS
Amazon MQ is a fully managed message broker service that supports industry-standard protocols: MQTT, AMQP, STOMP, OpenWire, and JMS. It's the cloud version of Apache ActiveMQ and RabbitMQ. When should you reach for it instead of SQS/SNS?
Amazon MQ is the right choice when: - You're migrating an existing on-premises application that already uses JMS, AMQP, or MQTT. Rewriting everything to use SQS/SNS would be too risky or time-consuming — Amazon MQ lets you lift and shift with minimal code changes. - You need advanced message routing beyond what SNS offers — like topic wildcards (e.g., orders.#), virtual topics, header-based exchanges, or complex JMS selectors. - You need transactional messaging across multiple queues — for example, sending to queue A and queue B atomically within a single XA transaction. - Your application requires specific broker-level features like scheduled messages (send now, deliver at a future time), message groups with flexible ordering, or custom dead-letter handling at the broker level. - You need push-based delivery with low latency — Amazon MQ's push model avoids the polling delay inherent to SQS.
Amazon MQ is NOT the right choice when: - You want fully serverless, zero infrastructure management. Amazon MQ requires you to provision and manage broker instances (though patching and failover are automated). SQS and SNS are fully serverless. - Your throughput needs are extremely high or bursty. SQS Standard scales to unlimited throughput automatically. Amazon MQ is limited by the broker instance type — larger instances cost more and still have connection limits. - You're building a new cloud-native application from scratch. Without a legacy protocol requirement, SQS+SNS is simpler, cheaper to operate, and scales without instance sizing decisions. - You need exactly-once processing with a simple API. SQS FIFO provides this natively; Amazon MQ requires idempotent consumers and broker-level deduplication configuration, which is more complex.
Cost comparison: Amazon MQ instances start at approximately $0.027/hour (~$20/month) for a single-instance mq.t3.micro broker, and scale to hundreds of dollars per month for active/standby HA configurations. In contrast, SQS and SNS have no fixed monthly cost — you pay only for requests. For bursty or low-volume workloads, SQS is far cheaper. For very high steady-state volume (hundreds of millions of messages per month), Amazon MQ may become cheaper because you pay per instance-hour rather than per request.
Decision rule: Use SQS/SNS for cloud-native applications where serverless scalability and operational simplicity are priorities. Use Amazon MQ when migrating legacy JMS/AMQP applications or when you need advanced broker-level routing features that SNS cannot provide. If you're starting from scratch without a protocol requirement, SQS+SNS is almost always the better choice.
Infrastructure as Code — Terraform SNS+SQS Fan-Out Setup
The manual AWS console or Boto3 scripts work for demos, but production infrastructure should be version-controlled and repeatable. Here's a complete Terraform configuration for the SNS+SQS fan-out pattern with a dead-letter queue and CloudWatch monitoring.
Practice Exercises — Build Your Own SQS/SNS Systems
Theory only sticks when you implement it. Here are five exercises that mirror real-world scenarios. Each exercise builds on the previous one. Set up a free AWS account (or use LocalStack for local testing at zero cost).
Exercise 1: Order Processing Queue - Create an SQS Standard queue named order-queue. - Write a Python/Boto3 script that sends 100 order messages with random order IDs and amounts. - Write a consumer that polls the queue (with long polling), prints each order, and deletes it. - Expected outcome: All messages processed exactly once (or with minor duplicates on Standard queue — plan for it). - Hint: Set WaitTimeSeconds=20 and MaxNumberOfMessages=10. Use a try/except block: only call delete_message on success, never on failure. Let failed messages reappear for retry.
Exercise 2: Fan-Out Notification - Create an SNS topic named user-notifications. - Subscribe two SQS queues to it: email-queue and sms-queue. - Set RawMessageDelivery=true on both subscriptions. - Publish a single message to the topic. Verify that both queues receive a copy. - Expected outcome: Each queue has exactly one copy of the message. - Hint: Don't forget the SQS queue policy to allow SNS to send messages — missing this policy causes silent drops. You can verify subscription status with aws sns list-subscriptions-by-topic.
Exercise 3: Dead-Letter Queue Monitoring - Modify Exercise 1's queue to attach a DLQ with maxReceiveCount=3. - Send a message with intentionally invalid JSON (e.g., {invalid). - Write a consumer that fails on JSON parsing — do not catch the exception, let it propagate without calling delete_message. - Poll the queue 3 times (each poll makes the message visible once, fails, and the receive count increments). After the 3rd failure, check the DLQ — the message should have moved automatically. - Set up a CloudWatch alarm on the DLQ's ApproximateNumberOfMessagesVisible > 0 that sends an email via a separate SNS topic. - Expected outcome: The poison message moves to DLQ, alarm triggers, and you get an email. - Hint: VisibilityTimeout must expire between each poll attempt to increment the receive count. Set it to 5 seconds for testing, then reset to 60 seconds for production.
Exercise 4: FIFO Queue with Deduplication - Create a FIFO queue named payment-events.fifo with content-based deduplication enabled. - Send 5 messages with identical bodies (e.g., {"transaction_id": "TXN-001", "amount": 100}) within a 5-minute window. - Consume the queue and count how many messages you receive. - Expected outcome: Only 1 message delivered — the other 4 are deduplicated by SQS using SHA-256 of the body. The deduplication window is 5 minutes. - Hint: FIFO queues require a .fifo suffix in the queue name. Each send_message call must include MessageGroupId. Deduplication uses SHA-256 of the MessageBody when content-based deduplication is enabled.
Exercise 5: Throttling Simulation with SNS → SQS → Lambda - Create an SNS topic, subscribe an SQS queue, and configure a Lambda function with an SQS event source mapping (batch size 1). - Set Lambda reserved concurrency to 1 to force serial processing. - Add a 3-second time.sleep in the Lambda to simulate work. - Publish 20 messages to the SNS topic in rapid succession. - Monitor ApproximateNumberOfMessages on the SQS queue during processing. - Expected outcome: Queue depth rises to ~19, then drains one message at a time as Lambda processes them serially. No messages are lost. Compare this to what would happen with a direct SNS → Lambda subscription under throttle. - Hint: Use CloudWatch Logs Insights to verify Lambda invocation count matches message count after the queue drains. The key lesson: with SQS buffering, throttle = delay. With direct SNS → Lambda, throttle = loss.
endpoint_url='http://localhost:4566' in your boto3 client and use dummy credentials (aws_access_key_id='test'). Costs nothing, runs offline, and starts in seconds with docker run localstack/localstack.Operational Runbook: What to Do When Messages Stop Flowing
You're on call. The pager wakes you at 3 AM. The SQS queue depth is climbing and the ApproximateAgeOfOldestMessage alarm is firing. No messages are being processed. Here's the step-by-step runbook.
Phase 1: Detect (under 5 minutes) - Open CloudWatch. Check ApproximateAgeOfOldestMessage on the main queue. If rising and > 5 minutes, consumers are stuck or down. - Check DLQ ApproximateNumberOfMessagesVisible. If > 0, messages are actively failing processing — there's a poison message or a bug in the consumer. - Check SNS delivery logs in CloudWatch Logs (/aws/sns/DeliveryLogs) for Throttling, EndpointDisabled, or AccessDenied entries within the last 30 minutes.
Phase 2: Diagnose (under 10 minutes) - Consumer crashed? Run: aws sqs get-queue-attributes --queue-url <main-queue-url> --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible. If ApproximateNumberOfMessagesNotVisible is high and static (not decreasing), consumers are receiving messages but crashing before deleting them. - Lambda throttled? Check the Lambda Throttles metric in CloudWatch. If elevated, increase ReservedConcurrency or check for concurrency exhaustion at the account level. - Permission denied? Check the SQS queue policy: does the SNS principal have sqs:SendMessage? Check the Lambda execution role: does it have sqs:ReceiveMessage, sqs:DeleteMessage, and sqs:GetQueueAttributes? - VisibilityTimeout too short? If messages reappear faster than your consumer processes them, processing time exceeds VisibilityTimeout. Increase the timeout or implement a heartbeat using ChangeMessageVisibility to extend it dynamically. - Poison message? Receive one message from the DLQ: aws sqs receive-message --queue-url <dlq-url> --max-number-of-messages 1. Examine the body. Look for unexpected null fields, schema changes, or malformed JSON from an upstream service.
Phase 3: Mitigate (under 15 minutes) - Dead consumers: Restart EC2 instances, redeploy Lambda, or trigger an ECS service restart. Verify the consumer is polling by watching NumberOfMessagesSent vs NumberOfMessagesDeleted in CloudWatch. - Backlogged queue: Temporarily scale up consumer concurrency — increase Lambda MaximumConcurrency on the event source mapping, or add EC2 worker instances. Monitor ApproximateAgeOfOldestMessage to confirm it starts decreasing. - Poison messages in DLQ: Do NOT delete them. Move them to a separate investigation queue or leave them in the DLQ. Fix the consumer to handle that message shape. Then use the SQS DLQ Redrive feature to replay them back to the main queue. - SNS delivery failure (AccessDenied): Fix the SQS queue policy to grant sqs:SendMessage to the SNS principal. Then republish missed events from application logs, a replay mechanism, or the event source system.
Phase 4: Prevent (post-incident) - Add Terraform modules that enforce DLQ attachment, long polling defaults, and CloudWatch alarms as non-negotiable for every new queue. - Set an alarm on ApproximateAgeOfOldestMessage > 5 minutes on every production queue. - Set an alarm on ApproximateNumberOfMessagesVisible > 0 on every DLQ. - Set an alarm on Lambda Throttles > 0 for every Lambda that processes SQS events. - Run a quarterly chaos exercise: throttle the Lambda to concurrency 1, send a spike of messages, and verify the queue drains correctly afterward with no message loss. - Document this runbook and store it in your on-call playbook. Run a tabletop walkthrough with the team so the steps are muscle memory before the next 3 AM page.
Stop Clicking Around: Automate SNS→SQS Subscription With IAM
You're not a cloud janitor. Stop manually subscribing queues and fixing broken permissions at 3 AM. The most common production issue with SNS→SQS fan-out isn't the setup — it's the IAM policy that everyone forgets.
When you subscribe an SQS queue to an SNS topic, AWS does NOT automatically grant SNS permission to send messages to that queue. The subscription will show as "Confirmed" in the console, but your messages will vanish. No errors. No logs. Just silence.
Here's the fix: you need two separate policies. One on the SQS queue that allows SNS:SendMessage from that specific topic ARN. Another on the SNS topic that allows the subscription to exist. Terraform handles this cleanly with aws_sns_topic_subscription and aws_sqs_queue_policy. CI/CD pipelines deploy this. You document it once and forget it.
If you're still clicking through the console to create subscriptions, you're one misclick away from a silent data loss incident. Don't be that person.
Message Filtering: Stop Wasting Compute on Unwanted Events
Your notification service doesn't care about every inventory update. It only needs "checkout_completed" events. Without message filtering, every SQS subscriber in a fan-out topology gets every message — and your consumers waste CPU cycles filtering them out.
SNS supports subscription filter policies with JSON-based attribute matching. You attach a policy string on the subscription resource. Only messages whose attributes match the filter get delivered. This is not optional at scale.
Example: your order service publishes events with an attribute "event_type" set to "order_placed", "payment_failed", or "refund_initiated". Your analytics pipeline only wants "order_placed". Your fraud detection wants "payment_failed". Each subscribes with a different filter policy. One topic, three queues, zero wasted messages.
The filter policy supports exact matching, prefix matching, numeric ranges, and exists checks. Use it. Your lambda cold starts will thank you.
IAM Policies That Actually Allow SQS+SNS to Talk — No More Silent Failures
Your SNS→SQS fan-out works in dev but mysteriously drops messages in prod. Why? IAM policies. The common mistake is granting SNS permission to publish to SQS but forgetting to give SQS permission to receive from SNS. Without both, messages vanish silently. The fix: attach an SQS queue policy that explicitly allows SNS to send messages, AND ensure the SNS topic policy allows SQS to subscribe. Use a condition key like aws:SourceArn to lock down which topic can publish. Never use a wildcard for the principal unless you want random accounts dumping data into your queue. Test with aws sqs receive-message and check the Redrive Policy immediately. If DLQ stays empty but SNS shows published, your IAM chain is broken — check CloudTrail for AccessDenied on sqs:SendMessage.
Principal: "*" for SQS policy grants any AWS account permission to send. Always lock with aws:SourceArn.CloudWatch Alarms That Wake You Up Before the Queue Drowns
SQS queues die quietly. The first sign is usually a support ticket from users complaining of slow responses. Don’t wait. Set CloudWatch alarms on four metrics: ApproximateNumberOfMessagesVisible (backlog), ApproximateAgeOfOldestMessage (stale messages), NumberOfMessagesSent (traffic spikes), and NumberOfMessagesDeleted (consumer health). For SNS, alarm on NumberOfNotificationsFailed and NumberOfNotificationsFilteredOut to catch permission or filter errors. Set thresholds: backlog > 1000 for 5 minutes = PagerDuty page. Age > 15 minutes = critical. Tie alarms to SNS topics (ironic but effective) that trigger Lambda to scale up consumers or restart stuck workers. Use math expressions: (Visible - InFlight) / ConsumerCount to detect imbalance. Test your alarms by flooding the queue with a script — if you don't get paged, your alerting is broken.
ApproximateNumberOfMessagesVisible alone misses stale messages. Always pair with ApproximateAgeOfOldestMessage.FIFO Throughput Limits — Why Your Ordering Breaks at Scale
FIFO queues guarantee exactly-once processing and strict ordering, but they cap throughput at 3000 messages per second (with batching) or 300 per second (without). Exceed that and your sends throttle with HTTP 500 or ThrottlingException. Worse, message groups serialize — one slow consumer for a MessageGroupId blocks all others in that group. To fix: shard your groups by increasing unique MessageGroupId values (e.g., by user ID). Each group gets its own slot. Use Amazon SQS Extended Client Library for messages >256KB. If you need >3000/s, switch to standard queues and handle deduplication in your app. For SNS FIFO, subscription policy must match — you can only use one consumer per group. Batch sends with MessageDeduplicationId deduplication controls or use content-based dedup. Test your throttle limits with a load test before hitting production.
MessageGroupId serializes all messages. One slow consumer blocks the entire queue — always shard.Lambda Throttling + SNS = 10,000 Lost Order Events
ApproximateAgeOfOldestMessage > 5 minutes on the SQS queue
Rule: If you can't lose the message, never subscribe Lambda directly to SNS in production. Always use SQS as the durable buffer.- SNS to Lambda = at-most-once delivery when throttling hits.
- SNS to SQS = at-least-once delivery + durable storage.
- Add DLQ to every SQS queue. maxReceiveCount=3.
- Monitor DLQ depth. A message in DLQ is a service bug.
- SQS ApproximateAgeOfOldestMessage alarm = early warning system.
delete_message() call. Also check VisibilityTimeout: if processing takes longer than timeout, message reappears. Increase timeout or send heartbeat.aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible ApproximateNumberOfMessagesDelayed ApproximateAgeOfOldestMessageaws cloudwatch get-metric-statistics --namespace AWS/SQS --metric-name ApproximateAgeOfOldestMessage --dimensions Name=QueueName,Value=my-queue --period 300 --statistics Maximum --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S)Key takeaways
Common mistakes to avoid
8 patternsSubscribing Lambda directly to SNS in production
Using short polling (WaitTimeSeconds=0 or omitting it entirely)
Not attaching a Dead-Letter Queue to every SQS queue
Deleting messages from DLQ without investigating root cause
Putting filter criteria inside the SNS message body instead of MessageAttributes
Forgetting the SQS queue policy that allows SNS to send messages
Setting VisibilityTimeout shorter than the actual processing time
Using FIFO queues everywhere by default
Interview Questions on This Topic
What is the difference between SQS and SNS? When would you use each?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
That's Cloud. Mark it forged?
25 min read · try the examples if you haven't