Junior 25 min · March 06, 2026

SQS vs SNS: The Silent Production Failure Engineers Miss

SNS retries throttled Lambda twice then drops messages permanently.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • SQS = queue. One message, one consumer. Durable storage (up to 14 days). Use for async task processing, rate limiting, back-pressure.
  • SNS = pub/sub topic. One message, all subscribers. No storage. Use for event broadcasting, fan-out, decoupling producers from consumers.
  • SNS+SQS fan-out = production standard. SNS broadcasts to multiple SQS queues. Each queue durably stores its copy. Never subscribe Lambda directly to SNS in production — SQS in between absorbs throttling.
  • Long polling: always set WaitTimeSeconds=20 in receive_message. Cuts API calls by 95%, drops costs.
  • Dead-letter queue (DLQ): maxReceiveCount=3. Messages that fail processing go to DLQ, not infinite retry. Monitor DLQ depth → that's your bug signal.
  • Cost trap: SNS to Lambda subscriptions retry twice then drop messages on throttle. SQS queues hold messages safely for days.
✦ Definition~90s read
What is AWS SQS and SNS?

Amazon SQS (Simple Queue Service) and SNS (Simple Notification Service) are AWS's two core asynchronous messaging services, but they solve fundamentally different problems. SQS is a fully managed message queue designed for reliable, durable point-to-point communication between distributed system components.

Imagine a busy pizza restaurant.

It guarantees at-least-once delivery, supports message retention up to 14 days, and provides dead-letter queues (DLQs) for handling failures. You use SQS when you need to decouple microservices, buffer spikes in traffic, or ensure no message is lost even if a consumer crashes mid-processing.

In contrast, SNS is a pub/sub notification service that broadcasts messages to multiple subscribers simultaneously—think of it as a megaphone for your system. It pushes messages to HTTP endpoints, Lambda functions, SQS queues, email, SMS, or mobile push, making it ideal for fan-out patterns, alerting, and event-driven architectures where you need to notify many consumers at once.

The critical distinction that causes silent production failures is that SQS requires consumers to pull messages (polling), while SNS pushes messages to subscribers. This means SQS gives you backpressure control—your consumer processes at its own pace—but introduces latency and polling costs.

SNS delivers instantly but offers no retry logic beyond its built-in exponential backoff (max 3 retries for HTTP endpoints). The most common production mistake is treating them as interchangeable: using SNS where you need durability (e.g., critical order processing) or using SQS where you need real-time fan-out (e.g., triggering 50 Lambda functions).

The SNS+SQS fan-out pattern solves this by subscribing SQS queues to an SNS topic, combining SNS's broadcast speed with SQS's durability—but misconfiguring DLQs or forgetting to set a message retention policy on the queue can silently lose data.

EventBridge is the third player in this space, offering a more sophisticated event bus with schema registry, filtering, and routing rules. Use SQS+SNS when you need simple, high-throughput messaging with minimal overhead—EventBridge adds complexity and cost (per-event pricing) that's overkill for straightforward queue or notification patterns.

For example, a payment processing pipeline should use SQS for order messages (durability, retries) and SNS for sending confirmation emails (broadcast). EventBridge shines when you need complex event filtering, cross-account routing, or integration with AWS SaaS partners.

The decision table in this article will help you choose: SQS for reliable point-to-point, SNS for simple broadcast, EventBridge for event-driven orchestration with filtering.

Plain-English First

Imagine a busy pizza restaurant. SNS is the manager who shouts 'Order 42 is ready!' — every station (kitchen, cashier, delivery) that cares about that announcement hears it at the same time. SQS is the ticket rail above the grill — each chef grabs one ticket, works through it at their own pace, and the ticket is gone once it's done. One broadcasts, one queues. That's the whole mental model.

Modern applications rarely do one thing at a time. A user places an order and suddenly you need to charge their card, send a confirmation email, update inventory, notify the warehouse, and log an audit trail — all reliably, even if your email service crashes at 2 a.m. That's the problem AWS SQS and SNS were built to solve. They decouple the parts of your system so that a failure in one place doesn't cascade everywhere.

Without messaging services like these, you'd wire services together with direct HTTP calls. Service A calls Service B, which calls Service C. If B is slow, A waits. If C is down, the whole chain breaks. SQS introduces a buffer — a durable queue that holds messages until a consumer is ready to process them. SNS takes a different angle: it lets one event instantly fan out to dozens of subscribers without the publisher needing to know who they are.

By the end of this article you'll know the architectural difference between a queue and a publish-subscribe topic, how to wire SQS and SNS together for a real fan-out pattern, what dead-letter queues are and why you desperately need them, and exactly when to reach for each service in your next cloud project.

Here's the reality: most teams learn these services after losing messages in production. That's why this guide focuses on failure modes first. You'll walk away knowing exactly where silent drops happen and how to build a fan-out that survives traffic spikes.

SQS and SNS: The Two Async Pillars That Break Differently

Amazon SQS (Simple Queue Service) and SNS (Simple Notification Service) are AWS's managed messaging services. SQS is a pull-based queue: producers send messages, consumers poll and delete them. SNS is a push-based pub/sub bus: publishers send to a topic, which fans out to multiple subscribers (SQS queues, Lambda, HTTP endpoints, etc.). The core mechanic: SQS guarantees at-least-once delivery with exactly-once processing via deduplication IDs; SNS delivers each message to every subscriber with no built-in retry beyond its delivery policy.

In practice, SQS decouples microservices with a buffer that absorbs traffic spikes — a single queue can handle thousands of messages per second, and visibility timeout prevents duplicate processing. SNS pushes messages immediately, making it ideal for event broadcasting, but its push model means a slow subscriber can cause backpressure or message loss if the delivery policy exhausts retries. SQS supports FIFO ordering and deduplication; SNS does not guarantee order across subscribers.

Use SQS when you need reliable, asynchronous work processing with consumer-driven pacing — e.g., order fulfillment pipelines. Use SNS when you need to fan out the same event to multiple independent consumers — e.g., account creation triggers email, SMS, and audit logging. The critical nuance: SNS + SQS is the standard pattern for reliable fan-out, but many teams miss that SNS delivery failures can silently drop messages if the SQS queue's redrive policy isn't configured.

Silent Message Loss
SNS delivery to SQS can fail without alerting you — always set a dead-letter queue on the SQS subscription to catch undeliverable messages.
Production Insight
A team used SNS to fan out order events to a Lambda and a DynamoDB stream. The Lambda hit concurrency limits, causing SNS delivery failures. No DLQ was configured — 12% of events silently disappeared, corrupting inventory counts.
Symptom: inventory discrepancies with no error logs; SNS CloudWatch metrics showed 'NumberOfNotificationsDelivered' matching sent, but 'NumberOfNotificationsFailed' was zero because failures were counted per subscriber, not per message.
Rule of thumb: Always attach a DLQ to every SNS subscription and monitor the DLQ's ApproximateNumberOfMessagesVisible metric — if it rises above zero, you're losing data.
Key Takeaway
SQS is pull-based with consumer-controlled throughput; SNS is push-based with publisher-controlled fan-out.
SNS + SQS is the reliable fan-out pattern, but only if you configure DLQs on both sides.
Never assume SNS delivery succeeds — monitor subscription-level metrics and DLQ depth, not just topic-level counts.
SQS vs SNS: Async Messaging Decision Flow THECODEFORGE.IO SQS vs SNS: Async Messaging Decision Flow From queue vs pub/sub to fan-out and dead-letter queues SQS: Durable Message Queue Pull-based, at-least-once delivery, FIFO option SNS: Pub/Sub Megaphone Push-based, fan-out to multiple subscribers SNS+SQS Fan-Out Pattern SNS pushes to SQS queues for durable processing Dead-Letter Queue (DLQ) Captures failed messages after retries EventBridge Alternative Schema-aware, filtering, third-party integration Amazon MQ Alternative Managed ActiveMQ/RabbitMQ for existing apps ⚠ Silent failure: no DLQ on SQS or SNS subscription Always configure a DLQ to catch undeliverable messages THECODEFORGE.IO
thecodeforge.io
SQS vs SNS: Async Messaging Decision Flow
Aws Sqs Sns

SQS — The Durable Message Queue That Saves Your System at 2 A.M.

SQS (Simple Queue Service) is a fully managed message queue. A producer drops a message into the queue, and one or more consumers poll the queue and process messages at their own pace. The key word is 'one' — by default each message is delivered to exactly one consumer. This is point-to-point messaging.

Why does that matter? Because it gives you back-pressure handling for free. If your order-processing service is overwhelmed, messages just pile up in the queue safely. The queue acts as a shock absorber between the part of your system that generates work and the part that does the work.

There are two flavours. Standard queues give you maximum throughput with at-least-once delivery and best-effort ordering — meaning a message might appear twice (rare, but plan for it). FIFO queues guarantee exactly-once processing and strict order, but cap you at 3,000 messages per second with batching. Choose FIFO when order actually matters — financial transactions, state machines. Choose Standard everywhere else.

Messages live in the queue for up to 14 days. The visibility timeout is the other critical setting: after a consumer picks up a message, it becomes invisible to other consumers for that window. If your Lambda or EC2 worker crashes mid-process, the message reappears and gets retried. That's your built-in retry mechanism.

Long polling is the single most impactful cost-saving setting. Always set WaitTimeSeconds=20 in your receive_message calls. Without it, your consumer uses short polling — it returns immediately even if no messages exist. You pay per API call, so a quiet queue will cost you thousands of empty calls. Long polling holds the connection for up to 20 seconds, waiting for a message. On a queue that gets one message per minute, long polling cuts your API calls by 95%.

Batch operations amplify cost savings further. Use send_message_batch to send up to 10 messages in a single API call, and receive up to 10 messages per receive_message call. Batching cuts your per-message API cost by up to 90% compared to sending one message per call.

sqs_producer_consumer.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import boto3
import json
import time

# --- PRODUCER: Order Service drops a new order into the queue ---

sqs_client = boto3.client('sqs', region_name='us-east-1')

ORDER_QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789012/order-processing-queue'

def place_order(order_id: str, customer_email: str, items: list):
    order_payload = {
        'order_id': order_id,
        'customer_email': customer_email,
        'items': items,
        'timestamp': time.time()
    }

    # MessageGroupId is only needed for FIFO queues.
    # For Standard queues, just send the body.
    response = sqs_client.send_message(
        QueueUrl=ORDER_QUEUE_URL,
        MessageBody=json.dumps(order_payload),
        # DelaySeconds postpones visibility to consumers — useful for
        # scheduling near-future jobs (max 900 seconds = 15 minutes)
        DelaySeconds=0
    )

    print(f"[PRODUCER] Message sent. SQS Message ID: {response['MessageId']}")
    return response


# --- CONSUMER: Fulfilment Service polls and processes orders ---

def process_pending_orders():
    """
    Polls the queue for up to 10 messages at a time.
    Uses long polling (WaitTimeSeconds=20) to reduce empty responses
    and lower your AWS bill — this is the single most impactful SQS setting.
    """
    print("[CONSUMER] Polling for messages...")

    response = sqs_client.receive_message(
        QueueUrl=ORDER_QUEUE_URL,
        MaxNumberOfMessages=10,   # Batch up to 10 messages per poll
        WaitTimeSeconds=20,       # Long polling — ALWAYS use this
        VisibilityTimeout=60      # Give our worker 60s to finish before retry
    )

    messages = response.get('Messages', [])

    if not messages:
        print("[CONSUMER] Queue is empty. Nothing to process.")
        return

    for message in messages:
        receipt_handle = message['ReceiptHandle']  # Needed to delete the message
        order = json.loads(message['Body'])

        print(f"[CONSUMER] Processing order {order['order_id']} "
              f"for {order['customer_email']}")

        try:
            # Simulate doing the actual work (charge card, update DB, etc.)
            fulfil_order(order)

            # CRITICAL: Delete the message after successful processing.
            # If you skip this, SQS re-delivers it after VisibilityTimeout expires.
            sqs_client.delete_message(
                QueueUrl=ORDER_QUEUE_URL,
                ReceiptHandle=receipt_handle
            )
            print(f"[CONSUMER] Order {order['order_id']} done. Message deleted.")

        except Exception as e:
            # Do NOT delete the message on failure.
            # SQS will make it visible again after VisibilityTimeout expires.
            # After maxReceiveCount failures, it moves to the DLQ automatically.
            print(f"[CONSUMER] Failed to process order {order['order_id']}: {e}. "
                  f"Message will reappear for retry.")


def fulfil_order(order: dict):
    """Placeholder for real business logic."""
    print(f"  -> Charging card and packing {len(order['items'])} item(s)...")
    time.sleep(0.1)  # Simulate processing time


# --- Run it ---
place_order(
    order_id='ORD-9981',
    customer_email='alex@example.com',
    items=['Laptop Stand', 'USB-C Hub']
)

process_pending_orders()
Output
[PRODUCER] Message sent. SQS Message ID: 4e2b1a3f-7c9d-4f1e-b2a0-8d3e5f6c7b8a
[CONSUMER] Polling for messages...
[CONSUMER] Processing order ORD-9981 for alex@example.com
-> Charging card and packing 2 item(s)...
[CONSUMER] Order ORD-9981 done. Message deleted.
Short Polling Will Drain Your Wallet
If you omit WaitTimeSeconds or set it to 0, SQS uses short polling. Your consumer hammers the API with empty responses and you pay for every single API call. Always set WaitTimeSeconds=20 (the maximum). It's a one-line change that cuts polling costs by up to 95% on quiet queues.
Production Insight
A team deployed an SQS consumer with WaitTimeSeconds not set (defaults to 0). The queue received 1 message per minute. Their consumer polled every 0.1 seconds (10 times per second). Daily API calls: 864,000. Monthly cost for just polling: ~$350.
After fixing: WaitTimeSeconds=20. Receive calls dropped to 3 per minute (one long poll every 20 seconds). Daily calls: 4,320. Monthly cost: ~$1.75. A 99.5% cost reduction with one line of code.
Rule: Always set WaitTimeSeconds=20 in every receive_message call. For Lambda with SQS event source mapping, this is handled automatically — but for EC2 workers or on-prem consumers, you must explicitly set it.
Key Takeaway
SQS = durable queue. Message stored 14 days. One consumer per message.
Standard = high throughput (unlimited), at-least-once. FIFO = 3k msg/s, exactly-once.
Visibility timeout = built-in retry. Message reappears if consumer crashes.
Long polling (WaitTimeSeconds=20) = 95% cost reduction. Always use it.
Choose Standard vs FIFO SQS
IfMessage order matters (financial transaction, state machine step, audit trail)
UseUse FIFO queue. Exactly-once processing, strict ordering. 3,000 msg/sec max with batching.
IfHigh throughput (>3,000 msg/sec) or order doesn't matter
UseUse Standard queue. Unlimited throughput, at-least-once delivery. Make your consumers idempotent to handle rare duplicates safely.
IfDuplicate processing would break the system (e.g., double payment)
UseUse FIFO queue with content-based deduplication. Deduplication window is 5 minutes — identical messages within that window are delivered only once.
IfYou need high throughput AND deduplication with FIFO semantics
UseUse FIFO with a MessageDeduplicationId derived from a stable business key (e.g., order ID + idempotency token). Avoid content-based deduplication for payloads that might legitimately repeat.

SNS — The Pub/Sub Megaphone That Notifies Everyone at Once

SNS (Simple Notification Service) works on the publish-subscribe model. You publish one message to a Topic, and SNS fans it out simultaneously to every subscriber — SQS queues, Lambda functions, HTTP endpoints, email addresses, mobile push notifications. The publisher has zero knowledge of who's listening. Adding a new subscriber doesn't touch the publisher at all.

This is the architectural superpower. Imagine your user-signup event needs to trigger a welcome email, a CRM record creation, a Slack notification to your growth team, and an analytics event. With SNS, your Auth service publishes one 'UserRegistered' message and walks away. Four independent services consume it in parallel.

Message filtering is what takes SNS from useful to essential. Instead of every subscriber receiving every message on a topic, you attach a filter policy to a subscription. Your EU payments service can subscribe to the 'transactions' topic but only receive messages where region=EU. This keeps each service focused on what it actually cares about. Filter policies support string matching, prefix matching, numeric ranges, and existence checks — making them powerful enough for most routing requirements without needing EventBridge.

SNS does not store messages. If a subscriber is down when the message arrives, that message is gone unless the subscriber is an SQS queue (which durably stores it). That's the most important SNS limitation to internalise — and it leads directly to the most powerful pattern: SNS + SQS fan-out.

SNS delivery logging is a critical debugging tool that most teams don't enable. It publishes delivery attempts, failures, and throttling events to CloudWatch Logs. When messages go missing, this is the first place to look. Enable it on every production SNS topic.

RawMessageDelivery is another setting most teams overlook. By default, SNS wraps your message in a JSON envelope containing metadata like the topic ARN, subject, and signature. When your SQS consumer receives the message, it has to unwrap the SNS envelope before parsing your actual payload. Setting RawMessageDelivery=true on a subscription tells SNS to deliver your original JSON body directly — no wrapping. This simplifies consumer code and avoids subtle bugs from double JSON-encoding.

sns_fanout_publisher.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import boto3
import json

sns_client = boto3.client('sns', region_name='us-east-1')

USER_EVENTS_TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789012:user-lifecycle-events'

def publish_user_registered_event(user_id: str, email: str, plan: str, region: str):
    event_payload = {
        'user_id': user_id,
        'email': email,
        'plan': plan,
        'region': region
    }

    response = sns_client.publish(
        TopicArn=USER_EVENTS_TOPIC_ARN,
        Message=json.dumps(event_payload),
        Subject='UserRegistered',  # Used by email subscribers as the email subject

        # MessageAttributes enable server-side filtering.
        # A subscription with filter {"plan": ["pro"]} only receives pro-plan signups.
        # This avoids every subscriber having to filter in their own code.
        MessageAttributes={
            'event_type': {
                'DataType': 'String',
                'StringValue': 'UserRegistered'
            },
            'plan': {
                'DataType': 'String',
                'StringValue': plan  # 'free' or 'pro'
            },
            'region': {
                'DataType': 'String',
                'StringValue': region  # 'EU', 'US', 'APAC'
            }
        }
    )

    print(f"[SNS] Event published. Message ID: {response['MessageId']}")
    print(f"[SNS] Delivered to all active subscribers in parallel.")
    return response


# --- Subscribing an SQS queue to this SNS topic (infra setup, done once) ---

def subscribe_sqs_queue_to_topic(
    topic_arn: str,
    queue_arn: str,
    filter_policy: dict = None
) -> str:
    """
    Wires an SQS queue as an SNS subscriber.
    Optionally attach a filter so this queue only gets relevant messages.
    RawMessageDelivery=true ensures the SQS message body IS your JSON,
    not wrapped in an SNS envelope — simpler consumer code, no double-parsing.
    """
    subscribe_kwargs = {
        'TopicArn': topic_arn,
        'Protocol': 'sqs',
        'Endpoint': queue_arn,
        'Attributes': {
            'RawMessageDelivery': 'true'  # Always set this for SQS subscribers
        }
    }

    if filter_policy:
        # FilterPolicy limits which messages reach this subscription.
        # The CRM service only wants to hear about 'pro' plan signups.
        subscribe_kwargs['Attributes']['FilterPolicy'] = json.dumps(filter_policy)

    response = sns_client.subscribe(**subscribe_kwargs)
    print(f"[SNS] Queue subscribed. Subscription ARN: {response['SubscriptionArn']}")
    return response['SubscriptionArn']


# --- Example: set up the CRM subscription with a filter ---
subscribe_sqs_queue_to_topic(
    topic_arn=USER_EVENTS_TOPIC_ARN,
    queue_arn='arn:aws:sqs:us-east-1:123456789012:crm-sync-queue',
    filter_policy={'plan': ['pro']}  # Only pro signups reach the CRM queue
)

# --- Publish a new user signup ---
publish_user_registered_event(
    user_id='usr-4471',
    email='jordan@example.com',
    plan='pro',
    region='EU'
)
Output
[SNS] Queue subscribed. Subscription ARN: arn:aws:sns:us-east-1:123456789012:user-lifecycle-events:a1b2c3d4-e5f6-7890-abcd-ef1234567890
[SNS] Event published. Message ID: 7f3a2b1c-9e8d-4c7b-a6f5-2e1d0c9b8a7f
[SNS] Delivered to all active subscribers in parallel.
Use Message Attributes for Filtering, Not Message Body
SNS filter policies only work on MessageAttributes, not on the JSON inside the Message body. A common mistake is embedding filter criteria inside the payload and wondering why all subscribers still get everything. Put your routing metadata in MessageAttributes and keep your payload clean.
Production Insight
A team set up an SNS topic for payment events. They had a fraud detection Lambda subscriber that should only process transactions > $10,000. They put the threshold logic in the Lambda itself — which meant the Lambda was invoked for every transaction, including $5 coffee purchases, and filtered internally.
Monthly bill: 2 million Lambda invocations at $0.20 per million = $0.40 (cheap). But the Lambda's execution time for filtering and early-return added up to 2,000 GB-seconds at ~$0.00001667 per GB-second = $0.33. Total ~$0.73/month. Not huge, but wasteful.
Bigger issue: The Lambda's concurrency limit was 100. Every $5 transaction consumed a slot. A spike of 100 small transactions could block a $50,000 fraudulent transaction from being processed.
Fix: Added SNS filter policy on the subscription using a numeric amount MessageAttribute: {"amount": [{"numeric": [">", 10000]}]}. Lambda now only invoked for high-value transactions. Concurrency preserved for important work.
Rule: Push filtering to SNS when possible. It saves compute, saves concurrency, and keeps your event-driven architecture clean.
Key Takeaway
SNS = pub/sub topic. One message, all subscribers. No storage.
Message filters on subscriptions prevent unnecessary invocations.
SNS delivery logging = first debugging step for missing messages.
RawMessageDelivery=true simplifies consumer parsing — always set on SQS subscriptions.
Never subscribe Lambda directly in prod unless you can afford to lose messages.

Dead-Letter Queues and the SNS+SQS Fan-Out Pattern — Production Essentials

Two patterns separate a toy cloud setup from a production-grade one: dead-letter queues (DLQs) and the SNS+SQS fan-out architecture. You need to understand both.

A DLQ is just another SQS queue. You configure it on your main queue and set maxReceiveCount — say, 3. If a message fails processing 3 times, SQS automatically moves it to the DLQ instead of retrying forever or silently dropping it. Your team gets alerted, investigates the poisoned message, and the rest of your queue keeps flowing normally. Without a DLQ, one bad message can block your queue or create an infinite retry storm.

The fan-out pattern solves SNS's biggest weakness — no durability. The rule is: never subscribe a Lambda directly to SNS in production if message loss is unacceptable. Instead, subscribe an SQS queue to the SNS topic. The queue durably catches every message. Your Lambda then polls the queue. You get SNS's broadcasting power AND SQS's durability and retry logic together. This is the architectural backbone of most event-driven AWS systems.

The code example below wires both patterns together with IaC-style Boto3 calls, showing exactly how a DLQ connects to a main queue.

Monitoring the DLQ: Set a CloudWatch alarm on ApproximateNumberOfMessages > 0 on your DLQ. A message in the DLQ means your consumer failed to process it after maxReceiveCount attempts. That's a bug — it shouldn't be ignored or silently deleted. Your on-call should get a page.

DLQ Redrive: Once you've fixed the root cause, use the SQS Dead-Letter Queue Redrive feature (available in the AWS console and via API) to move messages from the DLQ back to the source queue for reprocessing. Never manually replay messages by hand — the redrive API preserves original message attributes and handles batching correctly.

dlq_and_fanout_setup.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import boto3
import json

sqs_client = boto3.client('sqs', region_name='us-east-1')
sns_client = boto3.client('sns', region_name='us-east-1')


def create_queue_with_dlq(main_queue_name: str, dlq_name: str) -> tuple[str, str]:
    """
    Creates a main SQS queue and wires up a Dead-Letter Queue.
    If a message fails maxReceiveCount times, it moves to the DLQ
    instead of looping forever or vanishing silently.
    """

    # Step 1: Create the DLQ first — it's just a regular SQS queue
    dlq_response = sqs_client.create_queue(
        QueueName=dlq_name,
        Attributes={
            'MessageRetentionPeriod': '1209600'  # Keep failed messages for 14 days
        }
    )
    dlq_url = dlq_response['QueueUrl']
    print(f"[SETUP] DLQ created: {dlq_url}")

    # Step 2: Get the DLQ's ARN — needed to reference it as a redrive target
    dlq_attributes = sqs_client.get_queue_attributes(
        QueueUrl=dlq_url,
        AttributeNames=['QueueArn']
    )
    dlq_arn = dlq_attributes['Attributes']['QueueArn']

    # Step 3: Create the main queue with a RedrivePolicy pointing at the DLQ.
    # maxReceiveCount=3 means: after 3 failed delivery attempts, move to DLQ.
    # Set VisibilityTimeout >= 6x your expected Lambda/worker processing time
    # to avoid premature redelivery during normal processing.
    redrive_policy = {
        'deadLetterTargetArn': dlq_arn,
        'maxReceiveCount': '3'
    }

    main_queue_response = sqs_client.create_queue(
        QueueName=main_queue_name,
        Attributes={
            'VisibilityTimeout': '60',
            'ReceiveMessageWaitTimeSeconds': '20',  # Enable long polling at queue level
            'RedrivePolicy': json.dumps(redrive_policy)
        }
    )
    main_queue_url = main_queue_response['QueueUrl']
    print(f"[SETUP] Main queue created with DLQ attached: {main_queue_url}")

    return main_queue_url, dlq_url


def wire_sns_to_sqs_fanout(
    topic_arn: str,
    queue_url: str,
    queue_arn: str
):
    """
    Subscribes an SQS queue to an SNS topic — the fan-out pattern.
    Also grants SNS permission to write to the SQS queue.
    Forgetting the SQS access policy is the #1 cause of 'messages not arriving' bugs.
    """

    # Step 1: Grant SNS permission to send messages to this SQS queue.
    # Without this policy, SNS silently drops messages — no error thrown on publish.
    sqs_policy = {
        'Version': '2012-10-17',
        'Statement': [{
            'Sid': 'AllowSNSToSendToSQS',
            'Effect': 'Allow',
            'Principal': {'Service': 'sns.amazonaws.com'},
            'Action': 'sqs:SendMessage',
            'Resource': queue_arn,
            'Condition': {
                'ArnEquals': {'aws:SourceArn': topic_arn}
            }
        }]
    }

    sqs_client.set_queue_attributes(
        QueueUrl=queue_url,
        Attributes={'Policy': json.dumps(sqs_policy)}
    )
    print(f"[SETUP] SQS access policy updated to allow SNS.")

    # Step 2: Subscribe the queue to the topic.
    # RawMessageDelivery=true means the SQS message body IS your JSON payload.
    # Without it, the body is wrapped in an SNS envelope — harder to parse
    # and a common source of double-deserialization bugs.
    subscription = sns_client.subscribe(
        TopicArn=topic_arn,
        Protocol='sqs',
        Endpoint=queue_arn,
        Attributes={
            'RawMessageDelivery': 'true'  # Skip the SNS wrapper — always set this
        }
    )

    print(f"[SETUP] Fan-out wired. Subscription ARN: {subscription['SubscriptionArn']}")
    return subscription['SubscriptionArn']


# --- Run the setup ---
main_url, dlq_url = create_queue_with_dlq(
    main_queue_name='order-fulfilment-queue',
    dlq_name='order-fulfilment-dlq'
)

wire_sns_to_sqs_fanout(
    topic_arn='arn:aws:sns:us-east-1:123456789012:order-events',
    queue_url=main_url,
    queue_arn='arn:aws:sqs:us-east-1:123456789012:order-fulfilment-queue'
)

print("\n[READY] Production fan-out pattern active:")
print("  SNS Topic -> SQS Queue (durable) -> Lambda/Worker")
print("  Failed messages -> DLQ after 3 attempts")
print("  Fix root cause -> Redrive from DLQ back to main queue")
Output
[SETUP] DLQ created: https://sqs.us-east-1.amazonaws.com/123456789012/order-fulfilment-dlq
[SETUP] Main queue created with DLQ attached: https://sqs.us-east-1.amazonaws.com/123456789012/order-fulfilment-queue
[SETUP] SQS access policy updated to allow SNS.
[SETUP] Fan-out wired. Subscription ARN: arn:aws:sns:us-east-1:123456789012:order-events:b9c8d7e6-f5a4-3b2c-1d0e-9f8a7b6c5d4e
[READY] Production fan-out pattern active:
SNS Topic -> SQS Queue (durable) -> Lambda/Worker
Failed messages -> DLQ after 3 attempts
Fix root cause -> Redrive from DLQ back to main queue
Interview Gold: Why SNS+SQS Instead of SNS Directly to Lambda?
Lambda can subscribe to SNS directly, but if Lambda throttles (hits concurrency limits), SNS retries only twice then drops the message — gone forever. With an SQS queue in between, messages wait safely in the queue until Lambda capacity frees up. The queue absorbs the spike. This is the canonical answer to 'how do you handle Lambda throttling in an event-driven system?'
Production Insight
A team had a critical event-driven workflow with SNS → Lambda. Their Lambda concurrency limit was 100. During a marketing campaign, traffic spiked to 500 concurrent events. SNS retried each throttled Lambda invocation twice, then dropped the remaining 300 messages. Three hundred high-value customer signups were lost. No alarm, no error log, no trace.
The marketing team reported 300 fewer signups than expected. Engineering spent 2 days tracing through CloudWatch logs to find the SNS delivery failures.
Root cause: SNS retries throttled Lambda only 3 times total (initial + 2 retries), then gives up. The team didn't know this limit existed and had no CloudWatch alarm on SNS's NumberOfNotificationsFailed metric.
Fix: Moved to SNS → SQS → Lambda with SQS event source mapping. The SQS queue now has 14 days of durability. During spikes, messages queue safely and Lambda processes them at its own pace. Added CloudWatch alarm on ApproximateAgeOfOldestMessage > 5 minutes as an early-warning signal. No more drops.
Rule: If you can't afford to lose a single message, never subscribe Lambda directly to SNS. The SQS buffer is not optional — it's mandatory.
Key Takeaway
DLQ = failed messages after maxReceiveCount attempts. maxReceiveCount=3.
Monitor DLQ depth with CloudWatch alarm. Message in DLQ = bug, not noise.
Use DLQ Redrive API to replay messages after fixing root cause — never delete blindly.
Fan-out pattern: SNS → SQS (durable) → Lambda. Always for critical messages.
SNS direct to Lambda = at-most-once delivery when throttling hits. Assume message loss.

SQS vs SNS vs EventBridge — A Three-Way Decision Table

When you need more than basic pub/sub or queuing, AWS EventBridge enters the picture. It's not a replacement for SQS or SNS — it sits above them, offering a central event bus with advanced routing, schema registry, and integration with third-party SaaS events. Here's a decision table to clarify when to pick each:

FeatureSQSSNSEventBridge
Messaging modelPoint-to-point queuePub/sub topicEvent bus (pub/sub + routing)
DurabilityYes, up to 14 daysNo (unless subscriber is SQS)Yes, 24-hour default, configurable up to 14 days for archive
ThroughputUnlimited (Standard) / 3,000 msg/s (FIFO)300 publishes/s (default, adjustable)10,000 events/s per bus (default, adjustable)
FilteringConsumer-sideServer-side subscription filters (attribute-based)Rich content-based filtering (JSONPath, prefix, suffix, anything, exists, numeric ranges)
OrderingBest-effort (Standard) / Strict FIFONo orderingNo ordering
Payload sizeUp to 256 KBUp to 256 KBUp to 256 KB
PricingPay per request and data transferPay per publish and delivery attemptsPay per event ingested and delivered (higher per-event cost than SNS)
Third-party integrationsNone nativeEmail, SMS, mobile push, HTTPSaaS apps (Zendesk, Datadog, PagerDuty, 200+ built-in sources)
Schema registryNoNoYes — schema discovery and code generation
Replay eventsNoNoYes — archive and replay events up to 14 days
FIFO supportYesYes (SNS FIFO topics)No

When to pick EventBridge over SNS: - You need complex content-based filtering (e.g., "order.total > 100 and order.region != 'US'") directly on the message body, not just attributes. - You want to ingest events from third-party SaaS providers (GitHub, Shopify, Salesforce, etc.) without building custom ingest pipelines. - You need event replay for debugging or disaster recovery — being able to re-run yesterday's events against a fixed consumer is invaluable. - You want automatic schema discovery to generate strongly typed code from your event shapes.

When to stick with SNS+SQS: - You need FIFO ordering or exactly-once processing. EventBridge does not support FIFO. - Your throughput is very high and you want the lowest per-message cost. SNS+SQS is cheaper at high volume. - You need 14-day message retention on the queue itself. EventBridge archive can hold events up to 14 days, but queued durability for unprocessed messages requires SQS. - You need the simplicity of a direct queue (SQS) without event bus routing complexity.

The rule of thumb: SNS+SQS covers 80% of event-driven use cases. EventBridge is worth the extra cost and complexity when you need its advanced routing, third-party integration, or replay capabilities.

EventBridge Pricing Surprise
EventBridge costs $1.00 per million events delivered. SNS costs $0.50 per million publishes plus $0.50 per million deliveries. If you publish to 5 SQS subscribers, SNS costs $0.50 (publish) + $2.50 (5 deliveries) = $3.00 per million source events. EventBridge costs $1.00 per million events ingested plus $1.00 per million delivered to 5 targets = $6.00. EventBridge is more expensive per event at low subscriber counts but includes schema registry and replay at no extra charge. Always model your expected volume and number of subscribers before choosing.
Production Insight
A team migrated from SNS+SQS to EventBridge for their order processing system. They needed content-based filtering to route orders by region (EU vs US) and by product category. With SNS, they had to create separate topics and duplicate subscriptions for each routing dimension. EventBridge let them define one rule with a single filter expression matching against the message body directly — no message attribute gymnastics.
However, their order volume was 10 million events per month. At $1 per million events delivered, EventBridge cost them $10/month just for the bus. SNS with 3 queues would have cost $5/month for publishes plus $15/month for 30M deliveries = $20/month — more expensive than EventBridge in this case. The team also got free event replay (invaluable during a schema migration) and the schema registry, which generated typed Python dataclasses from their event shapes automatically.
Rule: EventBridge becomes cheaper than SNS once you have more than about 2 subscriber endpoints per event type, and it pays for itself immediately when you add replay and schema registry to the comparison.
Key Takeaway
SNS+SQS is simpler and cheaper for basic fan-out with few subscribers. EventBridge adds content-based filtering on message body, third-party sources, and replay. Pick based on filtering complexity, subscriber count, and whether replay or schema registry justify the cost.

Pros and Cons of SQS and SNS

Every service has trade-offs. Here's a clear-eyed look at what SQS and SNS do well, and where they fall short.

SQS — Advantages - Durable 14-day message storage with automatic retries via visibility timeout - At-least-once delivery (Standard) or exactly-once (FIFO) - Unlimited throughput with Standard queues; 3,000 msg/s with FIFO batching - Built-in dead-letter queue support with configurable maxReceiveCount - Low cost per API request, especially with long polling and batch operations - Supports batch operations (up to 10 messages per receive, up to 10 per send) - Decouples producers from consumers — producer doesn't need to know consumer speed or availability

SQS — Disadvantages - No built-in fan-out (point-to-point only — you need SNS or Lambda triggers for fan-out) - Consumer must poll the queue, introducing latency and API cost if not optimized - Max message size 256 KB — need SQS Extended Client Library with S3 for larger payloads - Ordering only guaranteed with FIFO, which has throughput limitations - No content-based server-side filtering — consumers must filter in application code - FIFO queues don't support Lambda event source mapping scaling the same way as Standard queues

SNS — Advantages - Instant pub/sub fan-out — one message reaches all subscribers simultaneously - Multiple subscriber types (SQS, Lambda, HTTP/S, email, SMS, mobile push) - Server-side filter policies on MessageAttributes reduce unnecessary deliveries and compute cost - No polling overhead for Lambda and HTTP subscribers (push-based delivery) - Simple pricing per publish, not multiplied by subscriber count - Supports FIFO topics for ordered fan-out when needed

SNS — Disadvantages - Messages are NOT stored — if subscriber is down, message is lost unless subscriber is SQS - Limited retries (3 attempts total for HTTP and Lambda subscribers on throttle) - No ordering guarantees on Standard topics - Max message size 256 KB - No DLQ for the SNS layer itself — must use an SQS subscriber with its own DLQ to get retry and dead-letter semantics - Filtering limited to MessageAttributes, not the message body content - HTTP subscribers are vulnerable to transient failures during the narrow retry window

The real insight: SNS's biggest disadvantage (no durability) is also its greatest advantage when paired with SQS. The combination covers each service's weakness. Never use SNS alone for critical events — always buffer with SQS.

SNS + HTTP: Watch Out for Delivery Failures
SNS retries HTTP endpoints only 3 times with exponential backoff. If your endpoint is slow or returns a non-2xx status for more than a few seconds, you lose the message. Always subscribe an SQS queue to SNS and have your HTTP endpoint poll the queue — this gives you visibility timeout, retry, and DLQ protection that HTTP endpoints behind SNS cannot provide.
Production Insight
A team used SNS with an HTTP subscriber for a real-time notification service. The HTTP endpoint processed the message and returned 200 immediately. But if the endpoint was slow (e.g., database contention), it might time out after 2 seconds. SNS would retry twice more with exponential backoff, then discard the message. The team lost 5% of notifications due to transient database delays — no alarm, no log, no DLQ.
Fix: They placed an SQS queue between SNS and the HTTP endpoint. The endpoint polled the queue using long polling, processed each message with a 30-second VisibilityTimeout, and deleted on success. Even if the endpoint was slow, the message stayed in the queue for retry. A DLQ caught anything that failed 3 or more times, triggering a CloudWatch alarm.
Rule: HTTP endpoints behind SNS are fragile. The 3-retry limit is too narrow for anything but the most reliable endpoints. Use an SQS subscriber to make them robust.
Key Takeaway
SQS excels at durability, retries, and back-pressure. SNS excels at fan-out and decoupling publishers. Use SNS+SQS together to get both. Never rely on SNS alone for critical message delivery where loss is unacceptable.

Pricing Comparison — Standard vs FIFO, Per-Million Request Costs

Understanding cost at scale is critical. Here's the pricing breakdown for SQS, SNS, and the common patterns.

SQS Pricing (as of 2026)

Queue TypeRequest PricingData Transfer PricingFree Tier
Standard$0.40 per million requests$0.09 per GB after first 1 GB/month1 million requests free per month
FIFO$0.50 per million requests$0.09 per GB after first 1 GB/month1 million requests free per month

Notes on SQS requests: - A "request" is any API call: SendMessage, ReceiveMessage, DeleteMessage, ChangeMessageVisibility, etc. - Long polling (WaitTimeSeconds=20) counts as one request per poll call, even if the response is empty. Far cheaper than short polling which generates continuous empty responses. - Batch operations (SendMessageBatch, DeleteMessageBatch) count as one request per batch of up to 10 messages. Always batch when sending or deleting multiple messages. - ChangeMessageVisibility (heartbeat) also counts as one request — factor this in for long-running consumers.

SNS Pricing (as of 2026)

Topic TypePublish PricingDelivery PricingFree Tier
Standard$0.50 per million publishes$0.50 per million deliveries to SQS/Lambda; SMS and email have separate rates1 million publishes free per month
FIFO$0.50 per million publishes$0.50 per million deliveriesNot included in SNS free tier for FIFO

Notes on SNS delivery: - Each subscriber receives a copy of the message. If you have 5 SQS subscribers and publish 1 million messages, you pay for 1 million publishes ($0.50) + 5 million deliveries ($2.50) = $3.00 total. - SMS delivery is billed separately by destination country — typically $0.00645 per message in the US. Not covered by the standard delivery rate. - Email and email-JSON subscribers are free for the first 1,000 emails per month, then $2.00 per 100,000 emails.

Comparison Scenario: Fan-out to 3 SQS queues, 10 million messages/month

  • SNS+SQS fan-out: 10M SNS publishes = $5.00. 30M SNS deliveries to 3 queues = $15.00. SQS send cost (SNS writes to SQS): 30M send requests = $12.00. Consumer polling with long polling and batch size 10: ~1M receive calls = $0.40. Consumer deletes: 30M delete requests = $12.00. Total: ~$44.40/month.
  • Direct SNS → Lambda subscriptions: 10M publishes = $5.00. Lambda invocations and duration are billed separately. Risk: message loss on throttle. Not recommended for critical data regardless of cost.
  • EventBridge bus: 10M events ingested = $10.00. 30M deliveries to 3 rules = $30.00. Total: $40.00/month — comparable to SNS+SQS at this scale, with replay and content-based filtering included.

Cost-saving tips: 1. Always use long polling (WaitTimeSeconds=20) on SQS consumers to eliminate empty receive requests. 2. Batch send messages (up to 10 per SendMessageBatch call) to cut send costs by up to 90%. 3. Batch delete messages (up to 10 per DeleteMessageBatch call) after processing. 4. Use SNS Standard for high-volume fan-out; FIFO only when ordering is truly required. 5. Monitor SQS API usage with CloudWatch metrics. Set AWS Budgets alarms to catch unexpected growth early.

Pricing is subject to change. Always verify on the official AWS pricing pages for SQS and SNS before making architectural decisions based on cost.

FIFO Pricing Premium
SQS FIFO queues cost about 25% more per request than Standard ($0.50 vs $0.40 per million). Only use FIFO when message order and exactly-once processing are truly required by your business logic. For most use cases, Standard queues plus idempotent consumers (deduplicating by a business key like order ID) are cheaper, simpler, and scale without throughput limits.
Production Insight
A startup processed 50 million events per month with SNS+SQS fan-out to 2 queues. They were using FIFO for both queues out of habit — they had seen FIFO mentioned in a tutorial and assumed 'more guarantees = better'. Monthly cost: ~$170.
They reviewed their consumer logic and realized it was already idempotent — every event handler checked whether the event ID had already been processed using a DynamoDB conditional write. Ordering didn't matter because events were independent. Switching to Standard queues cut monthly costs to ~$90 — a 47% saving — with no change in correctness or reliability.
Rule: Always validate your ordering requirement before choosing FIFO. If your consumers are idempotent (they should be), Standard queues are almost always the right choice.
Key Takeaway
SQS Standard and SNS Standard are cheapest. FIFO adds a throughput cap and cost premium — use only when order truly matters. Batch all send and delete operations. Long polling is the single biggest cost-saver for SQS consumers.

When to Consider Amazon MQ Instead of SQS or SNS

Amazon MQ is a fully managed message broker service that supports industry-standard protocols: MQTT, AMQP, STOMP, OpenWire, and JMS. It's the cloud version of Apache ActiveMQ and RabbitMQ. When should you reach for it instead of SQS/SNS?

Amazon MQ is the right choice when: - You're migrating an existing on-premises application that already uses JMS, AMQP, or MQTT. Rewriting everything to use SQS/SNS would be too risky or time-consuming — Amazon MQ lets you lift and shift with minimal code changes. - You need advanced message routing beyond what SNS offers — like topic wildcards (e.g., orders.#), virtual topics, header-based exchanges, or complex JMS selectors. - You need transactional messaging across multiple queues — for example, sending to queue A and queue B atomically within a single XA transaction. - Your application requires specific broker-level features like scheduled messages (send now, deliver at a future time), message groups with flexible ordering, or custom dead-letter handling at the broker level. - You need push-based delivery with low latency — Amazon MQ's push model avoids the polling delay inherent to SQS.

Amazon MQ is NOT the right choice when: - You want fully serverless, zero infrastructure management. Amazon MQ requires you to provision and manage broker instances (though patching and failover are automated). SQS and SNS are fully serverless. - Your throughput needs are extremely high or bursty. SQS Standard scales to unlimited throughput automatically. Amazon MQ is limited by the broker instance type — larger instances cost more and still have connection limits. - You're building a new cloud-native application from scratch. Without a legacy protocol requirement, SQS+SNS is simpler, cheaper to operate, and scales without instance sizing decisions. - You need exactly-once processing with a simple API. SQS FIFO provides this natively; Amazon MQ requires idempotent consumers and broker-level deduplication configuration, which is more complex.

Cost comparison: Amazon MQ instances start at approximately $0.027/hour (~$20/month) for a single-instance mq.t3.micro broker, and scale to hundreds of dollars per month for active/standby HA configurations. In contrast, SQS and SNS have no fixed monthly cost — you pay only for requests. For bursty or low-volume workloads, SQS is far cheaper. For very high steady-state volume (hundreds of millions of messages per month), Amazon MQ may become cheaper because you pay per instance-hour rather than per request.

Decision rule: Use SQS/SNS for cloud-native applications where serverless scalability and operational simplicity are priorities. Use Amazon MQ when migrating legacy JMS/AMQP applications or when you need advanced broker-level routing features that SNS cannot provide. If you're starting from scratch without a protocol requirement, SQS+SNS is almost always the better choice.

Amazon MQ is Not Serverless
You provision and pay for an Amazon MQ broker instance 24/7, even if you send zero messages. SQS and SNS charge only for what you use. For bursty workloads or development environments, SQS will be far cheaper. For steady-state high-throughput workloads with legacy protocol requirements, Amazon MQ's fixed instance cost may be worthwhile.
Production Insight
A financial services company migrated from an on-premises RabbitMQ cluster to Amazon MQ to reduce maintenance overhead. They had hundreds of queues, complex routing rules (message selectors, header exchanges, wildcard topic subscriptions), and hundreds of thousands of lines of Java code using JMS APIs. SNS's simple attribute filtering couldn't match the flexibility they needed, and rewriting the application to SQS would have required touching every service.
Amazon MQ gave them the same broker semantics in the cloud, and they kept their existing Java/JMS client code with minimal changes — mostly just updating the broker endpoint URL. The migration took 3 weeks instead of the 6+ months estimated for a full SQS/SNS rewrite.
They pay approximately $400/month for an active/standby mq.m5.large configuration. With SQS/SNS at their 10 million messages/month volume, they'd pay ~$44/month. The $356 monthly premium is their cost for protocol compatibility, zero code changes, and a risk-free migration — a worthwhile trade-off for a risk-averse regulated environment.
Key Takeaway
Amazon MQ = migration-friendly, protocol-compatible (JMS/AMQP/MQTT), but not serverless. SQS/SNS = cloud-native, cheaper for bursty or low-volume workloads, zero infrastructure management. Choose based on your application's existing protocol, legacy constraints, and long-term architectural direction.

Infrastructure as Code — Terraform SNS+SQS Fan-Out Setup

The manual AWS console or Boto3 scripts work for demos, but production infrastructure should be version-controlled and repeatable. Here's a complete Terraform configuration for the SNS+SQS fan-out pattern with a dead-letter queue and CloudWatch monitoring.

main.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
provider "aws" {
  region = "us-east-1"
}

# --- SNS Topic ---
resource "aws_sns_topic" "order_events" {
  name = "order-lifecycle-events"
}

# --- Dead-Letter Queue (DLQ) — create first, referenced by main queue ---
resource "aws_sqs_queue" "order_dlq" {
  name                      = "order-processing-dlq"
  message_retention_seconds = 1209600  # 14 days — maximum retention
}

# --- Main SQS Queue (consumer will poll this) ---
resource "aws_sqs_queue" "order_processing_queue" {
  name                       = "order-processing-queue"
  visibility_timeout_seconds = 60
  delay_seconds              = 0

  # Long polling at the queue level — consumers that don't set WaitTimeSeconds
  # explicitly will still benefit from this default.
  receive_wait_time_seconds = 20

  # Attach Dead-Letter Queue.
  # maxReceiveCount=3: after 3 failed processing attempts, move to DLQ.
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_dlq.arn
    maxReceiveCount     = 3
  })
}

# --- SQS Queue Policy: allow SNS to send messages ---
# Without this policy, SNS silently drops messages — no error is raised on publish.
resource "aws_sqs_queue_policy" "allow_sns" {
  queue_url = aws_sqs_queue.order_processing_queue.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowSNSToSendToSQS"
        Effect = "Allow"
        Principal = {
          Service = "sns.amazonaws.com"
        }
        Action   = "sqs:SendMessage"
        Resource = aws_sqs_queue.order_processing_queue.arn
        Condition = {
          ArnEquals = {
            "aws:SourceArn" = aws_sns_topic.order_events.arn
          }
        }
      }
    ]
  })
}

# --- Subscribe the SQS queue to the SNS topic ---
# depends_on ensures the queue policy is applied before the subscription is created.
# Without this, SNS may reject the subscription because it can't yet write to the queue.
resource "aws_sns_topic_subscription" "fanout" {
  topic_arn            = aws_sns_topic.order_events.arn
  protocol             = "sqs"
  endpoint             = aws_sqs_queue.order_processing_queue.arn
  raw_message_delivery = true  # Deliver raw JSON — no SNS envelope wrapping

  # Uncomment and populate to add server-side attribute filtering:
  # filter_policy = jsonencode({
  #   event_type = ["order_placed", "order_updated"]
  # })

  depends_on = [
    aws_sqs_queue_policy.allow_sns
  ]
}

# --- CloudWatch Alarm: DLQ depth — fires when any message lands in DLQ ---
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "order-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Maximum"
  threshold           = 0
  alarm_description   = "A message has landed in the DLQ. Consumer failed maxReceiveCount times. Investigate immediately."
  treat_missing_data  = "notBreaching"
  dimensions = {
    QueueName = aws_sqs_queue.order_dlq.name
  }
  alarm_actions = []  # Add SNS notification topic ARN here for on-call paging
}

# --- CloudWatch Alarm: queue age — fires when messages are not being consumed ---
resource "aws_cloudwatch_metric_alarm" "queue_age_high" {
  alarm_name          = "order-queue-oldest-message-age"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateAgeOfOldestMessage"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Maximum"
  threshold           = 300  # Alert if oldest message is > 5 minutes old
  alarm_description   = "Messages are aging in the queue. Consumers may be down or overwhelmed."
  treat_missing_data  = "notBreaching"
  dimensions = {
    QueueName = aws_sqs_queue.order_processing_queue.name
  }
  alarm_actions = []  # Add SNS notification topic ARN here for on-call paging
}

# --- (Optional) FIFO version — uncomment when strict ordering is required ---
# resource "aws_sqs_queue" "order_fifo" {
#   name                        = "order-processing-queue.fifo"
#   fifo_queue                  = true
#   content_based_deduplication = true
#   visibility_timeout_seconds  = 60
#   receive_wait_time_seconds   = 20
#   redrive_policy = jsonencode({
#     deadLetterTargetArn = aws_sqs_queue.order_dlq.arn
#     maxReceiveCount     = 3
#   })
# }

output "sns_topic_arn" {
  value       = aws_sns_topic.order_events.arn
  description = "Publish order events to this ARN from your application"
}

output "sqs_queue_url" {
  value       = aws_sqs_queue.order_processing_queue.id
  description = "Consumer polls this queue URL for order processing"
}

output "dlq_queue_url" {
  value       = aws_sqs_queue.order_dlq.id
  description = "Inspect this queue when the DLQ alarm fires"
}
Output
Apply complete! Resources: 7 added.
Outputs:
sns_topic_arn = "arn:aws:sns:us-east-1:123456789012:order-lifecycle-events"
sqs_queue_url = "https://sqs.us-east-1.amazonaws.com/123456789012/order-processing-queue"
dlq_queue_url = "https://sqs.us-east-1.amazonaws.com/123456789012/order-processing-dlq"
Raw Message Delivery in Terraform
The aws_sns_topic_subscription resource supports raw_message_delivery = true as a first-class attribute. Set it on every SQS subscription so the message body received by your consumer is your original JSON payload — not wrapped in an SNS envelope with metadata fields. Failing to set this is a common source of JSON parsing bugs where consumers unexpectedly receive a nested Message field instead of the expected payload structure.
Production Insight
Using Terraform for this pattern ensures that every environment (dev, staging, prod) has identical queue configuration. One team had a production incident where a developer manually created a queue in the console and forgot to attach a DLQ. The next day, a poison message (malformed JSON from a schema change) caused infinite retries. The queue backlogged for 6 hours before anyone noticed — there was no alarm.
After the incident, the team added a Terraform module that enforces DLQ attachment, long polling, and CloudWatch alarms as non-negotiable defaults. Any queue without a DLQ fails the Terraform plan. The module also outputs the DLQ URL and automatically creates the alarm, so engineers don't have to remember to do it manually.
Also note: the depends_on for the queue policy is critical. Without it, Terraform may create the SNS subscription before the SQS policy is applied. SNS will attempt to confirm the subscription, fail with AccessDenied, and mark it as PendingConfirmation — resulting in a broken fan-out that's silent until you check the subscription status.
Key Takeaway
Use Terraform (or CloudFormation/CDK) for all production queue infrastructure. Enforce DLQ, long polling, and CloudWatch alarms as code. The depends_on between queue policy and SNS subscription is not optional. Never manually create queues in the console — it's too easy to miss critical settings.

Practice Exercises — Build Your Own SQS/SNS Systems

Theory only sticks when you implement it. Here are five exercises that mirror real-world scenarios. Each exercise builds on the previous one. Set up a free AWS account (or use LocalStack for local testing at zero cost).

Exercise 1: Order Processing Queue - Create an SQS Standard queue named order-queue. - Write a Python/Boto3 script that sends 100 order messages with random order IDs and amounts. - Write a consumer that polls the queue (with long polling), prints each order, and deletes it. - Expected outcome: All messages processed exactly once (or with minor duplicates on Standard queue — plan for it). - Hint: Set WaitTimeSeconds=20 and MaxNumberOfMessages=10. Use a try/except block: only call delete_message on success, never on failure. Let failed messages reappear for retry.

Exercise 2: Fan-Out Notification - Create an SNS topic named user-notifications. - Subscribe two SQS queues to it: email-queue and sms-queue. - Set RawMessageDelivery=true on both subscriptions. - Publish a single message to the topic. Verify that both queues receive a copy. - Expected outcome: Each queue has exactly one copy of the message. - Hint: Don't forget the SQS queue policy to allow SNS to send messages — missing this policy causes silent drops. You can verify subscription status with aws sns list-subscriptions-by-topic.

Exercise 3: Dead-Letter Queue Monitoring - Modify Exercise 1's queue to attach a DLQ with maxReceiveCount=3. - Send a message with intentionally invalid JSON (e.g., {invalid). - Write a consumer that fails on JSON parsing — do not catch the exception, let it propagate without calling delete_message. - Poll the queue 3 times (each poll makes the message visible once, fails, and the receive count increments). After the 3rd failure, check the DLQ — the message should have moved automatically. - Set up a CloudWatch alarm on the DLQ's ApproximateNumberOfMessagesVisible > 0 that sends an email via a separate SNS topic. - Expected outcome: The poison message moves to DLQ, alarm triggers, and you get an email. - Hint: VisibilityTimeout must expire between each poll attempt to increment the receive count. Set it to 5 seconds for testing, then reset to 60 seconds for production.

Exercise 4: FIFO Queue with Deduplication - Create a FIFO queue named payment-events.fifo with content-based deduplication enabled. - Send 5 messages with identical bodies (e.g., {"transaction_id": "TXN-001", "amount": 100}) within a 5-minute window. - Consume the queue and count how many messages you receive. - Expected outcome: Only 1 message delivered — the other 4 are deduplicated by SQS using SHA-256 of the body. The deduplication window is 5 minutes. - Hint: FIFO queues require a .fifo suffix in the queue name. Each send_message call must include MessageGroupId. Deduplication uses SHA-256 of the MessageBody when content-based deduplication is enabled.

Exercise 5: Throttling Simulation with SNS → SQS → Lambda - Create an SNS topic, subscribe an SQS queue, and configure a Lambda function with an SQS event source mapping (batch size 1). - Set Lambda reserved concurrency to 1 to force serial processing. - Add a 3-second time.sleep in the Lambda to simulate work. - Publish 20 messages to the SNS topic in rapid succession. - Monitor ApproximateNumberOfMessages on the SQS queue during processing. - Expected outcome: Queue depth rises to ~19, then drains one message at a time as Lambda processes them serially. No messages are lost. Compare this to what would happen with a direct SNS → Lambda subscription under throttle. - Hint: Use CloudWatch Logs Insights to verify Lambda invocation count matches message count after the queue drains. The key lesson: with SQS buffering, throttle = delay. With direct SNS → Lambda, throttle = loss.

Run Locally with LocalStack
You don't need an AWS account for early exercises. LocalStack (https://localstack.cloud) emulates SQS and SNS locally. Set endpoint_url='http://localhost:4566' in your boto3 client and use dummy credentials (aws_access_key_id='test'). Costs nothing, runs offline, and starts in seconds with docker run localstack/localstack.
Production Insight
A junior engineer completed Exercise 5 and watched the queue backlog grow while Lambda processed one message at a time. They then deleted the SQS queue from the architecture and subscribed Lambda directly to SNS, ran the same 20-message test, and watched CloudWatch show 19 Lambda throttle failures and 1 successful invocation. That controlled experiment — taking 20 minutes — permanently changed how the team architected all future event-driven systems. Nothing teaches like a controlled, observable failure.
Key Takeaway
Practice with real failures. Simulate throttling, poison messages, and fan-out. The muscle memory for debugging SQS/SNS comes from breaking things on purpose in a safe environment.

Operational Runbook: What to Do When Messages Stop Flowing

You're on call. The pager wakes you at 3 AM. The SQS queue depth is climbing and the ApproximateAgeOfOldestMessage alarm is firing. No messages are being processed. Here's the step-by-step runbook.

Phase 1: Detect (under 5 minutes) - Open CloudWatch. Check ApproximateAgeOfOldestMessage on the main queue. If rising and > 5 minutes, consumers are stuck or down. - Check DLQ ApproximateNumberOfMessagesVisible. If > 0, messages are actively failing processing — there's a poison message or a bug in the consumer. - Check SNS delivery logs in CloudWatch Logs (/aws/sns/DeliveryLogs) for Throttling, EndpointDisabled, or AccessDenied entries within the last 30 minutes.

Phase 2: Diagnose (under 10 minutes) - Consumer crashed? Run: aws sqs get-queue-attributes --queue-url <main-queue-url> --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible. If ApproximateNumberOfMessagesNotVisible is high and static (not decreasing), consumers are receiving messages but crashing before deleting them. - Lambda throttled? Check the Lambda Throttles metric in CloudWatch. If elevated, increase ReservedConcurrency or check for concurrency exhaustion at the account level. - Permission denied? Check the SQS queue policy: does the SNS principal have sqs:SendMessage? Check the Lambda execution role: does it have sqs:ReceiveMessage, sqs:DeleteMessage, and sqs:GetQueueAttributes? - VisibilityTimeout too short? If messages reappear faster than your consumer processes them, processing time exceeds VisibilityTimeout. Increase the timeout or implement a heartbeat using ChangeMessageVisibility to extend it dynamically. - Poison message? Receive one message from the DLQ: aws sqs receive-message --queue-url <dlq-url> --max-number-of-messages 1. Examine the body. Look for unexpected null fields, schema changes, or malformed JSON from an upstream service.

Phase 3: Mitigate (under 15 minutes) - Dead consumers: Restart EC2 instances, redeploy Lambda, or trigger an ECS service restart. Verify the consumer is polling by watching NumberOfMessagesSent vs NumberOfMessagesDeleted in CloudWatch. - Backlogged queue: Temporarily scale up consumer concurrency — increase Lambda MaximumConcurrency on the event source mapping, or add EC2 worker instances. Monitor ApproximateAgeOfOldestMessage to confirm it starts decreasing. - Poison messages in DLQ: Do NOT delete them. Move them to a separate investigation queue or leave them in the DLQ. Fix the consumer to handle that message shape. Then use the SQS DLQ Redrive feature to replay them back to the main queue. - SNS delivery failure (AccessDenied): Fix the SQS queue policy to grant sqs:SendMessage to the SNS principal. Then republish missed events from application logs, a replay mechanism, or the event source system.

Phase 4: Prevent (post-incident) - Add Terraform modules that enforce DLQ attachment, long polling defaults, and CloudWatch alarms as non-negotiable for every new queue. - Set an alarm on ApproximateAgeOfOldestMessage > 5 minutes on every production queue. - Set an alarm on ApproximateNumberOfMessagesVisible > 0 on every DLQ. - Set an alarm on Lambda Throttles > 0 for every Lambda that processes SQS events. - Run a quarterly chaos exercise: throttle the Lambda to concurrency 1, send a spike of messages, and verify the queue drains correctly afterward with no message loss. - Document this runbook and store it in your on-call playbook. Run a tabletop walkthrough with the team so the steps are muscle memory before the next 3 AM page.

Do Not Delete DLQ Messages Without Investigation
A common mistake: an engineer sees messages in DLQ, assumes they're stale, and empties the queue with a bulk delete. The root cause — a missing database column, a schema change, or a broken JSON parser — stays hidden. The same message shape will be published again next week. The same failure will happen again. Always inspect at least one DLQ message before deciding what to do. Fix the consumer, then redrive.
Production Insight
After a major outage caused by a poison message that looped for 6 hours (no DLQ, no alarm), a team implemented this runbook as a living document in their on-call wiki. They ran a tabletop exercise the following sprint.
Three months later, a schema change in an upstream service produced a poison message with a missing required field. The DLQ caught it on the 3rd attempt. The CloudWatch alarm fired. The on-call engineer opened the runbook, inspected the DLQ message, identified the missing field, deployed a fix to the consumer with a safe default for the missing field, and redrove the 1 stuck message. Total time: 18 minutes. Customer impact: zero. Downtime: 0 minutes.
The difference between a 6-hour outage and an 18-minute fix was a DLQ, a CloudWatch alarm, and a documented runbook.
Key Takeaway
A runbook is not a document — it's a sequence of commands you can execute half-asleep at 3 AM. Write it, test it in a tabletop exercise, and embed it in your on-call playbook. The DLQ alarm and the queue age alarm are your two most important early-warning signals.

Stop Clicking Around: Automate SNS→SQS Subscription With IAM

You're not a cloud janitor. Stop manually subscribing queues and fixing broken permissions at 3 AM. The most common production issue with SNS→SQS fan-out isn't the setup — it's the IAM policy that everyone forgets.

When you subscribe an SQS queue to an SNS topic, AWS does NOT automatically grant SNS permission to send messages to that queue. The subscription will show as "Confirmed" in the console, but your messages will vanish. No errors. No logs. Just silence.

Here's the fix: you need two separate policies. One on the SQS queue that allows SNS:SendMessage from that specific topic ARN. Another on the SNS topic that allows the subscription to exist. Terraform handles this cleanly with aws_sns_topic_subscription and aws_sqs_queue_policy. CI/CD pipelines deploy this. You document it once and forget it.

If you're still clicking through the console to create subscriptions, you're one misclick away from a silent data loss incident. Don't be that person.

sns-sqs-fanout-policy.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — devops tutorial

resource "aws_sqs_queue" "order_events" {
  name = "order-events-queue"
}

resource "aws_sns_topic" "order_events" {
  name = "order-events-topic"
}

data "aws_iam_policy_document" "sqs_policy" {
  statement {
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["sns.amazonaws.com"]
    }
    actions   = ["sqs:SendMessage"]
    resources = [aws_sqs_queue.order_events.arn]
    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = [aws_sns_topic.order_events.arn]
    }
  }
}

resource "aws_sqs_queue_policy" "allow_sns" {
  queue_url = aws_sqs_queue.order_events.id
  policy    = data.aws_iam_policy_document.sqs_policy.json
}
Output
Terraform apply output:
aws_sns_topic.order_events: Creating...
aws_sqs_queue.order_events: Creating...
aws_sqs_queue_policy.allow_sns: Creating...
aws_sns_topic_subscription.order_events: Creating...
Apply complete! Resources: 4 added, 0 changed, 0 destroyed.
Subscription ID: arn:aws:sns:us-east-2:123456789012:order-events-topic:abc123-...
Queue Policy: attached to order-events-queue
Production Trap:
The SNS subscription will show 'Confirmed' even when the queue policy is missing. Messages go into a black hole. Always add a CloudWatch alarm on ApproximateNumberOfMessagesNotVisible to catch silent delivery failures.
Key Takeaway
SNS→SQS fan-out requires explicit queue policy to allow SNS:SendMessage. If it's not in your IaC, it's not production-ready.

Message Filtering: Stop Wasting Compute on Unwanted Events

Your notification service doesn't care about every inventory update. It only needs "checkout_completed" events. Without message filtering, every SQS subscriber in a fan-out topology gets every message — and your consumers waste CPU cycles filtering them out.

SNS supports subscription filter policies with JSON-based attribute matching. You attach a policy string on the subscription resource. Only messages whose attributes match the filter get delivered. This is not optional at scale.

Example: your order service publishes events with an attribute "event_type" set to "order_placed", "payment_failed", or "refund_initiated". Your analytics pipeline only wants "order_placed". Your fraud detection wants "payment_failed". Each subscribes with a different filter policy. One topic, three queues, zero wasted messages.

The filter policy supports exact matching, prefix matching, numeric ranges, and exists checks. Use it. Your lambda cold starts will thank you.

sns-filter-policy.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — devops tutorial

resource "aws_sns_topic_subscription" "analytics_orders" {
  topic_arn = aws_sns_topic.order_events.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.analytics_queue.arn

  filter_policy = jsonencode({
    event_type = ["order_placed"]
  })
}

resource "aws_sns_topic_subscription" "fraud_alerts" {
  topic_arn = aws_sns_topic.order_events.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.fraud_queue.arn

  filter_policy = jsonencode({
    event_type = ["payment_failed"]
  })
}
Output
# Published message with attribute:
# {
# "event_type": "order_placed"
# "order_id": "ORD-2024-001"
# }
#
# Result:
# analytics_queue: receives message
# fraud_queue: message filtered out (no event_type match)
Senior Shortcut:
Use subscription filter policies instead of creating separate SNS topics per event type. It reduces topic count, simplifies IAM, and keeps your architecture diagram from looking like a spider web.
Key Takeaway
Message filtering in SNS subscriptions prevents queue pollution. One topic, N subscribers, only relevant events delivered.

IAM Policies That Actually Allow SQS+SNS to Talk — No More Silent Failures

Your SNS→SQS fan-out works in dev but mysteriously drops messages in prod. Why? IAM policies. The common mistake is granting SNS permission to publish to SQS but forgetting to give SQS permission to receive from SNS. Without both, messages vanish silently. The fix: attach an SQS queue policy that explicitly allows SNS to send messages, AND ensure the SNS topic policy allows SQS to subscribe. Use a condition key like aws:SourceArn to lock down which topic can publish. Never use a wildcard for the principal unless you want random accounts dumping data into your queue. Test with aws sqs receive-message and check the Redrive Policy immediately. If DLQ stays empty but SNS shows published, your IAM chain is broken — check CloudTrail for AccessDenied on sqs:SendMessage.

SnsSqsPolicy.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — devops tutorial

// SNS topic policy — allows SQS subscription
PolicyDocument:
  Version: "2012-10-17"
  Statement:
  - Effect: Allow
    Principal:
      Service: sqs.amazonaws.com
    Action: sns:Subscribe
    Resource: arn:aws:sns:us-east-1:123456789012:MyTopic
    Condition:
      ArnLike:
        aws:SourceArn: arn:aws:sqs:us-east-1:123456789012:MyQueue

// SQS queue policy — allows SNS send
  - Effect: Allow
    Principal:
      Service: sns.amazonaws.com
    Action: sqs:SendMessage
    Resource: arn:aws:sqs:us-east-1:123456789012:MyQueue
    Condition:
      ArnEquals:
        aws:SourceArn: arn:aws:sns:us-east-1:123456789012:MyTopic
Output
SNS can subscribe; SQS accepts messages.
Production Trap:
Using Principal: "*" for SQS policy grants any AWS account permission to send. Always lock with aws:SourceArn.
Key Takeaway
Both SNS topic policy and SQS queue policy must explicitly allow the cross-service action — missing one means silent message drops.

CloudWatch Alarms That Wake You Up Before the Queue Drowns

SQS queues die quietly. The first sign is usually a support ticket from users complaining of slow responses. Don’t wait. Set CloudWatch alarms on four metrics: ApproximateNumberOfMessagesVisible (backlog), ApproximateAgeOfOldestMessage (stale messages), NumberOfMessagesSent (traffic spikes), and NumberOfMessagesDeleted (consumer health). For SNS, alarm on NumberOfNotificationsFailed and NumberOfNotificationsFilteredOut to catch permission or filter errors. Set thresholds: backlog > 1000 for 5 minutes = PagerDuty page. Age > 15 minutes = critical. Tie alarms to SNS topics (ironic but effective) that trigger Lambda to scale up consumers or restart stuck workers. Use math expressions: (Visible - InFlight) / ConsumerCount to detect imbalance. Test your alarms by flooding the queue with a script — if you don't get paged, your alerting is broken.

SqsCloudWatchAlarm.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — devops tutorial

// CloudWatch alarm for SQS backlog
Type: AWS::CloudWatch::Alarm
Properties:
  AlarmName: SQS-High-Backlog
  Namespace: AWS/SQS
  MetricName: ApproximateNumberOfMessagesVisible
  Statistic: Sum
  Period: 300
  EvaluationPeriods: 1
  Threshold: 1000
  ComparisonOperator: GreaterThanThreshold
  AlarmActions:
    - arn:aws:sns:us-east-1:123456789012:OpsTopic
  Dimensions:
    - Name: QueueName
      Value: MyQueue
Output
Alarm fires when backlog exceeds 1000 messages for 5 minutes.
Production Trap:
Alarming on ApproximateNumberOfMessagesVisible alone misses stale messages. Always pair with ApproximateAgeOfOldestMessage.
Key Takeaway
Monitor both backlog depth and message age — without alarms, your queue is a bomb you won't know is ticking.

FIFO Throughput Limits — Why Your Ordering Breaks at Scale

FIFO queues guarantee exactly-once processing and strict ordering, but they cap throughput at 3000 messages per second (with batching) or 300 per second (without). Exceed that and your sends throttle with HTTP 500 or ThrottlingException. Worse, message groups serialize — one slow consumer for a MessageGroupId blocks all others in that group. To fix: shard your groups by increasing unique MessageGroupId values (e.g., by user ID). Each group gets its own slot. Use Amazon SQS Extended Client Library for messages >256KB. If you need >3000/s, switch to standard queues and handle deduplication in your app. For SNS FIFO, subscription policy must match — you can only use one consumer per group. Batch sends with MessageDeduplicationId deduplication controls or use content-based dedup. Test your throttle limits with a load test before hitting production.

FifoThrottleTest.ymlYAML
1
2
3
4
5
6
7
8
9
// io.thecodeforge — devops tutorial

// Simulate FIFO batch send (max 10 per batch)
aws sqs send-message-batch \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/MyQueue.fifo \
  --entries file://messages.json

// messages.json: each entry has Id, MessageBody, MessageGroupId, MessageDeduplicationId
// Run with 300 batches/sec — watch for ThrottlingException
Output
Throttling starts above 300 batches/sec without batching.
Production Trap:
Using a single MessageGroupId serializes all messages. One slow consumer blocks the entire queue — always shard.
Key Takeaway
FIFO ordering comes at a throughput cost — shard message groups and batch aggressively or switch to standard queues at scale.
● Production incidentPOST-MORTEMseverity: high

Lambda Throttling + SNS = 10,000 Lost Order Events

Symptom
SNS publishes show 'Success' in CloudWatch (message sent). Lambda invocations flat — no errors, just not called. SNS delivery logs show 'Delivery failure — throttled'. CloudWatch SNS metrics show NumberOfNotificationsDelivered < NumberOfMessagesPublished.
Assumption
The team assumed SNS would keep retrying until Lambda had capacity. They didn't know SNS retries only 3 times total (initial + 2 retries) then discards the message. They also didn't know Lambda's reserved concurrency was set too low for a holiday spike.
Root cause
SNS to Lambda subscription: When Lambda throttles (hits concurrency limit), SNS retries twice with exponential backoff, then gives up permanently. Message is lost — no DLQ, no error notification to the publisher. Lambda's concurrency limit was 100. Traffic spike needed 300 concurrent executions. SNS tried to deliver to Lambda 3 times, each time got a throttle response, then dropped the message. No alert because SNS 'successfully' published to the topic — the failure happened at the subscription layer, not the publish layer.
Fix
1. Changed architecture: SNS → SQS queue → Lambda. - SNS topic fans out to an SQS queue - Lambda now polls the queue (or uses event source mapping) - Queue holds messages indefinitely during throttling 2. Created dead-letter queue on the SQS queue with maxReceiveCount=3 - Messages that fail after 3 attempts go to DLQ - Team monitors DLQ depth as a metric - No more silent drops 3. Increased Lambda reserved concurrency to 500 4. Added CloudWatch alarm on ApproximateAgeOfOldestMessage > 5 minutes on the SQS queue Rule: If you can't lose the message, never subscribe Lambda directly to SNS in production. Always use SQS as the durable buffer.
Key lesson
  • SNS to Lambda = at-most-once delivery when throttling hits.
  • SNS to SQS = at-least-once delivery + durable storage.
  • Add DLQ to every SQS queue. maxReceiveCount=3.
  • Monitor DLQ depth. A message in DLQ is a service bug.
  • SQS ApproximateAgeOfOldestMessage alarm = early warning system.
Production debug guideThe 4 most common failure modes and how to find them5 entries
Symptom · 01
SNS publishes 'Success' but Lambda never invoked
Fix
Check SNS subscription to Lambda. SNS retries throttled Lambda only twice then drops. Fix: subscribe SQS queue to SNS instead. Lambda polls queue.
Symptom · 02
SQS consumer processes same message repeatedly
Fix
Check for missing delete_message() call. Also check VisibilityTimeout: if processing takes longer than timeout, message reappears. Increase timeout or send heartbeat.
Symptom · 03
Messages in SQS but consumer stuck
Fix
Check ApproximateAgeOfOldestMessage metric. If increasing, consumers are down. Check Lambda concurrency, EC2 worker processes, and SQS policy permissions.
Symptom · 04
SQS bill unexpectedly high
Fix
Check if you're using short polling (WaitTimeSeconds=0). Each receive call costs API request. Always set WaitTimeSeconds=20 (long polling). Also check for high Redrive count (messages moving to DLQ).
Symptom · 05
Messages disappear without trace, no DLQ activity
Fix
Enable SNS delivery logging. Check CloudWatch Logs group /aws/sns/DeliveryLogs for 'Throttling', 'EndpointDisabled', or 'AccessDenied'. Also verify SQS queue policy allows SNS to send messages.
★ SQS/SNS — 60-Second DiagnosisRun these AWS CLI commands when messages are missing or consumers are stuck
Check SQS queue depth and health
Immediate action
Get queue attributes and oldest message age
Commands
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible ApproximateNumberOfMessagesDelayed ApproximateAgeOfOldestMessage
aws cloudwatch get-metric-statistics --namespace AWS/SQS --metric-name ApproximateAgeOfOldestMessage --dimensions Name=QueueName,Value=my-queue --period 300 --statistics Maximum --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S)
Fix now
If ApproximateAgeOfOldestMessage > 300 seconds, consumers are lagging. If ApproximateNumberOfMessagesNotVisible is high and static, consumers are crashing mid-processing.
Check if SNS topic is delivering to subscriber+
Immediate action
View SNS delivery metrics and log groups
Commands
aws sns get-topic-attributes --topic-arn arn:aws:sns:us-east-1:123456789012:my-topic --query 'Attributes.EffectiveDeliveryPolicy'
aws logs filter-log-events --log-group-name /aws/sns/DeliveryLogs --filter-pattern 'my-topic' --max-items 10
Fix now
Enable SNS delivery logging if not already active: aws sns set-topic-attributes --topic-arn arn:aws:sns:us-east-1:123456789012:my-topic --attribute-name DeliveryStatusLogging --attribute-value '{"protocol":"sqs","successFeedbackRoleArn":"arn:aws:iam::123456789012:role/SNSSuccessFeedback","failureFeedbackRoleArn":"arn:aws:iam::123456789012:role/SNSFailureFeedback"}'. Check for 'Throttling' or 'EndpointDisabled' errors.
Find DLQ depth (messages that failed processing)+
Immediate action
Check DLQ queue attributes
Commands
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue-dlq --attribute-names ApproximateNumberOfMessages
aws sqs receive-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue-dlq --max-number-of-messages 1
Fix now
Any message in DLQ = investigate root cause immediately. Do not delete messages before understanding why they failed. Inspect the message body for malformed JSON, missing fields, or unexpected data shapes.
Simulate sending a test message through the system+
Immediate action
Publish to SNS, then check SQS for delivery
Commands
aws sns publish --topic-arn arn:aws:sns:us-east-1:123456789012:my-topic --message '{"test":true,"timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%S)Z'"}'
aws sqs receive-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --wait-time-seconds 20
Fix now
If test message not received in SQS within 30 seconds, check: (1) SQS queue policy allows sqs:SendMessage from SNS principal, (2) SNS subscription is confirmed (not PendingConfirmation), (3) SNS filter policy does not exclude your test message attributes.
Messages stuck in flight or invisible+
Immediate action
Check VisibilityTimeout and dead-letter redrive settings
Commands
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --attribute-names VisibilityTimeout RedrivePolicy
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue-dlq --attribute-names ApproximateNumberOfMessagesNotVisible
Fix now
If ApproximateNumberOfMessagesNotVisible stays high for minutes, consumers are crashing before deleting. Increase VisibilityTimeout to at least 6x your average processing time, or implement a heartbeat using ChangeMessageVisibility to extend the timeout dynamically.
FeatureSQS StandardSQS FIFOSNS StandardSNS FIFO
Delivery modelPoint-to-point (one consumer)Point-to-point (one consumer)Fan-out (all subscribers)Fan-out (all subscribers)
Message orderingBest-effortStrict FIFONoneStrict FIFO per MessageGroupId
Exactly-once deliveryNo (at-least-once)YesNoNo (at-least-once to subscribers)
Message durabilityYes, up to 14 daysYes, up to 14 daysNoNo
Max throughputUnlimited3,000 msg/s (with batching)30,000 msg/s (default, adjustable)300 publishes/s (default, adjustable)
Dead-letter queueYesYesNo (use SQS subscriber DLQ)No (use SQS subscriber DLQ)
Message filteringNo (consumer-side only)No (consumer-side only)Yes (MessageAttributes)Yes (MessageAttributes)
Price per million requests$0.40$0.50$0.50 (publish) + $0.50 (delivery)$0.50 (publish) + $0.50 (delivery)
Long polling supportYes (WaitTimeSeconds=20)Yes (WaitTimeSeconds=20)N/A (push-based)N/A (push-based)
Max message size256 KB256 KB256 KB256 KB
Message retention1 minute to 14 days1 minute to 14 daysNot storedNot stored
DeduplicationIdempotent consumersContent-based or MessageDeduplicationIdNoMessageDeduplicationId
Use caseAsync task queue, back-pressure, rate limitingFinancial transactions, state machines, audit logsEvent broadcasting, fan-out, decouplingOrdered event broadcasting
Best paired withEC2 workers, Lambda event source mappingEC2 workers, Lambda event source mappingSQS queues (fan-out), Lambda (with SQS buffer)SQS FIFO queues

Key takeaways

1
SNS to Lambda = at-most-once delivery under throttling. Add SQS buffer for at-least-once guaranteed delivery.
2
Always set WaitTimeSeconds=20 on SQS receive calls. Cuts polling costs by up to 95% on quiet queues.
3
Every SQS queue needs a DLQ with maxReceiveCount=3. Monitor DLQ depth with a CloudWatch alarm. A message in DLQ is a bug, not noise.
4
SNS filter policies work on MessageAttributes only
not the message body. Put routing metadata in attributes.
5
Set VisibilityTimeout to at least 6x your expected processing time. Use ChangeMessageVisibility heartbeats for variable workloads.
6
Use Terraform to enforce DLQ, long polling, and CloudWatch alarms on every queue. Never hand-roll production queues in the console.
7
The canonical production pattern
SNS → SQS → Lambda. SNS broadcasts, SQS buffers durably, Lambda processes at its own pace.
8
When in doubt, SNS+SQS fan-out. It covers 80% of event-driven use cases and has been the production standard for a reason.

Common mistakes to avoid

8 patterns
×

Subscribing Lambda directly to SNS in production

Symptom
SNS retries throttled Lambda only 3 times (initial + 2 retries) then permanently discards the message. No DLQ, no alarm, no trace. Under traffic spikes, you silently lose messages.
Fix
Always put an SQS queue between SNS and Lambda. Lambda polls the SQS queue via event source mapping. The queue durably holds messages through throttling events.
×

Using short polling (WaitTimeSeconds=0 or omitting it entirely)

Symptom
Your consumer sends continuous API requests even when the queue is empty. Each empty response costs the same as a response with messages. On a quiet queue, this generates hundreds of thousands of unnecessary API calls per day.
Fix
Always set WaitTimeSeconds=20 in every receive_message call. For Lambda event source mapping, long polling is configured automatically.
×

Not attaching a Dead-Letter Queue to every SQS queue

Symptom
A single malformed message (poison message) causes infinite retry loops, blocks queue processing, and may exhaust consumer resources. Without a DLQ, the message never moves — it loops forever until manually deleted.
Fix
Attach a DLQ with maxReceiveCount=3 to every SQS queue. Add a CloudWatch alarm on DLQ depth > 0 with on-call notification.
×

Deleting messages from DLQ without investigating root cause

Symptom
The underlying bug — a schema change, a missing field, a null reference — goes unfixed. The next time a similar message is published, it fails again and lands in the DLQ again.
Fix
Always inspect at least one DLQ message before deciding what to do. Fix the consumer, then use DLQ Redrive to replay messages back to the main queue.
×

Putting filter criteria inside the SNS message body instead of MessageAttributes

Symptom
SNS filter policies only work on MessageAttributes. A subscription with a filter policy based on body content will receive every message published to the topic — the filter is silently ignored.
Fix
Put all routing metadata in MessageAttributes when publishing. Define SNS filter policies against those attributes. Keep your message body clean and focused on the event payload.
×

Forgetting the SQS queue policy that allows SNS to send messages

Symptom
SNS silently drops messages when it cannot write to the SQS queue due to missing permissions. The publish API returns 'Success' because the topic accepted the message — but the subscriber queue rejects the delivery attempt with AccessDenied. No error is visible at the publish layer.
Fix
Always add an SQS queue policy granting sqs:SendMessage to the SNS principal with a Condition on the source topic ARN. Verify with a test publish and confirm the message appears in the queue.
×

Setting VisibilityTimeout shorter than the actual processing time

Symptom
If your consumer takes 90 seconds to process a message but VisibilityTimeout is 60 seconds, the message becomes visible again before processing completes. Another consumer picks it up and you get duplicate processing — or the same consumer processes it twice.
Fix
Set VisibilityTimeout to at least 6x your expected processing time. For variable processing times, implement a heartbeat: periodically call ChangeMessageVisibility to extend the timeout before it expires.
×

Using FIFO queues everywhere by default

Symptom
FIFO queues cost 25% more per request and cap throughput at 3,000 messages per second with batching. Most applications don't need strict ordering and are already idempotent by design.
Fix
Use Standard queues by default. Make consumers idempotent using a business key (e.g., order ID + operation type) stored in DynamoDB or Redis for deduplication. Only switch to FIFO when business logic truly requires strict ordering.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is the difference between SQS and SNS? When would you use each?
Q02JUNIOR
Why should you never subscribe Lambda directly to SNS in production for ...
Q03JUNIOR
What is a dead-letter queue and why does every production SQS queue need...
Q04JUNIOR
What is the SNS+SQS fan-out pattern and why is it the production standar...
Q05JUNIOR
What is the visibility timeout in SQS and how do you set it correctly?
Q06JUNIOR
How does SQS long polling work and why should you always use it?
Q07JUNIOR
How do SNS message filter policies work, and what is the most common mis...
Q08JUNIOR
When would you choose EventBridge over SNS for event routing?
Q01 of 08JUNIOR

What is the difference between SQS and SNS? When would you use each?

ANSWER
SQS is a durable message queue with point-to-point delivery — one message is processed by one consumer. It stores messages for up to 14 days and provides at-least-once delivery (Standard) or exactly-once (FIFO). Use SQS when you need async task processing, back-pressure handling, rate limiting, or guaranteed retry with dead-letter queue support. SNS is a pub/sub topic with fan-out delivery — one message reaches all subscribers simultaneously. It does not store messages. Use SNS when you need to broadcast a single event to multiple independent consumers without the publisher needing to know who they are. In production, these services are almost always used together: SNS fans out to multiple SQS queues, combining SNS's broadcasting power with SQS's durability and retry semantics.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
Can SQS and SNS guarantee exactly-once delivery?
02
What happens to SQS messages if my consumer is down for several hours?
03
What is the maximum message size for SQS and SNS, and what do I do for larger payloads?
04
How do I replay messages that failed and ended up in a DLQ?
05
Should I use SQS Standard or FIFO for a payment processing system?
06
How do I monitor whether SNS is successfully delivering to all subscribers?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's Cloud. Mark it forged?

25 min read · try the examples if you haven't

Previous
Google Cloud Run Basics
18 / 23 · Cloud
Next
AWS CloudWatch Basics