Notification System Design — Silent SMS Soft-Ban Traps
HTTP 200 from Twilio doesn't mean delivery.
- Decoupled architecture: event producer → message broker → notification service → channel providers
- Fan-out strategies: direct per-user, batch per channel, or tiered routing based on importance
- Rate limiting per channel (e.g., SMS caps) prevents provider bans and ensures fair use
- Deduplication via idempotency keys avoids sending the same notification twice
- Monitoring delivery receipts is essential; a silent failure (e.g., email throttled) won't appear in logs
Every time you get a payment confirmation from your bank, a 'your package shipped' email from Amazon, or a red badge on your Instagram icon, a notification system fired behind the scenes. These systems are invisible when they work and catastrophic when they don't — a failed OTP SMS locks a user out of their account, a duplicate push notification at 3 AM turns a loyal customer into a one-star reviewer. Notification systems are deceptively simple on the surface and brutally hard in production.
The core problem a notification system solves is decoupling event producers from delivery channels. Your payment service shouldn't need to know whether a user prefers SMS, email, or push — and it definitely shouldn't block waiting for Twilio to respond. The notification system absorbs that complexity: it stores user preferences, throttles sends, fans out to multiple channels, retries failures, and records delivery receipts — all asynchronously and at scale.
By the end of this article you'll be able to design a notification system that handles 10 million daily active users across email, SMS, and push — including the fan-out architecture, idempotency strategy, rate limiting design, and the five production edge cases that trip up even experienced engineers. You'll also have the vocabulary and depth to walk through this confidently in a senior system design interview.
Functional & Non-Functional Requirements
Before designing any system, define what it must do and how well it must do it. For a notification system, functional requirements include the ability to send notifications via email, SMS, and push; support multiple languages and templates; respect user preferences (opt-in/out); and track delivery status. Non-functional requirements are where senior engineers focus: latency (under 1 second for critical OTPs, up to 5 minutes for promotional emails), throughput (10 million notifications/day), availability (99.99% uptime), and durability (never lose a notification once accepted). The biggest mistake? Treating all notifications with the same priority. An OTP failure is a revenue and security incident; a marketing email failure is a minor miss. Your architecture must distinguish between them.
High-Level Architecture: Event Ingestion to Delivery
A notification system can be broken into four layers: ingestion, processing, delivery, and tracking. Ingestion receives an event from any service (e.g., PaymentService sends a 'payment_confirmed' event). The event enters a message broker like Kafka. The processing layer reads events, enriches them with user preferences and templates, decides which channels to use, and applies rate limiting. Then it creates individual notification tasks per channel and pushes them to channel-specific queues. The delivery layer consumes those tasks, calls third-party providers (Twilio for SMS, SendGrid for email, Firebase for push), and stores the result. Tracking layer captures delivery receipts, opens, clicks, and bounces. This architecture decouples producers from delivery, allows independent scaling of each layer, and provides a single place to add monitoring, retries, and compliance.
Fan-Out Strategies: Direct, Batch, and Tiered
Fan-out is how one event becomes many deliveries. The simplest approach is direct fan-out: for each event, the processing layer queries the user preference store and creates one notification task per subscribed channel. This works for low volumes but becomes expensive at scale — you're doing N database lookups per event. Batch fan-out groups users by channel and preference, reducing lookups. For example, a 'promotion' event might define a target segment (e.g., all users in tier 'premium'), and the system generates email tasks in bulk without individual lookups. Tiered fan-out combines both: critical events use direct (fast, per-user), bulk events use batch. The trade-off is latency vs. throughput. Direct fan-out is O(1) per event per user; batch fan-out is O(1) per segment. Choose based on the notification type's SLA.
Reliability, Retries, and Dead-Letter Queues
Third-party providers fail — they return 5xx, throttle you, or go down. Your notification system must handle these failures gracefully. The standard pattern: exponential backoff with jitter (e.g., 1s, 4s, 16s, 64s max), configurable max retries per channel and per priority. After exhausting retries, move the task to a dead-letter queue (DLQ). The DLQ stores failed notifications for manual inspection or later reprocessing. Crucially, retries must be idempotent: the same retry should not send two SMS messages. Use a unique idempotency key (e.g., eventId + channel) stored in Redis with a TTL longer than the retry window. For example, if max retries took 5 minutes, set Redis TTL to 30 minutes. Also, the DLQ must trigger an alert — a silent DLQ means you're losing notifications.
Production Gotchas: Rate Limiting, Deduplication, and Channel Failures
Three non-obvious issues that bring down notification systems:
- Rate limiting at the provider level: SMS providers like Twilio impose per-second and per-day limits. Exceeding them triggers soft bans (HTTP 200 but no delivery). Solution: implement a token bucket rate limiter per provider and per channel. Monitor the token consumption against the limit.
- Deduplication across retries and chains: If a user receives a notification that triggers another notification (e.g., 'payment received' triggers 'transaction alert'), you need cross-event deduplication. Use a configurable suppression window (e.g., don't send same type within 5 minutes). Store suppression keys in Redis with TTL.
- Channel failures: Email domain may have strict SPF/DKIM, causing bounces. Push notification certificates expire. SMS aggregator may be down in a region. Solution: have a fallback channel strategy (e.g., if email fails, send SMS). Define fallback rules per notification type. Monitor channel health with synthetic probes.
| Strategy | Latency | Database Load | Best For |
|---|---|---|---|
| Direct Fan-Out | Low (milliseconds) | High (N lookups per event) | Critical notifications (OTP, alerts) |
| Batch Fan-Out | High (minutes) | Low (1 lookup per segment) | Bulk mailings, promotions |
| Tiered Fan-Out | Low for critical, High for bulk | Medium (mixed) | Mixed workload systems |
Key Takeaways
- A notification system decouples event producers from delivery channels — never let a service block on third-party API calls.
- Isolate critical and bulk notifications into separate queues to prevent starvation.
- Rate limiting must be per channel and per provider — token buckets work well.
- Idempotency keys (stored in Redis) are the only reliable way to prevent duplicates across retries.
- Monitoring delivery receipts and synthetic probes are essential — HTTP 200 does not mean delivered.
- Always define SLAs per notification type and design your architecture to meet them.
Common Mistakes to Avoid
- Assuming HTTP 200 from provider means delivery
Symptom: Users report missing notifications even though logs show successful API calls. Provider soft-ban or carrier filtering causes silent drops.
Fix: Always monitor delivery receipts (callbacks, webhooks). Compare 'sent' vs 'delivered' counts. Implement synthetic probes that verify end-to-end. - No rate limiting per channel
Symptom: Provider throttles your account, returning 429 or soft-banning. Retry storms exacerbate the problem.
Fix: Implement per-channel rate limiting (token bucket or sliding window). Respect provider documented limits (SMS: 1 per second, email: 100 per second, etc.). - Using a single queue for all notification types
Symptom: A burst of promotional notifications delays critical OTPs by minutes. Hard to diagnose because all logs look the same.
Fix: Use separate queues/topics for different priorities. Configure independent consumer groups and autoscaling policies. - Missing idempotency on retries
Symptom: Users receive duplicate SMS, email, or push notifications during provider failures and retries.
Fix: Add idempotency key (eventId + channel) stored in Redis with TTL greater than retry window. Check before sending.
Interview Questions on This Topic
- QHow would you design a notification system that handles 10 million daily active users across SMS, email, and push? Discuss the key architectural components and trade-offs.SeniorReveal
- QHow would you handle duplicate notifications if a user receives the same event twice in 5 seconds?Mid-levelReveal
- QWhat happens when an email provider returns a 500 error? Walk through the retry strategy.SeniorReveal
Frequently Asked Questions
What is a notification system in simple terms?
Think of it as a smart switchboard. An event comes in (e.g., 'payment received'), and the system decides who should be notified and how (email, SMS, push). It then fans out the message to the right channels, handles failures, and tracks delivery — all asynchronously and at scale.
Why should I use a message broker like Kafka for notifications?
A broker decouples event producers from delivery workers. If Twilio goes down, events keep piling up in Kafka instead of blocking the payment service. It also allows replaying events, handling spikes, and scaling consumers independently.
How do you handle user preferences (opt-in/out) in a notification system?
Store user preferences in a dedicated database (highly available, low latency). When a notification event arrives, the processing layer queries the preference store to determine which channels are enabled for that user and notification type. Cache frequently accessed preferences in Redis to reduce load.
What's the worst production incident you've seen with notification systems?
The silent SMS block described earlier. A fintech company's OTP delivery stopped for hours because retries hit rate limits. No alerts fired because API calls returned 200. Users couldn't log in, support was flooded, and the issue was only caught when a developer manually verified delivery. That's why you must monitor delivery receipts, not just API calls.
That's Real World. Mark it forged?
4 min read · try the examples if you haven't