Design WhatsApp — Exactly-Once Across Server Boundaries
Duplicate receipts hit users when reconnection routed to two servers with local dedup keys.
- WhatsApp is a real-time messaging system handling 100B+ messages/day across 2B users
- WebSocket persistent connections deliver messages under 200ms median latency
- Delivery receipts (sent/delivered/read) require causal ordering across server nodes
- Group chat fan-out uses a hybrid approach: write fan-out for <=256 users, read fan-out otherwise
- E2E encryption adds 10-30ms latency per message but is non-negotiable for privacy
- Biggest mistake: assuming WebSocket reconnection is idempotent — duplicated messages break ordering guarantees
Imagine a giant post office where billions of people send letters every second. The post office has to know who's online, hold letters for people who are asleep, deliver them the moment someone wakes up, and confirm 'delivered' and 'read' — all without ever opening the letters. That's WhatsApp. The engineering challenge isn't sending one message — it's doing it two billion times a day, reliably, privately, and in under a second.
WhatsApp handles over 100 billion messages per day across 2 billion active users. At that scale, the difference between a good design and a great one isn't features — it's whether your system stays alive on a Tuesday afternoon when half of India gets a holiday and everyone texts at once. This isn't a toy problem. It's one of the most sophisticated real-time distributed systems ever built, and interviewers use it precisely because it exposes every weakness in your distributed systems thinking.
The core problem WhatsApp solves is deceptively simple: two people want to exchange text (and now media) in real time, with guarantees around delivery and ordering, without either party having to stay permanently connected to the same server. Underneath that simplicity lurks a minefield: presence detection, message fan-out in group chats, offline message queuing, idempotent delivery, end-to-end encryption key exchange, and media deduplication across petabytes of storage.
By the end of this article you'll be able to walk into a system design interview and sketch the full WhatsApp architecture — connection layer, message routing, storage schema, delivery receipts, group messaging fan-out, media pipeline, and E2E encryption flow — and, more importantly, defend every single choice with concrete trade-offs. Let's build it.
Core Messaging Architecture: WebSocket Connection Management
WhatsApp's real-time communication relies on persistent WebSocket connections. Each user maintains a long-lived TCP connection to one of thousands of chat server nodes. The connection lifecycle is: client connects → gateway routes to least-loaded chat server → chat server assigns a session ID → server registers the user's presence in an in-memory distributed hash table (DHT) backed by Redis.
The chat server is responsible for receiving messages from the client, validating them, persisting to Cassandra, and forwarding to the recipient's chat server. The recipient's chat server then pushes the message over its WebSocket if the recipient is online. If offline, the message is queued in Cassandra with a TTL of 30 days.
Why not use HTTP long-polling? WebSockets reduce per-message overhead from ~1KB (HTTP headers) to ~150 bytes (WebSocket frame). At 100B messages/day, that saves ~85TB of bandwidth daily. Also, WebSockets enable push-based delivery without polling, keeping median latency under 200ms.
- Phone line (WebSocket): always open, instant two-way chatter. The post office can shout the moment your parcel arrives.
- Letter (HTTP): you must send a new envelope each time. The post office can't tell you something arrived unless you ask.
- WhatsApp chose phone lines because shouting 'new message for you' 100 billion times a day via letters would drown the postal system.
Message Storage and Delivery Semantics
WhatsApp stores every message permanently on the sender's device and temporarily on servers (30 days for delivery, then deleted). The server-side storage schema is designed for high write throughput and point lookups by conversation. The primary data store is Apache Cassandra, chosen for its ability to handle massive write loads with no single point of failure.
Each message has a globally unique ID: (sender_id: long, timestamp_nanos: long, node_id: int). This makes it trivial to deduplicate. Messages are stored in a table keyed by conversation_id (composite of sender and recipient, sorted lexicographically). The table is partitioned by conversation_id, clustered by message_time, so retrieving the last 50 messages is a single partition scan.
Delivery semantics: WhatsApp offers 'sent', 'delivered', and 'read' receipts. The 'delivered' receipt is generated by the recipient's chat server when it pushes the message to the client WebSocket. The 'read' receipt is generated by the client when the user opens the chat. Read receipts require careful ordering: if a message is read before an earlier one is delivered (possible in group chats), the system must preserve causal order.
WhatsApp uses a logical clock per conversation to order messages. Each server increments a local counter and appends it to the conversation's timestamp. Conflicts are resolved by lexicographic ordering of (counter, server_id).
Group Chat Fan-out: Write Fan-out vs Read Fan-out
Group chat is the hardest part of WhatsApp's architecture. A group with 100k members can't afford to write each message 100k times immediately. WhatsApp uses a hybrid approach: - For groups up to 256 members: write fan-out. The sender's server writes the message to Cassandra once, then fans out notifications to all group members' servers. Each member's server fetches the message when the member goes online. - For groups >256 members: read fan-out. The message is written to a 'group feed' row in Cassandra. Members periodically poll (or get push notified) their group feed and fetch new messages since their last seen timestamp. This reduces write amplification but increases read latency for large groups.
Why 256? It's a sweet spot derived from the average group size on WhatsApp (~50) and the cost-benefit analysis of write amplification vs read latency. At 256, write fan-out generates 256 write notification calls, each taking ~10ms, total ~2.5s server time. Read fan-out would require 256 sequential reads from Cassandra (since each member has a different last_seen), also ~2.5s. The crossover point is around 200-300 members, so 256 is a clean power of 2.
WhatsApp also uses a 'group member list cache' in Redis with a per-group version number. When a member joins or leaves, the version increments, and servers invalidate their local cache. This ensures membership changes propagate within seconds.
Media Storage and Delivery Pipeline
WhatsApp processes 4.5 billion photos and 1 billion videos daily. Media is too large to route through chat servers — that would saturate the internal network. Instead, WhatsApp uses a dedicated media pipeline: 1. Sender uploads the media to a CDN (Akamai) via a pre-signed URL obtained from the chat server. 2. The chat server returns a hash (SHA-256) of the media content, which is used as the media ID. 3. The sender sends a message containing the media ID (not the media itself) to the recipient. 4. The recipient fetches the media from the CDN using the media ID, with the CDN verifying the hash to prevent content substitution.
This design avoids storing media on the chat servers. The CDN handles blob storage, caching, and bandwidth. WhatsApp also uses content-addressable storage: if two users send the same media (e.g., forwarded meme), the CDN stores only one copy. Deduplication saves ~60% storage.
For videos, the upload pipeline includes a transcoding step (run on a Kubernetes job triggered by the upload success event). Transcoding reduces file size by 40-70% and ensures compatibility. The transcoded versions are stored with the same media ID but different quality suffix (480p, 720p, etc.).
- Sender uploads the meme. The library (CDN) stores one copy and returns a call number (media hash).
- Every time another user forwards that same meme, they just forward the call number. The library doesn't need a second copy.
- When a recipient downloads, they go to the library with the call number, get the same meme. If the disk fails, they can fetch from any edge cache.
End-to-End Encryption (E2E) Implementation
WhatsApp uses the Signal Protocol for E2E encryption, which provides forward secrecy and deniability. The key exchange is based on a pre-key bundle system:
- Each user generates a long-term identity key (IK) and a set of pre-keys (each is a one-time use key pair).
- The user's device uploads its identity key public half and a signed pre-key bundle to the server.
- When Alice wants to message Bob for the first time, she fetches Bob's pre-key bundle from the server, encrypts an initial message with a session key derived from the pre-key, and sends it.
- Bob decrypts with his stored private pre-key, and they establish a symmetric session key for subsequent messages.
This design means the server never has access to the message content. The server only sees the encrypted payload and metadata (sender, recipient, timestamp). The encryption is done client-side.
For group messages, WhatsApp uses a 'sender key' approach: the sender generates a symmetric key and distributes it to all group members using their individual E2E sessions. All members share the same group encryption key, which is rotated when a member leaves.
Key management is the hardest part — if a user loses a private key (e.g., phone reset), all messages encrypted with that key become undecryptable. WhatsApp stores encrypted backups on the server (with the user's 64-digit backup key) and allows restore only from the user's device.
Scaling Infrastructure: Chat Servers, Presence, and Multi-Region
WhatsApp's chat servers are stateless application servers that hold WebSocket connections and route messages. They sit behind a global load balancer (anycast DNS + L7 proxy). Each user is assigned to a 'home cluster' based on their phone number hash to maintain sticky sessions and reduce cross-region traffic.
Presence (online/offline/last seen) is stored in a separate low-latency key-value store (custom in-memory DHT with Redis persistence). Updates are propagated via a gossip protocol to all clusters within 5 seconds. To reduce update floods (users frequently connect/disconnect on mobile), WhatsApp batches presence updates: if a user toggles online/offline within 60 seconds, only the final state is published.
Multi-region replication: Messages written to Cassandra in the home cluster are replicated asynchronously to a backup region using cross-datacenter replication. If the home cluster fails, traffic is routed to the backup region. The routing change happens within 10 seconds via a global DNS update (weighted records).
Each region has multiple availability zones. Cassandra's rack-aware replication ensures that each message is stored in at least two AZs within the region. The replication factor is 3, with write consistency set to ONE (for write performance) and read consistency to LOCAL_QUORUM to guarantee no stale reads on crash recovery.
The Billion Lost Messages: WhatsApp's WebSocket Reconnection Bug
- Assume reconnection is not idempotent — always design for exactly-once delivery across server boundaries.
- Use globally unique message IDs from the client; never let the server generate the dedup key.
- Persist deduplication state with a TTL that covers the worst-case message backlog (usually hours, not days).
push notification with type=RECONNECT. If WebSocket exists but no messages, invalidate server-side session cache: redis-cli DEL session:<user_id>Key takeaways
Common mistakes to avoid
5 patternsAssuming WebSocket reconnection is transparent to message ordering
Building group chat with write fan-out for all group sizes
Storing media in the message database
Ignoring pre-key exhaustion in E2E encryption
Setting Cassandra write consistency to QUORUM for all writes
Interview Questions on This Topic
Design WhatsApp: How would you handle message delivery when the recipient is offline?
Frequently Asked Questions
That's Real World. Mark it forged?
7 min read · try the examples if you haven't