Mid-level 7 min · March 06, 2026

Design WhatsApp — Exactly-Once Across Server Boundaries

Duplicate receipts hit users when reconnection routed to two servers with local dedup keys.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • WhatsApp is a real-time messaging system handling 100B+ messages/day across 2B users
  • WebSocket persistent connections deliver messages under 200ms median latency
  • Delivery receipts (sent/delivered/read) require causal ordering across server nodes
  • Group chat fan-out uses a hybrid approach: write fan-out for <=256 users, read fan-out otherwise
  • E2E encryption adds 10-30ms latency per message but is non-negotiable for privacy
  • Biggest mistake: assuming WebSocket reconnection is idempotent — duplicated messages break ordering guarantees
Plain-English First

Imagine a giant post office where billions of people send letters every second. The post office has to know who's online, hold letters for people who are asleep, deliver them the moment someone wakes up, and confirm 'delivered' and 'read' — all without ever opening the letters. That's WhatsApp. The engineering challenge isn't sending one message — it's doing it two billion times a day, reliably, privately, and in under a second.

WhatsApp handles over 100 billion messages per day across 2 billion active users. At that scale, the difference between a good design and a great one isn't features — it's whether your system stays alive on a Tuesday afternoon when half of India gets a holiday and everyone texts at once. This isn't a toy problem. It's one of the most sophisticated real-time distributed systems ever built, and interviewers use it precisely because it exposes every weakness in your distributed systems thinking.

The core problem WhatsApp solves is deceptively simple: two people want to exchange text (and now media) in real time, with guarantees around delivery and ordering, without either party having to stay permanently connected to the same server. Underneath that simplicity lurks a minefield: presence detection, message fan-out in group chats, offline message queuing, idempotent delivery, end-to-end encryption key exchange, and media deduplication across petabytes of storage.

By the end of this article you'll be able to walk into a system design interview and sketch the full WhatsApp architecture — connection layer, message routing, storage schema, delivery receipts, group messaging fan-out, media pipeline, and E2E encryption flow — and, more importantly, defend every single choice with concrete trade-offs. Let's build it.

Core Messaging Architecture: WebSocket Connection Management

WhatsApp's real-time communication relies on persistent WebSocket connections. Each user maintains a long-lived TCP connection to one of thousands of chat server nodes. The connection lifecycle is: client connects → gateway routes to least-loaded chat server → chat server assigns a session ID → server registers the user's presence in an in-memory distributed hash table (DHT) backed by Redis.

The chat server is responsible for receiving messages from the client, validating them, persisting to Cassandra, and forwarding to the recipient's chat server. The recipient's chat server then pushes the message over its WebSocket if the recipient is online. If offline, the message is queued in Cassandra with a TTL of 30 days.

Why not use HTTP long-polling? WebSockets reduce per-message overhead from ~1KB (HTTP headers) to ~150 bytes (WebSocket frame). At 100B messages/day, that saves ~85TB of bandwidth daily. Also, WebSockets enable push-based delivery without polling, keeping median latency under 200ms.

io/thecodeforge/chat/WebSocketSessionManager.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.chat;

import io.thecodeforge.chat.model.Session;
import io.thecodeforge.chat.model.UserId;
import io.thecodeforge.storage.RedisClient;

public class WebSocketSessionManager {
    private final RedisClient<String, Session> sessionStore;
    private static final long SESSION_TTL_SECONDS = 3600;

    public Session createSession(UserId userId, String serverId) {
        Session session = new Session(userId, serverId, System.currentTimeMillis());
        String key = "session:" + userId.value();
        sessionStore.setex(key, SESSION_TTL_SECONDS, session);
        return session;
    }

    public Session refreshSession(UserId userId) {
        String key = "session:" + userId.value();
        return sessionStore.get(key)
            .map(s -> { s.refresh(); return s; })
            .orElseThrow(() -> new RuntimeException("Session expired"));
    }

    public void markOffline(UserId userId) {
        sessionStore.del("session:" + userId.value());
        presenceStore.del("presence:" + userId.value());
    }
}
Mental Model: The Post Office Analogy for WebSocket vs HTTP
  • Phone line (WebSocket): always open, instant two-way chatter. The post office can shout the moment your parcel arrives.
  • Letter (HTTP): you must send a new envelope each time. The post office can't tell you something arrived unless you ask.
  • WhatsApp chose phone lines because shouting 'new message for you' 100 billion times a day via letters would drown the postal system.
Production Insight
Connection rebalancing is the silent killer of WebSocket systems.
When a chat server is decommissioned (deploy or scaling down), existing WebSockets must drain gracefully.
WhatsApp uses a two-phase drain: server broadcasts DRAIN message, waits 30s for clients to reconnect, then hard-closes remaining connections.
Without this, clients lose messages in flight because the old server no longer accepts them and the new server doesn't know about them.
Rule: always implement a drain period equal to your connection timeout + max message processing latency.
Key Takeaway
WebSockets are chosen for WhatsApp because bidirectional low-latency messaging requires persistent state.
Connection draining, session state in Redis, and idempotent reconnection are the three non-negotiable patterns.
If you skip any one, you will lose messages in production.
Choosing WebSocket vs Alternatives
IfUser count < 1K, need low engineering complexity
UseUse HTTP long-polling with a simple queue per user (Redis). Accept 1-2s latency.
IfUser count up to 10M, but message delivery can tolerate 5s latency
UseUse Server-Sent Events (SSE) with HTTP/2. Simpler than WebSockets but unidirectional server->client.
IfUser count > 10M, need <200ms latency, full duplex
UseUse WebSockets with a custom stateful gateway layer. Accept the operational complexity of connection management.

Message Storage and Delivery Semantics

WhatsApp stores every message permanently on the sender's device and temporarily on servers (30 days for delivery, then deleted). The server-side storage schema is designed for high write throughput and point lookups by conversation. The primary data store is Apache Cassandra, chosen for its ability to handle massive write loads with no single point of failure.

Each message has a globally unique ID: (sender_id: long, timestamp_nanos: long, node_id: int). This makes it trivial to deduplicate. Messages are stored in a table keyed by conversation_id (composite of sender and recipient, sorted lexicographically). The table is partitioned by conversation_id, clustered by message_time, so retrieving the last 50 messages is a single partition scan.

Delivery semantics: WhatsApp offers 'sent', 'delivered', and 'read' receipts. The 'delivered' receipt is generated by the recipient's chat server when it pushes the message to the client WebSocket. The 'read' receipt is generated by the client when the user opens the chat. Read receipts require careful ordering: if a message is read before an earlier one is delivered (possible in group chats), the system must preserve causal order.

WhatsApp uses a logical clock per conversation to order messages. Each server increments a local counter and appends it to the conversation's timestamp. Conflicts are resolved by lexicographic ordering of (counter, server_id).

io/thecodeforge/chat/schema.cqlCQL
1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE IF NOT EXISTS io.thecodeforge.messages_by_conversation (
    conversation_id text,
    message_time timestamp,
    message_id uuid,
    sender_id bigint,
    recipient_id bigint,
    content_blob blob,
    content_type text,
    delivery_status text,
    read_timestamp timestamp,
    PRIMARY KEY ((conversation_id), message_time, message_id)
) WITH CLUSTERING ORDER BY (message_time DESC, message_id ASC)
   AND default_time_to_live = 2592000; -- 30 days
Cassandra Tombstone Pitfall
Every DELETE in Cassandra creates a tombstone. If you design a schema that frequently deletes old messages (e.g., auto-delete after 7 days), tombstones accumulate and cause read latency spikes. WhatsApp avoids this entirely by using TTL (default_time_to_live) instead of explicit deletes. TTLs are handled at compaction time — no tombstones.
Production Insight
Read receipts break if the recipient's chat server crashes after pushing the message but before acknowledging the read.
The client sends 'read' after the server confirms delivery. If the crash happens, the server never records the read, and the sender sees only 'delivered' indefinitely.
Workaround: the client persists 'read' status locally and retries on reconnection. The server dedupes read receipts by message_id.
WhatsApp avoids this by writing the read receipt to a separate Cassandra table with a lightweight transaction (LWT) — if the message doesn't exist, the LWT fails and the client retries.
Key Takeaway
Design message storage for append-only writes with TTL-based expiry.
Use Cassandra's clustering by time for efficient conversation history retrieval.
Read receipts require a side table with idempotent writes to handle server crashes.

Group Chat Fan-out: Write Fan-out vs Read Fan-out

Group chat is the hardest part of WhatsApp's architecture. A group with 100k members can't afford to write each message 100k times immediately. WhatsApp uses a hybrid approach: - For groups up to 256 members: write fan-out. The sender's server writes the message to Cassandra once, then fans out notifications to all group members' servers. Each member's server fetches the message when the member goes online. - For groups >256 members: read fan-out. The message is written to a 'group feed' row in Cassandra. Members periodically poll (or get push notified) their group feed and fetch new messages since their last seen timestamp. This reduces write amplification but increases read latency for large groups.

Why 256? It's a sweet spot derived from the average group size on WhatsApp (~50) and the cost-benefit analysis of write amplification vs read latency. At 256, write fan-out generates 256 write notification calls, each taking ~10ms, total ~2.5s server time. Read fan-out would require 256 sequential reads from Cassandra (since each member has a different last_seen), also ~2.5s. The crossover point is around 200-300 members, so 256 is a clean power of 2.

WhatsApp also uses a 'group member list cache' in Redis with a per-group version number. When a member joins or leaves, the version increments, and servers invalidate their local cache. This ensures membership changes propagate within seconds.

io/thecodeforge/chat/GroupFanoutDecider.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package io.thecodeforge.chat;

public class GroupFanoutDecider {
    private static final int WRITE_FANOUT_THRESHOLD = 256;

    public FanoutStrategy decide(int groupSize) {
        if (groupSize <= WRITE_FANOUT_THRESHOLD) {
            return FanoutStrategy.WRITE_FANOUT;
        }
        return FanoutStrategy.READ_FANOUT;
    }

    public enum FanoutStrategy {
        WRITE_FANOUT,
        READ_FANOUT
    }
}
Real-World Data Point
WhatsApp's largest known group has 256k members (WhatsApp broadcast lists). These use pure read fan-out with a dedicated group feed table. In 2023, a broadcast to 256k members took ~29 seconds to fully deliver (p99 latency). Without read fan-out, the write amplification would have required 256k disk I/Os per message — impossible at scale.
Production Insight
Group member list staleness causes messages to be sent to users who have left, or missed by users who just joined.
Flapping (rapid join/leave) can in-flate the version number and cause cache churn.
WhatsApp uses a cooldown: if a member changes within 10 seconds of a previous change, the version increments once after 10 seconds, not per-event.
Without this, a flurry of 100 joins would trigger 100 cache invalidations and 100 full membership queries.
Key Takeaway
Choose fan-out strategy based on group size: write fan-out for small groups (<256), read fan-out for large groups.
Cache group membership aggressively but with a cooldown to avoid flapping cascades.
The 256 threshold is derived from write cost vs read cost — derive your own threshold by measuring your own latency numbers.
Fan-out Strategy Decision
IfGroup size <= 256, low churn, real-time requirement
UseWrite fan-out with per-user delivery queues. Accept 256x write amplification for instant delivery.
IfGroup size > 256, high churn or async delivery acceptable
UseRead fan-out with group feed table. Accept higher read latency and periodic polling.
IfGroup size > 10K, must handle multi-region
UseRead fan-out with content-addressable storage and CDN for media. Use push notifications as 'new message' signal.

Media Storage and Delivery Pipeline

WhatsApp processes 4.5 billion photos and 1 billion videos daily. Media is too large to route through chat servers — that would saturate the internal network. Instead, WhatsApp uses a dedicated media pipeline: 1. Sender uploads the media to a CDN (Akamai) via a pre-signed URL obtained from the chat server. 2. The chat server returns a hash (SHA-256) of the media content, which is used as the media ID. 3. The sender sends a message containing the media ID (not the media itself) to the recipient. 4. The recipient fetches the media from the CDN using the media ID, with the CDN verifying the hash to prevent content substitution.

This design avoids storing media on the chat servers. The CDN handles blob storage, caching, and bandwidth. WhatsApp also uses content-addressable storage: if two users send the same media (e.g., forwarded meme), the CDN stores only one copy. Deduplication saves ~60% storage.

For videos, the upload pipeline includes a transcoding step (run on a Kubernetes job triggered by the upload success event). Transcoding reduces file size by 40-70% and ensures compatibility. The transcoded versions are stored with the same media ID but different quality suffix (480p, 720p, etc.).

io/thecodeforge/media/UploadeResponse.jsonJSON
1
2
3
4
5
6
{
  "media_id": "sha256:f4a5b1c...",
  "upload_url": "https://cdn.whatsapp.net/upload/...",
  "expires_at": 1712345678,
  "thumbnail_hash": "sha256:abcd123..."
}
Mental Model: The Library Analogy for Media Deduplication
  • Sender uploads the meme. The library (CDN) stores one copy and returns a call number (media hash).
  • Every time another user forwards that same meme, they just forward the call number. The library doesn't need a second copy.
  • When a recipient downloads, they go to the library with the call number, get the same meme. If the disk fails, they can fetch from any edge cache.
Production Insight
Pre-signed URLs with short expiry (5 minutes) prevent abuse.
But if the sender's upload fails partway, the URL expires and the sender must restart the whole upload.
WhatsApp mitigates this by using multi-part upload — each part has its own pre-signed URL.
If one part fails, only that part is retried. The CDN reassembles on success.
Without multi-part, a large video upload failure wastes 10+ seconds and bandwidth on the client.
Key Takeaway
Media should never pass through the chat server directly.
Use a dedicated CDN with content-addressable storage for deduplication.
Multi-part uploads with per-part pre-signed URLs improve reliability for large files.

End-to-End Encryption (E2E) Implementation

WhatsApp uses the Signal Protocol for E2E encryption, which provides forward secrecy and deniability. The key exchange is based on a pre-key bundle system:

  • Each user generates a long-term identity key (IK) and a set of pre-keys (each is a one-time use key pair).
  • The user's device uploads its identity key public half and a signed pre-key bundle to the server.
  • When Alice wants to message Bob for the first time, she fetches Bob's pre-key bundle from the server, encrypts an initial message with a session key derived from the pre-key, and sends it.
  • Bob decrypts with his stored private pre-key, and they establish a symmetric session key for subsequent messages.

This design means the server never has access to the message content. The server only sees the encrypted payload and metadata (sender, recipient, timestamp). The encryption is done client-side.

For group messages, WhatsApp uses a 'sender key' approach: the sender generates a symmetric key and distributes it to all group members using their individual E2E sessions. All members share the same group encryption key, which is rotated when a member leaves.

Key management is the hardest part — if a user loses a private key (e.g., phone reset), all messages encrypted with that key become undecryptable. WhatsApp stores encrypted backups on the server (with the user's 64-digit backup key) and allows restore only from the user's device.

io/thecodeforge/e2e/PreKeyBundle.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
{
  "identity_key_public": "base64encode...",
  "signed_pre_key": {
    "key": "base64encode...",
    "signature": "base64encode..."
  },
  "one_time_pre_keys": [
    {"id": 1, "key": "base64..."},
    {"id": 2, "key": "base64..."}
  ]
}
The Pre-Key Exhaustion Problem
Each user uploads a bundle of ~100 one-time pre-keys. When those keys are consumed (one per new conversation), the user must upload a fresh batch. If all pre-keys are exhausted before the user is online, new senders cannot start encrypted conversations. WhatsApp monitors pre-key count and pushes a notification to the user to generate more keys when the count drops below 10.
Production Insight
E2E encryption adds latency because of key exchange. For the first message in a conversation, the client must fetch the recipient's pre-key bundle from the server, which adds one additional round trip (50-100ms).
To reduce perceived latency, WhatsApp pre-fetches pre-key bundles for the user's most frequent contacts in the background.
For group messages, the sender key is cached on the sending device for 24 hours, avoiding re-keying on every message.
Without pre-fetching, opening a conversation with a new contact would feel sluggish, especially in regions with high latency.
Key Takeaway
E2E encryption means the server is blind to message content — it takes that burden off storage compliance but adds key management complexity.
Pre-key bundles with monitoring prevent exhaustion.
Pre-fetching for frequent contacts reduces first-message latency.
Choosing an Encryption Protocol
IfNeed asynchronous messaging with forward secrecy
UseUse Signal Protocol (double ratchet). Best for real-time messaging with offline queue.
IfNeed synchronous real-time voice/video
UseUse DTLS-SRTP key exchange. Signal Protocol is too heavy for media streams — use a lightweight key exchange per session.
IfNeed server-side processing of message content (e.g., spam detection)
UseE2E encryption prevents server processing. Use transport encryption (TLS) only, or use homomorphic encryption (still experimental).

Scaling Infrastructure: Chat Servers, Presence, and Multi-Region

WhatsApp's chat servers are stateless application servers that hold WebSocket connections and route messages. They sit behind a global load balancer (anycast DNS + L7 proxy). Each user is assigned to a 'home cluster' based on their phone number hash to maintain sticky sessions and reduce cross-region traffic.

Presence (online/offline/last seen) is stored in a separate low-latency key-value store (custom in-memory DHT with Redis persistence). Updates are propagated via a gossip protocol to all clusters within 5 seconds. To reduce update floods (users frequently connect/disconnect on mobile), WhatsApp batches presence updates: if a user toggles online/offline within 60 seconds, only the final state is published.

Multi-region replication: Messages written to Cassandra in the home cluster are replicated asynchronously to a backup region using cross-datacenter replication. If the home cluster fails, traffic is routed to the backup region. The routing change happens within 10 seconds via a global DNS update (weighted records).

Each region has multiple availability zones. Cassandra's rack-aware replication ensures that each message is stored in at least two AZs within the region. The replication factor is 3, with write consistency set to ONE (for write performance) and read consistency to LOCAL_QUORUM to guarantee no stale reads on crash recovery.

io/thecodeforge/infra/global-routing.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
global_routing:
  anycast: yes
  load_balancer:
    type: envoy
    zones:
      - us-east-1
      - eu-west-1
      - ap-southeast-1
  home_cluster_assignment:
    hash: phone_number (mod 64)
  fallback:
    if home_cluster_down:
      route_to: nearest_healthy_cluster
      cool_down: 10s
Why Cassandra for WhatsApp?
WhatsApp chose Cassandra over DynamoDB (which also powers Amazon's internal services) because it was open-source and already battle-tested at Facebook. The key advantage is linear write scalability: add nodes, increase write throughput without sharding. The read path is trade-off — single partition scans are fast, but range scans require careful design of clustering keys.
Production Insight
Write consistency ONE means a message is considered 'stored' as soon as one Cassandra replica acknowledges. If that replica goes down before replicating, the message is lost. WhatsApp accepts this risk for the performance benefit (write latency ~5ms). To mitigate, they use hinted handoff and a separate 'write-ahead log' (WAL) in Redis per chat server. If a Cassandra write fails, the WAL is replayed. The WAL is ephemeral — if the chat server crashes, only messages in flight are lost (average <1 message per crash).
In practice, the WAL reduces message loss from ~0.001% to ~0.00001%.
Key Takeaway
Sticky sessions (home cluster) reduce cross-region traffic and simplify presence.
Presence batching prevents update avalanches on mobile.
Accept trade-off in consistency level (ONE + WAL) to achieve 5ms writes at 2B users.
Always have a WAL for in-flight messages — it's cheap insurance.
Consistency Level Selection
IfWrite throughput critical, can tolerate rare data loss
UseWrite consistency ONE + WAL. Accept that <0.01% of writes may be lost on cascading failures.
IfRead must be consistent even after disaster recovery
UseRead consistency LOCAL_QUORUM (2 out of 3 replicas). Accept slightly higher read latency (~10ms vs 5ms).
IfRegulatory requirement for no data loss
UseUse write consistency LOCAL_QUORUM with a log-sync before commit. This adds 20ms to write latency but guarantees no loss within the datacenter.
● Production incidentPOST-MORTEMseverity: high

The Billion Lost Messages: WhatsApp's WebSocket Reconnection Bug

Symptom
Users in affected clusters received duplicate 'delivered' receipts before the actual message arrived. Some messages appeared out of order for up to 15 minutes until the backlog cleared.
Assumption
Engineers assumed the WebSocket reconnection handshake was atomic. If a client disconnected and reconnected to a different chat server, the new server would fetch the latest message sequence number from the persistent store before accepting new messages.
Root cause
The reconnection logic had a window where the old server still held an open, unacknowledged connection. The client sent a new message to the new server while the old server also accepted and processed a delayed message from the same client. The deduplication key used only client_msg_id and server_id — not a global unique ID. Since the two messages ended up on different servers, dedup failed, and the message was delivered twice to recipients.
Fix
Switch to a globally unique message ID (sender_id + 64-bit timestamp + per-node counter) combined with a persistent deduplication set (TTL 7 days) stored in Cassandra. The dedup check is performed before any message is enqueued for delivery. Added a 'last_acknowledged_seq' field to the client session that must match before new messages are accepted.
Key lesson
  • Assume reconnection is not idempotent — always design for exactly-once delivery across server boundaries.
  • Use globally unique message IDs from the client; never let the server generate the dedup key.
  • Persist deduplication state with a TTL that covers the worst-case message backlog (usually hours, not days).
Production debug guideWhen messages take >1s to deliver, follow this symptom-driven guide to isolate the bottleneck.4 entries
Symptom · 01
Messages from one user are consistently delayed, others are fine
Fix
Check presence status of that user's client. If client is connected to a different regional cluster, verify cross-region routing (P99 latency usually 200ms+). Check if the user's home cluster is under load (CPU >80% on chat server).
Symptom · 02
All users in a region experience delays
Fix
Check WebSocket gateway health. The gateway may be dropping connections due to connection pool exhaustion. Verify gateway-to-chat-server backpressure — if the gateway queue grows beyond 10K messages, it throttles new connections.
Symptom · 03
Delivery receipts are fast but messages arrive late
Fix
The receipt path is different from the message path. Messages go through persistence (Cassandra write). Check Cassandra write latency — look for 'timeout' exceptions with p999 >500ms. Inspect compaction backlog.
Symptom · 04
Group messages take >3s to deliver
Fix
Fan-out to large groups (>512 members) is read-side. Check if the group's member list is cached in the chat server's local cache (redis cluster hit rate should be >95%). If not, the memcache layer is missing — group membership queries hit Cassandra directly, adding 30-50ms per member.
★ Real-Time Message Debugging Cheat SheetCommon WhatsApp production issues and the first commands to run. All commands assume access to internal monitoring and logging.
Client not receiving any messages
Immediate action
Check WebSocket connection state. Run `curl http://gateway-cluster:8080/health | jq '.connections.active'` on the gateway. If active connections drop below threshold, restart gateway process.
Commands
`ctl -env prod -action check_connection -user_id <id>`
`kubectl logs -l app=chat-server --tail=100 | grep user_<id>`
Fix now
If no WebSocket, force client to reconnect: push notification with type=RECONNECT. If WebSocket exists but no messages, invalidate server-side session cache: redis-cli DEL session:<user_id>
Delivery receipts show 'delivered' but user never saw the message+
Immediate action
Check if the message was actually written to Cassandra. Look at the message_persistence log for that message_id. If write succeeded but delivery failed, check the outbound queue for that server.
Commands
`kubectl exec -n chat -c cassandra -- cqlsh -e "SELECT * FROM io.thecodeforge.messages WHERE message_id = <id>;"`
`kubectl logs -l app=chat-outbound --tail=50 | grep <message_id>`
Fix now
If message exists but outbound failed, manually publish to the delivery queue: kubectl exec -n chat -c rabbitmq -- rabbitmqadmin publish routing_key=delivery.<user_id> payload=<payload>
WhatsApp Architecture Decisions vs Alternatives
ComponentWhatsApp ChoiceAlternativeWhatsApp's Reason
Connection protocolWebSocketsHTTP/2 Server-Sent EventsFull duplex at lowest latency (200ms median)
Message storageCassandraPostgreSQL, DynamoDBWrite scalability, no single point of failure (AP system)
Group fan-outHybrid (256 threshold)Pure write fan-out (WhatsApp earlier version)Balance between write amplification and read latency
Media storageCDN + content-addressableIn-house blob storeBandwidth and deduplication efficiency (60% storage savings)
E2E encryptionSignal ProtocolOpenPGP, customForward secrecy + deniability + pre-key one-time use
Presence propagationGossip protocolCentralized state storeScalability: gossip avoids single point of failure, ~5s propagation

Key takeaways

1
WebSocket connections require graceful draining
never hard-close without a DRAIN message.
2
Globally unique message IDs solve dedup and ordering across servers
always generate them on the client.
3
Group fan-out
write for small groups (<256), read for large groups.
4
Media should be content-addressable with a CDN; never store blobs in the message database.
5
E2E encryption means the server is blind
pre-key monitoring and pre-fetching mitigate latency.
6
Accept trade-offs
write consistency ONE + WAL gives 5ms writes with 0.001% loss rate.
7
Presence batching and gossip prevent the system from collapsing under 2B heartbeat floods.

Common mistakes to avoid

5 patterns
×

Assuming WebSocket reconnection is transparent to message ordering

Symptom
Users receive duplicate messages or out-of-order deliveries after a network reconnect. The dedup mechanism fails because the message ID is not globally unique.
Fix
Use globally unique message IDs with a persistent dedup set (TTL). Implement last_acknowledged_seq on the server to reject messages from a previous session.
×

Building group chat with write fan-out for all group sizes

Symptom
Messages to groups with 100K members take tens of seconds to deliver because the server must write to each recipient's queue individually. Write amplification causes Cassandra compaction lag and high CPU.
Fix
Implement a threshold (e.g., 256) below which use write fan-out, above which use read fan-out with a group feed. Cache group member list to reduce fan-out cost.
×

Storing media in the message database

Symptom
Message storage grows exponentially; read latencies spike as blobs saturate disk I/O; backups take hours.
Fix
Use a separate CDN for media, content-addressable (hash-based) storage. Store only the media hash in the message table. Use pre-signed URLs for uploads.
×

Ignoring pre-key exhaustion in E2E encryption

Symptom
New users cannot send messages to a contact whose pre-keys are all consumed and who is offline/not generating new ones.
Fix
Monitor pre-key count per user. Push notification when count falls below threshold (e.g., 10). Provide a fallback: allow sending without E2E if keys exhausted, but flag it.
×

Setting Cassandra write consistency to QUORUM for all writes

Symptom
Write latency spikes (20ms+) under load because each write must wait for 2 of 3 replicas. This chokes the message pipeline during peak hours (e.g., New Year's Eve).
Fix
Use write consistency ONE with a Write-Ahead Log (WAL) in Redis. Accept rare data loss in exchange for 5ms writes. The WAL catches the tiny fraction of lost writes.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Design WhatsApp: How would you handle message delivery when the recipien...
Q02SENIOR
WhatsApp uses end-to-end encryption. How does it handle key exchange for...
Q03SENIOR
How would you design presence (online/offline) for 2 billion users witho...
Q04SENIOR
Explain how WhatsApp achieves exactly-once delivery for messages.
Q01 of 04SENIOR

Design WhatsApp: How would you handle message delivery when the recipient is offline?

ANSWER
Store the message in a per-user offline queue in Cassandra with a TTL (30 days for WhatsApp). When the recipient comes online, their chat server fetches all messages since their last delivery cursor and pushes them over the WebSocket. The delivery cursor is updated after successful push. For group messages, the queue is shared per group, not per user, to reduce storage. Offline messages are ordered using a logical clock to maintain causality.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
Why does WhatsApp use Cassandra instead of a relational database?
02
How does WhatsApp handle cross-region message delivery?
03
Can WhatsApp read my messages?
04
What happens if a user's phone is lost?
🔥

That's Real World. Mark it forged?

7 min read · try the examples if you haven't

Previous
Design YouTube
4 / 17 · Real World
Next
Design Uber