Mid-level 9 min · March 06, 2026

Design WhatsApp — Exactly-Once Across Server Boundaries

Q: Why does WhatsApp use Cassandra instead of a relational database?

Cassandra is designed for high write throughput with linear scalability and no single point of failure. WhatsApp's workload is write-heavy (100B messages/day) and requires fast writes with eventual consistency. A relational database would struggle with the write volume and require complex sharding. Cassandra's write path is optimized for low latency (~5ms) and its lack of joins fits the simple key-value access pattern for messages.

Q: How does WhatsApp handle cross-region message delivery?

Each user is assigned a home cluster. When a message is sent from a user in one region to a user in another, the sender's chat server forwards the message to the recipient's home cluster (via an internal RPC). The recipient's home cluster then delivers the message over the recipient's WebSocket. Cross-region latency is typically 100-300ms. WhatsApp uses multi-region Cassandra replication asynchronously to ensure no data loss if a region fails.

Q: Can WhatsApp read my messages?

No. WhatsApp uses end-to-end encryption (Signal Protocol) by default. Only the sender and recipient(s) can decrypt the message content. The server only sees encrypted blobs and metadata (who sent to whom, when). WhatsApp cannot access the message text, even if compelled by law enforcement — they physically don't have the keys.

Q: What happens if a user's phone is lost?

Messages are stored locally on the device. Without E2E backup (which requires the user's 64-digit backup key), all unsaved message history is lost. WhatsApp provides encrypted iCloud/Google Drive backups, but they are encrypted with the user's backup key, which WhatsApp does not store. If the backup key is lost, the backup is unrecoverable. This is a deliberate trade-off for privacy.

Duplicate receipts hit users when reconnection routed to two servers with local dedup keys.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

WhatsApp is a real-time messaging system handling 100B+ messages/day across 2B users
WebSocket persistent connections deliver messages under 200ms median latency
Delivery receipts (sent/delivered/read) require causal ordering across server nodes
Group chat fan-out uses a hybrid approach: write fan-out for <=256 users, read fan-out otherwise
E2E encryption adds 10-30ms latency per message but is non-negotiable for privacy
Biggest mistake: assuming WebSocket reconnection is idempotent — duplicated messages break ordering guarantees

✦ Definition~90s read

What is Design WhatsApp?

This article walks through the architectural decisions behind building a WhatsApp-scale messaging system, focusing on the hard problem of delivering exactly-once semantics across server boundaries. WhatsApp processes over 100 billion messages daily, with 2+ billion users, meaning your design must handle WebSocket connection migration, message deduplication at the application layer, and atomicity guarantees when a message crosses from one chat server to another.

★

Imagine a giant post office where billions of people send letters every second.

The core tension is that TCP guarantees delivery per connection, but once you fan out to group chats or route through different datacenters, you need idempotency keys, server-side deduplication windows, and two-phase commit patterns in your message store (typically Cassandra or Scylla for WhatsApp's scale).

You'll see why WhatsApp chose write fan-out for groups under 256 members (pre-compute delivery lists and write to each recipient's inbox) but switches to read fan-out for larger groups to avoid write amplification. The article dives into the media pipeline—how thumbnails are generated and cached at edge POPs while full-resolution files live in blob storage (Facebook's f4 or similar), with CDN-based delivery to reduce origin load.

E2E encryption is layered on top of this infrastructure using the Signal Protocol, meaning the server never sees plaintext content; your delivery semantics must work on encrypted blobs, with key exchange handled out-of-band via the server as a relay for pre-key bundles.

If you're building a chat system that needs to survive server restarts, network partitions, and client reconnections without dropping or duplicating messages, this design pattern is your blueprint. The alternatives—like using a message queue (Kafka) for every hop or relying on client-side dedup alone—fall apart at WhatsApp's scale because they either introduce unbounded latency or require clients to hold state across network failures.

This article assumes you already know basic distributed systems concepts (CAP theorem, consistent hashing) and focuses on the WhatsApp-specific tradeoffs: how to balance delivery latency against storage cost, and why exactly-once is a myth at the transport layer but achievable at the application layer with the right sequence numbers and acknowledgment protocols.

Plain-English First

Imagine a giant post office where billions of people send letters every second. The post office has to know who's online, hold letters for people who are asleep, deliver them the moment someone wakes up, and confirm 'delivered' and 'read' — all without ever opening the letters. That's WhatsApp. The engineering challenge isn't sending one message — it's doing it two billion times a day, reliably, privately, and in under a second.

WhatsApp handles over 100 billion messages per day across 2 billion active users. At that scale, the difference between a good design and a great one isn't features — it's whether your system stays alive on a Tuesday afternoon when half of India gets a holiday and everyone texts at once. This isn't a toy problem. It's one of the most sophisticated real-time distributed systems ever built, and interviewers use it precisely because it exposes every weakness in your distributed systems thinking.

The core problem WhatsApp solves is deceptively simple: two people want to exchange text (and now media) in real time, with guarantees around delivery and ordering, without either party having to stay permanently connected to the same server. Underneath that simplicity lurks a minefield: presence detection, message fan-out in group chats, offline message queuing, idempotent delivery, end-to-end encryption key exchange, and media deduplication across petabytes of storage.

By the end of this article you'll be able to walk into a system design interview and sketch the full WhatsApp architecture — connection layer, message routing, storage schema, delivery receipts, group messaging fan-out, media pipeline, and E2E encryption flow — and, more importantly, defend every single choice with concrete trade-offs. Let's build it.

What Design WhatsApp Actually Means

Design WhatsApp is the architectural pattern for building a chat system that guarantees exactly-once message delivery across server boundaries, even under network partitions and server failures. The core mechanic is a two-phase acknowledgment protocol combined with idempotent message IDs: the sender retries until it receives a confirmed ACK, and the receiver deduplicates using a persistent message log. This eliminates both at-most-once (lost messages) and at-least-once (duplicates) semantics.

In practice, each message carries a globally unique ID (e.g., UUID or Snowflake). The receiving server writes the ID to a deduplication table before processing; if the same ID arrives again, it is silently dropped. The sender maintains a retry queue with exponential backoff until the ACK arrives. This shifts the burden of reliability from the network to the application layer, trading O(1) storage per message for guaranteed delivery.

Use this pattern when message loss or duplication is unacceptable—financial transactions, critical alerts, or any system where users expect every message to arrive exactly once. It is the foundation of WhatsApp’s reliability at billions of messages per day, and it applies directly to any distributed system that must preserve ordering and delivery guarantees across unreliable networks.

Idempotency Is the Linchpin

Exactly-once delivery is impossible without idempotent receivers—the network can always duplicate, so the receiver must tolerate replays.

Production Insight

A payment service used at-least-once delivery without deduplication, causing a single charge to hit the user's card 14 times during a 30-second network blip.

The symptom was a spike in duplicate transaction IDs in the database, detected only after customers complained of overcharges.

Rule: always persist the message ID and check it before any side effect—never trust the network to deliver exactly once.

Key Takeaway

Exactly-once delivery is a property of the receiver, not the sender.

Idempotency keys are the only practical mechanism for deduplication across server boundaries.

Retry with exponential backoff is mandatory; without it, a network blip becomes a thundering herd.

thecodeforge.io

WhatsApp Exactly-Once Delivery Architecture

Design Whatsapp

Core Messaging Architecture: WebSocket Connection Management

WhatsApp's real-time communication relies on persistent WebSocket connections. Each user maintains a long-lived TCP connection to one of thousands of chat server nodes. The connection lifecycle is: client connects → gateway routes to least-loaded chat server → chat server assigns a session ID → server registers the user's presence in an in-memory distributed hash table (DHT) backed by Redis.

The chat server is responsible for receiving messages from the client, validating them, persisting to Cassandra, and forwarding to the recipient's chat server. The recipient's chat server then pushes the message over its WebSocket if the recipient is online. If offline, the message is queued in Cassandra with a TTL of 30 days.

Why not use HTTP long-polling? WebSockets reduce per-message overhead from ~1KB (HTTP headers) to ~150 bytes (WebSocket frame). At 100B messages/day, that saves ~85TB of bandwidth daily. Also, WebSockets enable push-based delivery without polling, keeping median latency under 200ms.

io/thecodeforge/chat/WebSocketSessionManager.javaJAVA

package io.thecodeforge.chat;

import io.thecodeforge.chat.model.Session;
import io.thecodeforge.chat.model.UserId;
import io.thecodeforge.storage.RedisClient;

public class WebSocketSessionManager {
    private final RedisClient<String, Session> sessionStore;
    private static final long SESSION_TTL_SECONDS = 3600;

    public Session createSession(UserId userId, String serverId) {
        Session session = new Session(userId, serverId, System.currentTimeMillis());
        String key = "session:" + userId.value();
        sessionStore.setex(key, SESSION_TTL_SECONDS, session);
        return session;
    }

    public Session refreshSession(UserId userId) {
        String key = "session:" + userId.value();
        return sessionStore.get(key)
            .map(s -> { s.refresh(); return s; })
            .orElseThrow(() -> new RuntimeException("Session expired"));
    }

    public void markOffline(UserId userId) {
        sessionStore.del("session:" + userId.value());
        presenceStore.del("presence:" + userId.value());
    }
}

Mental Model: The Post Office Analogy for WebSocket vs HTTP

Phone line (WebSocket): always open, instant two-way chatter. The post office can shout the moment your parcel arrives.
Letter (HTTP): you must send a new envelope each time. The post office can't tell you something arrived unless you ask.
WhatsApp chose phone lines because shouting 'new message for you' 100 billion times a day via letters would drown the postal system.

Production Insight

Connection rebalancing is the silent killer of WebSocket systems.

When a chat server is decommissioned (deploy or scaling down), existing WebSockets must drain gracefully.

WhatsApp uses a two-phase drain: server broadcasts DRAIN message, waits 30s for clients to reconnect, then hard-closes remaining connections.

Without this, clients lose messages in flight because the old server no longer accepts them and the new server doesn't know about them.

Rule: always implement a drain period equal to your connection timeout + max message processing latency.

Key Takeaway

WebSockets are chosen for WhatsApp because bidirectional low-latency messaging requires persistent state.

Connection draining, session state in Redis, and idempotent reconnection are the three non-negotiable patterns.

If you skip any one, you will lose messages in production.

Choosing WebSocket vs Alternatives

IfUser count < 1K, need low engineering complexity

→

UseUse HTTP long-polling with a simple queue per user (Redis). Accept 1-2s latency.

IfUser count up to 10M, but message delivery can tolerate 5s latency

→

UseUse Server-Sent Events (SSE) with HTTP/2. Simpler than WebSockets but unidirectional server->client.

IfUser count > 10M, need <200ms latency, full duplex

→

UseUse WebSockets with a custom stateful gateway layer. Accept the operational complexity of connection management.

Message Storage and Delivery Semantics

WhatsApp stores every message permanently on the sender's device and temporarily on servers (30 days for delivery, then deleted). The server-side storage schema is designed for high write throughput and point lookups by conversation. The primary data store is Apache Cassandra, chosen for its ability to handle massive write loads with no single point of failure.

Each message has a globally unique ID: (sender_id: long, timestamp_nanos: long, node_id: int). This makes it trivial to deduplicate. Messages are stored in a table keyed by conversation_id (composite of sender and recipient, sorted lexicographically). The table is partitioned by conversation_id, clustered by message_time, so retrieving the last 50 messages is a single partition scan.

Delivery semantics: WhatsApp offers 'sent', 'delivered', and 'read' receipts. The 'delivered' receipt is generated by the recipient's chat server when it pushes the message to the client WebSocket. The 'read' receipt is generated by the client when the user opens the chat. Read receipts require careful ordering: if a message is read before an earlier one is delivered (possible in group chats), the system must preserve causal order.

WhatsApp uses a logical clock per conversation to order messages. Each server increments a local counter and appends it to the conversation's timestamp. Conflicts are resolved by lexicographic ordering of (counter, server_id).

io/thecodeforge/chat/schema.cqlCQL

CREATE TABLE IF NOT EXISTS io.thecodeforge.messages_by_conversation (
    conversation_id text,
    message_time timestamp,
    message_id uuid,
    sender_id bigint,
    recipient_id bigint,
    content_blob blob,
    content_type text,
    delivery_status text,
    read_timestamp timestamp,
    PRIMARY KEY ((conversation_id), message_time, message_id)
) WITH CLUSTERING ORDER BY (message_time DESC, message_id ASC)
   AND default_time_to_live = 2592000; -- 30 days

Cassandra Tombstone Pitfall

Every DELETE in Cassandra creates a tombstone. If you design a schema that frequently deletes old messages (e.g., auto-delete after 7 days), tombstones accumulate and cause read latency spikes. WhatsApp avoids this entirely by using TTL (default_time_to_live) instead of explicit deletes. TTLs are handled at compaction time — no tombstones.

Production Insight

Read receipts break if the recipient's chat server crashes after pushing the message but before acknowledging the read.

The client sends 'read' after the server confirms delivery. If the crash happens, the server never records the read, and the sender sees only 'delivered' indefinitely.

Workaround: the client persists 'read' status locally and retries on reconnection. The server dedupes read receipts by message_id.

WhatsApp avoids this by writing the read receipt to a separate Cassandra table with a lightweight transaction (LWT) — if the message doesn't exist, the LWT fails and the client retries.

Key Takeaway

Design message storage for append-only writes with TTL-based expiry.

Use Cassandra's clustering by time for efficient conversation history retrieval.

Read receipts require a side table with idempotent writes to handle server crashes.

Group Chat Fan-out: Write Fan-out vs Read Fan-out

Group chat is the hardest part of WhatsApp's architecture. A group with 100k members can't afford to write each message 100k times immediately. WhatsApp uses a hybrid approach: - For groups up to 256 members: write fan-out. The sender's server writes the message to Cassandra once, then fans out notifications to all group members' servers. Each member's server fetches the message when the member goes online. - For groups >256 members: read fan-out. The message is written to a 'group feed' row in Cassandra. Members periodically poll (or get push notified) their group feed and fetch new messages since their last seen timestamp. This reduces write amplification but increases read latency for large groups.

Why 256? It's a sweet spot derived from the average group size on WhatsApp (~50) and the cost-benefit analysis of write amplification vs read latency. At 256, write fan-out generates 256 write notification calls, each taking ~10ms, total ~2.5s server time. Read fan-out would require 256 sequential reads from Cassandra (since each member has a different last_seen), also ~2.5s. The crossover point is around 200-300 members, so 256 is a clean power of 2.

WhatsApp also uses a 'group member list cache' in Redis with a per-group version number. When a member joins or leaves, the version increments, and servers invalidate their local cache. This ensures membership changes propagate within seconds.

io/thecodeforge/chat/GroupFanoutDecider.javaJAVA

package io.thecodeforge.chat;

public class GroupFanoutDecider {
    private static final int WRITE_FANOUT_THRESHOLD = 256;

    public FanoutStrategy decide(int groupSize) {
        if (groupSize <= WRITE_FANOUT_THRESHOLD) {
            return FanoutStrategy.WRITE_FANOUT;
        }
        return FanoutStrategy.READ_FANOUT;
    }

    public enum FanoutStrategy {
        WRITE_FANOUT,
        READ_FANOUT
    }
}

Real-World Data Point

WhatsApp's largest known group has 256k members (WhatsApp broadcast lists). These use pure read fan-out with a dedicated group feed table. In 2023, a broadcast to 256k members took ~29 seconds to fully deliver (p99 latency). Without read fan-out, the write amplification would have required 256k disk I/Os per message — impossible at scale.

Production Insight

Group member list staleness causes messages to be sent to users who have left, or missed by users who just joined.

Flapping (rapid join/leave) can in-flate the version number and cause cache churn.

WhatsApp uses a cooldown: if a member changes within 10 seconds of a previous change, the version increments once after 10 seconds, not per-event.

Without this, a flurry of 100 joins would trigger 100 cache invalidations and 100 full membership queries.

Key Takeaway

Choose fan-out strategy based on group size: write fan-out for small groups (<256), read fan-out for large groups.

Cache group membership aggressively but with a cooldown to avoid flapping cascades.

The 256 threshold is derived from write cost vs read cost — derive your own threshold by measuring your own latency numbers.

Fan-out Strategy Decision

IfGroup size <= 256, low churn, real-time requirement

→

UseWrite fan-out with per-user delivery queues. Accept 256x write amplification for instant delivery.

IfGroup size > 256, high churn or async delivery acceptable

→

UseRead fan-out with group feed table. Accept higher read latency and periodic polling.

IfGroup size > 10K, must handle multi-region

→

UseRead fan-out with content-addressable storage and CDN for media. Use push notifications as 'new message' signal.

Media Storage and Delivery Pipeline

WhatsApp processes 4.5 billion photos and 1 billion videos daily. Media is too large to route through chat servers — that would saturate the internal network. Instead, WhatsApp uses a dedicated media pipeline: 1. Sender uploads the media to a CDN (Akamai) via a pre-signed URL obtained from the chat server. 2. The chat server returns a hash (SHA-256) of the media content, which is used as the media ID. 3. The sender sends a message containing the media ID (not the media itself) to the recipient. 4. The recipient fetches the media from the CDN using the media ID, with the CDN verifying the hash to prevent content substitution.

This design avoids storing media on the chat servers. The CDN handles blob storage, caching, and bandwidth. WhatsApp also uses content-addressable storage: if two users send the same media (e.g., forwarded meme), the CDN stores only one copy. Deduplication saves ~60% storage.

For videos, the upload pipeline includes a transcoding step (run on a Kubernetes job triggered by the upload success event). Transcoding reduces file size by 40-70% and ensures compatibility. The transcoded versions are stored with the same media ID but different quality suffix (480p, 720p, etc.).

io/thecodeforge/media/UploadeResponse.jsonJSON

{
  "media_id": "sha256:f4a5b1c...",
  "upload_url": "https://cdn.whatsapp.net/upload/...",
  "expires_at": 1712345678,
  "thumbnail_hash": "sha256:abcd123..."
}

Mental Model: The Library Analogy for Media Deduplication

Sender uploads the meme. The library (CDN) stores one copy and returns a call number (media hash).
Every time another user forwards that same meme, they just forward the call number. The library doesn't need a second copy.
When a recipient downloads, they go to the library with the call number, get the same meme. If the disk fails, they can fetch from any edge cache.

Production Insight

Pre-signed URLs with short expiry (5 minutes) prevent abuse.

But if the sender's upload fails partway, the URL expires and the sender must restart the whole upload.

WhatsApp mitigates this by using multi-part upload — each part has its own pre-signed URL.

If one part fails, only that part is retried. The CDN reassembles on success.

Without multi-part, a large video upload failure wastes 10+ seconds and bandwidth on the client.

Key Takeaway

Media should never pass through the chat server directly.

Use a dedicated CDN with content-addressable storage for deduplication.

Multi-part uploads with per-part pre-signed URLs improve reliability for large files.

End-to-End Encryption (E2E) Implementation

WhatsApp uses the Signal Protocol for E2E encryption, which provides forward secrecy and deniability. The key exchange is based on a pre-key bundle system:

Each user generates a long-term identity key (IK) and a set of pre-keys (each is a one-time use key pair).
The user's device uploads its identity key public half and a signed pre-key bundle to the server.
When Alice wants to message Bob for the first time, she fetches Bob's pre-key bundle from the server, encrypts an initial message with a session key derived from the pre-key, and sends it.
Bob decrypts with his stored private pre-key, and they establish a symmetric session key for subsequent messages.

This design means the server never has access to the message content. The server only sees the encrypted payload and metadata (sender, recipient, timestamp). The encryption is done client-side.

For group messages, WhatsApp uses a 'sender key' approach: the sender generates a symmetric key and distributes it to all group members using their individual E2E sessions. All members share the same group encryption key, which is rotated when a member leaves.

Key management is the hardest part — if a user loses a private key (e.g., phone reset), all messages encrypted with that key become undecryptable. WhatsApp stores encrypted backups on the server (with the user's 64-digit backup key) and allows restore only from the user's device.

io/thecodeforge/e2e/PreKeyBundle.jsonJSON

{
  "identity_key_public": "base64encode...",
  "signed_pre_key": {
    "key": "base64encode...",
    "signature": "base64encode..."
  },
  "one_time_pre_keys": [
    {"id": 1, "key": "base64..."},
    {"id": 2, "key": "base64..."}
  ]
}

The Pre-Key Exhaustion Problem

Each user uploads a bundle of ~100 one-time pre-keys. When those keys are consumed (one per new conversation), the user must upload a fresh batch. If all pre-keys are exhausted before the user is online, new senders cannot start encrypted conversations. WhatsApp monitors pre-key count and pushes a notification to the user to generate more keys when the count drops below 10.

Production Insight

E2E encryption adds latency because of key exchange. For the first message in a conversation, the client must fetch the recipient's pre-key bundle from the server, which adds one additional round trip (50-100ms).

To reduce perceived latency, WhatsApp pre-fetches pre-key bundles for the user's most frequent contacts in the background.

For group messages, the sender key is cached on the sending device for 24 hours, avoiding re-keying on every message.

Without pre-fetching, opening a conversation with a new contact would feel sluggish, especially in regions with high latency.

Key Takeaway

E2E encryption means the server is blind to message content — it takes that burden off storage compliance but adds key management complexity.

Pre-key bundles with monitoring prevent exhaustion.

Pre-fetching for frequent contacts reduces first-message latency.

Choosing an Encryption Protocol

IfNeed asynchronous messaging with forward secrecy

→

UseUse Signal Protocol (double ratchet). Best for real-time messaging with offline queue.

IfNeed synchronous real-time voice/video

→

UseUse DTLS-SRTP key exchange. Signal Protocol is too heavy for media streams — use a lightweight key exchange per session.

IfNeed server-side processing of message content (e.g., spam detection)

→

UseE2E encryption prevents server processing. Use transport encryption (TLS) only, or use homomorphic encryption (still experimental).

Scaling Infrastructure: Chat Servers, Presence, and Multi-Region

WhatsApp's chat servers are stateless application servers that hold WebSocket connections and route messages. They sit behind a global load balancer (anycast DNS + L7 proxy). Each user is assigned to a 'home cluster' based on their phone number hash to maintain sticky sessions and reduce cross-region traffic.

Presence (online/offline/last seen) is stored in a separate low-latency key-value store (custom in-memory DHT with Redis persistence). Updates are propagated via a gossip protocol to all clusters within 5 seconds. To reduce update floods (users frequently connect/disconnect on mobile), WhatsApp batches presence updates: if a user toggles online/offline within 60 seconds, only the final state is published.

Multi-region replication: Messages written to Cassandra in the home cluster are replicated asynchronously to a backup region using cross-datacenter replication. If the home cluster fails, traffic is routed to the backup region. The routing change happens within 10 seconds via a global DNS update (weighted records).

Each region has multiple availability zones. Cassandra's rack-aware replication ensures that each message is stored in at least two AZs within the region. The replication factor is 3, with write consistency set to ONE (for write performance) and read consistency to LOCAL_QUORUM to guarantee no stale reads on crash recovery.

io/thecodeforge/infra/global-routing.yamlYAML

global_routing:
  anycast: yes
  load_balancer:
    type: envoy
    zones:
      - us-east-1
      - eu-west-1
      - ap-southeast-1
  home_cluster_assignment:
    hash: phone_number (mod 64)
  fallback:
    if home_cluster_down:
      route_to: nearest_healthy_cluster
      cool_down: 10s

Why Cassandra for WhatsApp?

WhatsApp chose Cassandra over DynamoDB (which also powers Amazon's internal services) because it was open-source and already battle-tested at Facebook. The key advantage is linear write scalability: add nodes, increase write throughput without sharding. The read path is trade-off — single partition scans are fast, but range scans require careful design of clustering keys.

Production Insight

Write consistency ONE means a message is considered 'stored' as soon as one Cassandra replica acknowledges. If that replica goes down before replicating, the message is lost. WhatsApp accepts this risk for the performance benefit (write latency ~5ms). To mitigate, they use hinted handoff and a separate 'write-ahead log' (WAL) in Redis per chat server. If a Cassandra write fails, the WAL is replayed. The WAL is ephemeral — if the chat server crashes, only messages in flight are lost (average <1 message per crash).

In practice, the WAL reduces message loss from ~0.001% to ~0.00001%.

Key Takeaway

Sticky sessions (home cluster) reduce cross-region traffic and simplify presence.

Presence batching prevents update avalanches on mobile.

Accept trade-off in consistency level (ONE + WAL) to achieve 5ms writes at 2B users.

Always have a WAL for in-flight messages — it's cheap insurance.

Consistency Level Selection

IfWrite throughput critical, can tolerate rare data loss

→

UseWrite consistency ONE + WAL. Accept that <0.01% of writes may be lost on cascading failures.

IfRead must be consistent even after disaster recovery

→

UseRead consistency LOCAL_QUORUM (2 out of 3 replicas). Accept slightly higher read latency (~10ms vs 5ms).

IfRegulatory requirement for no data loss

→

UseUse write consistency LOCAL_QUORUM with a log-sync before commit. This adds 20ms to write latency but guarantees no loss within the datacenter.

Api Design: Why Your Endpoints Should Look Different Than Your Database

I spent a night at 3 AM debugging why a 'send message' call triggered a database deadlock. The problem? The junior engineer mapped the API directly to the database table. Never do that. Your API is a contract with the client. Your database is your private storage business. They should not look the same. For WhatsApp, we design four primary endpoints: send message, get messages (with pagination), upload media, and download media. The send message endpoint accepts a payload with sender ID, receiver ID, conversation ID, message type, and an optional media ID. It returns a message ID and a timestamp. The client does not need to know about your internal partition keys or encryption key IDs. That is your problem. Keep the API lean. Version it. And always, always validate input before it hits any internal service. A malformed payload will crash a chat server faster than a traffic spike.

MessageController.javaJAVA

// io.thecodeforge
// Controller for message API — thin as possible
@RestController
@RequestMapping("/api/v1/messages")
public class MessageController {

    @PostMapping
    public ResponseEntity<SendMessageResponse> sendMessage(@Valid @RequestBody SendMessageRequest request) {
        // WHY: Validate before anything touches internal services
        if (!conversationValidator.isParticipant(request.senderId(), request.conversationId())) {
            return ResponseEntity.status(403).build();
        }
        // Delegate to service — database partition logic stays hidden
        Message message = messageService.send(request);
        return ResponseEntity.ok(new SendMessageResponse(message.id(), message.createdAt()));
    }
}

Output

HTTP 200: { "messageId": "abc123", "createdAt": "2025-03-28T14:30:00Z" }

Production Trap:

Never return internal IDs like partition keys or encryption key versions in the API response. Attackers and clueless clients will use them to probe your backend. Strip everything non-essential before the response leaves your service boundary.

Key Takeaway

API is a contract, not a database dump. Keep the client in the dark about your internals.

Data Model Design: Partition Keys Will Save Your Weekend

Here is a truth I learned the hard way: your data model determines how your system fails. WhatsApp stores billions of messages. If you use a simple auto-increment ID as primary key, every read and write for a user hits the same database node. You will burn. Instead, use a composite primary key: conversation_id as partition key and message_timestamp as sort key. This spreads messages across nodes and lets you fetch conversation history with a single range query. The conversation_id itself should be deterministic — a sorted concatenation of user IDs for one-on-one chats, or a unique group ID. Never rely on 'last message timestamp' for ordering. Timestamps from clients are notoriously broken due to clock skew. Use a server-generated sequence number. Trust nothing from the client except the message content. We keep user data separate: a users table for profile data, a conversations table for metadata, and a messages table with the composite key. Index only what you query. Everything else is dead weight.

MessageRepository.javaJAVA

// io.thecodeforge
// Repository for message retrieval — partition key is non-negotiable
@Repository
public class MessageRepository {

    public List<Message> getMessages(String conversationId, Long beforeTimestamp, int limit) {
        // WHY: Range query on sort key avoids full table scan
        return jdbcTemplate.query(
            """
            SELECT * FROM messages
            WHERE conversation_id = ? AND created_at < ?
            ORDER BY created_at DESC
            LIMIT ?
            """,
            new Object[]{conversationId, beforeTimestamp, limit},
            messageRowMapper
        );
    }
}

Output

[ { "senderId": "user1", "content": "Hello", "createdAt": 1678900000 } ]

Production Trap:

Do not trust client timestamps for ordering. Clocks drift. Use a server-assigned monotonic counter per conversation. Otherwise, a user with a wrong clock will see messages out of order and your support team will get paged at 2 AM.

Key Takeaway

Partition key = conversation_id, sort key = server timestamp. Client timestamps are lies.

● Production incidentPOST-MORTEMseverity: high

The Billion Lost Messages: WhatsApp's WebSocket Reconnection Bug

Symptom

Users in affected clusters received duplicate 'delivered' receipts before the actual message arrived. Some messages appeared out of order for up to 15 minutes until the backlog cleared.

Assumption

Engineers assumed the WebSocket reconnection handshake was atomic. If a client disconnected and reconnected to a different chat server, the new server would fetch the latest message sequence number from the persistent store before accepting new messages.

Root cause

The reconnection logic had a window where the old server still held an open, unacknowledged connection. The client sent a new message to the new server while the old server also accepted and processed a delayed message from the same client. The deduplication key used only client_msg_id and server_id — not a global unique ID. Since the two messages ended up on different servers, dedup failed, and the message was delivered twice to recipients.

Fix

Switch to a globally unique message ID (sender_id + 64-bit timestamp + per-node counter) combined with a persistent deduplication set (TTL 7 days) stored in Cassandra. The dedup check is performed before any message is enqueued for delivery. Added a 'last_acknowledged_seq' field to the client session that must match before new messages are accepted.

Key lesson

Assume reconnection is not idempotent — always design for exactly-once delivery across server boundaries.
Use globally unique message IDs from the client; never let the server generate the dedup key.
Persist deduplication state with a TTL that covers the worst-case message backlog (usually hours, not days).

Production debug guideWhen messages take >1s to deliver, follow this symptom-driven guide to isolate the bottleneck.4 entries

Symptom · 01

Messages from one user are consistently delayed, others are fine

→

Fix

Check presence status of that user's client. If client is connected to a different regional cluster, verify cross-region routing (P99 latency usually 200ms+). Check if the user's home cluster is under load (CPU >80% on chat server).

Symptom · 02

All users in a region experience delays

→

Fix

Check WebSocket gateway health. The gateway may be dropping connections due to connection pool exhaustion. Verify gateway-to-chat-server backpressure — if the gateway queue grows beyond 10K messages, it throttles new connections.

Symptom · 03

Delivery receipts are fast but messages arrive late

→

Fix

The receipt path is different from the message path. Messages go through persistence (Cassandra write). Check Cassandra write latency — look for 'timeout' exceptions with p999 >500ms. Inspect compaction backlog.

Symptom · 04

Group messages take >3s to deliver

→

Fix

Fan-out to large groups (>512 members) is read-side. Check if the group's member list is cached in the chat server's local cache (redis cluster hit rate should be >95%). If not, the memcache layer is missing — group membership queries hit Cassandra directly, adding 30-50ms per member.

★ Real-Time Message Debugging Cheat SheetCommon WhatsApp production issues and the first commands to run. All commands assume access to internal monitoring and logging.

Client not receiving any messages−

Immediate action

Check WebSocket connection state. Run `curl http://gateway-cluster:8080/health | jq '.connections.active'` on the gateway. If active connections drop below threshold, restart gateway process.

Commands

`ctl -env prod -action check_connection -user_id <id>`

`kubectl logs -l app=chat-server --tail=100 | grep user_<id>`

Fix now

If no WebSocket, force client to reconnect: push notification with type=RECONNECT. If WebSocket exists but no messages, invalidate server-side session cache: redis-cli DEL session:<user_id>

Delivery receipts show 'delivered' but user never saw the message+

WhatsApp Architecture Decisions vs Alternatives

Component	WhatsApp Choice	Alternative	WhatsApp's Reason
Connection protocol	WebSockets	HTTP/2 Server-Sent Events	Full duplex at lowest latency (200ms median)
Message storage	Cassandra	PostgreSQL, DynamoDB	Write scalability, no single point of failure (AP system)
Group fan-out	Hybrid (256 threshold)	Pure write fan-out (WhatsApp earlier version)	Balance between write amplification and read latency
Media storage	CDN + content-addressable	In-house blob store	Bandwidth and deduplication efficiency (60% storage savings)
E2E encryption	Signal Protocol	OpenPGP, custom	Forward secrecy + deniability + pre-key one-time use
Presence propagation	Gossip protocol	Centralized state store	Scalability: gossip avoids single point of failure, ~5s propagation

Key takeaways

WebSocket connections require graceful draining

never hard-close without a DRAIN message.

Globally unique message IDs solve dedup and ordering across servers

always generate them on the client.

Group fan-out

write for small groups (<256), read for large groups.

Media should be content-addressable with a CDN; never store blobs in the message database.

E2E encryption means the server is blind

pre-key monitoring and pre-fetching mitigate latency.

Accept trade-offs

write consistency ONE + WAL gives 5ms writes with 0.001% loss rate.

Presence batching and gossip prevent the system from collapsing under 2B heartbeat floods.

Common mistakes to avoid

5 patterns

Assuming WebSocket reconnection is transparent to message ordering

Symptom

Users receive duplicate messages or out-of-order deliveries after a network reconnect. The dedup mechanism fails because the message ID is not globally unique.

Fix

Use globally unique message IDs with a persistent dedup set (TTL). Implement last_acknowledged_seq on the server to reject messages from a previous session.

Building group chat with write fan-out for all group sizes

Symptom

Messages to groups with 100K members take tens of seconds to deliver because the server must write to each recipient's queue individually. Write amplification causes Cassandra compaction lag and high CPU.

Fix

Implement a threshold (e.g., 256) below which use write fan-out, above which use read fan-out with a group feed. Cache group member list to reduce fan-out cost.

Storing media in the message database

Symptom

Message storage grows exponentially; read latencies spike as blobs saturate disk I/O; backups take hours.

Fix

Use a separate CDN for media, content-addressable (hash-based) storage. Store only the media hash in the message table. Use pre-signed URLs for uploads.

Ignoring pre-key exhaustion in E2E encryption

Symptom

New users cannot send messages to a contact whose pre-keys are all consumed and who is offline/not generating new ones.

Fix

Monitor pre-key count per user. Push notification when count falls below threshold (e.g., 10). Provide a fallback: allow sending without E2E if keys exhausted, but flag it.

Setting Cassandra write consistency to QUORUM for all writes

Symptom

Write latency spikes (20ms+) under load because each write must wait for 2 of 3 replicas. This chokes the message pipeline during peak hours (e.g., New Year's Eve).

Fix

Use write consistency ONE with a Write-Ahead Log (WAL) in Redis. Accept rare data loss in exchange for 5ms writes. The WAL catches the tiny fraction of lost writes.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Design WhatsApp: How would you handle message delivery when the recipien...

Q02SENIOR

WhatsApp uses end-to-end encryption. How does it handle key exchange for...

Q03SENIOR

How would you design presence (online/offline) for 2 billion users witho...

Q04SENIOR

Explain how WhatsApp achieves exactly-once delivery for messages.

Q01 of 04SENIOR

Design WhatsApp: How would you handle message delivery when the recipient is offline?

ANSWER

Store the message in a per-user offline queue in Cassandra with a TTL (30 days for WhatsApp). When the recipient comes online, their chat server fetches all messages since their last delivery cursor and pushes them over the WebSocket. The delivery cursor is updated after successful push. For group messages, the queue is shared per group, not per user, to reduce storage. Offline messages are ordered using a logical clock to maintain causality.

FAQ · 4 QUESTIONS

Frequently Asked Questions

Why does WhatsApp use Cassandra instead of a relational database?

How does WhatsApp handle cross-region message delivery?

Can WhatsApp read my messages?

What happens if a user's phone is lost?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Real World. Mark it forged?

9 min read · try the examples if you haven't