Advanced 20 min · March 06, 2026

Design Slack — Fixing WebSocket Fanout Amplification

One 200-user channel triggers 10x Redis pub/sub amplification.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • WebSocket persistent connections handle real-time message delivery
  • Redis pub/sub fans out each write to every server with subscribers
  • Cassandra row key shards messages by channel + time bucket; offline catch-up uses sliding window
  • Presence detection uses heartbeat + expiry, not polling
  • Message fanout is the bottleneck — write once, read by N subscribers
  • The biggest mistake: assuming handshake latency is the bottleneck — it's the fanout write amplification
  • Multi-region adds 40ms+ latency; cross-region fanout requires separate pub/sub clusters
Plain-English First

Imagine a giant office building where thousands of people work. When someone speaks in a meeting room, only the people in that room hear it — not the whole building. Slack is like that building, but digital: it routes your message only to the right 'rooms' (channels), instantly, to everyone who's currently listening. The engineering challenge is making that work for millions of rooms, simultaneously, without anyone hearing a delay or missing a word.

Slack processes over 1 billion messages per month across millions of workspaces. At peak, it delivers messages in under 100ms to recipients who could be on a phone in Tokyo or a laptop in Berlin. That kind of reliability doesn't happen by accident — it's the result of deliberate architectural decisions around real-time transport, message persistence, presence broadcasting, and massive horizontal scale. Most system design resources gloss over the hard parts. This article doesn't.

The core problem Slack solves is deceptively simple: send a message from one user to many others, in real time, reliably, even when some recipients are offline. But hiding inside that sentence are half a dozen distributed systems challenges — how do you maintain persistent connections at scale, fan out a single write to thousands of subscribers, handle offline delivery, deduplicate retries, and keep 'online/offline' status accurate without hammering your database? Each of those deserves its own deep-dive, and we'll cover all of them.

By the end of this article you'll be able to walk into a system design interview and coherently explain the WebSocket connection layer, the pub/sub fanout architecture, how Slack shards its message store, why presence is one of the hardest problems in the system, and what trade-offs you'd make at each decision point. You'll also understand the production gotchas that only show up once you're running at real scale.

This isn't theoretical. Every pattern here is battle-tested — some are the direct result of production incidents that cost teams days of debugging. The fanout amplification trap? We've seen it bring down a cluster. The presence debounce logic? We've watched it flap under load. The cursor staleness bug? It's the kind of subtlety that survives code review. You'll walk into your next design review armed with the actual failure modes.

Here's what this deep dive covers: WebSocket connection management at millions of concurrent users, the write path with write-behind buffers and idempotency keys, Redis pub/sub fanout and its hidden O(N) cost, Cassandra time-bucketed storage with dynamic bucket sizes, heartbeat-based presence with adaptive intervals, async search indexing with Kafka and Elasticsearch, file upload pipelines with staged security scanning, and finally multi-region deployment strategies for global workspaces. Each section includes production incidents, debug commands, and decision trees that senior engineers actually use.

What is Design Slack?

Design Slack is a core concept in System Design. Rather than starting with a dry definition, let's see it in action and understand why it exists.

The real problem isn't sending a message — it's sending that message to thousands of recipients within 100 milliseconds. Every extra 100ms of latency drops user engagement by measurable percentage points. And when you have millions of channels and billions of messages, every architectural decision multiplies.

At its heart, Slack is a stateful system: each user maintains a persistent WebSocket connection, each channel is a pub/sub topic, and each message must be durably stored and indexed for search. The hard part is doing all three at the same time while keeping the architecture horizontally scalable.

What often gets missed is that the real bottleneck shifts depending on the scale tier. At 10 users, any architecture works. At 10,000, the pub/sub fanout starts to creak. At 1 million, the message storage becomes the dominant challenge. That's why you need an architecture that's modular enough to let you swap out components — for instance, moving from Redis pub/sub to Kafka for offline delivery — without a full rewrite.

But there's another layer: the design must also handle misconfiguration at scale. A single workspace with 50,000 users can saturate a Redis pub/sub channel if the fanout isn't filtered per-server. That's why Slack invests heavily in per-server subscription maps and circuit breakers — the Monday morning all-hands is a real threat, not a hypothetical.

Here's a deeper trap: the write path's performance depends on the slowest component. If Cassandra takes 10ms to persist but Redis pub/sub takes 1ms, the overall latency is still 10ms. That's fine. But if Cassandra times out (2 seconds), the whole request blocks. Slack uses a write-behind queue: persist to a local buffer first, then ack the client, then flush to Cassandra. That cuts perceived latency from 100ms to 5ms. The trade-off is a small window for data loss if the server crashes before the buffer flushes.

One more production reality: the write-behind buffer must be sized correctly. If the buffer fills up because Cassandra is slow, you start dropping acknowledgements. Slack monitors buffer pressure and alerts when it crosses 70% utilization. That's the kind of metric you don't find in any textbook but will absolutely save your on-call rotation.

Here's the thing most engineers miss: the real bottleneck isn't the message send, it's the state management. Every message creates a chain of state changes — channel unread count, search index, thread replies, file metadata. If you don't design for that complexity upfront, you'll be patching each component individually later. Slack learned this the hard way when they had to retrofit file indexing — it took them two quarters to stabilize the async pipeline.

One more edge you won't see in interviews: the idle workspace problem. For workspaces with <10 active users, the overhead of maintaining all those subscriptions and connection slots is wasteful. Slack uses a 'dormant workspace' strategy: if a workspace has no active connections for 5 minutes, its subscription map is offloaded to a cold store. When a user reconnects, the load balancer routes them to a special 'warm-up server' that rehydrates the subscriptions from S3. This saves ~30% of Redis memory in clusters with many small workspaces.

Beyond idle workspaces, there's the multi-tenant isolation challenge. A noisy workspace with a high message rate shouldn't starve other workspaces on the same servers. Slack uses per-workspace token bucket rate limiters at the application layer, not just at the WebSocket connection level. If one workspace exceeds its burst limit, messages are queued locally and delivered with a slight delay rather than blocking all connections.

SlackMessageFlow.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package io.thecodeforge.slack;

import io.thecodeforge.slack.model.Message;
import io.thecodeforge.slack.service.FanoutService;
import io.thecodeforge.slack.storage.MessageStore;
import java.util.UUID;

public class SlackMessageFlow {
    public void sendMessage(Message msg) {
        // Assign a unique idempotency key before anything else
        msg.setIdempotencyKey(UUID.randomUUID().toString());
        // Persist first — durable before real-time
        MessageStore.store(msg);
        // Then fan out to all subscribed servers via pub/sub
        FanoutService.publish(msg.channelId(), msg);
    }
}
Forge Tip:
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
Production Insight
Persist before pub/sub ensures no message loss even if fanout fails.
Cost: ~5-10ms extra latency.
Rule: persist first, fanout second.
Gotcha: slow persistence blocks fanout — use write-behind queue.
Buffer pressure >70% is a silent pager-killer — monitor.
Dormant workspace offloading saves ~30% Redis memory — worth implementing if you have many small tenants.
Per-workspace rate limiters prevent noisy neighbor problems.
Key Takeaway
Slack's core write path is: persist → fanout → deliver.
Each step can scale independently.
Persist first, then broadcast — never the reverse.
Idempotency keys are cheaper than exactly-once delivery — use them.
State management amplifies every write: plan for the cascade from day one.
Write-behind buffers cut latency but trade durability. Monitor that buffer.
Dormant workspaces are a real cost saver at scale.
Rate limit per workspace, not per connection.
Write Path Decision Tree
IfLatency requirement < 50ms?
UseUse write-behind buffer: ack client immediately, persist async. Accept risk of data loss on crash.
IfDurability critical (financial data)?
UsePersist synchronously with QUORUM before fanout. Accept 10-20ms extra latency.
IfThroughput > 10k msg/s?
UseBatch Cassandra writes in the persistence layer. Use a write buffer with flush every 5ms or 1000 messages.
IfBuffer utilization > 70%?
UseAlert! Pause new writes to prevent data loss. Auto-scale persistence layer or throttle clients.

WebSocket Connection Management

Slack maintains millions of persistent TCP connections using WebSockets. Each connection is pinned to a WebSocket server (a 'connection server') via a load balancer. The challenge is handling connection storms: when thousands of users reconnect at once after a server restart or network blip.

Each WebSocket server runs an event loop that multiplexes reads and writes across all its connections. It subscribes to Redis pub/sub channels for the workspaces it hosts. When a message arrives, it writes to the local TCP socket for each connected user who belongs to the target channel.

The load balancer distributes connections using consistent hashing on user_id to minimise reconnections when a server is added or removed. Each server maintains a local registry of which channels each user is in, so it only forwards messages relevant to its connections.

A key optimization you'll see in production: the connection servers don't just blindly subscribe to all workspace channels. They maintain a local subscription map and only subscribe to a channel when the first user in that channel connects to them. This reduces the number of Redis pub/sub subscriptions per server from O(workspace_channels * users) to O(distinct_channels_with_local_users). If a 50-channel workspace has 200 users spread across 20 servers, that's a 50x reduction in subscription overhead.

But there's a deeper trap: if a server crashes and all its users reconnect to other servers, those servers now subscribe to channels that were previously handled elsewhere. The subscription count can spike 5x in seconds. That's why Slack pre-allocates a subscription budget per server and rejects new channel subscriptions if the budget is exceeded — the user gets a 'try again' signal and connects to a different server via the load balancer.

Another real-world pattern: Slack uses a two-tier load balancer. The first tier is a layer-4 TCP balancer that distributes raw connections. The second tier is a layer-7 balancer that inspects the WebSocket upgrade request and routes based on workspace ID to ensure all users in the same workspace land on the same set of servers — reduces cross-server traffic for frequently messaged channels.

One more nuance: the subscription budget isn't static. Slack dynamically adjusts it based on server health and CPU. When a server is under load, it lowers its budget, pushing new subscriptions to healthier servers. This is a form of self-healing load balancing.

Here's a real gotcha that will bite you: even with subscription budgets, a sudden burst of new connections (like the Monday morning all-hands) can cause a rapid increase in pending handshakes. If the handshake handler is single-threaded in your event loop, it blocks all other processing — including heartbeats and message delivery. Slack mitigates this by dedicating a separate thread pool for WebSocket handshakes, keeping the event loop free for ongoing connections. You don't want a new connection storm to delay delivery for already-connected users.

One more production pattern: Slack uses connection draining on server shutdown. When a server is about to be decommissioned, it stops accepting new connections, sends a 'go away' frame to each connected client with a 10-second reconnect delay, and waits for all connections to drain before shutting down. Without this, clients get a TCP RST and reconnect immediately, causing a stampede. With draining, the reconnect is spread over 10 seconds, reducing the peak load on the remaining servers.

For global workspaces, WebSocket servers are deployed in multiple regions. Users connect to the nearest regional endpoint via Geo-DNS. A regional load balancer distributes within the region. Cross-region message delivery for users from different regions in the same channel uses a regional pub/sub bridge: each region has its own Redis pub/sub cluster, and a message published in one region is forwarded to other regions through a Kafka-based cross-region bridge. This adds latency but prevents a single region from bottlenecking global fanout.

ConnectionServer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package io.thecodeforge.slack.websocket;

import io.thecodeforge.slack.model.SubscriptionRegistry;
import io.thecodeforge.slack.pubsub.RedisPubSubClient;
import java.net.http.WebSocket;
import java.util.concurrent.ConcurrentHashMap;

public class ConnectionServer {
    private final ConcurrentHashMap<String, WebSocket> connections = new ConcurrentHashMap<>();
    private final SubscriptionRegistry registry = new SubscriptionRegistry();
    private final RedisPubSubClient pubSubClient = new RedisPubSubClient();

    public void onNewConnection(WebSocket ws, String userId) {\\n        connections.put(userId, ws);\\n        registry.userConnected(userId);\\n        pubSubClient.subscribeToWorkspaceChannels(registry.getWorkspaceIds());\\n    }

    public void broadcastToChannel(String channelId, String message) {\\n        registry.getChannelUsers(channelId).stream()\\n            .map(connections::get)\\n            .filter(ws -> ws != null && ws.isOpen())\\n            .forEach(ws -> ws.sendText(message, true));\\n    }
}
Hidden Cost of Connection Storms
Even with per-server subscription filters, a server failure can cause a chain reaction. Always have circuit breakers and connection draining.
Production Insight
Subscription per-channel per-server is key to scaling WebSocket servers.
Load balancer must use consistent hashing to avoid reconnection storms.
Rule: never trust a single-threaded handshake handler under load.
Connection draining prevents reconnection stampede.
Subscription budgets protect against cascading overload.
Monitor pending handshake count vs active connections as early warning.
Geo-DNS + regional pub/sub bridges are essential for global workspaces.
Key Takeaway
WebSocket connections are cheap, but management is expensive.
Filter subscriptions per-server to reduce fanout.
Connection draining is mandatory for graceful scaling.
Handshake storms kill event loops — isolate them.
Load balancer choice dictates reconnection behavior.
The subscription budget is your safety valve.
For global scale, region-local clusters with async cross-region bridges.
WebSocket Server Scaling Decision Tree
IfServer memory > 80%?
UseCheck connection count per server. If unequal, verify consistent hash distribution. Adjust hash ring.
IfPending handshake count > 2x active connections?
UseAlert! Likely a connection storm. Activate circuit breaker: drop non-essential subscriptions. Scale up servers.
IfHigh cross-server traffic?
UseUse layer-7 routing to pin workspace users to same servers. Reduces inter-server pub/sub messages.
IfCross-region latency > 200ms?
UseDeploy regional WebSocket clusters. Use Kafka bridge for cross-region message forwarding.

Message Fanout and Pub/Sub Architecture

When a user sends a message, it must be delivered to all other users in the channel in near real-time. The naive approach would be to have the sending server open a direct connection to every recipient. That doesn't scale. Slack uses Redis pub/sub as an intermediary: the sending server publishes the message to a Redis channel named after the workspace or channel. All WebSocket servers that have subscribers for that channel receive the message and forward it to their local connections.

But this introduces a write amplification problem: each message is written to Redis once by the sender, but read N times (once per server that has subscribers). If 50 servers each have at least one subscriber in the channel, that's 50 pub/sub reads per message. The cost isn't negligible — Redis pub/sub is O(N) per message where N is the number of subscribing servers.

Slack mitigates this in two ways. First

FanoutService.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package io.thecodeforge.slack.service;

import io.thecodeforge.slack.model.Message;
import io.thecodeforge.slack.pubsub.RedisPubSubClient;
import io.thecodeforge.slack.pubsub.KafkaClient;

public class FanoutService {
    private final RedisPubSubClient redisPubSub;
    private final KafkaClient kafkaClient;

    public void publish(String channelId, Message msg) {\\n        // Publish to Redis for live delivery\\n        redisPubSub.publish(channelId, msg);\\n        // Publish to Kafka for async processing (offline, indexing)\\n        kafkaClient.send(\\\"slack-messages\\\", channelId, msg);\\n    }\\n}\",\n        \"output\": \"\"\n      }",
        "callout": {
          "type": "tip",
          "title": "Fanout Isolation",
          "text": "Always separate the live fanout (Redis pub/sub) from async processing (Kafka). If Kafka lags, it shouldn't slow down message delivery."
        },
        "production_insight": "Fanout is O(N) where N = servers with subscribers — not total users.\nBatching multiple messages into one pub/sub event cuts overhead.\nDedicated Redis instance for pub/sub prevents cache eviction.\nSubscription limit per server prevents buffer overflow.\nFallback to direct TCP relay when pub/sub latency spikes.\nRule: measure pub/sub reads per second, not just writes.\nCross-region fanout adds latency but must be async to avoid global bottlenecks.",
        "decision_tree": {
          "title": "Fanout Strategy Decision Tree",
          "items": [
            {
              "condition": "Channel size > 1000 members?",
              "result": "Use broadcast channel with fanout tree instead of flat pub/sub. Optionally send via Kafka for offline."
            },
            {
              "condition": "Pub/sub memory > 1GB?",
              "result": "Reduce subscription count per server or add Redis shards. Check for stale subscriptions."
            },
            {
              "condition": "Message loss acceptable?",
              "result": "Use fire-and-forget pub/sub for live delivery. Recovery from persistent store on reconnect."
            },
            {
              "condition": "Latency > 500ms?",
              "result": "Switch to direct TCP relay between servers. Investigate Redis CPU or network bottleneck."
            },
            {
              "condition": "Cross-region delivery required?",
              "result": "Use Kafka bridge per region pair. Accept 50-100ms additional latency for inter-region messages."
            }
          ]
        },
        "key_takeaway": "Fanout is the scaling bottleneck — optimize for reads per message.\nBatching and per-server filtering reduce Redis load.\nKafka decouples async processing from real-time path.\nMonitor pub/sub memory and latency aggressively.\nHave a fallback plan for when Redis buckles.\nRule: write once, read smart, not to everyone.\nCross-region fanout requires regional isolation with async bridges."
      }

Message Storage and Retrieval with Cassandra

Slack stores all messages in Cassandra for durability, search, and offline catch-up. The primary table uses a composite primary key: channel_id as partition key, and a time-based clustering column (e.g., message_timestamp). This allows efficient retrieval of messages for a channel in chronological order.

However, a hot partition problem emerges for busy channels. A channel with 10,000 messages a day will have all writes hitting a single partition. To spread the load, Slack uses a 'time-bucketed' row key: partition key is (channel_id + day_bucket). Thus, each day's messages are in a separate partition. For catch-up, the client requests messages from the last N days, and the backend merges results from multiple partitions.

For offline catch-up, Slack uses a cursor-based approach: the client stores the timestamp of the last seen message. On reconnect, it queries Cassandra for messages with timestamp > cursor, limited to a window (e.g., last 7 days). This avoids scanning the entire channel history.

But there's a subtlety: Cassandra is eventually consistent. If a user just sent a message and immediately goes offline, the message might not be replicated to all replicas before the next read. The catch-up uses LOCAL_QUORUM consistency for both read and write to ensure strong consistency within the datacenter. Cross-datacenter reads use ONE, but the cursor is read at LOCAL_QUORUM to avoid stale cursor.

Slack also uses a write-behind cache (Redis) for the last 1000 messages per channel. When a client requests channel history, the server checks the cache first. If the cache is warm, it returns immediately. If not, it queries Cassandra. This reduces read pressure on Cassandra for frequently accessed channels.

One production issue: tombstone accumulation. When messages are deleted (edit or delete), Cassandra leaves tombstones. Frequent deletes in a busy channel cause tombstone overload, leading to read timeouts. Slack uses a TTL-based approach: messages are automatically deleted after 90 days (or workspace retention policy). For explicit deletes, they use a 'soft delete' that marks the message as deleted without creating a tombstone; a background compaction job removes them later.

Another gotcha: the time-bucketing must align with query patterns. If a user catches up after being offline for 30 days, the query hits 30 partitions. That's acceptable if each partition is small. But if the channel is very busy, each partition can be large. Slack uses a variable bucket size: for high-traffic channels, buckets are smaller (e.g., 6 hours). For low-traffic channels, buckets are larger (e.g., 7 days). The bucket duration is calculated based on the channel's historical message rate.

For multi-region storage, Slack uses separate Cassandra clusters in each region. Messages are written to the local cluster first, then replicated asynchronously to other regions via Cassandra's built-in cross-datacenter replication. This keeps write latency low (local quorum) but introduces eventual consistency across regions. When a user switches regions (e.g., travels), the cursor includes the region ID, and the catch-up service queries both local and remote clusters, merging results sorted by timestamp. This is critical to avoid users seeing 'missing messages' when they travel.

messages_by_channel.cqlCQL
1
2
3
4
5
6
7
8
9
10
11
12
-- Primary table for channel messages
CREATE TABLE IF NOT EXISTS io_thecodeforge_slack.messages_by_channel (
    channel_id text,
    day_bucket text,
    message_timestamp timeuuid,
    user_id text,
    content text,
    idempotency_key text,
    region_id text,
    PRIMARY KEY ((channel_id, day_bucket), message_timestamp)
) WITH CLUSTERING ORDER BY (message_timestamp DESC)
    AND default_time_to_live = 7776000; -- 90 days
Time Bucketing
  • Partition key = channel_id + day_bucket to spread writes
  • Clustering column = timeuuid for chronological order
  • Query merges responses from N buckets based on time range
  • Bucket size adjusts dynamically: small for hot channels, large for cold
  • TTL auto-deletes messages after retention period to avoid tombstones
  • Include region_id in partition key for cross-datacenter merging
Production Insight
Time bucketing prevents hot partitions at the cost of multi-partition reads.
Dynamic bucket size optimizes for both hot and cold channels.
Cursor-based catch-up avoids full scans.
LOCAL_QUORUM for cursor reads prevents stale 'last seen' timestamps.
Write-behind cache reduces Cassandra read load.
Soft deletes over real deletes to avoid tombstone issues.
Rule: always plan for tombstone accumulation in Cassandra.
Cross-region: separate clusters with async replication; cursor must include region ID.
Key Takeaway
Time-bucketed row keys spread writes but increase read fan-out.
Cursor-based catch-up is efficient for offline users.
Write-behind cache cuts read pressure by 80% for hot channels.
Tombstones are the silent killer of Cassandra reads.
Dynamic bucket sizing adapts to channel activity.
Rule: partition key design determines scalability.
For global users, merge cross-region clusters with region-aware cursors.
Storage Strategy Decision Tree
IfChannel write rate > 100 msg/s?
UseUse 6-hour buckets. Prepend channel_id with a hash to spread writes further if needed.
IfUser return after 30+ days offline?
UseQuery all buckets from past 30 days. If cold storage tier exists, retrieve from there first.
IfHigh delete rate?
UseUse soft delete + background compaction. Increase gc_grace_seconds but monitor tombstone count.
IfCross-region user travel?
UseMerge results from both local and remote Cassandra clusters. Cursor includes region_id. Use LOCAL_QUORUM for cursor consistency.

Presence Detection at Scale

Slack shows green dots next to users who are currently active. This seems trivial but at scale it's one of the hardest problems. The naive approach — updating a database row every time a user does something — would hammer the database. Instead, Slack uses a heartbeat-based presence service.

Each client sends a heartbeat every 15 seconds via WebSocket. The presence server receives these heartbeats and updates an in-memory map of user -> last_heartbeat_timestamp. A separate goroutine (or thread) runs every 30 seconds and marks users as offline if their last heartbeat is older than 30 seconds. The presence state is published to Redis for other services to consume.

But there's a problem: if the heartbeat interval is too short, you overload the server. If too long, presence feels stale. Slack uses adaptive heartbeat: if the user is idle but connected, the heartbeat interval extends to 60 seconds. If the user is actively interacting (typing, scrolling), it drops to 5 seconds. This reduces server load by 60% during idle periods.

Another challenge: when a user disconnects abruptly (e.g., network loss), the heartbeat stops. The system takes up to 30 seconds to mark them offline. That's fine for most use cases, but for real-time collaboration (e.g., pair programming), 30 seconds feels slow. Slack investigated using TCP keepalives and WebSocket ping/pong to detect disconnection faster, but those are unreliable at scale because intermediate proxies may not forward them. They settled on a hybrid: WebSocket ping/pong every 10 seconds, plus heartbeat to confirm liveness. If ping/pong fails three times, the server initiates a clean disconnect, and the presence is updated immediately rather than waiting for heartbeat expiry.

Presence is also broadcasted to other users in the same channel. When a user comes online or goes offline, the presence server publishes an event to the appropriate channel's pub/sub topic. Each WebSocket server forwards it to clients. But flooding every client with presence changes of every user in a large channel is expensive. Slack uses debouncing: if a user's presence changes frequently (e.g., flaky network), the server batches updates and sends them at most once per 5 seconds per channel. Client-side dedup prevents flickering.

One production incident: a bug in debouncing caused all presence changes to be delayed by 30 seconds during high churn. The fix was to use a sliding window of last change time per user, not per channel. That reduced the batch window from 5s to 1s without increasing message volume.

Another nuance: presence is aggregated per workspace. If a user is in multiple workspaces, they have separate presence for each. The presence server maintains a user_id -> workspace_id map and updates each independently. This prevents a busy workspace from causing false offline marking in another.

For multi-region presence, each region has its own presence server. When a user's connection moves to a new region (e.g., they travel), the old region's presence server marks them offline after the heartbeat timeout, and the new region's server picks up the heartbeats. The region change can cause a brief 1-2 second 'offline' blip. Slack mitigates this by using a global presence cache (Redis with TTL) that reconciles across regions: if a user has heartbeats in any region within the last 30 seconds, they are considered online. This requires cross-region writes to the global cache, which adds latency but avoids flapping.

PresenceService.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package io.thecodeforge.slack.presence;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

public class PresenceService {
    private final ConcurrentHashMap<String, Long> heartbeats = new ConcurrentHashMap<>();
    private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();

    public void start() {
        scheduler.scheduleAtFixedRate(() -> {
            long cutoff = System.currentTimeMillis() - 30_000;
            heartbeats.entrySet().removeIf(e -> e.getValue() < cutoff);
            // publish changes to Redis
        }, 30, 30, TimeUnit.SECONDS);
    }

    public void handleHeartbeat(String userId) {
        heartbeats.put(userId, System.currentTimeMillis());
    }
}
Failure Pattern
If the presence server's expiry thread is too slow (e.g., GC pause), false offline markings occur. Always run it in a separate thread pool with a low-latency GC.
Production Insight
Adaptive heartbeat reduces server load by 60% during idle.
WebSocket ping/pong + heartbeat hybrid fast-tracks disconnect detection.
Debounce presence updates to avoid flooding clients.
Per-user debounce window fixes cascade delays.
Separate presence per workspace prevents cross-contamination.
Rule: presence is a write-heavy system — optimize for writes, tolerate read staleness.
Global presence cache with TTL across regions handles user travel gracefully.
Key Takeaway
Heartbeat-based presence is simple but has latency trade-offs.
Adaptive intervals reduce unnecessary server load.
Ping/pong helps detect disconnection but isn't foolproof.
Debounce updates to prevent client overload.
Per-workspace presence isolation prevents cross-effect.
Rule: presence is write-scalable, read tolerant.
For global travel, cache presence across regions with TTL.
Presence Strategy Decision Tree
IfFlaky network flapping presence?
UseIncrease debounce window to 10s. Client-side dedup for UI.
IfHeartbeat server CPU > 80%?
UseOffload heartbeat processing to a separate thread pool. Consider scaling presence servers.
IfReal-time collaboration needed (<1s detection)?
UseUse TCP keepalive and WebSocket ping/pong. Dedicated presence flag update path bypassing heartbeat expiry.
IfUser travels across regions?
UseDeploy regional presence servers. Use global Redis cache with 30s TTL for blended online status.

Search Indexing Architecture

Slack provides full-text search across all messages. This is powered by Elasticsearch, which indexes messages asynchronously. When a message is written to Cassandra, a background worker publishes the message ID and content to a Kafka topic. A separate indexer consumer reads from Kafka and pushes documents to Elasticsearch.

The separation is deliberate: the real-time path (WebSocket delivery) must not be blocked by search indexing. Kafka acts as a durable buffer, absorbing spikes in message volume without back-propagating pressure to the write path.

But there's a subtlety: Elasticsearch is eventually consistent with Cassandra. If a user searches immediately after sending a message, they might not see it in the index for 1-2 seconds. Slack handles this by returning the most recent messages from Cassandra in the search results, merging them with Elasticsearch results client-side. That way, the user always sees their own message instantly.

Another production concern: indexer lag. If the Kafka consumer falls behind (e.g., due to an Elasticsearch cluster issue), search becomes stale. Slack monitors consumer lag and alerts if it exceeds 30 seconds. When lag is high, they temporarily redirect search queries to Cassandra's secondary index (which is slower but always up-to-date).

The trade-off is clear: you trade indexing cost and latency for query performance. Elasticsearch can handle complex full-text queries in under 100ms, while Cassandra's secondary index would take seconds on the same query. At Slack's scale, the cost of maintaining an Elasticsearch cluster is acceptable for the user experience gain.

One advanced pattern: Slack uses a two-level indexing approach. Frequently searched terms are indexed at higher priority, while rare terms are batched. This ensures that common searches (like 'team announcement' or 'bug fix') are always fast, even during indexing backlogs. The priority is determined by a sliding window of search query logs.

Here's a real failure pattern: when Elasticsearch undergoes a rolling restart, it stops accepting writes for a brief period. The Kafka consumer continues to read but fails to index. The lag grows. If the restart takes too long, the lag can exceed the retention period of the Kafka topic, causing permanent data loss. Slack mitigates this by pausing the consumer before a planned ES restart and resuming after all nodes are healthy. They also set Kafka topic retention to 7 days to give a safety buffer.

Another nuance: the merge of recent messages from Cassandra into search results is not free. For workspaces with heavy write rates, querying Cassandra for recent messages adds load to the primary store. Slack caches the last 1000 messages per workspace in Redis with a 5-second TTL, so most search queries hit the cache instead of Cassandra. This reduces read pressure on the primary database and speeds up search results.

One more detail: the search indexer also handles message edits and deletions. When a message is edited, the indexer receives the updated document and replaces the existing one. But deletions are trickier — if a user deletes a message, the indexer must remove it from Elasticsearch. If the indexer is backlogged, a deleted message might still appear in search results for a while. Slack uses a 'delete marker' approach: instead of actually deleting the document immediately, they mark it as deleted in the index and let a background cleanup job remove it after 24 hours. This prevents race conditions where a delete event arrives before the original message is indexed.

For multi-region search, Slack maintains an Elasticsearch cluster per region. The indexer in each region consumes from its local Kafka topic. When a user searches, the query is routed to the user's home region. If the user is traveling, the query proxies to their home region, or a cross-region merge is performed for global accounts. This keeps indexing latency low and avoids a single global ES cluster becoming a bottleneck.

SearchIndexer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package io.thecodeforge.slack.search;

import io.thecodeforge.slack.model.Message;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.action.index.IndexRequest;

public class SearchIndexer {
    private final RestHighLevelClient esClient;

    public void onMessage(ConsumerRecord<String, Message> record) {\\n        Message msg = record.value();\\n        IndexRequest request = new IndexRequest(\\\"slack_messages\\\")\\n            .id(msg.getIdempotencyKey())\\n            .source(\\\"channel_id\\\", msg.channelId(),\\n                    \\\"user_id\\\", msg.userId(),\\n                    \\\"text\\\", msg.text(),\\n                    \\\"timestamp\\\", msg.timestamp());\\n        esClient.index(request, RequestOptions.DEFAULT);\\n    }\\n}\",\n        \"output\": \"\"\n      }",
        "callout": {
          "type": "info",
          "title": "Forge Insight",
          "text": "Always decouple search indexing from the real-time path. Use a queue (Kafka) as a buffer. Monitor lag. Have a fallback plan for when the index is behind."
        },
        "production_insight": "Indexer lag causes stale search results.\nMonitor Kafka consumer lag and alert if >30s.\nRule: always have a fallback read path for search (e.g., Cassandra secondary index) when the primary index is lagging.\nMerging recent messages from Cassandra into search results mitigates the eventual consistency gap.\nCache recent messages in Redis to avoid hammering Cassandra on every search.\nPause consumer before planned Elasticsearch restarts to avoid data loss from lag exceeding topic retention.\nUse soft-delete markers in index to handle deletions gracefully during backlog.\nFor global scale, deploy per-region ES clusters with local indexers.",
        "decision_tree": {
          "title": "Search Indexing Strategy Decision Tree",
          "items": [
            {
              "condition": "Indexer lag < 30s?",
              "result": "Serve search from Elasticsearch. Merge recent (cached) messages from Cassandra."
            },
            {
              "condition": "Indexer lag > 30s?",
              "result": "Switch to Cassandra secondary index for all queries. Accept slower search (seconds) over stale results."
            },
            {
              "condition": "Elasticsearch restart planned?",
              "result": "Pause the Kafka consumer before restart. Resume after all nodes are green. Verify lag is acceptable."
            },
            {
              "condition": "Write rate very high (>10k msg/s)?",
              "result": "Use priority indexing: frequent terms first, batch rare terms. Cache recent messages in Redis for merge."
            },
            {
              "condition": "User searches from different region?",
              "result": "Route query to user's home region ES cluster. For global accounts, merge results from multiple regions."
            }
          ]
        },
        "key_takeaway": "Search indexing must be async and buffered.\nKafka absorbs write spikes and decouples systems.\nMonitor indexer lag aggressively.\nMerge recent writes into search results to hide indexing delay.\nCache recent results to protect primary store from search overload.\nPause consumer during planned Elasticsearch restarts to avoid data loss.\nHandle deletes with soft markers to avoid race conditions with backlog.\nPer-region ES clusters keep indexing local and fast."
      }

File Sharing and Attachment Handling

Slack handles file uploads — images, documents, code snippets — and must generate thumbnails, scan for malware, and make files searchable. The file upload path is separate from the message path to avoid blocking message delivery.

When a user uploads a file, the frontend sends it to a dedicated file upload service via HTTPS. This service stores the file in an S3-compatible object store, generates a unique file ID, and then publishes the file metadata (file ID, workspace ID, channel ID, uploader) to a Kafka topic. A separate worker consumes this topic and performs: thumbnail generation (for images), virus scanning (using ClamAV), and content indexing (for full-text search within documents).

The file itself is stored externally; the message that includes the file contains only the file ID and a URL. When a client displays the message, it fetches the file URL from a CDN, which caches the file for faster delivery.

One production challenge: thumbnail generation is CPU-intensive. Slack uses a pool of worker services dedicated to thumbnail processing. If the pool falls behind, thumbnails appear broken (placeholder) for seconds. Slack monitors the thumbnail generation queue depth and auto-scales workers when it exceeds 1000 pending jobs.

Another gotcha: file size limits. Slack enforces per-file size limits (e.g., 1GB for paid plans). The upload service must reject oversized files early — before they reach the object store. They implement streaming validation: read the Content-Length header and reject if > limit. For chunked uploads, they use a proxy that buffers the first few MB to check MIME type but rejects if total size exceeds limit based on chunk metadata.

Security: malware scanning must happen before the file becomes accessible. Slack uses a 'scan first, serve later' model. The file is stored in a staging bucket with restricted access. After scanning passes, it moves to the public bucket. If scanning fails, the file is quarantined and the user is notified. This means thumbnails are generated only after scanning passes, which adds latency of 1-5 seconds. For images, they pre-generate a small thumbnail from the first chunk to show in the message immediately, but the full image is blocked until scan completes.

One more nuance: deduplication of file uploads. If the same file is uploaded to multiple channels, Slack uses content hashing (SHA-256) to detect duplicates and store only one copy. The file ID maps to the content hash. This saves storage space but complicates deletion: you can't delete a file if it's referenced by other messages. Slack uses reference counting per content hash.

Production incident: a workspace uploaded a large video file that took 10 minutes to scan. During that time, users couldn't see the file in the channel. The fix was to show a placeholder immediately with a progress indicator, and update the thumbnail when scanning completes. They also implemented scanning timeout: if scan takes > 5 minutes, it's moved to a offline scan queue to avoid blocking the upload path.

For multi-region file storage, Slack uses region-specific S3 buckets. When a file is uploaded, it's stored in the nearest region's bucket. For cross-region access, a CDN with multiple origins handles the latency. If a user in Europe views a file stored in the US, the CDN fetches from the US origin on the first request and caches in a European edge location. This avoids replicating every file to every region, saving storage costs.

FileUploadHandler.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package io.thecodeforge.slack.file;

import io.thecodeforge.slack.model.FileMetadata;
import io.thecodeforge.slack.upload.UploadService;
import io.thecodeforge.slack.scan.ScanService;

public class FileUploadHandler {
    private final UploadService uploadService;
    private final ScanService scanService;

    public void handleUpload(InputStream fileStream, String fileName, long size) {\\n        // Reject oversized files early\\n        long maxSize = 1_073_741_824L; // 1GB\\n        if (size > maxSize) {\\n            throw new FileTooLargeException(\\\"File exceeds maximum size\\\");\\n        }\\n        // Store in staging bucket\\n        String fileId = uploadService.storeToStaging(fileStream);\\n        // Queue for scanning and thumbnail generation\\n        scanService.queueScan(fileId);\\n    }\\n}\",\n        \"output\": \"\"\n      }",
        "callout": {
          "type": "warning",
          "title": "Scanning Bottleneck",
          "text": "Malware scanning adds 1-5s latency. Use pre-generated preview for images to avoid blocking the user experience. Monitor scan queue depth to avoid backlogs."
        },
        "production_insight": "File upload must be async and decoupled from message path.\nThumbnail generation is CPU-bound — auto-scale workers based on queue depth.\nEnforce file size limits before upload to avoid waste.\nStaging bucket with restricted access prevents premature file exposure.\nContent hashing deduplicates files but complicates deletion.\nPlaceholder + progress indicator improves UX during scanning.\nRule: file upload and message delivery should never block each other.\nCDN with multi-region origins avoids replicating files globally.",
        "decision_tree": {
          "title": "File Handling Decision Tree",
          "items": [
            {
              "condition": "File size > 1GB?",
              "result": "Reject immediately. Check Content-Length header or chunk metadata."
            },
            {
              "condition": "Scan queue depth > 1000?",
              "result": "Scale up scanning workers. If scaling not possible, prioritize smaller files first."
            },
            {
              "condition": "Same file uploaded to multiple channels?",
              "result": "Use content hash dedup. Store reference counts per hash. Handle deletion carefully."
            },
            {
              "condition": "Scan takes > 5 minutes?",
              "result": "Move to offline scan queue. Show file as 'processing' with placeholder."
            },
            {
              "condition": "Global file access required?",
              "result": "Store in region-specific S3 bucket. Use CDN with multi-origin support. Cache on edge for fast retrieval."
            }
          ]
        },
        "key_takeaway": "File uploads are async and isolated from message path.\nScan before serve for security, but manage latency with previews.\nContent dedup saves space but adds complexity.\nAuto-scale workers based on queue depth to avoid bottlenecks.\nAlways show progress to users — silence kills engagement.\nRule: separate upload concerns from real-time delivery.\nCDN with regional origins handles global file access without replication."
      }
● Production incidentPOST-MORTEMseverity: high

The Monday Morning WebSocket Storm

Symptom
Users across the workspace see 'Connecting...' indefinitely. The WebSocket handshake latency spikes from 50ms to 30s. The channel server's memory grows until OOM killer intervenes.
Assumption
The team assumed that increasing the number of WebSocket servers would linearly increase connection capacity. They didn't account for the fanout amplification when one message is broadcast to every active connection.
Root cause
Each WebSocket server subscribes to a global Redis pub/sub channel for every workspace it hosts. When a single message is published to a large channel (200 users), every server receives it and forwards to its local connections. With 10 servers handling 20 connections each, that's 200 messages delivered — but each server processes 200 incoming pub/sub events, causing a 10x amplification on the message processing path. The CPU on each server saturates, causing WebSocket handshake timeouts for new connections.
Fix
1. Implement fan-out filtering: only forward messages to servers that actually have subscribers in that channel. 2. Add a per-server connection count metric and auto-scale WebSocket servers based on pending handshakes. 3. Introduce a 'connection storm' circuit breaker that drops non-essential subscriptions when handshake latency exceeds 1s.
Key lesson
  • Message fanout amplification is the hidden cost of broadcast — measure read vs write pressure separately.
  • Don't treat WebSocket connection count as the only scaling metric; consider messages-per-second per server.
  • Always include a circuit breaker for connection storms — they happen when you least expect them.
  • Pre-allocate a subscription budget per server to prevent cascading overload when one server fails.
  • Monitor pending handshake count vs active connections — a ratio >2:1 means you're about to hit capacity.
  • For multi-region deployments, add a cross-region circuit breaker that prevents inter-region pub/sub from flooding during regional failover.
Production debug guideSymptom → Action Guide for Production Incidents8 entries
Symptom · 01
Messages not delivered to a subset of users
Fix
Check WebSocket connection state on the affected user's server. Run io.thecodeforge.slack.health.websocket --check <user-id>
Symptom · 02
High message latency (> 500ms) across the workspace
Fix
Monitor Redis pub/sub aggregation delay. Run io.thecodeforge.slack.pubsub.latency --workspace-id <id>
Symptom · 03
Presence status shows users offline when they are active
Fix
Verify presence heartbeat interval (default 15s). Check if the presence server is overloaded due to a rapid connection storm.
Symptom · 04
Cassandra read latency spikes for channel history
Fix
Check if the channel's row key is hitting a hot partition. Use io.thecodeforge.slack.storage.partition-stats --channel-id <id>
Symptom · 05
Offline catch-up delivers duplicate messages
Fix
Check cursor LAST_READ_TS consistency. Run io.thecodeforge.slack.cursor.verify --user-id <id>
Symptom · 06
Search index not reflecting recent messages
Fix
Check Kafka consumer lag for the indexing pipeline. Run io.thecodeforge.slack.search.indexer-lag --workspace-id <id>
Symptom · 07
Messages arriving out of order on reconnect
Fix
Verify the sub-bucket merge logic in the catch-up service. Ensure messages are sorted by (channel_bucket, message_ts). Run io.thecodeforge.slack.delivery.order-check --user-id <id>
Symptom · 08
Cross-region message delivery > 2s
Fix
Check inter-region pub/sub latency. Run io.thecodeforge.slack.multiregion.latency --source-region us-east-1 --destination-region eu-west-1
★ Quick Debug Cheat Sheet for Slack System DesignCommon production issues and the exact commands to diagnose them — no theory.
WebSocket connection refused
Immediate action
Check server capacity on the connection pool
Commands
netstat -an | grep :443 | wc -l
docker compose exec websocket-server io.thecodeforge.slack.connections.active
Fix now
Increase WebSocket server count by 2 via the orchestrator, then restart the connection pool.
Message fanout delay > 1s+
Immediate action
Inspect Redis pub/sub latency
Commands
redis-cli --latency -h <redis-endpoint> -p 6379
io.thecodeforge.slack.pubsub.queue-depth --workspace-id $ws
Fix now
Add two more Redis pub/sub nodes and rebalance channel subscriptions.
Presence status flapping+
Immediate action
Check presence heartbeat processing rate
Commands
io.thecodeforge.slack.presence.heartbeat-check --threshold 30s
io.thecodeforge.slack.presence.backlog | head -100
Fix now
Increase presence server count and reduce heartbeat interval to 10s.
Cassandra read timeout on channel history+
Immediate action
Check partition heat map
Commands
io.thecodeforge.slack.storage.hot-partitions --keyspace slack_messages
nodetool cfhistograms slack_messages messages_by_channel
Fix now
Enable bucketed row keys: prepend a hash prefix to the channel ID to spread writes.
Duplicate messages on reconnect+
Immediate action
Check cursor consistency and client-side dedup window
Commands
io.thecodeforge.slack.cursor.staleness --user-id <id>
io.thecodeforge.slack.dedup.window-size --server-id <id>
Fix now
Increase dedup window to 30s and ensure cursor is read with LOCAL_QUORUM.
Search index stale by more than 30 seconds+
Immediate action
Check Kafka consumer lag
Commands
kafka-consumer-groups --bootstrap-server <broker> --group search-indexer --describe
curl <elasticsearch>/_cat/indices/slack_messages
Fix now
Restart the search indexer consumer and increase partition count if lag persists.
Cross-region message delivery > 2s+
Immediate action
Check inter-region pub/sub link health
Commands
io.thecodeforge.slack.multiregion.link-check --region-pair us-east-1:eu-west-1
io.thecodeforge.slack.multiregion.pubsub-lag --dest-region eu-west-1
Fix now
Activate the regional circuit breaker to prevent cross-region flooding. Re-route messages through the fallback Kafka bridge if latency exceeds 3s.
🔥

That's Real World. Mark it forged?

20 min read · try the examples if you haven't

Previous
Design Dropbox
10 / 17 · Real World
Next
Design a Rate Limiter