Design Slack — Fixing WebSocket Fanout Amplification
One 200-user channel triggers 10x Redis pub/sub amplification.
- WebSocket persistent connections handle real-time message delivery
- Redis pub/sub fans out each write to every server with subscribers
- Cassandra row key shards messages by channel + time bucket; offline catch-up uses sliding window
- Presence detection uses heartbeat + expiry, not polling
- Message fanout is the bottleneck — write once, read by N subscribers
- The biggest mistake: assuming handshake latency is the bottleneck — it's the fanout write amplification
- Multi-region adds 40ms+ latency; cross-region fanout requires separate pub/sub clusters
Imagine a giant office building where thousands of people work. When someone speaks in a meeting room, only the people in that room hear it — not the whole building. Slack is like that building, but digital: it routes your message only to the right 'rooms' (channels), instantly, to everyone who's currently listening. The engineering challenge is making that work for millions of rooms, simultaneously, without anyone hearing a delay or missing a word.
Slack processes over 1 billion messages per month across millions of workspaces. At peak, it delivers messages in under 100ms to recipients who could be on a phone in Tokyo or a laptop in Berlin. That kind of reliability doesn't happen by accident — it's the result of deliberate architectural decisions around real-time transport, message persistence, presence broadcasting, and massive horizontal scale. Most system design resources gloss over the hard parts. This article doesn't.
The core problem Slack solves is deceptively simple: send a message from one user to many others, in real time, reliably, even when some recipients are offline. But hiding inside that sentence are half a dozen distributed systems challenges — how do you maintain persistent connections at scale, fan out a single write to thousands of subscribers, handle offline delivery, deduplicate retries, and keep 'online/offline' status accurate without hammering your database? Each of those deserves its own deep-dive, and we'll cover all of them.
By the end of this article you'll be able to walk into a system design interview and coherently explain the WebSocket connection layer, the pub/sub fanout architecture, how Slack shards its message store, why presence is one of the hardest problems in the system, and what trade-offs you'd make at each decision point. You'll also understand the production gotchas that only show up once you're running at real scale.
This isn't theoretical. Every pattern here is battle-tested — some are the direct result of production incidents that cost teams days of debugging. The fanout amplification trap? We've seen it bring down a cluster. The presence debounce logic? We've watched it flap under load. The cursor staleness bug? It's the kind of subtlety that survives code review. You'll walk into your next design review armed with the actual failure modes.
Here's what this deep dive covers: WebSocket connection management at millions of concurrent users, the write path with write-behind buffers and idempotency keys, Redis pub/sub fanout and its hidden O(N) cost, Cassandra time-bucketed storage with dynamic bucket sizes, heartbeat-based presence with adaptive intervals, async search indexing with Kafka and Elasticsearch, file upload pipelines with staged security scanning, and finally multi-region deployment strategies for global workspaces. Each section includes production incidents, debug commands, and decision trees that senior engineers actually use.
What is Design Slack?
Design Slack is a core concept in System Design. Rather than starting with a dry definition, let's see it in action and understand why it exists.
The real problem isn't sending a message — it's sending that message to thousands of recipients within 100 milliseconds. Every extra 100ms of latency drops user engagement by measurable percentage points. And when you have millions of channels and billions of messages, every architectural decision multiplies.
At its heart, Slack is a stateful system: each user maintains a persistent WebSocket connection, each channel is a pub/sub topic, and each message must be durably stored and indexed for search. The hard part is doing all three at the same time while keeping the architecture horizontally scalable.
What often gets missed is that the real bottleneck shifts depending on the scale tier. At 10 users, any architecture works. At 10,000, the pub/sub fanout starts to creak. At 1 million, the message storage becomes the dominant challenge. That's why you need an architecture that's modular enough to let you swap out components — for instance, moving from Redis pub/sub to Kafka for offline delivery — without a full rewrite.
But there's another layer: the design must also handle misconfiguration at scale. A single workspace with 50,000 users can saturate a Redis pub/sub channel if the fanout isn't filtered per-server. That's why Slack invests heavily in per-server subscription maps and circuit breakers — the Monday morning all-hands is a real threat, not a hypothetical.
Here's a deeper trap: the write path's performance depends on the slowest component. If Cassandra takes 10ms to persist but Redis pub/sub takes 1ms, the overall latency is still 10ms. That's fine. But if Cassandra times out (2 seconds), the whole request blocks. Slack uses a write-behind queue: persist to a local buffer first, then ack the client, then flush to Cassandra. That cuts perceived latency from 100ms to 5ms. The trade-off is a small window for data loss if the server crashes before the buffer flushes.
One more production reality: the write-behind buffer must be sized correctly. If the buffer fills up because Cassandra is slow, you start dropping acknowledgements. Slack monitors buffer pressure and alerts when it crosses 70% utilization. That's the kind of metric you don't find in any textbook but will absolutely save your on-call rotation.
Here's the thing most engineers miss: the real bottleneck isn't the message send, it's the state management. Every message creates a chain of state changes — channel unread count, search index, thread replies, file metadata. If you don't design for that complexity upfront, you'll be patching each component individually later. Slack learned this the hard way when they had to retrofit file indexing — it took them two quarters to stabilize the async pipeline.
One more edge you won't see in interviews: the idle workspace problem. For workspaces with <10 active users, the overhead of maintaining all those subscriptions and connection slots is wasteful. Slack uses a 'dormant workspace' strategy: if a workspace has no active connections for 5 minutes, its subscription map is offloaded to a cold store. When a user reconnects, the load balancer routes them to a special 'warm-up server' that rehydrates the subscriptions from S3. This saves ~30% of Redis memory in clusters with many small workspaces.
Beyond idle workspaces, there's the multi-tenant isolation challenge. A noisy workspace with a high message rate shouldn't starve other workspaces on the same servers. Slack uses per-workspace token bucket rate limiters at the application layer, not just at the WebSocket connection level. If one workspace exceeds its burst limit, messages are queued locally and delivered with a slight delay rather than blocking all connections.
WebSocket Connection Management
Slack maintains millions of persistent TCP connections using WebSockets. Each connection is pinned to a WebSocket server (a 'connection server') via a load balancer. The challenge is handling connection storms: when thousands of users reconnect at once after a server restart or network blip.
Each WebSocket server runs an event loop that multiplexes reads and writes across all its connections. It subscribes to Redis pub/sub channels for the workspaces it hosts. When a message arrives, it writes to the local TCP socket for each connected user who belongs to the target channel.
The load balancer distributes connections using consistent hashing on user_id to minimise reconnections when a server is added or removed. Each server maintains a local registry of which channels each user is in, so it only forwards messages relevant to its connections.
A key optimization you'll see in production: the connection servers don't just blindly subscribe to all workspace channels. They maintain a local subscription map and only subscribe to a channel when the first user in that channel connects to them. This reduces the number of Redis pub/sub subscriptions per server from O(workspace_channels * users) to O(distinct_channels_with_local_users). If a 50-channel workspace has 200 users spread across 20 servers, that's a 50x reduction in subscription overhead.
But there's a deeper trap: if a server crashes and all its users reconnect to other servers, those servers now subscribe to channels that were previously handled elsewhere. The subscription count can spike 5x in seconds. That's why Slack pre-allocates a subscription budget per server and rejects new channel subscriptions if the budget is exceeded — the user gets a 'try again' signal and connects to a different server via the load balancer.
Another real-world pattern: Slack uses a two-tier load balancer. The first tier is a layer-4 TCP balancer that distributes raw connections. The second tier is a layer-7 balancer that inspects the WebSocket upgrade request and routes based on workspace ID to ensure all users in the same workspace land on the same set of servers — reduces cross-server traffic for frequently messaged channels.
One more nuance: the subscription budget isn't static. Slack dynamically adjusts it based on server health and CPU. When a server is under load, it lowers its budget, pushing new subscriptions to healthier servers. This is a form of self-healing load balancing.
Here's a real gotcha that will bite you: even with subscription budgets, a sudden burst of new connections (like the Monday morning all-hands) can cause a rapid increase in pending handshakes. If the handshake handler is single-threaded in your event loop, it blocks all other processing — including heartbeats and message delivery. Slack mitigates this by dedicating a separate thread pool for WebSocket handshakes, keeping the event loop free for ongoing connections. You don't want a new connection storm to delay delivery for already-connected users.
One more production pattern: Slack uses connection draining on server shutdown. When a server is about to be decommissioned, it stops accepting new connections, sends a 'go away' frame to each connected client with a 10-second reconnect delay, and waits for all connections to drain before shutting down. Without this, clients get a TCP RST and reconnect immediately, causing a stampede. With draining, the reconnect is spread over 10 seconds, reducing the peak load on the remaining servers.
For global workspaces, WebSocket servers are deployed in multiple regions. Users connect to the nearest regional endpoint via Geo-DNS. A regional load balancer distributes within the region. Cross-region message delivery for users from different regions in the same channel uses a regional pub/sub bridge: each region has its own Redis pub/sub cluster, and a message published in one region is forwarded to other regions through a Kafka-based cross-region bridge. This adds latency but prevents a single region from bottlenecking global fanout.
Message Fanout and Pub/Sub Architecture
When a user sends a message, it must be delivered to all other users in the channel in near real-time. The naive approach would be to have the sending server open a direct connection to every recipient. That doesn't scale. Slack uses Redis pub/sub as an intermediary: the sending server publishes the message to a Redis channel named after the workspace or channel. All WebSocket servers that have subscribers for that channel receive the message and forward it to their local connections.
But this introduces a write amplification problem: each message is written to Redis once by the sender, but read N times (once per server that has subscribers). If 50 servers each have at least one subscriber in the channel, that's 50 pub/sub reads per message. The cost isn't negligible — Redis pub/sub is O(N) per message where N is the number of subscribing servers.
Slack mitigates this in two ways. First
Message Storage and Retrieval with Cassandra
Slack stores all messages in Cassandra for durability, search, and offline catch-up. The primary table uses a composite primary key: channel_id as partition key, and a time-based clustering column (e.g., message_timestamp). This allows efficient retrieval of messages for a channel in chronological order.
However, a hot partition problem emerges for busy channels. A channel with 10,000 messages a day will have all writes hitting a single partition. To spread the load, Slack uses a 'time-bucketed' row key: partition key is (channel_id + day_bucket). Thus, each day's messages are in a separate partition. For catch-up, the client requests messages from the last N days, and the backend merges results from multiple partitions.
For offline catch-up, Slack uses a cursor-based approach: the client stores the timestamp of the last seen message. On reconnect, it queries Cassandra for messages with timestamp > cursor, limited to a window (e.g., last 7 days). This avoids scanning the entire channel history.
But there's a subtlety: Cassandra is eventually consistent. If a user just sent a message and immediately goes offline, the message might not be replicated to all replicas before the next read. The catch-up uses LOCAL_QUORUM consistency for both read and write to ensure strong consistency within the datacenter. Cross-datacenter reads use ONE, but the cursor is read at LOCAL_QUORUM to avoid stale cursor.
Slack also uses a write-behind cache (Redis) for the last 1000 messages per channel. When a client requests channel history, the server checks the cache first. If the cache is warm, it returns immediately. If not, it queries Cassandra. This reduces read pressure on Cassandra for frequently accessed channels.
One production issue: tombstone accumulation. When messages are deleted (edit or delete), Cassandra leaves tombstones. Frequent deletes in a busy channel cause tombstone overload, leading to read timeouts. Slack uses a TTL-based approach: messages are automatically deleted after 90 days (or workspace retention policy). For explicit deletes, they use a 'soft delete' that marks the message as deleted without creating a tombstone; a background compaction job removes them later.
Another gotcha: the time-bucketing must align with query patterns. If a user catches up after being offline for 30 days, the query hits 30 partitions. That's acceptable if each partition is small. But if the channel is very busy, each partition can be large. Slack uses a variable bucket size: for high-traffic channels, buckets are smaller (e.g., 6 hours). For low-traffic channels, buckets are larger (e.g., 7 days). The bucket duration is calculated based on the channel's historical message rate.
For multi-region storage, Slack uses separate Cassandra clusters in each region. Messages are written to the local cluster first, then replicated asynchronously to other regions via Cassandra's built-in cross-datacenter replication. This keeps write latency low (local quorum) but introduces eventual consistency across regions. When a user switches regions (e.g., travels), the cursor includes the region ID, and the catch-up service queries both local and remote clusters, merging results sorted by timestamp. This is critical to avoid users seeing 'missing messages' when they travel.
- Partition key = channel_id + day_bucket to spread writes
- Clustering column = timeuuid for chronological order
- Query merges responses from N buckets based on time range
- Bucket size adjusts dynamically: small for hot channels, large for cold
- TTL auto-deletes messages after retention period to avoid tombstones
- Include region_id in partition key for cross-datacenter merging
Presence Detection at Scale
Slack shows green dots next to users who are currently active. This seems trivial but at scale it's one of the hardest problems. The naive approach — updating a database row every time a user does something — would hammer the database. Instead, Slack uses a heartbeat-based presence service.
Each client sends a heartbeat every 15 seconds via WebSocket. The presence server receives these heartbeats and updates an in-memory map of user -> last_heartbeat_timestamp. A separate goroutine (or thread) runs every 30 seconds and marks users as offline if their last heartbeat is older than 30 seconds. The presence state is published to Redis for other services to consume.
But there's a problem: if the heartbeat interval is too short, you overload the server. If too long, presence feels stale. Slack uses adaptive heartbeat: if the user is idle but connected, the heartbeat interval extends to 60 seconds. If the user is actively interacting (typing, scrolling), it drops to 5 seconds. This reduces server load by 60% during idle periods.
Another challenge: when a user disconnects abruptly (e.g., network loss), the heartbeat stops. The system takes up to 30 seconds to mark them offline. That's fine for most use cases, but for real-time collaboration (e.g., pair programming), 30 seconds feels slow. Slack investigated using TCP keepalives and WebSocket ping/pong to detect disconnection faster, but those are unreliable at scale because intermediate proxies may not forward them. They settled on a hybrid: WebSocket ping/pong every 10 seconds, plus heartbeat to confirm liveness. If ping/pong fails three times, the server initiates a clean disconnect, and the presence is updated immediately rather than waiting for heartbeat expiry.
Presence is also broadcasted to other users in the same channel. When a user comes online or goes offline, the presence server publishes an event to the appropriate channel's pub/sub topic. Each WebSocket server forwards it to clients. But flooding every client with presence changes of every user in a large channel is expensive. Slack uses debouncing: if a user's presence changes frequently (e.g., flaky network), the server batches updates and sends them at most once per 5 seconds per channel. Client-side dedup prevents flickering.
One production incident: a bug in debouncing caused all presence changes to be delayed by 30 seconds during high churn. The fix was to use a sliding window of last change time per user, not per channel. That reduced the batch window from 5s to 1s without increasing message volume.
Another nuance: presence is aggregated per workspace. If a user is in multiple workspaces, they have separate presence for each. The presence server maintains a user_id -> workspace_id map and updates each independently. This prevents a busy workspace from causing false offline marking in another.
For multi-region presence, each region has its own presence server. When a user's connection moves to a new region (e.g., they travel), the old region's presence server marks them offline after the heartbeat timeout, and the new region's server picks up the heartbeats. The region change can cause a brief 1-2 second 'offline' blip. Slack mitigates this by using a global presence cache (Redis with TTL) that reconciles across regions: if a user has heartbeats in any region within the last 30 seconds, they are considered online. This requires cross-region writes to the global cache, which adds latency but avoids flapping.
Search Indexing Architecture
Slack provides full-text search across all messages. This is powered by Elasticsearch, which indexes messages asynchronously. When a message is written to Cassandra, a background worker publishes the message ID and content to a Kafka topic. A separate indexer consumer reads from Kafka and pushes documents to Elasticsearch.
The separation is deliberate: the real-time path (WebSocket delivery) must not be blocked by search indexing. Kafka acts as a durable buffer, absorbing spikes in message volume without back-propagating pressure to the write path.
But there's a subtlety: Elasticsearch is eventually consistent with Cassandra. If a user searches immediately after sending a message, they might not see it in the index for 1-2 seconds. Slack handles this by returning the most recent messages from Cassandra in the search results, merging them with Elasticsearch results client-side. That way, the user always sees their own message instantly.
Another production concern: indexer lag. If the Kafka consumer falls behind (e.g., due to an Elasticsearch cluster issue), search becomes stale. Slack monitors consumer lag and alerts if it exceeds 30 seconds. When lag is high, they temporarily redirect search queries to Cassandra's secondary index (which is slower but always up-to-date).
The trade-off is clear: you trade indexing cost and latency for query performance. Elasticsearch can handle complex full-text queries in under 100ms, while Cassandra's secondary index would take seconds on the same query. At Slack's scale, the cost of maintaining an Elasticsearch cluster is acceptable for the user experience gain.
One advanced pattern: Slack uses a two-level indexing approach. Frequently searched terms are indexed at higher priority, while rare terms are batched. This ensures that common searches (like 'team announcement' or 'bug fix') are always fast, even during indexing backlogs. The priority is determined by a sliding window of search query logs.
Here's a real failure pattern: when Elasticsearch undergoes a rolling restart, it stops accepting writes for a brief period. The Kafka consumer continues to read but fails to index. The lag grows. If the restart takes too long, the lag can exceed the retention period of the Kafka topic, causing permanent data loss. Slack mitigates this by pausing the consumer before a planned ES restart and resuming after all nodes are healthy. They also set Kafka topic retention to 7 days to give a safety buffer.
Another nuance: the merge of recent messages from Cassandra into search results is not free. For workspaces with heavy write rates, querying Cassandra for recent messages adds load to the primary store. Slack caches the last 1000 messages per workspace in Redis with a 5-second TTL, so most search queries hit the cache instead of Cassandra. This reduces read pressure on the primary database and speeds up search results.
One more detail: the search indexer also handles message edits and deletions. When a message is edited, the indexer receives the updated document and replaces the existing one. But deletions are trickier — if a user deletes a message, the indexer must remove it from Elasticsearch. If the indexer is backlogged, a deleted message might still appear in search results for a while. Slack uses a 'delete marker' approach: instead of actually deleting the document immediately, they mark it as deleted in the index and let a background cleanup job remove it after 24 hours. This prevents race conditions where a delete event arrives before the original message is indexed.
For multi-region search, Slack maintains an Elasticsearch cluster per region. The indexer in each region consumes from its local Kafka topic. When a user searches, the query is routed to the user's home region. If the user is traveling, the query proxies to their home region, or a cross-region merge is performed for global accounts. This keeps indexing latency low and avoids a single global ES cluster becoming a bottleneck.
File Sharing and Attachment Handling
Slack handles file uploads — images, documents, code snippets — and must generate thumbnails, scan for malware, and make files searchable. The file upload path is separate from the message path to avoid blocking message delivery.
When a user uploads a file, the frontend sends it to a dedicated file upload service via HTTPS. This service stores the file in an S3-compatible object store, generates a unique file ID, and then publishes the file metadata (file ID, workspace ID, channel ID, uploader) to a Kafka topic. A separate worker consumes this topic and performs: thumbnail generation (for images), virus scanning (using ClamAV), and content indexing (for full-text search within documents).
The file itself is stored externally; the message that includes the file contains only the file ID and a URL. When a client displays the message, it fetches the file URL from a CDN, which caches the file for faster delivery.
One production challenge: thumbnail generation is CPU-intensive. Slack uses a pool of worker services dedicated to thumbnail processing. If the pool falls behind, thumbnails appear broken (placeholder) for seconds. Slack monitors the thumbnail generation queue depth and auto-scales workers when it exceeds 1000 pending jobs.
Another gotcha: file size limits. Slack enforces per-file size limits (e.g., 1GB for paid plans). The upload service must reject oversized files early — before they reach the object store. They implement streaming validation: read the Content-Length header and reject if > limit. For chunked uploads, they use a proxy that buffers the first few MB to check MIME type but rejects if total size exceeds limit based on chunk metadata.
Security: malware scanning must happen before the file becomes accessible. Slack uses a 'scan first, serve later' model. The file is stored in a staging bucket with restricted access. After scanning passes, it moves to the public bucket. If scanning fails, the file is quarantined and the user is notified. This means thumbnails are generated only after scanning passes, which adds latency of 1-5 seconds. For images, they pre-generate a small thumbnail from the first chunk to show in the message immediately, but the full image is blocked until scan completes.
One more nuance: deduplication of file uploads. If the same file is uploaded to multiple channels, Slack uses content hashing (SHA-256) to detect duplicates and store only one copy. The file ID maps to the content hash. This saves storage space but complicates deletion: you can't delete a file if it's referenced by other messages. Slack uses reference counting per content hash.
Production incident: a workspace uploaded a large video file that took 10 minutes to scan. During that time, users couldn't see the file in the channel. The fix was to show a placeholder immediately with a progress indicator, and update the thumbnail when scanning completes. They also implemented scanning timeout: if scan takes > 5 minutes, it's moved to a offline scan queue to avoid blocking the upload path.
For multi-region file storage, Slack uses region-specific S3 buckets. When a file is uploaded, it's stored in the nearest region's bucket. For cross-region access, a CDN with multiple origins handles the latency. If a user in Europe views a file stored in the US, the CDN fetches from the US origin on the first request and caches in a European edge location. This avoids replicating every file to every region, saving storage costs.
The Monday Morning WebSocket Storm
- Message fanout amplification is the hidden cost of broadcast — measure read vs write pressure separately.
- Don't treat WebSocket connection count as the only scaling metric; consider messages-per-second per server.
- Always include a circuit breaker for connection storms — they happen when you least expect them.
- Pre-allocate a subscription budget per server to prevent cascading overload when one server fails.
- Monitor pending handshake count vs active connections — a ratio >2:1 means you're about to hit capacity.
- For multi-region deployments, add a cross-region circuit breaker that prevents inter-region pub/sub from flooding during regional failover.
That's Real World. Mark it forged?
20 min read · try the examples if you haven't