Senior 6 min · June 25, 2026

Design Zoom: Building Real-Time Video at Scale Without Losing Your Mind

Design Zoom system design: architecture, WebRTC internals, SFU vs MCU, scaling to 1000+ participants, and production gotchas from real incidents..

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Zoom uses a Selective Forwarding Unit (SFU) architecture where the server forwards selected video streams to each participant, reducing client processing. Key components: signaling server (WebSocket), media server (SFU), TURN server for NAT traversal, and a distributed backend for rooms and users.

✦ Definition~90s read
What is Design Zoom?

Design Zoom is the system design of a real-time video conferencing platform like Zoom. It covers client-server architecture, media routing (SFU/MCU), signaling, WebRTC, scaling strategies, and handling network degradation.

Imagine a conference room where everyone talks at once.
Plain-English First

Imagine a conference room where everyone talks at once. In a small room, you can hear everyone. But with 100 people, it's chaos. Zoom's SFU is like a smart switchboard operator: they listen to everyone, but only forward the voice of the person currently speaking to each listener. If you're not speaking, the operator stops sending your voice to others, saving everyone's ears (and bandwidth).

You've seen it happen: a 50-person all-hands call turns into a slideshow of frozen faces, audio stuttering like a scratched CD. Everyone blames the Wi-Fi. But the real culprit is almost always the server architecture. Most video calling systems choke because they try to send every participant's video to every other participant — an O(n²) problem that kills bandwidth and CPU. Zoom doesn't do that. And that's why it works when everything else falls apart.

This article breaks down the system design of a Zoom-like platform. You'll learn the exact architecture — signaling, media routing, scaling, and the production traps that take down naive implementations. By the end, you'll be able to design a real-time video system that handles 1000+ participants without melting your servers or your users' laptops.

Why SFU Beats MCU: The Bandwidth Math That Decides Your Architecture

Before you write a single line of code, you need to pick your media routing strategy. The two main options: MCU (Multipoint Control Unit) and SFU (Selective Forwarding Unit). MCU mixes all incoming streams into a single composite stream on the server. Each client sends one stream and receives one stream. Sounds simple. But the server has to decode, mix, and re-encode every stream — that's CPU-intensive and adds latency. SFU, on the other hand, forwards streams without decoding. The server is just a smart switch: it selects which streams to send to each client based on who's speaking. The client decodes multiple streams and renders them. This shifts the processing burden to clients, which is fine for desktops but tough for mobile. However, SFU scales linearly with participants (O(n) server load) while MCU scales O(n²) because the server must process every combination. For a 100-person call, MCU server does 100× the work of SFU. That's why Zoom uses SFU. The trade-off: SFU requires more client bandwidth (each client receives multiple streams), but you can mitigate with simulcast (send multiple resolutions) and bandwidth estimation.

SFUvsMCU.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — System Design tutorial

// SFU bandwidth calculation for N participants
// Each client sends 1 stream (uplink) and receives M streams (downlink)
// M = number of active speakers (e.g., 4)
// Total server bandwidth = N * (uplink + M * downlink)
// Example: N=100, uplink=2Mbps, downlink=2Mbps, M=4
// Server bandwidth = 100 * (2 + 4*2) = 100 * 10 = 1000 Mbps = 1 Gbps

// MCU bandwidth calculation
// Each client sends 1 stream, receives 1 mixed stream
// Server must decode N streams, mix, encode 1 stream per client
// Server bandwidth = N * (uplink + downlink) = 100 * (2+2) = 400 Mbps
// But server CPU is O(N^2) because mixing requires processing all streams
// For N=100, server must process 100*100 = 10,000 stream combinations
// SFU server CPU is O(N) — just forwarding packets

// Decision: Use SFU for >10 participants. MCU only for small groups (<10) where client CPU is limited (e.g., embedded devices).
Output
SFU server bandwidth: 1000 Mbps
MCU server bandwidth: 400 Mbps
SFU server CPU: O(N)
MCU server CPU: O(N^2)
Production Trap: MCU at Scale
I've seen a startup try MCU for a 500-person town hall. Their server farm melted in 3 minutes. The CPU hit 100% on all cores, and the audio delay hit 10 seconds. They switched to SFU the next day. Don't be that startup.
Media Routing Decision Tree
IfParticipants <= 10, clients are low-power (mobile/embedded)
UseMCU — server does the heavy lifting
IfParticipants > 10, clients are desktops or modern phones
UseSFU with simulcast and active speaker detection
IfParticipants > 100, need to support legacy clients
UseSFU with transcoding fallback for incompatible codecs
Zoom-Scale Video Architecture: SFU, Signaling, and Scaling THECODEFORGE.IO Zoom-Scale Video Architecture: SFU, Signaling, and Scaling Core components for real-time video at scale without chaos SFU vs MCU Bandwidth Math SFU forwards streams; MCU mixes, saving bandwidth WebSocket Signaling Dance Negotiates SDP, ICE candidates, and session setup Media Server (SFU) Architecture Selective forwarding unit that doesn't drift Distributed SFU for 1000+ Horizontal scaling with cascading SFU nodes Bandwidth Estimation & Degradation Adaptive bitrate based on network conditions TURN Server NAT Traversal Relay for peers behind symmetric NATs ⚠ Don't build your own Zoom unless you have to Leverage WebRTC and existing SFU libraries instead THECODEFORGE.IO
thecodeforge.io
Zoom-Scale Video Architecture: SFU, Signaling, and Scaling
Design Zoom
MCU vs SFU Bandwidth TradeoffsTHECODEFORGE.IOMCU vs SFU Bandwidth TradeoffsServer-side mixing vs selective forwardingMCUServer mixes all streams into oneEach client sends 1 stream, receives 1High server CPU for transcodingSingle bitrate fits all clientsSFUServer forwards streams as-isEach client sends 1, receives N-1Low server CPU, no transcodingPer-client bitrate adaptationSFU wins for scale: less CPU, flexible bitrates per clientTHECODEFORGE.IO
thecodeforge.io
MCU vs SFU Bandwidth Tradeoffs
Design Zoom

Signaling: The WebSocket Dance That Sets Up Every Call

Before any video flows, clients need to exchange session descriptions and ICE candidates. This is signaling. You need a reliable, low-latency channel. WebSocket is the standard. Each client connects to a signaling server (typically a separate service from media servers). The signaling server handles room management, user presence, and relays SDP offers/answers and ICE candidates between clients. For a 1:1 call, signaling is simple: client A sends offer to server, server forwards to client B, B sends answer back. For group calls, the signaling server maintains a room state and broadcasts new participant info to all existing members. Key gotcha: signaling must be authenticated and rate-limited. If a malicious client floods the signaling server with SDP offers, it can exhaust server memory. Always validate SDP size (max 64KB) and limit offers per second per user (e.g., 5/s). Also, use a separate WebSocket connection for signaling vs. media — don't mix them. Media should go over UDP (SRTP/SCTP), not WebSocket.

SignalingFlow.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — System Design tutorial

// Signaling flow for a group call
// 1. Client A connects to signaling server via WebSocket
// 2. A sends 'join-room' message with room ID and auth token
// 3. Server validates token, adds A to room, broadcasts 'peer-joined' to others
// 4. Server sends A the list of existing peers (IDs only)
// 5. A creates a PeerConnection for each peer, generates SDP offer
// 6. A sends 'sdp-offer' to server for each peer
// 7. Server forwards each offer to the respective peer
// 8. Each peer generates SDP answer and sends back via server
// 9. ICE candidates are exchanged similarly
// 10. Once all PeerConnections are established, media flows directly between clients (or via TURN)

// Pseudocode for signaling server message handling
onMessage(ws, msg) {
  switch(msg.type) {
    case 'join-room':
      if (!authenticate(msg.token)) { ws.send({error: 'unauthorized'}); return; }
      room.addPeer(ws, msg.roomId);
      broadcastToRoom(msg.roomId, {type: 'peer-joined', peerId: ws.id});
      ws.send({type: 'room-state', peers: room.getPeerIds(msg.roomId)});
      break;
    case 'sdp-offer':
      // Validate SDP size
      if (msg.sdp.length > 65536) { ws.send({error: 'sdp too large'}); return; }
      // Forward to target peer
      sendToPeer(msg.targetPeerId, {type: 'sdp-offer', sdp: msg.sdp, from: ws.id});
      break;
    // ... similar for answer, ICE
  }
}
Output
Client A connects -> Server validates -> A joins room -> Server broadcasts peer-joined -> A receives peer list -> A sends SDP offers -> Server forwards -> Peers answer -> ICE exchange -> Media flows
Never Do This: Signaling Over HTTP
Some tutorials show signaling over HTTP long-polling. Don't. The latency kills the ICE negotiation window. WebSocket is mandatory. I've seen a team lose 30% of call setups because ICE candidates expired before the HTTP response came back.

Media Server Architecture: The SFU That Doesn't Drop Packets

The media server is the heart of your Zoom clone. It runs an SFU that receives RTP packets from publishers and forwards them to subscribers. Each media server handles a subset of participants (e.g., 100 per server). You need to assign participants to servers based on room size. For small rooms (<10), a single server is fine. For large rooms, you split participants across multiple servers and use a 'media bridge' to connect them. The bridge forwards streams between servers, effectively creating a distributed SFU. Each media server runs a WebRTC stack (e.g., mediasoup, Janus, or custom). Key components: a transport for each peer (WebRTC or plain RTP), a router that maps incoming streams to outgoing streams, and a bandwidth estimator that adjusts quality based on network conditions. The SFU must support simulcast: each publisher sends multiple resolutions (e.g., 720p, 360p, 180p). The SFU selects which layer to forward to each subscriber based on their bandwidth and screen size. This is critical for mobile clients on 3G. Without simulcast, you'd have to transcode, which kills latency.

SFUInternal.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// io.thecodeforge — System Design tutorial

// Simplified SFU internal architecture
// Each participant has a Producer (sends media) and a Consumer (receives media)
// The SFU maintains a map: roomId -> { producers: Map<peerId, Producer>, consumers: Map<peerId, Consumer> }

class SFU {
  rooms: Map<string, Room>;

  onProducer(roomId, producer) {
    const room = this.rooms.get(roomId);
    room.producers.set(producer.peerId, producer);
    // Notify all consumers in room about new producer
    for (const consumer of room.consumers.values()) {
      consumer.addProducer(producer);
    }
  }

  onConsumer(roomId, consumer) {
    const room = this.rooms.get(roomId);
    room.consumers.set(consumer.peerId, consumer);
    // Add existing producers to this consumer
    for (const producer of room.producers.values()) {
      consumer.addProducer(producer);
    }
  }

  // Forwarding logic: for each consumer, decide which producers to forward
  // Use audio levels to select top N speakers (e.g., 3)
  // For video, forward only active speakers' high-resolution streams
  // For others, forward low-resolution or no video
  forward(roomId) {
    const room = this.rooms.get(roomId);
    const activeSpeakers = this.getTopSpeakers(room, 3);
    for (const consumer of room.consumers.values()) {
      const streamsToForward = [];
      for (const producer of room.producers.values()) {
        if (producer.peerId === consumer.peerId) continue; // don't send own stream
        if (activeSpeakers.has(producer.peerId)) {
          streamsToForward.push({ producer, layer: 'high' });
        } else {
          streamsToForward.push({ producer, layer: 'low' }); // or skip video entirely
        }
      }
      consumer.setStreams(streamsToForward);
    }
  }
}
Output
SFU maintains room state -> Producers send RTP -> SFU forwards selected streams to consumers -> Active speaker detection selects top 3 -> Others get low-res or no video
Senior Shortcut: Use mediasoup
Building an SFU from scratch is a year-long project. Use mediasoup (C++ with Node.js API). It handles WebRTC, simulcast, SVC, and bandwidth estimation out of the box. We've used it in production for 10,000+ concurrent users. Just don't forget to set the maxIncomingBitrate per producer to avoid a single user flooding the server.
SFU Media Routing FlowTHECODEFORGE.IOSFU Media Routing FlowFrom publisher to subscriber via selective forwardingPublisherSends RTP stream to SFUSFU ReceiverIngests RTP, extracts SSRCForwarding LogicSelects subscribers per streamSubscriberReceives only needed streams⚠ SFU never decodes/encodes; it forwards raw RTP packetsTHECODEFORGE.IO
thecodeforge.io
SFU Media Routing Flow
Design Zoom

Scaling to 1000+ Participants: Distributed SFU and Cascading

A single SFU can handle ~100-200 participants before CPU or bandwidth becomes a bottleneck. Beyond that, you need to distribute the load. Two approaches: 1) Room-based sharding: assign each room to a specific SFU. Works if rooms are small (<100). 2) Distributed SFU: split a single large room across multiple SFUs, each handling a subset of participants. The SFUs are connected via a media bridge (e.g., using RTP over UDP between servers). Each SFU forwards streams from its participants to other SFUs as needed. This is complex because you need to avoid forwarding the same stream multiple times. A common pattern is to designate one SFU as the 'bridge' for each stream, or use a full mesh between SFUs. For 1000 participants, you might have 10 SFUs, each handling 100 participants. Each SFU forwards the active speaker streams (3-6) to all other SFUs. That's 10 SFUs × 6 streams = 60 cross-SFU streams. Manageable. But you also need a global active speaker detection: the SFUs must agree on who's speaking. Use a centralized audio level aggregator that collects levels from all SFUs and broadcasts the top speakers.

DistributedSFU.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// io.thecodeforge — System Design tutorial

// Distributed SFU architecture for 1000 participants
// Assume 10 SFU nodes, each handling 100 participants
// Each SFU has a bridge module that connects to other SFUs

// Bridge module: forwards selected streams to other SFUs
class Bridge {
  connections: Map<SFUId, RtpConnection>;

  forwardStream(stream, targetSFUId) {
    const conn = this.connections.get(targetSFUId);
    conn.send(stream);
  }

  // On receiving a stream from another SFU
  onRemoteStream(stream) {
    // Add this stream to local consumers that need it
    // e.g., if stream is from an active speaker, forward to all local consumers
    this.localSFU.addRemoteProducer(stream);
  }
}

// Global active speaker detection
// Each SFU periodically sends audio levels of its participants to a central service
// Central service aggregates and returns top 6 speakers globally
// SFUs then forward those speakers' streams to all other SFUs

// Pseudocode for central audio level service
function getGlobalActiveSpeakers(levelsFromAllSFUs) {
  // levelsFromAllSFUs: Map<SFUId, Map<PeerId, level>>
  const allLevels = [];
  for (const [sfuId, peerLevels] of levelsFromAllSFUs) {
    for (const [peerId, level] of peerLevels) {
      allLevels.push({ peerId, level, sfuId });
    }
  }
  allLevels.sort((a,b) => b.level - a.level);
  return allLevels.slice(0, 6);
}
Output
10 SFUs each handle 100 participants -> Each SFU forwards top 6 speakers to other SFUs via bridge -> Central audio level aggregator determines global top speakers -> Total cross-SFU streams: 10 * 6 = 60
Production Trap: Bridge Bandwidth
If you forward all streams between SFUs, you'll saturate your inter-SFU link. In a 1000-person call with 10 SFUs, forwarding all 1000 streams would require 1000 * 2Mbps = 2Gbps per SFU. Only forward active speakers. Use a hard limit of 6 streams per SFU. We learned this the hard way when our 10Gbps link became the bottleneck.

Handling Network Degradation: Bandwidth Estimation and Adaptation

Real-time video is unforgiving of packet loss. WebRTC has built-in bandwidth estimation (GCC — Google Congestion Control) that adjusts bitrate based on delay and loss. But you need to configure it properly. The SFU should also participate: it can send REMB (Receiver Estimated Maximum Bitrate) messages to publishers to reduce their bitrate. For clients with poor connectivity, the SFU can switch to a lower simulcast layer or drop video entirely (audio-only). Key: never let the client decide alone — the server knows the overall network conditions. Implement a server-side bandwidth manager that aggregates feedback from all consumers and sends a unified REMB to each producer. Also, support FEC (Forward Error Correction) for audio — it's small and worth the overhead. For video, FEC is too expensive; use NACKs and retransmissions instead. And always enable packet loss hiding (PLC) in audio codecs (Opus does this automatically).

BandwidthAdaptation.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — System Design tutorial

// Server-side bandwidth manager
// Collects receiver reports from all consumers of a producer
// Computes a combined REMB and sends it to the producer

class BandwidthManager {
  consumers: Map<ProducerId, ConsumerReport[]>;

  onReceiverReport(producerId, consumerId, report) {
    this.consumers.get(producerId).push({ consumerId, report });
    // Aggregate: take the minimum available bandwidth across all consumers
    const minBw = Math.min(...this.consumers.get(producerId).map(c => c.report.availableBitrate));
    // Send REMB to producer
    sendREMB(producerId, minBw);
  }

  // On packet loss > 5%, force switch to lower simulcast layer
  onPacketLoss(producerId, consumerId, lossFraction) {
    if (lossFraction > 0.05) {
      // Ask SFU to forward lower layer for this consumer
      sfu.setConsumerLayer(consumerId, 'low');
    }
  }
}

// Simulcast layer switching
// Producer sends 3 layers: high (720p, 2Mbps), medium (360p, 500Kbps), low (180p, 150Kbps)
// SFU selects layer per consumer based on bandwidth
function selectLayer(availableBw) {
  if (availableBw > 1000000) return 'high';
  if (availableBw > 300000) return 'medium';
  return 'low';
}
Output
Bandwidth manager aggregates consumer reports -> Sends REMB with min available bandwidth -> Producer adjusts bitrate -> If packet loss >5%, SFU switches consumer to lower layer
Interview Gold: GCC vs. Sender-Side Estimation
WebRTC's GCC works on the receiver side. But for server-side SFU, you often want sender-side estimation because the server sees all consumers. Some implementations use a hybrid: receiver-side for client-to-server, sender-side for server-to-client. Know the difference.

TURN Servers: The NAT Traversal Safety Net

Not all clients can establish peer-to-peer connections due to symmetric NATs or firewalls. That's where TURN (Traversal Using Relays around NAT) comes in. TURN servers relay media traffic. They're bandwidth hogs: each stream consumes relay bandwidth. You need to deploy TURN servers in multiple regions close to users. Use ICE (Interactive Connectivity Establishment) to try direct P2P first (via STUN), then fall back to TURN. Configure TURN with authentication (time-limited credentials) to prevent abuse. Key metric: TURN usage ratio. If >20% of calls use TURN, your STUN infrastructure might be misconfigured or your users are behind restrictive NATs (e.g., corporate VPNs). Also, TURN servers must support UDP, TCP, and TLS. UDP is preferred for low latency. TCP adds overhead but works through firewalls that block UDP.

TURNDeployment.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — System Design tutorial

// TURN server configuration (using coturn as example)
// /etc/turnserver.conf

listening-port=3478
tls-listening-port=5349
fingerprint
lt-cred-mech
user=zoomuser:secretpassword  # Use time-limited tokens in production
realm=thecodeforge.io
# Allocate bandwidth per user: 2Mbps max
max-bps=2000000
# Use multiple relay IPs for load balancing
relay-ip=203.0.113.1
relay-ip=203.0.113.2
# Enable STUN as well
stun-only=false

# Client-side ICE configuration
// In WebRTC, set ICE transport policy to 'relay' only if you want to force TURN
// For normal operation, use 'all' (default)
const pcConfig = {
  iceServers: [
    { urls: 'stun:stun.thecodeforge.io:3478' },
    { urls: 'turn:turn.thecodeforge.io:3478', username: 'user', credential: 'pass' }
  ]
};

// Monitor TURN usage
// Metric: relayed bytes vs total bytes
// If relayed > 20% of total, investigate NAT issues
Output
TURN server listens on 3478 (UDP/TCP) and 5349 (TLS) -> Clients use ICE to try STUN first -> If STUN fails, fallback to TURN -> TURN relays media -> Monitor relay ratio
Never Do This: Hardcoding TURN Credentials
I've seen production configs with hardcoded TURN credentials in the client code. Attackers can use your TURN server for free relay, costing you bandwidth. Always use time-limited credentials generated by your signaling server. coturn supports REST API for this.

Recording and Playback: Archiving the Chaos

Recording a Zoom call is harder than it looks. You can't just record the mixed audio/video because you lose individual speaker tracks. For compliance (e.g., legal depositions), you need per-participant recordings. Solution: have the SFU send each participant's stream to a recording service. The recording service can either store individual tracks (for later compositing) or mix them in real-time. For real-time mixing, use a dedicated MCU-like component that decodes and mixes streams, then encodes the final video. This is expensive. Better: store individual tracks as fragmented MP4 (fMP4) with timestamps, and composite offline. For playback, you need a video player that can handle multiple synchronized streams. Use HLS or DASH with multiple audio tracks. Or build a custom player using WebRTC to re-render the call. Gotcha: recording must handle network glitches — buffer at least 5 seconds of data to recover from packet loss.

RecordingService.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — System Design tutorial

// Recording service architecture
// SFU forwards each participant's stream to a recording service via RTP
// Recording service writes each stream to a separate file (e.g., WebM or fMP4)

class RecordingService {
  streams: Map<PeerId, FileWriter>;

  onRtpPacket(peerId, packet) {
    let writer = this.streams.get(peerId);
    if (!writer) {
      writer = new FileWriter(`/recordings/${roomId}/${peerId}.webm`);
      this.streams.set(peerId, writer);
    }
    writer.write(packet);
  }

  // For composite recording, use a mixer that decodes all streams
  // and encodes a single output
  // Mixer uses FFmpeg or custom GStreamer pipeline
}

// Offline compositing command (using FFmpeg)
// ffmpeg -i peer1.webm -i peer2.webm -filter_complex "[0:v]scale=640:360[0v];[1:v]scale=640:360[1v];[0v][1v]hstack=inputs=2" output.mp4
// This creates a side-by-side video. For grid layout, use more complex filter.
Output
SFU forwards streams to recording service -> Each stream saved as individual file -> Offline compositing produces final video
Senior Shortcut: Use GStreamer for Recording
Don't build a recording pipeline from scratch. Use GStreamer with webrtcbin to receive streams and encode to file. It handles jitter buffer, retransmission, and encoding. We use it in production with a Python wrapper.

When Not to Build Your Own Zoom: The Build vs. Buy Decision

Building a Zoom clone is a massive undertaking. You need expertise in WebRTC, networking, distributed systems, and media codecs. If your core business isn't video conferencing, don't build it. Use a third-party API like Twilio Video, Agora, or Daily.co. They handle SFU, TURN, and scaling. You pay per participant-minute, but you save months of engineering. Only build if you have specific requirements: custom UI, offline recording, proprietary codecs, or air-gapped deployments. Even then, consider using open-source SFUs like mediasoup or Janus and customize. The build vs. buy decision is simple: if you need to support >1000 participants with <200ms latency, and you have a team of 5+ engineers dedicated to this, build. Otherwise, buy.

The Classic Bug: Over-Engineering
I've seen startups spend 6 months building a video platform when they could have used Twilio and launched in 2 weeks. They ended up with a buggy product and missed market window. Don't be a hero. Use existing infrastructure.
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
Media server pods in Kubernetes kept OOMKilling every 15 minutes during a 200-person webinar. Memory usage spiked to 4GB then crashed.
Assumption
Thought it was a memory leak in the WebRTC library. Spent days profiling heap dumps.
Root cause
SFU was configured to forward all video streams (not just active speakers). Each 720p stream at 30fps consumed ~2Mbps bandwidth and ~50MB memory for buffering. 200 participants × 50MB = 10GB, but container limit was 4GB. OOM killer did its job.
Fix
Enabled simulcast and forward-only-active-speaker mode. Set max video streams per client to 4 (the 3 loudest speakers + self). Memory dropped to 800MB. No more OOMs.
Key lesson
  • Never forward all streams.
  • Always limit active video streams to a small number (3-6).
  • Use audio levels to select which streams to forward.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Users report frozen video or audio stuttering
Fix
1. Check packet loss on media server (use ss -s or WebRTC stats). 2. Check bandwidth estimation logs. 3. If loss >5%, enable FEC for audio or switch to lower simulcast layer. 4. Verify TURN server bandwidth isn't saturated.
Symptom · 02
Call setup fails for some users (ICE connection timeout)
Fix
1. Check signaling server logs for ICE candidate exchange. 2. Verify STUN/TURN servers are reachable from client IP. 3. Check firewall rules (UDP 3478, TCP 443). 4. Increase ICE timeout from default 5s to 10s.
Symptom · 03
High CPU on media server
Fix
1. Check number of active streams per server. 2. Reduce max participants per server (e.g., from 200 to 100). 3. Enable simulcast and limit high-resolution streams. 4. Profile with perf to find hot spots (e.g., encryption).
★ Design Zoom Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
`OOMKilled` on media server pod
Immediate action
Check memory usage per stream
Commands
kubectl top pod <pod-name> --containers
kubectl logs <pod-name> --previous | grep OOM
Fix now
Set --max-incoming-bitrate 2000000 per producer and limit active video streams to 4.
`ICE failed, no candidate pair` in client logs+
Immediate action
Verify TURN server is reachable
Commands
nc -vuz <turn-server-ip> 3478
tcpdump -i any port 3478
Fix now
Add TURN server with correct credentials. Ensure firewall allows UDP 3478.
High packet loss (>10%) on server+
Immediate action
Check network interface saturation
Commands
sar -n DEV 1 5
ethtool -S eth0 | grep drop
Fix now
Enable FEC for audio (Opus inband FEC). For video, reduce max bitrate per stream to 1Mbps.
Audio out of sync with video+
Immediate action
Check RTP timestamps and NTP sync
Commands
date && ntpq -p
tshark -r capture.pcap -Y 'rtp' -T fields -e rtp.timestamp -e rtp.ssrc
Fix now
Ensure all media servers use same NTP server. Enable RTCP sender reports for synchronization.
Feature / AspectSFUMCU
Server CPU loadO(N) — forwarding onlyO(N²) — decode/mix/encode
Client CPU loadHigh — decode multiple streamsLow — decode one mixed stream
Bandwidth per clientHigh — receive multiple streamsLow — receive one stream
LatencyLow — no transcodingHigher — transcoding adds delay
ScalabilityGood up to 1000+ with distributed SFUPoor beyond 20 participants
ComplexityModerate — need simulcast and active speaker detectionHigh — need mixer and transcoder

Key takeaways

1
SFU scales better than MCU for large calls because server load is O(N) vs O(N²). Always use SFU for >10 participants.
2
Never forward all video streams. Limit active video to 3-6 streams per client based on audio levels. This cuts bandwidth and CPU by 90%.
3
Signaling must use WebSocket, not HTTP. ICE candidates expire in seconds; WebSocket ensures low-latency exchange.
4
Deploy TURN servers in multiple regions. Monitor relay ratio; if >20%, investigate NAT issues. Use time-limited credentials for security.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Zoom's SFU handle a participant with poor network connectivity?...
Q02SENIOR
In a distributed SFU architecture with 1000 participants across 10 serve...
Q03SENIOR
What happens when a TURN server runs out of bandwidth during a large cal...
Q04JUNIOR
Explain the role of STUN and TURN in WebRTC. When would a call use TURN ...
Q05SENIOR
You notice that 40% of your calls are using TURN relay, costing high ban...
Q06SENIOR
Design a system to record a 500-person Zoom call with per-participant au...
Q01 of 06SENIOR

How does Zoom's SFU handle a participant with poor network connectivity? Describe the adaptation mechanism.

ANSWER
Zoom uses simulcast: each client sends multiple resolution layers. The SFU monitors packet loss and round-trip time per consumer. If a consumer's loss exceeds 5%, the SFU switches that consumer to a lower simulcast layer. Additionally, the SFU sends REMB messages to the producer to reduce overall bitrate. For extreme cases, the SFU may drop video entirely and forward only audio.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How does Zoom handle 1000 participants in a single call?
02
What's the difference between SFU and MCU in video conferencing?
03
How do I set up a TURN server for WebRTC?
04
What happens when WebRTC packet loss is high? How does Zoom adapt?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Real World. Mark it forged?

6 min read · try the examples if you haven't

Previous
Design Spotify
32 / 40 · Real World
Next
Design Google Maps