Senior 6 min · June 25, 2026

Design Zoom: Building Real-Time Video at Scale Without Losing Your Mind

Q: How does Zoom handle 1000 participants in a single call?

Zoom uses a distributed SFU architecture. Participants are split across multiple media servers (SFUs), each handling ~100 participants. The SFUs are interconnected via a media bridge that forwards active speaker streams between servers. A global audio level aggregator determines the top speakers, and only those streams are forwarded across servers. This keeps bandwidth and CPU manageable.

Q: What's the difference between SFU and MCU in video conferencing?

SFU (Selective Forwarding Unit) forwards individual streams without processing them — the server is a smart switch. MCU (Multipoint Control Unit) mixes all streams into one on the server. SFU scales better (O(N) server load) but requires more client bandwidth. MCU is simpler for small groups but doesn't scale. Zoom uses SFU.

Q: How do I set up a TURN server for WebRTC?

Use coturn. Install it, configure listening ports (3478 for UDP/TCP, 5349 for TLS), set authentication with time-limited credentials, and specify relay IPs. In your WebRTC client, add the TURN server to `iceServers` with username and credential. Ensure firewall allows the ports. Monitor usage to avoid abuse.

Q: What happens when WebRTC packet loss is high? How does Zoom adapt?

Zoom's SFU monitors packet loss per consumer. If loss exceeds 5%, the SFU switches the consumer to a lower simulcast layer (e.g., from 720p to 360p). It also sends REMB messages to the producer to reduce bitrate. For audio, Opus has built-in packet loss concealment. If loss is extreme, the SFU may drop video entirely and keep audio only.

Design Zoom system design: architecture, WebRTC internals, SFU vs MCU, scaling to 1000+ participants, and production gotchas from real incidents..

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Zoom uses a Selective Forwarding Unit (SFU) architecture where the server forwards selected video streams to each participant, reducing client processing. Key components: signaling server (WebSocket), media server (SFU), TURN server for NAT traversal, and a distributed backend for rooms and users.

✦ Definition~90s read

What is Design Zoom?

Design Zoom is the system design of a real-time video conferencing platform like Zoom. It covers client-server architecture, media routing (SFU/MCU), signaling, WebRTC, scaling strategies, and handling network degradation.

★

Imagine a conference room where everyone talks at once.

Plain-English First

Imagine a conference room where everyone talks at once. In a small room, you can hear everyone. But with 100 people, it's chaos. Zoom's SFU is like a smart switchboard operator: they listen to everyone, but only forward the voice of the person currently speaking to each listener. If you're not speaking, the operator stops sending your voice to others, saving everyone's ears (and bandwidth).

You've seen it happen: a 50-person all-hands call turns into a slideshow of frozen faces, audio stuttering like a scratched CD. Everyone blames the Wi-Fi. But the real culprit is almost always the server architecture. Most video calling systems choke because they try to send every participant's video to every other participant — an O(n²) problem that kills bandwidth and CPU. Zoom doesn't do that. And that's why it works when everything else falls apart.

This article breaks down the system design of a Zoom-like platform. You'll learn the exact architecture — signaling, media routing, scaling, and the production traps that take down naive implementations. By the end, you'll be able to design a real-time video system that handles 1000+ participants without melting your servers or your users' laptops.

Why SFU Beats MCU: The Bandwidth Math That Decides Your Architecture

Before you write a single line of code, you need to pick your media routing strategy. The two main options: MCU (Multipoint Control Unit) and SFU (Selective Forwarding Unit). MCU mixes all incoming streams into a single composite stream on the server. Each client sends one stream and receives one stream. Sounds simple. But the server has to decode, mix, and re-encode every stream — that's CPU-intensive and adds latency. SFU, on the other hand, forwards streams without decoding. The server is just a smart switch: it selects which streams to send to each client based on who's speaking. The client decodes multiple streams and renders them. This shifts the processing burden to clients, which is fine for desktops but tough for mobile. However, SFU scales linearly with participants (O(n) server load) while MCU scales O(n²) because the server must process every combination. For a 100-person call, MCU server does 100× the work of SFU. That's why Zoom uses SFU. The trade-off: SFU requires more client bandwidth (each client receives multiple streams), but you can mitigate with simulcast (send multiple resolutions) and bandwidth estimation.

SFUvsMCU.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// SFU bandwidth calculation for N participants
// Each client sends 1 stream (uplink) and receives M streams (downlink)
// M = number of active speakers (e.g., 4)
// Total server bandwidth = N * (uplink + M * downlink)
// Example: N=100, uplink=2Mbps, downlink=2Mbps, M=4
// Server bandwidth = 100 * (2 + 4*2) = 100 * 10 = 1000 Mbps = 1 Gbps

// MCU bandwidth calculation
// Each client sends 1 stream, receives 1 mixed stream
// Server must decode N streams, mix, encode 1 stream per client
// Server bandwidth = N * (uplink + downlink) = 100 * (2+2) = 400 Mbps
// But server CPU is O(N^2) because mixing requires processing all streams
// For N=100, server must process 100*100 = 10,000 stream combinations
// SFU server CPU is O(N) — just forwarding packets

// Decision: Use SFU for >10 participants. MCU only for small groups (<10) where client CPU is limited (e.g., embedded devices).

Output

SFU server bandwidth: 1000 Mbps

MCU server bandwidth: 400 Mbps

SFU server CPU: O(N)

MCU server CPU: O(N^2)

Production Trap: MCU at Scale

I've seen a startup try MCU for a 500-person town hall. Their server farm melted in 3 minutes. The CPU hit 100% on all cores, and the audio delay hit 10 seconds. They switched to SFU the next day. Don't be that startup.

Media Routing Decision Tree

IfParticipants <= 10, clients are low-power (mobile/embedded)

→

UseMCU — server does the heavy lifting

IfParticipants > 10, clients are desktops or modern phones

→

UseSFU with simulcast and active speaker detection

IfParticipants > 100, need to support legacy clients

→

UseSFU with transcoding fallback for incompatible codecs

thecodeforge.io

Zoom-Scale Video Architecture: SFU, Signaling, and Scaling

Design Zoom

thecodeforge.io

MCU vs SFU Bandwidth Tradeoffs

Design Zoom

Signaling: The WebSocket Dance That Sets Up Every Call

Before any video flows, clients need to exchange session descriptions and ICE candidates. This is signaling. You need a reliable, low-latency channel. WebSocket is the standard. Each client connects to a signaling server (typically a separate service from media servers). The signaling server handles room management, user presence, and relays SDP offers/answers and ICE candidates between clients. For a 1:1 call, signaling is simple: client A sends offer to server, server forwards to client B, B sends answer back. For group calls, the signaling server maintains a room state and broadcasts new participant info to all existing members. Key gotcha: signaling must be authenticated and rate-limited. If a malicious client floods the signaling server with SDP offers, it can exhaust server memory. Always validate SDP size (max 64KB) and limit offers per second per user (e.g., 5/s). Also, use a separate WebSocket connection for signaling vs. media — don't mix them. Media should go over UDP (SRTP/SCTP), not WebSocket.

SignalingFlow.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Signaling flow for a group call
// 1. Client A connects to signaling server via WebSocket
// 2. A sends 'join-room' message with room ID and auth token
// 3. Server validates token, adds A to room, broadcasts 'peer-joined' to others
// 4. Server sends A the list of existing peers (IDs only)
// 5. A creates a PeerConnection for each peer, generates SDP offer
// 6. A sends 'sdp-offer' to server for each peer
// 7. Server forwards each offer to the respective peer
// 8. Each peer generates SDP answer and sends back via server
// 9. ICE candidates are exchanged similarly
// 10. Once all PeerConnections are established, media flows directly between clients (or via TURN)

// Pseudocode for signaling server message handling
onMessage(ws, msg) {
  switch(msg.type) {
    case 'join-room':
      if (!authenticate(msg.token)) { ws.send({error: 'unauthorized'}); return; }
      room.addPeer(ws, msg.roomId);
      broadcastToRoom(msg.roomId, {type: 'peer-joined', peerId: ws.id});
      ws.send({type: 'room-state', peers: room.getPeerIds(msg.roomId)});
      break;
    case 'sdp-offer':
      // Validate SDP size
      if (msg.sdp.length > 65536) { ws.send({error: 'sdp too large'}); return; }
      // Forward to target peer
      sendToPeer(msg.targetPeerId, {type: 'sdp-offer', sdp: msg.sdp, from: ws.id});
      break;
    // ... similar for answer, ICE
  }
}

Output

Client A connects -> Server validates -> A joins room -> Server broadcasts peer-joined -> A receives peer list -> A sends SDP offers -> Server forwards -> Peers answer -> ICE exchange -> Media flows

Never Do This: Signaling Over HTTP

Some tutorials show signaling over HTTP long-polling. Don't. The latency kills the ICE negotiation window. WebSocket is mandatory. I've seen a team lose 30% of call setups because ICE candidates expired before the HTTP response came back.

Media Server Architecture: The SFU That Doesn't Drop Packets

The media server is the heart of your Zoom clone. It runs an SFU that receives RTP packets from publishers and forwards them to subscribers. Each media server handles a subset of participants (e.g., 100 per server). You need to assign participants to servers based on room size. For small rooms (<10), a single server is fine. For large rooms, you split participants across multiple servers and use a 'media bridge' to connect them. The bridge forwards streams between servers, effectively creating a distributed SFU. Each media server runs a WebRTC stack (e.g., mediasoup, Janus, or custom). Key components: a transport for each peer (WebRTC or plain RTP), a router that maps incoming streams to outgoing streams, and a bandwidth estimator that adjusts quality based on network conditions. The SFU must support simulcast: each publisher sends multiple resolutions (e.g., 720p, 360p, 180p). The SFU selects which layer to forward to each subscriber based on their bandwidth and screen size. This is critical for mobile clients on 3G. Without simulcast, you'd have to transcode, which kills latency.

SFUInternal.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified SFU internal architecture
// Each participant has a Producer (sends media) and a Consumer (receives media)
// The SFU maintains a map: roomId -> { producers: Map<peerId, Producer>, consumers: Map<peerId, Consumer> }

class SFU {
  rooms: Map<string, Room>;

  onProducer(roomId, producer) {
    const room = this.rooms.get(roomId);
    room.producers.set(producer.peerId, producer);
    // Notify all consumers in room about new producer
    for (const consumer of room.consumers.values()) {
      consumer.addProducer(producer);
    }
  }

  onConsumer(roomId, consumer) {
    const room = this.rooms.get(roomId);
    room.consumers.set(consumer.peerId, consumer);
    // Add existing producers to this consumer
    for (const producer of room.producers.values()) {
      consumer.addProducer(producer);
    }
  }

  // Forwarding logic: for each consumer, decide which producers to forward
  // Use audio levels to select top N speakers (e.g., 3)
  // For video, forward only active speakers' high-resolution streams
  // For others, forward low-resolution or no video
  forward(roomId) {
    const room = this.rooms.get(roomId);
    const activeSpeakers = this.getTopSpeakers(room, 3);
    for (const consumer of room.consumers.values()) {
      const streamsToForward = [];
      for (const producer of room.producers.values()) {
        if (producer.peerId === consumer.peerId) continue; // don't send own stream
        if (activeSpeakers.has(producer.peerId)) {
          streamsToForward.push({ producer, layer: 'high' });
        } else {
          streamsToForward.push({ producer, layer: 'low' }); // or skip video entirely
        }
      }
      consumer.setStreams(streamsToForward);
    }
  }
}

Output

SFU maintains room state -> Producers send RTP -> SFU forwards selected streams to consumers -> Active speaker detection selects top 3 -> Others get low-res or no video

Senior Shortcut: Use mediasoup

Building an SFU from scratch is a year-long project. Use mediasoup (C++ with Node.js API). It handles WebRTC, simulcast, SVC, and bandwidth estimation out of the box. We've used it in production for 10,000+ concurrent users. Just don't forget to set the maxIncomingBitrate per producer to avoid a single user flooding the server.

thecodeforge.io

SFU Media Routing Flow

Design Zoom

Scaling to 1000+ Participants: Distributed SFU and Cascading

A single SFU can handle ~100-200 participants before CPU or bandwidth becomes a bottleneck. Beyond that, you need to distribute the load. Two approaches: 1) Room-based sharding: assign each room to a specific SFU. Works if rooms are small (<100). 2) Distributed SFU: split a single large room across multiple SFUs, each handling a subset of participants. The SFUs are connected via a media bridge (e.g., using RTP over UDP between servers). Each SFU forwards streams from its participants to other SFUs as needed. This is complex because you need to avoid forwarding the same stream multiple times. A common pattern is to designate one SFU as the 'bridge' for each stream, or use a full mesh between SFUs. For 1000 participants, you might have 10 SFUs, each handling 100 participants. Each SFU forwards the active speaker streams (3-6) to all other SFUs. That's 10 SFUs × 6 streams = 60 cross-SFU streams. Manageable. But you also need a global active speaker detection: the SFUs must agree on who's speaking. Use a centralized audio level aggregator that collects levels from all SFUs and broadcasts the top speakers.

DistributedSFU.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Distributed SFU architecture for 1000 participants
// Assume 10 SFU nodes, each handling 100 participants
// Each SFU has a bridge module that connects to other SFUs

// Bridge module: forwards selected streams to other SFUs
class Bridge {
  connections: Map<SFUId, RtpConnection>;

  forwardStream(stream, targetSFUId) {
    const conn = this.connections.get(targetSFUId);
    conn.send(stream);
  }

  // On receiving a stream from another SFU
  onRemoteStream(stream) {
    // Add this stream to local consumers that need it
    // e.g., if stream is from an active speaker, forward to all local consumers
    this.localSFU.addRemoteProducer(stream);
  }
}

// Global active speaker detection
// Each SFU periodically sends audio levels of its participants to a central service
// Central service aggregates and returns top 6 speakers globally
// SFUs then forward those speakers' streams to all other SFUs

// Pseudocode for central audio level service
function getGlobalActiveSpeakers(levelsFromAllSFUs) {
  // levelsFromAllSFUs: Map<SFUId, Map<PeerId, level>>
  const allLevels = [];
  for (const [sfuId, peerLevels] of levelsFromAllSFUs) {
    for (const [peerId, level] of peerLevels) {
      allLevels.push({ peerId, level, sfuId });
    }
  }
  allLevels.sort((a,b) => b.level - a.level);
  return allLevels.slice(0, 6);
}

Output

10 SFUs each handle 100 participants -> Each SFU forwards top 6 speakers to other SFUs via bridge -> Central audio level aggregator determines global top speakers -> Total cross-SFU streams: 10 * 6 = 60

Production Trap: Bridge Bandwidth

If you forward all streams between SFUs, you'll saturate your inter-SFU link. In a 1000-person call with 10 SFUs, forwarding all 1000 streams would require 1000 * 2Mbps = 2Gbps per SFU. Only forward active speakers. Use a hard limit of 6 streams per SFU. We learned this the hard way when our 10Gbps link became the bottleneck.

Handling Network Degradation: Bandwidth Estimation and Adaptation

Real-time video is unforgiving of packet loss. WebRTC has built-in bandwidth estimation (GCC — Google Congestion Control) that adjusts bitrate based on delay and loss. But you need to configure it properly. The SFU should also participate: it can send REMB (Receiver Estimated Maximum Bitrate) messages to publishers to reduce their bitrate. For clients with poor connectivity, the SFU can switch to a lower simulcast layer or drop video entirely (audio-only). Key: never let the client decide alone — the server knows the overall network conditions. Implement a server-side bandwidth manager that aggregates feedback from all consumers and sends a unified REMB to each producer. Also, support FEC (Forward Error Correction) for audio — it's small and worth the overhead. For video, FEC is too expensive; use NACKs and retransmissions instead. And always enable packet loss hiding (PLC) in audio codecs (Opus does this automatically).

BandwidthAdaptation.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Server-side bandwidth manager
// Collects receiver reports from all consumers of a producer
// Computes a combined REMB and sends it to the producer

class BandwidthManager {
  consumers: Map<ProducerId, ConsumerReport[]>;

  onReceiverReport(producerId, consumerId, report) {
    this.consumers.get(producerId).push({ consumerId, report });
    // Aggregate: take the minimum available bandwidth across all consumers
    const minBw = Math.min(...this.consumers.get(producerId).map(c => c.report.availableBitrate));
    // Send REMB to producer
    sendREMB(producerId, minBw);
  }

  // On packet loss > 5%, force switch to lower simulcast layer
  onPacketLoss(producerId, consumerId, lossFraction) {
    if (lossFraction > 0.05) {
      // Ask SFU to forward lower layer for this consumer
      sfu.setConsumerLayer(consumerId, 'low');
    }
  }
}

// Simulcast layer switching
// Producer sends 3 layers: high (720p, 2Mbps), medium (360p, 500Kbps), low (180p, 150Kbps)
// SFU selects layer per consumer based on bandwidth
function selectLayer(availableBw) {
  if (availableBw > 1000000) return 'high';
  if (availableBw > 300000) return 'medium';
  return 'low';
}

Output

Bandwidth manager aggregates consumer reports -> Sends REMB with min available bandwidth -> Producer adjusts bitrate -> If packet loss >5%, SFU switches consumer to lower layer

Interview Gold: GCC vs. Sender-Side Estimation

WebRTC's GCC works on the receiver side. But for server-side SFU, you often want sender-side estimation because the server sees all consumers. Some implementations use a hybrid: receiver-side for client-to-server, sender-side for server-to-client. Know the difference.

TURN Servers: The NAT Traversal Safety Net

Not all clients can establish peer-to-peer connections due to symmetric NATs or firewalls. That's where TURN (Traversal Using Relays around NAT) comes in. TURN servers relay media traffic. They're bandwidth hogs: each stream consumes relay bandwidth. You need to deploy TURN servers in multiple regions close to users. Use ICE (Interactive Connectivity Establishment) to try direct P2P first (via STUN), then fall back to TURN. Configure TURN with authentication (time-limited credentials) to prevent abuse. Key metric: TURN usage ratio. If >20% of calls use TURN, your STUN infrastructure might be misconfigured or your users are behind restrictive NATs (e.g., corporate VPNs). Also, TURN servers must support UDP, TCP, and TLS. UDP is preferred for low latency. TCP adds overhead but works through firewalls that block UDP.

TURNDeployment.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// TURN server configuration (using coturn as example)
// /etc/turnserver.conf

listening-port=3478
tls-listening-port=5349
fingerprint
lt-cred-mech
user=zoomuser:secretpassword  # Use time-limited tokens in production
realm=thecodeforge.io
# Allocate bandwidth per user: 2Mbps max
max-bps=2000000
# Use multiple relay IPs for load balancing
relay-ip=203.0.113.1
relay-ip=203.0.113.2
# Enable STUN as well
stun-only=false

# Client-side ICE configuration
// In WebRTC, set ICE transport policy to 'relay' only if you want to force TURN
// For normal operation, use 'all' (default)
const pcConfig = {
  iceServers: [
    { urls: 'stun:stun.thecodeforge.io:3478' },
    { urls: 'turn:turn.thecodeforge.io:3478', username: 'user', credential: 'pass' }
  ]
};

// Monitor TURN usage
// Metric: relayed bytes vs total bytes
// If relayed > 20% of total, investigate NAT issues

Output

TURN server listens on 3478 (UDP/TCP) and 5349 (TLS) -> Clients use ICE to try STUN first -> If STUN fails, fallback to TURN -> TURN relays media -> Monitor relay ratio

Never Do This: Hardcoding TURN Credentials

I've seen production configs with hardcoded TURN credentials in the client code. Attackers can use your TURN server for free relay, costing you bandwidth. Always use time-limited credentials generated by your signaling server. coturn supports REST API for this.

Recording and Playback: Archiving the Chaos

Recording a Zoom call is harder than it looks. You can't just record the mixed audio/video because you lose individual speaker tracks. For compliance (e.g., legal depositions), you need per-participant recordings. Solution: have the SFU send each participant's stream to a recording service. The recording service can either store individual tracks (for later compositing) or mix them in real-time. For real-time mixing, use a dedicated MCU-like component that decodes and mixes streams, then encodes the final video. This is expensive. Better: store individual tracks as fragmented MP4 (fMP4) with timestamps, and composite offline. For playback, you need a video player that can handle multiple synchronized streams. Use HLS or DASH with multiple audio tracks. Or build a custom player using WebRTC to re-render the call. Gotcha: recording must handle network glitches — buffer at least 5 seconds of data to recover from packet loss.

RecordingService.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Recording service architecture
// SFU forwards each participant's stream to a recording service via RTP
// Recording service writes each stream to a separate file (e.g., WebM or fMP4)

class RecordingService {
  streams: Map<PeerId, FileWriter>;

  onRtpPacket(peerId, packet) {
    let writer = this.streams.get(peerId);
    if (!writer) {
      writer = new FileWriter(`/recordings/${roomId}/${peerId}.webm`);
      this.streams.set(peerId, writer);
    }
    writer.write(packet);
  }

  // For composite recording, use a mixer that decodes all streams
  // and encodes a single output
  // Mixer uses FFmpeg or custom GStreamer pipeline
}

// Offline compositing command (using FFmpeg)
// ffmpeg -i peer1.webm -i peer2.webm -filter_complex "[0:v]scale=640:360[0v];[1:v]scale=640:360[1v];[0v][1v]hstack=inputs=2" output.mp4
// This creates a side-by-side video. For grid layout, use more complex filter.

Output

SFU forwards streams to recording service -> Each stream saved as individual file -> Offline compositing produces final video

Senior Shortcut: Use GStreamer for Recording

Don't build a recording pipeline from scratch. Use GStreamer with webrtcbin to receive streams and encode to file. It handles jitter buffer, retransmission, and encoding. We use it in production with a Python wrapper.

When Not to Build Your Own Zoom: The Build vs. Buy Decision

Building a Zoom clone is a massive undertaking. You need expertise in WebRTC, networking, distributed systems, and media codecs. If your core business isn't video conferencing, don't build it. Use a third-party API like Twilio Video, Agora, or Daily.co. They handle SFU, TURN, and scaling. You pay per participant-minute, but you save months of engineering. Only build if you have specific requirements: custom UI, offline recording, proprietary codecs, or air-gapped deployments. Even then, consider using open-source SFUs like mediasoup or Janus and customize. The build vs. buy decision is simple: if you need to support >1000 participants with <200ms latency, and you have a team of 5+ engineers dedicated to this, build. Otherwise, buy.

The Classic Bug: Over-Engineering

I've seen startups spend 6 months building a video platform when they could have used Twilio and launched in 2 weeks. They ended up with a buggy product and missed market window. Don't be a hero. Use existing infrastructure.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Media server pods in Kubernetes kept OOMKilling every 15 minutes during a 200-person webinar. Memory usage spiked to 4GB then crashed.

Assumption

Thought it was a memory leak in the WebRTC library. Spent days profiling heap dumps.

Root cause

SFU was configured to forward all video streams (not just active speakers). Each 720p stream at 30fps consumed ~2Mbps bandwidth and ~50MB memory for buffering. 200 participants × 50MB = 10GB, but container limit was 4GB. OOM killer did its job.

Fix

Enabled simulcast and forward-only-active-speaker mode. Set max video streams per client to 4 (the 3 loudest speakers + self). Memory dropped to 800MB. No more OOMs.

Key lesson

Never forward all streams.
Always limit active video streams to a small number (3-6).
Use audio levels to select which streams to forward.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Users report frozen video or audio stuttering

→

Fix

1. Check packet loss on media server (use ss -s or WebRTC stats). 2. Check bandwidth estimation logs. 3. If loss >5%, enable FEC for audio or switch to lower simulcast layer. 4. Verify TURN server bandwidth isn't saturated.

Symptom · 02

Call setup fails for some users (ICE connection timeout)

→

Fix

1. Check signaling server logs for ICE candidate exchange. 2. Verify STUN/TURN servers are reachable from client IP. 3. Check firewall rules (UDP 3478, TCP 443). 4. Increase ICE timeout from default 5s to 10s.

Symptom · 03

High CPU on media server

→

Fix

1. Check number of active streams per server. 2. Reduce max participants per server (e.g., from 200 to 100). 3. Enable simulcast and limit high-resolution streams. 4. Profile with perf to find hot spots (e.g., encryption).

★ Design Zoom Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

`OOMKilled` on media server pod−

Immediate action

Check memory usage per stream

Commands

kubectl top pod <pod-name> --containers

kubectl logs <pod-name> --previous | grep OOM

Fix now

Set --max-incoming-bitrate 2000000 per producer and limit active video streams to 4.

`ICE failed, no candidate pair` in client logs+

High packet loss (>10%) on server+

Audio out of sync with video+

Feature / Aspect	SFU	MCU
Server CPU load	O(N) — forwarding only	O(N²) — decode/mix/encode
Client CPU load	High — decode multiple streams	Low — decode one mixed stream
Bandwidth per client	High — receive multiple streams	Low — receive one stream
Latency	Low — no transcoding	Higher — transcoding adds delay
Scalability	Good up to 1000+ with distributed SFU	Poor beyond 20 participants
Complexity	Moderate — need simulcast and active speaker detection	High — need mixer and transcoder

Key takeaways

SFU scales better than MCU for large calls because server load is O(N) vs O(N²). Always use SFU for >10 participants.

Never forward all video streams. Limit active video to 3-6 streams per client based on audio levels. This cuts bandwidth and CPU by 90%.

Signaling must use WebSocket, not HTTP. ICE candidates expire in seconds; WebSocket ensures low-latency exchange.

Deploy TURN servers in multiple regions. Monitor relay ratio; if >20%, investigate NAT issues. Use time-limited credentials for security.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Zoom's SFU handle a participant with poor network connectivity?...

Q02SENIOR

In a distributed SFU architecture with 1000 participants across 10 serve...

Q03SENIOR

What happens when a TURN server runs out of bandwidth during a large cal...

Q04JUNIOR

Explain the role of STUN and TURN in WebRTC. When would a call use TURN ...

Q05SENIOR

You notice that 40% of your calls are using TURN relay, costing high ban...

Q06SENIOR

Design a system to record a 500-person Zoom call with per-participant au...

Q01 of 06SENIOR

How does Zoom's SFU handle a participant with poor network connectivity? Describe the adaptation mechanism.

ANSWER

Zoom uses simulcast: each client sends multiple resolution layers. The SFU monitors packet loss and round-trip time per consumer. If a consumer's loss exceeds 5%, the SFU switches that consumer to a lower simulcast layer. Additionally, the SFU sends REMB messages to the producer to reduce overall bitrate. For extreme cases, the SFU may drop video entirely and forward only audio.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How does Zoom handle 1000 participants in a single call?

What's the difference between SFU and MCU in video conferencing?

How do I set up a TURN server for WebRTC?

What happens when WebRTC packet loss is high? How does Zoom adapt?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

6 min read · try the examples if you haven't