Zoom uses a Selective Forwarding Unit (SFU) architecture where the server forwards selected video streams to each participant, reducing client processing. Key components: signaling server (WebSocket), media server (SFU), TURN server for NAT traversal, and a distributed backend for rooms and users.
✦ Definition~90s read
What is Design Zoom?
Design Zoom is the system design of a real-time video conferencing platform like Zoom. It covers client-server architecture, media routing (SFU/MCU), signaling, WebRTC, scaling strategies, and handling network degradation.
★
Imagine a conference room where everyone talks at once.
Plain-English First
Imagine a conference room where everyone talks at once. In a small room, you can hear everyone. But with 100 people, it's chaos. Zoom's SFU is like a smart switchboard operator: they listen to everyone, but only forward the voice of the person currently speaking to each listener. If you're not speaking, the operator stops sending your voice to others, saving everyone's ears (and bandwidth).
You've seen it happen: a 50-person all-hands call turns into a slideshow of frozen faces, audio stuttering like a scratched CD. Everyone blames the Wi-Fi. But the real culprit is almost always the server architecture. Most video calling systems choke because they try to send every participant's video to every other participant — an O(n²) problem that kills bandwidth and CPU. Zoom doesn't do that. And that's why it works when everything else falls apart.
This article breaks down the system design of a Zoom-like platform. You'll learn the exact architecture — signaling, media routing, scaling, and the production traps that take down naive implementations. By the end, you'll be able to design a real-time video system that handles 1000+ participants without melting your servers or your users' laptops.
Why SFU Beats MCU: The Bandwidth Math That Decides Your Architecture
Before you write a single line of code, you need to pick your media routing strategy. The two main options: MCU (Multipoint Control Unit) and SFU (Selective Forwarding Unit). MCU mixes all incoming streams into a single composite stream on the server. Each client sends one stream and receives one stream. Sounds simple. But the server has to decode, mix, and re-encode every stream — that's CPU-intensive and adds latency. SFU, on the other hand, forwards streams without decoding. The server is just a smart switch: it selects which streams to send to each client based on who's speaking. The client decodes multiple streams and renders them. This shifts the processing burden to clients, which is fine for desktops but tough for mobile. However, SFU scales linearly with participants (O(n) server load) while MCU scales O(n²) because the server must process every combination. For a 100-person call, MCU server does 100× the work of SFU. That's why Zoom uses SFU. The trade-off: SFU requires more client bandwidth (each client receives multiple streams), but you can mitigate with simulcast (send multiple resolutions) and bandwidth estimation.
SFUvsMCU.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — SystemDesign tutorial
// SFU bandwidth calculation for N participants
// Each client sends 1stream (uplink) and receives M streams (downlink)
// M = number of active speakers (e.g., 4)
// Total server bandwidth = N * (uplink + M * downlink)
// Example: N=100, uplink=2Mbps, downlink=2Mbps, M=4
// Server bandwidth = 100 * (2 + 4*2) = 100 * 10 = 1000Mbps = 1Gbps
// MCU bandwidth calculation
// Each client sends 1 stream, receives 1 mixed stream
// Server must decode N streams, mix, encode 1 stream per client
// Server bandwidth = N * (uplink + downlink) = 100 * (2+2) = 400Mbps
// But server CPU is O(N^2) because mixing requires processing all streams
// For N=100, server must process 100*100 = 10,000 stream combinations
// SFU server CPU is O(N) — just forwarding packets
// Decision: UseSFUfor >10 participants. MCU only for small groups (<10) where client CPU is limited (e.g., embedded devices).
Output
SFU server bandwidth: 1000 Mbps
MCU server bandwidth: 400 Mbps
SFU server CPU: O(N)
MCU server CPU: O(N^2)
Production Trap: MCU at Scale
I've seen a startup try MCU for a 500-person town hall. Their server farm melted in 3 minutes. The CPU hit 100% on all cores, and the audio delay hit 10 seconds. They switched to SFU the next day. Don't be that startup.
Media Routing Decision Tree
IfParticipants <= 10, clients are low-power (mobile/embedded)
→
UseMCU — server does the heavy lifting
IfParticipants > 10, clients are desktops or modern phones
→
UseSFU with simulcast and active speaker detection
IfParticipants > 100, need to support legacy clients
→
UseSFU with transcoding fallback for incompatible codecs
thecodeforge.io
Zoom-Scale Video Architecture: SFU, Signaling, and Scaling
Design Zoom
thecodeforge.io
MCU vs SFU Bandwidth Tradeoffs
Design Zoom
Signaling: The WebSocket Dance That Sets Up Every Call
Before any video flows, clients need to exchange session descriptions and ICE candidates. This is signaling. You need a reliable, low-latency channel. WebSocket is the standard. Each client connects to a signaling server (typically a separate service from media servers). The signaling server handles room management, user presence, and relays SDP offers/answers and ICE candidates between clients. For a 1:1 call, signaling is simple: client A sends offer to server, server forwards to client B, B sends answer back. For group calls, the signaling server maintains a room state and broadcasts new participant info to all existing members. Key gotcha: signaling must be authenticated and rate-limited. If a malicious client floods the signaling server with SDP offers, it can exhaust server memory. Always validate SDP size (max 64KB) and limit offers per second per user (e.g., 5/s). Also, use a separate WebSocket connection for signaling vs. media — don't mix them. Media should go over UDP (SRTP/SCTP), not WebSocket.
SignalingFlow.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — SystemDesign tutorial
// Signaling flow for a group call
// 1. Client A connects to signaling server via WebSocket
// 2. A sends 'join-room' message with room ID and auth token
// 3. Server validates token, adds A to room, broadcasts 'peer-joined' to others
// 4. Server sends A the list of existing peers (IDs only)
// 5. A creates a PeerConnectionfor each peer, generates SDP offer
// 6. A sends 'sdp-offer' to server for each peer
// 7. Server forwards each offer to the respective peer
// 8. Each peer generates SDP answer and sends back via server
// 9. ICE candidates are exchanged similarly
// 10. Once all PeerConnections are established, media flows directly between clients (or via TURN)
// Pseudocodefor signaling server message handling
onMessage(ws, msg) {
switch(msg.type) {
case'join-room':
if (!authenticate(msg.token)) { ws.send({error: 'unauthorized'}); return; }
room.addPeer(ws, msg.roomId);
broadcastToRoom(msg.roomId, {type: 'peer-joined', peerId: ws.id});
ws.send({type: 'room-state', peers: room.getPeerIds(msg.roomId)});
break;
case'sdp-offer':
// ValidateSDP size
if (msg.sdp.length > 65536) { ws.send({error: 'sdp too large'}); return; }
// Forward to target peer
sendToPeer(msg.targetPeerId, {type: 'sdp-offer', sdp: msg.sdp, from: ws.id});
break;
// ... similar for answer, ICE
}
}
Output
Client A connects -> Server validates -> A joins room -> Server broadcasts peer-joined -> A receives peer list -> A sends SDP offers -> Server forwards -> Peers answer -> ICE exchange -> Media flows
Never Do This: Signaling Over HTTP
Some tutorials show signaling over HTTP long-polling. Don't. The latency kills the ICE negotiation window. WebSocket is mandatory. I've seen a team lose 30% of call setups because ICE candidates expired before the HTTP response came back.
Media Server Architecture: The SFU That Doesn't Drop Packets
The media server is the heart of your Zoom clone. It runs an SFU that receives RTP packets from publishers and forwards them to subscribers. Each media server handles a subset of participants (e.g., 100 per server). You need to assign participants to servers based on room size. For small rooms (<10), a single server is fine. For large rooms, you split participants across multiple servers and use a 'media bridge' to connect them. The bridge forwards streams between servers, effectively creating a distributed SFU. Each media server runs a WebRTC stack (e.g., mediasoup, Janus, or custom). Key components: a transport for each peer (WebRTC or plain RTP), a router that maps incoming streams to outgoing streams, and a bandwidth estimator that adjusts quality based on network conditions. The SFU must support simulcast: each publisher sends multiple resolutions (e.g., 720p, 360p, 180p). The SFU selects which layer to forward to each subscriber based on their bandwidth and screen size. This is critical for mobile clients on 3G. Without simulcast, you'd have to transcode, which kills latency.
SFUInternal.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// io.thecodeforge — SystemDesign tutorial
// SimplifiedSFU internal architecture
// Each participant has a Producer (sends media) and a Consumer (receives media)
// TheSFU maintains a map: roomId -> { producers: Map<peerId, Producer>, consumers: Map<peerId, Consumer> }
classSFU {
rooms: Map<string, Room>;
onProducer(roomId, producer) {
const room = this.rooms.get(roomId);
room.producers.set(producer.peerId, producer);
// Notify all consumers in room about new producer
for (const consumer of room.consumers.values()) {
consumer.addProducer(producer);
}
}
onConsumer(roomId, consumer) {
const room = this.rooms.get(roomId);
room.consumers.set(consumer.peerId, consumer);
// Add existing producers to this consumer
for (const producer of room.producers.values()) {
consumer.addProducer(producer);
}
}
// Forwarding logic: for each consumer, decide which producers to forward
// Use audio levels to select top N speakers (e.g., 3)
// For video, forward only active speakers' high-resolution streams
// For others, forward low-resolution or no video
forward(roomId) {
const room = this.rooms.get(roomId);
const activeSpeakers = this.getTopSpeakers(room, 3);
for (const consumer of room.consumers.values()) {
const streamsToForward = [];
for (const producer of room.producers.values()) {
if (producer.peerId === consumer.peerId) continue; // don't send own stream
if (activeSpeakers.has(producer.peerId)) {
streamsToForward.push({ producer, layer: 'high' });
} else {
streamsToForward.push({ producer, layer: 'low' }); // or skip video entirely
}
}
consumer.setStreams(streamsToForward);
}
}
}
Output
SFU maintains room state -> Producers send RTP -> SFU forwards selected streams to consumers -> Active speaker detection selects top 3 -> Others get low-res or no video
Senior Shortcut: Use mediasoup
Building an SFU from scratch is a year-long project. Use mediasoup (C++ with Node.js API). It handles WebRTC, simulcast, SVC, and bandwidth estimation out of the box. We've used it in production for 10,000+ concurrent users. Just don't forget to set the maxIncomingBitrate per producer to avoid a single user flooding the server.
thecodeforge.io
SFU Media Routing Flow
Design Zoom
Scaling to 1000+ Participants: Distributed SFU and Cascading
A single SFU can handle ~100-200 participants before CPU or bandwidth becomes a bottleneck. Beyond that, you need to distribute the load. Two approaches: 1) Room-based sharding: assign each room to a specific SFU. Works if rooms are small (<100). 2) Distributed SFU: split a single large room across multiple SFUs, each handling a subset of participants. The SFUs are connected via a media bridge (e.g., using RTP over UDP between servers). Each SFU forwards streams from its participants to other SFUs as needed. This is complex because you need to avoid forwarding the same stream multiple times. A common pattern is to designate one SFU as the 'bridge' for each stream, or use a full mesh between SFUs. For 1000 participants, you might have 10 SFUs, each handling 100 participants. Each SFU forwards the active speaker streams (3-6) to all other SFUs. That's 10 SFUs × 6 streams = 60 cross-SFU streams. Manageable. But you also need a global active speaker detection: the SFUs must agree on who's speaking. Use a centralized audio level aggregator that collects levels from all SFUs and broadcasts the top speakers.
DistributedSFU.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// io.thecodeforge — SystemDesign tutorial
// DistributedSFU architecture for1000 participants
// Assume10SFU nodes, each handling 100 participants
// EachSFU has a bridge module that connects to other SFUs
// Bridge module: forwards selected streams to other SFUsclassBridge {
connections: Map<SFUId, RtpConnection>;
forwardStream(stream, targetSFUId) {
const conn = this.connections.get(targetSFUId);
conn.send(stream);
}
// On receiving a stream from another SFUonRemoteStream(stream) {
// Addthis stream to local consumers that need it
// e.g., if stream is from an active speaker, forward to all local consumers
this.localSFU.addRemoteProducer(stream);
}
}
// Global active speaker detection
// EachSFU periodically sends audio levels of its participants to a central service
// Central service aggregates and returns top 6 speakers globally
// SFUs then forward those speakers' streams to all other SFUs
// Pseudocodefor central audio level service
function getGlobalActiveSpeakers(levelsFromAllSFUs) {
// levelsFromAllSFUs: Map<SFUId, Map<PeerId, level>>
const allLevels = [];
for (const [sfuId, peerLevels] of levelsFromAllSFUs) {
for (const [peerId, level] of peerLevels) {
allLevels.push({ peerId, level, sfuId });
}
}
allLevels.sort((a,b) => b.level - a.level);
return allLevels.slice(0, 6);
}
Output
10 SFUs each handle 100 participants -> Each SFU forwards top 6 speakers to other SFUs via bridge -> Central audio level aggregator determines global top speakers -> Total cross-SFU streams: 10 * 6 = 60
Production Trap: Bridge Bandwidth
If you forward all streams between SFUs, you'll saturate your inter-SFU link. In a 1000-person call with 10 SFUs, forwarding all 1000 streams would require 1000 * 2Mbps = 2Gbps per SFU. Only forward active speakers. Use a hard limit of 6 streams per SFU. We learned this the hard way when our 10Gbps link became the bottleneck.
Handling Network Degradation: Bandwidth Estimation and Adaptation
Real-time video is unforgiving of packet loss. WebRTC has built-in bandwidth estimation (GCC — Google Congestion Control) that adjusts bitrate based on delay and loss. But you need to configure it properly. The SFU should also participate: it can send REMB (Receiver Estimated Maximum Bitrate) messages to publishers to reduce their bitrate. For clients with poor connectivity, the SFU can switch to a lower simulcast layer or drop video entirely (audio-only). Key: never let the client decide alone — the server knows the overall network conditions. Implement a server-side bandwidth manager that aggregates feedback from all consumers and sends a unified REMB to each producer. Also, support FEC (Forward Error Correction) for audio — it's small and worth the overhead. For video, FEC is too expensive; use NACKs and retransmissions instead. And always enable packet loss hiding (PLC) in audio codecs (Opus does this automatically).
BandwidthAdaptation.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — SystemDesign tutorial
// Server-side bandwidth manager
// Collects receiver reports from all consumers of a producer
// Computes a combined REMB and sends it to the producer
classBandwidthManager {
consumers: Map<ProducerId, ConsumerReport[]>;
onReceiverReport(producerId, consumerId, report) {
this.consumers.get(producerId).push({ consumerId, report });
// Aggregate: take the minimum available bandwidth across all consumers
const minBw = Math.min(...this.consumers.get(producerId).map(c => c.report.availableBitrate));
// SendREMB to producer
sendREMB(producerId, minBw);
}
// On packet loss > 5%, force switch to lower simulcast layer
onPacketLoss(producerId, consumerId, lossFraction) {
if (lossFraction > 0.05) {
// AskSFU to forward lower layer forthis consumer
sfu.setConsumerLayer(consumerId, 'low');
}
}
}
// Simulcast layer switching
// Producer sends 3 layers: high (720p, 2Mbps), medium (360p, 500Kbps), low (180p, 150Kbps)
// SFU selects layer per consumer based on bandwidth
function selectLayer(availableBw) {
if (availableBw > 1000000) return'high';
if (availableBw > 300000) return'medium';
return'low';
}
Output
Bandwidth manager aggregates consumer reports -> Sends REMB with min available bandwidth -> Producer adjusts bitrate -> If packet loss >5%, SFU switches consumer to lower layer
Interview Gold: GCC vs. Sender-Side Estimation
WebRTC's GCC works on the receiver side. But for server-side SFU, you often want sender-side estimation because the server sees all consumers. Some implementations use a hybrid: receiver-side for client-to-server, sender-side for server-to-client. Know the difference.
TURN Servers: The NAT Traversal Safety Net
Not all clients can establish peer-to-peer connections due to symmetric NATs or firewalls. That's where TURN (Traversal Using Relays around NAT) comes in. TURN servers relay media traffic. They're bandwidth hogs: each stream consumes relay bandwidth. You need to deploy TURN servers in multiple regions close to users. Use ICE (Interactive Connectivity Establishment) to try direct P2P first (via STUN), then fall back to TURN. Configure TURN with authentication (time-limited credentials) to prevent abuse. Key metric: TURN usage ratio. If >20% of calls use TURN, your STUN infrastructure might be misconfigured or your users are behind restrictive NATs (e.g., corporate VPNs). Also, TURN servers must support UDP, TCP, and TLS. UDP is preferred for low latency. TCP adds overhead but works through firewalls that block UDP.
TURNDeployment.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — SystemDesign tutorial
// TURN server configuration (using coturn as example)
// /etc/turnserver.conf
listening-port=3478
tls-listening-port=5349
fingerprint
lt-cred-mech
user=zoomuser:secretpassword # Use time-limited tokens in production
realm=thecodeforge.io
# Allocate bandwidth per user: 2Mbps max
max-bps=2000000
# Use multiple relay IPsfor load balancing
relay-ip=203.0.113.1
relay-ip=203.0.113.2
# EnableSTUN as well
stun-only=false
# Client-side ICE configuration
// InWebRTC, set ICE transport policy to 'relay' only if you want to force TURN
// For normal operation, use 'all' (default)
const pcConfig = {
iceServers: [
{ urls: 'stun:stun.thecodeforge.io:3478' },
{ urls: 'turn:turn.thecodeforge.io:3478', username: 'user', credential: 'pass' }
]
};
// MonitorTURN usage
// Metric: relayed bytes vs total bytes
// If relayed > 20% of total, investigate NAT issues
Output
TURN server listens on 3478 (UDP/TCP) and 5349 (TLS) -> Clients use ICE to try STUN first -> If STUN fails, fallback to TURN -> TURN relays media -> Monitor relay ratio
Never Do This: Hardcoding TURN Credentials
I've seen production configs with hardcoded TURN credentials in the client code. Attackers can use your TURN server for free relay, costing you bandwidth. Always use time-limited credentials generated by your signaling server. coturn supports REST API for this.
Recording and Playback: Archiving the Chaos
Recording a Zoom call is harder than it looks. You can't just record the mixed audio/video because you lose individual speaker tracks. For compliance (e.g., legal depositions), you need per-participant recordings. Solution: have the SFU send each participant's stream to a recording service. The recording service can either store individual tracks (for later compositing) or mix them in real-time. For real-time mixing, use a dedicated MCU-like component that decodes and mixes streams, then encodes the final video. This is expensive. Better: store individual tracks as fragmented MP4 (fMP4) with timestamps, and composite offline. For playback, you need a video player that can handle multiple synchronized streams. Use HLS or DASH with multiple audio tracks. Or build a custom player using WebRTC to re-render the call. Gotcha: recording must handle network glitches — buffer at least 5 seconds of data to recover from packet loss.
RecordingService.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — SystemDesign tutorial
// Recording service architecture
// SFU forwards each participant's stream to a recording service via RTP
// Recording service writes each stream to a separate file (e.g., WebM or fMP4)
classRecordingService {
streams: Map<PeerId, FileWriter>;
onRtpPacket(peerId, packet) {
let writer = this.streams.get(peerId);
if (!writer) {
writer = newFileWriter(`/recordings/${roomId}/${peerId}.webm`);
this.streams.set(peerId, writer);
}
writer.write(packet);
}
// For composite recording, use a mixer that decodes all streams
// and encodes a single output
// Mixer uses FFmpeg or custom GStreamer pipeline
}
// Offline compositing command (using FFmpeg)
// ffmpeg -i peer1.webm -i peer2.webm -filter_complex "[0:v]scale=640:360[0v];[1:v]scale=640:360[1v];[0v][1v]hstack=inputs=2" output.mp4
// This creates a side-by-side video. For grid layout, use more complex filter.
Output
SFU forwards streams to recording service -> Each stream saved as individual file -> Offline compositing produces final video
Senior Shortcut: Use GStreamer for Recording
Don't build a recording pipeline from scratch. Use GStreamer with webrtcbin to receive streams and encode to file. It handles jitter buffer, retransmission, and encoding. We use it in production with a Python wrapper.
When Not to Build Your Own Zoom: The Build vs. Buy Decision
Building a Zoom clone is a massive undertaking. You need expertise in WebRTC, networking, distributed systems, and media codecs. If your core business isn't video conferencing, don't build it. Use a third-party API like Twilio Video, Agora, or Daily.co. They handle SFU, TURN, and scaling. You pay per participant-minute, but you save months of engineering. Only build if you have specific requirements: custom UI, offline recording, proprietary codecs, or air-gapped deployments. Even then, consider using open-source SFUs like mediasoup or Janus and customize. The build vs. buy decision is simple: if you need to support >1000 participants with <200ms latency, and you have a team of 5+ engineers dedicated to this, build. Otherwise, buy.
The Classic Bug: Over-Engineering
I've seen startups spend 6 months building a video platform when they could have used Twilio and launched in 2 weeks. They ended up with a buggy product and missed market window. Don't be a hero. Use existing infrastructure.
● Production incidentPOST-MORTEMseverity: high
The 4GB Container That Kept Dying
Symptom
Media server pods in Kubernetes kept OOMKilling every 15 minutes during a 200-person webinar. Memory usage spiked to 4GB then crashed.
Assumption
Thought it was a memory leak in the WebRTC library. Spent days profiling heap dumps.
Root cause
SFU was configured to forward all video streams (not just active speakers). Each 720p stream at 30fps consumed ~2Mbps bandwidth and ~50MB memory for buffering. 200 participants × 50MB = 10GB, but container limit was 4GB. OOM killer did its job.
Fix
Enabled simulcast and forward-only-active-speaker mode. Set max video streams per client to 4 (the 3 loudest speakers + self). Memory dropped to 800MB. No more OOMs.
Key lesson
Never forward all streams.
Always limit active video streams to a small number (3-6).
Use audio levels to select which streams to forward.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Users report frozen video or audio stuttering
→
Fix
1. Check packet loss on media server (use ss -s or WebRTC stats). 2. Check bandwidth estimation logs. 3. If loss >5%, enable FEC for audio or switch to lower simulcast layer. 4. Verify TURN server bandwidth isn't saturated.
Symptom · 02
Call setup fails for some users (ICE connection timeout)
→
Fix
1. Check signaling server logs for ICE candidate exchange. 2. Verify STUN/TURN servers are reachable from client IP. 3. Check firewall rules (UDP 3478, TCP 443). 4. Increase ICE timeout from default 5s to 10s.
Symptom · 03
High CPU on media server
→
Fix
1. Check number of active streams per server. 2. Reduce max participants per server (e.g., from 200 to 100). 3. Enable simulcast and limit high-resolution streams. 4. Profile with perf to find hot spots (e.g., encryption).
★ Design Zoom Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
`OOMKilled` on media server pod−
Immediate action
Check memory usage per stream
Commands
kubectl top pod <pod-name> --containers
kubectl logs <pod-name> --previous | grep OOM
Fix now
Set --max-incoming-bitrate 2000000 per producer and limit active video streams to 4.
`ICE failed, no candidate pair` in client logs+
Immediate action
Verify TURN server is reachable
Commands
nc -vuz <turn-server-ip> 3478
tcpdump -i any port 3478
Fix now
Add TURN server with correct credentials. Ensure firewall allows UDP 3478.
High packet loss (>10%) on server+
Immediate action
Check network interface saturation
Commands
sar -n DEV 1 5
ethtool -S eth0 | grep drop
Fix now
Enable FEC for audio (Opus inband FEC). For video, reduce max bitrate per stream to 1Mbps.
Ensure all media servers use same NTP server. Enable RTCP sender reports for synchronization.
Feature / Aspect
SFU
MCU
Server CPU load
O(N) — forwarding only
O(N²) — decode/mix/encode
Client CPU load
High — decode multiple streams
Low — decode one mixed stream
Bandwidth per client
High — receive multiple streams
Low — receive one stream
Latency
Low — no transcoding
Higher — transcoding adds delay
Scalability
Good up to 1000+ with distributed SFU
Poor beyond 20 participants
Complexity
Moderate — need simulcast and active speaker detection
High — need mixer and transcoder
Key takeaways
1
SFU scales better than MCU for large calls because server load is O(N) vs O(N²). Always use SFU for >10 participants.
2
Never forward all video streams. Limit active video to 3-6 streams per client based on audio levels. This cuts bandwidth and CPU by 90%.
3
Signaling must use WebSocket, not HTTP. ICE candidates expire in seconds; WebSocket ensures low-latency exchange.
4
Deploy TURN servers in multiple regions. Monitor relay ratio; if >20%, investigate NAT issues. Use time-limited credentials for security.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
How does Zoom's SFU handle a participant with poor network connectivity?...
Q02SENIOR
In a distributed SFU architecture with 1000 participants across 10 serve...
Q03SENIOR
What happens when a TURN server runs out of bandwidth during a large cal...
Q04JUNIOR
Explain the role of STUN and TURN in WebRTC. When would a call use TURN ...
Q05SENIOR
You notice that 40% of your calls are using TURN relay, costing high ban...
Q06SENIOR
Design a system to record a 500-person Zoom call with per-participant au...
Q01 of 06SENIOR
How does Zoom's SFU handle a participant with poor network connectivity? Describe the adaptation mechanism.
ANSWER
Zoom uses simulcast: each client sends multiple resolution layers. The SFU monitors packet loss and round-trip time per consumer. If a consumer's loss exceeds 5%, the SFU switches that consumer to a lower simulcast layer. Additionally, the SFU sends REMB messages to the producer to reduce overall bitrate. For extreme cases, the SFU may drop video entirely and forward only audio.
Q02 of 06SENIOR
In a distributed SFU architecture with 1000 participants across 10 servers, how do you ensure global active speaker detection works with low latency?
ANSWER
Each SFU periodically (every 100ms) sends audio levels of its top 10 speakers to a central aggregator service. The aggregator sorts all levels and returns the top 6 global speakers to all SFUs. This adds ~100ms latency, acceptable for speaker detection. To reduce latency further, use a gossip protocol between SFUs to share levels directly.
Q03 of 06SENIOR
What happens when a TURN server runs out of bandwidth during a large call? How do you mitigate?
ANSWER
Clients will fail to allocate relay ports, causing ICE failures. Mitigation: deploy multiple TURN servers behind a load balancer, monitor bandwidth usage per server, and scale horizontally. Also, implement client-side fallback: if TURN allocation fails, retry with a different TURN server or reduce stream quality. Set max bandwidth per TURN user to prevent a single user from hogging resources.
Q04 of 06JUNIOR
Explain the role of STUN and TURN in WebRTC. When would a call use TURN vs. STUN?
ANSWER
STUN is used to discover the public IP and port of a client (NAT binding). TURN relays media when direct P2P fails (e.g., symmetric NAT). A call uses STUN first; if STUN fails to establish a direct connection, it falls back to TURN. TURN is used when both clients are behind symmetric NATs or firewalls that block UDP.
Q05 of 06SENIOR
You notice that 40% of your calls are using TURN relay, costing high bandwidth. How do you debug and fix?
ANSWER
First, check if STUN servers are reachable and correctly configured. Use ICE candidate pair logs to see why P2P failed. Common causes: misconfigured STUN server IP, firewall blocking UDP, or clients behind VPNs that force relay. Fix: ensure STUN servers are in the same region as clients, open UDP ports, and consider using TCP for STUN if UDP is blocked. Also, enable ICE restart to retry P2P after initial failure.
Q06 of 06SENIOR
Design a system to record a 500-person Zoom call with per-participant audio tracks for compliance.
ANSWER
Use a distributed SFU that forwards each participant's audio stream to a recording service. The recording service runs on separate servers, each handling a subset of streams. Streams are written as fragmented MP4 files with timestamps. For playback, a compositing service mixes tracks offline using FFmpeg or GStreamer. To handle failures, buffer 10 seconds of audio in memory before writing to disk. Use cloud storage (S3) for durability.
01
How does Zoom's SFU handle a participant with poor network connectivity? Describe the adaptation mechanism.
SENIOR
02
In a distributed SFU architecture with 1000 participants across 10 servers, how do you ensure global active speaker detection works with low latency?
SENIOR
03
What happens when a TURN server runs out of bandwidth during a large call? How do you mitigate?
SENIOR
04
Explain the role of STUN and TURN in WebRTC. When would a call use TURN vs. STUN?
JUNIOR
05
You notice that 40% of your calls are using TURN relay, costing high bandwidth. How do you debug and fix?
SENIOR
06
Design a system to record a 500-person Zoom call with per-participant audio tracks for compliance.
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
How does Zoom handle 1000 participants in a single call?
Zoom uses a distributed SFU architecture. Participants are split across multiple media servers (SFUs), each handling ~100 participants. The SFUs are interconnected via a media bridge that forwards active speaker streams between servers. A global audio level aggregator determines the top speakers, and only those streams are forwarded across servers. This keeps bandwidth and CPU manageable.
Was this helpful?
02
What's the difference between SFU and MCU in video conferencing?
SFU (Selective Forwarding Unit) forwards individual streams without processing them — the server is a smart switch. MCU (Multipoint Control Unit) mixes all streams into one on the server. SFU scales better (O(N) server load) but requires more client bandwidth. MCU is simpler for small groups but doesn't scale. Zoom uses SFU.
Was this helpful?
03
How do I set up a TURN server for WebRTC?
Use coturn. Install it, configure listening ports (3478 for UDP/TCP, 5349 for TLS), set authentication with time-limited credentials, and specify relay IPs. In your WebRTC client, add the TURN server to iceServers with username and credential. Ensure firewall allows the ports. Monitor usage to avoid abuse.
Was this helpful?
04
What happens when WebRTC packet loss is high? How does Zoom adapt?
Zoom's SFU monitors packet loss per consumer. If loss exceeds 5%, the SFU switches the consumer to a lower simulcast layer (e.g., from 720p to 360p). It also sends REMB messages to the producer to reduce bitrate. For audio, Opus has built-in packet loss concealment. If loss is extreme, the SFU may drop video entirely and keep audio only.