The core challenge is handling concurrent edits from multiple users without data loss. Use Operational Transformation (OT) to transform each operation against concurrent ones, ensuring all clients converge to the same document state. Google Docs uses OT with a central server for ordering.
✦ Definition~90s read
What is Design Google Docs?
Design Google Docs refers to the system architecture behind real-time collaborative document editing, enabling multiple users to edit simultaneously with conflict resolution via Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs).
★
Imagine a group of people writing on a whiteboard with markers.
Plain-English First
Imagine a group of people writing on a whiteboard with markers. If two people try to write at the same spot, you need a rule to decide whose text goes where. Google Docs is like having a referee who catches every marker stroke, reorders them, and tells everyone the final result so no one's work gets erased.
You think building Google Docs is just WebSocket + CRDT? I've seen that assumption crater a startup's demo when two users typed the same word and the document turned into a jumble of 'helloworldhello'. Real-time collaboration is a distributed systems problem in disguise. The naive approach—send diffs and hope—loses data under load. This article walks you through the architecture that powers Google Docs: Operational Transformation, conflict resolution, and the production gotchas that'll kill your latency SLA. By the end, you'll be able to design a collaborative editor that survives concurrent edits, network partitions, and 3 AM pager duty.
Why Operational Transformation? The Problem with Naive Sync
Before OT, collaborative editors used lock-step or diff-merge. Lock-step blocks users—unacceptable for real-time. Diff-merge loses context: if Alice inserts 'A' at position 0 and Bob inserts 'B' at position 0, a simple merge produces 'AB' or 'BA' depending on order, but both lose the intent. OT solves this by transforming each operation against concurrent ones so they apply correctly regardless of order. The key insight: operations are functions that can be composed and transformed. Without OT, you get data corruption under concurrent edits. I've seen a production system where two users edited the same paragraph and the server applied both operations without transformation—result: half the paragraph vanished. The fix was implementing OT with a central sequencer.
OperationalTransform.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// io.thecodeforge — SystemDesign tutorial
// SimplifiedOTfor insert operations
// Assume document is a string, operations are {type, position, text}
function transform(op1, op2) {
// op1 and op2 are concurrent operations
// Return transformed op1' such that applying op2 then op1' is equivalent to op1 then op2'
if (op1.type === 'insert' && op2.type === 'insert') {
if (op1.position < op2.position) {
// op1 stays, op2 shifts right by length of op1.text
return { ...op1 };
} elseif (op1.position > op2.position) {
// op1 shifts right by length of op2.text
return { ...op1, position: op1.position + op2.text.length };
} else {
// Same position: use tie-breaker (e.g., user ID)
if (op1.userId < op2.userId) {
return { ...op1 };
} else {
return { ...op1, position: op1.position + op2.text.length };
}
}
}
// ... handle delete, replace etc.
}
// Example:
let doc = "hello";
let opA = { type: 'insert', position: 0, text: 'x', userId: 'A' };
let opB = { type: 'insert', position: 0, text: 'y', userId: 'B' };
// Server receives opA then opB (order arbitrary, but we need to apply both)
// Transform opB against opA:
let opBPrime = transform(opB, opA); // position becomes 1 (since opA inserted 'x' at 0)
// Apply opA then opBPrime:
doc = apply(doc, opA); // "xhello"
doc = apply(doc, opBPrime); // "xyhello"
// If we had applied opB then opAPrime, result would be "yxhello" — different!
// OT ensures convergence only if transformation functions satisfy TP1 and TP2 properties.
console.log(doc); // "xyhello" — but this is not necessarily what users intended; real OT is more complex.
Output
xyhello
Production Trap: Non-Convergent Transformations
If your OT functions don't satisfy the transformation properties (TP1, TP2), clients will diverge. Test with random concurrent operations and verify final state is identical. I've seen a team spend weeks debugging 'ghost characters' because their delete transformation didn't handle overlapping ranges.
thecodeforge.io
Real-Time Collaborative Editing at Scale
Design Google Docs
Central Server Architecture: The Sequencer Pattern
Google Docs uses a central server that sequences all operations. Each client sends operations to the server, which assigns a monotonically increasing version number (timestamp or counter). The server transforms incoming operations against all previously applied operations and broadcasts the transformed version to all clients. This guarantees total order and simplifies conflict resolution. The downside: single point of failure and latency bottleneck. But for a document editor, the consistency guarantees are worth it. Without a sequencer, you need a distributed consensus protocol (like Raft) which adds complexity. For most use cases, a central server with a standby replica is fine. The classic rookie mistake: not handling server restarts. If the server crashes and loses the operation log, clients will have divergent states. Persist the operation log to a database (e.g., PostgreSQL with WAL) before broadcasting.
SequencerPattern.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — SystemDesign tutorial
// Server-side operation handler (pseudocode)
classDocumentServer {
constructor() {
this.version = 0;
this.operations = []; // persisted to DBthis.document = "";
}
handleOperation(clientOp, clientVersion) {
// Client sends its last known version
// Transform clientOp against all operations after clientVersion
let transformedOp = clientOp;
for (let i = clientVersion; i < this.version; i++) {
transformedOp = transform(transformedOp, this.operations[i]);
}
// Apply to server document
this.document = apply(this.document, transformedOp);
// Assignnew version
const newVersion = this.version++;
this.operations.push(transformedOp);
// Persist operation to DB (async, but critical for recovery)
persistOperation(newVersion, transformedOp);
// Broadcast to all clients
broadcast({ op: transformedOp, version: newVersion });
}
}
// Client-side: send operation with last known version
// On receiving broadcast, apply op and update local version
// If broadcast version > expected, request missing ops from server
Output
(No direct output; pattern for server logic)
Senior Shortcut: Batching Operations
To reduce server load, batch operations from the same client every 50-100ms. Send a list of ops with the client's version. The server transforms the batch as a unit. This cuts WebSocket overhead by 10x.
Conflict Resolution: Handling Concurrent Edits
When two users edit the same word simultaneously, OT transforms both operations so they apply without loss. But edge cases abound: what if Alice deletes a range that Bob inserts into? The delete operation must be transformed to account for the insert. The standard approach is to use a two-phase transformation: first transform the incoming operation against the history, then apply. For complex edits (e.g., formatting), you need to track character positions with a position index that updates after each operation. Google Docs uses a 'cursor' model where each character has a unique ID, so operations reference characters by ID, not position. This avoids the 'shifting index' problem. Production gotcha: if your transformation functions are not commutative, you'll get different results depending on operation order. Always test with a random concurrent workload.
ConflictResolution.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// io.thecodeforge — SystemDesign tutorial
// Using character IDs to avoid position shifts
// Each character gets a unique ID (e.g., UUID)
// Operations reference character IDsclassChar {
constructor(id, value) {
this.id = id;
this.value = value;
}
}
classDocument {
constructor() {
this.chars = []; // ordered list of Char objects
}
applyInsert(afterCharId, newChar) {
const idx = this.chars.findIndex(c => c.id === afterCharId);
this.chars.splice(idx + 1, 0, newChar);
}
applyDelete(charId) {
const idx = this.chars.findIndex(c => c.id === charId);
this.chars.splice(idx, 1);
}
}
// Transformation becomes simpler: no position shifting
// But you need to handle the case where the referenced character was deleted
// Solution: if charId not found, the operation is a no-op (or transformed to insert at end)
function transformInsert(op1, op2) {
// op1 and op2 are inserts with afterCharId
// If they reference the same afterCharId, use tie-breaker
if (op1.afterCharId === op2.afterCharId) {
// Insert op2's char before op1's char (or vice versa based on user ID)
return { ...op1, afterCharId: op2.newChar.id }; // op1 now inserts after op2's char
}
// Otherwise, no transformation needed
return op1;
}
Output
(No direct output; demonstrates character-ID approach)
Interview Gold: Character IDs vs Positions
Google Docs uses character IDs to avoid the 'shifting index' problem. This is a common interview question: 'How do you handle concurrent inserts at the same position?' Answer: assign each character a unique ID and reference that, not a numeric index.
thecodeforge.io
OT Conflict Resolution Flow
Design Google Docs
Cursors and Selections: The UX Nightmare
Showing remote cursors in real time is deceptively hard. Each client broadcasts its cursor position (character ID) on every movement. The server broadcasts these to other clients. But if the document changes, the cursor position must be transformed. For example, if Alice's cursor is at character ID 'abc' and Bob deletes that character, Alice's cursor should move to the next valid character. This requires the server to transform cursor positions against operations. The naive approach—send absolute position—breaks when the document changes. Instead, send the character ID of the character before the cursor (anchor). When an operation deletes that anchor, the cursor moves to the next character. Production gotcha: if you don't transform cursors, users will see cursors floating in the wrong place. I've seen a demo where a cursor ended up outside the document because the anchor was deleted and the client didn't handle it.
CursorSync.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — SystemDesign tutorial
// Cursor representation: { anchorCharId: string, focusCharId: string }
// anchorCharId is the character before the cursor (or nullif at start)
// focusCharId is forselection (same as anchor for cursor)
function transformCursor(cursor, operation) {
// If operation deletes the anchor character, move cursor to next valid
if (operation.type === 'delete' && operation.charId === cursor.anchorCharId) {
// Find next character in document after deleted one
const nextChar = getNextChar(cursor.anchorCharId);
cursor.anchorCharId = nextChar ? nextChar.id : null;
}
// If operation inserts before the anchor, anchor stays same (insert is after anchor)
// If operation inserts after the anchor, no change
// If operation inserts at the same position as anchor, anchor stays (insert is after)
return cursor;
}
// Broadcast cursor updates at most every 50ms to avoid flooding
// Use a separate WebSocket channel for cursor updates (lower priority)
Output
(No direct output; cursor transformation logic)
Never Do This: Broadcast Cursor on Every Mouse Move
You'll saturate the network. Throttle to 20 updates per second max. Use a separate low-priority channel so cursor updates don't block document edits.
Persistence and Recovery: Surviving Crashes
The server must persist every operation before broadcasting. If the server crashes, it replays the operation log to reconstruct document state. But what about operations that were broadcast but not persisted? Clients will have applied them, but the server won't know. Solution: clients acknowledge operations. The server marks an operation as committed only after receiving acknowledgements from all clients. On recovery, the server requests missing operations from clients. This is the 'optimistic replication' pattern. Production gotcha: if you persist operations synchronously, latency spikes. Use async persistence with a write-ahead log (WAL). The WAL is flushed every 10ms or every 100 operations, whichever comes first. On crash, replay WAL. I've seen a system that persisted every operation synchronously to PostgreSQL—latency went from 10ms to 200ms. The fix was a WAL with async flush.
PersistenceRecovery.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// io.thecodeforge — SystemDesign tutorial
// Write-AheadLog (WAL) for operation persistence
classWAL {
constructor() {
this.buffer = [];
this.flushInterval = setInterval(() => this.flush(), 10); // flush every 10ms
}
append(operation) {
this.buffer.push(operation);
if (this.buffer.length >= 100) {
this.flush();
}
}
flush() {
if (this.buffer.length === 0) return;
// Write buffer to disk (e.g., append to file or DB)
db.insertOperations(this.buffer);
this.buffer = [];
}
recover(documentId) {
// Load all operations from DB and replay
const ops = db.getOperations(documentId);
let doc = "";
for (const op of ops) {
doc = apply(doc, op);
}
return doc;
}
}
// On server start:
// 1. Recover document state from WAL
// 2. Connect to clients
// 3. For each client, compare last acknowledged version with server version
// 4. Request missing operations from clients if server is behind
Output
(No direct output; WAL pattern)
Senior Shortcut: Use a Dedicated WAL Service
Don't embed WAL in your app server. Use a separate service (e.g., Apache BookKeeper) that can handle high throughput and provides durability guarantees. This decouples persistence from business logic.
Scaling to Millions of Documents: Sharding and Caching
Google Docs handles billions of documents. The key is sharding by document ID. Each document's operations are stored on a single shard. The shard also handles the OT logic. This keeps the operation history local. For hot documents (e.g., a popular spreadsheet), you can replicate the shard and use a primary-replica pattern: all writes go to primary, reads can go to replicas. But replicas must apply operations in the same order—use the sequencer's version number. Caching: cache the document state (compiled from operations) in memory. Invalidate on new operation. For cold documents, load from persistent storage. Production gotcha: if you cache the compiled state, you must ensure it's consistent with the operation log. Use a version number that increments on each operation. On cache miss, rebuild from log and cache the result.
ShardingCaching.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// io.thecodeforge — SystemDesign tutorial
// Shard assignment: hash(documentId) % numShards
// Each shard is a separate process or container
classDocumentShard {
constructor(shardId) {
this.shardId = shardId;
this.documents = newMap(); // documentId -> { state, version, operationLog }
this.cache = newLRUCache({ max: 10000 }); // cache compiled state
}
getDocumentState(documentId) {
if (this.cache.has(documentId)) {
returnthis.cache.get(documentId);
}
// Rebuild from operation log
const log = this.getOperationLog(documentId);
let state = "";
for (const op of log) {
state = apply(state, op);
}
this.cache.set(documentId, state);
return state;
}
handleOperation(documentId, operation) {
// Apply operation, update log, increment version
// Invalidate cache forthis document
this.cache.delete(documentId);
// Persist operation
// Broadcast to clients
}
}
// Load balancer routes requests based on document ID hash
Output
(No direct output; sharding pattern)
Interview Gold: Hot Document Problem
What happens when a document goes viral (e.g., a shared Google Doc with 10k concurrent editors)? The shard becomes a bottleneck. Solution: split the document into sections (e.g., paragraphs) and shard by section. Each section has its own operation log. This is what Google Docs does internally for large documents.
Offline Support and Conflict Resolution After Reconnect
Users expect to edit offline and sync later. This is a hard problem: the client accumulates operations locally. On reconnect, it sends them to the server. The server must transform these operations against any concurrent operations that happened while the client was offline. This is the same OT problem, but with a large batch. The server processes the client's operations in order, transforming each against the server's history since the client's last version. If conflicts are detected (e.g., both client and server edited the same word), the server's version wins (or you can use a merge UI). Google Docs uses a 'last writer wins' policy for simple conflicts, but for complex ones, it flags the conflict to the user. Production gotcha: if the client was offline for a long time, the transformation may produce unexpected results. Limit offline duration (e.g., 30 days) and force a full sync after that.
OfflineSync.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// io.thecodeforge — SystemDesign tutorial
// Client-side offline operation queue
classOfflineQueue {
constructor() {
this.queue = [];
this.lastSyncedVersion = 0;
}
addOperation(op) {
this.queue.push(op);
}
async sync(server) {
// Send all queued operations with lastSyncedVersion
const response = await server.syncOperations({
operations: this.queue,
lastVersion: this.lastSyncedVersion
});
// Server returns transformed operations that the client must apply
for (const op of response.transformedOps) {
applyLocal(op);
}
this.queue = [];
this.lastSyncedVersion = response.newVersion;
}
}
// Server-side sync handler
function handleSync(clientOps, clientVersion) {
let transformedOps = [];
for (const op of clientOps) {
// Transform against server operations after clientVersion
let transformed = op;
for (let i = clientVersion; i < serverVersion; i++) {
transformed = transform(transformed, serverOps[i]);
}
// Apply to server
applyServer(transformed);
transformedOps.push(transformed);
clientVersion++;
}
// Also send any server operations that happened after client's last version
// that were not transformed (i.e., concurrent ops that client hasn't seen)
const serverOpsSince = serverOps.slice(clientVersion);
return { transformedOps, serverOpsSince, newVersion: serverVersion };
}
Output
(No direct output; offline sync pattern)
The Classic Bug: Offline Queue Overflow
If the client is offline for days, the queue can grow to millions of operations. This causes memory pressure and slow sync. Implement a max queue size (e.g., 10k ops) and force a full document download if exceeded.
When Not to Use OT: CRDTs as an Alternative
OT requires a central server for ordering. If you need peer-to-peer collaboration (no central server), CRDTs are a better fit. CRDTs guarantee convergence without a central coordinator by using commutative operations. However, CRDTs have larger metadata overhead (each character carries a unique ID and a tombstone for deletions). For a document editor, OT is simpler and more efficient when you have a server. Use CRDTs only if you need offline-first with no server or if you're building a decentralized app. Production gotcha: CRDTs can cause unbounded metadata growth if you don't implement garbage collection for tombstones. I've seen a CRDT-based editor where deleted characters accumulated and the document size grew 10x. The fix was periodic tombstone compaction.
CRDTvsOT.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — SystemDesign tutorial
// CRDT approach: each character has a unique ID and a list of 'causally ready' parents
// Insert operation: { charId, value, afterId: [list of IDs that thischar should follow] }
// Delete operation: { charId, tombstone: true }
// Convergence: all clients apply operations in any order, result is the same
// because inserts are commutative (they specify explicit ordering via afterId)
// Example: two clients insert 'x' and 'y' after the same character 'a'
// Client A: insert { charId: 'x', afterId: ['a'] }
// Client B: insert { charId: 'y', afterId: ['a'] }
// Both clients will have 'axy' or 'ayx' depending on tie-breaker (e.g., charId comparison)
// But both will converge to the same order because tie-breaker is deterministic
// OT would require a server to order these operations; CRDT doesn't.
Output
(No direct output; CRDT vs OT comparison)
Interview Gold: OT vs CRDT Trade-offs
OT: simpler metadata, requires server, lower storage overhead. CRDT: no server needed, higher metadata, eventual consistency. Choose OT for server-based apps like Google Docs; choose CRDT for peer-to-peer or offline-first apps like Notion's offline mode.
thecodeforge.io
OT vs CRDT for Collaborative Editing
Design Google Docs
● Production incidentPOST-MORTEMseverity: high
The 4GB Container That Kept Dying
Symptom
Server OOM-killed every 30 minutes under 100 concurrent editors. No obvious memory leak in heap dumps.
Assumption
Thought it was a memory leak in the OT transformation cache.
Root cause
The operation history buffer stored every operation since session start, never compacted. With 100 users typing at 5 ops/sec, history grew 500 ops/sec. After 30 minutes: 900k operations, each ~4KB JSON → 3.6GB. The OOM killer fired.
Fix
Implemented sliding window compaction: keep last 1000 operations per document, archive older ones to disk with a Bloom filter for conflict checks. Memory dropped to 200MB.
Key lesson
Always bound operation history.
Unbounded history is a memory bomb waiting to explode.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Users report document state diverges between clients
→
Fix
1. Check server operation log for missing transformations. 2. Verify OT functions satisfy TP1/TP2 with random test harness. 3. Ensure all clients apply operations in the same order (version number). 4. If using character IDs, check for duplicate IDs.
Symptom · 02
High latency on edits (>1 second)
→
Fix
1. Profile WebSocket message size (JSON vs binary). 2. Check server CPU: OT transformation is O(n) per operation. 3. Reduce broadcast frequency: batch operations. 4. Consider sharding hot documents.
Symptom · 03
Server OOM after hours of operation
→
Fix
1. Check operation history size. 2. Implement compaction: archive old ops, keep sliding window. 3. Profile memory usage of cached document states. 4. Set max cache size with LRU eviction.
★ Design Google Docs Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Document state mismatch between clients: `Error: Version mismatch`−
Immediate action
Check server operation log for gaps
Commands
SELECT COUNT(*) FROM operations WHERE document_id = 'doc123' AND version > last_known_version;
Check client last acknowledged version in logs
Fix now
Force client to full sync: send current document state as a snapshot.
High latency on edits: `p95 latency > 500ms`+
Immediate action
Check WebSocket message size
Commands
tcpdump -i eth0 port 443 -A | grep 'op:' | head -100
Measure average operation size in bytes
Fix now
Switch to Protocol Buffers for serialization. Reduce operation frequency by batching.
Server OOM: `OutOfMemoryError: Java heap space`+
Immediate action
Check operation history size per document
Commands
jmap -histo <pid> | head -20
Check number of operations in memory: SELECT document_id, COUNT(*) FROM operations GROUP BY document_id;
Fix now
Set max operations per document to 1000. Archive older ops to disk.
Cursors jumping erratically: `Cursor position out of bounds`+
Immediate action
Check cursor transformation logic
Commands
Enable debug logging for cursor updates
Verify anchor character IDs exist in current document
Fix now
If anchor deleted, move cursor to next valid character. If none, set to end.
Feature / Aspect
Operational Transformation (OT)
CRDT
Central coordinator required
Yes (sequencer)
No
Metadata overhead
Low (operation log)
High (per-character IDs, tombstones)
Conflict resolution
Transform functions
Commutative operations
Offline support
Complex (batch transform)
Natural (merge on reconnect)
Scalability
Server bottleneck at high concurrency
Better for P2P, but metadata grows
Production maturity
Google Docs, Microsoft Office
Automerge, Yjs
Key takeaways
1
OT requires a central sequencer for total order; without it, clients diverge.
2
Always bound operation history to prevent OOM; use sliding window compaction.
3
Transform cursor positions against operations to avoid floating cursors.
4
CRDTs are overkill for server-based editors; OT is simpler and more efficient.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
How does Google Docs handle concurrent edits without data loss?
Q02SENIOR
When would you choose CRDTs over OT for a collaborative editor?
Q03SENIOR
What happens when a client reconnects after being offline for an hour wi...
Q04JUNIOR
What is the purpose of a sequencer in OT-based systems?
Q05SENIOR
You notice that after a server crash and recovery, some clients have ope...
Q06SENIOR
How would you design a collaborative editor that supports 10,000 concurr...
Q01 of 06SENIOR
How does Google Docs handle concurrent edits without data loss?
ANSWER
It uses Operational Transformation (OT) with a central sequencer. Each operation is transformed against concurrent operations before application, ensuring all clients converge to the same state. The server assigns a version number to each operation, guaranteeing total order.
Q02 of 06SENIOR
When would you choose CRDTs over OT for a collaborative editor?
ANSWER
Choose CRDTs when you need peer-to-peer collaboration without a central server, or when offline-first is critical (e.g., Notion). OT is simpler and more efficient with a server. For Google Docs-scale, OT is proven; CRDTs have higher metadata overhead.
Q03 of 06SENIOR
What happens when a client reconnects after being offline for an hour with 5000 queued operations?
ANSWER
The server transforms each queued operation against all concurrent operations that occurred while offline. This can be CPU-intensive. Mitigation: limit offline queue size (e.g., 10k ops) and force a full snapshot sync if exceeded. Also, transform in batches to reduce overhead.
Q04 of 06JUNIOR
What is the purpose of a sequencer in OT-based systems?
ANSWER
The sequencer assigns a unique, monotonically increasing version number to each operation, establishing a total order. This simplifies conflict resolution because all clients apply operations in the same sequence, ensuring convergence.
Q05 of 06SENIOR
You notice that after a server crash and recovery, some clients have operations that the server never received. How do you handle this?
ANSWER
Implement an acknowledgement protocol: clients send ACK for each operation. The server marks operations as committed only after receiving ACKs from all clients. On recovery, the server requests missing operations from clients by comparing last committed version. Alternatively, use a WAL that persists before broadcasting.
Q06 of 06SENIOR
How would you design a collaborative editor that supports 10,000 concurrent users on a single document?
ANSWER
Shard the document into sections (e.g., paragraphs) and assign each section to a separate shard/process. Each section has its own operation log and OT state. Users editing different sections experience no contention. For cross-section edits (e.g., copy-paste), use a two-phase commit or a global sequencer for those operations.
01
How does Google Docs handle concurrent edits without data loss?
SENIOR
02
When would you choose CRDTs over OT for a collaborative editor?
SENIOR
03
What happens when a client reconnects after being offline for an hour with 5000 queued operations?
SENIOR
04
What is the purpose of a sequencer in OT-based systems?
JUNIOR
05
You notice that after a server crash and recovery, some clients have operations that the server never received. How do you handle this?
SENIOR
06
How would you design a collaborative editor that supports 10,000 concurrent users on a single document?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
How does Google Docs handle multiple users typing at the same time?
Google Docs uses Operational Transformation (OT). Each keystroke is sent to a central server, which transforms it against concurrent keystrokes from other users. The server then broadcasts the transformed operation to all clients, ensuring everyone sees the same result.
Was this helpful?
02
What's the difference between OT and CRDT for collaborative editing?
OT requires a central server to order operations; CRDTs work without a server by using commutative operations. OT has lower metadata overhead; CRDTs store per-character IDs and tombstones. Use OT for server-based apps like Google Docs; use CRDTs for peer-to-peer or offline-first apps.
Was this helpful?
03
How do I implement offline editing in a collaborative document editor?
Queue operations locally with a version number. On reconnect, send the queue to the server, which transforms each operation against concurrent server operations. The server returns transformed operations and any missed server operations. Apply them locally to sync.
Was this helpful?
04
What happens if the server crashes in the middle of processing an operation?
If the operation was not persisted, it's lost. Clients may have applied it optimistically. Use a write-ahead log (WAL) to persist operations before broadcasting. On recovery, replay the WAL and reconcile with clients by comparing last acknowledged versions.