Senior 5 min · June 25, 2026

Design Google Docs: Real-Time Collaborative Editing at Scale

Q: How does Google Docs handle multiple users typing at the same time?

Google Docs uses Operational Transformation (OT). Each keystroke is sent to a central server, which transforms it against concurrent keystrokes from other users. The server then broadcasts the transformed operation to all clients, ensuring everyone sees the same result.

Q: What's the difference between OT and CRDT for collaborative editing?

OT requires a central server to order operations; CRDTs work without a server by using commutative operations. OT has lower metadata overhead; CRDTs store per-character IDs and tombstones. Use OT for server-based apps like Google Docs; use CRDTs for peer-to-peer or offline-first apps.

Q: How do I implement offline editing in a collaborative document editor?

Queue operations locally with a version number. On reconnect, send the queue to the server, which transforms each operation against concurrent server operations. The server returns transformed operations and any missed server operations. Apply them locally to sync.

Q: What happens if the server crashes in the middle of processing an operation?

If the operation was not persisted, it's lost. Clients may have applied it optimistically. Use a write-ahead log (WAL) to persist operations before broadcasting. On recovery, replay the WAL and reconcile with clients by comparing last acknowledged versions.

Learn how to design Google Docs with operational transformation, conflict resolution, and real-time sync.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

The core challenge is handling concurrent edits from multiple users without data loss. Use Operational Transformation (OT) to transform each operation against concurrent ones, ensuring all clients converge to the same document state. Google Docs uses OT with a central server for ordering.

✦ Definition~90s read

What is Design Google Docs?

Design Google Docs refers to the system architecture behind real-time collaborative document editing, enabling multiple users to edit simultaneously with conflict resolution via Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs).

★

Imagine a group of people writing on a whiteboard with markers.

Plain-English First

Imagine a group of people writing on a whiteboard with markers. If two people try to write at the same spot, you need a rule to decide whose text goes where. Google Docs is like having a referee who catches every marker stroke, reorders them, and tells everyone the final result so no one's work gets erased.

You think building Google Docs is just WebSocket + CRDT? I've seen that assumption crater a startup's demo when two users typed the same word and the document turned into a jumble of 'helloworldhello'. Real-time collaboration is a distributed systems problem in disguise. The naive approach—send diffs and hope—loses data under load. This article walks you through the architecture that powers Google Docs: Operational Transformation, conflict resolution, and the production gotchas that'll kill your latency SLA. By the end, you'll be able to design a collaborative editor that survives concurrent edits, network partitions, and 3 AM pager duty.

Why Operational Transformation? The Problem with Naive Sync

Before OT, collaborative editors used lock-step or diff-merge. Lock-step blocks users—unacceptable for real-time. Diff-merge loses context: if Alice inserts 'A' at position 0 and Bob inserts 'B' at position 0, a simple merge produces 'AB' or 'BA' depending on order, but both lose the intent. OT solves this by transforming each operation against concurrent ones so they apply correctly regardless of order. The key insight: operations are functions that can be composed and transformed. Without OT, you get data corruption under concurrent edits. I've seen a production system where two users edited the same paragraph and the server applied both operations without transformation—result: half the paragraph vanished. The fix was implementing OT with a central sequencer.

OperationalTransform.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified OT for insert operations
// Assume document is a string, operations are {type, position, text}

function transform(op1, op2) {
  // op1 and op2 are concurrent operations
  // Return transformed op1' such that applying op2 then op1' is equivalent to op1 then op2'
  if (op1.type === 'insert' && op2.type === 'insert') {
    if (op1.position < op2.position) {
      // op1 stays, op2 shifts right by length of op1.text
      return { ...op1 };
    } else if (op1.position > op2.position) {
      // op1 shifts right by length of op2.text
      return { ...op1, position: op1.position + op2.text.length };
    } else {
      // Same position: use tie-breaker (e.g., user ID)
      if (op1.userId < op2.userId) {
        return { ...op1 };
      } else {
        return { ...op1, position: op1.position + op2.text.length };
      }
    }
  }
  // ... handle delete, replace etc.
}

// Example:
let doc = "hello";
let opA = { type: 'insert', position: 0, text: 'x', userId: 'A' };
let opB = { type: 'insert', position: 0, text: 'y', userId: 'B' };

// Server receives opA then opB (order arbitrary, but we need to apply both)
// Transform opB against opA:
let opBPrime = transform(opB, opA); // position becomes 1 (since opA inserted 'x' at 0)
// Apply opA then opBPrime:
doc = apply(doc, opA); // "xhello"
doc = apply(doc, opBPrime); // "xyhello"
// If we had applied opB then opAPrime, result would be "yxhello" — different! 
// OT ensures convergence only if transformation functions satisfy TP1 and TP2 properties.
console.log(doc); // "xyhello" — but this is not necessarily what users intended; real OT is more complex.

Output

xyhello

Production Trap: Non-Convergent Transformations

If your OT functions don't satisfy the transformation properties (TP1, TP2), clients will diverge. Test with random concurrent operations and verify final state is identical. I've seen a team spend weeks debugging 'ghost characters' because their delete transformation didn't handle overlapping ranges.

thecodeforge.io

Real-Time Collaborative Editing at Scale

Design Google Docs

Central Server Architecture: The Sequencer Pattern

Google Docs uses a central server that sequences all operations. Each client sends operations to the server, which assigns a monotonically increasing version number (timestamp or counter). The server transforms incoming operations against all previously applied operations and broadcasts the transformed version to all clients. This guarantees total order and simplifies conflict resolution. The downside: single point of failure and latency bottleneck. But for a document editor, the consistency guarantees are worth it. Without a sequencer, you need a distributed consensus protocol (like Raft) which adds complexity. For most use cases, a central server with a standby replica is fine. The classic rookie mistake: not handling server restarts. If the server crashes and loses the operation log, clients will have divergent states. Persist the operation log to a database (e.g., PostgreSQL with WAL) before broadcasting.

SequencerPattern.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Server-side operation handler (pseudocode)

class DocumentServer {
  constructor() {
    this.version = 0;
    this.operations = []; // persisted to DB
    this.document = "";
  }

  handleOperation(clientOp, clientVersion) {
    // Client sends its last known version
    // Transform clientOp against all operations after clientVersion
    let transformedOp = clientOp;
    for (let i = clientVersion; i < this.version; i++) {
      transformedOp = transform(transformedOp, this.operations[i]);
    }
    // Apply to server document
    this.document = apply(this.document, transformedOp);
    // Assign new version
    const newVersion = this.version++;
    this.operations.push(transformedOp);
    // Persist operation to DB (async, but critical for recovery)
    persistOperation(newVersion, transformedOp);
    // Broadcast to all clients
    broadcast({ op: transformedOp, version: newVersion });
  }
}

// Client-side: send operation with last known version
// On receiving broadcast, apply op and update local version
// If broadcast version > expected, request missing ops from server

Output

(No direct output; pattern for server logic)

Senior Shortcut: Batching Operations

To reduce server load, batch operations from the same client every 50-100ms. Send a list of ops with the client's version. The server transforms the batch as a unit. This cuts WebSocket overhead by 10x.

Conflict Resolution: Handling Concurrent Edits

When two users edit the same word simultaneously, OT transforms both operations so they apply without loss. But edge cases abound: what if Alice deletes a range that Bob inserts into? The delete operation must be transformed to account for the insert. The standard approach is to use a two-phase transformation: first transform the incoming operation against the history, then apply. For complex edits (e.g., formatting), you need to track character positions with a position index that updates after each operation. Google Docs uses a 'cursor' model where each character has a unique ID, so operations reference characters by ID, not position. This avoids the 'shifting index' problem. Production gotcha: if your transformation functions are not commutative, you'll get different results depending on operation order. Always test with a random concurrent workload.

ConflictResolution.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Using character IDs to avoid position shifts
// Each character gets a unique ID (e.g., UUID)
// Operations reference character IDs

class Char {
  constructor(id, value) {
    this.id = id;
    this.value = value;
  }
}

class Document {
  constructor() {
    this.chars = []; // ordered list of Char objects
  }

  applyInsert(afterCharId, newChar) {
    const idx = this.chars.findIndex(c => c.id === afterCharId);
    this.chars.splice(idx + 1, 0, newChar);
  }

  applyDelete(charId) {
    const idx = this.chars.findIndex(c => c.id === charId);
    this.chars.splice(idx, 1);
  }
}

// Transformation becomes simpler: no position shifting
// But you need to handle the case where the referenced character was deleted
// Solution: if charId not found, the operation is a no-op (or transformed to insert at end)

function transformInsert(op1, op2) {
  // op1 and op2 are inserts with afterCharId
  // If they reference the same afterCharId, use tie-breaker
  if (op1.afterCharId === op2.afterCharId) {
    // Insert op2's char before op1's char (or vice versa based on user ID)
    return { ...op1, afterCharId: op2.newChar.id }; // op1 now inserts after op2's char
  }
  // Otherwise, no transformation needed
  return op1;
}

Output

(No direct output; demonstrates character-ID approach)

Interview Gold: Character IDs vs Positions

Google Docs uses character IDs to avoid the 'shifting index' problem. This is a common interview question: 'How do you handle concurrent inserts at the same position?' Answer: assign each character a unique ID and reference that, not a numeric index.

thecodeforge.io

OT Conflict Resolution Flow

Design Google Docs

Cursors and Selections: The UX Nightmare

Showing remote cursors in real time is deceptively hard. Each client broadcasts its cursor position (character ID) on every movement. The server broadcasts these to other clients. But if the document changes, the cursor position must be transformed. For example, if Alice's cursor is at character ID 'abc' and Bob deletes that character, Alice's cursor should move to the next valid character. This requires the server to transform cursor positions against operations. The naive approach—send absolute position—breaks when the document changes. Instead, send the character ID of the character before the cursor (anchor). When an operation deletes that anchor, the cursor moves to the next character. Production gotcha: if you don't transform cursors, users will see cursors floating in the wrong place. I've seen a demo where a cursor ended up outside the document because the anchor was deleted and the client didn't handle it.

CursorSync.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Cursor representation: { anchorCharId: string, focusCharId: string }
// anchorCharId is the character before the cursor (or null if at start)
// focusCharId is for selection (same as anchor for cursor)

function transformCursor(cursor, operation) {
  // If operation deletes the anchor character, move cursor to next valid
  if (operation.type === 'delete' && operation.charId === cursor.anchorCharId) {
    // Find next character in document after deleted one
    const nextChar = getNextChar(cursor.anchorCharId);
    cursor.anchorCharId = nextChar ? nextChar.id : null;
  }
  // If operation inserts before the anchor, anchor stays same (insert is after anchor)
  // If operation inserts after the anchor, no change
  // If operation inserts at the same position as anchor, anchor stays (insert is after)
  return cursor;
}

// Broadcast cursor updates at most every 50ms to avoid flooding
// Use a separate WebSocket channel for cursor updates (lower priority)

Output

(No direct output; cursor transformation logic)

Never Do This: Broadcast Cursor on Every Mouse Move

You'll saturate the network. Throttle to 20 updates per second max. Use a separate low-priority channel so cursor updates don't block document edits.

Persistence and Recovery: Surviving Crashes

The server must persist every operation before broadcasting. If the server crashes, it replays the operation log to reconstruct document state. But what about operations that were broadcast but not persisted? Clients will have applied them, but the server won't know. Solution: clients acknowledge operations. The server marks an operation as committed only after receiving acknowledgements from all clients. On recovery, the server requests missing operations from clients. This is the 'optimistic replication' pattern. Production gotcha: if you persist operations synchronously, latency spikes. Use async persistence with a write-ahead log (WAL). The WAL is flushed every 10ms or every 100 operations, whichever comes first. On crash, replay WAL. I've seen a system that persisted every operation synchronously to PostgreSQL—latency went from 10ms to 200ms. The fix was a WAL with async flush.

PersistenceRecovery.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Write-Ahead Log (WAL) for operation persistence

class WAL {
  constructor() {
    this.buffer = [];
    this.flushInterval = setInterval(() => this.flush(), 10); // flush every 10ms
  }

  append(operation) {
    this.buffer.push(operation);
    if (this.buffer.length >= 100) {
      this.flush();
    }
  }

  flush() {
    if (this.buffer.length === 0) return;
    // Write buffer to disk (e.g., append to file or DB)
    db.insertOperations(this.buffer);
    this.buffer = [];
  }

  recover(documentId) {
    // Load all operations from DB and replay
    const ops = db.getOperations(documentId);
    let doc = "";
    for (const op of ops) {
      doc = apply(doc, op);
    }
    return doc;
  }
}

// On server start:
// 1. Recover document state from WAL
// 2. Connect to clients
// 3. For each client, compare last acknowledged version with server version
// 4. Request missing operations from clients if server is behind

Output

(No direct output; WAL pattern)

Senior Shortcut: Use a Dedicated WAL Service

Don't embed WAL in your app server. Use a separate service (e.g., Apache BookKeeper) that can handle high throughput and provides durability guarantees. This decouples persistence from business logic.

Scaling to Millions of Documents: Sharding and Caching

Google Docs handles billions of documents. The key is sharding by document ID. Each document's operations are stored on a single shard. The shard also handles the OT logic. This keeps the operation history local. For hot documents (e.g., a popular spreadsheet), you can replicate the shard and use a primary-replica pattern: all writes go to primary, reads can go to replicas. But replicas must apply operations in the same order—use the sequencer's version number. Caching: cache the document state (compiled from operations) in memory. Invalidate on new operation. For cold documents, load from persistent storage. Production gotcha: if you cache the compiled state, you must ensure it's consistent with the operation log. Use a version number that increments on each operation. On cache miss, rebuild from log and cache the result.

ShardingCaching.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Shard assignment: hash(documentId) % numShards
// Each shard is a separate process or container

class DocumentShard {
  constructor(shardId) {
    this.shardId = shardId;
    this.documents = new Map(); // documentId -> { state, version, operationLog }
    this.cache = new LRUCache({ max: 10000 }); // cache compiled state
  }

  getDocumentState(documentId) {
    if (this.cache.has(documentId)) {
      return this.cache.get(documentId);
    }
    // Rebuild from operation log
    const log = this.getOperationLog(documentId);
    let state = "";
    for (const op of log) {
      state = apply(state, op);
    }
    this.cache.set(documentId, state);
    return state;
  }

  handleOperation(documentId, operation) {
    // Apply operation, update log, increment version
    // Invalidate cache for this document
    this.cache.delete(documentId);
    // Persist operation
    // Broadcast to clients
  }
}

// Load balancer routes requests based on document ID hash

Output

(No direct output; sharding pattern)

Interview Gold: Hot Document Problem

What happens when a document goes viral (e.g., a shared Google Doc with 10k concurrent editors)? The shard becomes a bottleneck. Solution: split the document into sections (e.g., paragraphs) and shard by section. Each section has its own operation log. This is what Google Docs does internally for large documents.

Offline Support and Conflict Resolution After Reconnect

Users expect to edit offline and sync later. This is a hard problem: the client accumulates operations locally. On reconnect, it sends them to the server. The server must transform these operations against any concurrent operations that happened while the client was offline. This is the same OT problem, but with a large batch. The server processes the client's operations in order, transforming each against the server's history since the client's last version. If conflicts are detected (e.g., both client and server edited the same word), the server's version wins (or you can use a merge UI). Google Docs uses a 'last writer wins' policy for simple conflicts, but for complex ones, it flags the conflict to the user. Production gotcha: if the client was offline for a long time, the transformation may produce unexpected results. Limit offline duration (e.g., 30 days) and force a full sync after that.

OfflineSync.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Client-side offline operation queue
class OfflineQueue {
  constructor() {
    this.queue = [];
    this.lastSyncedVersion = 0;
  }

  addOperation(op) {
    this.queue.push(op);
  }

  async sync(server) {
    // Send all queued operations with lastSyncedVersion
    const response = await server.syncOperations({
      operations: this.queue,
      lastVersion: this.lastSyncedVersion
    });
    // Server returns transformed operations that the client must apply
    for (const op of response.transformedOps) {
      applyLocal(op);
    }
    this.queue = [];
    this.lastSyncedVersion = response.newVersion;
  }
}

// Server-side sync handler
function handleSync(clientOps, clientVersion) {
  let transformedOps = [];
  for (const op of clientOps) {
    // Transform against server operations after clientVersion
    let transformed = op;
    for (let i = clientVersion; i < serverVersion; i++) {
      transformed = transform(transformed, serverOps[i]);
    }
    // Apply to server
    applyServer(transformed);
    transformedOps.push(transformed);
    clientVersion++;
  }
  // Also send any server operations that happened after client's last version
  // that were not transformed (i.e., concurrent ops that client hasn't seen)
  const serverOpsSince = serverOps.slice(clientVersion);
  return { transformedOps, serverOpsSince, newVersion: serverVersion };
}

Output

(No direct output; offline sync pattern)

The Classic Bug: Offline Queue Overflow

If the client is offline for days, the queue can grow to millions of operations. This causes memory pressure and slow sync. Implement a max queue size (e.g., 10k ops) and force a full document download if exceeded.

When Not to Use OT: CRDTs as an Alternative

OT requires a central server for ordering. If you need peer-to-peer collaboration (no central server), CRDTs are a better fit. CRDTs guarantee convergence without a central coordinator by using commutative operations. However, CRDTs have larger metadata overhead (each character carries a unique ID and a tombstone for deletions). For a document editor, OT is simpler and more efficient when you have a server. Use CRDTs only if you need offline-first with no server or if you're building a decentralized app. Production gotcha: CRDTs can cause unbounded metadata growth if you don't implement garbage collection for tombstones. I've seen a CRDT-based editor where deleted characters accumulated and the document size grew 10x. The fix was periodic tombstone compaction.

CRDTvsOT.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// CRDT approach: each character has a unique ID and a list of 'causally ready' parents
// Insert operation: { charId, value, afterId: [list of IDs that this char should follow] }
// Delete operation: { charId, tombstone: true }

// Convergence: all clients apply operations in any order, result is the same
// because inserts are commutative (they specify explicit ordering via afterId)

// Example: two clients insert 'x' and 'y' after the same character 'a'
// Client A: insert { charId: 'x', afterId: ['a'] }
// Client B: insert { charId: 'y', afterId: ['a'] }
// Both clients will have 'axy' or 'ayx' depending on tie-breaker (e.g., charId comparison)
// But both will converge to the same order because tie-breaker is deterministic

// OT would require a server to order these operations; CRDT doesn't.

Output

(No direct output; CRDT vs OT comparison)

Interview Gold: OT vs CRDT Trade-offs

OT: simpler metadata, requires server, lower storage overhead. CRDT: no server needed, higher metadata, eventual consistency. Choose OT for server-based apps like Google Docs; choose CRDT for peer-to-peer or offline-first apps like Notion's offline mode.

thecodeforge.io

OT vs CRDT for Collaborative Editing

Design Google Docs

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Server OOM-killed every 30 minutes under 100 concurrent editors. No obvious memory leak in heap dumps.

Assumption

Thought it was a memory leak in the OT transformation cache.

Root cause

The operation history buffer stored every operation since session start, never compacted. With 100 users typing at 5 ops/sec, history grew 500 ops/sec. After 30 minutes: 900k operations, each ~4KB JSON → 3.6GB. The OOM killer fired.

Fix

Implemented sliding window compaction: keep last 1000 operations per document, archive older ones to disk with a Bloom filter for conflict checks. Memory dropped to 200MB.

Key lesson

Always bound operation history.
Unbounded history is a memory bomb waiting to explode.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Users report document state diverges between clients

→

Fix

1. Check server operation log for missing transformations. 2. Verify OT functions satisfy TP1/TP2 with random test harness. 3. Ensure all clients apply operations in the same order (version number). 4. If using character IDs, check for duplicate IDs.

Symptom · 02

High latency on edits (>1 second)

→

Fix

1. Profile WebSocket message size (JSON vs binary). 2. Check server CPU: OT transformation is O(n) per operation. 3. Reduce broadcast frequency: batch operations. 4. Consider sharding hot documents.

Symptom · 03

Server OOM after hours of operation

→

Fix

1. Check operation history size. 2. Implement compaction: archive old ops, keep sliding window. 3. Profile memory usage of cached document states. 4. Set max cache size with LRU eviction.

★ Design Google Docs Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Document state mismatch between clients: `Error: Version mismatch`−

Immediate action

Check server operation log for gaps

Commands

SELECT COUNT(*) FROM operations WHERE document_id = 'doc123' AND version > last_known_version;

Check client last acknowledged version in logs

Fix now

Force client to full sync: send current document state as a snapshot.

High latency on edits: `p95 latency > 500ms`+

Server OOM: `OutOfMemoryError: Java heap space`+

Cursors jumping erratically: `Cursor position out of bounds`+

Feature / Aspect	Operational Transformation (OT)	CRDT
Central coordinator required	Yes (sequencer)	No
Metadata overhead	Low (operation log)	High (per-character IDs, tombstones)
Conflict resolution	Transform functions	Commutative operations
Offline support	Complex (batch transform)	Natural (merge on reconnect)
Scalability	Server bottleneck at high concurrency	Better for P2P, but metadata grows
Production maturity	Google Docs, Microsoft Office	Automerge, Yjs

Key takeaways

OT requires a central sequencer for total order; without it, clients diverge.

Always bound operation history to prevent OOM; use sliding window compaction.

Transform cursor positions against operations to avoid floating cursors.

CRDTs are overkill for server-based editors; OT is simpler and more efficient.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Google Docs handle concurrent edits without data loss?

Q02SENIOR

When would you choose CRDTs over OT for a collaborative editor?

Q03SENIOR

What happens when a client reconnects after being offline for an hour wi...

Q04JUNIOR

What is the purpose of a sequencer in OT-based systems?

Q05SENIOR

You notice that after a server crash and recovery, some clients have ope...

Q06SENIOR

How would you design a collaborative editor that supports 10,000 concurr...

Q01 of 06SENIOR

How does Google Docs handle concurrent edits without data loss?

ANSWER

It uses Operational Transformation (OT) with a central sequencer. Each operation is transformed against concurrent operations before application, ensuring all clients converge to the same state. The server assigns a version number to each operation, guaranteeing total order.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How does Google Docs handle multiple users typing at the same time?

What's the difference between OT and CRDT for collaborative editing?

How do I implement offline editing in a collaborative document editor?

What happens if the server crashes in the middle of processing an operation?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

5 min read · try the examples if you haven't