System Design: Collaborative Document Editor (Google Docs-scale)

Requirements

Functional Requirements:

Multiple users edit the same document simultaneously with real-time conflict-free merging
Rich text editing: bold, italic, headers, lists, tables, inline images
Presence: see other users' cursors and selections in real time
Complete revision history: view any past version; restore to any checkpoint
Comments and threaded replies anchored to document ranges
Offline editing: changes made offline are merged when connectivity is restored

Non-Functional Requirements:

Sub-100ms latency for local keystrokes to appear in the editor (local-first)
Convergence: all collaborators reach the same document state within 2 seconds
Support documents up to 10MB of content with 50 simultaneous editors
Revision history retained for 30 days on free tier, indefinitely on paid
99.9% uptime; document unavailability causes work loss

Scale Estimation

At Google Docs scale: 1B documents, 50M DAU, average 5 active editors/document during collaboration. Operations rate: a fast typist generates ~6 operations/second (characters typed + cursor moves). 50M users × 1% actively typing at peak = 500k users typing = 3M operations/second. Each operation is ~100-200 bytes. Total: ~600MB/second of operation data. Revision storage: 3M ops/second × 200 bytes × 86,400 seconds = ~50TB of raw operations/day (but heavily compacted to snapshots).

High-Level Architecture

The system is built around two complementary concerns: a Collaboration Engine for real-time multi-user editing and a Document Storage Service for persistence and history. The Collaboration Engine runs on a Document Server that maintains in-memory document state for all actively edited documents. Clients connect via WebSocket to their document's server and send operations. The server applies operations using a concurrency control algorithm and broadcasts them to all collaborators.

Concurrency Control — OT vs. CRDT: Two dominant approaches exist:

Operational Transformation (OT) — the approach used by Google Docs. Each operation is transformed against concurrent operations before application to ensure convergence. For text: Insert(pos, char) and Delete(pos) operations are transformed pair-wise when concurrent. The server is a central authority: it assigns a global operation order (revision number) and broadcasts transformed operations to clients. Clients apply received operations using OT's transformation function. OT is well-understood for text but complex to implement correctly for rich text (embedded objects, tables).

CRDTs (Conflict-free Replicated Data Types) — used by Figma, Notion. Operations are designed to be commutative and associative, so any order of application yields the same result. For text, a common CRDT is a sequence CRDT (Logoot, LSEQ, or Yjs's Y.Text) where each character has a globally unique position identifier, making insert and delete operations naturally conflict-free. CRDTs enable true peer-to-peer collaboration without a central authority and support offline editing natively. The trade-off is larger operation sizes (position IDs) and potential document growth (deleted characters as tombstones).

Google Docs uses OT with a central server; Figma uses CRDTs. For this design we implement a server-authoritative OT model with CRDT-inspired offline support.

Core Components

Document Collaboration Server

A stateful service maintaining in-memory representations of actively edited documents. Each document has an OT state: the current document snapshot at the server's committed revision, and a buffer of unacknowledged operations from clients. When an operation arrives from client A: (1) transform the operation against all server operations since the client's last acknowledged revision (OT transformation); (2) apply the transformed operation to the server document state; (3) assign it a revision number; (4) broadcast the transformed operation to all other clients; (5) persist the operation to the operation log. Multiple clients editing simultaneously are handled by the transformation step ensuring convergence.

Operation Log & Persistence Service

All operations are durably logged to Kafka (for immediate replication) and PostgreSQL (for queryable history). Operations are stored as: {doc_id, revision_number, client_id, operation_type, operation_data JSONB, timestamp}. Periodic snapshots (every 100 operations) capture the full document state, enabling efficient history reconstruction without replaying thousands of operations. Snapshots are stored in S3; the snapshot + operations since the snapshot are loaded when a document server opens a document.

Presence Service

Tracks cursor positions and selections for all active editors. Unlike document operations, presence data is ephemeral and eventually consistent — a cursor position doesn't need strong consistency. Clients send cursor update events to the Presence Service (via WebSocket), which stores the latest position per user in Redis with a short TTL (5 seconds — refreshed by heartbeat). Presence data is broadcast to all editors via a separate low-priority WebSocket channel, decoupled from the operation channel so a presence update never delays an operation delivery.

Database Design

Documents: documents (doc_id UUID, owner_id, title, created_at, last_modified_at, current_revision INT, snapshot_revision INT, snapshot_s3_key). Operations: doc_operations (doc_id, revision_number, client_id, user_id, operation JSONB, timestamp) — composite PK (doc_id, revision_number) ensures total order per document. Snapshots: doc_snapshots (doc_id, revision_number, s3_key, snapshot_at). Comments: comments (comment_id, doc_id, user_id, anchor_start, anchor_end, content, resolved), comment_replies (reply_id, comment_id, user_id, content).

Document access control: doc_permissions (doc_id, principal_id, principal_type ENUM(user, group, public), role ENUM(viewer, commenter, editor, owner)). The Document Server checks permissions on WebSocket connection upgrade — unauthorized users are disconnected before seeing any document content.

API Design

WebSocket /ws/v1/documents/{docId}/collab — establishes collaboration session; client receives current snapshot + revision; sends {op_type, op_data, client_revision} frames; receives {op_type, op_data, server_revision, user_id} frames from other collaborators.

GET /api/v1/documents/{docId}/history?from_revision=&to_revision= — returns operation log for history view.

POST /api/v1/documents/{docId}/restore?revision={n} — restores document to a past revision (creates a new operation that replaces the current state).

POST /api/v1/documents/{docId}/comments — creates a comment anchored to a document range.

Scaling & Bottlenecks

The document server is stateful — a document must be loaded on exactly one server for OT to work correctly (a central authority is required for operation ordering). This is the fundamental scalability constraint: a single document cannot be served by multiple servers simultaneously without complex distributed consensus. Google's solution: documents are sharded across servers by doc_id; a routing layer directs all WebSocket connections for a document to the same server. When a server becomes overloaded, documents are migrated (with a brief handoff pause) to other servers.

Operation log storage grows indefinitely for long-lived documents. Compaction strategies: (1) periodic snapshots reduce the number of operations that need to be replayed on cold start; (2) operations older than 30 days (free tier) are deleted and only snapshots are retained; (3) semantic compression merges consecutive character insertions from the same user into a single operation, reducing log size by 10-50x.

Key Trade-offs

OT vs. CRDT: OT requires a central server for operation ordering, making it simpler to implement correctly for complex document types (rich text, tables) but incompatible with true peer-to-peer or offline-first collaboration; CRDTs are naturally offline-capable and decentralized but have larger operation overhead and tombstone accumulation over time.
Server-authoritative vs. peer-to-peer: Server authority simplifies conflict resolution and provides a single source of truth for history and permissions, but creates a scaling bottleneck (one active server per document) and a single point of failure; P2P scales trivially but requires full CRDT implementation and complicates permission enforcement.
Fine-grained vs. coarse-grained operations: Character-level OT operations (Insert(5, 'a')) enable precise conflict resolution but generate 3M operations/second at scale; word-level or paragraph-level operations reduce volume but produce coarser conflict resolution that may overwrite user changes.
Snapshot frequency: Frequent snapshots (every 10 operations) make cold start fast but increase S3 storage and write costs; infrequent snapshots (every 1,000 operations) minimize storage but cause slow cold starts for actively edited documents with long histories.