Key issues fall into consistency, real‑time propagation, storage, scaling, and security.
- WebSocket broadcast only reaches clients on the same server
Problem: Each server only broadcasts to its own connections. Clients connected to other servers only see updates after the 2‑second polling delay. This creates lag, inconsistent views, and lost intermediate states during rapid edits.
Solution:
- Introduce a pub/sub layer (Redis PubSub, Kafka, NATS, or similar).
- When a server receives an edit, it publishes to a document channel; all servers subscribed to that document broadcast to their clients.
Trade-offs:
- Extra infrastructure and operational complexity.
- Pub/sub throughput must scale with edit volume.
- Polling PostgreSQL every 2 seconds
Problem:
- Inefficient and slow.
- Causes heavy DB load as scale increases.
- Updates may arrive out of order relative to WebSocket events.
Solution:
- Replace polling with an event stream (Redis Streams, Kafka) or Postgres logical replication / LISTEN-NOTIFY.
Trade-offs:
- Streaming infrastructure adds operational overhead.
- LISTEN/NOTIFY can struggle at very large scale.
- Last-write-wins using client timestamps
Problem:
- Client clocks drift.
- Users can manipulate timestamps.
- Simultaneous edits overwrite each other, causing data loss.
Solution options:
- Operational Transform (OT) like Google Docs.
- CRDT (Conflict-free Replicated Data Types).
Trade-offs:
- OT: complex server coordination but efficient.
- CRDT: easier distributed merging but higher memory/network cost.
- Race conditions when writing to PostgreSQL
Problem:
- Multiple servers may write edits concurrently.
- Last-write-wins may overwrite changes before propagation.
Solution:
- Use version numbers or document revision IDs.
- Reject writes if base revision mismatches and merge via OT/CRDT.
Trade-offs:
- Extra conflict resolution logic.
- More complex client state management.
- Saving full HTML snapshots every 30 seconds
Problems:
- Large write amplification.
- Huge storage cost for long docs.
- Hard to reconstruct exact edit history.
- Race condition if multiple snapshots occur concurrently.
Solution:
- Store incremental operations (edit ops).
- Periodic checkpoints (snapshot + op log).
Trade-offs:
- Reconstruction cost increases.
- Requires replay logic.
- WebSocket connection imbalance due to round‑robin load balancer
Problem:
- WebSockets are long‑lived; round-robin does not rebalance.
- Some servers may accumulate far more connections.
Solution:
- Use connection-aware load balancing.
- Consistent hashing by document ID or sticky sessions.
Trade-offs:
- Stickiness can reduce flexibility when scaling.
- Rebalancing active sockets is difficult.
- Document editing split across many servers
Problem:
- Users editing the same document may connect to different servers, increasing coordination overhead.
Solution:
- Route document sessions to the same server shard using consistent hashing.
Trade-offs:
- Hot documents may overload a single node.
- Requires shard migration logic.
- PostgreSQL write bottleneck
Problem:
- Every keystroke becomes a DB write.
- High contention for popular documents.
Solution:
- Buffer edits in memory and batch commits.
- Use append-only event log (Kafka) and persist asynchronously.
Trade-offs:
- Risk of data loss if server crashes before flush.
- Slight durability delay.
- CDN caching API responses for 5 minutes
Problem:
- Document fetch endpoints could serve stale versions.
- Users might load outdated content.
Solution:
- Disable CDN caching for dynamic API responses.
- Or use cache keys with document version.
Trade-offs:
- Reduced CDN offload.
- More origin traffic.
- Redis session cache not used for collaboration state
Problem:
- Each server stores session state locally.
- Failover causes session loss and reconnect storms.
Solution:
- Move presence/session state to Redis or distributed state store.
Trade-offs:
- Extra latency for state access.
- Server crash with in‑memory edits
Problem:
- Edits may be lost if batching or buffering is used.
Solution:
- Write edits first to durable log (Kafka/Redis Stream) before applying.
Trade-offs:
- Slight write latency increase.
- WebSocket reconnect storms
Problem:
- If a node dies, thousands of clients reconnect simultaneously, overwhelming the system.
Solution:
- Exponential backoff reconnect.
- Multi-endpoint WebSocket gateway.
Trade-offs:
- Slight delay before reconnect.
- Hot document problem
Problem:
- Large meetings/classes editing same doc cause single shard overload.
Solution:
- Split document into smaller sections or CRDT segments.
- Partition by document section.
Trade-offs:
- JWT stored in localStorage
Problem:
- Vulnerable to XSS token theft.
Solution:
- Use HttpOnly secure cookies or short-lived tokens + refresh tokens.
Trade-offs:
- Slightly more auth complexity.
- No ordering guarantee for concurrent edits
Problem:
- Network latency may reorder edits across servers.
Solution:
- Use server-assigned sequence numbers per document.
Trade-offs:
- Requires central ordering authority or distributed consensus.
- Database read replicas for reads
Problem:
- Replica lag means clients may read outdated document states.
Solution:
- For active documents, read from primary or cache in Redis.
Trade-offs:
- Organization-based partitioning
Problem:
- One large organization could become a hotspot.
Solution:
- Partition by document ID hash instead.
Trade-offs:
- Cross-org queries become harder.
- No presence/awareness system
Problem:
- Cursor positions and presence updates can flood the system.
Solution:
- Send ephemeral presence via Redis PubSub without DB writes.
Trade-offs:
A typical production architecture instead uses:
- WebSocket gateway tier
- Pub/Sub or streaming bus (Kafka/NATS)
- OT or CRDT engine
- Operation log + periodic snapshots
- Redis for presence/state
- Consistent document sharding
- Durable event pipeline
This avoids polling, reduces DB load, and ensures consistent real‑time collaboration.