The described architecture for a real-time collaborative document editor has several critical failure modes, race conditions, and scaling bottlenecks that compromise consistency, availability, and scalability—especially under load or in edge network conditions. Below is a detailed breakdown of each issue, followed by specific solutions and their trade-offs.
🔴 1. Inconsistent Real-Time Sync Across Servers (Major Race Condition)
Issue:
Each server maintains its own WebSocket connections and only broadcasts changes to clients connected to it. Other servers poll PostgreSQL every 2 seconds for changes and then broadcast locally.
- Race Condition: A user on Server A makes a change → written to DB → Server B sees it after up to 2 seconds → broadcasts to its clients.
- Result: Clients on different servers see updates with up to 2 seconds of delay, and simultaneous edits can cause conflicts not resolved until after polling delay.
- Worse: If two users on different servers edit the same paragraph at nearly the same time, both changes may be applied locally before either server sees the other’s change → lost updates.
This violates the promise of “real-time” collaboration.
Solution:
Use a distributed pub/sub system (e.g., Redis Pub/Sub, Kafka, or NATS) to synchronize changes instantly across all API servers.
- When Server A receives a change, it:
- Writes to DB
- Publishes change to Redis channel (e.g.,
doc:123:updates)
- All other servers subscribe to relevant channels and immediately broadcast to their connected clients.
✅ Eliminates polling delay → near-instant cross-server sync.
Trade-offs:
- Adds dependency on Redis (availability, durability if using Redis without persistence).
- Requires coordination of channel subscriptions (e.g., scale to 100 servers → 100 subscribers per document).
- Redis Pub/Sub is fire-and-forget → lost messages if a server restarts. Use Redis Streams or Kafka for durability if message loss is unacceptable.
🔴 2. "Last-Write-Wins" with Client Clocks is Fundamentally Unsafe
Issue:
Using client-generated timestamps for conflict resolution is broken due to clock skew.
- Client A (clock fast) edits at 10:00:10 (actual time: 10:00:05)
- Client B (clock slow) edits at 10:00:08 (actual time: 10:00:12)
- Client A's change appears "later" → overwrites B's change, even though B wrote later.
- Result: Lost updates, inconsistent document state.
Solution:
Use server-assigned timestamps or, better yet, Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs).
Option A: Server Timestamp + Version Vectors
- Server assigns timestamp and monotonically increasing version on write.
- Use vector clocks or Lamport timestamps to detect causality.
- Reject or merge concurrent edits based on causal order, not absolute time.
Option B: OT or CRDTs (Recommended)
- CRDTs are ideal for text collaboration (e.g., Yjs, Automerge, or custom JSON-RWT).
- Changes are commutative, idempotent, and convergent.
- No need for total ordering; all clients eventually converge.
✅ Enables true real-time collaboration with no lost edits.
Trade-offs:
- CRDTs add complexity to frontend and backend logic.
- Larger payloads (e.g., metadata per character).
- Learning curve; not as widely understood as LWW.
🔴 3. Full HTML Snapshots Every 30 Seconds → Data Loss & Inefficiency
Issue:
Saving entire HTML snapshots every 30 seconds is dangerous:
- If a user types for 29 seconds and the server crashes → 29 seconds of work lost.
- Large payloads → high I/O, network, and storage cost.
- No version history or diffing → can't support undo/redo.
Solution:
- Persist changes incrementally, not snapshots.
- Use delta-based storage (e.g., OT operations or CRDT deltas).
- Store deltas in DB with strong durability (e.g., write-ahead log or Kafka for replay).
- Periodic snapshots can be derived for backup, not primary storage.
✅ Reduces data loss window, supports versioning, undo, and audit trails.
Trade-offs:
- Increased complexity in storage/querying (need to reconstruct document from deltas).
- May require background job to compact deltas into snapshots.
🔴 4. WebSocket Isolation per Server Breaks Scalability & HA
Issue:
Each server manages its own WebSocket connections → sticky sessions required.
- User must reconnect to the same server → breaks during server restarts, deploys, or scaling.
- Load balancer must support session affinity (e.g., based on cookie or IP), which reduces flexibility.
- If server crashes → all connected clients lose connection → need to reconnect and potentially lose state.
Solution:
Decouple WebSocket connections from data processing:
- Use a dedicated WebSocket gateway (e.g., using Socket.IO with Redis adapter, or a custom gateway with Redis pub/sub).
- Or: Use a message broker (e.g., Kafka, NATS) to decouple ingestion from broadcasting.
✅ Enables horizontal scaling without sticky sessions.
Trade-offs:
- Additional infrastructure complexity.
- Message broker becomes a critical dependency.
- Slight increase in latency due to indirection.
🔴 5. Polling PostgreSQL Every 2 Seconds → High Load & Inefficiency
Issue:
Servers polling DB every 2 seconds for changes:
- N servers × M documents → N×M queries/sec, even if no changes.
- Polling DB under high document count (e.g., 100 servers, 10k docs) = 50,000 queries/sec → DB overload.
- Wastes I/O and CPU.
Solution:
Replace polling with event-driven push:
- Use PostgreSQL’s
LISTEN/NOTIFY to get real-time change events.
- Or use Change Data Capture (CDC) via Debezium or logical replication.
- Trigger server-side pub/sub on change.
✅ Eliminates polling → zero overhead when idle.
Trade-offs:
LISTEN/NOTIFY has limitations (e.g., no payload size > 8KB, async, best-effort).
- CDC adds operational complexity (extra services, Kafka, etc.).
🔴 6. No Document Recovery After Server Failure
Issue:
- If a server crashes, clients reconnect and may:
- Rejoin document with stale state.
- Miss recent changes broadcast only to the failed server.
- Server state (e.g., in-memory presence, connection map) is lost.
Solution:
- Store document state metadata in Redis (e.g., current version, connected users).
- On reconnect, client fetches latest version from DB or Redis before syncing.
- Use WebSocket reconnection protocol with sequence numbers to catch up on missed messages.
✅ Enables fault-tolerant recovery.
Trade-offs:
- Increases Redis usage and latency on reconnect.
- Requires careful versioning and recovery logic.
🔴 7. CDN Caching API Responses Degrades Real-Time UX
Issue:
Caching API responses (e.g., document state) for 5 minutes via CDN:
- Users may see stale content for minutes.
- Contradicts real-time editing goals.
- Especially bad during initial load if CDN serves stale version.
Solution:
- Do not cache document content in CDN.
- Only cache static assets and auth/user metadata (if safe).
- Use private, no-cache headers for document fetch endpoints.
✅ Ensures users always get latest state.
Trade-offs:
- Increased load on API servers and DB.
- Can be mitigated with Redis cache (per-request) instead of CDN.
🔴 8. JWT in localStorage → XSS Vulnerability
Issue:
Storing JWT in localStorage makes it accessible via XSS attacks.
- Malicious script can steal token → impersonate user.
- 24-hour expiry increases exposure window.
Solution:
- Store JWT in HttpOnly, Secure, SameSite cookies.
- Use short-lived access tokens (e.g., 15 minutes) + refresh tokens (stored in DB or Redis).
- Implement CSRF protection (e.g., double-submit cookie) if using cookies.
✅ Mitigates XSS-based token theft.
Trade-offs:
- Slightly more complex auth flow.
- Need CSRF protection.
- Refresh token revocation requires server-side tracking.
🔴 9. Document Partitioning by Organization ID → Hotspot Risk
Issue:
Partitioning by organization ID may cause uneven load:
- A large org (e.g., 10k users editing 100 docs) → one DB shard overwhelmed.
- Small orgs underutilize their shard.
Solution:
- Use consistent hashing or range partitioning by document ID.
- Or use automatic sharding via Citus (PostgreSQL extension) or Vitess (for MySQL).
✅ Better load distribution.
Trade-offs:
- Cross-shard joins become harder (e.g., global search).
- Requires more sophisticated routing layer.
🔴 10. No Handling of Offline Clients or Reconnection
Issue:
If a client goes offline:
- Changes not sent → lost.
- On reconnect, no mechanism to catch up on missed changes.
Solution:
- Frontend queues changes when offline (IndexedDB).
- On reconnect, send queued ops + request missed updates from server.
- Server tracks per-client last-seen version (like Firebase).
✅ Robust offline support.
Trade-offs:
- Increased frontend complexity.
- Need server-side version tracking.
✅ Summary of Key Fixes and Architecture Upgrades
| Issue | Solution | Trade-off |
|---|
| Cross-server sync delay | Redis Pub/Sub or Kafka for real-time broadcast | Adds broker dependency |
| Client clock skew | Server timestamps + CRDTs/OT | Complexity, learning curve |
| Full snapshots → data loss | Delta-based persistence | Harder to query/backup |
| Sticky sessions required | Shared pub/sub (Redis) or gateway | Indirection, latency |
| DB polling overload | PostgreSQL NOTIFY or CDC | Operational complexity |
| CDN caching docs | Disable caching for doc content | Higher backend load |
| JWT in localStorage | HttpOnly cookies + refresh tokens | CSRF risk, more flow |
| No offline support | Client-side op queue + catch-up | Storage, logic overhead |
| Hotspot partitioning | Document ID sharding | Cross-shard queries hard |
✅ Recommended Final Architecture Additions
- Adopt CRDTs (e.g., Yjs) for conflict-free collaboration.
- Use Redis Streams for durable, ordered change propagation.
- Replace polling with
LISTEN/NOTIFY or CDC.
- Store JWT in HttpOnly cookies with short expiry.
- Remove CDN caching for document data.
- Add a message broker (e.g., Kafka) for audit log, search indexing, and recovery.
- Implement client-side offline queues and versioned sync.
By addressing these issues, the system evolves from a fragile, inconsistent prototype into a scalable, fault-tolerant, real-time collaborative editor capable of supporting thousands of concurrent users with strong consistency and minimal data loss.