Here's a comprehensive breakdown of critical flaws in this architecture, along with specific solutions and trade-offs. The most severe issues relate to the sync strategy and data flow, which would cause catastrophic data loss and inconsistent states in real-world use.
I. Critical Sync & Data Flow Failures
1. Client Timestamp-Based Last-Write-Wins (LWW)
- Problem:
- Client clocks are unreliable (e.g., user travels across timezones, device clock skew). Two edits happening milliseconds apart could have inverted timestamps, causing newer edits to be overwritten.
- Ignores operation semantics: If User A deletes a word while User B edits the same word, LWW applies the "last" change in full, discarding the other edit entirely (e.g., B's edit reappears after being deleted). This breaks collaboration fundamentally.
- No conflict resolution for concurrent edits (e.g., two users typing in the same sentence).
- Failure Mode: Frequent data loss, nonsensical document states, user frustration.
- Solution: Replace LWW with Operational Transformation (OT) or CRDTs.
- Implementation:
- Use a library like ShareDB (OT) or Yjs (CRDTs).
- Server validates/transforms operations before applying them (e.g., "insert 'x' at position 5" → adjusted if prior inserts happened).
- Trade-offs:
- ✅ Guarantees convergence (all clients see same state eventually).
- ✅ Handles concurrent edits without data loss.
- ❌ Increased server CPU/memory (transforming operations is non-trivial).
- ❌ Complex implementation (requires strict operation ordering).
2. Polling-Based Inter-Server Sync (2s Interval)
- Problem:
- Massive latency: Edits take up to 2 seconds + WebSocket broadcast delay to reach users on other servers. Not real-time (Google Docs achieves <100ms).
- Database overload: If 100 servers poll PostgreSQL every 2s for all documents, each document update triggers 100x queries. With 10k active docs, 5k QPS just for polling – unsustainable.
- Missed updates: If two edits happen within 2s, polling might only catch the latest, losing intermediate states.
- Failure Mode: Stale document views, users overwriting each other's work, database crashes under load.
- Solution: Replace polling with Redis Pub/Sub for inter-server events.
- Implementation:
- When Server A applies an operation, publish it to Redis:
PUBLISH doc:<id> "<operation>"
- All API servers subscribe to Redis channels for docs they host. On message, apply operation and broadcast via WebSocket.
- Trade-offs:
- ✅ Near-instant inter-server sync (<50ms).
- ✅ Eliminates polling load on PostgreSQL.
- ❌ Adds Redis latency (minimal vs. polling).
- ❌ Requires Redis HA setup (master-replica + Sentinel).
3. No Message Ordering Guarantee
- Problem:
- WebSockets deliver messages in order per connection, but no global order across servers. User A (Server 1) sees Edit X then Edit Y, while User B (Server 2) sees Y then X due to network delays. LWW can't fix this.
- PostgreSQL polling order isn't guaranteed (e.g.,
SELECT * FROM changes WHERE ts > last_poll may return edits out-of-order).
- Failure Mode: Permanent document divergence across clients.
- Solution: Enforce total order with logical clocks (Lamport timestamps) + sequence numbers.
- Implementation:
- Each operation gets a monotonically increasing
server_id:counter (e.g., server-3:142).
- Servers apply ops in this global order (using Redis to track latest counter per server).
- Trade-offs:
- ✅ Guarantees convergence (critical for OT/CRDTs).
- ❌ Slight overhead per operation (storing/propagating counters).
- ❌ Requires coordination on counter initialization (solved by Redis).
II. Scaling Bottlenecks
4. PostgreSQL Write Saturation
- Problem:
- Full HTML snapshots every 30s waste I/O (storing redundant data) and block writes during serialization.
- Incremental operations also write to PostgreSQL (Step 2), creating high write contention on document rows.
- Polling (if not fixed) would amplify this 100x.
- Bottleneck: Single document row becomes write hotspot (e.g., 100 users editing → 100 writes/sec).
- Solution: Decouple real-time ops from persistent storage.
- Implementation:
- Write operations to a write-ahead log (e.g., Kafka/Pulsar) instead of PostgreSQL.
- Use a background worker to:
- Apply ops to generate latest state (using OT/CRDTs).
- Save incremental diffs (not full HTML) to PostgreSQL every 5s.
- Compact diffs hourly into a snapshot.
- Trade-offs:
- ✅ Eliminates write contention on hot documents.
- ✅ Reduces DB storage by 10-100x (storing diffs vs. full HTML).
- ❌ Adds complexity (Kafka cluster, background workers).
- ❌ Slight delay in "permanent" storage (seconds, not 30s).
5. Inefficient Document Partitioning
- Problem:
- Partitioning only by
organization_id creates hot partitions (e.g., a large company with 10k concurrent editors on one doc).
- Read replicas won't help – hot partitions saturate the primary DB's write capacity.
- Bottleneck: Single organization can DOS the entire system.
- Solution: Multi-level partitioning + sharding.
- Implementation:
- Partition by
(organization_id, shard_id) where shard_id = hash(document_id) % 1024.
- Assign documents to shards dynamically (e.g., if shard >80% load, split).
- Use a shard router service (e.g., Vitess, or custom Redis cache).
- Trade-offs:
- ✅ Distributes load evenly.
- ✅ Scales linearly by adding shards.
- ❌ Cross-shard transactions impossible (mitigated by single-doc operations).
- ❌ Complex rebalancing during shard splits.
III. Security & Reliability Risks
6. JWT in localStorage + XSS Vulnerability
- Problem:
localStorage is accessible via JavaScript → XSS attacks steal tokens.
- 24-hour tokens enable long-lived session hijacking.
- Failure Mode: Account takeover via malicious script injection.
- Solution: HttpOnly cookies + short-lived tokens.
- Implementation:
- Store JWT in
HttpOnly, SameSite=Strict, Secure cookies.
- Use short token expiry (e.g., 15m) + refresh tokens (stored in DB, rotated on use).
- Trade-offs:
- ✅ Mitigates XSS token theft.
- ❌ CSRF risk (solved with
SameSite=Strict + anti-CSRF tokens).
- ❌ Requires token refresh mechanism.
7. CDN Caching API Responses
- Problem:
- CloudFront caching stale document data (e.g., after an edit, cached response serves old content for 5m).
- Breaks "real-time" promise for document fetches.
- Failure Mode: Users load outdated documents after edits.
- Solution: Disable CDN caching for dynamic API endpoints.
- Implementation:
- Set
Cache-Control: no-store, must-revalidate on all document-related API responses.
- Only cache static assets (JS/CSS/images) via CDN.
- Trade-offs:
- ✅ Ensures clients always get fresh data.
- ❌ Increased load on API servers (mitigated by WebSocket real-time updates).
8. WebSocket Connection Loss Handling
- Problem:
- No mechanism to recover after client disconnects (e.g., network drop).
- On reconnect, client reloads full document → loses local uncommitted edits.
- Failure Mode: User loses minutes of work after brief network outage.
- Solution: Client-side operational history + reconnect sync.
- Implementation:
- Client buffers unacknowledged operations locally.
- On reconnect, send buffered ops + last server-acknowledged sequence number.
- Server validates and applies missed ops (using OT/CRDTs).
- Trade-offs:
- ✅ Recovers uncommitted edits.
- ❌ Complex client logic (handled by libraries like Yjs).
IV. Other Critical Oversights
9. No Document Versioning
- Problem: Accidental deletions or malicious edits are irreversible.
- Solution: Append-only operation log (solved by Kafka-based storage in #4). Enables "undo" and history playback.
- Trade-off: Increased storage (but diffs minimize impact).
10. Load Balancer Session Affinity (Sticky Sessions) Missing
- Problem: Round-robin LB may route WebSocket requests to different servers mid-session → broken connections.
- Solution: Enable sticky sessions (e.g.,
sticky: true in Nginx, ALB target group stickiness).
- Trade-off: Uneven load if clients reconnect frequently (mitigated by session affinity TTL).
11. Redis as Single Point of Failure
- Problem: Redis crash → session cache/auth data lost, WebSocket servers can't sync.
- Solution: Redis Cluster with replicas + persistent storage.
- Trade-off: Increased ops complexity; slight latency increase.
Key Takeaways & Prioritized Fixes
| Issue Severity | Priority | Fix | Why Critical |
|---|
| Client LWW | 🔴 CRITICAL | OT/CRDTs + logical clocks | Prevents constant data loss & divergence |
| Polling bottleneck | 🔴 CRITICAL | Redis Pub/Sub | Eliminates 2s latency & DB overload |
| Full HTML snapshots | 🟠 HIGH | Kafka + diff-based storage | Solves write saturation, reduces storage 90%+ |
| JWT in localStorage | 🟠 HIGH | HttpOnly cookies + short tokens | Prevents mass account takeovers |
| No message ordering | 🔵 MEDIUM | Lamport timestamps | Required for OT/CRDTs to work correctly |
Without OT/CRDTs and Pub/Sub, this system is fundamentally broken for collaboration – it will lose data under even light concurrent usage. Start by replacing LWW and polling, then address storage/performance. The proposed solutions align with industry standards (Google Docs uses OT; Figma uses CRDTs). While they add complexity, they’re necessary for correctness – collaboration correctness trumps simplicity.