This architecture contains several critical flaws that would prevent it from functioning as a real-time collaborative editor (like Google Docs). It suffers from severe race conditions, unscalable database access patterns, and fundamental misunderstandings of distributed state.
Here is a detailed breakdown of the failure modes, race conditions, and scaling bottlenecks, along with specific solutions and trade-offs.
1. Synchronization & Concurrency
Issue: Last-Write-Wins (LWW) with Client Clocks
- Race Condition / Failure Mode: Client clocks are notoriously unreliable and easily manipulated. If User A's system clock is 5 minutes behind User B's, User A's edits will constantly be overwritten by User B, even if User A typed later. Furthermore, LWW at the "paragraph" level means if two users type in the same paragraph simultaneously, one user's keystrokes are entirely deleted.
- Solution: Replace LWW with Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs) (e.g., Yjs or Automerge). Use logical server timestamps (sequence numbers) instead of client system clocks.
- Trade-offs:
- OT: Extremely complex to implement and debug; requires a central server to sequence operations.
- CRDTs: Easier peer-to-peer merging, but can consume significant memory (tombstones of deleted text must be kept) and results in larger payload sizes.
2. Real-Time Communication
Issue: Isolated WebSockets + Round-Robin Load Balancing
- Failure Mode: Because the Load Balancer uses round-robin, User A and User B editing the same document will likely connect to different API servers. Since servers only broadcast to their own connected clients, User B will not see User A's keystrokes in real-time.
- Solution: Implement a Pub/Sub Message Broker (like Redis Pub/Sub). When Server 1 receives an edit, it publishes the edit to a Redis channel dedicated to that Document ID. All servers subscribed to that channel receive the edit and push it to their respective WebSocket clients.
- Trade-offs: Introduces a new critical infrastructure dependency (Redis). If Redis goes down, real-time collaboration breaks completely.
3. Database & Network Load
Issue: Polling PostgreSQL Every 2 Seconds
- Scaling Bottleneck: Having every API server poll the database every 2 seconds for changes creates an enormous N+1 scaling nightmare. If you have 10 servers and 10,000 active documents, that is 50,000 queries per second (QPS) just for polling. Furthermore, a 2-second delay is unacceptable for a "real-time" typing experience (Google Docs latency is typically < 50ms).
- Solution: Eliminate polling entirely. Rely on the Redis Pub/Sub architecture (mentioned above) to push state changes to servers instantly. The database should only be used for persistent, asynchronous saves, not real-time state synchronization.
- Trade-offs: Real-time state now lives in memory/Redis. You must implement robust error handling to ensure that if a server crashes before saving to the DB, the state can be recovered from the other active clients or the Redis buffer.
4. Storage Strategy
Issue: Saving Full HTML Snapshots Every 30 Seconds
- Failure Mode / Bottleneck: Saving full HTML for large documents (e.g., a 100-page report) every 30 seconds will rapidly exhaust database storage and bandwidth. Furthermore, if a user types for 29 seconds and the server crashes, 29 seconds of work is permanently lost.
- Solution: Implement an Event Sourcing / Append-Only Log model. Save individual document operations (deltas/keystrokes) as they happen. Periodically (e.g., every 1,000 operations, or when the last user disconnects), squash the operations into a compressed snapshot in an object store (like S3), keeping only recent operations in Postgres/Redis.
- Trade-offs: Document loading becomes slightly slower because the backend must fetch the last snapshot and "replay" recent operations on top of it before sending the state to the client.
5. Caching Strategy
Issue: CDN Caching API Responses for 5 Minutes
- Failure Mode: If the CDN caches API responses (like
/api/documents/123), users will load documents that are up to 5 minutes out of date. They will overwrite recent work because their client starts from a stale state.
- Solution: Disable CDN caching for dynamic API responses entirely. The CDN should only cache static assets (JS, CSS, images). Use Redis as an application-level cache for document states, and invalidate the Redis cache immediately upon any write.
- Trade-offs: Increases load on the API/DB for initial document loads, requiring more robust backend scaling to handle "thundering herd" problems if many users open a document simultaneously.
6. Security & Authentication
Issue: JWTs with 24-hour Expiry in LocalStorage
- Failure Mode: Storing JWTs in
localStorage makes them highly vulnerable to Cross-Site Scripting (XSS) attacks. If a malicious script runs, it can steal the token. A 24-hour expiry means the attacker has a full day to impersonate the user, and standard JWTs cannot be easily revoked.
- Solution: Store JWTs in HttpOnly, Secure cookies to prevent XSS access. Implement a dual-token system: a short-lived Access Token (e.g., 15 minutes) and a long-lived Refresh Token (stored in an HttpOnly cookie or secure backend session).
- Trade-offs: HttpOnly cookies require you to implement Cross-Site Request Forgery (CSRF) protection. It also makes cross-domain API requests slightly more complex to configure (requires precise CORS setups).
7. Database Partitioning
Issue: Document Partitioning by Organization ID
- Scaling Bottleneck: Partitioning by Org ID creates massive database hotspots. A massive enterprise client will overwhelm their specific database shard, while shards hosting small startups sit idle.
- Solution: Partition the database by Document ID using a consistent hashing algorithm. This ensures an even distribution of data and load across all database shards, regardless of the organization's size.
- Trade-offs: Cross-document queries (e.g., "Search all documents belonging to Organization X") become "scatter-gather" operations. The backend must query all shards and aggregate the results, making search operations significantly slower and more complex. (This is usually mitigated by piping data into a dedicated search index like Elasticsearch).