This architecture contains several critical flaws that would lead to data loss, poor user experience, and system instability under load. Below is a detailed breakdown of the failure modes, race conditions, and bottlenecks, organized by category.
1. Collaboration & Consistency Failures
Issue: The "Isolated Island" Problem (Inter-Server Latency)
- Failure Mode: The architecture relies on servers polling PostgreSQL every 2 seconds to discover changes made by other servers.
- Impact: If User A is connected to Server 1 and User B is connected to Server 2, User B will not see User A’s changes for up to 2 seconds (plus database latency). In a real-time editor, this breaks the "real-time" contract. Users will frequently overwrite each other because they are looking at stale states.
- Solution: Implement a Pub/Sub layer (e.g., Redis Pub/Sub).
- When Server 1 receives a change, it publishes it to a Redis channel (e.g.,
doc:{id}). All API servers subscribe to this channel. When Server 2 receives the message, it pushes the update to its connected WebSocket clients immediately.
- Trade-off: Adds complexity to the backend infrastructure (managing Redis connection stability), but reduces sync latency from seconds to milliseconds.
Issue: Destructive Conflict Resolution (Last-Write-Wins)
- Failure Mode: Using "Last-Write-Wins" (LWW) with client timestamps on structured content (like paragraphs) causes data loss. If User A deletes a sentence while User B fixes a typo in that same sentence, LWW will discard one change entirely.
- Impact: Users will lose work, leading to frustration and lack of trust in the product.
- Solution: Implement Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs).
- Instead of saving state snapshots, treat edits as operations (e.g.,
insert('a', position 5)). These algorithms merge concurrent operations mathematically so both changes are preserved.
- Trade-off: Significantly higher implementation complexity. OT requires a central authority to transform operations; CRDTs allow peer-to-peer merging but have higher computational overhead.
Issue: Unreliable Timestamps (Client Clocks)
- Failure Mode: Relying on client clocks for conflict resolution is unsafe. Users can intentionally or accidentally change their system time, manipulating the "Last-Write-Wins" logic to overwrite newer data.
- Solution: Use Server-Side Timestamps or Lamport Timestamps/Vector Clocks.
- The server should assign the timestamp upon receipt of the operation.
- Trade-off: Server timestamps still don't solve the ordering issue across distributed servers without clock synchronization (NTP). Vector clocks solve this but add metadata overhead to every operation.
2. Database & Performance Bottlenecks
Issue: Write Amplification (Keystroke-to-DB)
- Bottleneck: Writing every single change event directly to PostgreSQL.
- Impact: In a document with 10 active users typing at 200ms intervals (average typing speed), that is 50 writes per second per document. PostgreSQL will quickly become I/O bound, causing high latency for all users.
- Solution: Implement a Write-Behind Cache / Buffer.
- Hold changes in Redis (or an in-memory buffer) and flush to PostgreSQL in batches (e.g., every 5 seconds or when the doc is idle).
- Trade-off: If a server crashes before flushing, data in the buffer is lost. Mitigation: Use Redis AOF (Append Only File) to persist the buffer to disk quickly.
Issue: Polling Overhead
- Bottleneck: "Other servers poll PostgreSQL every 2 seconds."
- Impact: If you scale to 50 API servers, the database receives 25 queries per second constantly, even if no one is editing. This creates a baseline load that prevents scaling.
- Solution: The Redis Pub/Sub solution mentioned above eliminates this polling entirely. The database is only touched when data actually changes.
Issue: Storage Strategy (Full HTML Snapshots)
- Failure Mode: Saving full HTML snapshots every 30 seconds makes version history expensive and rollback difficult. It also creates massive database bloat.
- Impact: Retrieving a version from 2 hours ago requires replaying or diffing large blobs. Database storage costs balloon.
- Solution: Store Deltas (Operations).
- Store the sequence of operations (e.g.,
retain 5, insert 'x'). Periodically create snapshots (checkpoints) to optimize loading, but rely on the operation log for history.
- Trade-off: Reconstructing a document requires replaying the history log (slower load times). Mitigate this by creating snapshots every $N$ revisions (e.g., every 100 revisions or 10 minutes).
3. Security & Infrastructure Failures
Issue: CDN Caching API Responses
- Failure Mode: "CloudFront... also caches API responses for 5 minutes."
- Impact: If a user loads a document, they might receive a 5-minute-old version from the CDN edge. If they start editing this stale version, they will create conflicts immediately. Real-time data cannot be cached at the edge blindly.
- Solution: Disable caching for document content endpoints or use
Cache-Control: no-store. Use the CDN only for static assets (JS/CSS/Images).
- Trade-off: Increased load on the API servers for document fetches. You can optimize this by using
ETags so clients only download full content if it has changed.
Issue: XSS Vulnerability via LocalStorage
- Failure Mode: Storing JWTs in
localStorage makes them accessible to any JavaScript running on the page.
- Impact: If the application has any Cross-Site Scripting (XSS) vulnerability (e.g., via a third-party library or user input rendered incorrectly), attackers can steal the JWT and impersonate the user indefinitely (within the 24-hour window).
- Solution: Store JWTs in HttpOnly, Secure, SameSite Cookies.
- Trade-off: Makes the frontend slightly more complex (need to handle CSRF protection, though SameSite cookies mitigate this mostly). WebSockets need to handle auth via handshake query params or initial message rather than browser headers (which cookies handle automatically).
Issue: WebSocket Connection State & Load Balancer
- Failure Mode: The architecture mentions a round-robin load balancer. While WebSockets are persistent, if a server goes down, all clients connected to that server lose connection.
- Impact: The frontend needs robust reconnection logic. If the user is in the middle of typing, the data is currently being sent to a dead server.
- Solution: Implement Client-Side Buffering.
- The client should queue outgoing operations while disconnected and resend them upon reconnection. The server must be able to handle "re-sync" requests where the client says, "I have version 50, send me everything after that."
Summary of Recommended Architecture Changes
- Sync Strategy: Move from Last-Write-Wins to CRDTs (e.g., Yjs).
- Communication: Replace DB polling with Redis Pub/Sub for inter-server messaging.
- Storage: Store operation deltas in Postgres; use Redis as a write buffer.
- Security: Move JWTs to HttpOnly Cookies.
- Caching: Remove API response caching from the CDN.