This architecture contains several critical flaws that would prevent the application from functioning as a true real-time collaborative editor ("Google Docs" style). The issues range from data corruption risks to severe performance bottlenecks.
Here is a detailed breakdown of the failure modes, race conditions, and scaling bottlenecks, organized by severity.
1. Critical Failure: Collaboration Logic & Data Integrity
This is the most severe category. The proposed sync strategy effectively makes true collaboration impossible.
Issue A: "Paragraph-Level" Last-Write-Wins (LWW)
- The Flaw: If User A types "Hello" and User B types "World" in the same paragraph at the same time, the resolution logic compares timestamps.
- Result: Whoever pressed their key last (according to their client clock) overwrites the entire paragraph content of the other user. User A’s "Hello" is deleted and replaced by "World". This results in data loss.
- Solution: Implement Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs).
- Instead of sending "Update Paragraph X to 'World'", send "Insert character 'W' at index 0".
- The server transforms operations so they can be applied in any order and converge to the same state.
- Trade-off: Significantly higher complexity in backend logic. OT requires a central authority to handle transformations; CRDTs can be complex to implement efficiently.
Issue B: Client-Side Timestamps
- The Flaw: Using
Date.now() from the client browser for conflict resolution.
- Result: If a user's system clock is set to 1970, or simply drifting by 5 minutes, their edits will always lose conflicts or behave erratically. Malicious users can also manipulate clocks to force overwrites.
- Solution: Use Hybrid Logical Clocks (HLC) or strictly server-side vector clocks for ordering events.
- Trade-off: Requires slightly more metadata per operation (node ID + counter + timestamp).
Issue C: Full HTML Snapshots
- The Flaw: Storing documents as full HTML blobs every 30 seconds.
- Result:
- Data Loss: If a server crashes between snapshots, 29 seconds of work are lost.
- Version History: You cannot diff two HTML blobs efficiently to show "User A added this sentence".
- Network Overhead: Sending the full document state on every reconnect is bandwidth-heavy.
- Solution: Store the document as a Log of Operations (OpLog) or a binary CRDT state. Persist changes to the DB immediately as they happen (streaming), not in bulk snapshots.
- Trade-off: Reading a document requires "replaying" the history (slower read) or maintaining a snapshot + compaction strategy (higher implementation complexity).
2. Architecture & Communication Bottlenecks
Issue D: Database Polling for Real-time Updates
- The Flaw: "Other servers poll PostgreSQL every 2 seconds for changes."
- Result:
- Latency: User B sees User A's edits with a 2+ second delay. This feels broken.
- Database Death Spiral: If you have 10 API servers, they are constantly hammering the database with
SELECT * FROM changes every 2 seconds, regardless of whether users are typing.
- Solution: Implement a Pub/Sub Layer (e.g., Redis Pub/Sub).
- Server A receives an edit -> Writes to DB -> Publishes "DocID updated" to Redis.
- Server B, C, D subscribe to Redis -> Instantly receive the event -> Push to their WebSocket clients.
- Trade-off: Adds a dependency on Redis (though you already use it for caching, it needs to be configured for Pub/Sub).
Issue E: Synchronous DB Writes on Keystroke
- The Flaw: "Server writes change to PostgreSQL" immediately on the WebSocket event.
- Result: Relational databases are generally too slow for high-frequency writes (keystrokes). If 1,000 users type at once, the DB connection pool will exhaust, blocking the Node.js event loop and freezing the API.
- Solution: Write-Behind Caching / Buffering.
- Hold edits in a fast store (Redis) or in-memory buffer.
- Flush to PostgreSQL asynchronously (e.g., every 1-2 seconds) or use a stream-optimized DB (like Cassandra/DynamoDB) for the OpLog.
- Trade-off: Small risk of data loss if the API server crashes immediately after acknowledging the WebSocket frame but before flushing to disk. Can be mitigated with replication.
3. Infrastructure & Security Failures
Issue F: CDN Caching API Responses
- The Flaw: "CloudFront... also caches API responses for 5 minutes."
- Result: If a user loads the document, CloudFront might serve them a 4-minute-old version of the file. If they start editing based on that old version, they will create massive conflicts when they sync.
- Solution: Disable caching for the
/api/docs/* endpoints or use strict Cache-Control: no-store headers. CDNs should only cache static assets (JS/CSS/Images).
- Trade-off: Higher load on the API servers for document fetches.
Issue G: JWT in localStorage
- The Flaw: Storing JWTs in
localStorage makes them vulnerable to Cross-Site Scripting (XSS). If an attacker injects a script, they can steal the token and impersonate the user.
- Solution: Store JWTs in HttpOnly, Secure, SameSite Cookies.
- Trade-off: You must implement CSRF protection (e.g., double-submit cookies) because the browser will automatically send cookies with requests.
Issue H: Round-Robin Load Balancing
- The Flaw: "Round-Robin" distribution for WebSockets.
- Result: While WebSockets are persistent, this setup implies a lack of "Sticky Sessions" for the initial handshake or HTTP fallback. If a client disconnects and reconnects quickly, they might hit a different server. If that server hasn't polled the DB yet (due to the 2s lag), the user gets stale data.
- Solution: Ensure the Load Balancer supports Sticky Sessions (Session Affinity) based on IP or cookie, though this is less critical once the Redis Pub/Sub solution (Issue D) is implemented.
Summary of Proposed Refactor
To make this architecture viable, you need to shift from a "Database-centric polling" model to a "Stream-centric push" model.
- Frontend: Send Ops (Insert/Delete + Index), not text blobs.
- Backend: Use Redis Pub/Sub to broadcast ops between API servers instantly.
- Storage: Use an Append-Only Log in the database (NoSQL or Postgres JSONB) to store edits; do not overwrite the document HTML.
- Conflict Resolution: Implement Yjs (CRDT) or ShareDB (OT) libraries rather than building custom timestamp logic.