This architecture contains several critical flaws that would prevent it from functioning as a "Google Docs" clone. The most significant issues involve data integrity, synchronization latency, and security.
1. Conflict Resolution: "Last-Write-Wins" (LWW) with Client Clocks
- The Problem: Client clocks are never perfectly synchronized. A user with a lagging clock could "revert" legitimate changes made by others. Furthermore, LWW on a paragraph level means if two users type in the same paragraph simultaneously, one user's entire contribution will simply vanish.
- The Solution: Use Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs) (e.g., Yjs or Automerge).
- Trade-off: Significantly higher implementation complexity. OT requires a central "source of truth" (server), while CRDTs increase the payload size as they store metadata for every character/operation.
2. Real-time Pub/Sub: Server Silos
- The Problem: The architecture states servers only broadcast to clients connected to that server, and others poll every 2 seconds. This means User A (Server 1) sees their own edits instantly, but User B (Server 2) sees them up to 2 seconds later. This makes collaborative editing feel broken and causes constant merge conflicts.
- The Solution: Implement a Redis Pub/Sub or NATS backbone. When Server 1 receives an update, it publishes to a Redis channel for that Document ID. All other servers subscribe to that channel and push the update to their connected clients instantly.
- Trade-off: Adds a dependency on Redis; if Redis lags, the entire real-time experience lags.
3. Storage Strategy: HTML Snapshots
- The Problem: Saving full HTML snapshots every 30 seconds is extremely heavy on I/O and makes "undo" history or granular versioning impossible. Furthermore, if a server crashes at second 29, 29 seconds of work are lost because the "real-time" path only writes individual changes to Postgres (which isn't optimized for high-frequency small writes).
- The Solution: Store an initial snapshot and then an append-only log of operations (diffs). Use a background worker to periodically "squash" these operations into a new snapshot.
- Trade-off: Requires a more complex "reconstruction" logic to load a document (Snapshot + Diffs).
4. API Caching: CloudFront Caching
- The Problem: Caching API responses for 5 minutes at the CDN level is catastrophic for a collaborative editor. A user might refresh the page and see a version of the document from 4 minutes ago, even though they just spent those 4 minutes editing it.
- The Solution: Disable CDN caching for dynamic document data. Use ETags or
Cache-Control: no-cache. Rely on Redis for fast document state retrieval.
- Trade-off: Increases the load on your origin servers and database.
5. Security: JWT in LocalStorage & 24h Expiry
- The Problem: LocalStorage is vulnerable to XSS attacks. If a malicious script runs, it can steal the JWT. Additionally, a 24-hour expiry without a revocation mechanism (blacklist) means if a user is fired or a token is stolen, they have access for up to a full day.
- The Solution: Store JWTs in HttpOnly, Secure cookies. Implement Short-lived Access Tokens (15 min) and Refresh Tokens stored in the database to allow immediate revocation.
- Trade-off: Slightly more complex frontend/backend handshake; cookies can introduce CSRF risks (must use SameSite attributes).
6. Scaling Bottleneck: Round-Robin Load Balancing
- The Problem: With round-robin, two users collaborating on the same doc will likely end up on different servers. This exacerbates the "Server Silo" issue mentioned in point #2.
- The Solution: Use Sticky Sessions (Session Affinity) based on Document ID (or Organization ID). Alternatively, use a "Socket Worker" pattern where all traffic for a specific Document ID is routed to a specific node.
- Trade-off: Can lead to "hot spots" where one server is overloaded because a specific document is viral/highly active, while other servers are idle.
7. Database Bottleneck: PostgreSQL Writes
- The Problem: Writing every single keystroke (change event) directly to PostgreSQL will quickly exhaust the connection pool and disk I/O under heavy load.
- The Solution: Buffer writes in Redis or a message queue (Kafka). Batch these writes before committing them to PostgreSQL.
- Trade-off: Risk of losing a few seconds of data if the buffer/queue fails before the database write.
8. Race Condition: The "Polling" Gap
- The Problem: If Server A writes to the DB and Server B is polling every 2 seconds, there is a window where Server B overwrites Server A's data because it hasn't "seen" the update yet (especially with LWW).
- The Solution: This is solved by the Redis Pub/Sub solution in point #2 and the OT/CRDT solution in point #1. You must treat the document as a stream of events, not a series of static states.