Of course. This is a great exercise. The proposed architecture has several critical flaws that would prevent it from functioning as a reliable, real-time collaborative editor at scale.
Here is a breakdown of the issues, categorized for clarity.
Category 1: Race Conditions & Data Integrity
Issue 1: "Last-Write-Wins" with Client-Side Timestamps
- Problem: This is the most critical data integrity issue. Relying on client clocks is a recipe for data loss and corruption. Client clocks can be incorrect due to drift, manual user changes, or different time zones.
- Scenario: User A (in New York) edits a paragraph at
10:00:00 EST. User B (in London) edits the same paragraph a second later at 15:00:01 GMT (which is 10:00:01 EST). If User A's computer clock is 2 seconds fast, their change will be timestamped 10:00:02 and will incorrectly overwrite User B's later change. The result is non-deterministic and leads to lost work.
- Solution: Implement Operational Transformation (OT) or CRDTs.
- OT (Operational Transformation): This is the classic approach used by Google Docs. All changes (operations) are sent to a central server with a version number. The server acts as the single source of truth, sequencing all operations. When it receives an operation, it may need to "transform" it against other operations that have happened concurrently before applying it and broadcasting the transformed operation to all clients.
- CRDT (Conflict-free Replicated Data Type): A more modern approach. Changes are structured in a way that they can be merged in any order and will always converge to the same result. This is mathematically provable and removes the need for a central transformation server, though a central server is still useful for sequencing and broadcasting.
- Trade-offs:
- OT: Pro: Strong consistency and a well-understood model for text editing. Con: The transformation logic can become incredibly complex to get right, and it requires a centralized server to sequence operations, which can be a bottleneck if not designed carefully.
- CRDT: Pro: More resilient to offline work and decentralized by nature. Merge logic is often simpler than OT transforms. Con: Can have higher memory/storage overhead, as deletes are often "tombstones" rather than true removals. The resulting merged state might sometimes be mathematically correct but not what a user intuitively expected.
Category 2: Scaling Bottlenecks & Latency
Issue 2: Real-time Communication Silos
- Problem: Each API server only broadcasts changes to its own connected clients. Users on different servers will not see each other's changes in real-time. The 2-second polling interval means a user on Server B will see a change from a user on Server A with a delay of up to 2 seconds, destroying the real-time experience.
- Solution: Use a Pub/Sub Message Bus.
- Introduce a dedicated messaging layer like Redis Pub/Sub, RabbitMQ, or Kafka.
- New Data Flow:
- User's client sends a change to Server A.
- Server A validates the change and publishes it to a specific channel/topic (e.g.,
doc-changes:document-123).
- All API servers (A, B, C...) are subscribed to these channels.
- Servers A, B, and C all receive the message from the bus instantly and broadcast it down to their respective connected WebSocket clients.
- Trade-offs:
- Pro: Achieves true real-time, low-latency communication across the entire cluster. Decouples the API servers from each other.
- Con: Introduces a new system component (the message bus) that must be managed, scaled, and monitored. This adds to infrastructure complexity.
Issue 3: Database Polling as a Sync Mechanism
- Problem: Having every server poll the primary PostgreSQL database every 2 seconds is extremely inefficient and will not scale. With 100 servers, the database would receive 50 queries per second, even if no documents are being edited. This places a huge, unnecessary load on the primary write database, which is often the most critical bottleneck in a system.
- Solution: Event-Driven Architecture (via Pub/Sub).
- This problem is solved entirely by implementing the Pub/Sub message bus described in the previous point. The database is no longer used for real-time notifications. It is only accessed for initial document loads and for persisting changes, completely removing the polling load.
- Trade-offs:
- Pro: Massively reduces database load, allowing the DB to scale more effectively for its primary purpose (storage). Enables true real-time updates.
- Con: Relies on the availability and performance of the chosen message bus.
Issue 4: Inefficient Full Document Snapshots
- Problem: Saving the entire HTML document every 30 seconds is highly inefficient for both I/O and network traffic. A one-character change results in writing a multi-megabyte document. It also makes it impossible to implement granular versioning or a robust undo/redo feature.
- Solution: Store Deltas/Operations.
- Instead of saving snapshots, persist the individual operations (or "deltas") that users generate. This is a natural fit for an OT or CRDT system. For example, store records like
(doc_id, user_id, version, operation_payload).
- To prevent infinite replay on document load, a background worker can periodically create a new full snapshot and truncate the operation log up to that point.
- Trade-offs:
- Pro: Hugely efficient storage. Enables powerful features like precise version history ("see what changed between 10:05 and 10:06") and reliable undo/redo.
- Con: Loading a document becomes more complex. You must fetch the latest snapshot and replay all subsequent operations to construct the current state.
Category 3: Failure Modes & Reliability
Issue 5: Loss of WebSocket State on Server Failure
- Problem: The architecture states that "each API server maintains its own WebSocket connections." If a server crashes, all its WebSocket connections are terminated. The round-robin load balancer will redirect the client to a new server, but the context (which document the user was in) is lost, forcing a disruptive reconnection flow.
- Solution: Decouple Connection and Session Management.
- The WebSocket connection logic on the client must be robust, with automatic reconnection.
- When a client reconnects, it presents its JWT. The new server uses the session information (which should be in Redis, not server memory) to look up the user's state, including the
document_id they were last editing. The server can then seamlessly re-subscribe them to the document's Pub/Sub channel without the user noticing anything more than a brief flicker.
- Trade-offs:
- Pro: Creates a highly fault-tolerant system where individual server failures have minimal impact on users.
- Con: Requires more sophisticated client-side connection management and disciplined state management on the backend (i.e., never store session state in a specific server's memory).
Issue 6: Stale API Cache with CDN
- Problem: Caching API responses for a highly dynamic, real-time application is dangerous. If a document's title is changed, the real-time view will update instantly, but a browser refresh might hit a CDN-cached API response for
/api/documents/{id} and show the old title for up to 5 minutes. This creates a confusing and inconsistent user experience.
- Solution: Do Not Cache Dynamic API Responses.
- The CDN should only be used for its primary purpose: serving static assets (JavaScript, CSS, images, fonts).
- All API requests (
/api/*) should be configured to bypass the CDN cache and go directly to the origin (the load balancer).
- Trade-offs:
- Pro: Guarantees data consistency between the API and the real-time system.
- Con: Places higher load on the API servers, as they must serve all requests. This is a necessary trade-off for a system where data freshness is paramount.
Category 4: Security
Issue 7: JWT Storage in localStorage
- Problem: Storing JWTs in
localStorage makes the application vulnerable to Cross-Site Scripting (XSS). If an attacker finds an XSS vulnerability, they can inject script to steal the JWT from localStorage and use it to impersonate the user completely.
- Solution: Use Secure, HttpOnly Cookies.
- The authentication server should set the JWT in a cookie with the
HttpOnly, Secure, and SameSite=Strict flags.
HttpOnly: Prevents JavaScript from accessing the cookie, mitigating XSS-based theft.
Secure: Ensures the cookie is only sent over HTTPS.
SameSite=Strict: Prevents the browser from sending the cookie on cross-site requests, mitigating CSRF.
- Trade-offs:
- Pro: Significantly more secure than
localStorage. The browser handles token storage and transmission automatically and securely.
- Con: Requires protection against Cross-Site Request Forgery (CSRF) because cookies are sent automatically with requests. WebSocket authentication can also be slightly trickier, as the initial handshake might not have access to the cookie, often requiring a separate HTTP-based auth step or passing the cookie during connection setup.