This is an excellent and classic architecture for a real-time system that contains several critical flaws. Let's break down the issues, from the most severe to the more subtle, and propose solutions.
1. Race Conditions (The Most Critical Flaws)
These are issues where the timing of events leads to an incorrect or inconsistent state.
Issue 1: Flawed Conflict Resolution (Last-Write-Wins with Client Clocks)
- Problem: This is the most significant flaw. Client clocks are unreliable. They can be set incorrectly, drift over time, or even be manipulated by a malicious user. A user with a clock set 5 minutes in the future could make an edit, and for the next 5 minutes, all their changes would silently overwrite everyone else's work, even if those others were actively typing. This guarantees non-deterministic and frequent data loss.
- Solution: Abandon client-side clocks for conflict resolution. Instead, use a proper concurrency control algorithm.
- Option A (Good): Server-Generated Timestamps. When a change is received, the server assigns it a monotonic timestamp (e.g., from a database sequence or a high-precision timer). The server then applies the LWW logic using this authoritative timestamp.
- Option B (Better): Operational Transformation (OT). This is the algorithm Google Docs originally used. When a change is received, the server transforms it against any concurrent changes that have already been applied. It then sends this transformed operation back to the client and to other users. This preserves the user's intent.
- Option C (Best for Modern Architectures): Conflict-free Replicated Data Types (CRDTs). The document is represented as a CRDT (e.g., a list of characters with unique IDs). Edits are operations that can be applied in any order and are mathematically guaranteed to eventually converge to the same state on all clients without complex conflict resolution.
- Trade-offs:
- Server Timestamps: Simple to implement but still results in data loss for true concurrent edits (e.g., two users typing in the same spot). "Winner takes all" is rarely the desired user experience.
- OT: Extremely complex to implement correctly. The server must become the single source of truth for the operational history, which can be a bottleneck. Debugging OT issues is notoriously difficult.
- CRDTs: The client-side logic is more complex than with OT. The final state might sometimes look strange to users until it converges (e.g., deleted text might reappear briefly before being removed by another operation). However, it's infinitely more scalable and robust, especially for architectures without a central coordinating server.
Issue 2: Inconsistent Real-time State Across Servers
- Problem: The data flow creates a "split-brain" scenario. A user connected to Server A will see changes from other users on Server A instantly. However, they won't see changes from users on Server B for up to 2 seconds (the polling interval). This means users on different servers are temporarily editing different versions of the document, leading to a jarring experience and more frequent conflicts for the LWW resolver to handle.
- Solution: Decouple real-time broadcasting from the API servers using a message broker.
- Introduce a message bus like Redis Pub/Sub, Kafka, or RabbitMQ.
- When any API server receives a change via WebSocket, it immediately publishes it to a topic specific to that document (e.g.,
doc-updates:12345).
- All API servers subscribe to these topics. When a server receives a message on the topic, it broadcasts the change to all of its own connected WebSocket clients for that document.
- Trade-offs:
- Pro: Solves the consistency and latency problem across server instances. It's the standard pattern for building real-time systems at scale.
- Con: Adds another piece of infrastructure (the message bus) that must be managed, monitored, and made highly available. Redis Pub/Sub is simple but can lose messages if a subscriber disconnects; Kafka is more durable but significantly more complex to operate.
Issue 3: Stale Data from CDN Caching
- Problem: Caching API responses for a collaborative document for 5 minutes is catastrophic. If User A loads a document, the CDN caches the response. User B makes a change. If User C then requests the same document, they will get the 5-minute-old, stale version from the CDN, completely missing User B's edit.
- Solution: Do not cache stateful API endpoints. The CDN should only be used for static assets (JS, CSS, images). API endpoints that retrieve or modify document state must bypass the CDN and hit the live application servers every time.
- Trade-offs:
- Pro: Guarantees data freshness and integrity.
- Con: Increases the load on your backend servers, as they can't offload these requests to the CDN. This is a necessary trade-off for a dynamic, collaborative application.
2. Failure Modes
These are points where the system can break down completely.
Issue 1: Data Loss on Server Failure
- Problem: The storage strategy is to save a full HTML snapshot every 30 seconds. If a server (and its local PostgreSQL connection) crashes, all changes made in the last ~30 seconds are lost forever. This is an unacceptable level of data loss for a document editor.
- Solution: Adopt an event-sourcing or command-sourcing model.
- Log all operations: Every single change (keystroke, formatting, deletion) is written as an immutable event to a durable log (e.g., in a dedicated
document_events table in PostgreSQL or a system like Kafka).
- Snapshot periodically: Continue to take snapshots (e.g., every 100 operations or 5 minutes), but treat them as a performance optimization, not the primary source of data.
- Recovery: To reconstruct a document, you load the latest snapshot and then replay all events that occurred after that snapshot's timestamp.
- Trade-offs:
- Pro: Extremely durable. You can reconstruct the exact state of a document at any point in time. Data loss window is reduced to milliseconds.
- Con: Increased write volume to the database. Event logs can grow very large and require a compaction/truncation strategy. Replaying many events to load a document can be slower than reading a single snapshot (mitigated by frequent snapshots).
Issue 2: Session & Authentication Failures
- Problem 1 (Sticky Sessions): The load balancer is round-robin. A user connects to Server A via WebSocket. If their connection drops and they reconnect, the LB might send them to Server B. Server B has no knowledge of the user's WebSocket connection or the document they were viewing, leading to a broken experience.
- Problem 2 (JWT in localStorage): Storing JWTs in
localStorage makes them vulnerable to Cross-Site Scripting (XSS) attacks. If any malicious script runs on your page, it can steal the token and impersonate the user.
- Solution 1 (Sticky Sessions or External State):
- Option A (Easier): Configure the load balancer to use "sticky sessions" (or session affinity). This ensures a user is always routed to the same backend server for the duration of their session.
- Option B (Better for Scale): Do not rely on server-local state. Use the message bus solution from above. Any server can serve any user because they all get their state from the central message bus and database.
- Solution 2 (Secure Token Storage): Store the JWT in an
HttpOnly cookie. This makes it inaccessible to JavaScript and mitigates XSS-based token theft.
- Trade-offs:
- Sticky Sessions: Undermines horizontal scaling and high availability. If Server A goes down, all its "stuck" users lose their session.
- HttpOnly Cookies: Slightly more complex to manage (e.g., handling CSRF tokens), but it's the standard security best practice. You'll need a mechanism to refresh the token without page reloads.
Issue 3: Single Point of Failure in PostgreSQL
- Problem: The architecture relies on a single PostgreSQL instance for writes. If it goes down, the entire editing service stops. Period.
- Solution: Implement high availability for your database.
- Use a managed database service (like Amazon RDS or Google Cloud SQL) that offers automatic failover.
- Or, set up your own streaming replication configuration with a primary and one or more standby replicas. Use a connection proxy (like PgBouncer) or a failover mechanism that can automatically promote a standby to primary.
- Trade-offs:
- Pro: Dramatically increases system resilience.
- Con: Increased infrastructure complexity and cost. Managed services simplify this but come at a premium.
3. Scaling Bottlenecks
These are components that will prevent the system from handling increased load.
Issue 1: Database Write Contention
- Problem: Every single keystroke from every user across all servers results in a write to the same PostgreSQL database. With thousands of concurrent users, this will create massive write contention and lock the database, bringing the system to a halt. The read replicas mentioned in the scaling plan do not help with this write bottleneck.
- Solution: Offload real-time writes from the primary database.
- The message bus (e.g., Kafka) solution mentioned earlier helps. The API server's primary job becomes writing the change to Kafka, which is extremely fast. A separate set of "worker" services can then consume messages from Kafka and write them to PostgreSQL at a more manageable rate.
- Combine this with the event sourcing model: writing a small event is much faster for the database than updating a large HTML document snapshot.
- Trade-offs:
- Pro: Massively improves write throughput and responsiveness, decoupling the user-facing API from the database write speed.
- Con: Introduces significant architectural complexity (asynchronous workers, message bus). There is now a small delay between a change being made and it being durably written to the database, though this is an acceptable trade-off for real-time systems.
Issue 2: Inefficient Polling
- Problem: Having every server poll the database every 2 seconds is wasteful. Even with no changes, it generates a constant stream of database queries. As you add more servers, this background load increases linearly, consuming database resources for no reason.
- Solution: The message bus (Redis Pub/Sub, Kafka) completely eliminates the need for polling. It uses a push-based model, which is far more efficient. Changes are pushed to servers instantly, rather than servers having to pull for them.
- Trade-offs:
- Pro: Eliminates a major source of database load and reduces real-time latency from up to 2 seconds to milliseconds.
- Con: See trade-offs for adding a message bus (added infrastructure dependency).
Issue 3: Coarse-Grained Document Locking
- Problem: The conflict resolution strategy is "if two users edit the same paragraph." This implies that the system is locking or checking conflicts at a paragraph level. This is too coarse. Two users editing different sentences in the same paragraph will still be serialized, losing the benefit of real-time collaboration.
- Solution: Adopt a more granular data model. Instead of storing a full HTML snapshot, store the document as a structured model (like an Abstract Syntax Tree) or a CRDT. Conflict resolution and operational transformation should happen at the character level or at least at the word/node level.
- Trade-offs:
- Pro: Enables true, fine-grained collaborative editing where multiple users can work on the same line simultaneously without losing each other's work.
- Con: The data model and transformation logic become significantly more complex than working with a simple HTML string. This is the price of a high-quality collaborative experience.