Of course. This is an excellent exercise. The provided architecture has several critical issues that would prevent it from functioning as a reliable, scalable, real-time system.
Here is a breakdown of the potential failure modes, race conditions, and scaling bottlenecks, along with proposed solutions.
1. Race Conditions & Data Integrity Issues
Issue 1: "Last-Write-Wins" with Client Timestamps
This is the most critical data integrity problem in the architecture.
- Problem: Client clocks cannot be trusted. They can be out of sync due to network latency (NTP drift), incorrect manual settings, or even malicious manipulation. If User A's clock is 2 seconds behind User B's, and User A makes a change after User B, User A's change will be incorrectly discarded because its timestamp is earlier. This results in guaranteed data loss and a frustrating user experience, as changes will seemingly disappear.
- Solution: Replace the Last-Write-Wins (LWW) strategy with a more robust algorithm designed for collaborative editing. The two industry standards are:
- Operational Transformation (OT): This is the algorithm historically used by Google Docs. When a client sends an operation (e.g., "insert 'X' at position 50"), the server transforms it against any concurrent operations it has already processed before applying and broadcasting it. This requires a central server to serialize and transform all operations.
- Conflict-free Replicated Data Types (CRDTs): A more modern approach. CRDTs are data structures that are mathematically designed to resolve conflicts automatically and converge to the same state, regardless of the order in which operations are received. Each operation is commutative and idempotent. This removes the need for a central transformation server and is more resilient to network partitions.
- Trade-offs:
- OT: Very complex to implement correctly on the server. The logic for transforming all possible pairs of operations is non-trivial. It also requires a single, authoritative server per document session to order operations, which can be a bottleneck.
- CRDTs: Shifts complexity to the client-side data model. Payloads can sometimes be larger than OT operations. However, the backend logic is much simpler (mostly just relaying messages), making it more scalable and resilient.
Issue 2: Stale Data from CDN Caching
- Problem: The CDN caches API responses (e.g., the initial document load) for 5 minutes. If a user opens a document, they could receive a version that is up to 5 minutes old. They will then start receiving real-time WebSocket updates that are based on the current version of the document, leading to data corruption on the client-side as the updates (deltas) are applied to a stale base document.
- Solution: Do not cache the API endpoints that serve document content. The CDN should only be used for its primary purpose: serving static assets like JavaScript bundles, CSS files, images, and fonts. API requests for dynamic data must always hit the origin servers to ensure freshness.
- Trade-offs: This increases the load on the backend for initial document requests. However, this is a necessary trade-off for correctness. The load can be managed effectively with the proposed database read replicas.
2. Scaling Bottlenecks
Issue 3: Inter-Server Communication via Database Polling
- Problem: Having each server poll PostgreSQL every 2 seconds is extremely inefficient and will not scale.
- High Latency: Users on different servers will see each other's changes with a delay of up to 2 seconds, plus database latency. This is not "real-time."
- Database Load: As you add more API servers (
N), the number of polling queries to the database increases linearly (N queries every 2 seconds). This creates immense, constant, and largely useless load on the database, making it the primary bottleneck for the entire system.
- Solution: Implement a Pub/Sub (Publish/Subscribe) message bus.
- When an API server receives a change for
document-123, it publishes that change to a document-123 topic/channel on the message bus (e.g., Redis Pub/Sub, RabbitMQ, or Kafka).
- All API servers handling clients for
document-123 will be subscribed to that topic.
- They receive the message instantly from the bus and broadcast it to their connected WebSocket clients.
- Trade-offs:
- Complexity: Introduces a new component (Redis, RabbitMQ, etc.) into the architecture that must be deployed, managed, and monitored.
- Reliability: The message bus itself can become a point of failure, though systems like Kafka and clustered Redis are designed for high availability. The benefit of near-instant, low-overhead communication far outweighs this complexity for a real-time application.
Issue 4: Storing Full HTML Snapshots
- Problem: Saving the entire document as an HTML snapshot every 30 seconds is inefficient for both storage and I/O.
- Storage Bloat: If a user fixes a single typo in a 10MB document, you are writing another 10MB to the database. This causes the database to grow incredibly fast.
- High I/O: Frequent large writes put unnecessary strain on the database's write capacity.
- Solution: Store the operations/deltas themselves (the OT or CRDT operations). The document is an ordered log of these operations.
- To load a document, a client fetches an initial snapshot and all subsequent operations, replaying them to construct the current state.
- To optimize this, the server can periodically create new, consolidated snapshots in the background (e.g., every 1000 operations or every hour) so that clients don't have to replay an entire document's history from the beginning.
- Trade-offs:
- Read Complexity: Reconstructing a document from operations is more computationally expensive than just fetching a single blob. This is why periodic snapshotting is a crucial optimization.
- Data Model: Requires a more complex data model for storing and versioning operations, but the storage and I/O savings are massive.
3. Failure Modes & Security
Issue 5: Siloed WebSocket Connections
- Problem: Each API server maintains its own WebSocket connections. If one of these servers crashes or is taken down for deployment, all users connected to it are instantly disconnected. While they can reconnect (likely to a new server via the load balancer), their session context (e.g., which document they were in, cursor position) is lost, providing a jarring user experience.
- Solution: Decouple connection management from application logic.
- Store session state in a shared, fast data store like Redis. When a user connects to a server, the server looks up their session ID in Redis to see which document they are editing.
- If a server fails and the client reconnects to a new server, the new server can seamlessly resume the session using the data from Redis. The Pub/Sub system ensures they don't miss any messages during the brief reconnection window (if using a durable queue like Kafka).
- Trade-offs:
- Increases dependency on Redis, making its uptime even more critical.
- Adds a Redis lookup on every new connection, but this is a very fast operation and a worthwhile trade-off for fault tolerance.
Issue 6: JWT in localStorage
- Problem: Storing JWTs in
localStorage is a well-known security risk. It makes the application vulnerable to Cross-Site Scripting (XSS) attacks. If an attacker can inject malicious JavaScript onto the page (e.g., through a third-party library), they can read the JWT from localStorage and send it to their own server, allowing them to impersonate the user and gain full access to their account.
- Solution: Use secure,
httpOnly cookies to store authentication tokens.
- An
httpOnly cookie cannot be accessed by JavaScript, which mitigates XSS-based token theft.
- A common pattern is to store a long-lived refresh token in an
httpOnly cookie and a short-lived access token (the JWT) in memory on the client. When the access token expires, the client uses the refresh token (sent automatically by the browser) to silently request a new one.
- Trade-offs:
- This pattern requires protection against Cross-Site Request Forgery (CSRF), since cookies are sent with every request to the domain. This is a standard, solved problem that can be handled using anti-CSRF tokens. The security benefits greatly outweigh this implementation detail.