This architecture has several significant flaws, particularly concerning real-time consistency, data integrity, and horizontal scaling complexity.
Here is a detailed breakdown of the potential failure modes, race conditions, and scaling bottlenecks, along with proposed solutions and their trade-offs.
I. Real-Time Consistency and Data Integrity Issues
These issues directly impact the user experience and correctness of the document state.
1. Failure Mode: Inconsistent State Across Servers (The Polling Mechanism)
The Issue: The core problem lies in how state synchronization happens between the API servers.
- Server A receives an edit from Client A. It writes to PostgreSQL.
- Server B does not know about this change immediately. It polls 2 seconds later and pulls the change.
- If Client B (connected to Server B) makes an edit before Server B polls, Server B sends the edit based on stale data, leading to a potential conflict or lost update when Server B eventually writes to the DB.
Race Condition: A race condition exists between an edit being written by Server A and Server B polling for that edit.
Solution: Eliminate Polling and Use a Central Message Broker.
- Proposal: Introduce a dedicated message broker (like Redis Pub/Sub, Kafka, or RabbitMQ) that sits between the API servers.
- Revised Data Flow:
- User types → Change sent to Server A via WebSocket.
- Server A writes change to PostgreSQL.
- Server A publishes the change (with its definitive DB timestamp) to a topic in the Message Broker (e.g.,
document-updates:doc_id_123).
- All other API servers (B, C, D...) subscribe to this topic. When they receive a message, they immediately broadcast the update to their connected clients via their local WebSockets.
- Trade-offs:
- Pro: Near-instantaneous consistency across all servers. Solves the polling lag.
- Con: Adds a new critical dependency (the Message Broker). Requires careful configuration for message durability and ordering (especially if using Kafka).
2. Failure Mode: Clock Skew and Incorrect Conflict Resolution
The Issue: The sync strategy relies on Last-Write-Wins (LWW) using client-provided timestamps. Client clocks are notoriously unreliable, leading to incorrect conflict resolution.
- Example: User A's clock is 1 second fast. User B's clock is 1 second slow. If they edit simultaneously, the update with the "faster" timestamp (User A's) wins, even if User B's edit was logically received first by the server.
Solution: Use Server-Assigned Timestamps or Operational Transformation (OT) / Conflict-Free Replicated Data Types (CRDTs).
- Proposal A (Simpler LWW Fix): Server must ignore the client timestamp for conflict resolution. The server should use its own PostgreSQL transaction commit timestamp or assign a monotonically increasing sequence number/timestamp before broadcasting or committing the final state.
- Proposal B (The "Correct" Solution for Real-Time): Abandon LWW entirely and adopt a formal collaborative editing algorithm like CRDTs (e.g., Yjs or Automerge). This allows merging concurrent edits intelligently rather than just discarding one.
- Trade-offs:
- Proposal A: Easy to implement, but still results in lost edits during true simultaneous conflicts (the "last write wins" outcome is just based on server timing, not user intent).
- Proposal B: Significantly increases complexity on both frontend and backend logic but provides true, mathematically sound convergence.
3. Failure Mode: Data Loss Due to Snapshot Frequency
The Issue: Documents are only saved as full HTML snapshots every 30 seconds. If a server crashes or the network drops between the last save and an edit, that edit is lost (unless the Message Broker is 100% durable and all servers correctly process the backlog).
- Even with a message broker, if the broker fails or the client disconnects, the data stored in the immediate WebSocket session buffer is lost until the next full snapshot.
Solution: Commit Every Change to the Database.
- Proposal: Commit every valid, timestamped change (the atomic operation, not the full HTML) directly to PostgreSQL immediately upon receipt. Use the 30-second interval only for snapshotting the current state for fast initial loads, not for primary persistence.
- Trade-offs:
- Pro: Ensures atomicity and durability for every change.
- Con: Greatly increases database write load, potentially straining PostgreSQL. Requires extremely efficient indexing for versioning/history tables.
II. Scaling and Infrastructure Bottlenecks
These issues relate to how the system handles increased load.
4. Scaling Bottleneck: Statefulness of WebSockets
The Issue: "Each API server maintains its own WebSocket connections." This makes scaling difficult because user state (which document they are editing, which server they are on) is sticky. If Server A goes down, all its connected users instantly disconnect and must re-establish a session, potentially connecting to a different server (Server B).
Solution: Introduce a Centralized WebSocket Gateway/Proxy.
- Proposal: Place a dedicated layer (like HAProxy, Nginx, or specialized services like Socket.IO Gateway) in front of the API servers. This layer handles the initial WebSocket connection and routes subsequent messages to the correct backend server. A crucial component here is a mechanism (often using sticky sessions or session state lookup in Redis) to ensure the client reconnects to the same logical server if possible, or a mechanism for seamless handoff.
- Trade-offs:
- Pro: Decouples the WebSocket connection life cycle from the application logic servers, making application server scaling easier.
- Con: Adds another layer of complexity and potential latency. Requires the gateway to manage connection state or rely on the application servers to quickly re-establish context.
5. Scaling Bottleneck: Database Write Contention (PostgreSQL)
The Issue: If every change (even small text insertions) is written directly to PostgreSQL, the write throughput requirement will rapidly exceed what a single primary PostgreSQL instance can handle, even with read replicas handling reads.
Solution: Utilize Redis for High-Frequency Write Buffering/Sequencing.
- Proposal: When a change arrives, instead of writing directly to PostgreSQL, push the change payload into an ordered list or stream within Redis (e.g., using Redis Streams or Lists, keyed by document ID). A dedicated, optimized background worker process reads these streams and performs bulk inserts/updates to PostgreSQL periodically (e.g., every 1 second).
- Trade-offs:
- Pro: Dramatically reduces immediate write strain on PostgreSQL, allowing it to focus on durable storage and consistency checks.
- Con: Introduces a small, bounded window of potential data loss if Redis fails catastrophically before flushing to PostgreSQL (mitigated by using Redis persistence). Requires dedicated worker processes.
6. Scaling Bottleneck: CDN Caching of API Responses
The Issue: The CDN (CloudFront) is configured to cache API responses for 5 minutes. For a real-time collaborative editor, API responses are generally used for initial document loading or fetching metadata. Caching them for 5 minutes is highly risky.
- If User A edits Document X, and User B opens Document X 30 seconds later, User B’s request to the API endpoint for the document might be served a 5-minute-old cached version from CloudFront, bypassing the load balancer and the application servers entirely, leading to stale data.
Solution: Restrict CDN Caching to Static Assets Only.
- Proposal: Configure CloudFront to only cache static assets (JS, CSS, images). API endpoints, especially those related to document fetching or authentication, must use cache-control headers that enforce no-caching (
Cache-Control: no-cache, no-store, must-revalidate).
- Trade-offs:
- Pro: Ensures that all critical application data is fetched from the live backend servers, guaranteeing freshness.
- Con: Increases the load on the API servers as every initial document load request hits the backend cluster instead of the CDN edge cache. (This is generally acceptable, as initial loads are less frequent than ongoing real-time edits.)
III. Authentication and Security Issues
7. Failure Mode: JWT Expiry and Client Storage Risk
The Issue: JWTs with 24-hour expiry are stored in localStorage.
- If the JWT leaks (via XSS attack), an attacker has 24 hours of access.
- The 24-hour expiry means users must re-authenticate frequently, which is poor UX, or the token must be silently refreshed, which requires extra infrastructure.
Solution: Use HttpOnly Cookies for Session Management.
- Proposal: Instead of using JWTs in
localStorage for authentication, use traditional session cookies marked as HttpOnly and Secure. The backend issues this cookie upon successful login. HttpOnly prevents client-side JavaScript (including malicious scripts) from accessing the token.
- Trade-offs:
- Pro: Vastly superior security against XSS attacks.
- Con: Makes the session state less portable across microservices if JWTs were intended for that purpose (which they aren't in this setup, as auth seems tied to the initial connection). Requires careful configuration of the load balancer/API servers to ensure cookie forwarding.
8. Failure Mode: Authentication for WebSocket Connections
The Issue: The architecture implies JWTs are used for the initial HTTP connection, but it doesn't specify how the WebSocket connection is authenticated. If the WebSocket simply upgrades the connection without re-validating the session (e.g., relying on the client sending the old token in the upgrade request), security is weak.
Solution: Validate Token on WebSocket Upgrade.
- Proposal: When the client initiates the WebSocket handshake, it must pass the JWT (ideally in a query parameter, as headers are not standard for the upgrade request). The backend server must validate this token immediately upon connection. If valid, the server should create a short-lived session entry in Redis mapping the WebSocket session ID to the User ID, and only then proceed with the connection.
- Trade-offs:
- Pro: Ensures every persistent connection is authenticated.
- Con: Requires careful handling of the token during the upgrade process, which can sometimes be tricky depending on the client library used.