This architecture contains several critical design flaws that would lead to data loss, high latency, security vulnerabilities, and poor scalability. Below is a detailed breakdown of the failure modes, race conditions, and bottlenecks, along with specific solutions and trade-offs.
1. Real-Time Consistency & Synchronization
Issue: Client-Clock-Based Last-Write-Wins (LWW)
- Failure Mode: Client clocks are not synchronized. If User A's clock is 1 minute fast and User B's is 1 minute slow, User A's edits will overwrite User B's edits regardless of actual arrival time.
- Race Condition: Two users edit the same character range simultaneously. LWW resolves this by arbitrarily choosing one, effectively deleting the other user's work. This makes concurrent editing impossible.
- Solution: Implement CRDTs (Conflict-free Replicated Data Types) or Operational Transformation (OT) (e.g., Yjs, Automerge, Google Docs' OT). Assign server-side sequence numbers to operations, not client timestamps.
- Trade-off:
- Pro: Guarantees eventual consistency without data loss during concurrent edits.
- Con: Increased complexity in data modeling and frontend state management. Requires a robust state synchronization library.
Issue: Siloed WebSocket Connections (Server Partitioning)
- Failure Mode: The architecture states: "Server broadcasts change to all WebSocket clients connected to THAT server." If User A connects to Server 1 and User B connects to Server 2, User A will never see User B's changes until the polling interval hits.
- Scaling Bottleneck: As you add API servers, the probability of two collaborators connecting to different servers increases, degrading the "real-time" experience to "eventually consistent" (up to 2s delay).
- Solution: Implement a Pub/Sub Layer (e.g., Redis Pub/Sub or NATS) between API servers. When Server 1 receives a change, it publishes to a channel; Server 2 subscribes and pushes to its local clients.
- Trade-off:
- Pro: Enables true real-time collaboration across horizontally scaled servers.
- Con: Introduces a single point of failure (Redis cluster) and adds network latency for cross-server message propagation.
2. Database & Storage Architecture
Issue: Direct PostgreSQL Writes for Every Keystroke
- Scaling Bottleneck: Writing every keystroke directly to PostgreSQL creates massive I/O contention. A single document with 100 users typing fast could generate 500+ writes per second.
- Failure Mode: Database connection pool exhaustion during peak usage, causing write failures and lost edits.
- Solution: Implement Write Buffering. Buffer changes in Redis (sorted set or list) for a short window (e.g., 100ms) or batch them, then flush to PostgreSQL asynchronously. Alternatively, use Event Sourcing: write operations to a log, snapshot state periodically.
- Trade-off:
- Pro: Reduces DB load by orders of magnitude.
- Con: Increases complexity. Requires handling buffer persistence to prevent data loss if the Node process crashes.
Issue: Polling PostgreSQL Every 2 Seconds
- Scaling Bottleneck: If you have 50 API servers, that is 50 queries every 2 seconds just to check for updates. This is $O(N)$ load on the database that scales linearly with infrastructure cost.
- Failure Mode: Database CPU saturation under load, increasing latency for all operations.
- Solution: Use Database Change Data Capture (CDC) or PostgreSQL LISTEN/NOTIFY. Instead of polling, the DB pushes notifications to the API servers when a document changes.
- Trade-off:
- Pro: Eliminates polling overhead; near-zero latency.
- Con: Tightly couples architecture to PostgreSQL specific features. Requires handling notification backpressure.
Issue: Full HTML Snapshots Every 30 Seconds
- Failure Mode: 30 seconds is too long for a crash window. If the server crashes at 29 seconds, 29 seconds of data is lost.
- Data Integrity: Storing full HTML makes calculating diffs impossible. You cannot merge changes efficiently if the storage is just raw HTML.
- Solution: Store Operation Logs (text insert/delete events) in the DB. Generate snapshots on demand or via a background worker that compiles the log into a state file.
- Trade-off:
- Pro: Full history audit trail; allows "undo" to any point in time.
- Con: Storage costs grow over time; requires log compaction/cleanup strategies.
3. Infrastructure & Load Balancing
Issue: Round-Robin Load Balancing for WebSockets
- Failure Mode: WebSockets are stateful. If a Load Balancer (LB) sends a handshake to Server A, but the next request (or message) hits Server B, Server B won't have the connection context.
- Solution: Enable Sticky Sessions (Session Affinity) on the Load Balancer, or use a WebSocket Gateway (like Socket.IO or a dedicated proxy) that handles connection state external to the Node app.
- Trade-off:
- Pro: Ensures connection stability.
- Con: Sticky sessions can lead to uneven load distribution (hotspots). A Gateway adds an infrastructure layer.
Issue: CDN Caching API Responses
- Failure Mode: "CloudFront... caches API responses for 5 minutes." This is catastrophic for a collaborative editor. User A edits, User B sees old data for 5 minutes.
- Solution: Disable CDN caching for all API endpoints (
/api/*). Only cache static assets (JS, CSS, Images). Use Cache-Control: no-store for dynamic document data.
- Trade-off:
- Pro: Ensures users always see the latest data.
- Con: Increases origin server traffic; higher latency for static assets if not properly optimized elsewhere.
4. Security & Authentication
Issue: LocalStorage JWTs with 24-Hour Expiry
- Failure Mode: XSS Vulnerability. If a script is injected into the page (via a malicious comment or dependency), it can steal the JWT from LocalStorage and impersonate the user for 24 hours.
- Failure Mode: Revocation. If a user is fired, you cannot revoke their access until the token expires (24 hours later).
- Solution: Store Access Tokens in HttpOnly, Secure Cookies. Use a short-lived Access Token (15 mins) + a Refresh Token (stored in HttpOnly Cookie).
- Trade-off:
- Pro: Mitigates XSS token theft; allows immediate revocation.
- Con: Requires CSRF protection (e.g., Double Submit Cookie pattern); slightly more complex auth flow.
Issue: Document Partitioning by Organization ID
- Scaling Bottleneck: If one organization has massive traffic (e.g., a large enterprise), it will monopolize the resources of the shard it is assigned to, causing "noisy neighbor" issues.
- Solution: Implement Multi-tenancy with Quotas or Sharding by Hash rather than simple Org ID. Use a Hash Map to distribute Orgs across shards evenly.
- Trade-off:
- Pro: Better resource isolation and load balancing.
- Con: More complex data migration logic if a shard becomes too hot.
Summary of Recommended Architecture Changes
| Component | Current State | Recommended State | Reason |
|---|
| Sync Logic | LWW + Client Clocks | CRDT / OT + Server Seq IDs | Prevents data loss on concurrent edits. |
| Inter-Server | Polling DB (2s) | Redis Pub/Sub | Reduces DB load; improves latency to <100ms. |
| DB Writes | Immediate PG Write | Buffer / Event Log | Prevents DB I/O saturation. |
| Storage | HTML Snapshots | Operation Logs + Snapshots | Enables history/undo and efficient merging. |
| Auth | LocalStorage JWT | HttpOnly Cookies + Refresh | Prevents XSS token theft; allows revocation. |
| CDN | Caches API | Cache Static Only | Prevents stale document data. |
| LB | Round-Robin | Sticky Sessions / Gateway | Maintains WebSocket connection state. |
Critical "Showstopper" Risks
If you deploy the architecture exactly as described:
- Users will lose text when editing the same paragraph simultaneously (LWW + Client Clocks).
- Collaboration will feel broken because users on different servers will see edits with 2+ second delays (Polling).
- Security will be compromised if a single XSS vulnerability exists (LocalStorage JWT).
- Users will see stale data due to CDN caching API responses.
Recommendation: Prioritize fixing the Sync Strategy (CRDT/OT) and the Inter-Server Communication (Redis Pub/Sub) immediately, as these directly impact the core value proposition of the product.