This architecture contains several critical flaws that would prevent it from functioning as a reliable real-time collaborative editor. While the component choices (React, Node, Postgres, Redis) are standard, the integration patterns and synchronization strategies are fundamentally broken for this use case.
Here is a detailed analysis of the failure modes, race conditions, and bottlenecks, along with specific solutions.
1. Critical Concurrency & Data Integrity Issues
Issue A: The "Last-Write-Wins" (LWW) with Client Clocks
- Problem: Relying on client-side timestamps for conflict resolution is fatal.
- Clock Skew: User A's laptop clock is 5 minutes fast; User B's is correct. User A types a character 10 seconds after User B, but their timestamp is 5 minutes ahead. User A's change overwrites User B's valid recent change.
- Granularity: If two users type within the same millisecond (common in high-frequency typing), the tie-breaking logic is undefined or arbitrary.
- Data Loss: LWW operates on the unit of the "paragraph" in your description. If User A edits word 1 and User B edits word 5 of the same paragraph simultaneously, the entire paragraph from the later timestamp overwrites the earlier one, deleting the other user's work.
- Solution: Implement Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs).
- Approach: Instead of sending full paragraph snapshots, send atomic operations (e.g.,
insert char 'a' at index 5). The server (or a dedicated sync service) transforms these operations against concurrent operations to ensure convergence.
- Trade-off: High implementation complexity. CRDTs require significant memory overhead for metadata; OT requires a central sequencing server. Both are harder to build than simple LWW but are non-negotiable for data integrity.
Issue B: The Polling Gap (Split-Brain State)
- Problem: Step 4 states: "Other servers poll PostgreSQL every 2 seconds for changes."
- Latency Window: In a collaborative editor, 2 seconds is an eternity. Users on Server A will not see changes made by users on Server B for up to 2 seconds. This creates a confusing "laggy" experience where text appears/disappears abruptly.
- Race Condition during Poll: If Server A writes at $T=0$, Server B polls at $T=1.9$ (misses it), and Server C polls at $T=2.1$ (gets it), Server B is now out of sync. If a user on Server B edits based on stale data, the subsequent merge will be chaotic.
- Solution: Replace polling with Redis Pub/Sub.
- Approach: When Server A receives a change, it writes to the DB (for persistence) and immediately publishes a message to a Redis channel (e.g.,
doc:{id}:updates). All other API servers subscribe to this channel and instantly broadcast the update to their local WebSocket clients.
- Trade-off: Adds a dependency on Redis availability for real-time consistency (though the DB remains the source of truth). Increases network chatter slightly but reduces latency from seconds to milliseconds.
2. Scaling Bottlenecks
Issue C: Database Write Amplification
- Problem: Step 2 states: "Server writes change to PostgreSQL" for every keystroke/change event.
- Throughput Limit: A single active user can generate 5–10 events per second. With 1,000 concurrent users, that's 5,000–10,000 writes/sec per document if they are all editing the same file. PostgreSQL (even with tuning) will choke on row-level locking and WAL (Write Ahead Log) overhead if every character triggers a disk write.
- Lock Contention: Multiple servers trying to update the same document row simultaneously will cause heavy lock contention, slowing down the entire cluster.
- Solution: Write-Behind (Buffering) Strategy.
- Approach: Changes are applied in memory (via CRDT/OT state) and batched. The server writes to PostgreSQL only every $X$ seconds (e.g., 2s) or after $Y$ operations. Redis holds the "hot" state.
- Trade-off: Slight risk of data loss if the server crashes between batches (mitigated by Write-Ahead Logs in Redis or periodic snapshots). Drastically reduces DB load.
Issue D: Full HTML Snapshot Storage
- Problem: "Documents saved as full HTML snapshots every 30 seconds."
- Storage Bloat: Storing full versions every 30 seconds creates massive storage costs and makes retrieving specific historical versions inefficient.
- Merge Difficulty: You cannot easily reconstruct the state between snapshots if a conflict occurs. It forces the "all or nothing" revert model.
- Solution: Event Sourcing / Operational Log.
- Approach: Store the initial document state + an append-only log of every operation (insert/delete) in the database. Snapshots can be generated asynchronously for quick loading, but the source of truth is the operation log.
- Trade-off: Reading the document requires replaying the log (or loading the latest snapshot + replaying recent ops). Query complexity increases, but data fidelity and storage efficiency improve massively.
Issue E: CDN Caching API Responses
- Problem: "CloudFront... caches API responses for 5 minutes."
- Stale Data: If the API returns the current document state, caching it for 5 minutes means users downloading the doc (or refreshing) will see data that is up to 5 minutes old. This contradicts the "real-time" requirement.
- Cache Invalidation: Invalidating CloudFront cache on every edit is expensive and defeats the purpose of caching.
- Solution: Cache Static Assets Only.
- Approach: Configure CloudFront to cache only static JS/CSS/Images. Set
Cache-Control: no-store or private for all dynamic API endpoints serving document content. Use the CDN only for the initial application shell.
- Trade-off: Higher load on the origin servers for document fetches, but guarantees data freshness.
3. Reliability & Security Failure Modes
Issue F: JWT in LocalStorage
- Problem: "JWT tokens... stored in localStorage."
- XSS Vulnerability: Since the frontend is a React SPA, if any third-party script injection (XSS) occurs, the attacker can steal the JWT from localStorage and impersonate the user indefinitely (until the 24h expiry).
- Solution: HttpOnly Cookies.
- Approach: Store the JWT (or a session identifier) in an
HttpOnly, Secure, SameSite=Strict cookie. The browser sends it automatically; JavaScript cannot access it.
- Trade-off: Slightly more complex CSRF protection setup (though
SameSite handles most cases). Requires the API and Frontend to share a domain or handle cross-origin cookie policies carefully.
Issue G: Single Point of Failure in WebSocket Routing
- Problem: "Each API server maintains its own WebSocket connections." + "Round-robin load balancer."
- Connection Stickiness: Standard round-robin LBs break WebSocket handshakes if the upgrade request goes to Server A but subsequent packets are routed to Server B.
- Server Failure: If Server A crashes, all users connected to it lose their connection and unsaved in-memory state (if not synced to Redis/DB immediately).
- Solution: Sticky Sessions + Graceful Degradation.
- Approach: Configure the Load Balancer for Sticky Sessions (Session Affinity) based on a cookie or IP, ensuring a WS client stays pinned to the same backend server. Implement client-side reconnection logic with exponential backoff that reconnects to any available server, fetching the latest state from the DB/Redis upon reconnect.
- Trade-off: Sticky sessions can lead to uneven load distribution if some documents are "hotter" than others. Requires robust client-side state reconciliation on reconnect.
Issue H: Organization Partitioning Limits
- Problem: "Document partitioning by organization ID."
- Hot Partition: If one large organization (e.g., a major enterprise client) has 10,000 active users, their specific database shard/partition will be overloaded while others sit idle. This is the "Noisy Neighbor" problem.
- Solution: Hybrid Sharding or Logical Separation.
- Approach: Do not physically shard solely by Org ID unless Orgs are guaranteed to be small. Instead, shard by
DocumentID (hashed) or use a managed cloud database that handles auto-scaling storage/compute independently of logical tenancy. Use Row-Level Security (RLS) in Postgres for data isolation rather than physical partitioning.
- Trade-off: Hashing by DocumentID spreads load better but makes querying "all docs for Org X" slightly more complex (requires querying all shards or a secondary index).
Summary of Recommended Architecture Changes
| Component | Current Flawed Approach | Recommended Robust Approach |
|---|
| Sync Logic | Last-Write-Wins (Client Time) | CRDTs or Operational Transformation (OT) |
| Inter-Server Sync | Poll DB every 2s | Redis Pub/Sub for instant broadcast |
| DB Writes | Write every keystroke | Write-Behind Batching (Memory/Redis -> DB) |
| Storage Format | Full HTML Snapshots | Operation Log (Event Sourcing) |
| Caching | Cache API (5 min) | No Cache for dynamic data; CDN for static assets only |
| Auth Storage | LocalStorage | HttpOnly Cookies |
| Load Balancing | Round-Robin | Sticky Sessions for WebSockets |
| Scaling Unit | Partition by Org ID | Partition by Doc ID or Managed Cloud DB |
Final Verdict
The current architecture will result in data loss, visible lag, and security vulnerabilities under any realistic load. The shift from "snapshot-based LWW" to "operation-based CRDT/OT" with "Redis-backed pub/sub" is the most critical pivot required to make this system viable.