Here is an analysis of the failure modes, race conditions, and scaling bottlenecks in the proposed architecture, followed by specific solutions and their trade-offs.
1. Conflict Resolution & Data Integrity
Issue: Unreliable Client-Clock Timestamps (The "Client Time" Problem)
- Problem: The architecture uses
Last-Write-Wins (LWW) based on timestamps provided by the client's browser clock.
- Clock Drift: Browsers' system clocks are rarely perfectly synced. If User A’s clock is 5 minutes fast, their edits will overwrite User B’s edits, permanently losing User B's work.
- Tampering: Clients can easily spoof timestamps to claim authorship.
- Simultaneous Editing: If two users edit the exact same paragraph at the exact same time (even with ms precision), the later timestamp wins, regardless of which content was actually edited. This results in silent data loss.
- Solution: Operational Transformation (OT) or CRDTs (Conflict-free Replicated Data Types).
- Instead of comparing timestamps, compare the operations (e.g., "insert character X at index Y"). The system can determine the correct order of operations mathematically.
- Alternatively, use Server-side timestamps. The server assigns the timestamp and enforces the merge logic (e.g., "If text at index X changes, check if the new text is semantically better or just a later edit").
- Trade-offs:
- CRDTs/OT: High implementation complexity. OT is notoriously difficult to implement bug-free. CRDTs are easier but can result in more "ghost" characters or complex state management.
- Server-side Merge: Requires complex text-diffing algorithms to merge HTML content reliably without corrupting the document structure.
2. Real-Time Performance & Latency
Issue: The "Polling Gap" (2-Second Latency)
- Problem: The architecture relies on "Other servers poll PostgreSQL every 2 seconds."
- This creates a lag of up to 2 seconds between a user typing and another user seeing the change. This is not "real-time" and feels laggy to the user.
- Polling creates "thundering herd" problems on the database (hundreds of servers querying the DB simultaneously every 2 seconds).
- Solution: Publish/Subscribe (Pub/Sub) Pattern using Redis.
- Instead of polling, use a message broker. When a server writes a change to the DB, it publishes that change to a Redis channel (e.g.,
doc:123:updates).
- All API servers subscribe to this channel. When a message arrives, they push the update to their connected WebSocket clients immediately.
- Trade-offs:
- Complexity: Adds a dependency on Redis for real-time communication, not just caching.
- Reliability: If Redis fails, real-time sync fails. (Mitigation: Use a highly available Redis cluster).
Issue: CDN Cache Invalidation (The "Stale Data" Problem)
- Problem: The architecture specifies "CloudFront caches API responses for 5 minutes."
- If User A edits a document, User B (who has the cached HTML) will not see the change for 5 minutes. This completely negates the "real-time" requirement.
- Solution: Cache Busting / Dynamic Cache Headers.
- Do not cache API responses that contain document data.
- Only cache the HTML snapshots for read-only users (if applicable) or use a short TTL (e.g., 30 seconds) with aggressive invalidation.
- Use a "version" query parameter in the API URL (e.g.,
GET /doc/123?ver=abc) so the CDN caches the latest version automatically.
- Trade-offs:
- Performance: You lose the caching benefit for API calls, increasing backend load.
- Implementation: Requires careful header management to ensure the browser doesn't aggressively cache the WebSocket connection URL.
3. Data Storage & Database Load
Issue: Full HTML Snapshots vs. Delta Storage
- Problem: "Documents saved as full HTML snapshots every 30 seconds."
- Storage Bloat: Storing 10MB HTML files every 30 seconds for every user will fill a database instantly.
- Merge Complexity: You cannot merge HTML snapshots easily. If User A adds a
<b> tag and User B changes a word, merging the snapshots is error-prone and can corrupt the DOM structure.
- Solution: Store Operations (Deltas) or JSON Text.
- Store the change (e.g.,
{ "action": "insert", "text": "Hello", "index": 10 }) rather than the full document.
- Persist only the latest state in PostgreSQL, but keep an audit log or history table for "snapshots" if needed for rollback.
- Trade-offs:
- Frontend Complexity: The frontend must reconstruct the document from scratch every time or apply incremental patches. This requires a robust text engine (like ProseMirror or Yjs).
- Storage: Still requires storing the current state, but history is much smaller.
4. Fault Tolerance & State Management
Issue: Server-Side State Loss (The "Crash" Problem)
- Problem: "Each API server maintains its own WebSocket connections... Server writes change to PostgreSQL... Server broadcasts change."
- If Server A crashes after writing to the DB but before broadcasting to its clients, the clients on Server A will be desynchronized. They will think their edits were saved, but the rest of the cluster didn't receive them.
- Solution: Two-Phase Commit or Idempotency Keys.
- When a client sends a change, the server generates a unique
idempotency_key.
- The client stores this key.
- If the client doesn't receive a success response (or receives a retry signal), it resends the change. The server checks the key and ignores duplicates but re-broadcasts the missed update to the local cluster.
- Trade-offs:
- Complexity: Requires clients to handle retries and state management.
- Performance: Adds a database lookup to ensure the change hasn't already been processed.
5. Architecture Scalability
Issue: The "Server Affinity" Bottleneck
- Problem: The architecture implies that a user connects to "their connected server" and broadcasts only to that server's clients.
- If a user is on Server A, and edits a document, Server B and Server C (who have users viewing that doc) do not know about the edit until they poll the DB.
- If the document is critical and traffic spikes, the "polling" interval (2s) might be too slow, and the DB will be hammered by polling requests from all servers.
- Solution: Sharding with a Global Event Bus.
- Ensure the document partitioning (by Org ID) ensures that all servers for a specific Org share the same WebSocket channels.
- Implement the Redis Pub/Sub solution mentioned in point #2. This decouples the servers; Server A writes to DB, publishes to Redis, and Server B/C automatically subscribe and update their clients without polling the DB.
- Trade-offs:
- Network Overhead: Redis Pub/Sub adds network chatter between servers.
- Dependency: Redis becomes a single point of failure if not configured as a cluster (Master-Slave replication with Sentinel or Cluster mode).