This architecture for a real-time collaborative document editor is a solid starting point, but it contains several potential failure modes, race conditions, and scaling bottlenecks. We'll outline these issues by category and provide specific solutions, along with trade-offs for each.
🔥 Failure Modes
1. WebSocket Session Stickiness
- Problem: API servers maintain their own WebSocket connections, but a load balancer using round-robin may route a user to a different server upon reconnection, losing context/state.
- Impact: Lost session state, missed updates, or document desyncs.
- Solution:
- Use sticky sessions via load balancer (e.g., AWS ALB with session affinity).
- Better: Use a shared pub/sub layer (e.g., Redis Pub/Sub, Apache Kafka) where all servers broadcast/receive real-time updates.
- Trade-off: Adds operational complexity and latency, but ensures state consistency across servers.
2. Client Clock Drift
- Problem: Last-write-wins with client-side timestamps assumes synchronized clocks. Clock skew can cause updates to be applied out-of-order.
- Impact: Data loss or incorrect overwrites.
- Solution:
- Use server-generated timestamps.
- Alternatively, implement vector clocks or operational transforms (OT) / conflict-free replicated data types (CRDTs).
- Trade-off: Server timestamps add round-trip latency. OT/CRDTs are complex to implement but provide precise conflict resolution.
3. PostgreSQL Polling Delay
- Problem: Servers poll PostgreSQL every 2 seconds for changes. This introduces latency in update propagation and increases DB load.
- Impact: Delayed updates between users on different servers.
- Solution:
- Replace polling with PostgreSQL LISTEN/NOTIFY or use a real-time change data capture (CDC) system (e.g., Debezium + Kafka).
- Trade-off: Requires infrastructure changes. LISTEN/NOTIFY has limits on payload size and connection count.
4. Single Point of Failure: PostgreSQL
- Problem: PostgreSQL is a single point of failure for writes, even with read replicas.
- Impact: Downtime or data loss on DB failure.
- Solution:
- Use managed PostgreSQL with automated failover (e.g., AWS Aurora).
- Consider sharding or partitioning documents by org ID.
- Trade-off: Sharding adds complexity in query logic and data management.
5. Redis Failure
- Problem: Redis used for session cache is a potential single point of failure.
- Impact: Session loss, auth issues, degraded performance.
- Solution:
- Use Redis in a clustered or replicated setup with failover support (e.g., Redis Sentinel or AWS ElastiCache).
- Trade-off: Slightly more expensive and complex.
6. JWT in localStorage
- Problem: JWTs stored in localStorage are vulnerable to XSS attacks.
- Impact: Token theft, unauthorized access.
- Solution:
- Store JWTs in HttpOnly, Secure cookies.
- Use short-lived access tokens with refresh tokens stored securely.
- Trade-off: Slightly more complex auth flow, but significantly more secure.
⚠️ Race Conditions & Data Consistency Risks
1. Concurrent Edits in Same Paragraph
- Problem: Last-write-wins can cause loss of intermediate edits.
- Impact: Overwrites and inconsistent user experience.
- Solution:
- Use OT or CRDTs for conflict-free merging of edits.
- Or implement paragraph-level locking/versioning.
- Trade-off: OT/CRDTs are complex but scalable. Locking can cause UX issues under high contention.
2. Simultaneous Server Writes
- Problem: Two servers may write to the DB for the same document based on stale state.
- Impact: Write conflicts, inconsistent document state.
- Solution:
- Use optimistic concurrency control (e.g., version column with each write).
- Reject or retry conflicting updates.
- Trade-off: Adds complexity to write logic.
3. CDN Caching API Responses
- Problem: CDN caches API responses for 5 minutes, which may serve stale data (e.g., document state or user permissions).
- Impact: Users see outdated content.
- Solution:
- Use cache headers appropriately: Cache-Control: no-store or short TTLs for dynamic content.
- Use cache-busting query params or ETags.
- Trade-off: Reduces CDN cache hit rate.
🚧 Scaling Bottlenecks
1. WebSocket Scalability
- Problem: Each server maintains its own WebSocket connections, leading to duplication and scalability issues.
- Impact: Hard to scale horizontally, inconsistent state across servers.
- Solution:
- Use a shared WebSocket backend (e.g., Socket.IO with Redis adapter, or a dedicated message broker like NATS).
- Or offload WebSocket handling to a service like AWS API Gateway + Lambda or Ably/Pusher.
- Trade-off: Increased architectural complexity, but essential for scale.
2. Document Save Strategy
- Problem: Saving full HTML snapshots every 30 seconds is storage-intensive and inefficient.
- Impact: Inefficient storage, difficult to support fine-grained undo/history.
- Solution:
- Save a diff/patch log (event sourcing) and periodically snapshot for recovery.
- Use versioned documents with granular delta storage.
- Trade-off: More complex, but enables better history, undo, and auditing.
3. Load Balancer Round-Robin
- Problem: Round-robin doesn’t account for server load or sessions.
- Impact: Uneven load, missed sessions after reconnect.
- Solution:
- Use a load balancer with health checks and weighted routing or session affinity.
- Trade-off: Slightly more infrastructure config.
4. Organization-Based Partitioning
- Problem: Partitioning by organization ID is good, but can lead to hotspots for large organizations.
- Impact: Uneven load, potential DB bottlenecks.
- Solution:
- Further partition data by document ID or user ID within organizations.
- Use distributed databases if scale demands (e.g., CockroachDB, YugabyteDB).
- Trade-off: Increases data model complexity.
✅ Summary of Recommendations
| Issue | Solution | Trade-off |
|---|
| Client timestamp conflicts | Use server timestamps or CRDT | Complexity vs correctness |
| Polling DB for changes | Use LISTEN/NOTIFY or CDC | Infra changes |
| WebSocket scaling | Use Redis Pub/Sub or managed service | Increased infra complexity |
| JWT in localStorage | Use Secure HttpOnly cookies | Auth flow complexity |
| Full document snapshots | Store diffs + snapshots | More storage logic |
| CDN caching API | Use no-store / ETags | Lower cache hit rate |
| Redis single point | Use Redis cluster | Higher cost |
| PostgreSQL SPOF | Use managed DB with failover | Cost, setup |
| Load balancer routing | Use sticky sessions | State management |
| Org-based partitioning | Add finer-grained partitioning | Complexity |
By addressing these failure modes and bottlenecks with targeted improvements, this architecture can scale more robustly while maintaining real-time collaboration and data integrity.