Here’s a comprehensive analysis of potential failure modes, race conditions, and scaling bottlenecks in this architecture, along with proposed solutions and their trade-offs:
1. Real-Time Sync Issues
A. WebSocket Connection Failures
- Problem: If a WebSocket connection drops (e.g., due to network issues), the client may miss updates until it reconnects. The server may not detect the disconnection immediately (e.g., due to TCP keepalive timeouts).
- Solution:
- Implement exponential backoff reconnection on the client with a max retry limit.
- Use heartbeat messages (ping/pong) every 30 seconds to detect dead connections.
- Trade-off: Increases client-side complexity and network overhead.
B. WebSocket Server Failures
- Problem: If an API server crashes, all WebSocket connections on that server are lost. Clients must reconnect to another server, but may miss updates during the failover.
- Solution:
- Use a WebSocket-aware load balancer (e.g., AWS ALB with WebSocket support) to route connections to healthy servers.
- Implement session affinity (sticky sessions) so clients reconnect to the same server if possible.
- Trade-off: Sticky sessions reduce load balancing flexibility and may lead to uneven server loads.
C. Cross-Server Sync Latency
- Problem: Servers poll PostgreSQL every 2 seconds for changes, creating a 2-second sync delay between servers. This can cause conflicts if two users on different servers edit the same paragraph.
- Solution:
- Replace polling with PostgreSQL logical replication or CDC (Change Data Capture) to stream changes to all servers in real-time.
- Use Redis Pub/Sub for cross-server broadcast of changes (each server subscribes to a Redis channel for document updates).
- Trade-off:
- CDC adds complexity to PostgreSQL setup.
- Redis Pub/Sub is fast but not persistent (messages lost if Redis crashes).
D. Clock Skew in Last-Write-Wins (LWW)
- Problem: LWW relies on client timestamps, which can be skewed (e.g., due to incorrect system clocks). This can lead to lost edits if a client with a slow clock sends a change after a newer one.
- Solution:
- Use server-side timestamps (from a centralized NTP-synchronized clock) instead of client timestamps.
- Alternatively, use operational transformation (OT) or CRDTs (Conflict-Free Replicated Data Types) for conflict resolution.
- Trade-off:
- Server-side timestamps add latency (client must wait for server ack).
- OT/CRDTs are complex to implement and may increase storage overhead.
2. Database Bottlenecks
A. PostgreSQL Write Contention
- Problem: Every keystroke triggers a write to PostgreSQL, leading to high write load and potential lock contention.
- Solution:
- Batch writes (e.g., coalesce changes for 100ms before writing to DB).
- Use optimistic locking (e.g.,
UPDATE ... WHERE version = X) to avoid lost updates.
- Trade-off:
- Batching increases latency for real-time sync.
- Optimistic locking requires retry logic on conflicts.
B. Full HTML Snapshots Every 30 Seconds
- Problem: Storing full HTML snapshots is inefficient (large storage, slow writes) and doesn’t scale for large documents.
- Solution:
- Store deltas (changes) instead of full snapshots (e.g., using a diff algorithm like
google-diff-match-patch).
- Use PostgreSQL’s JSONB or a dedicated document store (e.g., MongoDB) for structured deltas.
- Trade-off:
- Deltas require more complex conflict resolution.
- Reconstructing documents from deltas may be slower.
C. Read Replicas Lag
- Problem: Read replicas may lag behind the primary, causing stale data to be served to clients.
- Solution:
- Use synchronous replication for critical reads (e.g.,
synchronous_commit = remote_apply in PostgreSQL).
- Implement client-side caching (e.g., Redis) for frequently accessed documents.
- Trade-off:
- Synchronous replication reduces write performance.
- Client-side caching adds complexity and staleness risks.
3. Authentication and Security
A. JWT in localStorage
- Problem: JWTs in
localStorage are vulnerable to XSS attacks. If an attacker injects JavaScript, they can steal the token.
- Solution:
- Store JWTs in HTTP-only, Secure, SameSite cookies instead of
localStorage.
- Use short-lived JWTs (e.g., 15-minute expiry) with refresh tokens stored in HTTP-only cookies.
- Trade-off:
- Cookies are vulnerable to CSRF (mitigated with
SameSite and CSRF tokens).
- Refresh tokens add complexity to the auth flow.
B. No Token Revocation
- Problem: JWTs are valid until expiry (24 hours), so compromised tokens cannot be revoked.
- Solution:
- Implement a token denylist (e.g., in Redis) for revoked tokens.
- Use short-lived JWTs (e.g., 15 minutes) with refresh tokens.
- Trade-off:
- Denylist adds latency to token validation.
- Refresh tokens require additional storage and logic.
4. Scaling Bottlenecks
A. WebSocket Connection Limits
- Problem: Each API server maintains WebSocket connections, which consume memory and file descriptors. A single server may hit OS limits (e.g.,
ulimit -n).
- Solution:
- Use connection pooling (e.g.,
ws library with connection reuse).
- Offload WebSocket connections to a dedicated service (e.g., Pusher, Ably, or a custom WebSocket cluster).
- Trade-off:
- Dedicated services add cost and vendor lock-in.
- Custom clusters require operational overhead.
B. PostgreSQL Single Point of Failure
- Problem: If the primary PostgreSQL instance fails, writes are blocked until failover completes.
- Solution:
- Use PostgreSQL streaming replication with automatic failover (e.g., Patroni + etcd).
- Deploy in a multi-AZ setup (e.g., AWS RDS Multi-AZ).
- Trade-off:
- Multi-AZ increases cost and complexity.
- Failover may take seconds to minutes.
C. Redis as a Single Point of Failure
- Problem: Redis is used for session cache and Pub/Sub. If Redis fails, cross-server sync breaks.
- Solution:
- Use Redis Cluster for high availability.
- Fall back to PostgreSQL polling if Redis is unavailable (degraded mode).
- Trade-off:
- Redis Cluster adds complexity.
- Fallback to polling increases latency.
D. CDN Caching API Responses
- Problem: CDN caches API responses for 5 minutes, which can serve stale data (e.g., outdated document versions).
- Solution:
- Disable CDN caching for API responses (only cache static assets).
- Use cache-control headers (e.g.,
no-cache for dynamic endpoints).
- Trade-off:
- Disabling caching reduces CDN benefits for API traffic.
5. Race Conditions
A. Concurrent Edits on the Same Paragraph
- Problem: Two users on different servers edit the same paragraph simultaneously. The last write (by timestamp) wins, but the "losing" edit is silently discarded.
- Solution:
- Use operational transformation (OT) or CRDTs to merge concurrent edits.
- Implement conflict resolution at the paragraph level (e.g., merge changes if they don’t overlap).
- Trade-off:
- OT/CRDTs are complex to implement.
- Paragraph-level merging may not handle all cases (e.g., overlapping deletions).
B. Lost Updates During Server Failover
- Problem: If a server crashes after receiving a change but before writing to PostgreSQL, the change is lost.
- Solution:
- Acknowledge changes only after PostgreSQL write (not just WebSocket send).
- Use write-ahead logging (WAL) in PostgreSQL for durability.
- Trade-off:
- Acknowledging after DB write increases latency.
- WAL adds storage overhead.
6. Other Issues
A. No Offline Support
- Problem: If a user’s internet disconnects, they cannot edit the document until reconnecting.
- Solution:
- Implement client-side offline editing with a local copy of the document.
- Sync changes when reconnecting (using a conflict-free merge strategy).
- Trade-off:
- Offline support adds complexity to the client and sync logic.
B. No Document Versioning
- Problem: If a user accidentally deletes content, there’s no way to recover it (only full snapshots every 30 seconds).
- Solution:
- Store every change as a delta in PostgreSQL with timestamps.
- Implement document versioning (e.g., store a new version on every save).
- Trade-off:
- Versioning increases storage costs.
- Reconstructing old versions may be slow.
C. No Rate Limiting
- Problem: A malicious user could spam the server with changes, causing high load.
- Solution:
- Implement rate limiting (e.g., 100 changes/minute per user).
- Use Redis to track rate limits (e.g.,
INCR + EXPIRE).
- Trade-off:
- Rate limiting may block legitimate users during bursts.
Summary of Key Solutions
| Issue | Solution | Trade-off |
|---|
| WebSocket disconnections | Heartbeat + exponential backoff | Network overhead |
| Cross-server sync latency | Redis Pub/Sub or PostgreSQL CDC | Complexity, Redis persistence |
| Clock skew in LWW | Server-side timestamps or OT/CRDTs | Latency or implementation complexity |
| PostgreSQL write contention | Batch writes + optimistic locking | Latency or retry logic |
| Full HTML snapshots | Store deltas instead | Complex conflict resolution |
| JWT in localStorage | HTTP-only cookies + refresh tokens | CSRF risk |
| WebSocket connection limits | Dedicated WebSocket service | Cost, vendor lock-in |
| Concurrent edits | OT/CRDTs or paragraph-level merging | Complexity |
| PostgreSQL SPOF | Multi-AZ setup + automatic failover | Cost, complexity |
| No offline support | Client-side offline editing | Complex sync logic |
Recommendations for Improvement
- Replace LWW with OT/CRDTs for better conflict resolution.
- Use Redis Pub/Sub for cross-server sync (instead of PostgreSQL polling).
- Store deltas instead of full snapshots to reduce storage and improve performance.
- Move JWTs to HTTP-only cookies to mitigate XSS.
- Implement document versioning for recovery.
- Use a dedicated WebSocket service (e.g., Pusher) to offload connection management.
- Add rate limiting to prevent abuse.
This architecture is workable but has several scalability and reliability risks. Addressing these issues will make it more robust for production use.