Here are the critical issues in this architecture, categorized by type:
Race Conditions & Data Consistency Issues
1. Client Clock Synchronization in Last-Write-Wins
- Problem: Client-generated timestamps are unreliable (clock skew, manual adjustment). Two users editing the same paragraph can result in incorrect ordering, causing valid edits to be silently overwritten. A client with a fast clock can dominate all conflicts.
- Solution: Replace client timestamps with server-generated Hybrid Logical Clocks (HLC). Each server assigns a monotonic timestamp when receiving an operation. For conflict resolution, use CRDTs (Conflict-free Replicated Data Types) specifically designed for text (e.g., Yjs, Automerge) that provide strong eventual consistency without relying on timestamps.
- Trade-offs:
- HLCs require server coordination but maintain causality with minimal overhead.
- CRDTs eliminate coordination but increase document size (20-40% overhead) and require significant implementation complexity. They also make migration away from them difficult.
2. Race Between Broadcast and Persistence
- Problem: If a server crashes after broadcasting to local clients but before PostgreSQL commit, clients see changes that never persist. Conversely, if DB commits but broadcast fails, clients are out of sync.
- Solution: Implement the Transactional Outbox Pattern. Write changes to a PostgreSQL "outbox" table within the same transaction as document updates. A separate worker process tails this table and publishes to a message broker. Broadcast only happens after successful outbox processing.
- Trade-offs: Adds 50-100ms latency to broadcasts and requires additional worker infrastructure, but guarantees exactly-once delivery semantics and prevents silent data loss.
3. Read Replica Lag Stale Data
- Problem: With 2-second polling, read replicas may serve stale document versions. Clients connecting to different servers see inconsistent states.
- Solution: Route all real-time document reads/writes through the PostgreSQL primary. Use replicas only for non-real-time queries (search, history, analytics). Implement read-your-writes consistency by caching recent writes in Redis with a 5-second TTL for session stickiness.
- Trade-offs: Increases primary DB load by ~30-40% but ensures consistency. Redis caching adds complexity but offloads hot documents.
Scaling Bottlenecks
4. PostgreSQL Polling Thundering Herd
- Problem: Every API server polling every 2 seconds creates O(n) database load. At 100 servers, this is 50 queries/second of overhead that doesn't scale with document activity.
- Solution: Eliminate polling. Use Redis Streams as a persistent message bus. Each server publishes document changes to a stream keyed by
document_id. Servers use consumer groups to subscribe only to documents their clients are actively editing.
- Trade-offs: Redis Streams adds memory pressure (plan for 2GB per 10k active documents). Requires implementing consumer group logic but reduces DB load by 90%+ and enables true real-time sync (<10ms latency).
5. Per-Change PostgreSQL Writes
- Problem: Writing every keystroke to PostgreSQL creates a write bottleneck. A 5-user editing session can generate 500+ writes/minute per document.
- Solution: Buffer changes in Redis Streams for 500ms or 50 operations, then batch write to PostgreSQL. Use asynchronous persistence with a dedicated writer service that compacts operations before storage.
- Trade-offs: Risk losing ~500ms of work on crash. Mitigate by configuring Redis AOF with
fsync=everysec and replication factor of 3. Reduces PostgreSQL write load by 95%.
6. Full HTML Snapshot Storage
- Problem: Storing full HTML every 30 seconds for a 1MB document generates 2.4MB/minute of redundant data. Storage grows exponentially with document size and edit frequency.
- Solution: Store operational transforms or CRDT operations instead. Keep a snapshot every 100 operations or 5 minutes (whichever comes first). Use binary encoding (e.g., MessagePack) for operations.
- Trade-offs: New clients must replay operations (adds 100-500ms load time for large histories). Requires implementing operation compression and snapshotting logic, but reduces storage by 95% and enables proper undo/redo.
7. CDN API Response Caching
- Problem: 5-minute CDN caching of API responses serves stale document content, breaking collaborative editing. Users see different document versions.
- Solution: Set
Cache-Control: private, no-cache, max-age=0 for all document API endpoints. Use CDN only for static assets (JS, CSS). For performance, implement Edge-side rendering with 1-second TTL and surrogate key purging on updates.
- Trade-offs: Increases origin server load by 50-100%. Requires implementing cache purge webhooks but ensures data freshness.
Failure Modes
8. WebSocket Server Crash
- Problem: When a server crashes, all its connections drop. Clients lose in-flight messages and must reconnect to a different server that has no knowledge of their session state.
- Solution: Store WebSocket session metadata (
client_id, document_id, last_acknowledged_op) in Redis with TTL. On reconnection, clients resume from last_acknowledged_op. Use Redis Streams consumer groups to allow other servers to take over disconnected clients' subscriptions.
- Trade-offs: Adds 5-10ms latency per message for Redis lookups. Requires client-side reconnection buffer and operation replay logic. Redis becomes a critical component requiring HA setup (Redis Sentinel).
9. Message Broker Partition
- Problem: If Redis Streams becomes unavailable, servers cannot sync across instances.
- Solution: Implement graceful degradation: fall back to direct PostgreSQL polling at 2-second intervals with exponential backoff. Cache recent messages in server memory (last 1000 ops) to handle transient Redis failures.
- Trade-offs: User experience degrades to "eventual consistency" during outages. Requires circuit breaker logic but maintains availability.
10. Database Connection Exhaustion
- Problem: Each WebSocket server maintains persistent PostgreSQL connections. At 10k connections/server, this exhausts the connection pool.
- Solution: Use PgBouncer in transaction pooling mode between servers and PostgreSQL. Limit each Node.js server to 20 DB connections maximum.
- Trade-offs: Adds 1-2ms latency per query. Requires tuning PgBouncer for prepared statements. Reduces connection overhead by 99%.
Security & Operational Issues
11. JWT in localStorage (XSS Risk)
- Problem: XSS attacks can steal 24-hour tokens, giving attackers persistent access.
- Solution: Store JWT in httpOnly, SameSite=strict, secure cookies. Implement refresh token rotation with a 15-minute access token TTL. Maintain a revocation list in Redis for logout.
- Trade-offs: Requires CSRF protection (double-submit cookie pattern). Increases auth server load by 20% but significantly reduces XSS impact radius.
12. No Rate Limiting on WebSocket Messages
- Problem: Malicious clients can flood the system with change events, causing DoS.
- Solution: Implement per-client token bucket rate limiting in Redis (e.g., 100 ops/sec burst, 50 ops/sec sustained). Close connections exceeding limits.
- Trade-offs: May throttle legitimate users in rare cases. Requires careful tuning and client-side debouncing (200ms) to stay under limits.
13. Load Balancer WebSocket Stickiness
- Problem: Round-robin creates unnecessary reconnections when clients hit different servers for the upgrade request.
- Solution: Use least-connections algorithm with IP hash fallback for the initial HTTP upgrade. Don't enforce stickiness post-connection—rely on Redis session state instead.
- Trade-offs: IP hash can create hot spots behind corporate NATs. Use consistent hashing on
client_id in query param for better distribution.
Recommended Architecture Changes Summary
| Component | Current | Recommended | Impact |
|---|
| Sync Strategy | Client timestamps + LWW | CRDTs (Yjs) + HLC | Fixes data loss, enables offline editing |
| Cross-server comms | PostgreSQL polling (2s) | Redis Streams | Real-time sync, 95% DB load reduction |
| Storage | Full HTML snapshots | Operations log + snapshots | 95% storage savings, enables undo/redo |
| Auth | JWT in localStorage | httpOnly cookies + rotation | Mitigates XSS, enables revocation |
| CDN | API cached 5min | API no-cache, assets cached | Fixes stale data, increases origin load |
| Persistence | Per-change writes | Batch async writes (500ms) | 95% write load reduction |
| Session State | Server memory | Redis with TTL | Enables failover, adds 5ms latency |
The architecture requires significant changes to be production-ready, particularly replacing the synchronization strategy and message bus. The trade-offs consistently favor complexity and latency over data loss and inconsistency, which is the correct priority for a collaborative editor.