1. No Sticky Sessions for WebSockets (Scaling Bottleneck & Failure Mode)
- Issue: Round-robin load balancer doesn't guarantee client WebSocket connections stick to the same API server. WebSockets require persistent, stateful connections; bouncing between servers causes connection drops, reconnect loops, or failed real-time updates. Clients on different servers experience up to 2s delays (or more during reconnections) for changes from other servers.
- Solution: Configure the load balancer (e.g., AWS ALB/ELB) for sticky sessions using a session cookie or connection ID, routing WebSocket upgrades to the same backend server.
- Trade-offs:
| Pro | Con |
|---|
| Ensures low-latency broadcasts within server groups | Uneven load distribution (hot servers with popular docs get overloaded) |
| Simple to implement | Single server failure affects all its clients (mitigate with health checks/auto-scaling) |
2. Client-Side Timestamps for Conflict Resolution (Race Condition)
- Issue: Last-write-wins relies on client clocks, which can skew (e.g., unsynced devices, NTP drift). A client with an advanced clock always wins conflicts, leading to lost edits and inconsistent document states across users.
- Solution: Switch to server-assigned timestamps (e.g., PostgreSQL's
now() or monotonic server clocks) on write, rejecting or queuing client changes with older timestamps.
- Trade-offs:
| Pro | Con |
|---|
| Reliable, consistent ordering | Increases round-trip latency (client waits for server ACK before UI update) |
| Easy DB enforcement via unique constraints | Doesn't handle true simultaneous edits (pair with OT/CRDTs for better resolution) |
3. Polling PostgreSQL for Cross-Server Sync (Scaling Bottleneck & Consistency Delay)
- Issue: Each server polls PG every 2s, creating O(N_servers * docs) query load. Scales poorly (e.g., 100 servers = 50 queries/sec per doc). Delays real-time feel (up to 2s+ lag for clients on different servers).
- Solution: Use PostgreSQL
LISTEN/NOTIFY for pub/sub: on write, server sends NOTIFY on a channel per document/org ID; other servers subscribe and broadcast changes to their WebSocket clients.
- Trade-offs:
| Pro | Con |
|---|
| Near-real-time (<100ms), low overhead | Each server needs a persistent PG connection (risk of connection pool exhaustion; limit to 1/subscription) |
| No external deps | PG notify doesn't scale to millions of channels (shard channels by org ID) |
4. Last-Write-Wins Conflict Resolution (Race Condition & Data Loss)
- Issue: Simultaneous edits to the same content (e.g., two users typing in the same paragraph) overwrite each other based on timestamps, silently losing one user's changes. No awareness of concurrent edits.
- Solution: Implement Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs), storing ops/deltas instead of full HTML. Libraries like ShareDB (OT) or Yjs (CRDT) integrate with WebSockets/Postgres.
- Trade-offs:
| Pro | Con |
|---|
| Preserves intent, no data loss | High complexity/debugging (OT requires server-side transformation) |
| Bandwidth-efficient diffs | CRDTs: higher storage (tombstones); OT: causal ordering latency |
5. Full HTML Snapshots Every 30s (Storage & Write Bottleneck)
- Issue: Frequent full-document writes bloat PostgreSQL (e.g., 10KB doc * 30s interval * 1M docs = massive IOPS). No delta storage leads to redundant data and slow restores.
- Solution: Store sequential ops/deltas in PG (with periodic snapshots every 5-10min), reconstruct on load using OT/CRDT library. Use Redis for short-term op cache.
- Trade-offs:
| Pro | Con |
|---|
| Reduces writes 90%+, linear storage growth | Load time increases for long sessions (mitigate with CDN-cached snapshots) |
| Enables rewind/undo | Computation overhead on reconstruct (offload to workers) |
6. JWT in localStorage (Security Failure Mode)
- Issue: Vulnerable to XSS attacks; malicious scripts steal tokens. 24h expiry allows prolonged access if compromised.
- Solution: Store JWT in HttpOnly, Secure, SameSite=Strict cookies. Refresh tokens via secure endpoints.
- Trade-offs:
| Pro | Con |
|---|
| XSS-proof | CSRF risk (mitigate with CSRF tokens or double-submit cookies) |
| Works seamlessly with SPA | Slightly higher backend load for refreshes |
7. CDN Caching API Responses for 5 Minutes (Staleness Failure Mode)
- Issue: Cached reads return stale document versions, conflicting with real-time WebSocket updates. Invalidation isn't mentioned.
- Solution: Exclude mutating/real-time APIs from CDN caching (cache only static assets). For reads, use cache-busting query params (e.g.,
?v=timestamp) or short TTL (10s) with PG invalidation triggers pushing to CDN.
- Trade-offs:
| Pro | Con |
|---|
| Consistent real-time data | Higher backend read load (use PG read replicas) |
| Simple config change | Misses CDN perf for infrequent reads |
8. No Cross-Server Pub/Sub for High-Scale Broadcasts (Scaling Bottleneck)
- Issue: PG polling/LISTEN works for dozens of servers but bottlenecks at 100+ (connection limits, notify fan-out). Popular docs flood all servers' clients with keystrokes.
- Solution: Introduce Redis Pub/Sub or Kafka: servers publish changes to doc-specific topics; subscribers (servers) fan-out to WebSockets. Add client-side diff throttling (e.g., debounce 100ms, cursor-based patches).
- Trade-offs:
| Pro | Con |
|---|
| Horizontal scale to 1000s servers, decouples servers | Added latency (10-50ms), new infra cost/reliability |
| Handles hot docs via partitioning | Eventual consistency window (use at-least-once delivery) |
9. PostgreSQL Write Contention on Primary (Scaling Bottleneck)
- Issue: All changes funnel to single PG primary, even with read replicas and org partitioning. Hot orgs/docs cause lock contention/index bloat.
- Solution: Shard writes by org ID across multiple PG primaries (e.g., Citus extension or app-level routing). Use async queues (e.g., SQS) for non-critical writes.
- Trade-offs:
| Pro | Con |
|---|
| True write scalability | Cross-shard queries complex (docs stay intra-shard) |
| Leverages existing partitioning | Migration overhead, eventual consistency on sharded joins |
10. Missing WebSocket Reconnection & State Sync (Failure Mode)
- Issue: Server crash/network partition drops WS; clients desync without retry logic. No snapshot fetch on reconnect leads to lost changes.
- Solution: Client-side: exponential backoff reconnects with last-known version/timestamp. Server: on connect, query PG for snapshot + unapplied ops since client version.
- Trade-offs:
| Pro | Con |
|---|
| Resilient to failures | Brief UI freeze during sync (show "Reconnecting..." overlay) |
| Standard (e.g., Socket.io handles) | Bandwidth spike on mass reconnects |
11. Redis Session Dependency (Failure Mode)
- Issue: Redis down loses sessions → auth failures mid-session, despite JWTs. Unclear if Redis is replicated.
- Solution: Make auth fully JWT stateless (validate signature server-side, no Redis lookup). Use Redis only for optional sticky hints; replicate Redis cluster.
- Trade-offs:
| Pro | Con |
|---|
| Zero-downtime auth | Slightly higher CPU for sig validation |
| Simplifies scaling | Revocation harder (shorten JWT expiry + blocklist in Redis) |
Summary of Architecture-Wide Risks
| Category | High Impact Issues | Mitigation Priority |
|---|
| Scaling | Polling, WS stickiness, PG writes | High (blocks >10 servers) |
| Consistency | Timestamps, LWW conflicts | High (core UX breakage) |
| Reliability | No reconnects, Redis single-point | Medium (graceful degradation) |
| Security/Perf | JWT storage, CDN staleness | Medium (exploitable but not critical) |
This covers the major issues; implementing 1-4 + reconnection yields a production-viable system. Total refactors (e.g., OT + Pub/Sub) add 20-50% complexity but enable 10x scale.