This architecture has several potential failure modes, race conditions, and scaling bottlenecks. Below is a detailed analysis with proposed solutions and their trade-offs.
1. WebSocket Connection & Server-Side State Issues
Problem: WebSocket Connections Are Server-Specific
- Failure Mode: If a user’s WebSocket connection is tied to a single server, and that server crashes or restarts, the user loses real-time updates until they reconnect.
- Race Condition: If a user reconnects to a different server, they may miss updates that were broadcast only to the original server.
- Scaling Bottleneck: Each server maintains its own WebSocket connections, meaning:
- Memory pressure: Each server must track all connected clients.
- Broadcast inefficiency: Changes must be poll-based between servers (2s delay).
Solutions:
Option 1: Use a Dedicated WebSocket Service (e.g., Socket.io with Redis Adapter)
- How it works:
- All WebSocket connections are managed by a separate service (or a Redis-backed Socket.io cluster).
- Broadcasts are published to Redis, and all servers subscribe to changes.
- Pros:
- No server-specific WebSocket state.
- Real-time broadcasts work across all servers.
- Cons:
- Adds complexity (another service to manage).
- Redis becomes a single point of failure (mitigated with Redis Cluster).
Option 2: Use a Pub/Sub System (e.g., Kafka, NATS, or AWS SNS/SQS)
- How it works:
- Servers publish changes to a central Pub/Sub system.
- All servers subscribe to relevant document change topics.
- Pros:
- Decouples WebSocket handling from API servers.
- Better scalability (Pub/Sub systems handle high throughput).
- Cons:
- Higher latency than in-memory Redis.
- More moving parts (another distributed system to manage).
Option 3: Sticky Sessions (Session Affinity) with Load Balancer
- How it works:
- Configure the load balancer to route WebSocket connections to the same server based on a session cookie.
- Pros:
- Simple to implement.
- No additional infrastructure.
- Cons:
- Still a bottleneck: If a server goes down, users must reconnect.
- Uneven load: Some servers may get more connections than others.
Recommended Solution: Option 1 (Redis-backed WebSocket service) for simplicity and real-time guarantees.
2. Last-Write-Wins (LWW) Conflict Resolution Issues
Problem: Clock Skew & Lost Updates
- Failure Mode:
- Client clocks are not synchronized (NTP drift, manual time changes).
- If two users edit the same paragraph, the last timestamp wins, but if clocks are out of sync, the wrong edit may persist.
- Race Condition:
- If two edits arrive at the server within the same millisecond, the order is non-deterministic.
- Scaling Bottleneck:
- LWW does not scale well for high-contention documents (e.g., many users editing the same paragraph).
Solutions:
Option 1: Use Server-Assigned Timestamps (Logical Clocks)
- How it works:
- The server (or database) assigns timestamps (e.g., PostgreSQL’s
transaction_id or a sequential ID).
- Conflicts are resolved based on server-assigned order rather than client clocks.
- Pros:
- Eliminates clock skew issues.
- Deterministic conflict resolution.
- Cons:
- Requires an extra database round-trip per edit.
- Still uses LWW, which may not be ideal for collaborative editing.
Option 2: Operational Transformation (OT) or CRDTs
- How it works:
- OT: Transforms edits to account for concurrent changes (used in Google Docs).
- CRDTs: Conflict-free replicated data types (eventually consistent).
- Pros:
- No lost updates: All changes are merged correctly.
- Better for high-contention documents.
- Cons:
- Complex to implement (OT requires careful sequencing).
- CRDTs may not fit HTML snapshots (better for plain text or JSON).
Option 3: Hybrid Approach (LWW + Version Vectors)
- How it works:
- Each edit includes a version vector (client ID + counter).
- The server rejects stale updates (if a client’s version is behind).
- Pros:
- Prevents lost updates due to clock skew.
- Still simple compared to OT/CRDTs.
- Cons:
- Requires client-side version tracking.
- Does not solve all conflict cases (e.g., concurrent edits to the same position).
Recommended Solution: Option 2 (OT or CRDTs) for correctness, but if simplicity is key, Option 1 (server-assigned timestamps) is a good compromise.
3. Polling-Based Inter-Server Sync (2s Delay)
Problem: High Latency & Inefficiency
- Failure Mode:
- If a server crashes before broadcasting a change, other servers may never see it until the next poll.
- Race Condition:
- If two servers process the same change at the same time, they may broadcast duplicates.
- Scaling Bottleneck:
- Database load: Every server polls PostgreSQL every 2s, leading to high read load.
- Stale data: Users on different servers see changes with a 2s delay.
Solutions:
Option 1: Replace Polling with Change Data Capture (CDC)
- How it works:
- Use PostgreSQL logical decoding (e.g., Debezium) to stream changes to a message queue (Kafka).
- All servers subscribe to the queue for real-time updates.
- Pros:
- No polling delay (changes propagate instantly).
- Reduces database load.
- Cons:
- Adds complexity (Kafka/Debezium setup).
- Requires handling out-of-order events.
Option 2: Use a Distributed Log (e.g., Apache Kafka)
- How it works:
- All document changes are written to a central log (Kafka).
- Servers consume the log to stay in sync.
- Pros:
- Real-time sync (no polling).
- Fault-tolerant (Kafka replicates logs).
- Cons:
- Operational overhead (managing Kafka).
- Eventual consistency (servers may lag behind).
Option 3: Optimized Polling (Long Polling or Conditional GETs)
- How it works:
- Servers use ETags or last-modified timestamps to poll only when changes occur.
- Pros:
- Reduces unnecessary database reads.
- Cons:
- Still has latency (though reduced).
Recommended Solution: Option 1 (CDC with Kafka) for real-time sync without polling.
4. Database Bottlenecks (PostgreSQL)
Problem: High Write Load & Contention
- Failure Mode:
- If many users edit the same document, row-level locks in PostgreSQL can cause timeouts.
- Scaling Bottleneck:
- Single writer: PostgreSQL primary handles all writes (read replicas don’t help).
- Snapshot storage: Saving full HTML every 30s is write-heavy.
Solutions:
Option 1: Shard Documents by Organization/ID
- How it works:
- Partition documents across multiple PostgreSQL instances (e.g., by
org_id).
- Pros:
- Reduces contention for hot documents.
- Cons:
- Cross-shard transactions are complex.
- Uneven load if some orgs have more activity.
Option 2: Use a Time-Series Database for Edits
- How it works:
- Store individual edits in TimescaleDB (PostgreSQL extension) or ScyllaDB.
- Reconstruct document state on read.
- Pros:
- Better write throughput for high-frequency edits.
- Cons:
- Read complexity: Must reconstruct HTML from deltas.
Option 3: Optimize Snapshot Storage
- How it works:
- Instead of full HTML snapshots every 30s, use:
- Incremental snapshots (only changed parts).
- Compression (e.g., gzip before storing).
- Pros:
- Reduces storage and write load.
- Cons:
- More complex reconstruction logic.
Recommended Solution: Option 1 (sharding) + Option 3 (optimized snapshots) for a balanced approach.
5. Authentication & Security Issues
Problem: JWT in localStorage + 24h Expiry
- Failure Mode:
- XSS vulnerability: If an attacker injects JS, they can steal JWT from
localStorage.
- Long-lived tokens: 24h expiry increases risk if a token is leaked.
- Scaling Bottleneck:
- No revocation: If a token is compromised, it’s valid until expiry.
Solutions:
Option 1: Use HttpOnly Cookies + Short-Lived Tokens
- How it works:
- Store JWT in HttpOnly, Secure, SameSite=Strict cookies.
- Reduce expiry to 1h, and use a refresh token (stored in a secure cookie).
- Pros:
- Mitigates XSS (cookies not accessible to JS).
- Short-lived tokens reduce exposure.
- Cons:
- More complex token rotation.
- CORS issues if frontend and backend are on different domains.
Option 2: Fine-Grained Permissions (Beyond JWT)
- How it works:
- Use OAuth2 scopes or attribute-based access control (ABAC).
- Check permissions on every edit (e.g., "Can user X edit document Y?").
- Pros:
- More secure than just JWT validation.
- Cons:
- Higher latency (extra DB checks).
Recommended Solution: Option 1 (HttpOnly cookies + short-lived tokens) for better security.
6. Caching & CDN Issues
Problem: Stale API Responses (5min Cache)
- Failure Mode:
- If a document is edited, users may see stale data for up to 5min.
- Race Condition:
- If a user fetches a document while another user is editing, they may get an outdated version.
Solutions:
Option 1: Cache Invalidation on Write
- How it works:
- When a document is updated, invalidate the CDN cache (e.g., CloudFront invalidation API).
- Pros:
- Cons:
- Cost: CloudFront invalidations are not free.
- Latency: Invalidation takes time (~1-10s).
Option 2: Use ETags + Conditional Requests
- How it works:
- Return
ETag headers and let clients use If-None-Match.
- CDN caches only if
ETag matches.
- Pros:
- No manual invalidation needed.
- Cons:
- Still some staleness (until next request).
Option 3: Real-Time Push (WebSocket + No CDN for API)
- How it works:
- Do not cache API responses (rely on WebSocket for real-time updates).
- Pros:
- Cons:
- Higher backend load (no caching).
Recommended Solution: Option 1 (cache invalidation) for a balance between freshness and performance.
7. Failure Modes in Document Storage (Full HTML Snapshots)
Problem: Data Loss on Crash
- Failure Mode:
- If the server crashes before saving a snapshot, up to 30s of edits could be lost.
- Scaling Bottleneck:
- High storage costs (full HTML every 30s).
Solutions:
Option 1: Write-Ahead Log (WAL) for Edits
- How it works:
- Append every edit to a WAL (e.g., Kafka or a simple log file).
- Replay logs on recovery.
- Pros:
- No data loss (edits are durable).
- Cons:
Option 2: More Frequent Snapshots (e.g., 5s)
- How it works:
- Save snapshots more often (but with delta compression).
- Pros:
- Reduces data loss window.
- Cons:
Option 3: Client-Side Undo/Redo Buffer
- How it works:
- Clients keep a local buffer of unsaved edits.
- On reconnect, they resend pending changes.
- Pros:
- Reduces reliance on server snapshots.
- Cons:
- Complex conflict resolution.
Recommended Solution: Option 1 (WAL) + Option 2 (frequent snapshots with deltas) for durability.
Summary of Recommended Fixes
| Issue | Solution | Trade-offs |
|---|
| WebSocket server affinity | Redis-backed WebSocket service | Adds Redis dependency |
| Last-write-wins conflicts | Operational Transformation (OT) | Complex to implement |
| Polling-based inter-server sync | PostgreSQL CDC + Kafka | Adds Kafka complexity |
| PostgreSQL write bottlenecks | Sharding + optimized snapshots | Cross-shard transactions |
| JWT security risks | HttpOnly cookies + short-lived tokens | More token rotation logic |
| Stale CDN caches | Cache invalidation on write | CloudFront invalidation costs |
| Snapshot data loss | Write-ahead log (WAL) | Recovery complexity |
Final Architecture Improvements
- Replace polling with Kafka + CDC for real-time inter-server sync.
- Use OT/CRDTs instead of LWW for conflict resolution.
- Move WebSockets to a Redis-backed service for fault tolerance.
- Shard PostgreSQL by
org_id and optimize snapshots.
- Store JWT in HttpOnly cookies with short expiry.
- Invalidate CDN cache on document updates.
- Implement a WAL for edit durability.
This makes the system more scalable, fault-tolerant, and correct while keeping complexity manageable.