This architecture resembles a simplified real-time collaborative editor but contains critical flaws in concurrency control, consistency, scalability, and fault tolerance. Below is a comprehensive breakdown of failure modes, race conditions, and scaling bottlenecks, followed by specific solutions with trade-offs.
🚨 1. Last-Write-Wins (LWW) with Client Clocks → Severe Data Corruption Risk
Issue:
- Client clocks are unsynchronized (NTP drift, manual settings, timezones). A user with a fast clock can overwrite others’ edits permanently.
- No server-side validation of timestamps → malicious or buggy clients can inject arbitrary timestamps.
- LWW is fundamentally unsuitable for collaborative editing — it discards potentially valid edits (e.g., two users typing "a" and "b" at the same position → only one survives).
Failure Mode:
- User A edits paragraph at 12:00:00 (correct time).
- User B edits same paragraph at 12:00:01 (but their clock is 5 minutes fast → actual time 11:55:01).
- System sees B’s timestamp as "newer" → A’s edit is lost.
Solution:
Replace LWW with Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs)
- Use a CRDT-based document model (e.g., Yjs or Automerge) that mathematically guarantees convergence without central coordination.
- Each edit is a structured operation (insert/delete at position with unique ID), not a full snapshot.
- Server validates and applies ops sequentially, assigning logical timestamps (causal order via vector clocks or Lamport timestamps).
Trade-offs:
- ✅ Strong consistency, no data loss, real-time convergence.
- ❌ Increased frontend/backend complexity (must replace HTML snapshots with structured JSON ops).
- ❌ Higher bandwidth (small ops vs. full HTML snapshots).
- ❌ Migration cost: existing HTML snapshots must be converted to CRDT state.
💡 Bonus: Store both the CRDT state and periodic HTML snapshots for UI rendering and backup.
🚨 2. Server-Local WebSockets → Inconsistent State Across Nodes
Issue:
- Each API server only broadcasts to its own WebSocket clients.
- Other servers poll PostgreSQL every 2s → massive latency (up to 2s delay) and missed updates.
- A user connected to Server A edits a doc → Server B (with other users) won’t see it until next poll → users see stale content.
Failure Mode:
- User A (on Server A) types “Hello”.
- User B (on Server B) sees nothing for up to 2s.
- User B types “World!” → Server B broadcasts “World!” to its clients.
- User A sees “World!” before “Hello” → edit order is broken.
Solution:
Use a pub/sub system (Redis Pub/Sub or Kafka) to propagate changes across servers
- When a server receives a change via WebSocket, it publishes the operation to a global channel (e.g.,
doc:{doc_id}:ops).
- All API servers subscribe to channels for documents they have active clients for.
- Each server applies the op to its local CRDT state and broadcasts to its connected clients.
- Eliminate polling — use event-driven propagation.
Trade-offs:
- ✅ Near-real-time sync across all servers (<100ms latency).
- ✅ Eliminates race conditions from polling delay.
- ❌ Adds dependency on Redis/Kafka (more infrastructure to manage).
- ❌ Risk of message duplication → must make ops idempotent (CRDTs naturally are).
🚨 3. Full HTML Snapshots Every 30s → Inefficient, Unreliable, Unscalable
Issue:
- Full HTML snapshots are huge (100KB–1MB+ per doc), stored every 30s → 100x more storage than needed.
- Snapshotting overwrites history — you lose the ability to reconstruct edit history, undo, or audit.
- On restart or load, server must rehydrate state from last snapshot → slow startup, potential data loss if last snapshot missed a change.
Failure Mode:
- User edits doc → 29s later, snapshot is taken.
- Server crashes at 30s100ms → last edit lost.
- User tries to undo → impossible.
Solution:
Store only CRDT operations + periodic snapshots as backup
- Store every operation (e.g.,
insert at 12, "a") in PostgreSQL as a row with doc_id, op_id, timestamp, client_id, operation_json.
- Use batching (e.g., 100 ops per batch) to reduce write load.
- Take snapshots every 5–10 minutes (not 30s) for fast restore.
- Use WAL-style persistence — you can replay ops to reconstruct any state.
Trade-offs:
- ✅ Full audit trail, undo/redo possible, no data loss.
- ✅ Storage efficiency: 100 ops = ~1KB vs 100KB snapshot.
- ❌ More complex query logic to reconstruct state.
- ❌ Requires migration of existing snapshot-based system.
🚨 4. JWT in localStorage + 24h Expiry → Security & Scalability Risks
Issue:
- localStorage is vulnerable to XSS → token stolen → attacker has full access for 24h.
- No refresh mechanism — if token expires, user must re-login (bad UX).
- No revocation — if user logs out or account compromised, token remains valid until expiry.
Failure Mode:
- XSS attack steals JWT → attacker edits documents as user → no way to revoke.
- User logs in on public computer → token left behind → next user accesses account.
Solution:
Use HTTP-only, SameSite=Strict cookies with short-lived access tokens + refresh tokens
- Access token: 5–15 min expiry, stored in HTTP-only, Secure, SameSite=Strict cookie.
- Refresh token: 7-day expiry, stored in HTTP-only cookie, used to get new access token.
- Maintain token revocation list (Redis set) for logout/invalidate events.
- Use OAuth2-style flow with backend-managed sessions.
Trade-offs:
- ✅ Much more secure (XSS can’t steal cookies).
- ✅ Automatic token refresh → better UX.
- ❌ Slightly more complex auth flow.
- ❌ Requires CSRF protection (but SameSite=Strict + POST-only endpoints mitigate).
🚨 5. CDN Caching API Responses → Stale Collaborative Data
Issue:
- CloudFront caches API responses (e.g.,
/api/doc/123) for 5 minutes.
- User A edits doc → backend updates PostgreSQL.
- User B requests doc → gets cached stale response from CDN → sees old content.
- Real-time collaboration is broken — users see different versions.
Failure Mode:
- Two users edit same doc → both get cached versions → conflict resolution fails because they’re working on stale state.
Solution:
Disable CDN caching for all dynamic API endpoints (e.g., /api/doc/*, /api/sync)
Cache only static assets (JS, CSS, images).
- Use
Cache-Control: no-cache, no-store, private headers on all document-related endpoints.
- If you must cache, use cache keys based on document version (e.g.,
/api/doc/123?v=456) — but this requires client-side version tracking.
Trade-offs:
- ✅ Ensures all users get up-to-date document state.
- ❌ Higher origin server load (no CDN caching for APIs).
- ✅ Mitigation: Use edge computing (e.g., Cloudflare Workers) to do lightweight auth/authorization checks at edge without caching response body.
🚨 6. Round-Robin Load Balancer → Sticky Sessions Needed, But Not Mentioned
Issue:
- WebSocket connections are stateful — client must reconnect to same server.
- If load balancer doesn’t use sticky sessions (session affinity) → WebSocket connection drops on every request → reconnection delays → lost edits.
Failure Mode:
- Client connects to Server A → types “Hi”.
- Load balancer routes next request to Server B → WebSocket connection closed → client reconnects → server B has no document state → client sees blank doc.
Solution:
Enable sticky sessions (session affinity) using client IP or JWT cookie hash
- Configure load balancer (e.g., NLB/ALB) to route based on JWT token hash or client IP.
- Alternatively, use Redis-backed shared session store and make servers stateless (clients reconnect to any server, which fetches current state from Redis/PostgreSQL).
Trade-offs:
- ✅ Simple: sticky sessions work well for websockets.
- ❌ Reduces load balancing fairness — one server may get overloaded.
- ✅ Better: Use stateless servers + Redis pub/sub → any server can handle any client → scales better long-term.
🚨 7. Document Partitioning by Organization ID → Hot Partitions & Single Points of Failure
Issue:
- Partitioning by
org_id assumes even distribution.
- Large orgs (e.g., Google, Apple) will have massive documents, causing:
- Single PostgreSQL partition to become a hotspot (high read/write load).
- Single point of failure for entire org’s editing.
- Read replicas won’t help if writes are concentrated.
Failure Mode:
- Org X has 10,000 users editing one doc → 10k ops/sec → PostgreSQL master throttled → latency spikes → all users in Org X experience lag.
Solution:
Partition documents by doc_id, not org_id — use sharding + document-level isolation
- Each document is its own shard → even if one org has 1000 docs, load is distributed.
- Use consistent hashing to map
doc_id → shard.
- Use PostgreSQL partitioning or CockroachDB/Amazon Aurora for automatic sharding.
- For massive docs (>100MB), split into chunks (e.g., sections) — each chunk is a separate CRDT.
Trade-offs:
- ✅ Scales horizontally with number of docs, not users/orgs.
- ❌ More complex routing: must know which shard a doc is on before querying.
- ✅ Bonus: Use caching layer per doc in Redis (e.g.,
doc:123:state) for read-heavy docs.
🚨 8. No Monitoring, Retry, or Backpressure → System Degrades Silently
Issue:
- No mention of:
- Retries for WebSocket disconnections.
- Backpressure on high write loads.
- Monitoring (latency, error rates, queue depth).
- Dead-letter queues for failed ops.
Failure Mode:
- PostgreSQL goes down for 10s → WebSocket clients keep sending ops → server queues fill → OOM crash.
- Client disconnects → edits lost.
- No alerting → outage goes unnoticed for hours.
Solution:
Implement:
- Retry with exponential backoff on WebSocket reconnect.
- Client-side op queue — if disconnected, buffer ops locally, replay on reconnect.
- Server-side op rate limiting per doc (e.g., max 100 ops/sec per doc).
- Kafka or Redis Streams as buffer between WebSocket server and DB writer.
- Metrics + Alerts: Prometheus/Grafana for:
- WebSocket connection count per server
- DB write latency
- Redis pub/sub backlog
- CRDT op queue depth
Trade-offs:
- ✅ Resilient to transient failures.
- ✅ Better UX: edits survive network hiccups.
- ❌ Client becomes more complex (local state management).
- ❌ Infrastructure cost (Kafka/Redis Streams).
✅ Summary: Recommended Architecture Upgrades
| Problem Area | Recommended Fix | Key Trade-off |
|---|
| Conflict Resolution | Replace LWW with CRDTs (Yjs/Automerge) | Higher complexity, migration cost |
| Cross-Server Sync | Use Redis Pub/Sub for ops, eliminate polling | Adds Redis dependency |
| Storage | Store CRDT ops, not HTML snapshots | Need to rebuild UI from ops |
| Auth | HTTP-only cookies, short-lived tokens, revocation list | CSRF protection needed |
| CDN Caching | Disable caching for /api/doc/* | Higher origin load |
| Load Balancing | Sticky sessions OR stateless + Redis state | Simplicity vs. scalability |
| Sharding | Shard by doc_id, not org_id | Routing complexity |
| Resilience | Client op queue + retry + Kafka buffer + monitoring | Client/server complexity |
💡 Final Recommendation: Adopt a Proven Stack
Instead of rolling your own, consider:
- Frontend: Yjs + Webrtc/WebSocket + React
- Backend: Node.js + Express + Redis Pub/Sub + PostgreSQL (with JSONB ops)
- Storage: CRDT ops stored as
JSONB in PostgreSQL, with batched writes
- Auth: Auth0/Supabase or custom JWT + HTTP-only cookies
- Deployment: Kubernetes with horizontal pod autoscaling, Redis + PostgreSQL on managed services (AWS RDS, ElastiCache)
- Monitoring: Prometheus + Grafana + Loki
Real-world examples: Notion, Coda, and Slate use CRDTs or OT. Google Docs uses OT under the hood. LWW + snapshots is for simple apps like note-taking — not real-time collaboration.
This architecture is not fit for production as-is. With the above fixes, it can become scalable, consistent, and resilient — but requires non-trivial engineering effort. Prioritize CRDTs + Redis pub/sub + HTTP-only auth as your top 3 fixes.