This architecture contains several critical flaws that will lead to data loss, poor user experience, and system instability at scale. The "Last-write-wins" strategy combined with client-side timestamps is particularly dangerous for collaborative editing.
Here is a detailed breakdown of failure modes, race conditions, and bottlenecks, organized by domain, with specific solutions and trade-offs.
1. Data Consistency & Conflict Resolution (Critical)
Issue A: Client-Side Timestamps for Conflict Resolution
- Failure Mode: Client clocks are not synchronized. User A (clock fast by 5s) types at
T. User B (clock slow by 5s) types at T. User A's change appears to have a later timestamp and overwrites User B's change, even though B typed later.
- Race Condition: Two users edit the same character simultaneously. The server accepts the one with the higher timestamp, silently deleting the other user's input.
- Solution: Server-Authoritative Timestamps & CRDTs/OT.
- Stop trusting client clocks. Use server-side logical clocks (e.g., Lamport timestamps) or Vector Clocks.
- Replace "Last-Write-Wins" with CRDTs (Conflict-free Replicated Data Types, e.g., Yjs, Automerge) or Operational Transformation (OT, e.g., ShareDB).
- Trade-off:
- Pros: Guarantees eventual consistency; no data loss; true collaboration.
- Cons: High engineering complexity; requires client-side library integration; higher CPU usage on server to merge operations.
Issue B: Coarse-Grained Conflict Resolution (Paragraph Level)
- Failure Mode: If two users edit the same paragraph, the entire paragraph is overwritten by the "winner."
- Race Condition: User A changes the first sentence; User B changes the last sentence. User A's timestamp wins, and User B's changes to the last sentence are lost.
- Solution: Fine-Grained Data Structure.
- Store documents as a sequence of characters or rich-text blocks (e.g., ProseMirror JSON), not HTML paragraphs.
- Apply CRDT/OT logic at the character or node level.
- Trade-off:
- Pros: Seamless merging of concurrent edits.
- Cons: Requires significant refactoring of the frontend editor and backend storage schema.
Issue C: Database Replication Lag
- Failure Mode: User writes to Primary DB. Load balancer routes next read to Read Replica. Replica hasn't synced yet. User sees "old" version of document immediately after saving.
- Race Condition: "Read-your-writes" consistency violation.
- Solution: Sticky Reads or Transactional Consistency.
- Route "read after write" requests to the Primary DB for a short window (e.g., 5 seconds).
- Or, use a session token to force read from Primary until
last_write_time + latency_buffer.
- Trade-off:
- Pros: Ensures users see their own edits immediately.
- Cons: Increases load on the Primary DB, reducing the effectiveness of Read Replicas.
2. Real-Time Architecture & Latency
Issue D: Server Polling (The "Thundering Herd" on DB)
- Failure Mode: Every API server polls PostgreSQL every 2 seconds. With 10 servers, that's 5 queries/sec minimum just for polling. At 100 servers, it's 50 queries/sec. This blocks write throughput.
- Bottleneck: PostgreSQL connection pool exhaustion; high CPU on DB due to constant
SELECT queries.
- Latency: Up to 2 seconds delay for updates to propagate between servers.
- Solution: Redis Pub/Sub or Change Data Capture (CDC).
- When Server A writes to Postgres, it publishes an event to a Redis channel.
- All other API servers subscribe to this channel and push updates to their connected clients immediately.
- Trade-off:
- Pros: Near real-time propagation; eliminates DB polling load.
- Cons: Adds dependency on Redis; requires handling message ordering/delivery guarantees.
Issue E: WebSocket Connection State & Load Balancing
- Failure Mode: Load Balancer uses Round-Robin. User connects to Server A. User refreshes page. LB sends request to Server B. Server B has no WebSocket state for this user.
- Failure Mode: Server A crashes. Users connected to it lose connection. They reconnect, but if LB sends them to Server B, Server B doesn't know their session.
- Solution: Sticky Sessions or Externalized Session State.
- Option 1: Enable Sticky Sessions on the Load Balancer (Session Affinity).
- Option 2: Use a WebSocket adapter (e.g., Socket.io Redis Adapter) to share connection state across servers.
- Trade-off:
- Option 1 Pros: Simple.
- Option 1 Cons: Uneven load distribution; if Server A dies, all its clients disconnect.
- Option 2 Pros: High availability; clients can reconnect to any server.
- Option 2 Cons: Redis becomes a bottleneck for connection metadata.
3. Security & Authentication
Issue F: JWT in LocalStorage
- Failure Mode: XSS vulnerability on the React SPA allows an attacker to read
localStorage. They steal the JWT and impersonate the user for 24 hours.
- Bottleneck: None, but high risk.
- Solution: HttpOnly, Secure, SameSite Cookies.
- Move JWT to an HttpOnly cookie. JavaScript cannot access it, mitigating XSS token theft.
- Implement CSRF protection (double submit cookie or token in header) if using cookies.
- Trade-off:
- Pros: Significant security improvement.
- Cons: Requires backend changes to handle cookies; CSRF protection adds slight implementation overhead.
Issue G: CDN Caching API Responses
- Failure Mode: CloudFront caches API responses for 5 minutes. User A edits doc. User B requests doc. CloudFront serves stale cached version.
- Race Condition: Data staleness. In a collaborative editor, 5 minutes of latency is unacceptable.
- Solution: Cache Busting / No-Cache Headers.
- Set
Cache-Control: no-store for all document content API endpoints.
- Only cache static assets (JS/CSS/Images).
- Trade-off:
- Pros: Ensures data freshness.
- Cons: Increases traffic to Origin (API servers/DB). Mitigate with aggressive DB read caching (Redis).
4. Storage & Persistence
Issue H: Full HTML Snapshots Every 30 Seconds
- Failure Mode: High storage costs and slow recovery. If a document is 1MB, saving every 30s consumes massive I/O.
- Bottleneck: Disk I/O on Postgres (BLOB storage); slow to load version history.
- Solution: Operational Logs + Checkpoints.
- Store a stream of Operations (JSON deltas) in a time-series DB or append-only log.
- Take "snapshots" (state) every 5-10 minutes. To restore, load snapshot + replay ops.
- Trade-off:
- Pros: Efficient storage; easier to implement version history/diffs.
- Cons: Complex logic to reconstruct state from logs; requires version compaction logic.
Issue I: Document Partitioning by Organization ID
- Failure Mode: "Noisy Neighbor" / Hot Partition. If Organization X is huge, all their data lands on the same DB shard. That shard becomes a bottleneck while others are idle.
- Bottleneck: Single shard write throughput limit.
- Solution: Hash-Based Sharding.
- Shard by
Document_ID (hashed) rather than Organization_ID.
- Or use a managed sharding service (e.g., Citus for Postgres).
- Trade-off:
- Pros: Even data distribution across shards.
- Cons: Queries across organizations (e.g., "Global Admin View") become expensive (requires sharding key routing or aggregation).
5. Summary of Critical Fixes
| Priority | Component | Problem | Immediate Fix |
|---|
| P0 (Critical) | Sync Strategy | Client Timestamps + LWW | Switch to Server Timestamps + CRDT/OT |
| P0 (Critical) | Real-Time | Polling Postgres every 2s | Implement Redis Pub/Sub |
| P0 (Critical) | Security | JWT in LocalStorage | Move to HttpOnly Cookies |
| P1 (High) | CDN | Caching API responses | Disable caching for document endpoints |
| P1 (High) | DB | Read Replicas Lag | Route "read-after-write" to Primary |
| P2 (Medium) | Storage | Full HTML Snapshots | Switch to Ops Log + Periodic Checkpoints |
Recommended Architecture Revision
- Frontend: React + Yjs (for CRDT handling) + WebSocket.
- Auth: JWT in HttpOnly Cookie (shorter expiry, refresh tokens).
- Backend: Node.js + Socket.io (with Redis Adapter for cross-server communication).
- DB: Postgres (Primary) + Redis (Pub/Sub + Cache).
- Sync: Client sends Operations → Server validates & timestamps → Server broadcasts via Redis → Server persists Ops to DB.
- Caching: CloudFront for static assets only; Redis for hot document state.
This revision moves the system from a fragile, polling-based architecture to an event-driven, consistency-guaranteed model suitable for real-time collaboration.