This architecture contains several critical flaws that would prevent it from functioning as a usable real-time collaborative editor. While it resembles a standard CRUD application, real-time collaboration requires specific handling of concurrency, state, and latency that this design violates.
Here is the breakdown of failure modes, race conditions, and bottlenecks, categorized by domain.
1. Data Consistency & Sync Strategy
Issue: Client-Side Timestamps for Last-Write-Wins (LWW)
- Failure Mode: Clock skew and malicious clients.
- Why it fails: Client clocks are not synchronized. If User A's clock is 5 minutes behind User B's, User A's edits will always be overwritten by User B's, even if User A edited after User B. Additionally, a malicious user can manipulate their system clock to dominate the document.
- Race Condition: Two users edit the same character simultaneously. User A (slow clock) sends change at T=100. User B (fast clock) sends change at T=99. User B's change overwrites User A's, despite happening later in real time.
- Solution: Use Server-Side Timestamps or Logical Clocks (Vector Clocks/Lamport Timestamps). Better yet, abandon LWW for text and implement CRDTs (Conflict-free Replicated Data Types) or OT (Operational Transformation).
- Trade-off: CRDTs/OT add significant implementation complexity and memory overhead compared to simple string overwrites. Server timestamps require tight clock synchronization (NTP) on the backend but remove trust from the client.
Issue: Cross-Server Polling (2-Second Delay)
- Failure Mode: High latency and "Text Jumping."
- Why it fails: If User A is on Server 1 and User B is on Server 2, User B will not see User A's changes for up to 2 seconds. In a typing scenario, this causes confusing UI behavior where text appears to rewind or jump.
- Scaling Bottleneck: If you have 100 API servers, that is 100 servers polling the database every 2 seconds. This creates a "thundering herd" problem on the DB read IOPS, regardless of actual user activity.
- Solution: Implement Redis Pub/Sub. When Server 1 receives a change, it publishes to a Redis channel. Server 2 subscribes to that channel and pushes the update to its connected clients immediately (sub-100ms).
- Trade-off: Adds infrastructure dependency on Redis availability. If Redis goes down, cross-server sync breaks (though single-server sync remains).
Issue: Destructive Conflict Resolution (Paragraph Level)
- Failure Mode: Data Loss.
- Why it fails: LWW on a "paragraph" level is too coarse. If User A adds a sentence to Paragraph 1 and User B deletes Paragraph 1 simultaneously, User B's delete wins, and User A's work is lost entirely.
- Solution: Move to Operation-Based Sync. Store edits as operations (e.g.,
insert at index 5, delete 3 chars) rather than state snapshots. Apply operations sequentially.
- Trade-off: Requires maintaining an operation log (event sourcing) which grows indefinitely unless compacted. Replaying history for new clients takes more CPU.
2. Database & Storage Performance
Issue: Synchronous DB Writes on Every Change
- Failure Mode: Database Connection Exhaustion & High Latency.
- Why it fails: Writing to PostgreSQL for every keystroke/change event will saturate the DB connection pool and disk IOPS. Typing speed (e.g., 5 chars/sec) × Concurrent Users will exceed standard RDS write limits quickly.
- Scaling Bottleneck: The DB becomes the hard limit on concurrency. You cannot scale API servers if the DB chokes on writes.
- Solution: Write-Behind Caching. Store operations in Redis (in-memory) first. Acknowledge the client immediately. Batch-write to PostgreSQL asynchronously (e.g., every 1 second or every 50 ops).
- Trade-off: Risk of data loss if the server crashes between the Redis write and the Postgres flush. Requires a WAL (Write-Ahead Log) mechanism in Redis or a queue like Kafka for durability.
Issue: Full HTML Snapshots
- Failure Mode: Storage Bloat & Merge Conflicts.
- Why it fails: Storing full HTML every 30 seconds makes version history massive. It also makes merging difficult because you don't know what changed, only the before/after state.
- Scaling Bottleneck: Retrieving a document requires loading a large HTML blob. Bandwidth costs increase.
- Solution: Store a Delta/Operation Log in the DB. Generate snapshots periodically (e.g., every 5 minutes) for quick loading, but rely on the log for sync.
- Trade-off: Reconstructing the document state from a log requires more CPU on read. Requires migration logic to handle schema changes in the operation format.
3. Infrastructure & Networking
Issue: Load Balancer Round-Robin with WebSockets
- Failure Mode: Connection Drops & Session Loss.
- Why it fails: WebSockets are long-lived TCP connections. If a client reconnects (network blip) and the LB sends them to a different server, the new server doesn't have their socket context or room subscription.
- Scaling Bottleneck: Stateful connections make horizontal scaling difficult. You cannot simply kill a server to scale down without disconnecting users.
- Solution: Enable Sticky Sessions (Session Affinity) on the Load Balancer based on a cookie or IP. Alternatively, use a Centralized WebSocket Gateway (e.g., Socket.io with Redis Adapter) where API servers are stateless workers.
- Trade-off: Sticky sessions can lead to uneven load distribution (some servers hot, some cold). Centralized gateway adds a network hop and a single point of failure (mitigated by clustering).
Issue: CDN Caching API Responses
- Failure Mode: Data Staleness & Security Leak.
- Why it fails: Caching API responses (document content) for 5 minutes means users will see stale data upon initial load. Worse, if the cache key isn't perfectly unique per user/session, User A might receive User B's cached document from CloudFront.
- Security Risk: Sensitive document data stored on edge nodes potentially accessible by the wrong tenant.
- Solution: Disable CDN Caching for Dynamic API Routes. Use CDN only for static assets (JS, CSS, Images). Set
Cache-Control: no-store for document API endpoints.
- Trade-off: Increased load on the origin server for document fetches. Increased latency for the initial document load for users far from the origin region.
4. Security & Authentication
Issue: JWT in LocalStorage
- Failure Mode: XSS (Cross-Site Scripting) Token Theft.
- Why it fails: Any third-party script injected into the React SPA (via a vulnerable dependency) can read
localStorage and steal the JWT. The attacker can then impersonate the user for 24 hours.
- Solution: Store JWT in HttpOnly, Secure, SameSite Cookies. The frontend cannot read this via JS, preventing XSS theft.
- Trade-off: More complex CSRF (Cross-Site Request Forgery) protection is required (though
SameSite cookies mitigate most of this). Requires backend to handle cookie parsing instead of header parsing.
Issue: 24-Hour Token Expiry
- Failure Mode: Extended Compromise Window.
- Why it fails: If a token is stolen, the attacker has access for a full day. There is no mechanism to revoke access immediately (e.g., if a user is fired or suspicious activity is detected).
- Solution: Implement Short-lived Access Tokens (15 mins) + Long-lived Refresh Tokens. Store a revocation list (or use Redis) for refresh tokens.
- Trade-off: Increased complexity in the auth flow (token rotation). Slight latency hit when refreshing tokens.
Summary of Critical Fixes (Priority Order)
- Sync Architecture: Replace DB Polling with Redis Pub/Sub for cross-server messaging. (Critical for functionality).
- Conflict Resolution: Replace LWW/Client Clocks with Server Timestamps + OT/CRDT. (Critical for data integrity).
- DB Write Path: Implement Redis Buffering + Batch Writes to Postgres. (Critical for survival under load).
- Security: Move JWT to HttpOnly Cookies and disable CDN Caching on APIs. (Critical for security).
- Load Balancing: Enable Sticky Sessions for WebSocket continuity. (Critical for user experience).
Revised Data Flow Recommendation
- User types → Change event sent via WebSocket.
- Server validates Auth (Cookie) → Pushes Operation to Redis (Pub/Sub + Queue).
- Server acknowledges client immediately (Optimistic UI).
- Redis broadcasts operation to all other API servers.
- All servers push operation to their connected clients.
- Background worker batches operations from Redis and flushes to PostgreSQL (Append-only log).
- Snapshot service runs periodically to compress log into a state snapshot for fast loading.