This architecture is a common "MVP" design, but it contains several critical flaws that would prevent it from functioning as a true real-time collaborative editor (like Google Docs) under load. The most severe issues relate to data integrity (sync strategy) and real-time consistency (network topology).
Here is a detailed breakdown of the failure modes, race conditions, and bottlenecks, with specific solutions and trade-offs.
1. Critical Failure Modes & Race Conditions
A. Client-Side Timestamps for Conflict Resolution (Fatal)
- Issue: The strategy relies on "client clocks" for Last-Write-Wins (LWW). Client clocks are unreliable due to NTP skew, time zone changes, or manual time adjustments. If User A (clock 10:00:01) and User B (clock 10:00:00) type simultaneously, the document will silently overwrite User A's work.
- Race Condition: Simultaneous edits to the same index result in data loss.
- Solution: Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs).
- Implementation: Send operation indices (e.g., "insert 'x' at index 5") rather than full text. Use a logical clock (vector clock) or monotonically increasing sequence IDs assigned by the server, not the client.
- Trade-off:
- Pros: Guarantees eventual consistency without data loss; handles offline editing.
- Cons: High implementation complexity; requires a dedicated real-time synchronization protocol (e.g., Yjs, Automerge, OT).
B. Polling-Based Cross-Server Sync (High Latency)
- Issue: If User A connects to Server 1 and User B connects to Server 2 (Round-Robin), Server 2 will not know about User A's changes for up to 2 seconds (the polling interval). This creates a "laggy" feel where users see each other typing in real-time on their own screen but not on the other's.
- Race Condition: If Server 1 crashes between polls, Server 2 may have stale data.
- Solution: Redis Pub/Sub or Message Queue (Kafka/RabbitMQ) for cross-server broadcasting.
- Implementation: When Server 1 receives a change, it publishes the operation to a Redis channel. Server 2 subscribes to that channel and receives the change immediately, bypassing the DB poll.
- Trade-off:
- Pros: Sub-millisecond latency between servers; decouples servers from the database for traffic flow.
- Cons: Adds an infrastructure component (Redis cluster); requires careful handling of message ordering and deduplication.
C. Round-Robin Load Balancing with Stateful WebSockets
- Issue: Round-robin LBs do not support WebSocket affinity. If a user refreshes or the connection drops, they might reconnect to a different server. That new server does not have the user's session state or the document's active lock.
- Failure Mode: Connection drop leads to reconnection to a server that thinks the user is offline, causing a "lost connection" error.
- Solution: Sticky Sessions or Shared State.
- Implementation (Sticky): Configure LB to route the same user ID to the same server until the session expires.
- Implementation (State): Store WebSocket connections in Redis (mapping UserID -> Server IP). If a user reconnects, the LB looks up the IP in Redis.
- Trade-off:
- Pros: Sticky sessions are easy to configure; Shared state allows zero-downtime server restarts.
- Cons: Sticky sessions reduce load balancing efficiency; Shared state adds Redis overhead and complexity.
D. CDN Caching API Responses (Data Staleness)
- Issue: The CDN caches API responses for 5 minutes. If User A edits a document, User B might pull the cached (old) version from CloudFront for up to 5 minutes, ignoring the real-time WebSocket update.
- Failure Mode: Users see conflicting versions of the document.
- Solution: Cache-Control Headers or Cache Invalidation.
- Implementation: Set
Cache-Control: no-cache, must-revalidate for document endpoints. Alternatively, use ETags and validate against the server on every request.
- Trade-off:
- Pros: Ensures data freshness.
- Cons: Increases load on the origin API servers (no static caching benefit for dynamic content).
2. Scaling Bottlenecks
A. Database Write Bottleneck
- Issue: "Server writes change to PostgreSQL" for every keystroke. Postgres is an ACID relational DB, not optimized for high-frequency writes. At 100 users typing, that's 100+ writes/second per document. This will saturate the primary DB quickly.
- Bottleneck: Write IOPS (Input/Output Operations Per Second) on the PostgreSQL Primary.
- Solution: Write-Through Buffering (Redis + Batch).
- Implementation: Write keystrokes to Redis (atomic lists) first. A background worker batches these writes to Postgres every 500ms or 1 second.
- Storage: Store the "current state" in a binary format or JSONB column to reduce transactional overhead.
- Trade-off:
- Pros: Reduces DB write load by factor of 10-100; improves latency for the user.
- Cons: Risk of data loss if the server crashes before the batch flushes to Postgres (mitigate by increasing snapshot frequency or using WAL).
B. Snapshot Strategy (30 Seconds)
- Issue: Saving full HTML snapshots every 30 seconds creates a large write payload. If the server crashes at second 29, the user loses 29 seconds of work.
- Bottleneck: Disk I/O and DB storage growth.
- Solution: Incremental Snapshots + Version History.
- Implementation: Save the full state to Postgres every keystroke (or every 5 seconds) using JSONB. Only create the "Full HTML snapshot" (for export/viewing) every 30s.
- Optimization: Store the document as a list of operations in Redis/Postgres, not just a snapshot. Rebuild the view from operations.
- Trade-off:
- Pros: Near-zero data loss; faster recovery from crashes.
- Cons: Requires more complex reconstruction logic to render the document from operations.
C. Partitioning by Organization ID
- Issue: Document partitioning is good, but what happens during scaling? If an Organization has 10,000 documents, the partition may become too hot (too many users).
- Bottleneck: Uneven data distribution (Hotspots).
- Solution: Sharding Strategy + Consistent Hashing.
- Implementation: Instead of just Org ID, hash
(OrgID + UserID) or use a dynamic sharding key. Implement a "hot shard" detection mechanism to move documents to less loaded shards.
- Trade-off:
- Pros: Even load distribution across DB nodes.
- Cons: Complex migration logic when shards move; cross-shard queries become impossible.
3. Security & Reliability Issues
A. JWT in LocalStorage (XSS Risk)
- Issue: Storing JWTs in LocalStorage is vulnerable to Cross-Site Scripting (XSS). If a malicious script runs in the browser, it can steal the token and take over the account.
- Failure Mode: Account hijacking.
- Solution: HttpOnly, Secure Cookies.
- Implementation: Send tokens via
Set-Cookie with HttpOnly, Secure, and SameSite=Strict flags. Do not rely on LocalStorage for auth tokens.
- Trade-off:
- Pros: Mitigates XSS token theft.
- Cons: Requires CSRF protection (Double Submit Cookie or SameSite) on the backend; slightly more complex frontend auth handling.
B. Read Replicas for Write-Heavy Workloads
- Issue: The architecture suggests using Read Replicas. However, the flow states "Server writes change to PostgreSQL." If the Write Primary fails, the system halts. Read replicas do not help with the write bottleneck.
- Bottleneck: Write Availability.
- Solution: PostgreSQL Streaming Replication with Failover (Patroni).
- Implementation: Use a high-availability setup where a standby node can be promoted automatically if the primary fails.
- Trade-off:
- Pros: High availability for writes.
- Cons: Increased cost (2x DB instances); slight replication lag might cause read-after-write inconsistencies.
Summary of Recommended Architecture Changes
| Component | Current Design | Recommended Design | Primary Benefit |
|---|
| Sync Logic | Client Timestamps (LWW) | CRDTs / OT with Server Clocks | Prevents data loss and race conditions. |
| Real-Time | Polling DB (2s) | Redis Pub/Sub between servers | Sub-millisecond latency across servers. |
| Load Balancer | Round-Robin | Sticky Sessions or Redis State | Maintains WebSocket connection affinity. |
| Persistence | Snapshot every 30s | Batched Writes (Redis -> Postgres) | Reduces DB load; near-zero data loss. |
| CDN | Caches API (5m) | No-Cache for Doc Endpoints | Ensures users see latest edits immediately. |
| Auth | LocalStorage JWT | HttpOnly Cookies | Prevents XSS token theft. |
Final Verdict
The current architecture is suitable for a single-user document editor or a read-only CMS, but it will fail for a collaborative editor. The combination of Client Timestamps and Polling makes true collaboration impossible, and the CDN Caching contradicts the real-time requirement.
To make this viable, you must decouple the real-time protocol from the persistence layer and implement a proven consensus algorithm (OT/CRDT) for conflict resolution.