This architecture contains several critical flaws that would lead to data loss, poor user experience, and security vulnerabilities in a production environment. Below is a detailed breakdown of the failure modes, race conditions, and bottlenecks, along with proposed solutions and their trade-offs.
1. Real-Time Synchronization & Consistency
Issue: Inefficient Cross-Server Communication (Polling)
- Failure Mode: User A connects to Server 1, User B connects to Server 2. User A types. Server 1 writes to DB. Server 2 polls DB every 2 seconds to find the change.
- Impact: 2-second latency for cross-server collaboration. Users will see each other's typing lag significantly. High database load due to constant polling reads.
- Solution: Implement a Redis Pub/Sub or Message Queue (Kafka/RabbitMQ) layer. When Server 1 receives a change, it publishes to the channel. Server 2 subscribes and pushes to its connected clients immediately.
- Trade-offs:
- Pros: Low latency (<100ms), decoupled server logic.
- Cons: Adds infrastructure complexity; requires handling message ordering and deduplication.
Issue: Last-Write-Wins (LWW) with Client Clocks
- Failure Mode: Client clocks are not synchronized. If User A (clock fast) and User B (clock slow) type simultaneously on the same line, the server might discard User B's text if the timestamp is lower, even if it arrived first.
- Impact: Data Loss. Text gets overwritten silently. Impossible to merge concurrent edits correctly.
- Solution: Use CRDTs (Conflict-free Replicated Data Types) like Yjs or Automerge, or Operational Transformation (OT). Use Vector Clocks or Hybrid Logical Clocks (HLC) instead of wall-clock time.
- Trade-offs:
- Pros: Guarantees eventual consistency; no data loss; handles offline editing.
- Cons: Increased payload size; more complex implementation logic on client and server.
Issue: WebSocket Connection State
- Failure Mode: Load balancer uses Round-Robin. User A is on Server 1. User A refreshes or reconnects. LB sends them to Server 2. Server 2 has no knowledge of the active session or the current document state.
- Impact: Session Discontinuity. Users lose their cursor position and connection state upon reconnect.
- Solution: Enable Sticky Sessions (Session Affinity) on the Load Balancer for WebSocket traffic, or use a stateless handshake where the WS handshake validates the token against a shared Redis store for session state.
- Trade-offs:
- Pros: Simplifies state management (keep WS connection on one server).
- Cons: Sticky sessions can cause uneven load distribution if one server gets "heavy" connections.
2. Database & Persistence
Issue: Database Write Bottleneck (Keystroke-to-DB)
- Failure Mode: Step 2 says "Server writes change to PostgreSQL" for every keystroke.
- Impact: High Latency & DB Crash. Writing to a relational DB for every keystroke (potentially 60 writes/sec/user) creates massive I/O contention. PostgreSQL will become the bottleneck for scaling.
- Solution: Implement a Write Buffer. Buffer changes in memory (or Redis) and batch commit to PostgreSQL every 1–5 seconds or on document close.
- Trade-offs:
- Pros: Drastically reduces DB I/O, improves responsiveness.
- Cons: Risk of data loss if the server crashes before the batch commits (mitigated by persistent queues).
Issue: Full HTML Snapshots (30s Interval)
- Failure Mode: Saving full HTML snapshots every 30 seconds.
- Impact: Storage Bloat & Data Loss. If the system crashes 29 seconds after the last save, all work is lost. Full HTML is too large to store efficiently for version history.
- Solution: Save Operation Logs (OT/CRDT operations) to the DB for versioning. Generate HTML snapshots only for rendering or long-term archiving.
- Trade-offs:
- Pros: Granular undo/redo history; smaller storage footprint for versioning.
- Cons: Reconstructing HTML from operations requires a parser on the client/server; slightly more complex restore logic.
Issue: Read Replicas Consistency
- Failure Mode: Architecture mentions read replicas. If a user reads a document from a replica immediately after writing, they might see stale data due to replication lag.
- Impact: Inconsistent State. User sees their own edit as "missing" for a few milliseconds.
- Solution: Enforce Read-After-Write Consistency by routing user's own reads to the Primary DB, or use Redis to cache the latest "known good" version for the user.
- Trade-offs:
- Pros: Strong consistency for the editor.
- Cons: Increased load on the Primary DB; requires logic to route reads dynamically.
3. Networking & Caching
Issue: CDN Caching API Responses
- Failure Mode: CloudFront caches API responses for 5 minutes.
- Impact: Catastrophic Data Staleness. If User A edits a document and the API response is cached, User B (on a different region) will see the old version cached by the CDN. The "real-time" aspect is completely broken.
- Solution: Configure CDN to Bypass Cache for all mutable API endpoints (
POST, PUT, PATCH, and specific GET endpoints for active documents). Only cache static assets (JS/CSS).
- Trade-offs:
- Pros: Data consistency.
- Cons: Increased load on the Origin API servers (no CDN offloading for dynamic traffic).
Issue: Round-Robin LB with WebSockets
- Failure Mode: Standard HTTP Load Balancers often tear down long-lived WebSocket connections or do not support sticky sessions by default.
- Impact: Connection Drops. Users get disconnected randomly.
- Solution: Use a Layer 7 Load Balancer (like NGINX, HAProxy, or AWS ALB) specifically configured to handle WebSocket upgrades (
Upgrade: websocket header) and enforce stickiness.
- Trade-offs:
- Pros: Stable connections.
- Cons: Requires specific LB configuration; potential uneven load.
4. Security & Authentication
Issue: 24-Hour JWT Expiry
- Failure Mode: JWTs are valid for 24 hours.
- Impact: Session Hijacking Risk. If a token is stolen (e.g., via XSS), the attacker has full access to edit the document for a full day.
- Solution: Reduce access token TTL to 15 minutes and implement a Refresh Token flow. Refresh tokens should be short-lived and stored in HttpOnly, Secure Cookies.
- Trade-offs:
- Pros: Minimizes blast radius of token theft.
- Cons: Requires handling refresh logic on the client; increases auth server load slightly.
Issue: LocalStorage for Tokens
- Failure Mode: Storing JWTs in LocalStorage.
- Impact: XSS Vulnerability. Any malicious script injected into the page (via a third-party library or compromised CDN) can steal the token.
- Solution: Use HttpOnly Cookies for auth tokens. If LocalStorage is unavoidable, implement strict CSP (Content Security Policy) and use a separate subdomain for the app to limit cookie scope.
- Trade-offs:
- Pros: Protects against XSS token theft.
- Cons: Cookies are susceptible to CSRF (mitigated by SameSite attributes and CSRF tokens); requires server-side cookie management.
5. Scaling & Partitioning
Issue: Organization ID Partitioning (Hotspots)
- Failure Mode: Partitioning by Org ID. One large enterprise organization has 10,000 active users editing the same doc.
- Impact: Single Shard Bottleneck. All traffic for that org hits one database partition/shard, causing latency for everyone, while other partitions sit idle.
- Solution: Implement Dynamic Sharding based on document ID hash rather than Org ID. Use Consistent Hashing to distribute load.
- Trade-offs:
- Pros: Even load distribution regardless of org size.
- Cons: Data isolation becomes harder (Org data is spread across shards); requires re-sharding logic when adding nodes.
Issue: Document Locking
- Failure Mode: Multiple users editing the same document without coordination.
- Impact: Race Conditions. Even with CRDTs, heavy write contention on the same document ID can cause DB deadlocks.
- Solution: Implement Optimistic Locking on the DB level (version numbers). If a write fails due to version mismatch, the client must reload state and re-apply changes.
- Trade-offs:
- Pros: Prevents database corruption.
- Cons: Requires client logic to handle conflict retries gracefully.
Summary of Critical Fixes
| Component | Current Design | Critical Risk | Recommended Fix |
|---|
| Sync | Poll DB every 2s | 2s Latency | Redis Pub/Sub for server-to-server msg |
| Conflict | LWW + Client Clock | Data Loss | CRDTs (Yjs) + Vector Clocks |
| DB Write | On every keystroke | DB Overload | Batch writes / Buffer in Redis |
| CDN | Cache API 5 min | Stale Data | No Cache for API endpoints |
| Auth | 24h JWT + LocalStorage | XSS / Hijack | 15m Access Token + HttpOnly Refresh Cookie |
| LB | Round Robin | WS Disconnection | Sticky Sessions / Layer 7 LB |
| Storage | HTML Snapshots | Bloated / Data Loss | Operation Log + Periodic Snapshot |
Final Architecture Recommendation
To build a system that scales like Google Docs, you must move away from "Database-First" synchronization to "Event-First" synchronization. The database should be the source of truth for persistence, not the bus for real-time communication. The real-time bus should be in-memory (Redis/Memcached) with asynchronous persistence to Postgres.