Here'sa comprehensive analysis of the architecture, identifying critical failure modes, race conditions, and scaling bottlenecks. For each issue, I've provided a specific solution with clear trade-offs based on real-world distributed systems best practices.
1. Client Clock Synchronization for Timestamps (Critical Failure Mode)
Issue: Using client-generated timestamps for conflict resolution ("last-write-wins") is fundamentally flawed. Client clocks are unsynchronized (NTP drift can be 100ms+), and users can manually adjust time. A user with a clock set ahead by 5 minutes could overwrite others' changes arbitrarily, causing data corruption. Even with NTP, network latency makes it impossible to reliably order concurrent edits.
Solution:
- Replace client timestamps with server-generated monotonic timestamps (using a central time source like
clocks service) or switch to CRDTs (Conflict-Free Replicated Data Types).
- For CRDTs: Use a state-based CRDT (e.g.,
G-Counter for text positions) or operation-based CRDT (e.g., Yjs library). Changes are merged automatically without central coordination.
- Alternative: Use vector clocks (requires per-client clock tracking) or Lamport timestamps (server-synchronized sequence numbers).
Trade-offs:
- CRDTs:
- ✅ Correctly handles concurrent edits without conflicts; no need for centralized timestamping.
- ❌ Increased storage overhead (sends entire state or complex operation metadata).
- ❌ Higher client-side complexity (requires CRDT-specific libraries like Yjs or Automerge).
- Server-generated timestamps:
- ✅ Simple to implement; uses a single trusted time source.
- ❌ Requires a highly available time service (e.g., Google TrueTime, AWS Time Sync) to avoid clock skew issues.
- ❌ Still vulnerable to network latency (e.g., if two edits arrive at the server within 1ms, order is arbitrary).
Recommendation: Use CRDTs for collaborative editing. It’s the industry standard (e.g., Google Docs uses a variant of OT, but CRDTs are simpler for distributed systems). Avoid client timestamps entirely.
2. Broadcast-Only-to-Same-Server + Polling (Race Condition & Scaling Bottleneck)
Issue:
- Changes are only broadcast to clients on the same server (e.g., Server A updates its clients but ignores Server B).
- Other servers poll PostgreSQL every 2 seconds for changes, creating:
- Race conditions: If User X edits on Server A and User Y edits the same document on Server B within 2 seconds, Server B’s polling might miss Server A’s change, causing User Y to overwrite X’s work.
- Scaling bottleneck: Polling generates 50+ queries/second per server (e.g., 10 servers × 5 polls/sec = 50 queries/sec). For 1000 documents, this could overwhelm PostgreSQL.
- Inconsistent state: Clients on different servers see stale data for up to 2 seconds.
Solution:
- Replace polling with a pub/sub system (e.g., Redis Pub/Sub, Kafka, or NATS):
- When a server processes a change, it publishes it to a channel (e.g.,
doc:{doc_id}:changes).
- All API servers subscribe to this channel and immediately broadcast changes to their connected clients.
- Use sticky sessions for WebSocket connections (via load balancer) to reduce cross-server communication, but do not rely on them for data consistency.
Trade-offs:
- ✅ Real-time propagation: Changes reach all clients in milliseconds (not 2 seconds).
- ✅ Reduced DB load: Eliminates constant polling; pub/sub is lightweight.
- ❌ Added complexity: Requires managing a pub/sub system (e.g., Redis setup, scaling, fault tolerance).
- ❌ Potential message loss: If Redis fails, changes may be lost (mitigate with Redis replication or Kafka persistence).
- Alternative: Use a dedicated real-time sync service (e.g., Ably, Pusher) for pub/sub, but this adds vendor lock-in.
Critical fix: Pub/sub is non-negotiable for real-time collaboration. Polling is unacceptable for low-latency systems.
3. Full HTML Snapshots Every 30 Seconds (Inefficiency & Scalability Bottleneck)
Issue: Saving full HTML snapshots every 30 seconds is wasteful:
- For a 1MB document, this generates ~2MB/hour of storage per document (vs. deltas at ~1KB/hour).
- Reconstruction latency: Loading a document requires replaying all snapshots (slow for large docs).
- Network strain: Sending full HTML on every sync wastes bandwidth (e.g., 100 users editing → 100×1MB = 100MB/sec).
Solution:
- Replace snapshots with delta-based storage:
- Store changes as CRDT operations (e.g., insert/delete operations) or OT operations.
- For reads: Reconstruct the document by applying deltas to a base state (e.g., initial snapshot + deltas).
- Use mergeable snapshotting: Save a full snapshot every 5 minutes (not 30 seconds) and store deltas between snapshots.
Trade-offs:
- ✅ Reduced storage: Deltas use 10-100× less space than full snapshots.
- ✅ Faster initial load: Only the latest snapshot + recent deltas need to be loaded.
- ❌ Complexity: Requires a delta parser and history management (e.g., garbage collection of old deltas).
- ❌ Reconstruction delays: For very large documents with long histories, replaying deltas might add latency (mitigate with periodic "checkpoint" snapshots).
Recommendation: Use a CRDT-based delta storage (e.g., Yjs) for both real-time sync and persistence. This solves sync and storage in one go.
4. CDN Caching Dynamic API Responses (Data Staleness)
Issue: Caching API responses for 5 minutes (e.g., document state endpoints) via CloudFront causes stale data. Users won’t see real-time updates, defeating the purpose of collaboration. For example, if User A edits a document, User B might see the old version for up to 5 minutes.
Solution:
- Disable caching for all dynamic endpoints (e.g.,
/document/{id}, /changes). Set Cache-Control: no-store or private, max-age=0.
- Cache only static assets: JS/CSS/images via CloudFront (with versioned URLs).
- For read-heavy read-only views (e.g., public docs), use cache with short TTL (e.g., 30s) and invalidate on edits.
Trade-offs:
- ✅ Always fresh data: Users see real-time changes.
- ❌ Increased backend load: More requests hit the API servers (mitigate with client-side caching or optimized DB indexing).
- ❌ No caching for dynamic content: Requires careful design to avoid overwhelming the DB.
Critical fix: Never cache dynamic collaborative data. Use CDN only for static assets.
5. JWT Tokens in localStorage (Security Vulnerability)
Issue: Storing JWTs in localStorage is vulnerable to XSS attacks. If an attacker injects malicious JS, they steal tokens and impersonate users. This is a critical security flaw.
Solution:
- Store JWTs in HttpOnly, Secure, SameSite=Strict cookies.
- Use CSRF tokens for state-changing requests (e.g., POST/PUT).
- For WebSocket auth: Include the JWT in the initial connection handshake (e.g.,
wss://host/?token=...), validated at connection time.
Trade-offs:
- ✅ XSS protection: HttpOnly cookies prevent JS access.
- ❌ CSRF risk: Requires anti-CSRF measures (e.g., double-submit cookies).
- ❌ Browser compatibility: Some older browsers have quirks with SameSite cookies (rare today).
Critical fix: Move to HttpOnly cookies immediately. This is non-negotiable for security.
6. No Conflict Resolution for Concurrent Edits (Data Corruption)
Issue: "Last-write-wins" with client timestamps ignores context. If two users edit the same paragraph simultaneously:
- One user’s changes are completely overwritten (e.g., "Hello" → "Hi" vs. "Hello" → "Hey" loses both changes).
- This violates user expectations (e.g., Google Docs merges edits without data loss).
Solution:
- Implement Operational Transformation (OT) or CRDTs:
- CRDTs: Prefer for simplicity in distributed systems (e.g.,
Yjs for JSON-like data). Changes are associative and commutative, so order doesn’t matter.
- OT: Used by Google Docs; transforms operations to maintain consistency (e.g., "insert at position 5" becomes "insert at position 6" if a prior edit added text).
- Store operations: Instead of just saving the final document, store all edit operations for auditability.
Trade-offs:
- ✅ No data loss: Concurrent edits are merged correctly (e.g., "Hi" + "Hey" → "Hey" or merged "Hiey" depending on CRDT type).
- ❌ Implementation complexity: CRDTs/OT require deep understanding of concurrency models.
- ❌ Storage overhead: Operations can be larger than raw text (but still better than full snapshots).
Recommendation: Use CRDTs (e.g., Yjs or Automerge). They’re simpler to implement correctly than OT for most use cases.
7. Single-Document Write Scalability (Bottleneck)
Issue:
- If one document has thousands of concurrent editors (e.g., a live webinar), all edits go to the same PostgreSQL instance.
- High write contention causes:
- Slow writes (locks on the document row).
- PostgreSQL can’t scale writes horizontally for a single row.
- Polling (even with pub/sub) adds load if many servers handle one document.
Solution:
- Shard documents by ID: Use a consistent hash to distribute documents across PostgreSQL instances.
- Use a write-optimized database: For high-write workloads, consider CockroachDB (distributed SQL) or Amazon Aurora (auto-scaling PostgreSQL).
- Batch writes: Buffer changes in memory (e.g., 100ms) and write in bulk to DB to reduce transactions.
Trade-offs:
- ✅ Horizontal scaling: Distributed DB handles high write throughput.
- ❌ Complexity: Sharding requires app-level routing logic (e.g., "doc_id % 10 → shard 0–9").
- ❌ Transaction limitations: Cross-shard transactions may not be supported (e.g., CockroachDB handles them but with latency).
Recommendation: For large-scale deployments, use CockroachDB for distributed SQL capabilities. For smaller apps, Aurora with read replicas suffices.
8. No Server Failover for WebSocket Connections (Data Loss Risk)
Issue: If an API server crashes:
- WebSocket connections drop, and unsaved client changes (in memory) are lost.
- Clients reconnect to a new server but have no way to recover partial edits (unless they send them again).
- PostgreSQL writes might be missed if the server crashed before persistence.
Solution:
- Client-side retry: Clients buffer unsent changes in memory and retry on reconnect.
- Server-side state recovery: When a client reconnects, send the latest document state + any operations missed during disconnect (using a sequence number).
- Dedicated persistent queue: Use Kafka to buffer changes before writing to DB (so crashes don’t lose data).
Trade-offs:
- ✅ No data loss: Changes survive server crashes.
- ❌ Client complexity: Requires buffering and retry logic on the frontend.
- ❌ Additional infrastructure: Kafka adds operational overhead.
Recommendation: Implement client-side retry + sequence numbers for safety. For critical systems, add Kafka as a persistent buffer.
Summary of Key Fixes
| Issue | Solution | Criticality |
|---|
| Client timestamps | CRDTs or server monotonic timestamps | 🔴 Critical |
| Polling for changes | Pub/sub (Redis/Kafka) | 🔴 Critical |
| Full HTML snapshots | Delta storage + CRDTs | 🔴 Critical |
| CDN caching dynamic data | Disable cache for dynamic endpoints | 🔴 Critical |
| JWT in localStorage | HttpOnly cookies | 🔴 Critical |
| Conflict resolution | CRDTs/OT | 🔴 Critical |
| Single-document scaling | Sharded distributed DB (CockroachDB) | 🟠 High |
| Server failover | Client retry + sequence numbers | 🟠 High |
Final Architecture Improvements:
- Frontend: Use Yjs for CRDT-based editing, buffered changes, and client-side retries.
- Backend: Replace polling with Redis Pub/Sub for change propagation; use a distributed database (CockroachDB) for storage.
- Auth: HttpOnly cookies + CSRF tokens.
- CDN: Cache only static assets (JS/CSS/images), never document data.
- Scaling: Shard documents by ID; use a dedicated sync service (e.g., Yjs server) for real-time ops.
Why this works: CRDTs eliminate the need for timestamps and conflict resolution logic. Pub/sub replaces inefficient polling. Distributed databases handle scaling. HttpOnly cookies fix security. This aligns with modern collaborative systems like Google Docs (OT-based) or Figma (CRDTs).