Critical Issues in the Collaborative Document Editor Architecture
1. Last-Write-Wins with Client Clocks (Critical)
Problem: Client clocks are unreliable and can be out of sync by seconds, minutes, or even hours. This causes:
- Data loss when a user with a "fast" clock overwrites legitimate changes from a user with a "slow" clock
- Inconsistent document states across different clients
- Impossible debugging when users report lost work
Solution: Implement Operational Transformations (OT) or Conflict-free Replicated Data Types (CRDTs)
- OT: Transform operations based on their sequence and context
- CRDT: Use data structures that guarantee convergence regardless of operation order
Trade-offs:
- Complexity: Both approaches are significantly more complex than LWW
- Performance: Additional computation overhead for transformation/merge logic
- Development time: Months of additional development vs. simple timestamp approach
2. Server-Local WebSocket Broadcasting (Critical)
Problem: Changes are only broadcast to clients connected to the same server instance. Clients on other servers:
- Don't receive real-time updates until the 2-second polling cycle
- Experience inconsistent document states during those 2 seconds
- May generate conflicting changes based on stale data
Solution: Implement Redis Pub/Sub for cross-server communication
- When a server receives a change, publish it to a Redis channel
- All servers subscribe to document-specific channels and forward to their connected clients
Trade-offs:
- Latency: Adds Redis network hop (~1-5ms)
- Complexity: Additional failure mode (Redis availability)
- Cost: Increased Redis bandwidth usage
3. Polling-Based Cross-Server Sync (High Severity)
Problem: 2-second polling creates:
- Data loss window: If a server crashes, changes made in the last 2 seconds are lost
- Inconsistency: Different servers have different document states for up to 2 seconds
- Scalability bottleneck: Polling frequency doesn't scale with user count
Solution: Replace polling with real-time database change streams
- Use PostgreSQL logical replication or triggers to push changes to Redis
- Servers subscribe to Redis streams instead of polling
Trade-offs:
- Database load: Logical replication adds overhead to PostgreSQL
- Complexity: More complex deployment and monitoring
- Eventual consistency: Still not truly real-time, but much better than polling
4. Full HTML Snapshots Every 30 Seconds (High Severity)
Problem:
- Storage bloat: HTML snapshots are huge compared to operation logs
- Network overhead: Sending entire documents wastes bandwidth
- Merge impossibility: Can't reconstruct intermediate states for proper conflict resolution
- Performance: Large writes to database every 30 seconds per active document
Solution: Store operation logs (deltas) instead of snapshots
- Record each atomic change as a structured operation
- Reconstruct document state by applying operations in order
- Create periodic snapshots only for performance optimization
Trade-offs:
- Read complexity: Need to apply operation history to get current state
- Storage: Still need occasional snapshots to avoid replaying long histories
- Migration complexity: Existing HTML snapshots need conversion
5. JWT in localStorage with 24-hour Expiry (Medium-High Severity)
Problem:
- XSS vulnerability: localStorage is accessible via JavaScript, making tokens stealable
- No revocation: Compromised tokens remain valid for 24 hours
- Session management: Can't easily log out users or handle password changes
Solution: Use HttpOnly cookies with shorter expiry + refresh tokens
- Store access tokens in HttpOnly cookies (inaccessible to JavaScript)
- Use 15-minute access tokens with refresh tokens stored securely
- Implement token revocation on logout/password change
Trade-offs:
- CSRF protection: Need additional CSRF tokens for state-changing requests
- Complexity: More complex auth flow with refresh token rotation
- Mobile compatibility: Slightly more complex for mobile apps
6. CDN Caching API Responses (Critical for Real-time)
Problem: Caching API responses for 5 minutes is incompatible with real-time collaboration:
- Users see stale document data when loading the page
- Conflicts between cached state and real-time WebSocket updates
- Inconsistent user experience across page reloads
Solution: Don't cache API responses for document endpoints
- Only cache static assets (JS, CSS, images) via CDN
- Document data should always come fresh from the database
- Use proper cache headers (
Cache-Control: no-store) for API endpoints
Trade-offs:
- Database load: More direct database queries
- Latency: Slightly slower initial document load
- Cost: Higher origin server load
7. Database as Single Source of Truth with High Write Load
Problem: Every keystroke writes to PostgreSQL, creating:
- Write bottleneck: PostgreSQL struggles with high-frequency small writes
- Lock contention: Multiple servers writing to same document rows
- Scaling limits: Vertical scaling of PostgreSQL has hard limits
Solution: Queue-based write architecture
- Use message queue (Redis Streams, Kafka, or RabbitMQ) to buffer writes
- Dedicated workers process operations and update database
- Implement write coalescing to batch rapid successive changes
Trade-offs:
- Complexity: Additional system components to manage
- Eventual consistency: Database may lag behind real-time state
- Failure handling: Need to handle queue failures and message loss
8. Document Partitioning by Organization ID (Potential Issue)
Problem:
- Hot partitions: Popular organizations create single-server bottlenecks
- Cross-partition queries: Impossible to search across organizations efficiently
- Uneven load: Some servers handle much more traffic than others
Solution: Fine-grained partitioning + consistent hashing
- Partition by document ID using consistent hashing
- Implement dynamic load balancing that can move hot documents between servers
- Use distributed coordination (etcd/ZooKeeper) for partition management
Trade-offs:
- Complexity: Much more complex routing logic
- Cross-document operations: Harder to implement features like document linking
- Operational overhead: Need sophisticated monitoring and rebalancing
9. No Graceful WebSocket Connection Handling
Problem:
- Connection drops: Lost changes when users have temporary network issues
- Duplicate operations: Reconnection can cause replay of operations
- State synchronization: Reconnected clients may be out of sync
Solution: Implement operation acknowledgment and replay buffers
- Assign sequence numbers to operations
- Maintain replay buffer on server for recent operations
- On reconnection, client requests missed operations since last acknowledged sequence
Trade-offs:
- Memory usage: Need to store operation history per client
- Complexity: Additional protocol layer on top of WebSockets
- Latency: Slight overhead for acknowledgment protocol
10. Lack of Operational Monitoring and Alerting
Problem: The architecture lacks visibility into:
- WebSocket connection counts per server
- Operation processing latency
- Database write queue depth
- Conflict resolution frequency
Solution: Comprehensive observability stack
- Metrics: Track operations per second, connection counts, error rates
- Tracing: End-to-end tracing of operation flow
- Logging: Structured logs with correlation IDs
- Alerting: Alert on high conflict rates, slow operations, connection drops
Trade-offs:
- Cost: Additional infrastructure for monitoring
- Complexity: More systems to maintain
- Performance: Slight overhead from instrumentation
Summary Priority Recommendations
Immediate (Critical):
- Replace client-clock LWW with OT/CRDT
- Implement Redis Pub/Sub for cross-server broadcasting
- Remove CDN caching of API responses
- Switch from HTML snapshots to operation logs
High Priority:
5. Fix authentication security (HttpOnly cookies)
6. Replace polling with real-time change streams
7. Implement operation acknowledgment for WebSockets
Medium Priority:
8. Add queue-based write architecture
9. Improve partitioning strategy
10. Implement comprehensive observability
The current architecture would work for a basic demo but would fail catastrophically under real-world collaborative editing scenarios due to the fundamental flaws in conflict resolution and real-time synchronization.