4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Kimi K2 5's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
This architecture contains several critical flaws that would prevent real-time collaboration, create data consistency issues, and cause catastrophic failures at scale. Here are the specific problems and solutions:
Problem: With round-robin load balancing and server-isolated WebSocket broadcasts, users connected to different API servers won't see each other's changes in real-time. Server A broadcasts only to its clients, while Server B discovers changes by polling PostgreSQL every 2 seconds.
Race Condition: User A (Server 1) and User B (Server 2) edit simultaneously. User A sees their change immediately; User B sees it 2 seconds later. During that window, User B edits stale content, creating a conflict that appears as a "jump" when the merge happens.
Solution: Implement Redis Pub/Sub (or NATS/RabbitMQ) as a message bus between API servers.
doc:{id}:changesTrade-offs:
Problem: Using client timestamps for "last-write-wins" is unreliable. If User A's laptop clock is 5 minutes fast (common with manual time changes or NTP failures), their edits permanently overwrite User B's concurrent edits, even if B typed later.
Race Condition:
T+0: User B (correct time) edits paragraph
T+1: User A (clock +5min) edits same paragraph
T+2: Server receives B's edit (timestamp later than A's local time)
T+3: A's edit "wins" despite being chronologically second
Solution: Implement Hybrid Logical Clocks (HLC) or server-assigned monotonic sequence numbers.
version = max(server_timestamp, client_timestamp) + 1Trade-offs:
Problem: Writing every keystroke to PostgreSQL creates an O(n²) write storm. With 1000 concurrent users typing 3 chars/second = 3000 writes/sec, plus read replicas lagging, causing replication delay.
Failure Mode: During traffic spikes, PostgreSQL connection pool exhaustion causes cascading failures. The 2-second polling from N servers creates N/2 queries per second per document.
Solution: Implement Event Sourcing with Kafka + In-Memory CRDT State.
Trade-offs:
Problem: CloudFront caching API responses for 5 minutes means:
Solution: Disable caching for all /api/* and /ws/* routes. Use CDN only for static assets (React bundle, CSS, images). Implement separate domains: static.example.com (CDN) vs api.example.com (no cache).
Trade-offs:
Problem: Storing JWT in localStorage makes it vulnerable to XSS attacks. A malicious script can steal the token and impersonate the user for 24 hours.
Solution: Use HttpOnly, Secure, SameSite=Strict cookies for the session ID.
Trade-offs:
/refresh endpointProblem: Round-robin load balancing without sticky sessions means:
Solution: Implement IP Hash or Cookie-based sticky sessions on the load balancer.
Trade-offs:
Problem: Full HTML snapshots every 30 seconds create:
Solution: Store operation logs (deltas) not snapshots.
Trade-offs:
Problem: If Server 1 crashes after writing to PostgreSQL but before broadcasting via Redis Pub/Sub, the edit is persisted but never reaches other users. They continue editing an old version, creating a "fork" in the document history.
Solution: Implement Server-Sent Events (SSE) for critical updates + Vector Clocks for version tracking.
last_seen_version to serverTrade-offs:
Problem: As you add more API servers, the 2-second polling interval creates a thundering herd on PostgreSQL. With 100 servers polling 1000 active documents: 100 × 1000 / 2 = 50,000 queries/second just for polling.
Solution: Reverse the polling direction. Use Redis Keyspace Notifications or PostgreSQL LISTEN/NOTIFY (if using Postgres 14+ with logical replication).
doc:{id}:notifySUBSCRIBE instead of pollingTrade-offs:
Problem: "Last-write-wins at paragraph level" provides terrible UX. If two users type in the same paragraph simultaneously, one user's entire edit disappears.
Solution: Integrate Yjs or Automerge protocol.
Trade-offs:
| Priority | Fix | Impact |
|---|---|---|
| P0 | Replace polling with Redis Pub/Sub | Enables real-time collaboration |
| P0 | Remove CDN caching on API | Prevents data corruption |
| P0 | Move JWT to HttpOnly cookies | Security compliance |
| P1 | Implement CRDTs (Yjs) | Correct conflict resolution |
| P1 | Buffer writes in Redis, batch to Postgres | Supports >10k concurrent users |
| P2 | Add sticky sessions | Prevents reconnection storms |
Without these changes, the system will fail under load of ~100 concurrent users due to database contention and will provide a broken collaboration experience (2-second delays, lost edits).
Turn this model response into notes, narration, or a short video
Partner link