4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Mercury's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
Below is a systematic review of the described real‑time collaborative editor.
For each problem I list:
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Clock drift – client timestamps are used for LWW. | Two users editing the same paragraph can produce nondeterministic ordering; the “last” edit may not reflect real intent. | • Use server‑generated logical timestamps (e.g., Lamport clocks or a monotonically increasing counter per document). <br>• Store the server‑generated timestamp in the change payload. | Requires a tiny round‑trip (or embed the timestamp in the WebSocket message) but eliminates reliance on client clocks. |
| LocalStorage token persistence – vulnerable to XSS and can be cleared by the user. | Unexpected logout, loss of session state, possible replay attacks. | • Move JWT to an HttpOnly secure cookie (same‑site). <br>• Refresh token flow (short‑lived access token + long‑lived refresh token). | Adds an extra token and refresh endpoint, but improves security and resilience to token loss. |
| WebSocket reconnection handling – not described. | Users may lose the connection and see stale data or duplicate edits. | • Implement exponential back‑off reconnection with state sync on reconnect (e.g., request the latest document snapshot). | Slightly more client code, but essential for reliability. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Single‑point WebSocket “owner” per server – each server only broadcasts to its own connections. | Changes made on one server are not instantly visible to clients on other servers. | • Use a publish‑subscribe system (Redis Pub/Sub, NATS, or Kafka) to fan‑out changes across all servers. <br>• On change receipt, publish to a channel; all servers subscribe and push to their local sockets. | Adds a small latency (publish‑subscribe round‑trip) but removes the need for polling. |
| Polling PostgreSQL every 2 s – inefficient and noisy. | High read load on the primary (or a replica), increased latency for cross‑server sync, potential missed changes if a change occurs just after a poll. | • Replace polling with a change‑data‑capture (CDC) mechanism (e.g., PostgreSQL logical replication, Debezium). <br>• Or use a lightweight “last‑updated” timestamp column and a LISTEN/NOTIFY on change. | Requires extra infrastructure (CDC pipeline) or a small DB change, but eliminates polling overhead and guarantees immediate propagation. |
| Last‑write‑wins (LWW) with timestamps – naïve conflict resolution. | Simultaneous edits to the same paragraph can overwrite each other, losing user intent. | • Adopt Operational Transformation (OT) or Conflict‑free Replicated Data Types (CRDTs) for the document model. <br>• If full OT/CRDT is too heavy, use a hybrid: LWW for non‑conflicting regions, and a “merge‑by‑paragraph” algorithm that preserves both edits (e.g., split the paragraph). | OT/CRDT adds complexity and overhead per edit; a hybrid approach is easier to implement but still requires careful design. |
| Document snapshot every 30 s – large writes. | Frequent full‑HTML writes can cause write amplification and slow down the DB. | • Store incremental deltas (diffs) instead of full snapshots. <br>• Keep a background job that composes deltas into a full snapshot every N minutes. | Adds delta‑storage logic but reduces write volume. |
| JWT verification on every request – stateless but may be heavy under load. | CPU overhead per request, especially if the secret is large or RSA verification is used. | • Cache the public key and use a fast symmetric algorithm (HS256) with a short TTL. <br>• Or use token introspection with Redis to avoid verification on each request. | Symmetric keys are faster but less secure if compromised; introspection adds a Redis lookup. |
| Static assets cached by CloudFront for 5 min – may serve stale data after a deployment. | Users may see old JS/CSS after a bug fix. | • Use versioned URLs (e.g., app.v1.2.3.js) or set Cache-Control: max-age=0, must-revalidate for critical assets. | Versioned URLs increase storage but guarantee freshness. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Single write node for all changes – no sharding. | Write bottleneck as the number of concurrent edits grows; potential lock contention on the same document row. | • Partition documents by organization ID (as planned) and use a sharded Postgres cluster or a document‑oriented store (e.g., CockroachDB) that scales horizontally. <br>• Use row‑level locking only when necessary; otherwise rely on optimistic concurrency (version column). | Sharding adds operational complexity; CockroachDB introduces a different query model. |
| Read replicas for read‑heavy ops – not used for change propagation. | Polling still hits the primary; read replicas are underutilized. | • Route all read‑only queries (e.g., fetching the latest snapshot) to replicas. <br>• Ensure replicas are up‑to‑date for CDC or LISTEN/NOTIFY. | Replicas may lag slightly; need to monitor replication lag. |
| Storing full HTML snapshots – large blobs. | Increased storage cost, slower backup, harder to diff. | • Store as plain text with compression (e.g., GZIP) or use a binary format (e.g., Protobuf) for deltas. | Compression adds CPU overhead; binary format may require extra tooling. |
| No explicit durability guarantee for edits – writes may be lost if server crashes before DB commit. | Users may lose recent changes. | • Use a write‑ahead log (WAL) and ensure synchronous_commit = on for critical tables. <br>• Or buffer edits in Redis with a short TTL and replay on crash. | synchronous_commit can slow writes; Redis buffering adds complexity. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Session cache only – no use for change propagation. | No benefit from Redis’s pub/sub capabilities. | • Use Redis Pub/Sub (or Streams) for cross‑server change broadcast, as mentioned above. | Requires Redis to be highly available (clustered). |
| Potential data loss on Redis eviction – default LRU may evict session data. | Users get logged out unexpectedly. | • Set a higher maxmemory-policy (e.g., allkeys-lru) and increase memory, or use Redis persistence (RDB/AOF) for critical sessions. | More memory cost; persistence adds write overhead. |
| No rate‑limiting – clients could flood the server with change events. | Server overload, increased latency. | • Implement per‑client rate limiting in the API layer (e.g., token bucket) or use a CDN edge rule. | Adds a small check per request; may need to tune thresholds. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Round‑robin routing – does not respect session affinity. | A client may reconnect to a different server after a temporary loss, causing duplicate change handling or missed updates. | • Use sticky sessions (session cookie) or a consistent hashing scheme based on document ID. | Sticky sessions can reduce pool utilization; consistent hashing requires a custom load‑balancer or service mesh. |
| No health‑check for WebSocket connections – dead servers may still receive client traffic. | Clients may experience dropped connections and need reconnection logic. | • Enable WebSocket‑aware health checks (e.g., ping/pong) and remove unhealthy nodes from rotation. | Slightly more configuration but improves reliability. |
| Scaling of WebSocket memory – each server holds all its connections. | Memory pressure when many users connect to a single server. | • Distribute connections evenly (via sticky routing) and consider a dedicated WebSocket gateway (e.g., NGINX + uWSGI) that can scale independently. | Adds an extra service layer. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Last‑write‑wins – no context of edit intent. | Users may lose meaningful changes, especially in collaborative editing. | • Implement OT/CRDT for fine‑grained merge. <br>• If full OT/CRDT is too heavy, adopt a “paragraph‑level merge” that splits the paragraph and preserves both edits. | OT/CRDT requires a more complex client library; paragraph merge is simpler but may still conflict on overlapping edits. |
| Timestamp collisions – two edits with identical timestamps (e.g., same millisecond). | Deterministic ordering not guaranteed. | • Append a server‑generated unique sequence number to the timestamp. | Slightly more data per edit. |
| No versioning – changes can be applied out of order. | Inconsistent document state across clients. | • Include a monotonically increasing document version in each change; servers reject out‑of‑order changes. | Requires client to fetch the latest version before editing, adding latency. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| JWT stored in localStorage – vulnerable to XSS. | Token theft, session hijacking. | • Move to HttpOnly secure cookie (as above). | Adds CSRF protection considerations (e.g., double‑submit cookie). |
| No per‑document permission checks – any authenticated user can edit any document. | Data leakage. | • Store ACLs in the DB and verify on each change request. | Slightly more DB reads per edit. |
| No rate limiting on API – potential abuse. | Server overload, denial‑of‑service. | • Implement global and per‑document rate limits (e.g., using Redis counters). | Adds a Redis dependency. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| No metrics on change propagation latency – cannot detect slow servers. | Hidden performance regressions. | • Instrument WebSocket send/receive timestamps and publish to a monitoring system (Prometheus). | Extra code, but essential for SLA. |
| No tracing of cross‑server broadcasts – hard to debug missed updates. | Users see stale content. | • Add distributed tracing (e.g., OpenTelemetry) around publish/subscribe and DB writes. | Adds overhead and requires a tracing backend. |
| No alerting on replication lag – CDC may fall behind. | Clients may see outdated snapshots. | • Monitor pg_stat_replication lag and set alerts. | Requires monitoring stack. |
| Layer | Primary Change | Key Benefit |
|---|---|---|
| Client | Server‑generated timestamps, HttpOnly cookie, reconnection sync | Removes clock drift, improves security, ensures state consistency |
| WebSocket Broadcast | Use Redis Pub/Sub or NATS instead of polling | Real‑time cross‑server propagation, eliminates 2‑second polling latency |
| Conflict Resolution | Adopt OT/CRDT or hybrid paragraph merge | Guarantees intent preservation, reduces lost edits |
| Data Storage | Store incremental deltas + periodic full snapshots | Reduces write volume, improves backup efficiency |
| Database | Partition by org, use CDC or LISTEN/NOTIFY, enable synchronous_commit for critical edits | Scales writes, ensures durability, removes polling |
| Load Balancing | Sticky sessions or consistent hashing, WebSocket‑aware health checks | Keeps client on same server, prevents dead connections |
| Security | HttpOnly cookie, per‑document ACLs, rate limiting | Mitigates XSS, data leakage, abuse |
| Observability | Metrics, tracing, replication lag alerts | Enables rapid detection of performance or consistency issues |
| Enhancement | Cost / Complexity | New Failure Modes | Mitigation |
|---|---|---|---|
| Redis Pub/Sub | Requires Redis cluster, handle message loss if a server crashes. | Use at‑least‑once delivery and deduplication logic on the client. | |
| OT/CRDT | Significant client & server library effort; higher per‑edit CPU. | Start with a simple OT implementation for text only; fallback to LWW for non‑conflicting ops. | |
| Delta storage | Need diffing logic, more complex snapshot assembly. | Cache recent deltas in memory for quick reconstruction; validate snapshots periodically. | |
| Sticky sessions | Can lead to uneven load distribution. | Combine with session‑id hashing that balances across nodes. | |
| HttpOnly cookie | Requires CSRF protection (e.g., double‑submit cookie). | Add CSRF token verification on state‑changing endpoints. |
Implementing the above changes will transform the system from a polling‑heavy, clock‑driven, LWW‑only design into a robust, low‑latency, conflict‑aware collaborative editor that scales horizontally without sacrificing consistency or security.
Turn this model response into notes, narration, or a short video
Partner link