4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Aurora Alpha's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
Below is a systematic “failure‑mode → root cause → mitigation” map for the architecture you described.
I grouped the items into four logical layers (client ↔ network ↔ API ↔ data) and then added a cross‑cutting “operational” section (security, observability, cost, etc.). For each problem I list a concrete remedy and the main trade‑offs you’ll have to consider.
| # | Potential Failure / Race | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 1.1 | Lost or out‑of‑order WebSocket messages | TCP guarantees order, but a client can reconnect to a different API instance and miss messages that were broadcast only on the original server. | Introduce a message‑sequence number + ACK on the client side and a re‑play buffer on each server (e.g., recent 10 s of changes). When a client reconnects, it sends the last sequence it processed and the server re‑sends any missing deltas. <br>Trade‑off: extra memory on the server and a small latency spike on reconnect; complexity in the client‑side state machine. |
| 1.2 | Clock skew → wrong “last‑write‑wins” | Clients use their own clocks for timestamps; a fast clock can overwrite a slower one even if the edit happened later. | Server‑generated timestamps (e.g., monotonic logical clock or ULID) instead of client clocks. The client can still send a provisional timestamp for UI ordering, but the authoritative order comes from the server. <br>Trade‑off: adds a round‑trip for each edit (or a small server‑side queue) and requires the server to keep a per‑document logical clock. |
| 1.3 | JWT theft / replay | Token stored in localStorage is accessible to any script on the page (XSS) and can be replayed on another device. | Store JWT in an HttpOnly Secure SameSite cookie and rotate it frequently (e.g., short‑lived access token + refresh token). Use refresh‑token rotation and revocation list. <br>Trade‑off: more complex auth flow; need CSRF protection for cookie‑based auth. |
| 1.4 | Network partitions → “split‑brain” edits | A client may be isolated from the primary API server and connect to a secondary that has stale data. | Use a centralised real‑time broker (e.g., Redis Streams, NATS, or a dedicated OT/CRDT service) that all API instances subscribe to, instead of per‑server broadcast. <br>Trade‑off: introduces a new component and network hop, but guarantees total ordering across the cluster. |
| 1.5 | Large payloads in WebSocket frames | Sending full HTML snapshots every 30 s can overflow the socket buffer on low‑bandwidth connections. | Compress deltas (e.g., JSON‑diff, operational‑transform/CRDT delta) and send only the delta, not the full snapshot. Keep periodic full snapshots for recovery only. <br>Trade‑off: requires a diff algorithm and versioning; adds CPU overhead on both client and server. |
| # | Potential Failure / Race | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 2.1 | Broadcast limited to “clients on the same server” | Server A never sees changes from Server B unless the DB poll picks them up; a 2‑second poll window creates a visible lag and possible race conditions. | Replace polling with a publish/subscribe bus (Redis Pub/Sub, Kafka, or a dedicated WebSocket message broker). Each server publishes its delta and subscribes to all others. <br>Trade‑off: extra infrastructure, need to handle message ordering and at‑least‑once delivery. |
| 2.2 | Polling interval too coarse → race conditions | Two users editing the same paragraph on different servers may both write to DB before the poll catches the other’s change, leading to “last‑write‑wins” conflicts. | Use a write‑ahead log / change‑feed (PostgreSQL logical replication, WAL‑2‑JSON, or a dedicated event store). Servers consume the feed in real time, eliminating the need for polling. <br>Trade‑off: more complex DB setup; requires idempotent handling of events. |
| 2.3 | Database write contention | Every keystroke (or batch of keystrokes) triggers a write to PostgreSQL; high‑frequency edits can cause row‑level lock contention on the document table. | Batch edits in memory (e.g., 100 ms window) and write a single UPDATE per user per batch. Alternatively, store deltas in a separate “edits” table and apply them asynchronously to the main snapshot. <br>Trade‑off: introduces a small latency for persistence; adds a background compaction job. |
| 2.4 | Single point of failure in WebSocket connection handling | If a single API instance crashes, all its connected clients lose their real‑time channel until they reconnect. | Deploy a dedicated WebSocket gateway (e.g., Envoy, NGINX, or a managed service like AWS API Gateway WebSocket) that sits in front of the API servers and can gracefully detach/attach connections. <br>Trade‑off: extra network hop; need to forward messages to the correct backend (via sticky sessions or a message bus). |
| 2.5 | Load‑balancer sticky‑session misconfiguration | Round‑robin without stickiness forces a client to reconnect to a different server on each request, breaking the per‑server broadcast model. | Enable session affinity (IP‑hash or cookie‑based) for WebSocket upgrades, or better, decouple connection handling from business logic (see 2.4). <br>Trade‑off: can lead to uneven load distribution; affinity may break when a server is drained for maintenance. |
| 2.6 | Memory leak in per‑connection buffers | Keeping a per‑client delta buffer for replay can grow unbounded if a client stays idle for a long time. | Set a TTL on buffers (e.g., 30 s) and drop the oldest entries when the buffer exceeds a size limit. Use a circular buffer implementation. <br>Trade‑off: a very slow client may miss some deltas and need to request a full snapshot. |
| 2.7 | Back‑pressure on WebSocket writes | If a client’s network is slow, the server’s write buffer can fill, causing the Node.js event loop to block or crash. | Implement flow‑control: pause reading from the source when the socket’s bufferedAmount exceeds a threshold, and resume after a drain event. <br>Trade‑off: adds latency for slow clients; may need to drop or compress older deltas. |
| # | Potential Failure / Race | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 3.1 | Snapshot every 30 s → storage churn | Writing a full HTML blob every half‑minute for many active documents can saturate I/O and increase storage costs. | Store only incremental deltas and generate a snapshot lazily (e.g., after N edits or when a user requests a version). Keep a periodic “checkpoint” (e.g., every 5 min) for fast recovery. <br>Trade‑off: recovery requires replaying deltas; more complex compaction logic. |
| 3.2 | Read‑replica lag | If the API reads from replicas for “current document state”, lag can cause a client to see stale data after a recent edit. | Read‑your‑writes: after a successful write, read back from the primary (or use a “write‑through cache” in Redis). <br>Trade‑off: extra read load on the primary; may need to tune replica lag thresholds. |
| 3.3 | PostgreSQL row‑level lock contention | Simultaneous UPDATEs on the same document row cause lock waiting, increasing latency and possibly deadlocks. | Use SELECT … FOR UPDATE SKIP LOCKED on a “pending edits” table, or store edits in a separate table keyed by (document_id, edit_seq) and let a background worker merge them into the snapshot. <br>Trade‑off: more tables and background jobs; eventual consistency for the snapshot. |
| 3.4 | Redis cache eviction / stale session data | If the session cache is not sized correctly, eviction can cause a user to lose their edit‑state, forcing a full reload. | Use a TTL per session (e.g., 5 min) and a “fallback” to DB if a cache miss occurs. Monitor cache hit‑rate and size the cluster accordingly. <br>Trade‑off: higher memory cost; occasional extra DB reads. |
| 3.5 | Schema evolution / migration downtime | Adding a new column to the document table (e.g., for metadata) can lock the table for a noticeable period. | Use online schema change tools (e.g., pg_repack, pt-online-schema-change) or add new columns with default NULL and back‑fill in batches. <br>Trade‑off: longer migration window; need to coordinate with rolling releases. |
| 3.6 | Data loss on sudden crash | If a write is acknowledged to the client before PostgreSQL has flushed to disk, a crash could lose the edit. | Enable synchronous_commit = on for critical tables, or use two‑phase commit with a write‑ahead log in Redis that is persisted before acknowledging. <br>Trade‑off: higher latency for each write; extra complexity in failure recovery. |
| # | Issue | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 4.1 | CDN caching of API responses | Caching API JSON for 5 min can serve stale document data after an edit. | Add Cache-Control: no‑store on any endpoint that returns mutable document state. Use CDN only for static assets and truly immutable API calls (e.g., list of templates). <br>Trade‑off: loses the small latency benefit of CDN for those endpoints. |
| 4.2 | Horizontal scaling without sharding | Adding more API servers only spreads load; the DB remains a single bottleneck for writes. | Partition documents by organization ID (or hash of doc‑id) and assign each partition to a dedicated DB shard (or use a multi‑tenant PostgreSQL with separate schemas). <br>Trade‑off: operational overhead of managing multiple shards; cross‑shard queries become more complex. |
| 4.3 | Single point of failure in load balancer | If the LB crashes, all traffic is lost. | Deploy a highly‑available LB pair (e.g., AWS ALB with multiple AZs, or HAProxy with VRRP). <br>Trade‑off: cost of extra instances and health‑check configuration. |
| 4.4 | Observability gaps | No metrics on WebSocket latency, queue depth, or DB write latency → hard to detect a bottleneck. | Instrument the stack: Prometheus metrics for socket bufferedAmount, DB query time, Redis hit‑rate; distributed tracing (OpenTelemetry) across the WebSocket → API → DB path. <br>Trade‑off: adds CPU/IO overhead and requires a monitoring stack. |
| 4.5 | Security – CSRF on JWT cookie | If you move JWT to HttpOnly cookie, a malicious site could still trigger a request with the cookie. | SameSite=Strict or Lax plus CSRF token for state‑changing endpoints. <br>Trade‑off: may break legitimate cross‑origin use cases (e.g., embedding the editor in another domain). |
| 4.6 | Versioning / backward compatibility | Clients may be on older JS bundles that expect a different message format. | Add a version field in every WebSocket message and have the server negotiate a compatible protocol (or reject with a clear error). <br>Trade‑off: extra code path for version handling; need to retire old versions. |
| 4.7 | Cost of frequent snapshots | Storing a full HTML snapshot every 30 s for thousands of documents can explode storage costs. | Compress snapshots (gzip/ Brotli) and store them in object storage (S3) with lifecycle policies, while keeping only the latest N snapshots in PostgreSQL. <br>Trade‑off: additional latency when retrieving older versions; need a background job to sync between DB and object storage. |
| Phase | Primary Goal | Key Changes | Approx. Effort |
|---|---|---|---|
| Phase 1 – Real‑time reliability | Remove per‑server broadcast & polling | • Introduce a central pub/sub broker (Redis Streams or NATS). <br>• Switch to server‑generated timestamps. <br>• Add sequence‑number ACK/replay for reconnects. | 2‑3 weeks (broker setup + code changes). |
| Phase 2 – Data‑layer optimisation | Reduce DB contention & storage churn | • Store deltas in an “edits” table, periodic snapshot worker. <br>• Batch DB writes (100 ms window). <br>• Enable logical replication feed for near‑real‑time change propagation. | 3‑4 weeks (schema changes + background workers). |
| Phase 3 – Security & auth hardening | Prevent token theft & stale cache | • Move JWT to HttpOnly SameSite cookie + refresh‑token rotation. <br>• Remove CDN caching for mutable API endpoints. | 1‑2 weeks (auth flow changes). |
| Phase 4 – Scaling & resilience | Prepare for horizontal growth | • Deploy a dedicated WebSocket gateway with sticky‑session fallback. <br>• Add DB sharding/partitioning by org ID. <br>• Set up HA load balancer and health checks. | 4‑6 weeks (infrastructure provisioning). |
| Phase 5 – Observability & ops | Detect and react to failures early | • Export Prometheus metrics & OpenTelemetry traces. <br>• Implement alerting on socket lag, DB write latency, Redis hit‑rate. | 1‑2 weeks (instrumentation). |
| Category | Core Problem | Quick Fix | Long‑Term Fix |
|---|---|---|---|
| Real‑time ordering | Client timestamps + per‑server broadcast | Server‑generated timestamps + ACK replay | Central pub/sub + CRDT/OT engine |
| Latency & lost updates | 2‑s DB poll + reconnection to different server | Reduce poll to < 200 ms or use change‑feed | Event stream (Logical Replication / Kafka) |
| DB write contention | One row per document, many concurrent UPDATEs | Batch edits, store deltas in separate table | Append‑only log + background compaction |
| Snapshot storage | Full HTML every 30 s | Store only deltas, compress snapshots | Object‑store checkpoints + lazy snapshot generation |
| Auth & token safety | JWT in localStorage | HttpOnly SameSite cookie + refresh rotation | OAuth‑style short‑lived access tokens + revocation list |
| Scaling bottleneck | Single PostgreSQL primary | Read‑replicas for reads, logical replication for writes | Horizontal sharding / multi‑tenant schemas |
| Cache staleness | CDN caching API responses | Cache-Control: no-store on mutable endpoints | Edge‑side invalidation via webhook on edit |
| Observability | No metrics on WebSocket health | Add simple latency counters | Full distributed tracing + alerting pipeline |
Implementing the Phase 1 changes (centralised pub/sub + server timestamps) will instantly eliminate the most visible race conditions (out‑of‑order updates, stale broadcasts). From there, you can progressively address storage, security, and scaling concerns while keeping the system observable and resilient.
Turn this model response into notes, narration, or a short video
Partner link