4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Solar Pro 3 Free's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
| Failure / Race Condition | Why it happens in the current design | Concrete mitigation (trade‑offs) |
|---|---|---|
JWT stored in localStorage – vulnerable to XSS, copy‑and‑paste attacks, and malicious extensions. | No HTTP‑Only flag, no SameSite/Lax cookie protection. | Move JWT to an HttpOnly, SameSite‑Lax cookie (or use a secure “session‑store” backed by Redis). <br>Pros: eliminates client‑side token theft. <br>Cons: requires cross‑origin cookie handling, possible CORS headaches, and a small added latency for cookie‑based auth. |
| 24‑hour token expiry – stale tokens can be reused after logout or revocation. | Long‑lived access token, no refresh‑token flow. | Introduce short‑lived access tokens (≤ 5 min) + refresh‑token endpoint that issues new JWTs. Store refresh tokens in Redis with revocation list. <br>Pros: immediate revocation possible, tighter security. <br>Cons: extra round‑trip for each user activity, higher auth‑server load. |
| Session cache in Redis – single point of failure, possible data loss on crash. | No persistence configured; if Redis restarts, sessions disappear. | Enable Redis persistence (RDB + AOF) and run a Redis cluster with synchronous replication. Optionally keep a lightweight “session‑store” in PostgreSQL for critical sessions. <br>Pros: durability, fail‑over. <br>Cons: write latency + storage cost; need to manage TTL consistency across nodes. |
| Token revocation not propagated – a server that crashes may still have a revoked token in memory. | Revocation list is in‑memory only; no broadcast mechanism. | Publish revocation events to a durable pub/sub (e.g., Kafka/Redis Streams) that all API nodes consume before each request. <br>Pros: guarantees revocation is respected across all servers. <br>Cons: adds latency to auth checks, extra infra. |
| WebSocket handshake lacks auth verification – a malicious client could open a connection with a stolen token. | JWT is only sent after connection, not validated on handshake. | Validate JWT on the WebSocket upgrade request (e.g., Sec-WebSocket-Protocol or a query param) and reject if invalid or revoked. <br>Pros: prevents unauthorized connections early. <br>Cons: adds extra CPU on each upgrade, must handle token revocation in the same flow. |
| Failure / Race Condition | Why it happens | Mitigation (trade‑offs) |
|---|---|---|
| Round‑robin load balancer without sticky sessions – a client may reconnect to a different API server after a network hiccup and miss recent changes. | Load balancer distributes WebSocket connections arbitrarily. | Enable sticky (session‑affinity) at the L4/L7 layer (e.g., source‑IP, cookie‑based). Or use a shared broadcast bus (Redis Pub/Sub, NATS, Kafka) that any server can publish to and all servers can subscribe, eliminating the need for stickiness. <br>Pros (sticky): simple, no extra infra. <br>Cons (sticky): uneven load, hot‑spot servers. <br>Pros (shared bus): true cross‑server sync, easier scaling. <br>Cons: added latency, need to guarantee delivery (persistent queue). |
| Polling every 2 s from each server – high DB load, poll‑storms, and possible missed updates if a poll interval overlaps a write. | Each server runs a separate poll, leading to N × poll‑frequency DB queries. | Replace polling with PostgreSQL LISTEN/NOTIFY (or logical replication) that pushes changes to a single channel. Or use a dedicated change‑stream service (Kafka, Pulsar) that all servers subscribe to. <br>Pros: eliminates polling overhead, near‑real‑time. <br>Cons: LISTEN/NOTIFY is limited to a single DB node; need a broker for multi‑region. |
| Broadcast only to clients on the same server – other servers never see changes, causing split‑brain. | Broadcast is local to the server that wrote the change. | Centralized pub/sub (Redis, Kafka) that all API servers publish to and all servers subscribe to. Include a document‑ID + change‑ID in each message to guarantee ordering. <br>Pros: full visibility across the cluster. <br>Cons: requires a reliable message broker, adds a hop latency. |
| Duplicate broadcast when a server recovers – after a crash, the same change may be re‑broadcast. | No deduplication on server side. | Assign a monotonically increasing per‑document sequence number (or UUID) on the DB write; broadcast only if the sequence number is newer than the last seen by the server. Use a persistent broadcast log (e.g., a “change‑queue” table) that the server reads on startup to catch missed messages. <br>Pros: eliminates duplicate messages. <br>Cons: extra write on each change, need to purge logs. |
| Last‑write‑wins based on client timestamps – clock skew leads to lost updates and non‑deterministic ordering. | Relying on client clocks for conflict resolution. | Server‑provided version vectors (e.g., doc_version, row_version incremented atomically) or CRDT/OT libraries (Yjs, Automerge). Use a conflict‑resolution service that merges operations deterministically. <br>Pros: robust, no lost edits. <br>Cons: higher CPU per change (OT/CRDT merge), added state to store. |
| Broadcast failure → client never receives change – server may crash after DB commit but before publishing. | Broadcast is done after DB write, not inside a transaction. | **Publish to the message bus inside the same DB transaction (or use a 2‑phase commit pattern). Alternatively, persist broadcast events in a “outbox” table and have a background worker replay missed messages. <br>Pros: guarantee delivery. <br>Cons: transaction latency, extra writes, complexity. |
| Client sends duplicate changes – network retransmission or reconnection may cause the same edit to be applied twice. | No change‑ID deduplication on client side. | Client includes a unique change_id (UUID) and a client_seq number; server checks for duplicates before persisting. <br>Pros: prevents double‑apply. <br>Cons: requires extra memory on server to store recent IDs. |
| WebSocket reconnection storm – many clients reconnect simultaneously after a brief outage, overwhelming servers. | No exponential back‑off or rate limiting on reconnection. | Exponential back‑off with jitter on client side; circuit‑breaker on server side (e.g., limit new connections per second). <br>Pros: smooths load spikes. <br>Cons: may delay recovery for some users. |
| Failure / Race Condition | Why it happens | Mitigation (trade‑offs) |
|---|---|---|
| Concurrent edits to the same paragraph – last‑write‑wins discards earlier edits. | No per‑paragraph version tracking, just whole‑doc timestamps. | Implement per‑paragraph vector clocks (or use a CRDT for the paragraph). When a change arrives, compare its vector with the stored version; merge if possible, otherwise apply deterministic rule (e.g., “first wins”). <br>Pros: preserves edits, no data loss. <br>Cons: increased write latency, extra storage for vector clocks. |
| Write‑ahead log (WAL) replication lag – read replicas may serve stale content, leading to “ghost” edits. | Read replicas are used for read‑heavy operations; they lag behind primary. | Route read‑after‑write operations to the primary (or a “read‑after‑write” pool). Use session‑affinity for reads of a document that was just edited. <br>Pros: strong consistency for the most recent change. <br>Cons: higher load on primary, need to balance read‑only traffic. |
| Snapshot generation (full HTML every 30 s) race with real‑time edits – a snapshot may capture a partially applied change set, causing inconsistency. | Snapshot runs independently of change stream. | Take snapshots after a batch of changes is flushed to the outbox or after a stable document version is published. Use a “snapshot lock” (row‑level lock) to prevent new edits while snapshot is taken, or use incremental diffs (store only changes). <br>Pros: atomic snapshot, less storage. <br>Cons: lock contention or extra CPU for diff generation. |
| Document partitioning only by org ID – a large org may concentrate all traffic on one DB node, causing hot‑spots. | Partitioning key is coarse; all docs for the same org share the same shard. | Add a second shard key (e.g., org_id + doc_id or a hash of doc_id) and use consistent hashing for routing. <br>Pros: distributes load more evenly. <br>Cons: more complex routing logic, need to keep mapping in cache. |
| Read‑replica lag causing “lost edit” on client – client reads from replica, then sees its own edit as missing. | Client reads from replica immediately after sending change. | Prefer the primary for reads of the same document within a short window (e.g., 2 s). Or publish a “read‑after‑write” flag in the change message that tells the client to wait for broadcast before re‑reading. <br>Pros: avoids stale reads. <br>Cons: may increase latency for reads; needs extra coordination. |
| Snapshot storage overflow – full HTML snapshots can become huge, exhausting disk. | No compression, no retention policy. | Compress snapshots (gzip/ brotli) and store them in a separate object‑store (S3, GCS). Keep only N recent snapshots (e.g., last 5 minutes) and purge older ones. <br>Pros: reduces storage cost, still provides point‑in‑time recovery. <br>Cons: extra CPU for compression, need to manage lifecycle. |
| Version history not persisted – undo/redo impossible; audit logs missing. | Only current snapshot stored. | Create a doc_changes table that records each operation (type, content, user, timestamp, vector clock). Use this for undo/redo, audit, and conflict resolution. <br>Pros: richer history, easier debugging. <br>Cons: extra write load, storage growth. |
| Bottleneck | Why it occurs | Mitigation (trade‑offs) |
|---|---|---|
| WebSocket connections per API server – memory/CPU grows linearly with active users; a single server can saturate. | Each server holds its own ws connections; no shared broker for cross‑server broadcast. | Offload real‑time messaging to a dedicated message broker (Redis Pub/Sub, NATS JetStream, Kafka). API servers only publish/subscribe, reducing per‑server memory. <br>Pros: horizontal scaling of WS servers independent of message broker. <br>Cons: added hop latency, need to guarantee delivery. |
| DB write volume – every keystroke results in a row insert (or update) → high contention on the same document row. | No batching, no debounce. | Debounce client edits on the server (e.g., coalesce changes per document for ≤ 200 ms) before persisting. Use optimistic concurrency (UPDATE … WHERE version = X). <br>Pros: reduces write traffic, less lock contention. <br>Cons: introduces slight client‑side latency, need to handle “late” client messages. |
| Polling load – N × 2 s queries per server. | Polling is naive. | LISTEN/NOTIFY (or change‑stream) pushes changes only when they happen. <br>Pros: near‑zero polling load. <br>Cons: requires a single DB node to push notifications; for multi‑region you need a broker. |
| Redis session cache – single‑node bottleneck, potential OOM under massive concurrent connections. | No clustering or persistence. | Run Redis as a cluster with sharding (or use a managed service with auto‑scaling). Enable AOF for durability. <br>Pros: horizontal scaling, resilience. <br>Cons: added network hops, need to keep keys consistent across shards. |
| CDN caching of API responses – 5‑minute cache can serve stale content (e.g., auth, document metadata). | Cache‑Control headers set to public for dynamic endpoints. | Set Cache‑Control: no‑store or private for all auth and document‑state endpoints. For static assets (e.g., UI bundles) keep the 5‑min cache. <br>Pros: eliminates stale data. <br>Cons: increases CDN load, slightly higher latency for unchanged responses. |
| Load‑balancer health‑checks for WebSocket – L7 health checks may close idle connections, causing unexpected reconnects. | Health‑checks use HTTP GET, not WS ping. | Configure L4 TCP health checks or add a dedicated /health endpoint that sends a WebSocket ping and expects a pong. <br>Pros: keeps connections alive. <br>Cons: more complex LB config. |
| Hot‑spotting on a single organization – if one org has many docs, all traffic lands on the same DB shard. | Partitioning only by org ID. | Hybrid sharding: first by org, then by a hash of doc ID (or use a consistent‑hash ring). <br>Pros: spreads load across shards. <br>Cons: requires a lookup table for routing. |
| Read‑replica lag under heavy writes – read traffic may see stale data. | Reads routed to replicas. | Read‑after‑write routing: for a given document, the first read after a write goes to primary; subsequent reads can use replicas. Use a “read‑after‑write” flag in the change message. <br>Pros: improves consistency without sacrificing read scalability. <br>Cons: primary load spikes, need to track per‑doc read‑after‑write windows. |
| Server‑side broadcast storm – many servers broadcast the same change, causing duplicate messages. | No deduplication across servers. | Publish to a single, durable channel (e.g., Kafka topic per document). Each server consumes the topic; duplicates are filtered by change ID. <br>Pros: eliminates cross‑server duplication. <br>Cons: adds broker dependency and latency. |
| Snapshot generation CPU intensive – compressing large HTML on every 30 s interval. | No throttling, no async processing. | Run snapshot generation in a background worker pool (e.g., separate Node/Go service) that picks up pending snapshots from a queue. <br>Pros: isolates heavy work from request path. <br>Cons: added queue latency, need to guarantee ordering. |
| WebSocket reconnection storms – all clients reconnect at once after a brief outage, overwhelming the server. | No exponential back‑off, no rate limiting. | Exponential back‑off + jitter on client side; circuit‑breaker on server side (e.g., reject new connections above a threshold, queue them). <br>Pros: protects server, smooths load. <br>Cons: may delay reconnection for some users. |
| Issue | Why it matters | Mitigation (trade‑offs) |
|---|---|---|
| No end‑to‑end latency monitoring – you cannot detect when a change takes > X ms to reach all clients. | Real‑time editing requires tight latency (< 200 ms). | Instrument every change path: DB write → broadcast → client receipt. Export metrics to Prometheus/Grafana (histogram of change‑latency). <br>Pros: early detection of bottlenecks. <br>Cons: extra instrumentation code, need to aggregate across servers. |
| Missing health‑checks for WebSocket – LB may think a server is healthy while it cannot accept new WS connections. | LB health‑checks use HTTP only. | Add a WS ping endpoint (/ws/health) that returns pong; configure LB to probe it. <br>Pros: accurate health detection. <br>Cons: extra endpoint, potential false positives if ping latency spikes. |
| No dead‑letter queue for broadcast failures – messages that cannot be delivered are dropped silently. | A crashed client or network glitch may cause loss of edits. | Persist broadcast events in an “outbox” table and have a background worker retry. When a client finally connects, it can request missed changes. <br>Pros: guarantees no edit is lost. <br>Cons: extra DB writes, storage for outbox. |
| No automated rollback for snapshot failures – if snapshot generation crashes, the system may lose recent state. | Snapshot is used for point‑in‑time recovery. | Implement a retry loop with exponential back‑off and a fallback to the latest DB row if snapshot fails. Log the failure and alert ops. <br>Pros: ensures continuity. <br>Cons: added retry latency, need to handle partial snapshots. |
| Lack of distributed tracing – you cannot see which component (client → API → WS → DB) is slow. | Complex asynchronous flow makes debugging hard. | Add OpenTelemetry tracing on each hop (client SDK, API middleware, WS server, DB driver). Correlate change_id across services. <br>Pros: pinpoint latency spikes. <br>Cons: overhead on every request, need to propagate trace context across async boundaries. |
| No automated scaling thresholds – you may add servers manually when load spikes. | Manual scaling leads to outages. | Set autoscaling rules based on WebSocket connection count, DB write latency, Redis memory usage, and CPU. Use Kubernetes HPA or cloud‑native autoscaling groups. <br>Pros: reacts quickly to load. <br>Cons: may over‑scale during transient spikes; need cost‑aware policies. |
| No graceful degradation for CDN failures – if CloudFront edge nodes are down, static assets may be unavailable. | CDN is critical for UI bundles. | Serve static assets from a secondary origin (e.g., S3) with fallback; configure CloudFront to use origin‑failover. <br>Pros: ensures asset availability. <br>Cons: added origin load, possible duplicate content. |
| Failure / Race Condition | Why it occurs | Mitigation (trade‑offs) |
|---|---|---|
| JWT in localStorage – XSS can steal tokens. | No HttpOnly flag, token accessible via JS. | Move JWT to HttpOnly cookie (or use a secure “session‑store” in Redis). Add SameSite‑Lax to mitigate CSRF. <br>Pros: mitigates XSS token theft. <br>Cons: need to handle CORS, cookie size limits. |
| Token revocation not immediate – cached tokens in CDN or client may be used after revocation. | CDN caches auth responses (Cache‑Control: public). | Set Cache‑Control: no‑store or private for all auth‑related endpoints. Invalidate CDN cache on revocation (purge API). <br>Pros: guarantees revocation visibility. <br>Cons: higher CDN load, need purge API latency. |
| No rate limiting on WebSocket – a malicious actor can open millions of connections. | No per‑IP or per‑user limits. | Implement per‑IP connection caps (e.g., 10 connections/IP) and a global token‑bucket for messages per second. Use a WAF rule to block abnormal traffic. <br>Pros: prevents DoS. <br>Cons: may block legitimate high‑traffic users, adds complexity to connection handling. |
| No TLS‑termination hardening – CloudFront may terminate TLS at edge, exposing raw data to CDN. | Edge TLS termination is fine, but you must ensure no HTTP‑only fallback and strict HSTS. | Enable HSTS, TLS‑1.3, OCSP stapling, and strict transport security on the origin. <br>Pros: stronger encryption. <br>Cons: adds CPU overhead on CloudFront, must keep certificates up‑to‑date. |
| No audit trail for document edits – GDPR/Compliance requires ability to prove who edited what. | Only snapshots stored, no per‑edit logs. | Store each edit in doc_changes table with user ID, timestamp, IP, and operation type. Enable immutable logs (append‑only) and periodic export for compliance. <br>Pros: full audit. <br>Cons: larger DB footprint, need to purge after retention period. |
| No token revocation list in Redis – revoked tokens may still be used after a server restart. | Revocation list is in‑memory only. | Persist revocation list to a durable store (e.g., PostgreSQL table) and replicate to Redis on startup. <br>Pros: revocation survives restarts. <br>Cons: extra DB writes, possible race if list is stale. |
doc_current) and an append‑only doc_changes table for each operation. Snapshots become compressed binary blobs stored in an object store (S3) and retained only for a limited period (e.g., 10 min).org_id + doc_id using consistent hashing; optionally add a secondary key (shard_id) to spread load.change → DB → broadcast → client; expose latency histograms; set up autoscaling based on connection count, DB write latency, and Redis memory.| Change | Benefit | Cost / Drawback |
|---|---|---|
| Sticky sessions | Simple to implement; no cross‑server broadcast needed. | Uneven load, hot‑spot servers, harder to scale horizontally. |
| Shared pub/sub (Redis/Kafka) | True cross‑server broadcast, eliminates duplicate work, easier scaling. | Added latency, need for durable broker, operational overhead. |
| Short‑lived JWT + refresh | Immediate revocation, mitigates token theft. | Extra round‑trip per user, higher auth‑server load. |
| CRDT/OT | No lost edits, deterministic merge. | Higher CPU per change, extra state (vector clocks) stored. |
| Debounce client edits | Reduces DB write volume, less contention. | Small client‑side latency (e.g., 200 ms). |
| Compressed snapshots in object store | Cuts storage cost, faster retrieval. | CPU for compression, need to manage lifecycle. |
| READ‑AFTER‑WRITE routing | Guarantees consistency for latest edits. | Primary DB load spikes, need tracking per‑doc. |
| Redis cluster | Horizontal scaling, fault‑tolerance. | Complexity of sharding keys, added network hops. |
| Auto‑scale based on WS connections | Handles traffic spikes automatically. | May over‑scale during brief bursts; need cost‑aware thresholds. |
| OpenTelemetry tracing | Pinpoints latency bottlenecks, aids debugging. | Overhead on every request; requires instrumentation on all services. |
| Audit‑log table | Compliance, forensic analysis. | Increased DB size, need retention policy. |
Implementing the above mitigations will eliminate the most critical failure modes, eliminate race conditions that currently cause lost edits, and provide a scalable foundation that can handle thousands of concurrent editors without hitting single‑point bottlenecks. The trade‑offs are intentional: you trade a little extra latency or operational complexity for far greater reliability, consistency, and security. Adjust the specific choices (e.g., stickiness vs. shared bus, CRDT vs. OT) based on your latency SLA, team expertise, and cost constraints.
Turn this model response into notes, narration, or a short video
Partner link