4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on Rival. This response is part of Rival's open dataset of 5,600+ AI model responses.
Mercury 2's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
Below is a structured “failure‑mode → mitigation” map for the architecture you described.
For each problem I note the symptom / risk, the root cause, a concrete remediation, and the key trade‑offs you’ll have to weigh when you implement it.
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 1.1 | Lost updates when a server crashes – a client’s WebSocket is attached to a single API instance; if that instance dies the change never reaches other clients. | No central fan‑out; each server only knows about its own connections. | • Introduce a message broker (Kafka, NATS, or Redis Pub/Sub) that all API nodes publish changes to. <br>• Each server subscribes to the broker and pushes to all its local sockets. | Broker adds latency (few ms) and operational overhead, but guarantees at‑least‑once delivery and decouples scaling of WS workers from the DB. |
| 1.2 | Sticky‑session requirement – load‑balancer round‑robin can route a client’s subsequent WS frames to a different server, breaking the “only broadcast to own sockets” rule. | WS connections are stateful; a client can have only one active socket. | • Use session affinity (sticky‑sessions) on the LB or <br>• Deploy a WebSocket gateway (e.g., Envoy, NGINX, or a dedicated socket‑server) that terminates WS and forwards events to the broker. | Sticky‑sessions limit true horizontal scaling of WS workers; a gateway adds a hop but lets you scale workers independently. |
| 1.3 | Back‑pressure / overload – a burst of edits (e.g., paste of a large block) floods the broker and downstream sockets, causing queue buildup and eventual OOM. | No flow‑control; WS frames are fire‑and‑forget. | • Rate‑limit at the client (debounce typing, max N ops / sec). <br>• Batch changes on the server (e.g., 10 ms windows) before publishing. <br>• Enable broker back‑pressure (Kafka’s consumer lag metrics) and drop or throttle when lag exceeds a threshold. | Slight increase in latency (few tens of ms) but protects stability. |
| 1.4 | Network partition / intermittent connectivity – a client temporarily loses WS, reconnects to a different server, and misses intermediate ops. | No replay mechanism; server only pushes live updates. | • Store ops in a log (Kafka topic or Redis stream) with a monotonically increasing sequence number. <br>• On reconnect, the client asks for “ops after seq X”. | Requires client‑side sequence tracking and log retention; extra storage cost. |
| 1.5 | Duplicate delivery – if a server publishes to the broker and also re‑broadcasts locally, a client connected to the same server may receive the same op twice. | Lack of idempotency handling. | • Include a unique op‑ID (UUID + server‑id) and have the client dedupe. <br>• Or let the broker be the only broadcast path (remove local broadcast). | Slight client complexity; eliminates double‑send risk. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 2.1 | Write‑write conflict & last‑write‑wins (LWW) is unreliable – client clocks drift, leading to “future” timestamps that overwrite newer edits. | No authoritative time source. | • Use server‑side timestamps (e.g., NOW() in Postgres) instead of client‑provided ones. <br>• Or keep client‑provided timestamps but validate they are within a sane bound (e.g., ±5 s). | Server timestamps guarantee total order, but you lose the ability to resolve ties based on client intent (e.g., “my edit happened earlier”). |
| 2.2 | Polling lag – other servers poll every 2 s, causing up to 2 s of stale view and increasing conflict probability. | Polling is coarse and adds DB load. | • Replace polling with change‑data‑capture (CDC) (Postgres logical replication) that streams changes to the broker. <br>• Or use LISTEN/NOTIFY + a lightweight pub/sub to push updates instantly. | CDC requires extra infrastructure (Debezium, Kafka Connect); LISTEN/NOTIFY has limited payload size and can be overwhelmed at high QPS. |
| 2.3 | Snapshot loss – full HTML snapshots every 30 s means any crash between snapshots loses up to 30 s of work. | No incremental persistence. | • Persist incremental ops (the same stream used for real‑time) to durable storage (Kafka, S3). <br>• Periodically compact into a new snapshot (e.g., every minute). | More storage I/O but near‑zero data loss; compaction adds CPU overhead. |
| 2.4 | Read‑replica lag – heavy read traffic (e.g., document load) can cause replicas to lag behind the primary, showing stale data after a write. | Replication is asynchronous. | • Serve writes (including the latest state) from the primary only; route read‑only heavy ops (history, analytics) to replicas. <br>• Use synchronous replication for critical tables (costly). | Synchronous replication hurts write latency; routing logic adds complexity. |
| 2.5 | Hot‑spot partitions – documents are partitioned by organization ID, but a large org can generate a disproportionate load on a single DB shard. | Uneven distribution of active docs. | • Add sharding on document ID (hash) in addition to org ID, or use Citus (Postgres distributed) to auto‑balance. | Requires schema changes and a distributed query layer. |
| 2.6 | Dead‑locks / transaction contention – many concurrent edits on the same row (document) cause lock contention. | Each edit writes a new row or updates a large JSON column. | • Use append‑only table for ops (no UPDATE). <br>• Or store the document in a document‑store (e.g., MongoDB) that handles concurrent writes better. | Append‑only table grows quickly; need periodic compaction. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 3.1 | Cache stampede on document load – many clients request the same doc after a snapshot, all hit the DB simultaneously. | No request coalescing. | • Use single‑flight / request coalescing (e.g., SETNX lock) so only one DB fetch occurs; others wait for the cached result. | Slight latency for waiting clients; extra lock handling. |
| 3.2 | Redis node failure – session cache lost, causing auth look‑ups to fall back to DB and increasing latency. | No redundancy. | • Deploy Redis Cluster with replication and automatic failover (e.g., via Sentinel). | Higher memory cost; need to handle key‑slot migrations. |
| 3.3 | Stale session data – JWT stored in localStorage, but Redis is used for revocation; if Redis is out‑of‑sync, revoked tokens may still be accepted. | No real‑time sync between client and Redis. | • Use short‑lived JWTs (e.g., 15 min) + refresh tokens stored in Redis. <br>• Or keep JWTs stateless and rely on token introspection only when a revocation flag is set. | Shorter JWT lifespan increases refresh traffic; adds complexity to token flow. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 4.1 | JWT replay after logout – token lives 24 h; if a user logs out on one device, the token remains valid elsewhere. | No revocation list. | • Store a token version or session ID in Redis; on logout increment version, forcing clients to fetch a new token. <br>• Or reduce JWT TTL to 1 h and use refresh tokens. | More frequent token refresh; extra Redis reads on each request. |
| 4.2 | XSS stealing of JWT from localStorage – localStorage is accessible to any script on the page. | Insecure storage. | • Move token to httpOnly Secure SameSite cookies. <br>• Or keep in IndexedDB with CSP + Subresource Integrity. | Cookies are sent automatically on every request (including static assets) unless scoped; need careful SameSite handling. |
| 4.3 | Clock skew in token issuance – client clock used for “exp” validation can be wrong, causing premature rejection. | Client‑side time check. | • Validate exp on the server only; client should ignore it for UI decisions. | Slight UX impact (user may see “session expired” after a few minutes). |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 5.1 | Stale API responses – CloudFront caches API GETs for 5 min, so a client may read an outdated document snapshot after an edit. | Cache TTL too aggressive for mutable data. | • Set Cache‑Control: no‑store or max‑age=0 for document‑fetch endpoints. <br>• Or use Cache‑Tag invalidation on each edit (purge specific doc). | More origin traffic; invalidation adds latency but guarantees freshness. |
| 5.2 | Cache warm‑up latency – after a new document is created, the first read triggers a cache miss and a DB hit, causing a spike. | No pre‑warming. | • Proactively populate CDN (or edge cache) after snapshot creation via a background job. | Extra write‑through cost; minimal impact if done asynchronously. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 6.1 | Unbalanced load – round‑robin without health checks can send traffic to a crashed instance, causing connection failures. | LB not aware of instance health. | • Enable health‑check endpoints (e.g., /healthz) and configure LB to skip unhealthy nodes. | Slightly longer health‑check interval may delay detection. |
| 6.2 | Graceful shutdown – when a server is terminated (e.g., autoscaling), existing WS connections are dropped abruptly. | No draining. | • Implement connection draining: stop accepting new WS, broadcast a “reconnect” message, wait for existing sockets to close, then exit. | Slightly longer termination time; need orchestrator support (K8s pod termination hooks). |
| Bottleneck | Symptoms | Remedy | Trade‑offs |
|---|---|---|---|
| WebSocket fan‑out | Each server only knows its own sockets → limited to per‑node client count. | Central broker (Kafka/NATS) + pub/sub; or a dedicated WS gateway. | Adds a network hop and operational component, but enables true horizontal scaling of API nodes. |
| DB write throughput | All edits go to a single PostgreSQL primary → CPU/IO saturation. | Append‑only ops table + partitioning (by org + hash) or move to a distributed SQL (Citus, CockroachDB). | More complex schema; need compaction jobs. |
| Polling latency | 2 s poll → stale view, higher conflict rate. | CDC / LISTEN‑NOTIFY to push changes instantly. | CDC adds extra services; LISTEN‑NOTIFY limited payload size. |
| Snapshot frequency | 30 s snapshots → potential loss of up to 30 s of work. | Persist incremental ops to durable log; compact into snapshots periodically. | Extra storage I/O; compaction CPU cost. |
| Redis single point | One Redis node → cache miss + session loss on failure. | Redis Cluster with replicas + Sentinel. | Higher memory cost; need to handle key‑slot migrations. |
| JWT long TTL | Revocation impossible, XSS risk. | Shorter JWT + refresh token; store revocation list in Redis. | More token refresh traffic; extra Redis reads. |
| CDN stale API | 5 min cache TTL → stale document reads. | Cache‑Control: no‑store for mutable endpoints; edge invalidation on edit. | More origin load; but guarantees freshness. |
| Load‑balancer routing | Round‑robin without affinity breaks WS continuity. | Sticky sessions or WS gateway that terminates connections. | Sticky sessions limit true stateless scaling; gateway adds a hop. |
If you need to ship a more robust version quickly, focus on the high‑impact, low‑complexity changes first:
NOW() and store a monotonic sequence number.doc:{id}. Replace the per‑node broadcast with a subscription to that channel.LISTEN/NOTIFY on the ops table and have each API node push the notification to the broker, eliminating the 2 s poll.Cache‑Control: no‑store on all document‑fetch endpoints; invalidate CDN on every edit (CloudFront invalidation API)./healthz and configure LB draining.These steps give you real‑time consistency, no stale reads, and basic fault tolerance while keeping the architecture simple.
| Enhancement | What it solves | Rough effort |
|---|---|---|
| CRDT / Operational Transformation (OT) | Eliminates LWW conflicts, enables true concurrent editing without a central arbiter. | High – requires a new data model, client library, and server‑side merging. |
| Event‑sourced document store (Kafka + compacted topic) | Guarantees lossless edit history, fast replay for new nodes, and easy snapshotting. | Medium – need to build consumer pipelines and compaction logic. |
| Distributed SQL (Citus / CockroachDB) | Scales writes horizontally, removes hot‑spot partitions. | Medium – data migration and query‑rewriting. |
| WebSocket gateway (Envoy/Traefik) | Decouples WS scaling from API logic, removes sticky‑session requirement. | Low‑Medium – configuration only, but requires a new service. |
| Token introspection service | Central revocation, short‑lived JWTs, per‑device logout. | Low – small microservice + Redis. |
| Edge‑computing (Cloudflare Workers) | Move diff‑generation to edge, reduce origin load. | Medium – custom JavaScript/TypeScript on edge. |
Implement the low‑effort fixes first, then progressively adopt the higher‑complexity, higher‑value solutions (CRDT/OT, event‑sourcing, distributed SQL) as the product scales.
Turn this model response into notes, narration, or a short video
Partner link