4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Gpt Oss 20B's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
Below is a “code‑review” of the design you posted.
I’ve broken it down into failure modes / race conditions / bottlenecks and for each one I provide a concrete fix (or set of fixes) together with the main trade‑offs you’ll pay for it.
TL;DR – The biggest problems are:
- No distributed change bus – polling + per‑server broadcast is slow, inconsistent and scales poorly.
- Client‑clock based LWW – you’ll lose edits when clocks drift or when two users edit the same paragraph.
- Write‑heavy DB – one write per keystroke + a 30 s snapshot blows up PostgreSQL.
- Sticky‑session oblivious load balancer – round‑robin can route the same user to two servers, breaking the real‑time flow.
- Security & freshness holes – JWT in localStorage + CDN caching of API responses can serve stale data or be stolen.
Below are the detailed points.
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| Per‑server broadcast + 2 s polling | • Latency (2 s) before other servers see a change. <br>• Inconsistent state when two servers write different changes to the same paragraph. <br>• Polling is wasteful – DB load grows linearly with #servers. | Introduce a distributed event bus (Redis Pub/Sub, NATS, Kafka, or a custom WebSocket “hub” cluster).<br>• Backend servers publish change events to the bus.<br>• Every server subscribes and pushes the change to its local clients immediately. | • Extra component to maintain (ops, monitoring). <br>• Slightly higher latency than direct WebSocket, but bounded to a few ms. <br>• Requires idempotency handling if you use a queue that can replay messages. |
| Clients reconnect to a different server | The new server won’t have the “in‑flight” changes that were already broadcast by the old server. | Sticky sessions (session affinity) on the load balancer or client‑side reconnection logic that re‑joins the same server (e.g. via a token that encodes the server ID). | • Sticky sessions hurt horizontal scaling of the backend (one server can become a hotspot). <br>• Client reconnection logic is more complex but keeps the backend stateless. |
| Duplicate change delivery | If both polling and Pub/Sub are used, a change may be broadcast twice. | Single source of truth – remove polling entirely. | • All servers must keep a local cache of the last change ID to avoid re‑processing. |
| Network partition | If the bus goes down, changes stop propagating. | Graceful degradation – keep local change log and replay when bus recovers. | • Adds a bit of complexity; you need a durable queue. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| Last‑write‑wins based on client timestamps | • Client clocks can drift by seconds → edits from a “behind” client win.<br>• Two users editing the same paragraph simultaneously causes one edit to be silently dropped. | Operational Transformation (OT) or Conflict‑free Replicated Data Types (CRDT). <br>• Server assigns a monotonically increasing sequence number or uses a Lamport timestamp. <br>• Clients send operations (insert/delete) that can be merged deterministically. | • OT/CRDT libraries are non‑trivial to integrate and test. <br>• Larger message size (operation payload) but far more robust. |
| Optimistic concurrency control on the DB | Two writes to the same row can interleave. | Use PostgreSQL’s SELECT ... FOR UPDATE or INSERT … ON CONFLICT … UPDATE with a version column. | • Adds a small locking overhead but protects against lost updates. |
| Polling + 2 s delay | Users see a lag when another user edits the same paragraph. | Use the event bus (above) + OT/CRDT so updates are applied instantly. | • Real‑time feel improves dramatically. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| One write per keystroke | 10+ users → 10k writes/sec for a single doc. PostgreSQL can’t keep up without sharding or batching. | Batch changes: buffer changes for 100–200 ms or 10 changes, then persist as a single row. <br>• Store a delta log (operation + target position). <br>• Snapshot every 30 s only if the document is actually dirty. | • Slightly more latency for the “last” change. <br>• Need to handle rollback if the batch fails (transaction). |
| Full HTML snapshot every 30 s | 30 s * 1 KB (doc) = 30 KB per doc per minute; for 10k docs that’s ~300 MB/min. | Store diffs instead of full snapshots. <br>• Use a binary diff algorithm (e.g. diff-match-patch). <br>• Keep snapshots only for critical points (e.g. every 5 min, every 1 MB of changes). | • Slightly more CPU to compute diffs. <br>• Recovery becomes a bit more complex (apply diffs to base). |
| Single PostgreSQL instance | All writes go to one node → CPU, I/O, and connection limits. | Write‑throughput sharding: partition by document ID or org ID into multiple Postgres instances (or use a sharded cluster like Citus). <br>• Use a “write‑hot” partition for the active doc. <br>• Keep a global read replica for analytics. | • More operational overhead (multiple DBs). <br>• Must implement routing logic in the API. |
| Connection pooling | Each write opens a new DB connection. | Use a connection pool (pg‑pool). | • Standard practice; no extra cost. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| Round‑robin without session stickiness | User’s WebSocket may be routed to Server A, but a subsequent request (e.g. HTTP API) goes to Server B, which doesn’t know the user’s state. | Sticky sessions on the load balancer (IP hash or session cookie). | • Reduces cross‑server state but can create a single point of failure. |
| No graceful failover | If a server dies, its clients lose the socket and all in‑flight edits. | Implement reconnection logic that re‑joins the same document and re‑plays any missed changes from the event bus. | • Slightly more client logic. |
| Scaling the event bus | If you use Redis Pub/Sub, Redis single‑node becomes a bottleneck. | Use Redis Cluster or Kafka (with multiple partitions per topic). | • More infrastructure but scales horizontally. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| JWT in localStorage | Vulnerable to XSS; stolen token can be used to hijack a session. | Store JWT in HttpOnly, SameSite=Lax/Strict cookie. <br>• Optionally rotate tokens or use short‑lived access tokens + refresh token in secure cookie. | • Requires CSRF protection (same‑site cookie). <br>• Slightly more round‑trips for token refresh. |
| 24‑hour expiry | User may be logged out mid‑session. | Use refresh token flow with a 14‑day refresh token + 15‑minute access token. | • Adds refresh logic. |
| CDN caching API responses | End‑points that return document data could be cached for 5 min → stale content. | Mark real‑time API routes with Cache-Control: no-store or a very short TTL. | • Nothing extra; just set headers. |
| Missing rate limiting | Attackers can flood a document with edits. | Apply per‑user / per‑doc rate limits (e.g., 10 ops/sec). | • Adds overhead but protects the system. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| No metrics | Hard to spot hot documents or slow DB writes. | Instrument WebSocket ops, DB latency, queue lag, Redis latency. Use Prometheus + Grafana. | • Extra instrumentation code. |
| No alerting | You’ll only notice after a user reports. | Alert on high error rate, queue lag, DB connection exhaustion. | • Requires ops involvement. |
| No graceful degradation | If Redis or Pub/Sub goes down, all clients lose updates. | Keep a local in‑memory buffer and replay when the bus comes back. | • Slightly more code. |
| No transaction retries | DB write fails due to transient lock. | Use retry‑on‑deadlock logic in the API. | • Adds complexity but increases reliability. |
| Bottleneck | Fix | Trade‑offs |
|---|---|---|
| Per‑doc snapshot every 30 s | Store incremental diffs; only snapshot on major version or manually. | CPU for diff, complexity for replay. |
| Client‑clock based timestamps | Server‑issued operation IDs + Lamport timestamps. | Minimal extra data per message. |
| Polling + 2 s | Switch to LISTEN/NOTIFY in Postgres or logical replication to push changes. | Requires a small change‑push service; eliminates polling overhead. |
| High write contention on a single row | Use row‑level locking or CRDT that merges locally; push to DB only after a batch. | Latency for final persistence, but throughput increases. |
| Static assets being served by CDN | Good practice; no change needed. | Ensure API endpoints are excluded from CDN caching. |
| Layer | Recommendation | Reason |
|---|---|---|
| Event bus | Redis Cluster (Pub/Sub) or NATS | Fast, low‑latency, supports horizontal scaling. |
| Conflict resolution | CRDT library (e.g. Automerge, Yjs) | Already battle‑tested for collaborative editors. |
| DB | Citus (Postgres sharding) or partitioned Postgres | Keeps single‑node simplicity for small orgs, scales for large. |
| Auth | HttpOnly refresh token + short‑lived access token | XSS safe, still supports SPA. |
| WebSocket | Sticky session or client‑side reconnection to same server | Guarantees continuity. |
| Metrics | Prometheus + Grafana | Visibility into latency, error rates. |
Cache-Control: no-store – avoid stale CDN data.The current design is a classic “single‑box” real‑time editor that works for a handful of users but will break under real‑world load because it:
By replacing polling with a proper event bus, adopting a proven CRDT/OT engine, batching writes, and hardening auth/session handling, you’ll get a system that scales to thousands of simultaneous collaborators while maintaining strong consistency and resilience.
Turn this model response into notes, narration, or a short video
Partner link