4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on Rival. This response is part of Rival's open dataset of 5,600+ AI model responses.
Gpt 5.4 Pro's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
This architecture will work for a demo, but it has major correctness, latency, and security problems for a Google-Docs-like editor.
| Issue | Failure mode / race | Specific fix | Trade-offs |
|---|---|---|---|
| Client timestamps for conflict resolution | Clock skew, incorrect device time, malicious clients sending future timestamps; later arrival may wrongly overwrite earlier causal edits | Use server-assigned monotonic document revisions or a single authoritative sequencer per document. For true collaborative editing, use OT or CRDT instead of LWW | More implementation complexity; single-writer-per-doc can create a hot-doc bottleneck |
| Last-write-wins at paragraph granularity | Two users edit different characters in the same paragraph and one loses all work | Use character/block-level operations with OT/CRDT (e.g. Yjs, ProseMirror collab, Automerge) | More metadata, more complex merge logic |
| No total ordering across servers | User on server A sees op1 then op2; user on server B sees op2 then op1 after polling; document diverges | Assign a per-document sequence number at one authority (doc owner/shard) and apply ops in sequence | Requires routing or coordination |
| DB commit order vs timestamp order | Two concurrent writes race in PostgreSQL; the transaction that commits last wins even if it has the “older” client timestamp | Use append-only ops + version check (expected_revision) or a sequencer; avoid blind overwrites of document state | More retry logic or ownership logic |
| Equal timestamps / timestamp collisions | Ties create nondeterministic winners | Don’t use timestamps for ordering; use sequence numbers | None, other than rework |
| Out-of-order delivery after polling | Clients on different servers receive changes late and in batches; applying naively can corrupt state | Use revisioned ops; buffer until missing revisions arrive; or move to pub/sub with ordering per doc | Slightly more state on client/server |
| Fetch/subscribe race | Client loads document snapshot, then opens WebSocket; edits committed between those steps are missed | Return snapshot with a revision number; WebSocket subscribe must say “start from revision N”; server replays N+1…current before live mode | Requires keeping recent op log |
| Duplicate delivery on reconnect/retry | Client resends an op after timeout; server applies it twice | Give every client op a UUID/idempotency key; dedupe per document | Dedupe state in memory/Redis/log |
| Lost local edits on reconnect | User types, network drops, app reconnects to a different server, pending ops vanish or get replayed wrong | Client keeps a pending op queue and resends unacked ops from last known revision | More client complexity |
| Offline edits clobber online edits | Offline user comes back with old base state; LWW overwrites newer edits | Use OT/CRDT or at least “op with base revision + server-side rebase/reject” | Rebase logic is nontrivial |
| Snapshot overwrite race | Background snapshot generated from older state may overwrite newer state if save isn’t versioned | Store snapshots with document revision and only commit them if based on the latest expected revision | More metadata; snapshot retries |
| HTML as the source of truth | HTML is non-canonical; same edit can serialize differently across browsers; formatting changes become hard to merge | Use a structured document model (ProseMirror JSON, Slate nodes, etc.) as source of truth; render HTML on read/export | Requires editor model migration |
| Structural edits break paragraph IDs | Splits/merges/lists make “same paragraph” ambiguous | Give blocks/nodes stable IDs and operate on those | Extra model complexity |
| Issue | Failure mode / bottleneck | Specific fix | Trade-offs |
|---|---|---|---|
| Broadcast only to clients on the same server | Collaborators on other servers see edits up to 2s late; not acceptable for real-time editing | Introduce a cross-server fanout mechanism: Redis Pub/Sub, Redis Streams, NATS, Kafka, or a dedicated collaboration service | New infrastructure |
| Servers poll PostgreSQL every 2 seconds | High DB load, stale UX, bursty updates, poor tail latency | For small scale: Postgres LISTEN/NOTIFY. For production scale: Redis Streams / NATS / Kafka with per-doc topics or partitioning | LISTEN/NOTIFY is simple but limited; Streams/Kafka add ops burden |
| Polling by timestamp | Misses rows with same timestamp; skew breaks cursoring | Poll by monotonic revision/LSN/sequence, not timestamp | Requires schema changes |
| Round-robin LB spreads one document’s users across many servers | Every edit must cross servers; cross-node chatter grows with participants | Route by document ID affinity (consistent hashing or “doc owner” routing) so most collaborators on a doc hit the same collab shard | Harder rebalancing; hot docs still hot |
| No authoritative doc owner | Any server can accept writes for same doc; ordering becomes distributed and messy | Make each document have a single active owner/shard that sequences ops | Must handle owner failover correctly |
| Split-brain risk if using doc ownership | Two servers may think they own same doc during failover, causing duplicate writers | Use leases with fencing tokens via etcd/Consul/ZK; avoid weak ad-hoc locks | More infra complexity |
| Server crash between DB write and broadcast | Write committed, but some clients never hear about it until reconnect/poll | Use a transactional outbox or make the durable op log the source of truth and drive fanout from it | Extra table/consumer or event system |
| Server crash before DB write but after local optimistic UI | User believes edit was saved, but it was not | Client should optimistically render locally, but server must ack only after durable append; client retries unacked ops | More protocol complexity |
| Slow consumer problem | Mobile/slow clients accumulate huge outbound queues; server memory grows | Put bounds on per-connection send queues; if exceeded, drop connection and force snapshot+replay | Slow clients reconnect more often |
| No heartbeat / presence TTL | Dead connections linger; presence indicators wrong | Use WebSocket ping/pong, server-side TTLs, and presence in ephemeral store | Slight extra traffic |
| Rolling deploys / connection draining not handled | Massive reconnect storms, dropped edits during deploy | Support graceful drain, stop accepting new docs, ask clients to reconnect with last revision | More deployment logic |
| Per-keystroke messages | Too many messages/network interrupts under high typing rates | Coalesce keystrokes into ops every 20–50ms or use semantic editor ops | Slightly higher local latency, but usually imperceptible |
| Large paste / format-all operations | Huge WebSocket frames, event loop stalls, DB spikes | Chunk large ops, enforce limits, maybe treat as specialized bulk ops | More edge-case handling |
| Issue | Failure mode / bottleneck | Specific fix | Trade-offs |
|---|---|---|---|
| Write every change to PostgreSQL | Primary becomes the bottleneck; high fsync/WAL/index churn; p99 latency hurts typing UX | Use an append-only operation log, ideally with batching; snapshot current state periodically rather than rewriting full state per keystroke | More moving parts |
| If updates are full-document or full-paragraph writes | Row lock contention, TOAST churn, large WAL, poor vacuum behavior | Store small ops and periodic snapshots; avoid whole-document overwrite per keystroke | Requires new data model |
| Full HTML snapshots every 30s | Large writes, expensive replication, poor diffing, possible 30s recovery gaps depending on exact implementation | Snapshot every N ops or on idle, store with revision, compress; large snapshots can go to object storage with metadata in Postgres | Slightly more complex restore path |
| Ambiguous durability model | The spec says “write change to PostgreSQL” and also “save full HTML every 30s”; if snapshots are the only durable state, up to 30s of edits can vanish | Be explicit: durable op append on each accepted edit, snapshots only for recovery speed | More storage |
| Hot documents create hot rows/partitions | A single active doc overloads one DB row/table partition | Use in-memory doc actor + op log, not direct row mutation. For very large docs, consider block/subtree partitioning | Cross-block edits become more complex |
| Read replicas for active documents | Replica lag serves stale snapshots; reconnecting client may load old state then apply wrong ops | For active docs, use primary or revision-aware fetch+replay; use replicas only for history/search/analytics | Less read offload |
| Large snapshots worsen replica lag | Replication lag grows exactly when collaboration is busiest | Reduce snapshot size/frequency; offload snapshots to object storage | Recovery can be slower |
| Polling DB from every server | Thundering herd against Postgres | Move real-time propagation off the DB | Extra infra |
| Connection pool exhaustion | Many API servers + WS write paths exhaust DB connections | Separate HTTP from collab workers; use small pooled DB writer layer / async persistence | More architecture |
| Org-ID partitioning is skew-prone | One large organization becomes one hot shard; “hot org” or “hot doc in one org” still melts one partition | Shard by document ID (or virtual shards), not just org ID. Keep org as a query dimension, not primary shard key | Cross-org/tenant queries become harder |
| Horizontal API scale doesn’t help the primary DB | More app servers produce more writes against the same bottleneck | Treat collaboration as a stateful, sharded service, not just more stateless API boxes | Bigger redesign |
| Redis as shared session/cache layer | If Redis is single-node or has eviction, auth/presence/fanout can fail unpredictably | Use HA Redis; separate session/auth from ephemeral presence/pubsub; disable eviction for critical keys | Higher cost |
| Issue | Failure mode | Specific fix | Trade-offs |
|---|---|---|---|
| JWT in localStorage | Any XSS steals the token; rich-text editors have large XSS surface | Use short-lived access token in memory + HttpOnly Secure SameSite refresh cookie; strong CSP and Trusted Types | More auth complexity; cookie flows need CSRF consideration |
| 24-hour JWT lifetime | Stolen token remains valid a long time | Shorten access token TTL (e.g. 5–15 min), rotate refresh tokens, support revocation/session versioning | More refresh traffic |
| JWT + Redis “session cache” mixed model | Confusing source of truth; revocations may not apply immediately | Pick a clear model: short-lived JWT + server-side session/refresh is common | Slightly less stateless |
| Permissions can change while WS stays open | User removed from doc/org can keep editing until token expiry | On doc join, check authorization; also push revocation events and disconnect affected sockets | More auth checks / eventing |
| Token expiry during WebSocket session | Long-lived socket stays authenticated forever unless server re-checks | Require periodic reauth or close socket at token expiry and reconnect with fresh token | Some reconnect churn |
| CloudFront caches API responses for 5 minutes | Users see stale docs; worse, private doc responses may leak if cache key is wrong | Cache only static assets at CDN. Mark doc/auth APIs Cache-Control: no-store, private; never cache personalized document GETs unless extremely carefully keyed | Higher origin load |
| Cached auth/permission responses | User still sees access after revoke or gets stale 403 | Don’t CDN-cache auth-sensitive APIs | Same as above |
| Raw HTML in collaborative docs | Stored XSS, reflected XSS, token theft, account compromise | Use a structured doc model, sanitize pasted/imported HTML, sanitize render/export path | Sanitization costs CPU and may strip some content |
| Abuse / flooding | One client can spam edits and DoS server/DB | Rate-limit per user/document/IP; cap message size and frequency | Must avoid harming legitimate bulk paste/editing |
| Issue | Failure mode | Specific fix | Trade-offs |
|---|---|---|---|
| Node.js single event loop per server | Large snapshots, JSON parsing, or one hot room can stall all sockets on that instance | Isolate collaboration into its own service/processes; use worker threads for heavy tasks | More services / ops |
| WebSocket connection imbalance | Round-robin at connect time doesn’t reflect active room load; one server gets hot docs | Balance by document ownership, not just connection count | Needs routing layer |
| Memory growth from room state + send buffers | Many active docs and slow clients can OOM a node | Bounded room state, bounded send queues, room eviction, snapshot+replay | More complexity |
| Protocol incompatibility during deploys | New servers send op formats old clients can’t apply | Version your protocol and maintain a compatibility window | Slower rollout cleanup |
This does not solve the main problems:
Better approach: split into:
Helpful for:
Not helpful for:
Good for tenant isolation, bad for load balance if one org is huge. Collaboration hotspots are usually by document, not org.
A practical production design looks like this:
Client fetches document snapshot + revision
docRevision = 18427.Client opens WebSocket to collaboration service
subscribe(docId, fromRevision=18427).Collaboration owner is authoritative for that doc
Each accepted op is durably appended
Fanout comes from the op stream
Snapshots are periodic optimization
Security
If you don’t want a full redesign immediately, do these first:
If you want, I can also turn this into:
Turn this model response into notes, narration, or a short video
Partner link