Here’s a structured teardown of the architecture: failure modes, race conditions, correctness issues, security problems, and scaling bottlenecks — plus concrete fixes and trade-offs.
High-risk architectural problems
1. WebSocket state is local to each API server
Problem
Each API server only broadcasts to clients connected to itself. Clients connected to different servers won’t see updates until those servers poll PostgreSQL 2 seconds later.
Failure modes
- Users on different servers see inconsistent document state for up to polling interval or longer.
- Edits may appear out of order across servers.
- “Realtime” becomes eventually consistent.
- If polling fails or lags, some users stop seeing updates entirely.
- Reconnect to a different server may show stale state.
Solution
Use a shared realtime fan-out layer:
- Redis Pub/Sub
- NATS
- Kafka
- dedicated collaboration service with document-room ownership
Each server publishes incoming operations to a shared channel keyed by document ID, and all servers subscribed to that document broadcast immediately to their local WebSocket clients.
Trade-offs
- Redis Pub/Sub: simple, low latency, but messages are ephemeral and can be lost during subscriber disconnects.
- Kafka/NATS JetStream: durable and replayable, but more operational complexity.
- Single “document owner” process/shard: easier ordering, but requires routing logic and failover handling.
2. Polling PostgreSQL every 2 seconds for changes
Problem
Using the primary database as a synchronization bus is a bad fit.
Failure modes
- High DB load from polling across many servers/documents.
- 2-second latency destroys collaborative editing UX.
- Servers can miss changes depending on polling query design.
- Race conditions if polling reads partial write sets.
- Poll storms at scale.
- Read replicas may lag, causing stale updates.
Solution
Stop polling PostgreSQL for realtime sync. Use:
- event bus for realtime propagation
- PostgreSQL only for persistence
- optional logical append-only operation log for recovery
Trade-offs
- Adds infrastructure.
- Requires thinking in event streams rather than DB polling.
- But greatly improves latency and scalability.
3. Last-write-wins using client timestamps
Problem
This is one of the most dangerous design choices.
Failure modes
- Client clocks are wrong or malicious.
- User changes can overwrite newer edits because of skew.
- Two users edit same area: one loses work arbitrarily.
- Offline clients reconnect with old but “future” timestamps.
- Timezone/system clock bugs create impossible ordering.
- Attackers can set huge future timestamps and win all conflicts.
Solution
Do not use client time for conflict resolution.
Use one of:
- OT (Operational Transformation) — classic Google Docs style
- CRDTs — strong eventual consistency without central transform
- At minimum: server-assigned monotonic sequence numbers per document
For rich text collaborative editing, OT or CRDT is the right answer.
Trade-offs
- OT: efficient and battle-tested, but complex to implement correctly.
- CRDT: easier to reason about distributed/offline editing, but can increase memory/storage and implementation complexity for rich text.
- Server sequencing only: better than client timestamps, but still inadequate for concurrent text edits without transformation/merge semantics.
4. Full HTML snapshots every 30 seconds
Problem
Saving full HTML snapshots is expensive and unsafe as the primary source of truth.
Failure modes
- Large write amplification.
- Data loss: up to 30 seconds of edits if a server crashes before snapshot.
- HTML is presentation state, not ideal operational state.
- Hard to merge concurrent edits.
- Serialization inconsistency between clients.
- Rich text HTML can contain non-semantic noise, causing diff churn.
- Snapshots become huge for big docs.
Solution
Store:
- operation log / change log as source of truth
- periodic compacted snapshots/checkpoints for recovery
- canonical document model (e.g. ProseMirror JSON, Slate JSON, Quill Delta, custom AST), not raw HTML
Then derive HTML for rendering/export.
Trade-offs
- More implementation work.
- Need compaction and replay logic.
- But correctness, auditability, and recovery improve dramatically.
5. No global ordering of edits
Problem
If edits arrive at different servers, there is no authoritative ordering before persistence and rebroadcast.
Failure modes
- Different users apply edits in different orders and diverge.
- Overlapping edits produce non-deterministic results.
- Duplicate updates if polling and local broadcasts overlap.
- Reordering due to network jitter.
Solution
Create per-document ordering:
- assign a document to a logical sequencer/room leader/shard
- or use a partitioned log by document ID
- all ops for a given document go through one ordered stream
Trade-offs
- Single-writer per document simplifies correctness.
- But introduces hotspot risk for highly active documents.
- Need shard rebalancing and failover.
Correctness and concurrency issues
6. Simultaneous edits to same paragraph with LWW
Problem
Paragraph-level overwrite loses intent. Two users changing different words in the same paragraph will conflict unnecessarily.
Failure modes
- Silent data loss.
- Cursor jumps and flicker.
- User distrust because edits disappear.
- Non-overlapping changes still collide.
Solution
Move from paragraph-level overwrite to operation-level editing:
- insert/delete/format operations at character/range granularity
- use OT/CRDT
- preserve intent where possible
Trade-offs
- More complex than paragraph blobs.
- Requires editor model integration.
7. Duplicate application of changes
Problem
A change may be:
- applied locally optimistically
- persisted
- rebroadcast locally
- later observed again via DB poll
Without idempotency, clients can apply same change twice.
Failure modes
- Repeated text insertion/deletion
- Formatting duplicated
- Client state corruption
Solution
Every operation needs:
- globally unique op ID
- document version or parent version/vector
- idempotent apply logic
- dedup cache on client and server
Trade-offs
- More metadata and bookkeeping.
- Essential for correctness.
8. Lost updates during reconnect
Problem
If a client disconnects briefly, it may miss operations sent while offline.
Failure modes
- Reconnected client resumes from stale state.
- Local unsent edits replay against wrong base.
- Divergence between users.
Solution
Use resumable streams:
- client tracks last acknowledged server op/version
- on reconnect, asks for missed ops since version N
- if too far behind, server sends fresh snapshot + subsequent ops
Trade-offs
- Need op retention or durable event log.
- Slightly more state on server/client.
9. No acknowledgment protocol
Problem
WebSocket send does not imply client processed the message.
Failure modes
- Server thinks update delivered, but client dropped/reloaded.
- Client thinks operation succeeded, but server didn’t persist.
- Ambiguous state after transient network issues.
Solution
Implement explicit protocol:
- client op submission with op ID
- server ack when durably accepted
- downstream ops include sequence/version
- client ack of applied sequence optional for resume/backpressure
Trade-offs
- More protocol complexity.
- Much better recovery semantics.
10. Race between DB write and broadcast
Problem
Sequence described is:
- receive change
- write to PostgreSQL
- broadcast to local clients
What if broadcast succeeds but DB write fails? Or DB succeeds and broadcast fails?
Failure modes
- Clients see edits that are never persisted.
- Persisted edits not visible to some users.
- Servers recover inconsistently.
Solution
Define a transactional ingestion path:
- accept op
- assign sequence number
- durably append to op log
- then broadcast from committed stream
If using event log, broadcast consumers only emit committed events.
Trade-offs
- Slightly higher latency.
- Much stronger consistency.
11. Read replicas for collaborative reads
Problem
Read replicas are often asynchronously replicated.
Failure modes
- User loads a document and misses recent edits.
- Metadata/version checks stale.
- Reconnect against a lagging replica causes rollback effect.
Solution
For collaboration-critical reads:
- use primary or strongly consistent document leader shard
- use replicas only for analytics/search/history/export
- optionally use “read-your-writes” routing based on session/document
Trade-offs
- More load on primary.
- Better correctness.
12. Partitioning by organization ID
Problem
Document collaboration hotspots are by document, not org. Organization-based partitioning can create skew.
Failure modes
- One large enterprise org becomes a hotspot.
- Many active docs in one org overload same partition.
- Cross-org balancing poor.
Solution
Partition by document ID or hashed document ID.
Optionally colocate metadata by org for admin queries, but realtime doc processing should shard by doc.
Trade-offs
- Org-level queries may become more expensive.
- Much better write distribution.
Availability and failover issues
13. Load balancer round-robin for WebSockets
Problem
Round-robin without session affinity can cause reconnects to land anywhere, which is okay only if backend state is shared properly. In current design it is not.
Failure modes
- Reconnect causes user to miss local in-memory state.
- Presence/cursors/sessions split across servers.
- Sticky-session dependence makes scaling/failover fragile.
Solution
Either:
- use stateless WebSocket servers backed by shared message bus and resumable state, or
- route by document ID to a collaboration shard/owner
Avoid depending on sticky sessions for correctness.
Trade-offs
- Stateless shared-bus design is simpler operationally.
- Routed ownership gives stronger ordering but requires smart LB/service discovery.
14. Server crash loses in-memory session/realtime state
Problem
Each server holds active WebSocket connections and maybe ephemeral session/presence info.
Failure modes
- Users connected to crashed server disconnect.
- Presence/cursor state disappears.
- Unsaved in-memory edits may be lost if not durably accepted.
- Other servers may not know who is editing.
Solution
- Keep only transient connection state in-process
- Persist presence/ephemeral state in Redis with TTL if needed
- Ensure ops are durably written before ack
- Clients auto-reconnect and resync from last acked version
Trade-offs
- Redis presence introduces extra writes.
- Better crash recovery.
15. No mention of backpressure or slow consumers
Problem
Some clients or servers will be slow.
Failure modes
- WebSocket buffers grow unbounded.
- One huge document floods all clients.
- Server memory bloat and event loop stalls.
- Broadcast loops block timely processing.
Solution
Implement backpressure:
- bounded outbound queues per client
- drop or coalesce non-essential events (e.g. cursor positions)
- disconnect clients that fall too far behind and force resync
- separate critical document ops from ephemeral presence events
Trade-offs
- Slow clients may be kicked more often.
- Protects system health.
16. Hot documents
Problem
A popular doc with hundreds/thousands of editors creates a concentrated hotspot.
Failure modes
- Single shard/server overload.
- Fan-out becomes dominant cost.
- CPU spent on transformation/serialization.
- Large presence state and cursor spam.
Solution
For hot docs:
- dedicated collaboration shard per hot document
- hierarchical fan-out
- rate-limit presence/cursor updates
- batch operations where possible
- use binary protocol / compression
- separate editors from viewers
Trade-offs
- More specialized logic.
- Needed for extreme scale.
Security issues
17. JWTs in localStorage
Problem
localStorage is vulnerable to token theft via XSS.
Failure modes
- Any XSS gives attackers long-lived account takeover.
- 24-hour token lifetime increases blast radius.
Solution
Use:
- HttpOnly, Secure, SameSite cookies for session/refresh token
- short-lived access tokens
- rotating refresh tokens
- CSP and strong XSS defenses
Trade-offs
- More auth complexity, CSRF considerations if using cookies.
- Major security improvement.
18. JWT 24-hour expiry
Problem
Long-lived bearer tokens are risky, especially for collaborative apps used in browsers.
Failure modes
- Stolen token valid all day.
- Revocation difficult.
- User role changes delayed.
Solution
- short-lived access token (5–15 min)
- refresh token rotation
- token revocation/versioning
- WebSocket auth revalidation on reconnect and periodically
Trade-offs
- More auth flows.
- Better security and revocation.
19. CloudFront caches API responses for 5 minutes
Problem
Caching API responses broadly is dangerous for auth, document freshness, and privacy.
Failure modes
- User sees stale document content or metadata.
- One user’s personalized response could be cached and leaked if cache keys/headers are wrong.
- Auth/permission changes delayed.
- Collaboration state appears inconsistent.
Solution
Do not CDN-cache mutable authenticated document APIs unless very carefully controlled.
- Cache only static assets
- For APIs, use Cache-Control: no-store/private for sensitive dynamic content
- If caching some public metadata, use explicit cache keys and short TTLs
- Consider edge caching only for immutable versioned exports
Trade-offs
- Higher origin load.
- Correctness and privacy preserved.
20. Client timestamps are trust boundary violation
Problem
Clients are untrusted.
Failure modes
- Malicious conflict wins
- replay attacks with manipulated timestamps
- fabricated ordering
Solution
Server-authoritative sequencing and validation.
Trade-offs
Data integrity and persistence issues
21. Writing every keystroke directly to PostgreSQL
Problem
If every edit event hits PostgreSQL synchronously, write amplification will be severe.
Failure modes
- DB becomes bottleneck quickly.
- transaction overhead dominates.
- lock/contention on hot docs.
- spikes from typing bursts.
Solution
Options:
- append operations to a log store/broker and asynchronously persist checkpoints
- batch/coalesce operations over small windows (e.g. 50–200 ms)
- maintain in-memory doc state on document leader and flush op batches
Trade-offs
- Batching adds slight latency and more complicated failure handling.
- Direct sync writes are simpler but won’t scale.
22. PostgreSQL row contention for hot documents
Problem
If a single document row is frequently updated, MVCC churn and row contention become painful.
Failure modes
- vacuum pressure
- bloated rows/TOAST data
- lock waits
- degraded write throughput
Solution
Use append-only operations table/log instead of repeatedly rewriting one giant document row.
Checkpoint periodically into snapshots.
Trade-offs
- Read path requires replay/checkpoints.
- Much better write scalability.
23. HTML as canonical format
Problem
HTML from browser/editor is not a stable canonical model.
Failure modes
- Browser/editor differences
- non-semantic markup noise
- formatting glitches on merge
- XSS risks if unsanitized content stored/rendered
Solution
Canonical structured editor model + strict sanitization for imported/exported HTML.
Trade-offs
- Need schema and conversion logic.
- Essential for robust rich text collaboration.
24. Snapshot interval may lose acknowledged edits
Problem
If edits are acknowledged before durable persistence and only snapshots happen every 30s, crash can lose “saved” work.
Solution
Durable operation append before ack. Snapshot only for compaction, not durability.
Trade-offs
- Slightly more ingestion complexity.
Networking and protocol issues
25. No ordering guarantee over multiple network paths
Problem
Clients may receive:
- optimistic local op
- remote transformed ops
- delayed poll-based updates
in inconsistent order.
Failure modes
- undo stack corruption
- cursor position mismatch
- content flicker
Solution
Version every op and require ordered apply.
Buffer out-of-order messages until missing versions arrive or trigger resync.
Trade-offs
- Client complexity.
- Necessary for deterministic state.
26. No mention of heartbeats/ping-pong
Problem
WebSockets may appear connected while dead due to proxies/NATs.
Failure modes
- Ghost users/presence
- server keeps stale connections
- clients think they are connected but are not receiving updates
Solution
Heartbeat protocol with timeout-based disconnect and reconnect.
Trade-offs
27. Presence and cursor updates mixed with document ops
Problem
Ephemeral high-frequency updates can overwhelm critical edit pipeline.
Failure modes
- edit latency rises due to cursor spam
- unnecessary DB writes if presence persisted wrongly
Solution
Separate channels:
- reliable ordered stream for document ops
- lossy throttled channel for presence/cursors
Trade-offs
- More protocol surface.
- Much better performance.
Product/UX consistency issues
28. No undo/redo semantics under collaboration
Problem
With naive LWW and snapshots, collaborative undo is ill-defined.
Failure modes
- undo removes someone else’s changes
- local history diverges from server history
Solution
Use operation-based model with per-user undo semantics integrated with OT/CRDT/editor framework.
Trade-offs
- Complex but expected in docs products.
29. Offline editing unsupported or dangerous
Problem
If users go offline and edit, reconnecting with LWW timestamps is destructive.
Solution
If offline support is needed:
- CRDT is usually a better fit
- or queue local ops against known base version and rebase/transform on reconnect
Trade-offs
- More client complexity and storage.
30. No schema/version migration strategy for document model
Problem
As editor features evolve, old snapshots/ops may become incompatible.
Solution
Version the document schema and operation format; support migration or transcoding.
Trade-offs
- Ongoing maintenance burden.
Observability and operational blind spots
31. Hard to debug causality and divergence
Problem
Current design lacks clear operation lineage.
Failure modes
- impossible to prove why text disappeared
- support nightmare
Solution
Maintain audit trail:
- op ID
- author ID
- server sequence
- parent/base version
- timestamp (server-side, informational only)
- transform metadata if applicable
Trade-offs
- More storage.
- Huge debugging value.
32. No mention of rate limiting / abuse control
Problem
Collaborative endpoints are easy to abuse.
Failure modes
- spam edits
- giant payloads
- connection floods
- expensive hot doc attacks
Solution
- connection limits per user/IP
- payload size limits
- per-doc op rate limiting
- authz checks on each document join/edit
- WAF for HTTP paths
Trade-offs
- Potential false positives for power users/bots.
Better target architecture
A stronger architecture would look like this:
Realtime path
- Clients connect via WebSocket to stateless collaboration gateways.
- Gateways authenticate and subscribe users to document rooms.
- All ops for a document route to a document shard/leader or partitioned stream by
document_id.
- The document processor assigns monotonic sequence numbers and applies OT/CRDT logic.
- Committed ops are published to all subscribers across all gateways immediately.
Persistence
- Source of truth = operation log + periodic snapshots/checkpoints.
- Canonical document model = structured rich-text JSON, not HTML.
- PostgreSQL can store snapshots, metadata, permissions, and optionally op history if scale permits.
- For very high scale, use Kafka/NATS/Redis Streams for op transport, then persist asynchronously.
Recovery
- Client tracks last seen sequence.
- On reconnect, server replays missing ops or sends latest snapshot + tail ops.
- Explicit acks ensure “saved” means durably accepted.
Security
- Static assets on CDN only.
- Dynamic document APIs mostly uncached.
- HttpOnly cookie or short-lived token approach.
- CSP, sanitization, and server-authoritative sequencing.
Prioritized list of fixes
If you had to improve this incrementally:
P0 — must fix before production
- Replace client timestamp LWW with OT/CRDT or at least server sequencing.
- Replace DB polling with shared realtime pub/sub or event stream.
- Stop caching authenticated mutable API responses in CDN.
- Remove JWTs from localStorage; use safer token/session handling.
- Add op IDs, versioning, deduplication, and reconnect replay.
- Persist operations durably before acking success.
P1 — next most important
- Move from HTML snapshots to canonical document model + op log + checkpoints.
- Partition/shard by document ID, not organization ID.
- Add backpressure, heartbeats, and presence separation.
- Avoid replica reads for collaboration-critical paths.
P2 — scale and polish
- Hot-document sharding/ownership.
- Audit logs and observability for divergence.
- Collaborative undo/redo semantics.
- Schema versioning and migration.
Summary table
| Issue | Risk | Fix | Trade-off |
|---|
| Local-only WebSocket broadcast | stale/inconsistent updates | shared pub/sub or doc shard | extra infra |
| DB polling for sync | latency, DB load | event bus | complexity |
| Client timestamp LWW | lost edits, abuse | OT/CRDT/server sequencing | hard implementation |
| Full HTML snapshots | write amp, poor merges | op log + snapshots + canonical model | more data model work |
| No global ordering | divergence | per-doc sequencer/leader | hotspot management |
| Paragraph conflict unit | excessive overwrites | character/range ops | editor integration |
| Duplicate change application | corruption | op IDs + idempotency | metadata overhead |
| Reconnect gaps | stale clients | replay by version | op retention needed |
| No ack protocol | ambiguous state | explicit acks | protocol complexity |
| DB write/broadcast race | phantom/lost edits | commit then broadcast from log | slight latency |
| Read replica lag | stale reads | primary/leader reads | more primary load |
| Partition by org | skew/hotspot | shard by doc ID | harder org queries |
| Round-robin WS | inconsistent state | stateless gateways or doc routing | routing complexity |
| Server crash | disconnect/state loss | durable ops + reconnect/resume | more engineering |
| Slow consumers | memory bloat | backpressure/disconnect | slow clients resync |
| Hot docs | shard overload | dedicated doc shard, batching | special handling |
| JWT in localStorage | token theft | HttpOnly cookies/short tokens | auth complexity |
| 24h JWT | long compromise window | short-lived + refresh | more token flow |
| CDN API caching | stale/leaked data | no-store/private | more origin traffic |
| Per-keystroke DB writes | DB bottleneck | batching/log append | complexity |
| HTML canonical storage | inconsistency/XSS | structured model | conversions needed |
If you want, I can also provide:
- a revised production-ready architecture diagram in text,
- a failure scenario walkthrough (server crash, split brain, reconnect, concurrent edits), or
- a reference protocol for client/server collaborative editing messages.