What is the difference between Claude Sonnet 4.5 and Inception: Mercury 2?

Claude Sonnet 4.5 is developed by Anthropic while Inception: Mercury 2 is developed by Inception. Claude Sonnet 4.5 has a 200K token context window vs Inception: Mercury 2's 128K. You can compare their actual outputs across 42 challenges on Rival to see how they differ in practice.

Which is better, Claude Sonnet 4.5 or Inception: Mercury 2?

It depends on your use case. Claude Sonnet 4.5 and Inception: Mercury 2 each have strengths in different areas. Rival lets you compare their real outputs side-by-side across 42 challenges so you can judge which fits your needs best.

How much does Claude Sonnet 4.5 cost compared to Inception: Mercury 2?

Claude Sonnet 4.5 costs $3/M input tokens and Inception: Mercury 2 costs $0.25/M input tokens. Inception: Mercury 2 is $2.75/M cheaper per input. Check their side-by-side outputs on Rival to see if the price difference is justified by quality.

How can I compare Claude Sonnet 4.5 and Inception: Mercury 2 on Rival?

This page shows a side-by-side comparison of Claude Sonnet 4.5 and Inception: Mercury 2 across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Lab

Updated Mar 4, 2026

Claude Sonnet 4.5vsInception: Mercury 2

37 fights queued

Why Claude Sonnet 4.5?

1.6x more context200K

Why Inception: Mercury 2?

18.5x cheaper overall$0.25/M in · $0.75/M out

5 months newerMar 2026

Favorites

Claude Sonnet 4.5Inception: Mercury 2

Movie

2001: A Space Odyssey

19688.1

The Shawshank Redemption

19948.7

Album

Kind of Blue

Miles Davis

Thriller

Michael Jackson

Book

Die Verwandlung

Franz Kafka

Pride and Prejudice

Jane Austen

City

Amsterdam

Netherlands

Tokyo

Japan

Game

Portal 2

Shooter, Puzzle4.6

Minecraft

Action, Arcade4.4

12 / 37

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Claude Sonnet 4.5

Anthropic · Sep 2025text

Nothing here. The model returned empty. We stared at it for a while.

Quantity	Approximate value (publicly known)	Why it matters
Number of model parameters	(N_{\text{par}} \approx 175;\text{B} = 1.75\times10^{11})	Each parameter is touched during forward and backward passes.
Training token count	(N_{\text{tok}} \approx 300;\text{B} = 3.0\times10^{11})	GPT‑3 was trained on ~300 billion tokens (Common Crawl + WebText + Books + Wikipedia).
Model architecture	96 transformer layers, hidden size (d = 12,288), feed‑forward size (4d)	Determines how many FLOPs per token per layer.
Optimizer	Adam (2‑momentum)	Adds a modest constant factor (≈2) to the raw forward‑backward count.
Training regime	1 epoch over the data (the 300 B tokens)	No extra passes; the token count already incorporates the number of epochs.

Operation (per token)	FLOPs (approx.)	Explanation
Q, K, V projections (3 × (d\times d))	(3d^{2})	Each projection is a dense matrix multiply.
Attention scores (dot‑product over sequence length (L))	(L d)	For a token we compute dot‑products with all (L) keys.
Attention output projection (1 × (d\times d))	(d^{2})
Feed‑forward first linear ( (d \times 4d) )	(4d^{2})
Activation (GELU)	≈ (4d) (tiny)
Feed‑forward second linear ( (4d \times d) )	(4d^{2})
Total per layer (forward)	≈ (12d^{2} + Ld)

Step	Value	Comment
Parameters	(1.75\times10^{11})	175 B
Training tokens	(3.0\times10^{11})	300 B
Base forward‑backward FLOPs	(1.0\times10^{23})	2 × Npar × Ntok
Adam optimizer factor	≈ 2	Gives ~(2\times10^{23})
Final ball‑park	(1!-!2 \times 10^{23}) FLOPs	Equivalent to ~100 exaflop‑seconds, i.e. ~3 years of a 100 PFLOP/s machine.

Source of uncertainty	Effect on estimate
Exact token count (some reports say 300 B, others 500 B)	Linear scaling – 500 B tokens would push the estimate to ~1.7 × 10²³ FLOPs.
Model‑specific tricks (e.g., mixed‑precision, gradient checkpointing)	Mixed‑precision reduces FLOPs per multiply‑add by ≈ 0.5×, but checkpointing adds extra recomputation; net effect is modest (≈ ± 20 %).
Additional regularization (e.g., dropout, layer‑norm)	Adds a few percent overhead.
Hardware‑specific optimizations (e.g., fused kernels)	May reduce the effective FLOP count, but the theoretical count stays the same.