Claude Sonnet 4.5 vs Inception: Mercury 2
Compare Claude Sonnet 4.5 by Anthropic against Inception: Mercury 2 by Inception, context windows of 200K vs 128K, tested across 42 shared challenges. Updated March 2026.
Compare Claude Sonnet 4.5 by Anthropic against Inception: Mercury 2 by Inception, context windows of 200K vs 128K, tested across 42 shared challenges. Updated March 2026.
37 fights queued
Tests an AI's ability to make educated estimates based on technical knowledge
Answer (order‑of‑magnitude estimate)
The training of the original GPT‑3 (the 175 billion‑parameter “GPT‑3‑175B” model) requires on the order of 10²³ floating‑point operations (FLOPs) – roughly 1 – 2 × 10²³ FLOPs.
Below is a step‑by‑step walk‑through of how that number is arrived at, using only the publicly‑known architecture details and a few well‑established approximations for transformer training cost.
| Quantity | Approximate value (publicly known) | Why it matters |
|---|---|---|
| Number of model parameters | (N_{\text{par}} \approx 175;\text{B} = 1.75\times10^{11}) | Each parameter is touched during forward and backward passes. |
| Training token count | (N_{\text{tok}} \approx 300;\text{B} = 3.0\times10^{11}) | GPT‑3 was trained on ~300 billion tokens (Common Crawl + WebText + Books + Wikipedia). |
| Model architecture | 96 transformer layers, hidden size (d = 12,288), feed‑forward size (4d) | Determines how many FLOPs per token per layer. |
| Optimizer | Adam (2‑momentum) | Adds a modest constant factor (≈2) to the raw forward‑backward count. |
| Training regime | 1 epoch over the data (the 300 B tokens) | No extra passes; the token count already incorporates the number of epochs. |
A widely‑used back‑of‑the‑envelope rule for transformer training is:
[ \text{FLOPs}{\text{train}} ;\approx; 2 \times N{\text{par}} \times N_{\text{tok}} ]
Plugging in the numbers:
[ \begin{aligned} \text{FLOPs}_{\text{train}} &\approx 2 \times (1.75\times10^{11}) \times (3.0\times10^{11})\ &= 2 \times 5.25\times10^{22}\ &= 1.05\times10^{23};\text{FLOPs}. \end{aligned} ]
So ≈ 1 × 10²³ FLOPs is the baseline estimate.
If we want to see where the “2 × Npar × Ntok” figure comes from, we can break down a single transformer layer:
| Operation (per token) | FLOPs (approx.) | Explanation |
|---|---|---|
| Q, K, V projections (3 × (d\times d)) | (3d^{2}) | Each projection is a dense matrix multiply. |
| Attention scores (dot‑product over sequence length (L)) | (L d) | For a token we compute dot‑products with all (L) keys. |
| Attention output projection (1 × (d\times d)) | (d^{2}) | |
| Feed‑forward first linear ( (d \times 4d) ) | (4d^{2}) | |
| Activation (GELU) | ≈ (4d) (tiny) | |
| Feed‑forward second linear ( (4d \times d) ) | (4d^{2}) | |
| Total per layer (forward) | ≈ (12d^{2} + Ld) |
With (d = 12,288) we have (d^{2} \approx 1.5\times10^{8}).
Ignoring the sequence‑length term (it is ≈ (L d) ≈ (1024 \times 12,288 \approx 1.3\times10^{7}), an order of magnitude smaller than (d^{2})), the forward cost per layer per token is roughly:
[ 12 d^{2} \approx 12 \times 1.5\times10^{8} \approx 1.8\times10^{9};\text{FLOPs}. ]
Multiplying by the 96 layers:
[ \text{Forward per token} \approx 96 \times 1.8\times10^{9} \approx 1.7\times10^{11};\text{FLOPs}. ]
A backward pass roughly doubles this (gradient w.r.t. each weight), giving ≈ (3.4\times10^{11}) FLOPs per token for the whole model.
Now multiply by the token count:
[ 3.4\times10^{11} \times 3.0\times10^{11} \approx 1.0\times10^{23};\text{FLOPs}, ]
which coincides with the simpler 2 × Npar × Ntok estimate.
Thus the two viewpoints are consistent: the “parameter‑times‑token” rule already captures the dominant cost of the dense matrix multiplications inside each transformer layer.
Adam keeps two extra state tensors (first‑ and second‑moment estimates) per parameter, and each training step performs a few extra arithmetic operations per weight (e.g., bias‑corrected moments, division, sqrt). A common heuristic is to multiply the raw forward‑backward count by ≈ 2 to include Adam’s extra work.
[ \text{FLOPs}_{\text{with Adam}} \approx 2 \times 1.05\times10^{23} \approx 2.1\times10^{23};\text{FLOPs}. ]
So a realistic upper bound is ≈ 2 × 10²³ FLOPs.
| Step | Value | Comment |
|---|---|---|
| Parameters | (1.75\times10^{11}) | 175 B |
| Training tokens | (3.0\times10^{11}) | 300 B |
| Base forward‑backward FLOPs | (1.0\times10^{23}) | 2 × Npar × Ntok |
| Adam optimizer factor | ≈ 2 | Gives ~(2\times10^{23}) |
| Final ball‑park | (1!-!2 \times 10^{23}) FLOPs | Equivalent to ~100 exaflop‑seconds, i.e. ~3 years of a 100 PFLOP/s machine. |
| Source of uncertainty | Effect on estimate |
|---|---|
| Exact token count (some reports say 300 B, others 500 B) | Linear scaling – 500 B tokens would push the estimate to ~1.7 × 10²³ FLOPs. |
| Model‑specific tricks (e.g., mixed‑precision, gradient checkpointing) | Mixed‑precision reduces FLOPs per multiply‑add by ≈ 0.5×, but checkpointing adds extra recomputation; net effect is modest (≈ ± 20 %). |
| Additional regularization (e.g., dropout, layer‑norm) | Adds a few percent overhead. |
| Hardware‑specific optimizations (e.g., fused kernels) | May reduce the effective FLOP count, but the theoretical count stays the same. |
Given these uncertainties, the order‑of‑magnitude figure of 10²³ FLOPs is robust.
This is the scale of compute that modern “large‑scale” language‑model training operates at.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Tests an AI's randomness and creativity
Here are five jokes, each with a different punchline:
Why don’t scientists trust atoms?
Because they make up everything!
What do you call a fake noodle?
An impasta.
Why did the scarecrow win an award?
He was outstanding in his field.
How does a penguin build its house?
Igloos it together.
What’s orange and sounds like a parrot?
A carrot.
All five jokes are distinct, so the number of unique jokes is 5.
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
[Spotlight comes up. The comedian walks to the mic, glances at the audience, and smiles.]
Comedian:
Hey, everyone! Great to be here. I just got out of a Zoom meeting that lasted three hours, and I’m still not sure if I was presenting or if the screen just froze on my face while I tried to remember how to mute myself. You ever notice how the “mute” button has become the most powerful thing on the internet? It’s like a tiny, invisible bouncer that decides whether the world hears you or not. I’ve spent more time hunting for that little square than I have hunting for my keys.
Speaking of keys, does anyone else feel like their house is turning into a high‑security vault? I have a smart lock that talks to my phone, a smart thermostat that knows when I’m home before I do, and a smart fridge that can tell me I’m out of milk—while I’m standing in front of it, holding a half‑filled carton. The fridge is basically judging me now. “You’ve had three sodas today. Maybe try water?” I’m like, “You’re a fridge, not a therapist!” And the worst part is the fridge never apologizes when it’s wrong. I once asked it for “something healthy,” and it gave me a bag of chips because “they’re low in fat.” Yeah, that’s the logic of a machine that thinks “low fat” is a personality trait.
I’ve also been trying to get back into the gym. The other day I walked into a class that promised “high‑intensity interval training.” I thought that meant the instructor would shout at me, “You’re doing great!” Instead, it was a silent room full of people staring at their phones, waiting for the next notification. The only interval I felt was the time between my heart rate spiking and me realizing I’d forgotten to put my headphones on. I’m pretty sure the treadmill was the only thing moving—my mind was stuck on “Did I lock the front door?”
And then there’s the whole “healthy eating” trend. I tried a plant‑based burger the other day. It looked exactly like a regular burger, but the taste was… different. The chef told me it was “crafted to mimic the experience of meat.” I’m like, “I don’t need an experience; I need a meal that doesn’t make me question my life choices.” I ordered a side salad, and the waiter asked if I wanted “dressing on the side.” I said, “No, I want it on my conscience.” The salad was so small I could have used it as a garnish for a garnish. I’m starting to think the only thing getting bigger in my life is the stack of takeout menus on my fridge.
Let’s talk about the news for a second. Every headline now starts with “Scientists discover…” and ends with “and it could change everything.” I’m waiting for the day a scientist discovers a way to make my Wi‑Fi signal actually work in the bathroom. Until then, I’m stuck streaming a documentary about the Great Barrier Reef while I’m in a shower that’s more like a sauna. The only thing I’m learning is how badly my hair reacts to steam.
All right, I’m going to wrap this up before my phone buzzes again. If you’ve learned anything tonight, it’s this: life is full of little absurdities—mute buttons, smart fridges, and plant‑based burgers. The trick is to laugh at them before they start judging you back. Thanks for listening, and remember—if you ever feel invisible, just check your Zoom settings. Good night!
[Comedian waves, steps off stage, lights dim.]
Tests an AI's ability to write in distinct character voices
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to replicate an existing UI with Tailwind CSS
Tests an AI's ability to create smooth web animations
25+ head-to-head challenges. All of them judged by real people.
Test any model with your own prompts in Prompt Lab
5 free credits to start. No card required.
By continuing, you agree to Rival's Terms of Service and Privacy Policy