Skip to content
Rival
Models
CompareBest ForArena
Lab
Sign Up
Sign Up

We compare AI models for a living. On purpose. We chose this.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Prompt Lab
  • Image Generation
  • Audio Comparison
  • Leaderboard
  • Challenges

Discover

  • Insights
  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • Rival Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival · Built at hours no one should be awake, on hardware we don’t own
Rival
Models
CompareBest ForArena
Lab
Sign Up
Sign Up

We compare AI models for a living. On purpose. We chose this.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Prompt Lab
  • Image Generation
  • Audio Comparison
  • Leaderboard
  • Challenges

Discover

  • Insights
  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • Rival Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival · Built at hours no one should be awake, on hardware we don’t own

Inception: Mercury 2 — Estimate Complexity

  1. Home
  2. Models
  3. Inception: Mercury 2
  4. Estimate Complexity

See how Inception: Mercury 2 compares on this challenge:

  • Inception: Mercury 2 vs Claude Sonnet 4.6
  • Inception: Mercury 2 vs Google: Gemini 3.1 Flash Lite Preview
  • Inception: Mercury 2 vs Gemini 3.1 Pro Preview

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on Rival. This response is part of Rival's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Lab
Sign Up
Sign Up
Back to Inception: Mercury 2

Estimate Complexity

Mercury 2's response to the 'Estimate Complexity' challenge.

Inception: Mercury 2
March 5, 2026
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Inception: Mercury 2
Response

mercury-2 logo
Inception: Mercury 2Estimate Complexity
mercury-2 logo
Inception: Mercury 2

Answer (order‑of‑magnitude estimate)
The training of the original GPT‑3 (the 175 billion‑parameter “GPT‑3‑175B” model) requires on the order of 10²³ floating‑point operations (FLOPs) – roughly 1 – 2 × 10²³ FLOPs.

Below is a step‑by‑step walk‑through of how that number is arrived at, using only the publicly‑known architecture details and a few well‑established approximations for transformer training cost.


1. What we need to know

QuantityApproximate value (publicly known)Why it matters
Number of model parameters(N_{\text{par}} \approx 175;\text{B} = 1.75\times10^{11})Each parameter is touched during forward and backward passes.
Training token count(N_{\text{tok}} \approx 300;\text{B} = 3.0\times10^{11})GPT‑3 was trained on ~300 billion tokens (Common Crawl + WebText + Books + Wikipedia).
Model architecture96 transformer layers, hidden size (d = 12,288), feed‑forward size (4d)Determines how many FLOPs per token per layer.
OptimizerAdam (2‑momentum)Adds a modest constant factor (≈2) to the raw forward‑backward count.
Training regime1 epoch over the data (the 300 B tokens)No extra passes; the token count already incorporates the number of epochs.

2. A simple “parameter‑times‑token” model

A widely‑used back‑of‑the‑envelope rule for transformer training is:

[ \text{FLOPs}{\text{train}} ;\approx; 2 \times N{\text{par}} \times N_{\text{tok}} ]

  • Why the factor 2?
    • Forward pass: each weight is multiplied by an activation and added → 2 FLOPs (multiply + add).
    • Backward pass: the gradient w.r.t. each weight requires another multiply‑add → another 2 FLOPs.
    • The total is therefore ≈ 4 FLOPs per weight per token, but the multiply‑add pair is usually counted as a single “FLOP” in deep‑learning literature, giving the factor 2.

Plugging in the numbers:

[ \begin{aligned} \text{FLOPs}_{\text{train}} &\approx 2 \times (1.75\times10^{11}) \times (3.0\times10^{11})\ &= 2 \times 5.25\times10^{22}\ &= 1.05\times10^{23};\text{FLOPs}. \end{aligned} ]

So ≈ 1 × 10²³ FLOPs is the baseline estimate.


3. Refinement: accounting for the internal structure of a transformer layer

If we want to see where the “2 × Npar × Ntok” figure comes from, we can break down a single transformer layer:

Operation (per token)FLOPs (approx.)Explanation
Q, K, V projections (3 × (d\times d))(3d^{2})Each projection is a dense matrix multiply.
Attention scores (dot‑product over sequence length (L))(L d)For a token we compute dot‑products with all (L) keys.
Attention output projection (1 × (d\times d))(d^{2})
Feed‑forward first linear ( (d \times 4d) )(4d^{2})
Activation (GELU)≈ (4d) (tiny)
Feed‑forward second linear ( (4d \times d) )(4d^{2})
Total per layer (forward)≈ (12d^{2} + Ld)

With (d = 12,288) we have (d^{2} \approx 1.5\times10^{8}).
Ignoring the sequence‑length term (it is ≈ (L d) ≈ (1024 \times 12,288 \approx 1.3\times10^{7}), an order of magnitude smaller than (d^{2})), the forward cost per layer per token is roughly:

[ 12 d^{2} \approx 12 \times 1.5\times10^{8} \approx 1.8\times10^{9};\text{FLOPs}. ]

Multiplying by the 96 layers:

[ \text{Forward per token} \approx 96 \times 1.8\times10^{9} \approx 1.7\times10^{11};\text{FLOPs}. ]

A backward pass roughly doubles this (gradient w.r.t. each weight), giving ≈ (3.4\times10^{11}) FLOPs per token for the whole model.

Now multiply by the token count:

[ 3.4\times10^{11} \times 3.0\times10^{11} \approx 1.0\times10^{23};\text{FLOPs}, ]

which coincides with the simpler 2 × Npar × Ntok estimate.

Thus the two viewpoints are consistent: the “parameter‑times‑token” rule already captures the dominant cost of the dense matrix multiplications inside each transformer layer.


4. Adding the optimizer overhead

Adam keeps two extra state tensors (first‑ and second‑moment estimates) per parameter, and each training step performs a few extra arithmetic operations per weight (e.g., bias‑corrected moments, division, sqrt). A common heuristic is to multiply the raw forward‑backward count by ≈ 2 to include Adam’s extra work.

[ \text{FLOPs}_{\text{with Adam}} \approx 2 \times 1.05\times10^{23} \approx 2.1\times10^{23};\text{FLOPs}. ]

So a realistic upper bound is ≈ 2 × 10²³ FLOPs.


5. Summary of the estimate

StepValueComment
Parameters(1.75\times10^{11})175 B
Training tokens(3.0\times10^{11})300 B
Base forward‑backward FLOPs(1.0\times10^{23})2 × Npar × Ntok
Adam optimizer factor≈ 2Gives ~(2\times10^{23})
Final ball‑park(1!-!2 \times 10^{23}) FLOPsEquivalent to ~100 exaflop‑seconds, i.e. ~3 years of a 100 PFLOP/s machine.

What does this number mean in practice?

  • Energy: Assuming a modern GPU delivers ~10 GFLOP/s per watt, 2 × 10²³ FLOPs would consume on the order of 20 GWh of electricity (≈ 2 months of a typical data‑center’s power budget).
  • Time: A 100 PFLOP/s training cluster (≈ 10 × A100 GPUs) would need ≈ 2 × 10⁶ seconds ≈ 23 days of continuous compute to finish the job, which matches the publicly reported training wall‑clock time (a few weeks).

6. Caveats & sources of uncertainty

Source of uncertaintyEffect on estimate
Exact token count (some reports say 300 B, others 500 B)Linear scaling – 500 B tokens would push the estimate to ~1.7 × 10²³ FLOPs.
Model‑specific tricks (e.g., mixed‑precision, gradient checkpointing)Mixed‑precision reduces FLOPs per multiply‑add by ≈ 0.5×, but checkpointing adds extra recomputation; net effect is modest (≈ ± 20 %).
Additional regularization (e.g., dropout, layer‑norm)Adds a few percent overhead.
Hardware‑specific optimizations (e.g., fused kernels)May reduce the effective FLOP count, but the theoretical count stays the same.

Given these uncertainties, the order‑of‑magnitude figure of 10²³ FLOPs is robust.


Bottom line

  • Training GPT‑3 (175 B) ≈ 1 × 10²³ FLOPs (forward + backward).
  • Including Adam optimizer ≈ 2 × 10²³ FLOPs.

This is the scale of compute that modern “large‑scale” language‑model training operates at.

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output lives in our open dataset — 5K+ responses across 200 models. Take it. It’s free.

About Inception: Mercury 2

Capabilities

ConversationReasoningCode GenerationAnalysisTool Use

Categories

TextCode

Specifications

Provider
Inception
Released
2026-03-04
Size
MEDIUM
Context
128,000 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Inception: Mercury 2 vs Google: Gemini 3.1 Flash Lite Preview

Both outputs, side by side

We compare AI models for a living. On purpose. We chose this.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Prompt Lab
  • Image Generation
  • Audio Comparison
  • Leaderboard
  • Challenges

Discover

  • Insights
  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • Rival Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival · Built at hours no one should be awake, on hardware we don’t own