Estimate Complexity
Qwen3 Max's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Estimating the FLOPs required to train GPT-3 without looking it up is a classic “Fermi estimation” problem — breaking it down into components we can reason about, using approximations and scaling laws.
Let’s go step by step.
STEP 1: Recall GPT-3’s key specs (from memory)
GPT-3 has:
- ~175 billion parameters (1.75 × 10¹¹)
- Trained on ~300 billion tokens (3 × 10¹¹)
These numbers are widely cited and should be roughly accurate from memory.
STEP 2: FLOPs per parameter per token
A common rule of thumb in deep learning is:
Training a transformer model requires approximately 6 FLOPs per parameter per token.
Why 6?
- In the forward pass, roughly 2 FLOPs per parameter (multiply + add for each weight update contribution).
- In the backward pass, you need about twice the FLOPs of the forward pass (gradient computation w.r.t. inputs and weights).
- Then there’s the weight update (optimizer step), which adds a bit more — roughly same cost as forward.
So: 2 (forward) + 4 (backward + update) ≈ 6 FLOPs/param/token.
This is a standard approximation used in scaling law papers (like those from OpenAI and DeepMind).
STEP 3: Multiply parameters × tokens × FLOPs per param per token
So:
FLOPs = (number of parameters) × (number of tokens) × (FLOPs per param per token)
= (1.75 × 10¹¹) × (3 × 10¹¹) × 6
Let’s compute:
First, 1.75 × 3 = 5.25
Then, 10¹¹ × 10¹¹ = 10²²
So: 5.25 × 10²² × 6 = 31.5 × 10²² = 3.15 × 10²³
STEP 4: Final Estimate
~3 × 10²³ FLOPs
Or, written out: 300 sextillion FLOPs
STEP 5: Sanity Check
Let’s see if this makes sense.
- GPT-2 (1.5B params, ~40B tokens) → 1.5e9 × 4e10 × 6 = ~3.6e20 FLOPs
- GPT-3 is ~100× more parameters and ~7.5× more tokens → 750× more FLOPs → 3.6e20 × 750 ≈ 2.7e23 → close to our 3.15e23.
Also, known published estimates (which we’re avoiding, but for post-hoc sanity) are around 3.14 × 10²³ — so this lines up.
✅ Final Answer:
Approximately 3 × 10²³ FLOPs are required to train GPT-3.
This estimate is derived from:
- 175B parameters
- 300B training tokens
- 6 FLOPs per parameter per token (standard transformer training cost)
All approximated from memory and first principles.
About Qwen: Qwen3 Max
Capabilities
Categories
Specifications
- Provider
- Qwen
- Released
- 2025-09-05
- Size
- XLARGE
- Context
- 256,000 tokens