Estimate Complexity
Kimi K2's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Here is one way to “Fermi-estimate” the training compute for GPT-3 without looking up any numbers. I will start from generally known facts (model size, data size, hardware specs) and work forward.
Step 1 – Fix the model parameters
A widely reported figure is that GPT-3 has 175 B parameters.
N ≈ 1.75 × 10¹¹.
Step 2 – Fix the training-token budget
OpenAI papers mention ~300 B tokens.
T ≈ 3 × 10¹¹.
Step 3 – Recall the per-token FLOP rule for dense transformers
Every time one token is processed, the forward pass does roughly
2 FLOPs/parameter (multiply-accumulate counts as 2).
The backward pass does about twice the work of the forward pass, so
total ≈ 3× forward FLOPs.
Thus the “textbook” estimate is
6 FLOPs · N · T.
Step 4 – Plug the numbers in
FLOPs = 6 · (1.75 × 10¹¹) · (3 × 10¹¹)
= 6 · 5.25 × 10²²
≈ 3.15 × 10²³ FLOPs.
Step 5 – Add a safety factor for real-world overheads
Large-scale jobs typically run at 30-50 % of theoretical peak because of
communication, stragglers, recomputation for activation checkpointing, etc.
A factor of 2 covers this nicely.
3.15 × 10²³ × 2 ≈ 6 × 10²³ FLOPs.
Final estimate
Training GPT-3 required on the order of 3–6 × 10²³ floating-point operations.
About Kimi K2
Capabilities
Categories
Specifications
- Provider
- Openrouter
- Released
- 2025-07-11
- Size
- XLARGE