Estimate Complexity
o3's response to the challenge...
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Below is a simple "back-of-the-envelope" estimate that uses only a few public facts about GPT-3 and very generic rules of thumb for counting operations.
Step 1. Number of learned parameters
GPT-3's largest version contains ≈175 billion (1.75 × 10¹¹) trainable parameters.
Step 2. How much data was seen during training
OpenAI reports that the model was trained on about 300 billion tokens (3 × 10¹¹).
Step 3. How many floating-point operations does one parameter cost per token?
A practical rule of thumb for Transformer language models is
Training FLOPs ≈ 6 × (# parameters) × (# training tokens)
Where does the factor 6 come from?
• Forward pass: every parameter is involved in one multiply-accumulate → 2 FLOPs
• Back-prop gradient computation: roughly another 2 FLOPs
• Parameter-update (optimizer) and miscellaneous overhead: ≈2 FLOPs
Total ≈6 FLOPs per parameter per token.
Step 4. Plug in the numbers
Total FLOPs = 6 × (1.75 × 10¹¹) × (3 × 10¹¹)
= 6 × 5.25 × 10²²
≈ 3.15 × 10²³ FLOPs.
Step 5. Express the answer in more familiar units
3.15 × 10²³ ≈ 315 zettaFLOPs (1 ZFLOP = 10²¹ FLOPs).
Step 6. Sanity check with hardware-time intuition
An NVIDIA A100 delivers ≈ 312 teraFLOP/s (0.312 PFLOP/s) in mixed precision.
Wall-clock time on a single A100:
3.15 × 10²³ / 3.12 × 10¹⁴ ≈ 1.0 × 10⁹ s ≈ 32 years.
Running on 1,000 A100s in parallel would take ≈12 days, which is the right ballpark for a large industrial training run.
Final estimate
Training GPT-3 (175 B parameters, 300 B tokens) requires on the order of 3 × 10²³ floating-point operations, i.e. roughly 300 zettaFLOPs.
About OpenAI o3
Capabilities
Categories
Specifications
- Provider
- Openai
- Released
- April 16, 2025
- Size
- XLARGE
- Parameters
- Not disclosed