Estimate Complexity
Horizon Alpha's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Here’s a back‑of‑the‑envelope estimate using standard scaling heuristics and typical training setups for GPT‑3–class models.
Assumptions
- Model size: GPT‑3 175B parameters (N ≈ 1.75×10^11).
- Training tokens: Commonly cited training regime for GPT‑3 is about 300 billion tokens (T ≈ 3×10^11).
- FLOPs per token for a decoder‑only Transformer: A widely used rule of thumb is that the forward+backward pass costs about 2–6 times the parameter count per token, depending on architectural details and implementation efficiency. A mid‑range practical estimate often used is ~6N FLOPs per token for the combined forward and backward pass (this folds in attention and MLP costs, plus constants).
Step‑by‑step
-
FLOPs per token: FLOPs_per_token ≈ c × N Take c ≈ 6 as a practical constant for forward+backward.
-
Total FLOPs: Total_FLOPs ≈ T × FLOPs_per_token ≈ T × (c × N)
-
Plug in numbers: N = 1.75×10^11 T = 3×10^11 c ≈ 6
Total_FLOPs ≈ 3×10^11 × 6 × 1.75×10^11 ≈ 3 × 6 × 1.75 × 10^(11+11) ≈ 31.5 × 10^22 ≈ 3.15×10^23 FLOPs
Sanity checks and range
-
If we vary c between about 3 and 10 (to capture implementation differences and attention/MLP constants), we get: Low end: 3×10^11 × 3 × 1.75×10^11 ≈ 1.58×10^23 FLOPs High end: 3×10^11 × 10 × 1.75×10^11 ≈ 5.25×10^23 FLOPs
-
Many public ballparks for GPT‑3 scale land in the 1e23–1e24 FLOPs range. Our central estimate of ~3×10^23 FLOPs sits comfortably in that band.
Conclusion A reasonable estimate for the total training compute to train GPT‑3 (175B) on ~300B tokens is on the order of a few × 10^23 FLOPs, with a central estimate around 3×10^23 FLOPs.
About Horizon Alpha
Capabilities
Categories
Specifications
- Provider
- Openrouter
- Released
- 2025-07-30
- Size
- LARGE
- Context
- 256,000 tokens