Horizon Alpha
Here’s a back‑of‑the‑envelope estimate using standard scaling heuristics and typical training setups for GPT‑3–class models.
Assumptions
- Model size: GPT‑3 175B parameters (N ≈ 1.75×10^11).
- Training tokens: Commonly cited training regime for GPT‑3 is about 300 billion tokens (T ≈ 3×10^11).
- FLOPs per token for a decoder‑only Transformer: A widely used rule of thumb is that the forward+backward pass costs about 2–6 times the parameter count per token, depending on architectural details and implementation efficiency. A mid‑range practical estimate often used is ~6N FLOPs per token for the combined forward and backward pass (this folds in attention and MLP costs, plus constants).
Step‑by‑step
-
FLOPs per token: FLOPs_per_token ≈ c × N Take c ≈ 6 as a practical constant for forward+backward.
-
Total FLOPs: Total_FLOPs ≈ T × FLOPs_per_token ≈ T × (c × N)
-
Plug in numbers: N = 1.75×10^11 T = 3×10^11 c ≈ 6
Total_FLOPs ≈ 3×10^11 × 6 × 1.75×10^11 ≈ 3 × 6 × 1.75 × 10^(11+11) ≈ 31.5 × 10^22 ≈ 3.15×10^23 FLOPs
Sanity checks and range
-
If we vary c between about 3 and 10 (to capture implementation differences and attention/MLP constants), we get: Low end: 3×10^11 × 3 × 1.75×10^11 ≈ 1.58×10^23 FLOPs High end: 3×10^11 × 10 × 1.75×10^11 ≈ 5.25×10^23 FLOPs
-
Many public ballparks for GPT‑3 scale land in the 1e23–1e24 FLOPs range. Our central estimate of ~3×10^23 FLOPs sits comfortably in that band.
Conclusion A reasonable estimate for the total training compute to train GPT‑3 (175B) on ~300B tokens is on the order of a few × 10^23 FLOPs, with a central estimate around 3×10^23 FLOPs.
