Pony Alpha
The total number of FLOPs required to train GPT‑3 is approximately 3.15 × 10²³ (or 315 zettaFLOPs). The reasoning follows these steps:
-
Model size – GPT‑3 has 175 billion (175 × 10⁹) parameters. This is a well‑known figure from the original paper.
-
Training data size – The model was trained on 300 billion (300 × 10⁹) tokens. The paper states that all models were trained for a total of 300 billion tokens.
-
FLOPs per parameter per token – For a transformer model, the compute can be approximated by:
- Forward pass: each parameter is used in a multiply‑add operation, contributing about 2 FLOPs per token.
- Backward pass (gradient computation): roughly 4 FLOPs per parameter per token (two extra passes for gradients with respect to weights and inputs).
- Hence, training (forward + backward) requires about 6 FLOPs per parameter per token.
-
Total FLOPs – Multiply the three numbers: [ \text{FLOPs} = 6 \times (175 \times 10^9) \times (300 \times 10^9) = 6 \times 175 \times 300 \times 10^{18} = 315,!000 \times 10^{18} = 3.15 \times 10^{23}. ]
-
Consistency check – This estimate aligns with the reported compute of 3.14 × 10²³ FLOPs mentioned in OpenAI’s technical analysis, confirming its plausibility.
The calculation captures the dominant linear‑algebra operations; small additional costs (e.g., attention quadratic terms, layer norms) are negligible in comparison.



