Z.ai: GLM 5
The total number of FLOPs required to train GPT‑3 is approximately 3.15 × 10²³.
Step‑by‑step reasoning:
-
Model size: GPT‑3 has 175 billion (1.75 × 10¹¹) trainable parameters.
-
Training data: According to the paper abstract, the model was trained on 300 billion (3 × 10¹¹) tokens.
-
FLOPs per token for training a transformer:
- Forward pass: each parameter is used in one multiply‑add operation → ~2 FLOPs per parameter → 2 N FLOPs per token.
- Backward pass (gradient computation): roughly twice the forward cost → ~4 N FLOPs per token.
- Total: ~6 N FLOPs per token.
This approximation is standard for large transformer models and neglects small overheads like attention quadratic terms (which add only ~1–2% for GPT‑3’s sequence length).
-
Calculation:
Total FLOPs ≈ 6 × (1.75 × 10¹¹) × (3 × 10¹¹)
= 6 × 5.25 × 10²²
= 3.15 × 10²³. -
Consistency check: The GPT‑3 paper reports a training compute of 3,640 PF‑days. Converting:
1 PF‑day = 10¹⁵ FLOP/s × 86400 s = 8.64 × 10¹⁹ FLOPs.
3,640 PF‑days = 3,640 × 8.64 × 10¹⁹ ≈ 3.14 × 10²³ FLOPs, matching the estimate.
Thus, the total FLOPs required to train GPT‑3 is about 3.15 × 10²³.