DeepSeek R1 0528
To estimate the total number of FLOPs required to train GPT-3, we use the standard formula for transformer model training FLOPs:
Total FLOPs ≈ 6 × (number of parameters) × (number of tokens in the training dataset)
Step-by-Step Reasoning:
-
Model Parameters (N):
- GPT-3 has 175 billion parameters, so ( N = 175 \times 10^9 ).
-
Training Tokens (T):
- GPT-3 was trained on approximately 300 billion tokens, so ( T = 300 \times 10^9 ).
-
FLOPs per Parameter-Token:
- Forward pass: Each token requires ~2 FLOPs per parameter (due to matrix multiplications involving multiply-add operations, counted as 2 FLOPs).
- Backward pass: Gradient computation typically requires ~4 FLOPs per parameter (twice the cost of the forward pass for gradient calculations).
- Total per parameter-token: ( 2 , \text{(forward)} + 4 , \text{(backward)} = 6 , \text{FLOPs} ).
-
Calculation: [ \text{Total FLOPs} = 6 \times (175 \times 10^9) \times (300 \times 10^9) ]
- First, multiply the parameters and tokens: [ 175 \times 10^9 \times 300 \times 10^9 = 52,500 \times 10^{18} = 5.25 \times 10^{22} ]
- Then multiply by 6: [ 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} = 3.15 \times 10^{23} ]
Final Estimate:
[ \boxed{3.15 \times 10^{23}} , \text{FLOPs} ]
Notes:
- This formula ((6 \times N \times T)) is a widely accepted heuristic for transformer models.
- The estimate assumes one full training epoch (no repeated tokens).
- Additional overhead (e.g., optimizer steps, activations) is negligible compared to the dominant matrix operations, so it is omitted.
- GPT-3 specifics: Parameters = 175B, Tokens = 300B.
