Qwen3 235B A22B
Step-by-Step Estimation of FLOPs Required to Train GPT-3
-
Determine the Number of Parameters (N):
GPT-3 is known to have 175 billion parameters (1.75 × 10¹¹). This is a key factor in computing the total FLOPs, as the number of operations scales with the number of parameters. -
Estimate the Number of Training Tokens (T):
Large language models like GPT-3 are trained on massive datasets. Based on common knowledge and prior reports, GPT-3 was trained on approximately 300 billion tokens (3 × 10¹¹). This number is derived from the scale of training data used in similar models and the observed performance improvements. -
Understand the FLOPs per Token and Parameter:
Training involves forward and backward passes through the network. Each parameter is used multiple times during these passes. A widely accepted approximation in machine learning literature is that each token requires 6 FLOPs per parameter (combining forward and backward operations). This factor of 6 accounts for:- Forward Pass: 2 FLOPs per parameter (multiply and add for forward computation).
- Backward Pass: 4 FLOPs per parameter (gradient computation for weights and inputs).
This results in 6 FLOPs per parameter per token.
-
Calculate Total FLOPs:
Multiply the number of parameters (N), the number of tokens (T), and the FLOPs per parameter per token (6):
$$ \text{Total FLOPs} = 6 \times N \times T = 6 \times 1.75 \times 10^{11} \times 3 \times 10^{11} $$
Performing the calculation:
$$ 6 \times 1.75 \times 3 = 31.5 \quad \Rightarrow \quad 31.5 \times 10^{22} = 3.15 \times 10^{23} , \text{FLOPs}. $$ -
Final Estimate:
Rounding to one significant figure for simplicity, the total number of FLOPs required to train GPT-3 is approximately 3 × 10²³ FLOPs.
Summary of Key Assumptions and Reasoning:
- Parameters (N): 175 billion (1.75 × 10¹¹) based on public knowledge.
- Tokens (T): 300 billion (3 × 10¹¹) as a reasonable estimate for the scale of training data.
- FLOPs per Parameter per Token: 6, derived from the combination of forward and backward pass operations.
- Formula: Total FLOPs ≈ 6 × N × T.
This approach aligns with standard practices in estimating compute requirements for training large language models, such as those outlined in the Chinchilla scaling laws and similar research.
Final Answer:
The total number of FLOPs required to train GPT-3 is approximately 3 × 10²³ FLOPs.







