Qwen: Qwen3.5 122B A10B
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we need to rely on the known architectural hyperparameters of the model and the standard computational scaling laws established in deep learning literature (specifically from the "Scaling Laws for Neural Language Models" paper by Kaplan et al., which preceded and informed the GPT-3 work).
Here is the step-by-step reasoning and calculation.
Step 1: Identify Key Model Hyperparameters
From the "Language Models are Few-Shot Learners" paper (Brown et al., 2020), we can recall the specific configuration of the largest GPT-3 model:
- Number of Parameters ($N$): The GPT-3 model has 175 billion parameters. $$N = 175 \times 10^9$$
- Number of Training Tokens ($D$): The model was trained on a dataset of approximately 300 billion tokens. $$D = 300 \times 10^9$$
Note: While the context length (2048 tokens) affects memory bandwidth and sequence processing, the total training FLOPs are primarily driven by the number of parameters and the total number of tokens processed.
Step 2: Determine the FLOPs Per Token Formula
To estimate the computational cost, we use the standard approximation for Transformer training derived in scaling law research.
- Forward Pass: For a standard dense layer in a neural network, computing the output requires one multiplication and one addition per parameter per input token. This equals approximately 2 FLOPs per parameter per token. $$\text{Forward FLOPs} \approx 2 \times N$$
- Backward Pass: To train the model, we must compute gradients via backpropagation. The computational cost of the backward pass is roughly twice that of the forward pass. $$\text{Backward FLOPs} \approx 2 \times (2 \times N) = 4 \times N$$
- Total FLOPs Per Token: Adding the forward and backward costs gives us the total computation required to process a single token across the entire model. $$\text{Total FLOPs per Token} \approx 2N + 4N = 6N$$
Therefore, the total training compute ($C$) is estimated as: $$C \approx 6 \times N \times D$$
Note: This estimate assumes standard optimization (like Adam) overhead is absorbed into this factor or is negligible compared to the model computation itself, which aligns with the "6ND" rule of thumb widely cited in AI infrastructure analysis.
Step 3: Perform the Calculation
Now, we substitute the values from Step 1 into the formula from Step 2.
$$C = 6 \times (175 \times 10^9) \times (300 \times 10^9)$$
First, multiply the coefficients: $$6 \times 175 \times 300$$ $$6 \times 175 = 1,050$$ $$1,050 \times 300 = 315,000$$
Next, handle the powers of 10: $$10^9 \times 10^9 = 10^{18}$$
Combine them: $$C = 315,000 \times 10^{18}$$
Convert to scientific notation: $$C = 3.15 \times 10^5 \times 10^{18}$$ $$C = 3.15 \times 10^{23} \text{ FLOPs}$$
Step 4: Verification and Nuance
In the original GPT-3 paper, the authors explicitly state in their Appendix that the training compute was approximately $3.14 \times 10^{23}$ FLOPs.
My estimation of $3.15 \times 10^{23}$ FLOPs is extremely close to the reported figure. The minor difference (0.3%) arises from minor variations in how specific operations (like embedding lookups or optimizer state updates) are counted, but the $6ND$ heuristic is highly accurate for this purpose.
Final Estimate
The total number of FLOPs required to train GPT-3 is approximately:
$3.15 \times 10^{23}$ FLOPs







