Google: Gemma 4 31B
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we can use a well-established heuristic from deep learning theory regarding the cost of a forward and backward pass in a Transformer.
Here is the step-by-step reasoning.
1. The Fundamental Heuristic
For a Transformer model, the cost of a single forward pass is approximately 2 operations per parameter per token.
- Reasoning: A matrix multiplication of a vector (size $h$) by a weight matrix (size $h \times h$) requires $h^2$ multiplications and $h^2$ additions, totaling $2h^2$ operations. Since the number of parameters in that layer is $h^2$, it is 2 FLOPs per parameter.
During training, we perform a forward pass and a backward pass. The backward pass is generally estimated to be twice as computationally expensive as the forward pass (because it involves calculating gradients for both the weights and the activations).
Therefore, the total cost per token is: $$\text{Cost per token} \approx 6 \times \text{Number of Parameters}$$
2. Defining the Variables for GPT-3
To get the total FLOPs, we need two primary numbers:
- Model Size ($N$): GPT-3 (the largest version) has 175 billion parameters.
- Dataset Size ($D$): GPT-3 was trained on roughly 300 billion tokens.
3. The Calculation
Now, we plug these values into the formula: $$\text{Total FLOPs} = 6 \times N \times D$$
Step A: Multiply Parameters by 6 $$6 \times 175 \times 10^9 = 1.05 \times 10^{12} \text{ FLOPs per token}$$
Step B: Multiply by the number of tokens $$\text{Total FLOPs} = (1.05 \times 10^{12}) \times (300 \times 10^9)$$ $$\text{Total FLOPs} = 315 \times 10^{21}$$
4. Final Result
Converting this into scientific notation: $$\text{Total FLOPs} \approx 3.15 \times 10^{23}$$
Summary of Reasoning:
- 6 FLOPs/param/token (2 for forward, 4 for backward).
- 175 Billion parameters $\times$ 300 Billion tokens.
- $6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11}) = 3.15 \times 10^{23}$.




