Google: Gemma 4 26B A4B
To estimate the total number of Floating Point Operations (FLOPs) required to train GPT-3, we can use a standard heuristic derived from the scaling laws of transformer models.
Here is the step-by-step derivation.
1. Identify the Known Parameters
While I am not searching online, I can rely on the architectural specifications of GPT-3:
- Parameters ($N$): Approximately $175 \times 10^9$ (175 billion).
- Training Tokens ($D$): GPT-3 was trained on roughly $300 \times 10^9$ (300 billion) tokens.
2. The Mathematical Heuristic for Transformer Training
In a standard transformer model, the computational cost is dominated by two main processes:
- The Forward Pass: Calculating activations.
- The Backward Pass: Calculating gradients.
A well-established rule of thumb in deep learning literature (often attributed to the Chinchilla scaling papers or OpenAI's scaling laws) is that the total training cost is approximately: $$\text{Total FLOPs} \approx 6 \times N \times D$$
Where does the $6$ come from?
- Forward Pass: For every parameter, we perform roughly $2$ operations (one multiplication and one addition) per token. Thus, the forward pass is $\approx 2ND$.
- Backward Pass: The backward pass is computationally more expensive than the forward pass. It involves calculating gradients with respect to the activations (to pass error back) and gradients with respect to the weights (to update the model). This is mathematically estimated to be roughly twice the cost of the forward pass. Thus, the backward pass is $\approx 4ND$.
- Total: $2ND \text{ (forward)} + 4ND \text{ (backward)} = 6ND$.
3. The Calculation
Now, we plug in the values:
- $N = 1.75 \times 10^{11}$
- $D = 3 \times 10^{11}$
$$\text{Total FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11})$$
Step-by-step arithmetic:
- Multiply the coefficients: $6 \times 1.75 \times 3$
- $6 \times 1.75 = 10.5$
- $10.5 \times 3 = 31.5$
- Multiply the powers of ten: $10^{11} \times 10^{11} = 10^{22}$
Result: $$\text{Total FLOPs} \approx 31.5 \times 10^{22}$$ Or, in standard scientific notation: $$\mathbf{3.15 \times 10^{23} \text{ FLOPs}}$$
4. Contextualizing the Result
To put $3.15 \times 10^{23}$ FLOPs into perspective:
- If you used an NVIDIA A100 GPU (which provides roughly $312 \times 10^{12}$ FLOPS at peak BF16/FP16 performance), it would take approximately: $$\frac{3.15 \times 10^{23}}{3.12 \times 10^{14}} \approx 10^9 \text{ seconds}$$
- $10^9$ seconds is roughly 31.7 years of computation for a single GPU.
- Since GPT-3 was trained in a matter of weeks/months, this confirms that thousands of GPUs were working in parallel.
Final Estimate
The estimated training cost for GPT-3 is approximately $3 \times 10^{23}$ FLOPs.



