Gemini 3.1 Pro Preview
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we can use a well-established heuristic in deep learning for estimating the compute cost of training Transformer-based large language models.
The estimated total compute is roughly $3.15 \times 10^{23}$ FLOPs.
Here is the step-by-step reasoning used to arrive at this number:
Step 1: Identify the key specifications of GPT-3
To calculate the compute required, we need to know two fundamental numbers about the model:
- $N$ (Number of Parameters): GPT-3 has 175 billion parameters ($1.75 \times 10^{11}$).
- $D$ (Number of Training Tokens): The GPT-3 paper states the model was trained on 300 billion tokens ($3.0 \times 10^{11}$).
Step 2: Establish the FLOPs formula
In language model training, the total compute ($C$) is heavily dominated by the matrix multiplications in the dense layers. OpenAI's scaling laws provide a standard, highly accurate approximation formula for this: $C \approx 6 \times N \times D$
Why the multiplier of 6?
- Forward Pass: For every token processed, multiplying it by a single parameter requires 1 Multiply-Accumulate (MAC) operation. A MAC consists of 2 FLOPs (one multiplication and one addition). Therefore, the forward pass takes roughly $2ND$ FLOPs.
- Backward Pass: Calculating the gradients during backpropagation requires about twice as much compute as the forward pass (you have to calculate the gradient with respect to the weights and the gradient with respect to the activations). Therefore, the backward pass takes roughly $4ND$ FLOPs.
- Adding them together gives $2ND + 4ND =$ $6ND$ FLOPs.
(Note: This formula slightly underestimates by ignoring the quadratic cost of attention and optimizer state updates, but at the 175B scale, the dense feed-forward matrices dominate the compute so heavily that $6ND$ is considered the gold-standard approximation).
Step 3: Perform the calculation
Now we simply plug the numbers into our formula:
- $N = 175 \times 10^9$
- $D = 300 \times 10^9$
First, multiply $N$ and $D$:
- $N \times D = (175 \times 10^9) \times (300 \times 10^9)$
- $N \times D = 52,500 \times 10^{18}$
- $N \times D = 5.25 \times 10^{22}$
Next, multiply by 6 (for the forward and backward passes):
- Total FLOPs = $6 \times 5.25 \times 10^{22}$
- Total FLOPs = $31.5 \times 10^{22}$ or $3.15 \times 10^{23}$
Step 4: Sanity Check
Does $3.15 \times 10^{23}$ FLOPs make sense in the real world? When GPT-3 was trained (around late 2019/early 2020), Nvidia V100 GPUs were the standard.
- A V100 GPU has a peak theoretical output of roughly 120 TeraFLOPs ($1.2 \times 10^{14}$ FLOPs) per second in mixed precision.
- Real-world hardware utilization (MFU) for training large models is usually around 30% to 40%, yielding roughly $4 \times 10^{13}$ effective FLOPs per second per GPU.
- If OpenAI used a cluster of 10,000 V100 GPUs, the cluster would output $4 \times 10^{17}$ FLOPs per second.
- Divide the total FLOPs by the cluster output: $(3.15 \times 10^{23}) / (4 \times 10^{17}) \approx 787,500$ seconds.
- 787,500 seconds is exactly 9.1 days.
Given that training runs of this size typically take a few weeks on a few thousand GPUs (factoring in downtime, checkpointing, and slightly smaller clusters), this math aligns perfectly with historical reality.
Conclusion: The total compute required to train GPT-3 was approximately $3.15 \times 10^{23}$ FLOPs.







