GPT-5.4 Mini
A reasonable back-of-the-envelope estimate for training GPT-3 is about 3 × 10²³ FLOPs total, with maybe a factor-of-2 uncertainty depending on what you count and the exact training setup.
Here’s the reasoning step by step.
1) Use the model size and token count
GPT-3’s largest widely cited version has about:
- 175 billion parameters = 1.75 × 10¹¹ parameters
- Trained on roughly 300 billion tokens = 3 × 10¹¹ tokens
2) Estimate FLOPs per token for training
For a transformer, a rough rule of thumb is:
- Forward pass + backward pass costs on the order of 6 × number of parameters FLOPs per token
Why 6?
- Forward pass: roughly 2 FLOPs per parameter-use is a common crude estimate
- Backward pass: roughly 2–4× the forward cost
- So total training cost per token is often approximated as about 6P
Thus:
[ \text{FLOPs per token} \approx 6 \times 1.75 \times 10^{11} = 1.05 \times 10^{12} ]
So each token costs about 1 trillion FLOPs.
3) Multiply by the number of training tokens
[ (1.05 \times 10^{12}) \times (3 \times 10^{11}) = 3.15 \times 10^{23} ]
So the total training compute is about:
[ \boxed{3 \times 10^{23}\ \text{FLOPs}} ]
4) Sanity check
This scale is plausible because GPT-3 was famously extremely expensive to train, and 10²³ FLOPs is in the right ballpark for a model of that size trained on hundreds of billions of tokens.
5) Caveats
This is an estimate, not an exact number, because it depends on:
- whether you count only dense matrix ops or also overhead
- optimizer cost
- attention cost vs. MLP cost
- sequence length and batching details
- whether the “6P per token” rule is used or a slightly different constant
A more conservative range would be:
[ \boxed{(2 \text{ to } 4) \times 10^{23}\ \text{FLOPs}} ]
Final answer
Estimated total training cost for GPT-3: about (3 \times 10^{23}) FLOPs.




