4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Gpt 5.1's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
The total training compute for GPT‑3 (175B) is on the order of (10^{23}) floating‑point operations. A reasonable point estimate is (\sim 3\times 10^{23}) FLOPs.
Here’s the step‑by‑step reasoning from basic principles.
From the GPT‑3 paper (architecture details are well known):
We’ll assume:
For a dense neural network, a single weight typically participates in one multiply-add per forward pass per “use”:
For large Transformers, most compute is in matrix multiplications in the attention and MLP blocks; these are roughly proportional to the total number of parameters, so a common back-of-the-envelope is:
Backward pass is more expensive because you’re computing gradients:
On top of that you have:
These are smaller but non-negligible; a common heuristic is to lump them into a constant factor, giving:
[ \text{FLOPs per token} \approx 6N ]
This “6” is a heuristic constant:
Some authors use factors between ~3 and ~6; using 6 is a conservative, widely used estimate in scaling law work.
Total training FLOPs:
[ \text{FLOPs}_{\text{total}} \approx (\text{FLOPs per token}) \times (\text{number of tokens}) ]
Using the heuristic:
[ \text{FLOPs}_{\text{total}} \approx 6N T ]
Plug in GPT‑3’s values:
So:
[ \begin{aligned} \text{FLOPs}_{\text{total}} &\approx 6 \times (1.75\times 10^{11}) \times (3\times 10^{11}) \ &= 6 \times 1.75 \times 3 \times 10^{22} \ &= 6 \times 5.25 \times 10^{22} \ &= 31.5 \times 10^{22} \ &= 3.15 \times 10^{23} \text{ FLOPs} \end{aligned} ]
So a good estimate is:
[ \boxed{\text{GPT‑3 (175B) training compute} ;\approx; 3\times 10^{23}\ \text{FLOPs}} ]
Even if our constant factor is off (say it’s 3 instead of 6, or 8 instead of 6), we’re still in the same ballpark:
So the order of magnitude (10^{23}) FLOPs is robust, and (\sim 3\times 10^{23}) FLOPs is a reasonable central estimate.
Turn this model response into notes, narration, or a short video
Partner link