4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Gpt 5.2 Pro's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Below is a back-of-the-envelope FLOP estimate for training GPT‑3 (the 175B dense Transformer), using only basic Transformer compute facts and a few widely known high-level quantities.
GPT‑3 “175B” means:
A commonly cited order-of-magnitude for GPT‑3 training is a few hundred billion tokens. I’ll assume:
(If you assumed 200B or 500B instead, the final answer scales linearly.)
A Transformer layer is mostly matrix multiplies (linear layers in attention projections + MLP). For a dense model, the key scaling is:
Training requires backprop. For each linear layer, you effectively do ~3 matrix multiplies of comparable size:
So training compute is roughly 3× the forward compute: [ \text{train FLOPs per token} \approx 3 \times (2P) = 6P ]
This ignores smaller terms (layernorm, softmax, activations) and also ignores attention’s extra (L^2) work (addressed briefly below).
Total FLOPs: [ \text{FLOPs} \approx (6P),T ]
Plug in numbers:
[ \text{FLOPs} \approx 6 \cdot (175 \times 10^9)\cdot(300 \times 10^9) ]
First multiply the plain numbers:
Track powers of ten:
So: [ \text{FLOPs} \approx 315{,}000 \times 10^{18} = 3.15 \times 10^{23} ]
Estimated total:
[
\boxed{\approx 3 \times 10^{23}\ \text{FLOPs}}
]
Self-attention has extra compute from forming attention scores and applying them to values, scaling like (O(L^2 d)). For GPT‑3-sized models (large (d)) with a context length around a couple thousand, the dominant cost is still the dense matmuls tied to parameters ((O(L d^2))), and the (L^2) attention part is typically only a few percent of total FLOPs. So (\sim 6PT) is usually a decent first estimate.
If you add a modest overhead factor (say +5–15%), you’d still be in the same ballpark: [ \boxed{\text{rough range } \sim (3\text{ to }3.6)\times 10^{23}\ \text{FLOPs}} ]
GPT‑3 (175B) training compute:
[
\boxed{\approx 3.15 \times 10^{23}\ \text{FLOPs} ;;(\text{order of }10^{23})}
]
If you tell me what token count you want to assume (e.g., 200B vs 500B, or multiple epochs over a smaller dataset), I can rescale the estimate instantly.
Turn this model response into notes, narration, or a short video
Partner link