4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Gpt 5.2's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Below is a back-of-the-envelope estimate for the total training compute (in FLOPs) for GPT‑3, done from first principles and common Transformer training heuristics.
We want total FLOPs over the whole training run:
[ \text{Total FLOPs} \approx (\text{FLOPs per token}) \times (\text{# training tokens}) ]
So we need:
A commonly used rule for dense Transformers is:
[ \text{Training FLOPs} \approx 6 , P , T ]
Why “6”?
This ignores some extras (attention score computations, softmax, layernorm, embedding lookups), but for very large models the big matrix multiplies dominate, and the (6PT) scaling is usually the right order of magnitude.
So we’ll use: [ \text{Total FLOPs} \approx 6PT ]
We need an assumption. GPT‑3 is widely described as trained on hundreds of billions of tokens. Without looking it up, a reasonable guess range is:
I’ll compute the mid estimate and show the range.
Given:
[ \text{Total FLOPs} \approx 6 \times (1.75\times 10^{11}) \times (3\times 10^{11}) ]
Multiply constants:
Multiply powers of ten:
So: [ \text{Total FLOPs} \approx 31.5 \times 10^{22} = 3.15\times 10^{23}\ \text{FLOPs} ]
Estimated total training compute for GPT‑3 (175B): [ \boxed{\text{On the order of }10^{23}\text{ FLOPs, roughly }3\times 10^{23}\text{ FLOPs (with a plausible range }2\text{–}5\times 10^{23}\text{).}} ]
If you tell me what you want to assume for the training tokens (e.g., 300B vs 500B) or whether you want to include optimizer overhead (Adam adds extra memory ops; FLOP impact is smaller but nonzero), I can refine the estimate accordingly.
Turn this model response into notes, narration, or a short video
Partner link