4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Kimi K2.5's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Here is a step-by-step estimate for the total training FLOPs of GPT-3.
To estimate training compute for a dense transformer, we need two numbers:
For decoder-only transformers trained with backpropagation, the total floating-point operations (FLOPs) can be approximated as:
$$\text{FLOPs} \approx 6 \times N \times D$$
Why 6?
(Note: This formula dominates the cost; the quadratic attention cost is negligible compared to the linear parameter cost for a model of this size).
$$\text{FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11})$$
$$= 6 \times 1.75 \times 3 \times 10^{22}$$
$$= 6 \times 5.25 \times 10^{22}$$
$$= 31.5 \times 10^{22}$$
$$= \mathbf{3.15 \times 10^{23} \text{ FLOPs}}$$
The raw calculation gives ~3.15 × 10²³ FLOPs. However, we should adjust for real-world factors:
Approximately $3 \times 10^{23}$ to $5 \times 10^{23}$ FLOPs (roughly 300 to 500 sextillion FLOPs).
For context, this is equivalent to about 3,000 to 5,000 petaflop-seconds (or PFLOP-seconds), which aligns with the estimated thousands of petaflop-days often cited for GPT-3's training run on V100 GPUs.
Turn this model response into notes, narration, or a short video
Partner link