4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Glm 4.7's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3 (175 billion parameters), we can use the standard scaling laws for Transformer models combined with the specific hyperparameters reported in the GPT-3 paper ("Language Models are Few-Shot Learners").
Here is the step-by-step reasoning:
From the GPT-3 paper, we have two primary numbers required for the estimation:
Training a neural network involves a forward pass (computing the output) and a backward pass (computing gradients and updating weights).
For a standard Transformer architecture (which GPT-3 is), the computational cost is dominated by matrix multiplications in the attention and feed-forward layers.
A widely accepted rule of thumb in deep learning (used in scaling laws like those from Kaplan et al. and the Chinchilla paper) for the FLOPs required to process one token is: $$ \text{FLOPs per token} \approx 6 \times N $$
Why the factor of 6?
Now we multiply the FLOPs per token by the total number of tokens trained.
$$ \text{Total FLOPs} = 6 \times N \times D $$
Substituting the values:
$$ \text{Total FLOPs} = 6 \times (175 \times 10^9) \times (300 \times 10^9) $$
Let's do the arithmetic:
$$ \text{Total FLOPs} = 315,000 \times 10^{18} $$
$$ \text{Total FLOPs} = 3.15 \times 10^{23} $$
The estimated total compute required to train GPT-3 is approximately $3.15 \times 10^{23}$ FLOPs.
To put this in more common hardware terms, we can convert this to PetaFLOP-days (the number of days a computer capable of $10^{15}$ operations per second would take to do the work).
$$ \text{Days} = \frac{3.15 \times 10^{23}}{8.64 \times 10^{19}} \approx 3,645 \text{ PetaFLOP-days} $$
Conclusion: Based on the model size of 175 billion parameters and a training dataset of 300 billion tokens, the total estimated compute is $3.15 \times 10^{23}$ FLOPs (or roughly 3,645 PetaFLOP-days).
Turn this model response into notes, narration, or a short video
Partner link