4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Qwen3.5 397B A17B's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
To estimate the total number of Floating Point Operations (FLOPs) required to train GPT-3, we can use the standard computational scaling laws for Transformer models. The calculation relies on three key variables: the number of parameters in the model, the number of tokens in the training dataset, and the computational cost per parameter per token.
Here is the step-by-step reasoning and calculation.
The defining characteristic of GPT-3 is its size. Based on the seminal paper "Language Models are Few-Shot Learners" (Brown et al., 2020), the largest version of GPT-3 has 175 billion parameters. $$N = 175 \times 10^9 = 1.75 \times 10^{11}$$
The computational cost scales linearly with the amount of data the model processes. GPT-3 was trained on a diverse corpus of internet text (Common Crawl, WebText, books, Wikipedia, etc.). The authors stated the model was trained on approximately 300 billion tokens. $$D = 300 \times 10^9 = 3.0 \times 10^{11}$$
To calculate the total operations, we need to know how many FLOPs are performed for a single parameter when processing a single token during training. Training involves both a forward pass (to make a prediction) and a backward pass (to calculate gradients and update weights).
Total FLOPs per parameter per token: $2 + 4 = 6$.
Note: This heuristic ($6ND$) is widely accepted in deep learning literature, specifically referenced in the "Scaling Laws for Neural Language Models" (Kaplan et al., 2020).
Now we combine the variables into the formula: $$\text{Total FLOPs} \approx 6 \times N \times D$$
Substitute the values: $$\text{Total FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3.0 \times 10^{11})$$
First, multiply the coefficients: $$6 \times 1.75 \times 3.0$$ $$6 \times 3.0 = 18$$ $$18 \times 1.75 = 31.5$$
Next, multiply the exponents: $$10^{11} \times 10^{11} = 10^{22}$$
Combine them: $$\text{Total FLOPs} \approx 31.5 \times 10^{22}$$
Convert to standard scientific notation: $$\text{Total FLOPs} \approx 3.15 \times 10^{23}$$
This estimate assumes a theoretically efficient implementation. In practice, there are minor overheads (embedding lookups, attention mask operations, optimizer states) that might slightly increase this number, but the $6ND$ rule captures the dominant matrix multiplications that consume the vast majority of compute.
Some analyses round the token count or parameter count slightly differently, leading to estimates ranging between $3.0 \times 10^{23}$ and $3.6 \times 10^{23}$. However, $3.15 \times 10^{23}$ is the precise result derived from the standard public specifications of the model.
The total number of FLOPs required to train GPT-3 is approximately:
$$3.15 \times 10^{23} \text{ FLOPs}$$
(Often colloquially referred to as 315 zettaFLOPs).
Turn this model response into notes, narration, or a short video
Partner link