Estimate Complexity
Qwen Plus 2025 07 28's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
To estimate the total number of floating-point operations (FLOPs) required to train GPT-3, we can use a widely adopted approximation based on the model size, the number of training tokens, and the architecture of transformer-based models. We’ll do this without looking up specific values, relying only on known public disclosures and reasonable assumptions.
Step 1: Understand the scaling law for training FLOPs
For transformer models, a well-known approximation for the total training FLOPs is:
[ \text{Total FLOPs} \approx 6 \times N \times D ]
Where:
- ( N ) = number of parameters in the model
- ( D ) = number of training tokens
The factor of 6 comes from empirical analysis that accounts for the computational cost of forward pass, backward pass, and parameter updates. Specifically:
- Forward pass: ~2N FLOPs per token
- Backward pass: ~2N FLOPs per token (similar to forward)
- Optimizer step (e.g., Adam): ~2N FLOPs per token (e.g., gradient updates, momentum, etc.)
So total is roughly 6N FLOPs per token, hence:
[ \text{Total FLOPs} \approx 6 \times N \times D ]
Step 2: Estimate GPT-3’s parameter count
GPT-3 was released with multiple sizes, but the largest version had 175 billion parameters.
So: [ N = 175 \times 10^9 = 1.75 \times 10^{11} ]
Step 3: Estimate the number of training tokens
From public papers (e.g., the GPT-3 paper by Brown et al., 2020), the model was trained on 300 billion tokens.
So: [ D = 300 \times 10^9 = 3 \times 10^{11} ]
(Note: This is a known figure from the literature, but assuming we recall it from general AI knowledge.)
Step 4: Apply the formula
[ \text{Total FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11}) ]
First, multiply the coefficients: [ 6 \times 1.75 \times 3 = 6 \times 5.25 = 31.5 ]
Now the exponents: [ 10^{11} \times 10^{11} = 10^{22} ]
So: [ \text{Total FLOPs} \approx 31.5 \times 10^{22} = 3.15 \times 10^{23} ]
Final Estimate:
[ \boxed{3.15 \times 10^{23}} \text{ FLOPs} ]
Summary of Reasoning:
- GPT-3 has 175B parameters.
- It was trained on 300B tokens.
- Each parameter is involved in approximately 6 FLOPs per token during training (due to forward, backward, and optimizer steps).
- Multiply: ( 6 \times 175 \times 10^9 \times 300 \times 10^9 = 3.15 \times 10^{23} )
This is a standard estimate used in the field and aligns with known reports (e.g., Kaplan et al. scaling laws), even without searching online.
About Qwen Plus 0728
Capabilities
Categories
Specifications
- Provider
- Qwen
- Released
- 2025-09-08
- Size
- LARGE
- Context
- 1,000,000 tokens