Estimate Complexity
Deepseek V3.2 Exp's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Alright — let’s estimate the total FLOPs required to train GPT‑3 without looking anything up.
1. Known basic formula
A common rule-of-thumb for training Transformer models is:
[ \text{Training FLOPs} \approx 6 \times N \times D ]
Where:
- ( N ) = total number of tokens in training
- ( D ) = total number of parameters in the model
This “6” factor comes from:
- Forward pass: ~ ( 2 \times D ) FLOPs per token (2 FLOPs per parameter, from matrix multiplies and activations)
- Backward pass: ~ ( 4 \times D ) FLOPs per token (roughly twice the forward pass)
So ( 2 + 4 = 6 ) FLOPs per parameter per token during training.
2. GPT‑3 parameters
GPT‑3 has multiple sizes; the largest is 175 billion parameters (175e9, or 1.75 × 10(^{11})).
So ( D = 1.75 \times 10^{11} ).
3. Training tokens
GPT‑3 was trained on ~300 billion tokens (from their paper).
So ( N = 3 \times 10^{11} ) tokens.
4. Apply formula
[ \text{FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11}) ]
First multiply ( 1.75 \times 3 = 5.25 )
Exponents: ( 10^{11} \times 10^{11} = 10^{22} )
So:
[ 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} = 3.15 \times 10^{23} \ \text{FLOPs} ]
5. Adjust for reality
This formula is approximate. In practice, there might be extra operations from attention computation, optimizers, etc., but for large dense models, ( 6ND ) is standard.
The actual number from the GPT‑3 paper is often cited as ~3.14 × 10(^{23}) FLOPs, which matches our estimate almost exactly.
6. Final answer
[ \boxed{3.15 \times 10^{23}} ] FLOPs for training GPT‑3 (175B parameters on 300B tokens).
About DeepSeek V3.2 Exp
Capabilities
Categories
Specifications
- Provider
- Deepseek
- Released
- 2025-09-29
- Size
- LARGE
- Context
- 163,840 tokens