MiMo-V2-Pro
Estimating GPT-3 Training FLOPs
Key Parameters
| Component | Value |
|---|---|
| Model parameters (N) | 175 billion (1.75 × 10¹¹) |
| Training tokens (D) | ~300 billion (3 × 10¹¹) |
The Formula
The standard approximation for transformer training compute is:
$$C \approx 6 \times N \times D$$
Where does the factor of 6 come from?
- Forward pass: ~2N FLOPs per token (matrix multiplications)
- Backward pass: ~4N FLOPs per token (gradient computation for weights + activations)
- Total: ~6N FLOPs per token
The Calculation
$$C \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11})$$
$$C \approx 6 \times 1.75 \times 3 \times 10^{22}$$
$$C \approx 31.5 \times 10^{22}$$
$$\boxed{C \approx 3.15 \times 10^{23} \text{ FLOPs}}$$
Sanity Check
This aligns with the commonly cited figure of ~3.14 × 10²³ FLOPs (sometimes written as ~314 ZettaFLOPs). For context, this is roughly equivalent to running a single GPU at 100 TFLOPS for ~35 days continuously — which makes sense given that GPT-3 training reportedly used thousands of GPUs over several weeks.




