Claude Opus 4.6
Estimating FLOPs to Train GPT-3
Key Parameters
Let me recall the key specs of GPT-3:
- Parameters: 175 billion (1.75 × 10¹¹)
- Training tokens: ~300 billion (3 × 10¹¹)
- Architecture: Decoder-only Transformer
The Standard Approximation Rule
There's a well-known approximation for Transformer training FLOPs:
$$C ≈ 6 \times N \times D$$
where:
- N = number of parameters
- D = number of training tokens
- 6 comes from the factor of 2 for each of three operations:
- Forward pass: ~2 FLOPs per parameter per token (one multiply + one add per parameter for matrix operations)
- Backward pass (activation gradients): ~2 FLOPs per parameter per token
- Backward pass (weight gradients): ~2 FLOPs per parameter per token
Why 2 FLOPs per parameter per token for the forward pass?
A Transformer is dominated by matrix multiplications. For a matrix multiplication of shape (m × k) times (k × n), the FLOPs are ~2mkn (k multiplies and k-1 adds per output element ≈ 2k ops × mn outputs). When you sum across all the weight matrices in the model, the total forward-pass FLOPs work out to approximately 2 × N per token, where N is the total parameter count.
Calculation
$$C ≈ 6 \times N \times D$$
$$C ≈ 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11})$$
$$C ≈ 6 \times 5.25 \times 10^{22}$$
$$C ≈ 3.15 \times 10^{23} \text{ FLOPs}$$
Cross-check with reported values
The OpenAI paper actually reported 3.14 × 10²³ FLOPs, so this estimate is essentially exact — which makes sense because the 6ND formula is precisely the approximation the community (and OpenAI themselves) use.
Final Answer
$$\boxed{C \approx 3.15 \times 10^{23} \text{ FLOPs}}$$
This is about 315 ZettaFLOPs — roughly equivalent to running thousands of V100 GPUs for weeks-to-months, which aligns with what we know about the training infrastructure used.





