Claude Sonnet 4.6
Estimating FLOPs to Train GPT-3
Step 1: Recall GPT-3's Architecture
GPT-3 has approximately:
- 175 billion parameters (this is well-known)
- 96 transformer layers
- 96 attention heads
- Hidden dimension: 12,288
- Context length: 2,048 tokens
Step 2: FLOPs per Token (Forward Pass)
For a transformer, the dominant operations are matrix multiplications. There's a useful rule of thumb:
FLOPs per token ≈ 2 × N (for the forward pass)
where N = number of parameters.
Why 2×N?
- Each parameter is involved in roughly one multiply-add operation per token
- A multiply-add = 2 FLOPs
- This holds when sequence length is much smaller than model dimension (which is roughly true here)
So forward pass FLOPs per token: $$2 \times 175 \times 10^9 = 3.5 \times 10^{11} \text{ FLOPs/token}$$
Step 3: Account for Backward Pass
Backpropagation costs approximately 2× the forward pass (computing gradients for weights + activations).
So total per token: $$3 \times 3.5 \times 10^{11} = 1.05 \times 10^{12} \text{ FLOPs/token}$$
Step 4: Training Tokens
GPT-3 was trained on approximately 300 billion tokens (from the paper's description of ~300B tokens from filtered Common Crawl, WebText, books, Wikipedia).
Step 5: Total FLOPs
$$\text{Total FLOPs} = 1.05 \times 10^{12} \times 3 \times 10^{11}$$
$$= 3.15 \times 10^{23} \text{ FLOPs}$$
$$\approx \boxed{3 \times 10^{23} \text{ FLOPs}}$$
Sanity Check
- This is ~300 zettaFLOPs
- At 1,000 A100s running at ~300 TFLOPs each (with ~50% utilization → ~150 TFLOPs effective):
- Total compute = 1,000 × 1.5×10¹⁴ = 1.5×10¹⁷ FLOPs/second
- Time = 3×10²³ / 1.5×10¹⁷ ≈ 2×10⁶ seconds ≈ 23 days
- OpenAI reportedly used thousands of GPUs for weeks — this is plausible ✓
Summary
| Component | Value |
|---|---|
| Parameters | 175B |
| FLOPs/token (forward) | ~3.5 × 10¹¹ |
| Training multiplier (fwd+bwd) | ×3 |
| Training tokens | ~300B |
| Total FLOPs | ~3 × 10²³ |
The key formula to remember: C ≈ 6ND where N = parameters, D = training tokens.



