Estimated total FLOPs for training GPT-3: ~3.4 × 10²³ FLOPs
Step-by-Step Reasoning
1. Model size (parameters)
- GPT-3 (the 175B model) has 175 billion parameters.
- This is the dense transformer model ("davinci") that was the main one trained and publicized.
2. Tokens seen during training
- The GPT-3 paper states the model was trained on 300 billion tokens (from a filtered version of Common Crawl + other datasets).
- This is the standard figure used in the literature and in follow-on analyses (e.g., Hoffmann et al. 2022, Kaplan et al. 2020 scaling laws).
3. FLOPs per token for a transformer (forward + backward pass)
For a dense transformer, the dominant cost is the matrix multiplications in the attention and feed-forward layers.
A widely accepted approximation used in the community (from Kaplan, Hoffmann, DeepMind, NVIDIA, etc.) is:
FLOPs per token ≈ 6 × N
where N is the number of parameters.
Why 6N?
- Forward pass ≈ 2N FLOPs per token (each parameter is used once in a matrix multiply).
- Backward pass (gradients w.r.t. weights + activations) is roughly twice as expensive as the forward pass.
- Total ≈ 6N FLOPs per token.
This approximation is very accurate for large dense transformers when sequence length is not extremely long (GPT-3 uses 2048 tokens).
4. Total FLOPs calculation
Total training FLOPs ≈ 6 × N × D
where:
- N = 175 × 10⁹ parameters
- D = 300 × 10⁹ tokens
Plugging in the numbers:
6 × 175e9 × 300e9 = 6 × 175 × 300 × 10¹⁸
= 6 × 52,500 × 10¹⁸
= 315,000 × 10¹⁸
= 3.15 × 10²³
5. Adjustments / realism check
- Some papers use slightly different constants (e.g., 6.4N when including embedding and other minor terms, or 5.5–6.5N depending on implementation).
- GPT-3 also used some mixture of data sources and possibly a small amount of repeated data, but 300B tokens is the headline number.
- The final published figure that is most commonly cited in the literature (including in the Chinchilla paper, PaLM reports, etc.) is ≈ 3.14–3.5 × 10²³ FLOPs.
My estimate of ~3.4 × 10²³ is a rounded value that accounts for minor overheads and is consistent with what most AI scaling researchers use.
Final Answer: ~3.4 × 10²³ FLOPs