Hunter Alpha
Estimating the total FLOPs (floating-point operations) for training GPT‑3 involves combining known model parameters, training data size, and standard assumptions about computational cost per token. Here’s a step‑by‑step reasoning based on public information available up to early 2023 (my knowledge cutoff is May 2025, but this is a historical estimate).
Step 1: Key known parameters
- GPT‑3 parameters: 175 billion (1.75 × 10¹¹)
- Training tokens: According to the original GPT‑3 paper, it was trained on about 300 billion tokens (3 × 10¹¹ tokens).
(Note: Some later analyses suggest ~300B tokens is correct for GPT‑3, though other models like GPT‑3.5/4 used more.)
Step 2: FLOPs per token per parameter
In transformer training, the standard rule of thumb is:
- Forward pass: ~2 FLOPs per parameter per token (one multiply and one add per weight).
- Backward pass: ~4 FLOPs per parameter per token (about twice the forward pass cost).
- Total: ~6 FLOPs per parameter per token.
This is a widely used approximation in ML scaling literature.
Step 3: Compute total FLOPs
[ \text{Total FLOPs} = 6 \times (\text{number of parameters}) \times (\text{number of training tokens}) ] [ = 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11}) ] [ = 6 \times 1.75 \times 3 \times 10^{22} ] [ = 31.5 \times 10^{22} ] [ = 3.15 \times 10^{23} \text{ FLOPs} ]
Step 4: Cross‑check with known estimates
The original GPT‑3 paper did not state FLOPs explicitly, but later analyses (e.g., from Epoch AI, OpenAI blog posts about scaling laws) often cite GPT‑3 training compute as ~3.14 × 10²³ FLOPs.
This matches our calculation closely.
Step 5: Uncertainty & assumptions
- The “6 FLOPs per parameter per token” rule assumes standard transformer training without excessive checkpointing or inefficient implementations.
- Token count: Some sources say 300B tokens, others say ~500B for later GPT‑3 variants, but 300B is the most common figure for the original 175B model.
- If we used 500B tokens, the estimate would be ~5.25 × 10²³ FLOPs.
Final estimate
Total training FLOPs for GPT‑3 ≈ 3.14 × 10²³ FLOPs
(or roughly 314,000 petaFLOPs).
This is in line with public estimates from AI research institutions.







