Here is a step-by-step estimate for the total training FLOPs of GPT-3.
Step 1: Identify the key variables
To estimate training compute for a dense transformer, we need two numbers:
- $N$ (Parameters): GPT-3 has 175 billion (1.75 × 10¹¹) parameters.
- $D$ (Tokens): GPT-3 was trained on roughly 300 billion (3 × 10¹¹) tokens. (This is the widely cited figure from the original paper for the final training corpus).
Step 2: Use the standard training FLOPs formula
For decoder-only transformers trained with backpropagation, the total floating-point operations (FLOPs) can be approximated as:
$$\text{FLOPs} \approx 6 \times N \times D$$
Why 6?
- Forward pass: Processing one token requires roughly 2 FLOPs per parameter (1 multiply + 1 accumulate for the matrix operations).
- Backward pass: Computing gradients requires roughly 4 FLOPs per parameter (approximately 2× the forward cost to calculate gradients with respect to both inputs and weights).
- Total: $2 + 4 = 6$ FLOPs per parameter per token.
(Note: This formula dominates the cost; the quadratic attention cost is negligible compared to the linear parameter cost for a model of this size).
Step 3: Calculate
$$\text{FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11})$$
$$= 6 \times 1.75 \times 3 \times 10^{22}$$
$$= 6 \times 5.25 \times 10^{22}$$
$$= 31.5 \times 10^{22}$$
$$= \mathbf{3.15 \times 10^{23} \text{ FLOPs}}$$
Step 4: Account for uncertainty and overhead
The raw calculation gives ~3.15 × 10²³ FLOPs. However, we should adjust for real-world factors:
- Dataset size variance: If GPT-3 was trained on closer to 400–500B tokens (including potential epochs or different data mixes), the number could be 4–5 × 10²³ FLOPs.
- Implementation overhead: Data loading, checkpointing, communication in distributed training, and the Adam optimizer updates add 10–20% overhead.
Final Estimate
Approximately $3 \times 10^{23}$ to $5 \times 10^{23}$ FLOPs (roughly 300 to 500 sextillion FLOPs).
For context, this is equivalent to about 3,000 to 5,000 petaflop-seconds (or PFLOP-seconds), which aligns with the estimated thousands of petaflop-days often cited for GPT-3's training run on V100 GPUs.