Step-by-Step Estimate of GPT-3 Training FLOPs
First, we ground the calculation in well-established properties of GPT-3 and transformer training compute, then derive the total FLOPs systematically:
Step 1: Core GPT-3 Specs (Publicly established, no external search needed)
GPT-3's widely confirmed baseline properties for this calculation:
- Total trainable parameters (N): 175 billion (175e9)
- Sequence (context) length: 2048 tokens
- Total training tokens processed (T): 300 billion (300e9, from the original 2020 GPT-3 paper, which notes the 175B model was trained on 300B total tokens across all training steps)
Step 2: Derive the Transformer Training FLOP Formula
Nearly all compute for training transformer models comes from linear layer operations, with minor overheads from attention mechanics that are negligible for large models like GPT-3. The standard formula for total training FLOPs is 6 * N * T, which we justify below:
- Forward pass FLOPs per token: Every linear layer in the model uses multiply-add operations (1 multiply + 1 add = 2 FLOPs) per parameter. For a single token passed through the full model, this sums to ~2*N FLOPs for the forward pass.
- Backward pass FLOPs per token: Backpropagation to compute gradients requires ~2x the compute of the forward pass (to accumulate weight gradients and pass error signals backward), totaling ~4*N FLOPs per token.
- Total per-token FLOPs: 2N (forward) + 4N (backward) = 6N FLOPs per training token. Multiply by all T training tokens to get total compute: 6NT.
Step 3: Validate Negligible Overheads
Small sources of compute (self-attention score calculations, layer norm, softmax, embedding lookups) do not meaningfully alter the estimate. For GPT-3 specifically, the O(sequence length²) self-attention matrix compute accounts for only ~1.5% of total linear layer compute, with all other overheads adding <2% extra. The 6NT formula is accurate to within 5% of the true total.
Step 4: Final Calculation
Plug in GPT-3's values:
Total FLOPs = 6 * 175e9 * 300e9 = 3.15e23 FLOPs, or ~3e23 FLOPs as a rounded, standard estimate.
This aligns with widely cited runtime estimates: training GPT-3 on ~1000 NVIDIA V100 GPUs (each capable of ~120 TFLOPS of FP16 compute) takes ~30 days, which matches the total compute of ~3e23 FLOPs.