Z.ai: GLM 5.1
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we can use a well-established rule of thumb in deep learning, combined with the known specifications of the GPT-3 model.
Here is the step-by-step reasoning:
Step 1: Establish the Formula
For large Transformer models, the total training compute can be estimated using the formula: Total FLOPs ≈ 6 × N × D
Where:
- N = Total number of parameters in the model.
- D = Total number of tokens in the training dataset.
Why the factor of 6? This comes from the mechanics of the forward and backward passes:
- Forward Pass (~2N): In a matrix multiplication $y = Wx$, for every parameter (weight) $w$, we perform one multiplication and one addition. Since a multiply-add counts as 2 FLOPs, the forward pass requires approximately 2N FLOPs per token.
- Backward Pass (~4N): Calculating gradients requires computing the gradient with respect to the weights (which involves an outer product, ~2N FLOPs) and the gradient with respect to the inputs to pass the error backward (another matrix multiplication, ~2N FLOPs). Thus, the backward pass requires approximately 4N FLOPs per token.
- Total: 2N (forward) + 4N (backward) = 6N FLOPs per token.
Step 2: Identify the Number of Parameters (N)
From the GPT-3 paper (Language Models are Few-Shot Learners), the largest model (GPT-3) is famously known to have:
- N = 175 billion parameters ($1.75 \times 10^{11}$)
Step 3: Identify the Training Dataset Size (D)
The GPT-3 paper explicitly details the training dataset. The model was trained on a mixture of filtered Common Crawl, WebText2, Books, and Wikipedia. The total token count across this blended dataset was:
- D = 300 billion tokens ($3 \times 10^{11}$)
(Note: While the raw datasets were much larger, the model was trained on roughly 300 billion tokens, meaning some high-quality data like Wikipedia was repeated for multiple epochs, but the total token presentations during training was 300 billion).
Step 4: Calculate the Total FLOPs
Now we plug the numbers into our formula:
Total FLOPs = 6 × N × D Total FLOPs = 6 × (175 × 10^9) × (300 × 10^9)
Let's break down the math:
- Multiply the coefficients: 6 × 175 × 300 = 315,000
- Multiply the exponents: 10^9 × 10^9 = 10^18
- Combine them: 315,000 × 10^18 = 3.15 × 10^23
Step 5: Final Estimation and Context
The total number of FLOPs required to train GPT-3 is approximately 3.15 × 10^23 FLOPs (or 315 ZettaFLOPs).
Sanity Check / Modern Context: To put this number into perspective, we can look at the hardware required. An NVIDIA V100 GPU (which was the standard at the time of GPT-3's training) has a theoretical peak performance of about 125 TFLOPs (for mixed-precision tensor operations). Even at 100% efficiency, a single V100 would take about 80 years to complete this training. In reality, due to communication overhead and sub-100% Model FLOPs Utilization (MFU), it took OpenAI a massive cluster of roughly 10,000 V100 GPUs running for several weeks to complete the training.
Furthermore, by modern Chinchilla scaling laws, a 175B parameter model should be trained on roughly 3.6 trillion tokens, meaning GPT-3 was significantly "under-trained" by today's standards. If it were trained optimally today, it would require roughly 3.78 × 10^24 FLOPs (12x more compute).




