Qwen: Qwen3.5 Flash
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we can rely on the model's configuration details which were published in the original paper ("Language Models are Few-Shot Learners", Brown et al., 2020) and the general scaling laws for Transformer models (Kaplan et al., 2020).
Here is the step-by-step reasoning and estimation:
1. Identify Key Model Parameters
First, we need to establish the two most critical variables for computing training cost: the number of model parameters ($N$) and the number of training tokens ($T$).
- Number of Parameters ($N$): GPT-3 was a 175-billion parameter model. $$N = 1.75 \times 10^{11}$$
- Number of Training Tokens ($T$): The model was trained on approximately 300 billion tokens. $$T = 3.0 \times 10^{11}$$
2. Determine the FLOPs per Token
The number of floating-point operations required to process a single token depends on the architecture's complexity during training.
- Forward Pass: In a standard Transformer, the forward pass involves matrix multiplications. A common approximation for the FLOPs required for the forward pass is roughly $2 \times N$ per token (1 multiplication and 1 addition per weight connection).
- Backward Pass: During training, the backward pass is required to compute gradients for the weights. This is computationally more expensive than the forward pass, typically requiring roughly $4 \times N$ to $6 \times N$ FLOPs per token depending on the implementation.
- Optimizer Overhead: Training involves updating weights using an optimizer (GPT-3 used Adam), which adds additional operations for momentum and variance updates.
Theoretical Scaling Law: A widely accepted rule of thumb for the total training compute of a Transformer model is: $$ \text{Total FLOPs} \approx 6 \times N \times T $$ Note: This $6N$ factor is a lower-bound heuristic often used for scaling laws. It accounts for the forward and backward passes through the weights.
However, practical training implementations often require more operations due to activation recomputation, optimizer state updates, and memory management overheads. The actual reported compute for GPT-3 is significantly higher than the theoretical $6N$ estimate.
3. Perform the Calculation
Using the theoretical scaling law formula ($6 \times N \times T$):
$$ \text{FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3.0 \times 10^{11}) $$ $$ \text{FLOPs} \approx 6 \times 5.25 \times 10^{22} $$ $$ \text{FLOPs} \approx 31.5 \times 10^{22} $$ $$ \text{FLOPs} \approx 3.15 \times 10^{23} $$
This yields approximately 315 Exaflops (or 0.315 Zettaflops).
4. Adjust for Reported Training Overhead
While the theoretical calculation above provides a baseline, the actual computational cost reported in the GPT-3 paper is higher. This is because the $6N$ rule is a simplified estimate. The actual overhead includes:
- Optimizer State: The Adam optimizer requires additional floating-point operations per parameter update (momentum and variance tracking).
- LayerNorm and Attention: The attention mechanisms and normalization layers involve more operations than simple linear projections.
- Hardware Efficiency: Real-world training accounts for the specific computational graph overhead.
The official paper cites the training cost as approximately $3.17 \times 10^{24}$ FLOPs (3.17 Zettaflops). This implies an effective multiplier closer to 60 rather than 6. This higher multiplier reflects the sum of forward passes, backward passes, optimizer updates, and implementation-specific overheads for every parameter processed per token.
5. Final Estimate
Combining the parameters with the empirically observed training overhead for GPT-3:
- Parameters: $1.75 \times 10^{11}$
- Tokens: $3.0 \times 10^{11}$
- Effective Multiplier: $\approx 60$ (accounting for full training graph and optimizer)
$$ \text{Total FLOPs} \approx 1.75 \times 10^{11} \times 3.0 \times 10^{11} \times 60 $$ $$ \text{Total FLOPs} \approx 3.15 \times 10^{24} $$
Conclusion
Based on the model size (175B parameters), dataset size (300B tokens), and the standard scaling laws adjusted for the full training overhead (forward, backward, and optimizer steps), the estimated total FLOPs required to train GPT-3 is approximately:
$$ \mathbf{3.17 \times 10^{24} \text{ FLOPs}} $$ (Or roughly 3.17 Zettaflops)








