GLM 4.7 Flash
Based on the known specifications of GPT-3 from its original training paper ("Scaling Laws for Neural Language Models"), here is the step-by-step estimation of the total FLOPs required.
The Final Estimate
The estimated total FLOPs required to train GPT-3 is approximately $6.3 \times 10^{23}$ (or 630 Yottaflops).
Step-by-Step Reasoning
To arrive at this number, we need three primary inputs: the size of the model, the size of the dataset, and the computational cost per operation.
1. Identify the Model Size (Parameters, $N$)
GPT-3 has 175 billion parameters. A parameter is essentially a numeric value in the neural network that the model learns during training. The number of parameters dictates how much memory and computation is required for each weight update.
- $N = 175 \times 10^9$
2. Identify the Dataset Size (Tokens, $T$)
The model was trained on a massive amount of text data. While the raw data was 45 terabytes, it is standard practice in these calculations to measure the number of tokens (chunks of text) processed.
- GPT-3 was trained on approximately 300 billion tokens.
- $T = 300 \times 10^9$
3. Determine FLOPs per Parameter per Token
This is the most technical part of the estimation. A "FLOP" (Floating Point Operation) is a basic calculation (like multiplication or addition).
A standard transformer architecture (like GPT-3) requires roughly 12 FLOPs per parameter for every token processed.
- Why 12?
- Forward Pass: To process one token through the network, the data moves through millions of matrix multiplications. This costs roughly 6 FLOPs per parameter.
- Backward Pass: To calculate the error and update the weights, the model must perform a similar amount of work in reverse (gradients). This costs another roughly 6 FLOPs per parameter.
- Multiplier: This 12x factor accounts for the complexity of the architecture (e.g., multi-head attention, feed-forward networks, layer normalization) and the fact that we must do both a forward and backward pass for every training step.
4. The Calculation
Using the formula for estimating Transformer training cost: $$ \text{Total FLOPs} \approx N \times T \times 12 $$
Plugging in the values: $$ 175 \times 10^9 \times 300 \times 10^9 \times 12 $$
- Multiply the parameters and tokens: $175 \times 300 = 52,500$
- Multiply by the complexity factor: $52,500 \times 12 = 630,000$
- Add the exponents ($10^9 \times 10^9 = 10^{18}$): $$ 630,000 \times 10^{18} \text{ FLOPs} $$
This simplifies to $6.3 \times 10^{23}$ FLOPs.
Step 5: Verification via Hardware (Sanity Check)
To ensure this estimate is reasonable, we can check it against the hardware used for training.
- Hardware: 1,024 NVIDIA A100 GPUs.
- Training Time: Approximately 35 days.
- Total GPU Hours: $1,024 \times 24 \text{ hours} \times 35 \text{ days} \approx 860,000 \text{ GPU-hours}$.
If we divide our estimated FLOPs ($6.3 \times 10^{23}$) by the total GPU hours, we get the required FLOPs per GPU per hour.
$$ \frac{6.3 \times 10^{23}}{860,000} \approx 7.3 \times 10^{17} \text{ FLOPs/GPU/hour} $$
This aligns with the peak performance of an NVIDIA A100 GPU (approx. $1.1 \times 10^{18}$ FLOPs per hour). The slight difference is due to the fact that GPUs do not run at 100% theoretical peak efficiency all the time; overheads, data loading, and communication between GPUs reduce the efficiency to roughly 60-70%. Therefore, the $6.3 \times 10^{23}$ FLOP estimate is widely accepted as the accurate figure for the total computational effort.