To estimate the total floating-point operations (FLOPs) required to train GPT-3, we start with the model's key specifications: it has 175 billion parameters and was trained on approximately 300 billion tokens. The calculation involves breaking down the FLOPs for the forward and backward passes during training.
Step 1: Understanding FLOPs per Token
For a transformer model like GPT-3, a common approximation is that the forward pass requires about 2 FLOPs per parameter per token. This is because each parameter is typically involved in one multiplication and one addition per token. The backward pass, which computes gradients, roughly doubles the FLOPs of the forward pass. Thus, the total FLOPs per token during training is approximately 6 FLOPs per parameter.
Step 2: Total FLOPs Formula
Given:
- ( N = 175 \times 10^9 ) parameters,
- ( D = 300 \times 10^9 ) tokens,
the total FLOPs can be estimated as:
[
\text{Total FLOPs} \approx 6 \times N \times D
]
Step 3: Plugging in the Numbers
[
6 \times (175 \times 10^9) \times (300 \times 10^9) = 6 \times 175 \times 300 \times 10^{18}
]
[
175 \times 300 = 52,500
]
[
6 \times 52,500 = 315,000
]
[
315,000 \times 10^{18} = 3.15 \times 10^{23} \text{ FLOPs}
]
Step 4: Refining with Attention Overhead
The above approximation assumes that the FLOPs per token scale linearly with parameters, but transformers also have attention mechanisms that add overhead, especially with longer sequences. For GPT-3, the sequence length is 2048 tokens, and the model has 96 layers with a dimension of 12,288. A more detailed derivation accounting for attention yields:
[
\text{Total FLOPs} = 6ND + 12L L_s d_{\text{model}} D
]
where ( L = 96 ), ( L_s = 2048 ), and ( d_{\text{model}} = 12,288 ). The second term adds about ( 8.7 \times 10^{21} ) FLOPs, increasing the total to approximately ( 3.24 \times 10^{23} ) FLOPs. This correction is minor (about 2.8%), so the simple estimate remains robust.
Step 5: Conclusion
The total FLOPs required to train GPT-3 is on the order of ( 3.2 \times 10^{23} ). This aligns with widely cited estimates and reflects the massive computational scale of training large language models.