DeepSeek R1
To estimate the total number of FLOPs required to train GPT-3, we use the following reasoning:
-
Model Parameters (P):
GPT-3 has 175 billion parameters ((P = 175 imes 10^9)). -
Training Tokens (N):
GPT-3 was trained on approximately 300 billion tokens ((N = 300 imes 10^9)). -
FLOPs per Token:
- A forward pass requires ~(2P) FLOPs per token (each parameter involves a multiply-accumulate operation, which counts as 2 FLOPs).
- A backward pass (for gradients) is roughly twice as expensive as the forward pass, requiring ~(4P) FLOPs per token.
- Total FLOPs per token: (2P + 4P = 6P).
-
Total Training FLOPs:
Multiply FLOPs per token by the total number of tokens:
[ ext{FLOPs} = 6 imes P imes N = 6 imes 175 imes 10^9 imes 300 imes 10^9 = 3.15 imes 10^{23} , ext{FLOPs}. ]
Final Answer:
The total number of FLOPs required to train GPT-3 is approximately (3.1 imes 10^{23}) FLOPs.






