GLM 5 Turbo
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we can use the standard theoretical framework established by OpenAI's scaling laws for neural language models.
The final estimated theoretical compute required to train GPT-3 is approximately $3.15 \times 10^{23}$ FLOPs.
Here is the step-by-step reasoning:
Step 1: Identify the Model Size ($N$)
GPT-3 is famous for having 175 billion parameters. Therefore, our parameter count $N = 175 \times 10^9$.
Step 2: Identify the Training Data Size ($D$)
According to the GPT-3 paper, the model was trained on a filtered dataset of roughly 300 billion tokens. Therefore, our data size $D = 300 \times 10^9$.
Step 3: Apply the Standard Compute Formula
In deep learning, specifically for Transformer architectures, the theoretical number of FLOPs required for a full training run is calculated using a well-established approximation: Total FLOPs $\approx 6 \times N \times D$
Step 4: Understand Where the "6" Comes From
The multiplier of 6 accounts for the three distinct computational phases of training a neural network over one token:
- Forward Pass ($2ND$): A matrix multiplication requires 2 FLOPs per parameter (1 multiplication + 1 addition). Passing one token through the entire network requires $2N$ FLOPs. Doing this for all tokens yields $2ND$.
- Backward Pass - Gradient Calculation ($2ND$): Calculating the error gradients requires running the forward pass operations in reverse, which computationally costs roughly the same as the forward pass ($2ND$).
- Backward Pass - Weight Update ($2ND$): Applying the calculated gradients to update the model's weights using an optimizer (like Adam, which maintains states) requires another $2N$ FLOPs per token, totaling $2ND$.
Adding these together: $2ND + 2ND + 2ND = 6ND$.
Step 5: Execute the Calculation
Now we plug the GPT-3 numbers into the formula:
- FLOPs = $6 \times (175 \times 10^9) \times (300 \times 10^9)$
- FLOPs = $6 \times 175 \times 300 \times 10^{18}$
- FLOPs = $1,050 \times 300 \times 10^{18}$
- FLOPs = $315,000 \times 10^{18}$
- FLOPs = $3.15 \times 10^{23}$
Step 6: Contextualize and Add Caveats
While $3.15 \times 10^{23}$ FLOPs is the standard theoretical answer, a perfectly accurate real-world estimate would require slight adjustments for the following nuances:
- Embedding Parameters: The $175$ billion figure includes word embeddings. Embeddings are looked up, not multiplied, so they technically require fewer FLOPs than dense matrix multiplications. However, in practice, this difference is small enough that the $6ND$ heuristic is universally accepted.
- Hardware Utilization (MFU): The $3.15 \times 10^{23}$ figure represents useful or theoretical FLOPs. In reality, because of communication overhead between GPUs, memory bottlenecks, and non-matrix-math operations, the actual hardware FLOPs executed on the GPUs would be higher (usually by a factor of 1.5x to 3x, depending on the Model FLOPs Utilization score of the cluster).
- Chinchilla Scaling: It is worth noting that by modern "Chinchilla" scaling laws, GPT-3 was significantly under-trained (Chinchilla optimal for 175B parameters would be closer to 3.5 trillion tokens). However, OpenAI chose to train a massive model on relatively less data, so the compute used remains strictly bound by the $D=300$ billion limit stated above.




