Estimate Complexity

To estimate the total FLOPs required to train GPT-3, we can proceed with the following reasoning steps, using only basic information about the model’s size and the amount of data it trained on:

Model Size
GPT-3 has approximately 175 billion parameters, which is (1.75 \times 10^{11}) parameters.
Training Dataset Size
GPT-3 was trained on roughly 300 billion tokens. That gives us a sense of how many “steps” (i.e., token predictions) the model processed during training.
FLOPs per Token
Training a transformer model involves both a forward pass and a backward pass for each token. A commonly used rule of thumb is that each parameter is involved in about 6 FLOPs per token when you count both the forward pass and the backward pass (this includes matrix multiplications, activations, etc.). This is a simplified approximation but it’s widely used for rough compute estimates.
Multiply Everything Together
Total FLOPs (\approx) (number of parameters) (\times) (number of tokens) (\times) (FLOPs per token)
(= 1.75 \times 10^{11} \times 3 \times 10^{11} \times 6)
Compute the Result
First multiply the parameters and tokens:
(1.75 \times 3 = 5.25), so (5.25 \times 10^{22}) parameter-token interactions.
Multiply by 6 FLOPs/token:
(5.25 \times 10^{22} \times 6 = 31.5 \times 10^{22} = 3.15 \times 10^{23}) FLOPs.

So, a reasonable estimate is that training GPT-3 required on the order of (3 \times 10^{23}) floating-point operations.

?
Prompt

Response

About GPT-5.1-Codex-Mini

Capabilities

Categories

Specifications