Estimate Complexity

Step 1 – Note the basic training quantities

GPT‑3’s largest model has about (P \approx 1.75 \times 10^{11}) trainable parameters (175 billion).
Public descriptions indicate it was trained on roughly (T \approx 3.0 \times 10^{11}) tokens (≈300 billion).

Step 2 – Relate parameters, tokens, and FLOPs

For dense transformers, almost all computation comes from matrix multiplies.
A practical rule (derived by counting multiply–adds in the attention and MLP blocks, and accounting for the backward pass) is:

[ \text{Training FLOPs} \approx 6 \times (\text{number of parameters}) \times (\text{number of tokens}) ]

Here’s the intuition:

Forward pass: Each parameter contributes one multiply and one add → about (2) FLOPs per parameter per token.
Backward pass: Computing gradients with respect to activations and weights requires roughly twice the forward cost (two more matrix multiplies of comparable size).
So total ≈ (3) times the forward cost → (3 \times 2 = 6) FLOPs per parameter per token.

(The softmax, layernorm, activation costs are much smaller and get absorbed into this constant.)

Step 3 – Plug in the GPT‑3 values

[ \text{FLOPs} \approx 6 \times (1.75\times 10^{11}) \times (3.0 \times 10^{11}) ]

[ = 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} \approx 3.15 \times 10^{23} \text{ FLOPs} ]

Step 4 – Interpret

So, training GPT‑3 to completion requires on the order of (3 \times 10^{23}) floating-point operations.
That estimate aligns with detailed breakdowns of transformers and is accurate to within tens of percent, which is sufficient for order-of-magnitude comparisons.

?
Prompt

Response

About GPT-5 Codex

Capabilities

Categories

Specifications