Estimate Complexity
Gpt 5 Codex's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Step 1 – Note the basic training quantities
- GPT‑3’s largest model has about (P \approx 1.75 \times 10^{11}) trainable parameters (175 billion).
- Public descriptions indicate it was trained on roughly (T \approx 3.0 \times 10^{11}) tokens (≈300 billion).
Step 2 – Relate parameters, tokens, and FLOPs
For dense transformers, almost all computation comes from matrix multiplies.
A practical rule (derived by counting multiply–adds in the attention and MLP blocks, and accounting for the backward pass) is:
[ \text{Training FLOPs} \approx 6 \times (\text{number of parameters}) \times (\text{number of tokens}) ]
Here’s the intuition:
- Forward pass: Each parameter contributes one multiply and one add → about (2) FLOPs per parameter per token.
- Backward pass: Computing gradients with respect to activations and weights requires roughly twice the forward cost (two more matrix multiplies of comparable size).
So total ≈ (3) times the forward cost → (3 \times 2 = 6) FLOPs per parameter per token.
(The softmax, layernorm, activation costs are much smaller and get absorbed into this constant.)
Step 3 – Plug in the GPT‑3 values
[ \text{FLOPs} \approx 6 \times (1.75\times 10^{11}) \times (3.0 \times 10^{11}) ]
[ = 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} \approx 3.15 \times 10^{23} \text{ FLOPs} ]
Step 4 – Interpret
So, training GPT‑3 to completion requires on the order of (3 \times 10^{23}) floating-point operations.
That estimate aligns with detailed breakdowns of transformers and is accurate to within tens of percent, which is sufficient for order-of-magnitude comparisons.
About GPT-5 Codex
Capabilities
Categories
Specifications
- Provider
- Openai
- Released
- 2025-09-23
- Size
- XLARGE