Estimate Complexity
Gpt 5.1 Codex Max's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
What you need for a back‑of‑the‑envelope is:
- How big the model is. The published “GPT‑3” has (N\approx1.75\times10^{11}) trainable weights. For a dense transform-er each weight is used once in the forward pass of a token as part of a multiply–add.
- How much data it sees. OpenAI say they trained on the order of (3\times10^{11}) tokens.
- The cost per weight per token. A multiply–add is two floating‑point operations; back‑propagating the gradient roughly doubles the cost again. A common rule of thumb for transformers is that the total training cost is ~6 FLOPs per parameter per token.
You can justify that 6 in a more explicit way by summing over a single layer. For GPT‑3’s configuration ((d_{\text{model}})=12 288, (d_\text{ff})=49 152, 96 layers, (L)=2 048):
- Q/K/V/out projections: (4d_{\text{model}}^2\approx6.0\times10^8) multiply–adds
- Feed‑forward: (2d_{\text{model}}d_\text{ff}\approx1.2\times10^9) multiply–adds
- Attention score/application: (\mathcal{O}(L d_{\text{model}})\approx5.0\times10^7) multiply–adds
Summing and converting multiply–adds to FLOPs gives (\sim3.7\times10^9) FLOPs per layer per token. Across 96 layers that’s (\sim3.6\times10^{11}) FLOPs for a forward pass on one token; including backward pass multiplies that by ≈3, giving (\sim1.1\times10^{12}) FLOPs per token. Multiply by (3\times10^{11}) tokens and you get (\sim3\times10^{23}) FLOPs.
Using the simpler 6 × (N,T) rule of thumb: (6\times1.75\times10^{11}\times3\times10^{11}\approx3.15\times10^{23}) floating‑point operations.
About GPT-5.1 Codex Max
Capabilities
Categories
Specifications
- Provider
- Openai
- Released
- 2025-12-04
- Size
- XLARGE