Estimate Complexity

What you need for a back‑of‑the‑envelope is:

How big the model is. The published “GPT‑3” has (N\approx1.75\times10^{11}) trainable weights. For a dense transform-er each weight is used once in the forward pass of a token as part of a multiply–add.
How much data it sees. OpenAI say they trained on the order of (3\times10^{11}) tokens.
The cost per weight per token. A multiply–add is two floating‑point operations; back‑propagating the gradient roughly doubles the cost again. A common rule of thumb for transformers is that the total training cost is ~6 FLOPs per parameter per token.

You can justify that 6 in a more explicit way by summing over a single layer. For GPT‑3’s configuration ((d_{\text{model}})=12 288, (d_\text{ff})=49 152, 96 layers, (L)=2 048):

Q/K/V/out projections: (4d_{\text{model}}^2\approx6.0\times10^8) multiply–adds
Feed‑forward: (2d_{\text{model}}d_\text{ff}\approx1.2\times10^9) multiply–adds
Attention score/application: (\mathcal{O}(L d_{\text{model}})\approx5.0\times10^7) multiply–adds

Summing and converting multiply–adds to FLOPs gives (\sim3.7\times10^9) FLOPs per layer per token. Across 96 layers that’s (\sim3.6\times10^{11}) FLOPs for a forward pass on one token; including backward pass multiplies that by ≈3, giving (\sim1.1\times10^{12}) FLOPs per token. Multiply by (3\times10^{11}) tokens and you get (\sim3\times10^{23}) FLOPs.

Using the simpler 6 × (N,T) rule of thumb: (6\times1.75\times10^{11}\times3\times10^{11}\approx3.15\times10^{23}) floating‑point operations.

Response

GPT-5.1 Codex MaxEstimate Complexity

GPT-5.1 Codex Max

What you need for a back‑of‑the‑envelope is:

How big the model is. The published “GPT‑3” has (N\approx1.75\times10^{11}) trainable weights. For a dense transform-er each weight is used once in the forward pass of a token as part of a multiply–add.
How much data it sees. OpenAI say they trained on the order of (3\times10^{11}) tokens.
The cost per weight per token. A multiply–add is two floating‑point operations; back‑propagating the gradient roughly doubles the cost again. A common rule of thumb for transformers is that the total training cost is ~6 FLOPs per parameter per token.

You can justify that 6 in a more explicit way by summing over a single layer. For GPT‑3’s configuration ((d_{\text{model}})=12 288, (d_\text{ff})=49 152, 96 layers, (L)=2 048):

Q/K/V/out projections: (4d_{\text{model}}^2\approx6.0\times10^8) multiply–adds
Feed‑forward: (2d_{\text{model}}d_\text{ff}\approx1.2\times10^9) multiply–adds
Attention score/application: (\mathcal{O}(L d_{\text{model}})\approx5.0\times10^7) multiply–adds

Using the simpler 6 × (N,T) rule of thumb: (6\times1.75\times10^{11}\times3\times10^{11}\approx3.15\times10^{23}) floating‑point operations.

Estimate Complexity

?
Prompt

Response

About GPT-5.1 Codex Max

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

GPT-5.1 Codex Max vs Google: Gemini 3.1 Flash Lite Preview

Estimate Complexity

?
Prompt

Response

About GPT-5.1 Codex Max

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

GPT-5.1 Codex Max vs Google: Gemini 3.1 Flash Lite Preview

GPT-5.1 Codex Max — Estimate Complexity

Estimate Complexity

?Prompt

Response

About GPT-5.1 Codex Max

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

GPT-5.1 Codex Max vs Google: Gemini 3.1 Flash Lite Preview

GPT-5.1 Codex Max — Estimate Complexity

Estimate Complexity

?Prompt

Response

About GPT-5.1 Codex Max

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

GPT-5.1 Codex Max vs Google: Gemini 3.1 Flash Lite Preview

?
Prompt

?
Prompt