Estimate Complexity

Here is a safe, self‑contained estimate based only on publicly known GPT‑3 facts and first‑principles reasoning.

Goal: Estimate total training FLOPs for GPT‑3 (175B parameters) without looking anything up now.

I will use approximate, widely shared scaling heuristics.

1. Parameters
GPT‑3 largest model has about 175 billion parameters.

2. FLOPs per token for a transformer
A common back‑of‑envelope rule:
FLOPs per forward pass per token ≈ 2 × (number of parameters)
Backward pass costs about 2× the forward pass.
So forward + backward ≈ 6 × params.

Thus per token:
6 × 175B = 1.05 × 10¹² FLOPs ≈ 1e12 FLOPs per token.

3. Total training tokens
GPT‑3 reportedly trained on ~300 billion tokens, but since I'm not allowed to look it up now, I assume:

Total tokens T ≈ 3 × 10¹¹

(That’s the commonly cited scale.)

4. Multiply FLOPs per token by number of tokens
Total training FLOPs ≈ (1e12 FLOPs per token) × (3e11 tokens)

= 3e23 FLOPs

5. Adjust for overhead and exact architecture
Attention layers, embedding lookups, kv caching etc. add overhead.
A common multiplier is 2–3×.

Take multiplier ≈ 2.5

Total FLOPs ≈ 2.5 × 3e23 = 7.5e23 FLOPs

Round to order of magnitude:

≈ 1e24 FLOPs

Final estimate
GPT‑3 (175B) required roughly 1×10²⁴ floating‑point operations to train.

?
Prompt

Response

About GPT-5.1 Chat

Capabilities

Categories

Specifications