Estimate Complexity

Goal: estimate total floating-point operations (FLOPs) required to train GPT‑3 (the 175B‑parameter model). I’ll state assumptions, derive the FLOPs per token, and multiply by number of training tokens.

Assumptions (reasonably standard for GPT‑3):

Model size: N ≈ 175 billion parameters (175 × 10^9).
Transformer shape consistent with that size: L = 96 layers, model width d = 12,288 (these give ≈175B params; I’ll use d and L only to justify a FLOP count check).
Context / sequence length used in training: S = 2048.
Number of training tokens: T ≈ 300 billion tokens (3 × 10^11). (This is an approximate dataset scale used in large‑language training; changing this linearly scales compute.)
Backpropagation / training overhead: backward pass + weight gradient computation ≈ 2× the forward cost, so total training cost ≈ 3× forward cost (forward + backward + a small extra for optimizer).

Step 1 — Rough forward cost per token (per layer) Consider one transformer layer acting on a sequence of S tokens, width d:

Q/K/V linear projections: 3 × (S × d) × d matmuls → about 6·S·d^2 FLOPs (counting multiply+add as 2 FLOPs).
Output projection from attention: (S × d) × d → 2·S·d^2 FLOPs. → Total for projections ≈ 8·S·d^2.
Attention core (Q·K^T and attention-weighted sum): two big matmuls of shapes (S × d) × (d × S) and (S × S) × (S × d) → about 4·S^2·d FLOPs.
Feed‑forward (two linear layers d → 4d → d): 2·S·d·(4d) + 2·S·(4d)·d = 16·S·d^2 FLOPs.

So per layer forward FLOPs for a sequence of length S: F_forward_layer ≈ 8·S·d^2 + 4·S^2·d + 16·S·d^2 = 24·S·d^2 + 4·S^2·d.

Divide by S to get forward FLOPs per token per layer: f_layer_token ≈ 24·d^2 + 4·S·d.

Step 2 — Plug in numbers (d = 12,288, S = 2048, L = 96)

d^2 = 12,288^2 = 150,994,944 ≈ 1.50995 × 10^8.
24·d^2 ≈ 24 × 1.50995e8 ≈ 3.6239 × 10^9.
4·S·d = 4 × 2048 × 12,288 = 100,663,296 ≈ 1.0066 × 10^8.
So f_layer_token ≈ 3.6239e9 + 1.0066e8 ≈ 3.7246 × 10^9 FLOPs per token per layer.

Multiply by L = 96 layers:

forward FLOPs per token for whole model ≈ 96 × 3.7246e9 ≈ 3.5756 × 10^11 ≈ 3.58 × 10^11 FLOPs.

Step 3 — Include backward pass / training overhead

Training (forward + backward + grads) ≈ 3 × forward FLOPs (typical rule of thumb).
So training FLOPs per token ≈ 3 × 3.58 × 10^11 ≈ 1.07 × 10^12 FLOPs per token.

Step 4 — Multiply by total tokens

T = 3 × 10^11 tokens → Total FLOPs ≈ 1.07 × 10^12 × 3 × 10^11 ≈ 3.2 × 10^23 FLOPs.

Shortcut / simple rule A common simple approximation comes out of the algebra above: training FLOPs ≈ 6 × N_params × N_tokens (because the forward FLOPs per token are ≈ 2×N_params, and training is about 3× forward → ~6×N_params). Using that:

6 × 175×10^9 × 300×10^9 = 6 × 175 × 300 × 10^18 = 315,000 × 10^18 = 3.15 × 10^23 FLOPs, which is the same ballpark.

Final estimate

≈ 3 × 10^23 FLOPs (more precisely ≈ 3.1–3.3 × 10^23 FLOPs) to train GPT‑3 (175B) on ≈300B tokens with S=2048.

Caveats and sensitivity

If the dataset contained more tokens (e.g., 400–500B) the total scales linearly (e.g., 500B tokens → ≈5.2 × 10^23 FLOPs).
If you use a smaller context length during training, the S-dependent attention term (4·S^2·d per layer) shrinks, but for large d the 24·d^2 term dominates so S matters less (unless S is much larger).
Optimizer choice, precision (fp16 vs fp32), and implementation efficiencies (fused kernels, reduced-memory backprop techniques) can change actual FLOP counts and wall‑clock compute by tens of percent.
This is an estimate of raw FLOPs; real training time also depends on parallelization, communication overhead, and hardware efficiency.

Summary: roughly 3 × 10^23 floating-point operations to train GPT‑3 (175B) under the stated assumptions.

?
Prompt

Response

About GPT-5 Mini

Capabilities

Categories

Specifications