Estimate Complexity

Great question—and nice constraint.

I’ll estimate the total training FLOPs for GPT-3 from first principles, using only generally known architectural/training patterns (no external lookup). I’ll walk through:

Model assumptions
FLOPs per token
FLOPs per step
Total training tokens
Multiply it out
Sanity checks

We’ll target the 175B-parameter GPT-3-like model.

Model assumptions

We need reasonable approximations:

Parameters: ~175B
Architecture: decoder-only Transformer
- Layers: ~96
- Hidden size: ~12,288
- Attention heads: e.g., 96 (128-dim per head)
Context length: 2048 tokens
Vocabulary size: ~50k
Data: on the order of 300 billion tokens (common rough figure)
Training objective: standard autoregressive LM
Precision: assume training in something like mixed precision (but we’ll count math ops independently of dtype).

We’ll estimate multiply-adds as 2 FLOPs (1 mult + 1 add). Some conventions count a fused multiply-add as 1 FLOP; I’ll be explicit where it matters.

FLOPs per token (Transformer forward)

We’ll approximate only the dominant costs:

Per Transformer layer and per token, the main pieces:

QKV projections:

Input dimension: d_model
Three linear maps: Q, K, V
Cost: 3 * d_model^2 multiplications (ignoring bias)
FLOPs (with mult+add): ≈ 2 * 3 * d_model^2 = 6 d_model^2

Attention scores and weighting: For sequence length L:

Compute QK^T:
- Per head: L * d_k by L → ~L^2 * d_k multiplications
- Summed over all heads: L^2 * d_model
- FLOPs ≈ 2 * L^2 * d_model
Apply softmax (cheaper; we’ll ignore or fold into constants).
Multiply by V:
- Same order: L^2 * d_model
- FLOPs ≈ 2 * L^2 * d_model

Total attention mixing ≈ 4 * L^2 * d_model FLOPs per layer (forward).

Output projection from concatenated heads:

One linear: d_model x d_model
FLOPs ≈ 2 * d_model^2

MLP (feed-forward): Typical config: d_ff ≈ 4 * d_model

First linear: d_model → d_ff
- FLOPs ≈ 2 * d_model * d_ff
Activation (e.g., GeLU): relatively small; we can approximate or fold in.
Second linear: d_ff → d_model
- FLOPs ≈ 2 * d_ff * d_model

Total MLP FLOPs ≈ 4 * d_model * d_ff ≈ 4 * d_model * (4 d_model) = 16 d_model^2

Now sum dominant per-layer forward FLOPs (per token vs. sequence-level is easy to mix up, so let’s be precise):

Careful: QK^T and AV are sequence-wide operations; their cost is per sequence, not per token. So let’s compute per sequence of length L, then divide.

Per layer, per sequence (forward):

QKV projections: 6 d_model^2 * L
Output projection: 2 d_model^2 * L
MLP: 16 d_model^2 * L
Attention mixes:
- QK^T: 2 L^2 d_model
- AV: 2 L^2 d_model
- Total attention mixing: 4 L^2 d_model

Total per layer per sequence (forward):

F_layer_fwd ≈ (6 + 2 + 16) d_model^2 L + 4 L^2 d_model = 24 d_model^2 L + 4 L^2 d_model

Now plug GPT-3-like numbers:

d_model = 12,288
L = 2048
Layers N = 96

First term: 24 d_model^2 L

d_model^2 = 12,288^2 ≈ 1.51 × 10^8
24 * d_model^2 ≈ 24 * 1.51e8 ≈ 3.62e9
Multiply by L = 2048:
- 3.62e9 * 2.048e3 ≈ 7.41e12

So ≈ 7.4 × 10^12 FLOPs from the dense parts per layer per sequence.

Second term: 4 L^2 d_model

L^2 = 2048^2 = 4,194,304 ≈ 4.19e6
4 * L^2 = 1.68e7
Multiply by d_model (1.2288e4):
- 1.68e7 * 1.2288e4 ≈ 2.06e11

So attention mixing ≈ 2.1 × 10^11 FLOPs per layer per sequence.

Compare:

Dense (24 d_model^2 L): ~7.4e12
Attention mixing (4 L^2 d_model): ~2.1e11

Dense terms dominate for this configuration (large d_model, moderate L).

Total per layer per sequence (forward): ≈ 7.4e12 + 0.21e12 ≈ 7.6e12 FLOPs

Now with N = 96 layers:

F_fwd_per_seq ≈ 96 * 7.6e12 ≈ 7.3e14 FLOPs

Add output softmax / embedding:

Final projection: d_model × vocab ≈ 12,288 * 50,000 ≈ 6.14e8 weights
FLOPs per token for logits: ≈ 2 * 6.14e8
Per sequence (L tokens): ≈ 2 * 6.14e8 * 2048 ≈ 2.5e12 FLOPs

This is small compared to 7.3e14, so we can ignore it at this scale.

So forward pass per 2048-token sequence: ≈ 7 × 10^14 FLOPs (rounding slightly).

Include backward pass

Training requires:

Forward pass
Backprop through all layers

Backprop cost is roughly:

≈ 2–3× the forward cost (common rough rule: total training ≈ 3× forward if you count forward+backward+grad computations).

I’ll use 3× forward for “forward + backward” to be conservative.

Per sequence (train step on one sequence): F_train_per_seq ≈ 3 * 7 × 10^14 ≈ 2.1 × 10^15 FLOPs

Relate sequences to total tokens

Assume total training tokens T_total ≈ 300 billion = 3 × 10^11 tokens.

Each sequence has L = 2048 tokens:

Number of sequences: S = T_total / L ≈ (3 × 10^11) / 2048

2048 ≈ 2.048 × 10^3

S ≈ (3 × 10^11) / (2.048 × 10^3) ≈ (3 / 2.048) × 10^8 ≈ 1.46 × 10^8 sequences

Total training FLOPs

Total FLOPs:

F_total ≈ F_train_per_seq * S ≈ (2.1 × 10^15) * (1.46 × 10^8)

First multiply mantissas: 2.1 * 1.46 ≈ 3.066

Add exponents: 10^15 * 10^8 = 10^23

So: F_total ≈ 3.1 × 10^23 FLOPs

Sanity checks and refinement

We got ≈ 3 × 10^23 FLOPs.

Let’s cross-check using simpler rules-of-thumb from the reasoning itself (still not “looking up”):

A common internal heuristic: training FLOPs ≈ 6 * N_params * N_tokens Explanation: about 2–3 passes of work (forward+backward) over each parameter per token; factor 6 is in the right ballpark for multiply-add accounting/glue.

Using:

N_params ≈ 1.75 × 10^11
N_tokens ≈ 3 × 10^11

Then: F_total ≈ 6 * (1.75 × 10^11) * (3 × 10^11) ≈ 6 * 5.25 × 10^22 ≈ 3.15 × 10^23 FLOPs

This aligns extremely well with our more detailed layer-by-layer estimate (3.1 × 10^23).

So both independent approaches converge on:

Answer:

A reasonable estimate for the total compute required to train GPT-3 (175B) is on the order of:

~3 × 10^23 floating-point operations.

If you’d like, I can break down what that implies in terms of GPU-days for specific hardware.

?
Prompt

Response

About Polaris Alpha

Capabilities

Categories

Specifications