Estimate Complexity
Polaris Alpha's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Great question—and nice constraint.
I’ll estimate the total training FLOPs for GPT-3 from first principles, using only generally known architectural/training patterns (no external lookup). I’ll walk through:
- Model assumptions
- FLOPs per token
- FLOPs per step
- Total training tokens
- Multiply it out
- Sanity checks
We’ll target the 175B-parameter GPT-3-like model.
- Model assumptions
We need reasonable approximations:
- Parameters: ~175B
- Architecture: decoder-only Transformer
- Layers: ~96
- Hidden size: ~12,288
- Attention heads: e.g., 96 (128-dim per head)
- Context length: 2048 tokens
- Vocabulary size: ~50k
- Data: on the order of 300 billion tokens (common rough figure)
- Training objective: standard autoregressive LM
- Precision: assume training in something like mixed precision (but we’ll count math ops independently of dtype).
We’ll estimate multiply-adds as 2 FLOPs (1 mult + 1 add). Some conventions count a fused multiply-add as 1 FLOP; I’ll be explicit where it matters.
- FLOPs per token (Transformer forward)
We’ll approximate only the dominant costs:
Per Transformer layer and per token, the main pieces:
- QKV projections:
- Input dimension: d_model
- Three linear maps: Q, K, V
- Cost: 3 * d_model^2 multiplications (ignoring bias)
- FLOPs (with mult+add): ≈ 2 * 3 * d_model^2 = 6 d_model^2
- Attention scores and weighting: For sequence length L:
- Compute QK^T:
- Per head: L * d_k by L → ~L^2 * d_k multiplications
- Summed over all heads: L^2 * d_model
- FLOPs ≈ 2 * L^2 * d_model
- Apply softmax (cheaper; we’ll ignore or fold into constants).
- Multiply by V:
- Same order: L^2 * d_model
- FLOPs ≈ 2 * L^2 * d_model
Total attention mixing ≈ 4 * L^2 * d_model FLOPs per layer (forward).
- Output projection from concatenated heads:
- One linear: d_model x d_model
- FLOPs ≈ 2 * d_model^2
- MLP (feed-forward): Typical config: d_ff ≈ 4 * d_model
- First linear: d_model → d_ff
- FLOPs ≈ 2 * d_model * d_ff
- Activation (e.g., GeLU): relatively small; we can approximate or fold in.
- Second linear: d_ff → d_model
- FLOPs ≈ 2 * d_ff * d_model
Total MLP FLOPs ≈ 4 * d_model * d_ff ≈ 4 * d_model * (4 d_model) = 16 d_model^2
Now sum dominant per-layer forward FLOPs (per token vs. sequence-level is easy to mix up, so let’s be precise):
Careful: QK^T and AV are sequence-wide operations; their cost is per sequence, not per token. So let’s compute per sequence of length L, then divide.
Per layer, per sequence (forward):
- QKV projections: 6 d_model^2 * L
- Output projection: 2 d_model^2 * L
- MLP: 16 d_model^2 * L
- Attention mixes:
- QK^T: 2 L^2 d_model
- AV: 2 L^2 d_model
- Total attention mixing: 4 L^2 d_model
Total per layer per sequence (forward):
F_layer_fwd ≈ (6 + 2 + 16) d_model^2 L + 4 L^2 d_model = 24 d_model^2 L + 4 L^2 d_model
Now plug GPT-3-like numbers:
- d_model = 12,288
- L = 2048
- Layers N = 96
First term: 24 d_model^2 L
- d_model^2 = 12,288^2 ≈ 1.51 × 10^8
- 24 * d_model^2 ≈ 24 * 1.51e8 ≈ 3.62e9
- Multiply by L = 2048:
- 3.62e9 * 2.048e3 ≈ 7.41e12
So ≈ 7.4 × 10^12 FLOPs from the dense parts per layer per sequence.
Second term: 4 L^2 d_model
- L^2 = 2048^2 = 4,194,304 ≈ 4.19e6
- 4 * L^2 = 1.68e7
- Multiply by d_model (1.2288e4):
- 1.68e7 * 1.2288e4 ≈ 2.06e11
So attention mixing ≈ 2.1 × 10^11 FLOPs per layer per sequence.
Compare:
- Dense (24 d_model^2 L): ~7.4e12
- Attention mixing (4 L^2 d_model): ~2.1e11
Dense terms dominate for this configuration (large d_model, moderate L).
Total per layer per sequence (forward): ≈ 7.4e12 + 0.21e12 ≈ 7.6e12 FLOPs
Now with N = 96 layers:
- F_fwd_per_seq ≈ 96 * 7.6e12 ≈ 7.3e14 FLOPs
Add output softmax / embedding:
- Final projection: d_model × vocab ≈ 12,288 * 50,000 ≈ 6.14e8 weights
- FLOPs per token for logits: ≈ 2 * 6.14e8
- Per sequence (L tokens): ≈ 2 * 6.14e8 * 2048 ≈ 2.5e12 FLOPs
This is small compared to 7.3e14, so we can ignore it at this scale.
So forward pass per 2048-token sequence: ≈ 7 × 10^14 FLOPs (rounding slightly).
- Include backward pass
Training requires:
- Forward pass
- Backprop through all layers
Backprop cost is roughly:
- ≈ 2–3× the forward cost (common rough rule: total training ≈ 3× forward if you count forward+backward+grad computations).
I’ll use 3× forward for “forward + backward” to be conservative.
Per sequence (train step on one sequence): F_train_per_seq ≈ 3 * 7 × 10^14 ≈ 2.1 × 10^15 FLOPs
- Relate sequences to total tokens
Assume total training tokens T_total ≈ 300 billion = 3 × 10^11 tokens.
Each sequence has L = 2048 tokens:
- Number of sequences: S = T_total / L ≈ (3 × 10^11) / 2048
2048 ≈ 2.048 × 10^3
S ≈ (3 × 10^11) / (2.048 × 10^3) ≈ (3 / 2.048) × 10^8 ≈ 1.46 × 10^8 sequences
- Total training FLOPs
Total FLOPs:
F_total ≈ F_train_per_seq * S ≈ (2.1 × 10^15) * (1.46 × 10^8)
First multiply mantissas: 2.1 * 1.46 ≈ 3.066
Add exponents: 10^15 * 10^8 = 10^23
So: F_total ≈ 3.1 × 10^23 FLOPs
- Sanity checks and refinement
We got ≈ 3 × 10^23 FLOPs.
Let’s cross-check using simpler rules-of-thumb from the reasoning itself (still not “looking up”):
- A common internal heuristic: training FLOPs ≈ 6 * N_params * N_tokens Explanation: about 2–3 passes of work (forward+backward) over each parameter per token; factor 6 is in the right ballpark for multiply-add accounting/glue.
Using:
- N_params ≈ 1.75 × 10^11
- N_tokens ≈ 3 × 10^11
Then: F_total ≈ 6 * (1.75 × 10^11) * (3 × 10^11) ≈ 6 * 5.25 × 10^22 ≈ 3.15 × 10^23 FLOPs
This aligns extremely well with our more detailed layer-by-layer estimate (3.1 × 10^23).
So both independent approaches converge on:
Answer:
A reasonable estimate for the total compute required to train GPT-3 (175B) is on the order of:
- ~3 × 10^23 floating-point operations.
If you’d like, I can break down what that implies in terms of GPU-days for specific hardware.
About Polaris Alpha
Capabilities
Categories
Specifications
- Provider
- Openrouter
- Released
- 2025-11-06
- Size
- LARGE
- Context
- 256,000 tokens