4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Polaris Alpha's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Great question—and nice constraint.
I’ll estimate the total training FLOPs for GPT-3 from first principles, using only generally known architectural/training patterns (no external lookup). I’ll walk through:
We’ll target the 175B-parameter GPT-3-like model.
We need reasonable approximations:
We’ll estimate multiply-adds as 2 FLOPs (1 mult + 1 add). Some conventions count a fused multiply-add as 1 FLOP; I’ll be explicit where it matters.
We’ll approximate only the dominant costs:
Per Transformer layer and per token, the main pieces:
Total attention mixing ≈ 4 * L^2 * d_model FLOPs per layer (forward).
Total MLP FLOPs ≈ 4 * d_model * d_ff ≈ 4 * d_model * (4 d_model) = 16 d_model^2
Now sum dominant per-layer forward FLOPs (per token vs. sequence-level is easy to mix up, so let’s be precise):
Careful: QK^T and AV are sequence-wide operations; their cost is per sequence, not per token. So let’s compute per sequence of length L, then divide.
Per layer, per sequence (forward):
Total per layer per sequence (forward):
F_layer_fwd ≈ (6 + 2 + 16) d_model^2 L + 4 L^2 d_model = 24 d_model^2 L + 4 L^2 d_model
Now plug GPT-3-like numbers:
First term: 24 d_model^2 L
So ≈ 7.4 × 10^12 FLOPs from the dense parts per layer per sequence.
Second term: 4 L^2 d_model
So attention mixing ≈ 2.1 × 10^11 FLOPs per layer per sequence.
Compare:
Dense terms dominate for this configuration (large d_model, moderate L).
Total per layer per sequence (forward): ≈ 7.4e12 + 0.21e12 ≈ 7.6e12 FLOPs
Now with N = 96 layers:
Add output softmax / embedding:
This is small compared to 7.3e14, so we can ignore it at this scale.
So forward pass per 2048-token sequence: ≈ 7 × 10^14 FLOPs (rounding slightly).
Training requires:
Backprop cost is roughly:
I’ll use 3× forward for “forward + backward” to be conservative.
Per sequence (train step on one sequence): F_train_per_seq ≈ 3 * 7 × 10^14 ≈ 2.1 × 10^15 FLOPs
Assume total training tokens T_total ≈ 300 billion = 3 × 10^11 tokens.
Each sequence has L = 2048 tokens:
2048 ≈ 2.048 × 10^3
S ≈ (3 × 10^11) / (2.048 × 10^3) ≈ (3 / 2.048) × 10^8 ≈ 1.46 × 10^8 sequences
Total FLOPs:
F_total ≈ F_train_per_seq * S ≈ (2.1 × 10^15) * (1.46 × 10^8)
First multiply mantissas: 2.1 * 1.46 ≈ 3.066
Add exponents: 10^15 * 10^8 = 10^23
So: F_total ≈ 3.1 × 10^23 FLOPs
We got ≈ 3 × 10^23 FLOPs.
Let’s cross-check using simpler rules-of-thumb from the reasoning itself (still not “looking up”):
Using:
Then: F_total ≈ 6 * (1.75 × 10^11) * (3 × 10^11) ≈ 6 * 5.25 × 10^22 ≈ 3.15 × 10^23 FLOPs
This aligns extremely well with our more detailed layer-by-layer estimate (3.1 × 10^23).
So both independent approaches converge on:
Answer:
A reasonable estimate for the total compute required to train GPT-3 (175B) is on the order of:
If you’d like, I can break down what that implies in terms of GPU-days for specific hardware.
Turn this model response into notes, narration, or a short video
Partner link