4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Gpt 5 Mini's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Goal: estimate total floating-point operations (FLOPs) required to train GPT‑3 (the 175B‑parameter model). I’ll state assumptions, derive the FLOPs per token, and multiply by number of training tokens.
Assumptions (reasonably standard for GPT‑3):
Step 1 — Rough forward cost per token (per layer) Consider one transformer layer acting on a sequence of S tokens, width d:
Q/K/V linear projections: 3 × (S × d) × d matmuls → about 6·S·d^2 FLOPs (counting multiply+add as 2 FLOPs).
Output projection from attention: (S × d) × d → 2·S·d^2 FLOPs. → Total for projections ≈ 8·S·d^2.
Attention core (Q·K^T and attention-weighted sum): two big matmuls of shapes (S × d) × (d × S) and (S × S) × (S × d) → about 4·S^2·d FLOPs.
Feed‑forward (two linear layers d → 4d → d): 2·S·d·(4d) + 2·S·(4d)·d = 16·S·d^2 FLOPs.
So per layer forward FLOPs for a sequence of length S: F_forward_layer ≈ 8·S·d^2 + 4·S^2·d + 16·S·d^2 = 24·S·d^2 + 4·S^2·d.
Divide by S to get forward FLOPs per token per layer: f_layer_token ≈ 24·d^2 + 4·S·d.
Step 2 — Plug in numbers (d = 12,288, S = 2048, L = 96)
Multiply by L = 96 layers:
Step 3 — Include backward pass / training overhead
Step 4 — Multiply by total tokens
Shortcut / simple rule A common simple approximation comes out of the algebra above: training FLOPs ≈ 6 × N_params × N_tokens (because the forward FLOPs per token are ≈ 2×N_params, and training is about 3× forward → ~6×N_params). Using that:
Final estimate
Caveats and sensitivity
Summary: roughly 3 × 10^23 floating-point operations to train GPT‑3 (175B) under the stated assumptions.
Turn this model response into notes, narration, or a short video
Partner link