Estimate Complexity
Horizon Beta's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Here’s a back-of-the-envelope estimate for the total training compute (in FLOPs) for GPT-3, using standard scaling heuristics and reasonable assumptions.
Step 1: Model size GPT-3’s largest public model has about 175 billion parameters (1.75e11).
Step 2: Training tokens OpenAI reported training on roughly 300 billion tokens for GPT-3-scale models. We’ll use 3.0e11 tokens.
Step 3: FLOPs per parameter per token A common rule of thumb for dense Transformer training is that the total training FLOPs is on the order of: FLOPs ≈ 6 × N_params × N_tokens This comes from accounting for forward and backward passes over all layers, including attention and MLPs, with constant factors absorbed into the 6× heuristic. Some estimates range from ~3× to ~12× depending on architecture details and precision; 6× is a reasonable midpoint.
Step 4: Compute the total N_params = 1.75e11 N_tokens = 3.0e11 FLOPs ≈ 6 × 1.75e11 × 3.0e11 = 6 × 5.25e22 = 3.15e23 FLOPs
Step 5: Sensitivity/uncertainty
- If the token count were lower (say 3e11 → 2e11), this would scale linearly.
- If the constant factor were closer to 3× or 12×, the result would vary by about 2× either way.
- Some additional overhead (optimizer states, data/activation precision, mixed-precision efficiency, embedding layers, and short sequences vs long) could shift the constant modestly but won’t change the order of magnitude.
Conclusion A reasonable estimate for total training compute for GPT-3 (175B) is on the order of 3e23 FLOPs, with a plausible range of roughly 1e23 to 6e23 FLOPs.
About Horizon Beta
Capabilities
Categories
Specifications
- Provider
- Openrouter
- Released
- 2025-08-01
- Size
- LARGE
- Context
- 256,000 tokens