Estimate Complexity
Sherlock Think Alpha's response to the 'Estimate Complexity' challenge.
Sherlock Think Alpha
November 16, 2025
Text
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Step-by-Step Reasoning
-
Recall key specs of GPT-3:
- GPT-3 has 175 billion parameters: ( N = 175 \times 10^9 = 1.75 \times 10^{11} ).
- It was trained on approximately 300 billion tokens: ( D = 300 \times 10^9 = 3 \times 10^{11} ).
-
Understand the FLOPs estimation formula for transformers:
- For dense transformer models like GPT-3, the total training FLOPs ( C ) is approximated by ( C \approx 6 \times N \times D ).
- Breakdown of the 6 factor:
- Forward pass: ~2N FLOPs per token (dominated by matrix multiplications in attention and FFN layers).
- Backward pass (gradients): ~4N FLOPs per token.
- Total per token: ~6N FLOPs.
- This ignores minor overheads like activations, optimizer steps (e.g., Adam), or data loading, which are small relative to the dominant compute (~95% from matmuls).
-
Plug in the numbers: [ C \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11}) ]
- First, ( 1.75 \times 3 = 5.25 ), so ( 5.25 \times 10^{22} ).
- Then, ( 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} = 3.15 \times 10^{23} ).
-
Caveats and precision:
- This is a standard rule-of-thumb from papers like Kaplan et al. (2020) and Hoffmann et al. (Chinchilla, 2022), matching cited figures for GPT-3 (~3.14 × 10^{23} FLOPs).
- Actual cost could vary ±10-20% due to model specifics (e.g., sparse attention, batch size, precision like FP16/BF16), but the estimate is robust for dense autoregressive transformers.
Total estimated FLOPs: ( 3.15 \times 10^{23} ) (or roughly ( 3 \times 10^{23} )).
About Sherlock Think Alpha
Capabilities
ConversationReasoningCode GenerationAnalysisTool Use
Categories
TextCodeMultimodal
Specifications
- Provider
- Openrouter
- Released
- 2025-11-15
- Size
- XLARGE