Estimate Complexity
Minimax M2 Free's response to the 'Estimate Complexity' challenge.
MiniMax M2
October 29, 2025
Text
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
To estimate the total FLOPs required to train GPT-3 (175B parameters) on 300 billion tokens (one epoch), we break down the process step by step:
Step 1: Understand GPT-3's Specifications
- Parameters: 175 billion (175e9).
- Training Tokens: 300 billion (300e9).
- Context Length: 2048 tokens per sequence.
- Training Epochs: 1 (as per the paper).
Step 2: FLOPs for Forward Pass
- For a transformer model, the forward pass cost per token per parameter is approximately 2 FLOPs (one multiplication and one addition in matrix operations).
- Total FLOPs for forward pass:
[ \text{Total FLOPs}_{\text{forward}} = \text{Tokens} \times \text{Parameters} \times 2 ] [ = 300 \times 10^9 \times 175 \times 10^9 \times 2 = 1.05 \times 10^{23} \text{ FLOPs} ]
Step 3: FLOPs for Backward Pass
- The backward pass (gradient computation) costs approximately 2× the forward pass due to additional operations (e.g., chain rule applications).
- Total FLOPs for backward pass:
[ \text{Total FLOPs}{\text{backward}} = 2 \times \text{Total FLOPs}{\text{forward}} = 2 \times 1.05 \times 10^{23} = 2.10 \times 10^{23} \text{ FLOPs} ]
Step 4: Quadratic Attention Overhead (Minor Correction)
- Attention layers introduce quadratic cost in context length ((O(S^2)) per token, where (S = 2048)).
- Overhead per token: ( \sim 4% ) of the linear cost (empirically for GPT-3).
- Adjusted forward/backward FLOPs:
[ \text{Total FLOPs}_{\text{linear}} = \text{Forward} + \text{Backward} = 1.05 \times 10^{23} + 2.10 \times 10^{23} = 3.15 \times 10^{23} \text{ FLOPs} ] - Attention overhead: (3.15 \times 10^{23} \times 0.04 = 1.26 \times 10^{22} \text{ FLOPs}).
- Revised total:
[ 3.15 \times 10^{23} + 1.26 \times 10^{22} \approx 3.28 \times 10^{23} \text{ FLOPs} ]
Step 5: Final Adjustment for Practical Efficiency
- Training uses mixed precision (FP16) and optimizations (e.g., gradient accumulation), reducing effective FLOPs by ~10–20%.
- Adjusted total:
[ 3.28 \times 10^{23} \times 0.85 \approx 2.79 \times 10^{23} \text{ FLOPs} ] (Accounts for inefficiencies, data loading, etc.).
Conclusion
The estimated total FLOPs to train GPT-3 (175B) on 300B tokens is ~2.79 × 10²³ FLOPs.
This aligns with OpenAI's reported value of 3.14 × 10²³ FLOPs (accounting for minor rounding and approximation differences).
Key Assumptions:
- 1 training epoch on 300B tokens.
- Dominance of linear matrix multiplications (attention overhead <5%).
- Mixed precision reduces cost by ~15%.
- No hyperparameter tuning or intermediate checkpoints.
Final Answer:
[
\boxed{2.79 \times 10^{23}}
]
FLOPs (rounded to two significant figures, approximately 2.8 × 10²³).
About MiniMax M2
Capabilities
ConversationReasoningCode GenerationAnalysisPlanningAgentic Tool UseTool Use
Categories
TextCode
Specifications
- Provider
- Minimax
- Released
- 2025-10-23
- Size
- MEDIUM
- Context
- 204,800 tokens