MiniMax M2.7
To estimate the total FLOPs required to train GPT-3, we can follow a step-by-step reasoning process based on known architectural details and computational estimates. Here’s how we can break it down:
Step 1: Key Facts About GPT-3
- Parameters: 175 billion (175 × 10⁹)
- Architecture: Transformer-based language model with 96 layers, 12,288 hidden size, 96 attention heads, etc.
- Training Data: Approximately 300 billion tokens (from Common Crawl, WebText, books, etc.).
- Training Objective: Predict the next token (cross-entropy loss).
Step 2: Theoretical FLOPs Per Token
During training, each parameter is involved in:
- Forward pass: Primarily matrix multiplications (e.g., in attention and feedforward layers).
- Backward pass: Gradient computation, which roughly doubles the compute compared to forward pass.
For transformer models, a commonly used estimate is:
- 6 FLOPs per parameter per token (accounting for forward + backward passes and common operations like softmax, layer norm, etc.).
So, for GPT-3: [ \text{FLOPs per token} = 6 \times 175 \times 10^9 = 1.05 \times 10^{12} \text{ FLOPs/token} ]
Step 3: Total FLOPs for Training
Multiply by the total number of tokens seen during training: [ \text{Total FLOPs} = 1.05 \times 10^{12} \times 300 \times 10^9 = 3.15 \times 10^{23} \text{ FLOPs} ]
Step 4: Real-World Efficiency Considerations
The above is a theoretical minimum. In practice:
- Hardware efficiency (GPU/TPU utilization) is less than 100% due to memory bandwidth, communication overhead, and non-compute operations.
- Training overhead: Optimizer steps (e.g., Adam), gradient checkpointing, and data loading add compute.
- Training duration: GPT-3 training took several weeks on thousands of GPUs/TPUs.
A more realistic estimate often cited in literature suggests GPT-3 required roughly 3.14 × 10²³ FLOPs (or about 3–4 × 10²³ FLOPs) when accounting for real-world inefficiencies.
Step 5: Order-of-Magnitude Context
- Supercomputers: Summit (Oak Ridge) peaks at ~200 PFLOPS (2 × 10¹⁷ FLOPS).
If fully utilized, it would take: [ \frac{3 \times 10^{23}}{2 \times 10^{17}} \approx 1.5 \times 10^6 \text{ seconds} \approx 17 \text{ days}. ] GPT-3 likely used ~1,000–10,000 such GPUs for weeks, consistent with reported estimates.
Final Estimate
[ \boxed{3 \text{–} 4 \times 10^{23} \text{ FLOPs}} ]
This aligns with published analyses (e.g., from OpenAI’s papers or subsequent estimates by Microsoft, Meta, etc.) that placed GPT-3’s training compute in the ~10²³ FLOPs range.




