Let me work through this step by step.
Step 1: Understanding the Model Parameters
GPT-3 has 175 billion parameters. This is the foundation for our calculation.
Step 2: Training Data Size
GPT-3 was trained on approximately 300 billion tokens from various text sources.
Step 3: The FLOPs Formula
For transformer models, a widely used approximation is:
FLOPs ≈ 6 × N × D
Where:
- N = number of parameters
- D = number of training tokens
- 6 accounts for:
- Forward pass: ~2 FLOPs per parameter per token (multiply-add)
- Backward pass: ~4 FLOPs per parameter per token (gradient computation)
Step 4: Calculation
Let's plug in the numbers:
- N = 175 × 10⁹ parameters
- D = 300 × 10⁹ tokens
FLOPs = 6 × (175 × 10⁹) × (300 × 10⁹)
FLOPs = 6 × 175 × 300 × 10¹⁸
FLOPs = 6 × 52,500 × 10¹⁸
FLOPs = 315,000 × 10¹⁸
FLOPs = 3.15 × 10²³
Step 5: Context and Refinements
A few important considerations:
-
This is a lower bound estimate. The actual number is likely higher due to:
- Attention mechanism computations
- Layer normalization
- Embedding lookups
- Optimizer states (Adam optimizer uses additional memory and compute)
-
Batch size effects: The formula assumes optimal batching. In practice, with smaller batch sizes, there might be some overhead.
-
Memory bandwidth: Actual hardware utilization is never 100%, so real-world compute requirements might be 1.5-2× higher.
Step 6: Comparison with Known Estimates
This aligns with:
- OpenAI's original estimates (~3.14 × 10²³ FLOPs)
- Other research papers citing similar magnitudes
- The Chinchilla scaling laws which suggest optimal training requires ~20 tokens per parameter (3.5 × 10¹² tokens for 175B parameters, which would be ~4.2 × 10²³ FLOPs)
Final Estimate
Approximately 3 × 10²³ FLOPs (or 300 zettaFLOPs).
To put this in perspective:
- This is equivalent to about 10,000 years of computation on a single high-end GPU (like an A100)
- Or roughly 1,000 years on a cluster of 100 A100 GPUs running continuously
- This explains why training such models requires massive computational resources and specialized infrastructure
The calculation shows why large language model training is both computationally intensive and expensive, requiring careful optimization of both algorithms and hardware.