Estimate Complexity
Glm 4.5's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
To estimate the total number of FLOPs (floating-point operations) required to train GPT-3 without searching online, we rely on known architectural details and standard computational models for transformer training. Here's the step-by-step reasoning:
Step 1: Key Parameters of GPT-3
- Number of parameters (P): GPT-3 has 175 billion parameters (175 × 10⁹).
- Training dataset size (D): The model was trained on approximately 300 billion tokens (300 × 10⁹ tokens). This is a standard figure associated with GPT-3.
Step 2: FLOPs per Token Calculation
Training a transformer model involves two main operations per token:
- Forward pass: Computes the model's output.
- Backward pass: Computes gradients for backpropagation.
For transformer architectures, the combined FLOPs for forward and backward passes per token is empirically estimated as 6 × P. This is derived from:
- Forward pass: Requires ~2 × P FLOPs per token (matrix multiplications dominate, with each parameter used in multiply-add operations).
- Backward pass: Typically requires ~2× the FLOPs of the forward pass due to gradient calculations (~4 × P FLOPs per token).
- Total per token: 2P (forward) + 4P (backward) = 6P FLOPs per token.
Step 3: Total FLOPs for the Entire Dataset
The total FLOPs scales linearly with the dataset size:
- Total FLOPs = FLOPs per token × Number of tokens.
- Substituting the values:
Total FLOPs = (6 × P) × D = 6 × (175 × 10⁹) × (300 × 10⁹).
Step 4: Calculation
- Compute the product:
6 × 175 × 300 × 10⁹ × 10⁹ = 6 × 52,500 × 10¹⁸. - Simplify:
6 × 52,500 = 315,000.
Total FLOPs = 315,000 × 10¹⁸ = 3.15 × 10²³ FLOPs.
Step 5: Validation with Known Compute
- GPT-3 training reportedly used ~3,640 petaflop-days of compute.
- Convert petaflop-days to FLOPs:
1 petaflop-day = 10¹⁵ FLOPs × 86,400 seconds/day = 8.64 × 10¹⁹ FLOPs.
3,640 petaflop-days = 3,640 × 8.64 × 10¹⁹ ≈ 3.14 × 10²³ FLOPs. - This closely matches our estimate of 3.15 × 10²³ FLOPs, confirming the calculation.
Step 6: Batch Size and Steps Consideration
- Batch size: GPT-3 used a batch size of 3.2 million tokens per step (80 batches × 4096 tokens/batch).
- Number of training steps:
Steps = Total tokens / Batch size = 300 × 10⁹ / 3.2 × 10⁶ = 93,750 steps. - FLOPs per step:
Per step FLOPs = 6 × P × Batch size = 6 × 175 × 10⁹ × 3.2 × 10⁶ = 3.36 × 10¹⁸ FLOPs/step. - Total FLOPs:
93,750 steps × 3.36 × 10¹⁸ FLOPs/step ≈ 3.15 × 10²³ FLOPs.
This consistency reinforces the estimate.
Step 7: Negligible Factors
- Parameter updates: Optimizer steps (e.g., Adam) add O(P) FLOPs per step. For 93,750 steps:
Update FLOPs ≈ 93,750 × 175 × 10⁹ ≈ 1.64 × 10¹⁶ FLOPs (negligible vs. 3.15 × 10²³). - Embedding/softmax layers: Included in the 6P per-token estimate.
- Attention mechanisms: Captured by the 6P heuristic (validated for transformers).
Final Estimate
The total FLOPs required to train GPT-3 is 3.15 × 10²³ FLOPs. This is derived from:
- Standard transformer FLOPs per token (6P).
- GPT-3 parameters (175B) and dataset size (300B tokens).
- Cross-validation with reported compute usage (3,640 petaflop-days).
About Z.AI: GLM 4.5
Capabilities
Categories
Specifications
- Provider
- Openrouter
- Released
- 2025-07-25
- Size
- LARGE
- Context
- 128,000 tokens