Estimate Complexity
Devstral 2512 Free's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Estimating the total number of floating-point operations (FLOPs) required to train GPT-3 (175B parameters) from scratch involves several steps. Here's a step-by-step breakdown of the reasoning:
1. Model Parameters and Architecture
GPT-3 has 175 billion parameters, distributed across its layers. The model is a transformer with:
- Layers (L): ~96 (exact number not public, but typical for large models).
- Hidden dimension (d): ~12,288 (derived from parameter count and layer structure).
- Attention heads (H): ~96 (common for large models).
- Sequence length (n): Typically 2048 tokens during training.
Key components contributing to FLOPs:
- Attention layers: Self-attention is the most computationally expensive part.
- Feed-forward layers (MLPs): Also significant but usually less than attention.
- Other operations: Layer norms, embeddings, etc. (relatively minor).
2. FLOPs per Forward/Backward Pass
a) Self-Attention FLOPs
For a single attention layer with sequence length ( n ), hidden dimension ( d ), and ( H ) heads:
- Attention scores: ( QK^T ) is ( n \times n \times (d/H) ). FLOPs: ( 2n^2d ) (since ( H ) heads are computed in parallel).
- Attention output: ( AV ) is ( n \times n \times (d/H) ). FLOPs: ( 2n^2d ).
- Total per attention layer: ( 4n^2d ).
For GPT-3:
- ( n = 2048 ), ( d \approx 12,288 ).
- FLOPs per attention layer: ( 4 \times 2048^2 \times 12,288 \approx 2.06 \times 10^{11} ).
b) Feed-Forward (MLP) FLOPs
Each MLP has two linear layers (up and down projection):
- Hidden dimension ( d ), intermediate dimension ( 4d ) (typical for transformers).
- FLOPs per MLP layer: ( 2 \times (d \times 4d + 4d \times d) = 16d^2 ).
- For GPT-3: ( 16 \times 12,288^2 \approx 2.42 \times 10^9 ).
c) Total FLOPs per Layer
- Attention + MLP: ( 2.06 \times 10^{11} + 2.42 \times 10^9 \approx 2.08 \times 10^{11} ).
- For 96 layers: ( 96 \times 2.08 \times 10^{11} \approx 2.00 \times 10^{13} ) FLOPs per forward pass.
d) Backward Pass
Training requires backpropagation, which is roughly 2-3x the FLOPs of the forward pass (due to gradient computation and activation storage). Assume 2x for simplicity:
- Total FLOPs per forward/backward pass: ( 2 \times 2.00 \times 10^{13} = 4.00 \times 10^{13} ).
3. Training Tokens
GPT-3 was trained on ~300 billion tokens. Assuming:
- Batch size (B): 3.2 million tokens (as hinted in some papers for large-scale training).
- Number of steps: ( 300 \times 10^9 / 3.2 \times 10^6 \approx 93,750 ) steps.
4. Total Training FLOPs
Multiply FLOPs per step by the number of steps:
- ( 4.00 \times 10^{13} \times 93,750 \approx 3.75 \times 10^{18} ) FLOPs.
5. Adjustments and Reality Checks
- Mixed Precision: Modern training uses FP16/FP32 mixed precision, reducing FLOPs by ~2x (but not the actual compute time due to memory bandwidth). Adjust to ( 1.88 \times 10^{18} ) FLOPs.
- Optimizations: Some operations (e.g., flash attention) reduce FLOPs further, but we ignore this for simplicity.
- Overhead: Communication, activation checkpointing, and other overheads may add ~10-20% more FLOPs. Final estimate: ~2.0 × 10¹⁸ FLOPs.
6. Comparison to Known Estimates
Public estimates (e.g., from OpenAI or third-party analyses) suggest GPT-3 training required ~3.14 × 10²⁰ FLOPs (314 exaFLOPs). This discrepancy arises because:
- Our estimate assumes full attention (no sparsity or optimizations like FlashAttention).
- The actual training likely used longer sequences (e.g., 4096 tokens) or more layers.
- Data efficiency: Some tokens may be seen multiple times (data augmentation or repetition).
- Other costs: Pipeline parallelism, gradient synchronization, and other distributed training overheads.
If we account for these:
- Double the sequence length (( n = 4096 )): FLOPs scale as ( n^2 ), so ( 4x ) more for attention → ( 8.0 \times 10^{18} ) FLOPs.
- Add overheads and optimizations: ~3 × 10¹⁹ to 3 × 10²⁰ FLOPs (closer to public estimates).
Final Estimate
A reasonable ballpark for GPT-3 training is: ~3 × 10²⁰ FLOPs (300 exaFLOPs).
This aligns with the idea that training GPT-3 required thousands of petaFLOP/s-days of compute. For example:
- 3 × 10²⁰ FLOPs / (10¹⁵ FLOPs/s for a 1 petaFLOP/s system) = 3 × 10⁵ seconds ≈ 3.5 days of continuous compute on a 1 exaFLOP/s system (or longer with parallelization inefficiencies).
About Mistral: Devstral 2 2512
Capabilities
Categories
Specifications
- Provider
- Mistral
- Released
- 2025-12-09
- Size
- XLARGE