Estimate Complexity
Gpt 5 Nano's response to the 'Estimate Complexity' challenge.
GPT-5 Nano
August 7, 2025
Text
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Here’s a back-of-the-envelope estimate for the total FLOPs to train GPT-3 (the 175B-parameter model) using reasonable public details, without looking anything up.
Assumptions (typical GPT-3 specs)
- Model: 175 billion parameters
- Transformer depth: 96 layers
- Hidden size (d_model): 12,288
- Sequence length used during training: 2,048 tokens
- Training data: about 300 billion tokens
- Training cost roughly doubles for forward + backward pass
- We’ll ignore data loading, optimizer overhead beyond the backward pass, precision differences, and any gradient checkpointing variations
Step 1: FLOPs per forward pass for one 2,048-token sequence
- Self-attention per layer costs on the order of L^2 * d_model operations, where L = 2,048 and d_model = 12,288.
- L^2 * d_model = 2,048^2 * 12,288 ≈ 51.54 × 10^9 operations per layer
- Across 96 layers: 51.54e9 × 96 ≈ 4.96 × 10^12 FLOPs for attention per forward pass
- Feed-forward network (FFN) per layer costs roughly 8 × d_model^2 operations
- d_model^2 = 12,288^2 ≈ 150.995 × 10^6
- 8 × d_model^2 ≈ 1.208 × 10^9 FLOPs per layer
- Across 96 layers: 1.208e9 × 96 ≈ 1.16 × 10^11 FLOPs for FFN per forward pass
- Total forward pass per sequence: attention + FFN ≈ 4.96e12 + 1.16e11 ≈ 5.07e12 FLOPs
- Include backward pass (roughly similar magnitude to forward): ≈ 2 × 5.07e12 ≈ 1.01 × 10^13 FLOPs per forward+backward pass
Step 2: How many 2,048-token sequences to cover 300B tokens?
- Number of segments N = 300e9 tokens / 2,048 ≈ 1.465 × 10^8 segments
Step 3: Total training FLOPs
- Total FLOPs ≈ (FLOPs per forward+backward) × N
- ≈ (1.01 × 10^13) × (1.465 × 10^8)
- ≈ 1.48 × 10^21 FLOPs
Result
- Rough estimate: about 1.5 × 10^21 FLOPs to train GPT-3 at the 175B scale, under the stated assumptions.
Notes and caveats
- This is a back-of-the-envelope calculation. Real compute could differ by a factor of a few due to:
- Different batch sizes or gradient accumulation, which changes the effective number of forward/backward passes
- Use of gradient checkpointing or optimizer options that alter the forward/backward FLOP balance
- Exact data token count and how sequences are batched
- Precision (fp32 vs. fp16/bfloat16) affecting the count of floating-point operations
- Overheads from data loading, padding, sparsity, or other architectural details
- The order-of-magnitude (around 10^21 FLOPs) is the key takeaway, not a precise figure.
If you want, I can run through a variant with a different token budget, sequence length, or depth to show how sensitive the total FLOPs is to those choices.
About GPT-5 Nano
Capabilities
ConversationReasoningCode GenerationAnalysis
Categories
TextCodeMultimodal
Specifications
- Provider
- Openai
- Released
- 2025-08-07
- Size
- SMALL