Qwen3 Next 80B A3B Instruct
Estimating the total number of FLOPs (floating-point operations) required to train GPT-3 without searching online requires combining known architectural details, training hyperparameters, and scaling laws that were publicly discussed around the time of GPT-3’s release (2020). Let’s reason step by step using publicly available facts that were widely known or logically inferable.
Step 1: Understand GPT-3’s architecture
GPT-3 has:
- 175 billion parameters (175 × 10⁹)
- 96 transformer layers
- 12288 hidden size (d_model)
- 96 attention heads
- Sequence length: ~2048 tokens (common for GPT-3 training)
This is a decoder-only transformer, similar to GPT-2 but much larger.
Step 2: FLOPs per forward pass
In a transformer, the dominant FLOPs come from:
- Matrix multiplications in attention and MLP
- Softmax and attention computation
A well-known approximation (from the paper “Training Compute-Optimal Large Language Models” by Hoffmann et al., 2022, but the scaling was known before) is:
FLOPs per forward pass ≈ 6 × N × L × S
Where:
- N = number of parameters
- L = sequence length
- S = number of tokens processed per batch? Wait — let’s be precise.
Actually, a better and widely accepted formula (from the original GPT-3 paper and follow-ups) is:
FLOPs per forward pass ≈ 6 × N × S
Wait — let’s derive it properly.
In a transformer, the dominant cost is the matrix multiplications:
- Each layer has:
- Attention: Q, K, V projections → 3 × d_model × d_model
- Attention output projection → d_model × d_model
- MLP: two linear layers: 4×d_model × d_model and d_model × 4×d_model (i.e., 2×4×d_model²)
So per layer:
- Attention: 4 × d_model² (Q,K,V,O)
- MLP: 2 × (4×d_model) × d_model = 8 × d_model²
- Total per layer: 12 × d_model²
But we also have token embedding and final layer norm + LM head, which is ≈ d_model × vocab_size
But since vocab_size ≈ 50k and d_model = 12288, embedding is ≈ 12288 × 50k ≈ 614M, which is negligible compared to 175B.
Now, total parameters N ≈ 175B. In transformers, N ≈ 12 × L × d_model² (for L layers, 12×d_model² per layer). Let’s verify:
- d_model = 12288
- So d_model² = ~151 million
- 12 × d_model² ≈ 1.81B per layer
- 96 layers → 96 × 1.81B ≈ 174B → matches 175B. Perfect.
So each layer uses ~12 × d_model² FLOPs for the main matrix multiplies (forward pass).
But wait — this is the parameter count, not the FLOPs. In matrix multiplication, multiplying two matrices of shape (a,b) × (b,c) requires 2×a×b×c FLOPs (multiply and add). But in practice, for linear layers, we often use:
FLOPs for a linear layer: 2 × input_size × output_size
So per transformer layer:
- Attention:
- Q, K, V projections: each is d_model → d_model → 3 × 2 × d_model × d_model = 6 × d_model²
- Output projection: 2 × d_model × d_model = 2 × d_model²
- Total attention: 8 × d_model²
- MLP:
- First linear: 2 × d_model × (4×d_model) = 8 × d_model²
- Second linear: 2 × (4×d_model) × d_model = 8 × d_model²
- Total MLP: 16 × d_model²
- Total per layer: 24 × d_model²
Wait — this contradicts the earlier parameter count.
Hold on — if there are 175B parameters, and each parameter in a linear layer contributes to 1 FLOP in multiplication, but matrix multiply requires 2× (multiply + add).
Actually, the standard estimate in the community is:
FLOPs per forward pass = 2 × N × S
Where N = number of parameters, S = sequence length.
Why?
Because each parameter is involved in a multiply-add operation. Since a multiply-add is 2 FLOPs, and each parameter is used once per token in the forward pass, we get 2 × N × S.
But is this accurate?
Actually, yes — this is the widely accepted approximation, and it was used in the GPT-3 paper's supplementary materials (though not explicitly stated).
From the GPT-3 paper:
“We estimate that training GPT-3 required about 3.14 × 10²³ FLOPs.”
But we’re estimating without looking it up.
Let’s derive it.
Standard formula from literature (e.g., DeepMind’s scaling laws):
FLOPs per forward pass ≈ 2 × N × S
Where:
- N = number of parameters
- S = sequence length
Then for a full training run, we need:
Total FLOPs = FLOPs per forward pass × number of forward passes
But each forward pass is followed by a backward pass, which typically requires ~2× the FLOPs (due to gradient computation).
So:
FLOPs per training step = 3 × FLOPs per forward pass
(Forward: 2×N×S, Backward: 2×(2×N×S) = 4×N×S? Wait — correction.)
Actually, the standard approximation is:
- Forward: 2 × N × S
- Backward: 4 × N × S (because gradients involve more operations, including re-computing or storing activations)
- Weight update: negligible (~2 × N)
So total per step: ~6 × N × S
This is the most commonly accepted estimate in the field (used in papers like “Training Compute-Optimal LLMs”).
So:
FLOPs per training step = 6 × N × S
Where:
- N = 175 × 10⁹
- S = 2048
So:
FLOPs per step = 6 × 175e9 × 2048
Calculate:
175e9 × 2048 = 175 × 2048 × 10⁹
175 × 2048 = 175 × (2000 + 48) = 350,000 + 8,400 = 358,400
So = 358,400 × 10⁹ = 3.584 × 10¹⁴
Then × 6 = 2.1504 × 10¹⁵ FLOPs per training step
Step 3: How many training steps?
GPT-3 was trained on 300 billion tokens (from the paper).
Each training step processes a batch of tokens. With sequence length 2048, the number of tokens per batch depends on batch size.
But total tokens seen = 300B.
So number of training steps = total tokens / tokens per step
Assume each step processes 1 batch of S tokens → then steps = 300e9 / 2048
Let’s compute:
300e9 / 2048 ≈ 300,000,000,000 / 2,048
≈ 146,484,375 steps
So ~146.5 million steps
Step 4: Total FLOPs
Total FLOPs = FLOPs per step × number of steps
= (6 × N × S) × (total_tokens / S)
= 6 × N × total_tokens
Oh! The S cancels out!
This is the key insight:
Total FLOPs = 6 × N × T
Where T = total number of tokens seen during training
This is a famous scaling law: total FLOPs ~ 6 × N × T
So:
N = 175 × 10⁹
T = 300 × 10⁹
Total FLOPs = 6 × 175e9 × 300e9
= 6 × 175 × 300 × 10¹⁸
= 6 × 52,500 × 10¹⁸
= 315,000 × 10¹⁸
= 3.15 × 10²³ FLOPs
Conclusion
✅ Estimated total FLOPs to train GPT-3: ~3.15 × 10²³
This matches the actual published number (3.14 × 10²³) from the GPT-3 paper.
Why this works
- The sequence length S cancels out because although each step has S tokens, the number of steps is inversely proportional to S.
- The 6× factor comes from: forward (2×), backward (4×), and update (negligible) → total ~6×.
- The formula Total FLOPs ≈ 6 × N × T is now a standard scaling law in LLM training.
✅ Final Answer: Approximately 3.15 × 10²³ FLOPs
This estimate is derived purely from known architecture size (175B parameters), training data size (300B tokens), and the standard FLOP accounting rule for transformer training.







