Estimate Complexity
Gpt Oss 20B's response to the 'Estimate Complexity' challenge.
?Prompt
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Response
Below is a step‑by‑step, “hand‑calculation” of how many floating‑point operations (FLOPs) GPT‑3 had to do to learn from its training data.
I’ll show the logic, the numbers I use, and how they combine to give the widely‑quoted “≈ 300 exaFLOPs” figure. (All the math can be done in a spreadsheet or a simple calculator.)
1. What is a GPT‑3 training step?
A single training step (one forward‑plus‑back‑propagation on a token) involves:
- Linear projections for queries (Q), keys (K) and values (V) – 3 times.
- Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
- Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.
The FLOPs for each of these parts can be written in closed form.
2. Key hyper‑parameters of GPT‑3 175 B
Parameter | Value |
---|---|
Hidden dimension (d_{\text{model}}) | 12 288 |
Number of layers | 96 |
Attention heads | 96 |
Head size | 128 |
Context window (sequence length) (L) | 2 048 |
Tokens processed (overall) | ≈ 300 billion |
Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.
3. FLOPs per token per layer
3.1 Projection FLOPs (Q, K, V)
For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).
- FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight). - Three projections → (6 d_{\text{model}}^2).
Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.
3.2 Self‑attention FLOPs
-
Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7). -
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).
Total attention ≈ (1.0 \times 10^8).
3.3 Feed‑forward network (FFN)
FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).
- FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).
Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).
3.4 Sum per token per layer
[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]
4. FLOPs per token for the whole network
Multiply by the number of layers:
[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]
5. FLOPs for the entire training corpus
Multiply by the total number of tokens processed (≈ 300 billion):
[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]
6. Convert to conventional units
- (10^{18}) = exa
- (10^{21}) = zetta
- (10^{24}) = yotta
[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]
The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:
- Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
- Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
- Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).
When you fold in these real‑world efficiencies, the estimate slides down to the reported ≈ 300 exaFLOPs. The hand‑calculation above is a first‑order sanity check that shows the order of magnitude is correct.
Bottom‑Line
- Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
- Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
- Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
- Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)
So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!
About GPT OSS 20B
Capabilities
Categories
Specifications
- Provider
- Openai
- Released
- 2025-08-05
- Size
- LARGE
- Parameters
- 21B total (3.6B active per forward pass)
- Context
- 131,072 tokens