Rival

Z.AI: GLM 4.6vsGPT OSS 20B

43 challenges

Why Z.AI: GLM 4.6?

1.5x more context203K

2 months newerSep 2025

Why GPT OSS 20B?

17.8x cheaper overall$0.02/M in · $0.10/M out

3 / 43

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Z.AI: GLM 4.6

Zhipu · Sep 2025text

Of course. Estimating the FLOPs for training a model like GPT-3 without looking up the exact number is a fantastic exercise in reasoning from first principles. Here is a step-by-step breakdown of how to arrive at a very close estimate.

Step 1: Understand the Core Components of the Calculation

The total computational cost for training a neural network is primarily determined by three factors:

N (Number of Parameters): The total number of trainable weights and biases in the model. For GPT-3, this is a well-known figure.
C (Number of Training Tokens): The total amount of data the model processes, measured in tokens (sub-word pieces). This is the most difficult variable to estimate.
K (FLOPs per Parameter per Token): A constant that represents the number of floating-point operations required to process a single token for every parameter in the model for a full training step (forward pass, backward pass, and optimizer update).

The total FLOPs can be approximated by the formula:

Total FLOPs ≈ K * N * C

Let's estimate each of these components.

Step 2: Estimate the Number of Parameters (N)

This is the easiest part. The size of GPT-3's largest model is one of its most famous specifications.

N = 175 billion parameters
N = 175 x 10⁹

Step 3: Estimate the FLOPs per Parameter per Token (K)

Training involves a forward pass, a backward pass (to calculate gradients), and a parameter update. A common and well-accepted approximation in deep learning is:

Forward Pass: For each parameter, you perform a multiplication and an addition. This is roughly 2 FLOPs.
Backward Pass: Calculating the gradients is computationally more intensive. It's generally estimated to be about 2x the cost of the forward pass. So, this is roughly 4 FLOPs.
Parameter Update: The optimizer (like Adam) updates the weights. This also involves a multiply and an add for each parameter, which is another 2 FLOPs.

Adding these up: 2 (forward) + 4 (backward) + 2 (update) = 8 FLOPs.

However, a more standard and slightly more conservative estimate used in many research papers (like the Chinchilla scaling laws paper) is K = 6. This typically accounts for the forward pass (2 FLOPs) and the two main parts of the backward pass (2 FLOPs + 2 FLOPs), often bundling the optimizer's smaller cost into this figure. Using K=6 is a robust choice for an estimation of this scale.

K = 6

Step 4: Estimate the Number of Training Tokens (C)

This is the most challenging and crucial estimation step. We need to reason about the composition of the training dataset.

GPT-3 was trained on a mixture of sources: Common Crawl, WebText2, Books1, Books2, and Wikipedia. Let's try to build a plausible number.

Wikipedia: The entire English Wikipedia has about 4 billion tokens. This is a small part of the total dataset.
Books: The dataset is said to contain a large number of books. Let's imagine it's the equivalent of 1 million books. An average book might be 100,000 words. There are about 1.3 tokens per word.
- Tokens per book ≈ 100,000 words * 1.3 tokens/word = 130,000 tokens
- Total book tokens ≈ 1,000,000 books * 130,000 tokens/book = 130 billion tokens.
Web-based Data (Common Crawl & WebText2): This is the largest component. Common Crawl is massive (petabytes of raw data), but it's heavily filtered for quality. The goal is to get a high-quality subset.
- Let's think about the scale. The number of tokens in the web-based data must be significantly larger than the book corpus. A number in the hundreds of billions seems reasonable.
- Could it be 200 billion? 300 billion? 500 billion?
- A helpful sanity check comes from the Chinchilla scaling laws, which suggest a compute-optimal training data size of about 20 tokens per parameter. For GPT-3, this would be 20 * 175 billion = 3.5 trillion tokens. However, GPT-3 predates Chinchilla* and is known to have been "under-trained"—meaning it was trained on significantly fewer tokens than this optimal ratio would suggest.
- This tells us the number should be well below 3.5 trillion. A number in the low hundreds of billions is much more likely.

Let's combine our estimates:

Wikipedia: ~4 billion
Books: ~130 billion
Web Data: Let's estimate this to be in the same ballpark as the books, but larger. Let's say ~170 billion.

Total Estimated Tokens (C) = 4 + 130 + 170 = 304 billion tokens

This is a very strong, defensible estimate. Let's round it for easier calculation.

C = 300 x 10⁹ tokens

Step 5: Perform the Final Calculation

Now we plug our estimates into the formula:

Total FLOPs ≈ K * N * C Total FLOPs ≈ 6 * (175 x 10⁹) * (300 x 10⁹)

Let's do the math:

6 * 175 = 1050
1050 * 300 = 315,000
10⁹ * 10⁹ = 10¹⁸

Total FLOPs ≈ 315,000 x 10¹⁸

To write this in standard scientific notation:

Total FLOPs ≈ 3.15 x 10²³

Final Estimate and Conclusion

Estimated Total FLOPs to train GPT-3: 3.15 x 10²³

To put this enormous number in perspective:

This is 315 ZettaFLOPs.
If you had a petaflop-scale (10¹⁵ FLOP/s) AI supercomputer running at 100% utilization, it would take 3.15 x 10²³ / 10¹⁵ = 3.15 x 10⁸ seconds, which is about 10 years. In reality, they used thousands of GPUs for several weeks.

This estimation, built from first principles and reasonable assumptions about the model and its dataset, arrives at a number that is remarkably close to the official figure cited in the original GPT-3 paper, which is ~3.14 x 10²³ FLOP-days. The key was correctly identifying the parameters, using the standard 6*N*C formula, and making a logical, grounded estimate of the training data size.

GPT OSS 20B

Openai · Aug 2025text

Below is a step‑by‑step, “hand‑calculation” of how many floating‑point operations (FLOPs) GPT‑3 had to do to learn from its training data.
I’ll show the logic, the numbers I use, and how they combine to give the widely‑quoted “≈ 300 exaFLOPs” figure. (All the math can be done in a spreadsheet or a simple calculator.)

1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

Linear projections for queries (Q), keys (K) and values (V) – 3 times.
Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.

2. Key hyper‑parameters of GPT‑3 175 B

Parameter	Value
Hidden dimension (d_{\text{model}})	12 288
Number of layers	96
Attention heads	96
Head size	128
Context window (sequence length) (L)	2 048
Tokens processed (overall)	≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight).
Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]

4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]

5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]

6. Convert to conventional units

(10^{18}) = exa
(10^{21}) = zetta
(10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

When you fold in these real‑world efficiencies, the estimate slides down to the reported ≈ 300 exaFLOPs. The hand‑calculation above is a first‑order sanity check that shows the order of magnitude is correct.

Bottom‑Line

Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!

Vote