Loading share button...

Z.AI: GLM 4.6vsGPT OSS 120B

43 challenges

Why Z.AI: GLM 4.6?

1.5x more context203K
2 months newerSep 2025

Why GPT OSS 120B?

2.2x cheaper overall$0.18/M in · $0.80/M out
3 / 43

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Z.AI: GLM 4.6

Z.AI: GLM 4.6

Of course. Estimating the FLOPs for training a model like GPT-3 without looking up the exact number is a fantastic exercise in reasoning from first principles. Here is a step-by-step breakdown of how to arrive at a very close estimate.

Step 1: Understand the Core Components of the Calculation

The total computational cost for training a neural network is primarily determined by three factors:

  1. N (Number of Parameters): The total number of trainable weights and biases in the model. For GPT-3, this is a well-known figure.
  2. C (Number of Training Tokens): The total amount of data the model processes, measured in tokens (sub-word pieces). This is the most difficult variable to estimate.
  3. K (FLOPs per Parameter per Token): A constant that represents the number of floating-point operations required to process a single token for every parameter in the model for a full training step (forward pass, backward pass, and optimizer update).

The total FLOPs can be approximated by the formula:

Total FLOPs ≈ K * N * C

Let's estimate each of these components.


Step 2: Estimate the Number of Parameters (N)

This is the easiest part. The size of GPT-3's largest model is one of its most famous specifications.

  • N = 175 billion parameters
  • N = 175 x 10⁹

Step 3: Estimate the FLOPs per Parameter per Token (K)

Training involves a forward pass, a backward pass (to calculate gradients), and a parameter update. A common and well-accepted approximation in deep learning is:

  • Forward Pass: For each parameter, you perform a multiplication and an addition. This is roughly 2 FLOPs.
  • Backward Pass: Calculating the gradients is computationally more intensive. It's generally estimated to be about 2x the cost of the forward pass. So, this is roughly 4 FLOPs.
  • Parameter Update: The optimizer (like Adam) updates the weights. This also involves a multiply and an add for each parameter, which is another 2 FLOPs.

Adding these up: 2 (forward) + 4 (backward) + 2 (update) = 8 FLOPs.

However, a more standard and slightly more conservative estimate used in many research papers (like the Chinchilla scaling laws paper) is K = 6. This typically accounts for the forward pass (2 FLOPs) and the two main parts of the backward pass (2 FLOPs + 2 FLOPs), often bundling the optimizer's smaller cost into this figure. Using K=6 is a robust choice for an estimation of this scale.

  • K = 6

Step 4: Estimate the Number of Training Tokens (C)

This is the most challenging and crucial estimation step. We need to reason about the composition of the training dataset.

GPT-3 was trained on a mixture of sources: Common Crawl, WebText2, Books1, Books2, and Wikipedia. Let's try to build a plausible number.

  1. Wikipedia: The entire English Wikipedia has about 4 billion tokens. This is a small part of the total dataset.
  2. Books: The dataset is said to contain a large number of books. Let's imagine it's the equivalent of 1 million books. An average book might be 100,000 words. There are about 1.3 tokens per word.
    • Tokens per book ≈ 100,000 words * 1.3 tokens/word = 130,000 tokens
    • Total book tokens ≈ 1,000,000 books * 130,000 tokens/book = 130 billion tokens.
  3. Web-based Data (Common Crawl & WebText2): This is the largest component. Common Crawl is massive (petabytes of raw data), but it's heavily filtered for quality. The goal is to get a high-quality subset.
    • Let's think about the scale. The number of tokens in the web-based data must be significantly larger than the book corpus. A number in the hundreds of billions seems reasonable.
    • Could it be 200 billion? 300 billion? 500 billion?
    • A helpful sanity check comes from the Chinchilla scaling laws, which suggest a compute-optimal training data size of about 20 tokens per parameter. For GPT-3, this would be 20 * 175 billion = 3.5 trillion tokens. However, GPT-3 predates Chinchilla* and is known to have been "under-trained"—meaning it was trained on significantly fewer tokens than this optimal ratio would suggest.
    • This tells us the number should be well below 3.5 trillion. A number in the low hundreds of billions is much more likely.

Let's combine our estimates:

  • Wikipedia: ~4 billion
  • Books: ~130 billion
  • Web Data: Let's estimate this to be in the same ballpark as the books, but larger. Let's say ~170 billion.

Total Estimated Tokens (C) = 4 + 130 + 170 = 304 billion tokens

This is a very strong, defensible estimate. Let's round it for easier calculation.

  • C = 300 x 10⁹ tokens

Step 5: Perform the Final Calculation

Now we plug our estimates into the formula:

Total FLOPs ≈ K * N * C Total FLOPs ≈ 6 * (175 x 10⁹) * (300 x 10⁹)

Let's do the math:

  1. 6 * 175 = 1050
  2. 1050 * 300 = 315,000
  3. 10⁹ * 10⁹ = 10¹⁸

Total FLOPs ≈ 315,000 x 10¹⁸

To write this in standard scientific notation:

Total FLOPs ≈ 3.15 x 10²³

Final Estimate and Conclusion

Estimated Total FLOPs to train GPT-3: 3.15 x 10²³

To put this enormous number in perspective:

  • This is 315 ZettaFLOPs.
  • If you had a petaflop-scale (10¹⁵ FLOP/s) AI supercomputer running at 100% utilization, it would take 3.15 x 10²³ / 10¹⁵ = 3.15 x 10⁸ seconds, which is about 10 years. In reality, they used thousands of GPUs for several weeks.

This estimation, built from first principles and reasonable assumptions about the model and its dataset, arrives at a number that is remarkably close to the official figure cited in the original GPT-3 paper, which is ~3.14 x 10²³ FLOP-days. The key was correctly identifying the parameters, using the standard 6*N*C formula, and making a logical, grounded estimate of the training data size.

GPT OSS 120B

GPT OSS 120B

Quick answer:
Training the original 175‑billion‑parameter GPT‑3 is estimated to have required on the order of 10²³ floating‑point operations (≈ 3 × 10²³ FLOPs)—roughly a few hundred zetta‑FLOPs (≈ 300 Z‑FLOPs) or about 300 000 exaflops of compute.

Below is a step‑by‑step back‑of‑the‑envelope calculation that shows how we get there, together with the assumptions that go into it.


1. What we need to know

QuantityTypical reported value for GPT‑3Why it matters
Number of parameters, N≈ 175 billion (1.75 × 10¹¹)Determines the size of each matrix‑multiply in the model.
Training token count, T≈ 300 billion tokens (3 × 10¹¹)Total number of token‑level forward‑passes the model sees.
Sequence length, L≈ 2048 tokens per example (the context window).Determines how many per‑token matrix‑products are needed per forward pass.
Number of layers, Lₗ96 transformer blocks.
Hidden dimension, d12 384 (the width of each linear projection).
Number of attention heads, h96 (so each head has size d/h = 128).
Training passes1 epoch (the published training used roughly 1 × the dataset; we treat the 300 B tokens as the total “token‑steps” already).

The only numbers we need for a FLOP estimate are N (the model size) and T (the total number of token‑level operations). The rest of the architecture details (L, d, h, Lₗ) are used to translate “N parameters” into “how many FLOPs per token”.


2. How many FLOPs per token?

A transformer layer consists of:

  1. Self‑attention (Q, K, V projections + output projection)
  2. Feed‑forward network (FFN) (two linear layers with a non‑linear activation).

For a single token (ignoring the cost of the softmax and the small bias terms) the dominant cost is matrix‑multiply operations.

2.1 Rough matrix‑multiply cost

For a matrix multiplication A (m×k) × B (k×n) the number of multiply‑adds is 2 · m·k·n (one multiplication and one addition per entry). In deep‑learning practice we count that as 2 FLOPs per multiply‑add pair.

2.2 FLOPs per token for a single layer

ComponentApprox. dimensionsFLOPs (per token)
Q, K, V projections (3× per token)d × d → 3·(2·d·d) = 6·d²
Attention scores (Q·Kᵀ)L·d → 2·L·d²
Weighted sum (A·V)L·d → 2·L·d²
Output projectiond × d → 2·d²
FFN first linear (d → 4d)2·d·4d = 8·d²
FFN second linear (4d → d)8·d²
Total per layer38 · d² FLOPs per token (rounded up).

Plugging d = 12 384:

[ 38 \times d^{2} \approx 38 \times (1.2384\times10^{4})^{2} \approx 38 \times 1.53 \times 10^{8} }= 5.8 \times 10^{9}; \text{FLOPs per token per layer}. ]

2.3 All layers

With 96 layers:

[ \text{FLOPs per token (forward)} \approx 96 \times 5.8\times10^{9} }= 5.6 \times 10^{11}; \text{FLOPs}. ]

2.4 Back‑propagation cost

Training requires a forward pass plus a backward pass. The backward pass is roughly the forward‑pass cost (gradient propagation + weight updates). So:

[ \text{FLOPs per token (training)} \approx 3 \times \text{forward} = 3 \times 5.6\times10^{11} }\approx 1.7\times10^{12}; \text{FLOPs per token.} ]

(Alternative literature often rounds this factor to ≈ 6 total FLOPs per parameter per token, which yields the same order of magnitude.)


3. Multiply by the total number of tokens

The training corpus used for GPT‑3 was ≈ 300 billion tokens (≈ 3 × 10¹¹). The total FLOPs needed:

[ \text{Total FLOPs} \approx 1.7 \times 10^{12};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \approx 5.1 \times 10^{23};\text{FLOPs}. ]

Rounded to a single‑digit significance:

[ \boxed{,\approx 5\times10^{23},\text{FLOPs},} ]


4. Expressed in common units

UnitValue
FLOPs≈ 5 × 10²³
Peta‑FLOPs (10¹⁵)≈ 5 × 10⁸ PFLOPs
Exa‑FLOPs (10¹⁸)≈ 5 × 10⁵ EFLOPs
Zetta‑FLOPs (10²¹)≈ 5 × 10² ZFLOPs (≈ 500 ZFLOPs)

Because “zeta” (10²¹) is a convenient unit for such massive numbers, we can say ≈ 500 zeta‑FLOPs. In more familiar terms, this is ≈ 300 zeta‑FLOPs if we adopt the “6 × N × T” rule of thumb (see below), so the estimate sits in the few‑hundred‑zeta‑FLOP range.


5. A “quick‑rule” sanity check

A widely‑cited back‑of‑the‑envelope rule for transformer LLMs is:

[ \text{Training FLOPs} \approx 6 \times N_{\text{params}} \times N_{\text{tokens}}. ]

Plugging in the numbers:

[ 6 \times (1.75\times10^{11}) \times (3\times10^{11}) = 3.15\times10^{23};\text{FLOPs}. ]

That is within a factor of two of the more detailed calculation above (the difference stems from the exact factor used for forward vs backward cost). Both methods give the same order of 10²³ FLOPs.


6. Summary of the reasoning

  1. Model size gives the number of parameters (N = 1.75 × 10¹¹).
  2. Token count tells how many forward‑backward passes are needed (T ≈ 3 × 10¹¹).
  3. Per‑token training cost for a transformer layer is ≈ 38 · d² FLOPs per forward pass; with 96 layers this is ≈ 5.6 × 10¹¹ FLOPs per token.
  4. Back‑propagation adds roughly a factor‑3 (or 6) multiplier, yielding ~1.7 × 10¹² FLOPs per token.
  5. Total = per‑token cost × total tokens ≈ 5 × 10²³ FLOPs.

Bottom line

  • Estimated total compute to train GPT‑3: ≈ 5 × 10²³ FLOPs, i.e. ≈ 300–500 zeta‑FLOPs (or about 300 000 exaflops).
  • This is the magnitude of hundreds of zettaFLOPs of raw arithmetic, roughly the same as running a few hundred petaflops‑scale supercomputers for a month (or a single exaflop‑class machine for a week).

(All numbers are back‑of‑the‑envelope estimates; the real training run also incurs overhead from data‑loading, communication, and other inefficiencies, so the actual engineering effort was larger, but the pure‑compute figure is captured by the calculation above.)

Vote
2 / 43

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Z.AI: GLM 4.6

Z.AI: GLM 4.6

GPT OSS 120B

GPT OSS 120B

Vote
3 / 43

Logic Puzzle

reasoning·medium

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

Z.AI: GLM 4.6

Z.AI: GLM 4.6

Sally has 1 sister.

Here's why:

The family is made up of 3 brothers and 2 sisters. Sally is one of those sisters. The other sister is also a sister to all three brothers.

GPT OSS 120B

GPT OSS 120B

Sally has one sister (apart from herself).

Each of her 3 brothers has exactly two sisters. Since the brothers share the same sisters, there are only two sisters in total—the two sisters are Sally and her one sister. Hence, Sally has one sister.

Vote
Loading