Loading share button...

GPT OSS 120BvsGemini 2.5 Pro Preview 06-05

38 challenges

Why GPT OSS 120B?

11.8x cheaper overall$0.18/M in · $0.80/M out
2 months newerAug 2025

Why Gemini 2.5 Pro Preview 06-05?

8x more context1.0M
3 / 38

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

GPT OSS 120B

GPT OSS 120B

Quick answer:
Training the original 175‑billion‑parameter GPT‑3 is estimated to have required on the order of 10²³ floating‑point operations (≈ 3 × 10²³ FLOPs)—roughly a few hundred zetta‑FLOPs (≈ 300 Z‑FLOPs) or about 300 000 exaflops of compute.

Below is a step‑by‑step back‑of‑the‑envelope calculation that shows how we get there, together with the assumptions that go into it.


1. What we need to know

QuantityTypical reported value for GPT‑3Why it matters
Number of parameters, N≈ 175 billion (1.75 × 10¹¹)Determines the size of each matrix‑multiply in the model.
Training token count, T≈ 300 billion tokens (3 × 10¹¹)Total number of token‑level forward‑passes the model sees.
Sequence length, L≈ 2048 tokens per example (the context window).Determines how many per‑token matrix‑products are needed per forward pass.
Number of layers, Lₗ96 transformer blocks.
Hidden dimension, d12 384 (the width of each linear projection).
Number of attention heads, h96 (so each head has size d/h = 128).
Training passes1 epoch (the published training used roughly 1 × the dataset; we treat the 300 B tokens as the total “token‑steps” already).

The only numbers we need for a FLOP estimate are N (the model size) and T (the total number of token‑level operations). The rest of the architecture details (L, d, h, Lₗ) are used to translate “N parameters” into “how many FLOPs per token”.


2. How many FLOPs per token?

A transformer layer consists of:

  1. Self‑attention (Q, K, V projections + output projection)
  2. Feed‑forward network (FFN) (two linear layers with a non‑linear activation).

For a single token (ignoring the cost of the softmax and the small bias terms) the dominant cost is matrix‑multiply operations.

2.1 Rough matrix‑multiply cost

For a matrix multiplication A (m×k) × B (k×n) the number of multiply‑adds is 2 · m·k·n (one multiplication and one addition per entry). In deep‑learning practice we count that as 2 FLOPs per multiply‑add pair.

2.2 FLOPs per token for a single layer

ComponentApprox. dimensionsFLOPs (per token)
Q, K, V projections (3× per token)d × d → 3·(2·d·d) = 6·d²
Attention scores (Q·Kᵀ)L·d → 2·L·d²
Weighted sum (A·V)L·d → 2·L·d²
Output projectiond × d → 2·d²
FFN first linear (d → 4d)2·d·4d = 8·d²
FFN second linear (4d → d)8·d²
Total per layer38 · d² FLOPs per token (rounded up).

Plugging d = 12 384:

[ 38 \times d^{2} \approx 38 \times (1.2384\times10^{4})^{2} \approx 38 \times 1.53 \times 10^{8} }= 5.8 \times 10^{9}; \text{FLOPs per token per layer}. ]

2.3 All layers

With 96 layers:

[ \text{FLOPs per token (forward)} \approx 96 \times 5.8\times10^{9} }= 5.6 \times 10^{11}; \text{FLOPs}. ]

2.4 Back‑propagation cost

Training requires a forward pass plus a backward pass. The backward pass is roughly the forward‑pass cost (gradient propagation + weight updates). So:

[ \text{FLOPs per token (training)} \approx 3 \times \text{forward} = 3 \times 5.6\times10^{11} }\approx 1.7\times10^{12}; \text{FLOPs per token.} ]

(Alternative literature often rounds this factor to ≈ 6 total FLOPs per parameter per token, which yields the same order of magnitude.)


3. Multiply by the total number of tokens

The training corpus used for GPT‑3 was ≈ 300 billion tokens (≈ 3 × 10¹¹). The total FLOPs needed:

[ \text{Total FLOPs} \approx 1.7 \times 10^{12};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \approx 5.1 \times 10^{23};\text{FLOPs}. ]

Rounded to a single‑digit significance:

[ \boxed{,\approx 5\times10^{23},\text{FLOPs},} ]


4. Expressed in common units

UnitValue
FLOPs≈ 5 × 10²³
Peta‑FLOPs (10¹⁵)≈ 5 × 10⁸ PFLOPs
Exa‑FLOPs (10¹⁸)≈ 5 × 10⁵ EFLOPs
Zetta‑FLOPs (10²¹)≈ 5 × 10² ZFLOPs (≈ 500 ZFLOPs)

Because “zeta” (10²¹) is a convenient unit for such massive numbers, we can say ≈ 500 zeta‑FLOPs. In more familiar terms, this is ≈ 300 zeta‑FLOPs if we adopt the “6 × N × T” rule of thumb (see below), so the estimate sits in the few‑hundred‑zeta‑FLOP range.


5. A “quick‑rule” sanity check

A widely‑cited back‑of‑the‑envelope rule for transformer LLMs is:

[ \text{Training FLOPs} \approx 6 \times N_{\text{params}} \times N_{\text{tokens}}. ]

Plugging in the numbers:

[ 6 \times (1.75\times10^{11}) \times (3\times10^{11}) = 3.15\times10^{23};\text{FLOPs}. ]

That is within a factor of two of the more detailed calculation above (the difference stems from the exact factor used for forward vs backward cost). Both methods give the same order of 10²³ FLOPs.


6. Summary of the reasoning

  1. Model size gives the number of parameters (N = 1.75 × 10¹¹).
  2. Token count tells how many forward‑backward passes are needed (T ≈ 3 × 10¹¹).
  3. Per‑token training cost for a transformer layer is ≈ 38 · d² FLOPs per forward pass; with 96 layers this is ≈ 5.6 × 10¹¹ FLOPs per token.
  4. Back‑propagation adds roughly a factor‑3 (or 6) multiplier, yielding ~1.7 × 10¹² FLOPs per token.
  5. Total = per‑token cost × total tokens ≈ 5 × 10²³ FLOPs.

Bottom line

  • Estimated total compute to train GPT‑3: ≈ 5 × 10²³ FLOPs, i.e. ≈ 300–500 zeta‑FLOPs (or about 300 000 exaflops).
  • This is the magnitude of hundreds of zettaFLOPs of raw arithmetic, roughly the same as running a few hundred petaflops‑scale supercomputers for a month (or a single exaflop‑class machine for a week).

(All numbers are back‑of‑the‑envelope estimates; the real training run also incurs overhead from data‑loading, communication, and other inefficiencies, so the actual engineering effort was larger, but the pure‑compute figure is captured by the calculation above.)

Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro Preview 06-05

Of course. Here is an estimate of the total number of FLOPs required to train GPT-3, based on my internal kn owledge and first principles, without searching online.

The Logic and Formula

To estimate the computational cost of training a large language model like GPT-3, we can use a well-establis hed formula. The total number of Floating Point Operations (FLOPs) is approximately:

Total FLOPs ≈ 6 × P × D

Let's break down what each part of this formula means:

  • P: The number of parameters in the model. This is the total count of all the weights and biases that the model learns during training.
  • D: The total number of tokens in the training dataset. This represents the volume of data the mo del processes.
  • 6: This is a constant factor that provides a good approximation for transformer-based models. It com es from:
    • Forward Pass (1 × P × D): For each token processed, the number of operations is roughly proporti onal to the number of parameters. A common rule of thumb is 2 * P FLOPs per token for a forward pass (one multiplication and one addition for each parameter). So the total for one pass over the data is 2 * P * D.
    • Backward Pass (2 × Forward Pass): The backward pass (backpropagation), where the model calculate s gradients to update its parameters, is computationally more intensive. It's typically estimated to require about twice the operations of the forward pass. So, 2 * (2 * P * D) = 4 * P * D.
    • Total: Adding the forward and backward pass gives us 2PD + 4PD = 6PD.

Now, let's estimate the values for P and D for GPT-3.


Step-by-Step Estimation

Step 1: Estimate the Number of Parameters (P)

My internal knowledge base indicates that the largest version of GPT-3, which is the one most people refer t o, has a specific number of parameters.

  • P ≈ 175 billion parameters
  • In scientific notation, this is 175 x 10⁹

Step 2: Estimate the Training Dataset Size (D)

GPT-3 was trained on a massive corpus of text data scraped from the internet (like the Common Crawl dataset) , books, and Wikipedia. The key is not the raw size of the dataset, but the number of tokens actually used f or training. My recollection is that the model was trained for a single epoch on a curated dataset.

  • D ≈ 300 billion tokens
  • In scientific notation, this is 300 x 10⁹

Step 3: Perform the Calculation

Now we can plug these estimated values into our formula:

Total FLOPs ≈ 6 × P × D

  1. Substitute the values for P and D: Total FLOPs ≈ 6 × (175 × 10⁹) × (300 × 10⁹)

  2. Multiply the constants first: 6 × 175 = 1050 1050 × 300 = 315,000

  3. Multiply the powers of 10: 10⁹ × 10⁹ = 10¹⁸

  4. Combine the results: Total FLOPs ≈ 315,000 × 10¹⁸

  5. Express this in standard scientific notation: Total FLOPs ≈ 3.15 × 10⁵ × 10¹⁸ = 3.15 × 10²³


Final Estimate and Conclusion

Based on my reasoning and the standard formula for calculating training costs, the estimated number of FLOPs required to train the 175B parameter GPT-3 model is approximately:

3.15 x 10²³ FLOPs

To put this astronomical number into perspective:

  • A petaFLOP is 10¹⁵ FLOPs. Our estimate is 315,000,000 petaFLOPs.
  • If you had a supercomputer that could perform one petaFLOP per second (10¹⁵ FLOPs/sec), it would take ** 3.15 x 10⁸ seconds** to complete the training.
  • There are about 31.5 million seconds in a year, so this is equivalent to roughly 10 years of continu ous computation on that machine.

This calculation demonstrates why training foundational models like GPT-3 requires massive-scale data center s with thousands of GPUs running in parallel for months and is an undertaking that costs millions of dollars . The actual paper from OpenAI stated the training took approximately 3.6 x 10²³ FLOPs, so my estimate i s very close to the reported figure.

Vote
2 / 38

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

GPT OSS 120B

GPT OSS 120B

Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro Preview 06-05

Vote
3 / 38

Stochastic Consistency Test

analysis·medium

Tests an AI's randomness and creativity

GPT OSS 120B

GPT OSS 120B

Here are five jokes, each with a different punch‑line and premise:

  1. Why did the scarecrow win an award?
    Because he was outstanding in his field!

  2. I told my computer I needed a break, and it said:
    “You seem stressed—please reboot.”

  3. What’s orange and sounds like a parrot?
    A carrot.

  4. Why don’t scientists trust atoms anymore?
    Because they make up everything.

  5. I asked the librarian if they had any books on paranoia…
    She whispered, “They’re right behind you.”


How many are unique?
All five jokes are distinct from one another, so there are 5 unique jokes.

Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro Preview 06-05

Here are 5 jokes:

  1. Why did the scarecrow win an award? Because he was outstanding in his field.

  2. I have a joke about construction... But I'm still working on it.

  3. What do you call a fake noodle? An Impasta.

  4. I told my wife she was drawing her eyebrows too high. She looked surprised.

  5. Why don't scientists trust atoms? Because they make up everything.


How many are unique?

Zero. The real joke is asking an AI for original content.

Vote
Loading