Loading share button...

GPT OSS 120BvsGrok 3

28 challenges

Why GPT OSS 120B?

6 months newerAug 2025

Why Grok 3?

No standout advantages
3 / 28

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

GPT OSS 120B

GPT OSS 120B

Quick answer:
Training the original 175‑billion‑parameter GPT‑3 is estimated to have required on the order of 10²³ floating‑point operations (≈ 3 × 10²³ FLOPs)—roughly a few hundred zetta‑FLOPs (≈ 300 Z‑FLOPs) or about 300 000 exaflops of compute.

Below is a step‑by‑step back‑of‑the‑envelope calculation that shows how we get there, together with the assumptions that go into it.


1. What we need to know

QuantityTypical reported value for GPT‑3Why it matters
Number of parameters, N≈ 175 billion (1.75 × 10¹¹)Determines the size of each matrix‑multiply in the model.
Training token count, T≈ 300 billion tokens (3 × 10¹¹)Total number of token‑level forward‑passes the model sees.
Sequence length, L≈ 2048 tokens per example (the context window).Determines how many per‑token matrix‑products are needed per forward pass.
Number of layers, Lₗ96 transformer blocks.
Hidden dimension, d12 384 (the width of each linear projection).
Number of attention heads, h96 (so each head has size d/h = 128).
Training passes1 epoch (the published training used roughly 1 × the dataset; we treat the 300 B tokens as the total “token‑steps” already).

The only numbers we need for a FLOP estimate are N (the model size) and T (the total number of token‑level operations). The rest of the architecture details (L, d, h, Lₗ) are used to translate “N parameters” into “how many FLOPs per token”.


2. How many FLOPs per token?

A transformer layer consists of:

  1. Self‑attention (Q, K, V projections + output projection)
  2. Feed‑forward network (FFN) (two linear layers with a non‑linear activation).

For a single token (ignoring the cost of the softmax and the small bias terms) the dominant cost is matrix‑multiply operations.

2.1 Rough matrix‑multiply cost

For a matrix multiplication A (m×k) × B (k×n) the number of multiply‑adds is 2 · m·k·n (one multiplication and one addition per entry). In deep‑learning practice we count that as 2 FLOPs per multiply‑add pair.

2.2 FLOPs per token for a single layer

ComponentApprox. dimensionsFLOPs (per token)
Q, K, V projections (3× per token)d × d → 3·(2·d·d) = 6·d²
Attention scores (Q·Kᵀ)L·d → 2·L·d²
Weighted sum (A·V)L·d → 2·L·d²
Output projectiond × d → 2·d²
FFN first linear (d → 4d)2·d·4d = 8·d²
FFN second linear (4d → d)8·d²
Total per layer38 · d² FLOPs per token (rounded up).

Plugging d = 12 384:

[ 38 \times d^{2} \approx 38 \times (1.2384\times10^{4})^{2} \approx 38 \times 1.53 \times 10^{8} }= 5.8 \times 10^{9}; \text{FLOPs per token per layer}. ]

2.3 All layers

With 96 layers:

[ \text{FLOPs per token (forward)} \approx 96 \times 5.8\times10^{9} }= 5.6 \times 10^{11}; \text{FLOPs}. ]

2.4 Back‑propagation cost

Training requires a forward pass plus a backward pass. The backward pass is roughly the forward‑pass cost (gradient propagation + weight updates). So:

[ \text{FLOPs per token (training)} \approx 3 \times \text{forward} = 3 \times 5.6\times10^{11} }\approx 1.7\times10^{12}; \text{FLOPs per token.} ]

(Alternative literature often rounds this factor to ≈ 6 total FLOPs per parameter per token, which yields the same order of magnitude.)


3. Multiply by the total number of tokens

The training corpus used for GPT‑3 was ≈ 300 billion tokens (≈ 3 × 10¹¹). The total FLOPs needed:

[ \text{Total FLOPs} \approx 1.7 \times 10^{12};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \approx 5.1 \times 10^{23};\text{FLOPs}. ]

Rounded to a single‑digit significance:

[ \boxed{,\approx 5\times10^{23},\text{FLOPs},} ]


4. Expressed in common units

UnitValue
FLOPs≈ 5 × 10²³
Peta‑FLOPs (10¹⁵)≈ 5 × 10⁸ PFLOPs
Exa‑FLOPs (10¹⁸)≈ 5 × 10⁵ EFLOPs
Zetta‑FLOPs (10²¹)≈ 5 × 10² ZFLOPs (≈ 500 ZFLOPs)

Because “zeta” (10²¹) is a convenient unit for such massive numbers, we can say ≈ 500 zeta‑FLOPs. In more familiar terms, this is ≈ 300 zeta‑FLOPs if we adopt the “6 × N × T” rule of thumb (see below), so the estimate sits in the few‑hundred‑zeta‑FLOP range.


5. A “quick‑rule” sanity check

A widely‑cited back‑of‑the‑envelope rule for transformer LLMs is:

[ \text{Training FLOPs} \approx 6 \times N_{\text{params}} \times N_{\text{tokens}}. ]

Plugging in the numbers:

[ 6 \times (1.75\times10^{11}) \times (3\times10^{11}) = 3.15\times10^{23};\text{FLOPs}. ]

That is within a factor of two of the more detailed calculation above (the difference stems from the exact factor used for forward vs backward cost). Both methods give the same order of 10²³ FLOPs.


6. Summary of the reasoning

  1. Model size gives the number of parameters (N = 1.75 × 10¹¹).
  2. Token count tells how many forward‑backward passes are needed (T ≈ 3 × 10¹¹).
  3. Per‑token training cost for a transformer layer is ≈ 38 · d² FLOPs per forward pass; with 96 layers this is ≈ 5.6 × 10¹¹ FLOPs per token.
  4. Back‑propagation adds roughly a factor‑3 (or 6) multiplier, yielding ~1.7 × 10¹² FLOPs per token.
  5. Total = per‑token cost × total tokens ≈ 5 × 10²³ FLOPs.

Bottom line

  • Estimated total compute to train GPT‑3: ≈ 5 × 10²³ FLOPs, i.e. ≈ 300–500 zeta‑FLOPs (or about 300 000 exaflops).
  • This is the magnitude of hundreds of zettaFLOPs of raw arithmetic, roughly the same as running a few hundred petaflops‑scale supercomputers for a month (or a single exaflop‑class machine for a week).

(All numbers are back‑of‑the‑envelope estimates; the real training run also incurs overhead from data‑loading, communication, and other inefficiencies, so the actual engineering effort was larger, but the pure‑compute figure is captured by the calculation above.)

Grok 3

Grok 3

To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we’ll need to make some reasonable assumptions based on what’s generally known about large language models like GPT-3, its architecture, and typical training procedures. GPT-3, developed by OpenAI, is a massive model with 175 billion parameters, and while exact training details aren’t fully public, we can piece together an estimate using standard practices in machine learning.

Step 1: Understand the Key Components

The number of FLOPs to train a model depends on:

  • Number of parameters (P): GPT-3 has 175 billion parameters.
  • Training dataset size (D): The number of tokens the model is trained on.
  • Number of training steps (S): How many times the model processes the data (related to epochs and batch size).
  • FLOPs per parameter update: How many operations are performed per parameter per token.

Step 2: Estimate FLOPs per Forward and Backward Pass

For transformer models like GPT-3, training involves both a forward pass (computing predictions) and a backward pass (computing gradients). A common heuristic in the field is that:

  • A single forward pass through a dense neural network layer requires about 2 FLOPs per parameter per token (multiply and add for each weight).
  • The backward pass typically takes about twice as many FLOPs as the forward pass due to gradient computations, so roughly 4 FLOPs per parameter per token.
  • Total: ~6 FLOPs per parameter per token for one full training step (forward + backward).

For GPT-3 with 175 billion parameters, processing one token requires: [ 6 imes 175 imes 10^9 = 1.05 imes 10^{12} ext{ FLOPs per token} ]

Step 3: Estimate the Training Dataset Size

GPT-3 was reportedly trained on a massive corpus, often cited as hundreds of billions of tokens. Let’s assume 300 billion tokens, a plausible figure based on discussions about large language model datasets (e.g., Common Crawl, books, and other sources). This is the total number of tokens processed during training, factoring in how data is sampled over epochs.

Step 4: Calculate Total FLOPs for One Pass

If the model processes 300 billion tokens once, the total FLOPs would be: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 300 imes 10^9 ext{ tokens} = 3.15 imes 10^{23} ext{ FLOPs} ] This assumes one full pass through the dataset, but training typically involves multiple passes or iterations over the data, adjusted by batch size and optimization.

Step 5: Adjust for Training Dynamics

Training isn’t a single pass—it involves multiple updates to converge. The number of training steps depends on:

  • Batch size (B): GPT-3 likely used a large batch size (e.g., 3.2 million tokens, as seen in some large-scale models).
  • Total tokens processed: For optimization, models process tokens multiple times. A rule of thumb is that total tokens processed ≈ 2–3× the dataset size, accounting for multiple epochs or sampling.

Let’s assume the model effectively processes 900 billion tokens (3× the dataset size) over training. Then: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 900 imes 10^9 ext{ tokens} = 9.45 imes 10^{23} ext{ FLOPs} ]

Step 6: Account for Model Architecture Specifics

GPT-3 is a sparse transformer in some variants (e.g., earlier GPT models used sparsity), but the 175B version is dense. The 6 FLOPs/parameter/token heuristic holds for dense models. We’re not factoring in optimizations like mixed-precision training (FP16), which reduces compute slightly but doesn’t change the order of magnitude significantly.

Step 7: Final Estimate

Based on this reasoning, the total FLOPs to train GPT-3 is approximately: [ 9.45 imes 10^{23} ext{ FLOPs} ] or roughly ( 10^{24} ) FLOPs when rounded for simplicity.

Reasoning Check

  • Comparison to known models: Smaller models like BERT (110M parameters, ~10B tokens) require ~10^19 FLOPs. Scaling to GPT-3’s size (175B parameters, ~900B tokens) suggests a 10,000× increase, aligning with ~10^23–10^24 FLOPs.
  • Hardware context: Training GPT-3 took months on thousands of GPUs. A V100 GPU performs ~10^14 FLOPs/second; 10^24 FLOPs over 3 months (~10^7 seconds) with 10,000 GPUs is plausible.

Conclusion

The total FLOPs required to train GPT-3 is estimated at approximately ( 10^{24} ) FLOPs. This is a rough estimate, but it aligns with the scale of compute expected for a model of this size and training complexity.

Vote
2 / 28

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

GPT OSS 120B

GPT OSS 120B

Grok 3

Grok 3

Vote
3 / 28

Math Misconception Test

reasoning·medium

Tests an AI's understanding of number representation

GPT OSS 120B

GPT OSS 120B

No. 9.11 is less than 9.9. (9.11 < 9.9.)

Grok 3

Grok 3

No, 9.11 is not greater than 9.9. In decimal comparison, 9.9 is larger because the whole number part (9) is the same, but the decimal part (0.9) is greater than (0.11). Numerically, 9.9 > 9.11.

Vote
Loading