Rival

GPT-5vsOpenAI o3

46 challenges

Why GPT-5?

4.3x cheaper overall$1.25/M in · $10.00/M out

Leads 1 of 1 benchmarks

Stronger on SWE-bench Verified74.9% vs 69.1%

4 months newerAug 2025

Why OpenAI o3?

No standout advantages

3 / 46

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

GPT-5

Openai · Aug 2025text

Here’s a back‑of‑the‑envelope estimate using standard LM training-compute heuristics.

Assumptions

Model size (N): GPT-3 has about 175 billion parameters (N ≈ 1.75 × 10^11).
Training tokens (T): GPT-3 was trained on roughly 300 billion tokens (T ≈ 3.0 × 10^11).
FLOPs per token per parameter: A common rule of thumb for transformer LM training is FLOPs ≈ 6 × N × T The factor ~6 comes from counting forward and backward passes for the parameter-heavy matmuls (roughly 2N for forward + 4N for backward). It ignores small terms from layernorms, softmax, embeddings, and optimizer updates (which we’ll account for with a small overhead factor later).

Core calculation

6 × N × T = 6 × (1.75 × 10^11) × (3.0 × 10^11)
First multiply N × T: 1.75 × 3.0 = 5.25; powers: 10^11 × 10^11 = 10^22 So N × T = 5.25 × 10^22
Now multiply by 6: 6 × 5.25 × 10^22 = 31.5 × 10^22 = 3.15 × 10^23 FLOPs

Overheads and uncertainty

Non-matmul ops (attention softmax, masking, residuals, layernorms), embeddings, and optimizer update steps add some overhead. A 10–20% uplift is a reasonable allowance.
Applying +10–20%: 3.15 × 10^23 × (1.1 to 1.2) ≈ (3.5 ± 0.3) × 10^23 FLOPs.

Answer

Point estimate: ~3.15 × 10^23 FLOPs
With a modest overhead allowance: roughly (3.3–3.8) × 10^23 FLOPs

This is a coarse estimate; the exact number depends on details like sequence length, activation checkpointing, implementation efficiency, and the precise token count.

OpenAI o3

Openai · Apr 2025text

Below is a simple "back-of-the-envelope" estimate that uses only a few public facts about GPT-3 and very generic rules of thumb for counting operations.

Step 1. Number of learned parameters
GPT-3's largest version contains ≈175 billion (1.75 × 10¹¹) trainable parameters.

Step 2. How much data was seen during training
OpenAI reports that the model was trained on about 300 billion tokens (3 × 10¹¹).

Step 3. How many floating-point operations does one parameter cost per token?
A practical rule of thumb for Transformer language models is
Training FLOPs ≈ 6 × (# parameters) × (# training tokens)

Where does the factor 6 come from?

• Forward pass: every parameter is involved in one multiply-accumulate → 2 FLOPs
• Back-prop gradient computation: roughly another 2 FLOPs
• Parameter-update (optimizer) and miscellaneous overhead: ≈2 FLOPs
Total ≈6 FLOPs per parameter per token.

Step 4. Plug in the numbers
Total FLOPs = 6 × (1.75 × 10¹¹) × (3 × 10¹¹)
= 6 × 5.25 × 10²²
≈ 3.15 × 10²³ FLOPs.

Step 5. Express the answer in more familiar units
3.15 × 10²³ ≈ 315 zettaFLOPs (1 ZFLOP = 10²¹ FLOPs).

Step 6. Sanity check with hardware-time intuition
An NVIDIA A100 delivers ≈ 312 teraFLOP/s (0.312 PFLOP/s) in mixed precision.
Wall-clock time on a single A100:
3.15 × 10²³ / 3.12 × 10¹⁴ ≈ 1.0 × 10⁹ s ≈ 32 years.
Running on 1,000 A100s in parallel would take ≈12 days, which is the right ballpark for a large industrial training run.

Final estimate
Training GPT-3 (175 B parameters, 300 B tokens) requires on the order of 3 × 10²³ floating-point operations, i.e. roughly 300 zettaFLOPs.

Vote