Estimate Complexity

Estimating the FLOPs required to train GPT-3 without looking it up is a classic “Fermi estimation” problem — breaking it down into components we can reason about, using approximations and scaling laws.

Let’s go step by step.

STEP 1: Recall GPT-3’s key specs (from memory)

GPT-3 has:

~175 billion parameters (1.75 × 10¹¹)
Trained on ~300 billion tokens (3 × 10¹¹)

These numbers are widely cited and should be roughly accurate from memory.

STEP 2: FLOPs per parameter per token

A common rule of thumb in deep learning is:

Training a transformer model requires approximately 6 FLOPs per parameter per token.

Why 6?

In the forward pass, roughly 2 FLOPs per parameter (multiply + add for each weight update contribution).
In the backward pass, you need about twice the FLOPs of the forward pass (gradient computation w.r.t. inputs and weights).
Then there’s the weight update (optimizer step), which adds a bit more — roughly same cost as forward.

So: 2 (forward) + 4 (backward + update) ≈ 6 FLOPs/param/token.

This is a standard approximation used in scaling law papers (like those from OpenAI and DeepMind).

STEP 3: Multiply parameters × tokens × FLOPs per param per token

So:

FLOPs = (number of parameters) × (number of tokens) × (FLOPs per param per token)

= (1.75 × 10¹¹) × (3 × 10¹¹) × 6

Let’s compute:

First, 1.75 × 3 = 5.25

Then, 10¹¹ × 10¹¹ = 10²²

So: 5.25 × 10²² × 6 = 31.5 × 10²² = 3.15 × 10²³

STEP 4: Final Estimate

~3 × 10²³ FLOPs

Or, written out: 300 sextillion FLOPs

STEP 5: Sanity Check

Let’s see if this makes sense.

GPT-2 (1.5B params, ~40B tokens) → 1.5e9 × 4e10 × 6 = ~3.6e20 FLOPs
GPT-3 is ~100× more parameters and ~7.5× more tokens → 750× more FLOPs → 3.6e20 × 750 ≈ 2.7e23 → close to our 3.15e23.

Also, known published estimates (which we’re avoiding, but for post-hoc sanity) are around 3.14 × 10²³ — so this lines up.

✅ Final Answer:

Approximately 3 × 10²³ FLOPs are required to train GPT-3.

This estimate is derived from:

175B parameters
300B training tokens
6 FLOPs per parameter per token (standard transformer training cost)

All approximated from memory and first principles.

?
Prompt

Response

STEP 1: Recall GPT-3’s key specs (from memory)

STEP 2: FLOPs per parameter per token

STEP 3: Multiply parameters × tokens × FLOPs per param per token

STEP 4: Final Estimate

STEP 5: Sanity Check

✅ Final Answer:

About Qwen: Qwen3 Max

Capabilities

Categories

Specifications

Estimate Complexity

?Prompt

Response

STEP 1: Recall GPT-3’s key specs (from memory)

STEP 2: FLOPs per parameter per token

STEP 3: Multiply parameters × tokens × FLOPs per param per token

STEP 4: Final Estimate

STEP 5: Sanity Check

✅ Final Answer:

About Qwen: Qwen3 Max

Capabilities

Categories

Specifications

?
Prompt