Estimate Complexity

Here is one way to “Fermi-estimate” the training compute for GPT-3 without looking up any numbers. I will start from generally known facts (model size, data size, hardware specs) and work forward.

Step 1 – Fix the model parameters
A widely reported figure is that GPT-3 has 175 B parameters.
N ≈ 1.75 × 10¹¹.

Step 2 – Fix the training-token budget
OpenAI papers mention ~300 B tokens.
T ≈ 3 × 10¹¹.

Step 3 – Recall the per-token FLOP rule for dense transformers
Every time one token is processed, the forward pass does roughly
2 FLOPs/parameter (multiply-accumulate counts as 2).
The backward pass does about twice the work of the forward pass, so
total ≈ 3× forward FLOPs.
Thus the “textbook” estimate is
6 FLOPs · N · T.

Step 4 – Plug the numbers in
FLOPs = 6 · (1.75 × 10¹¹) · (3 × 10¹¹)
= 6 · 5.25 × 10²²
≈ 3.15 × 10²³ FLOPs.

Step 5 – Add a safety factor for real-world overheads
Large-scale jobs typically run at 30-50 % of theoretical peak because of communication, stragglers, recomputation for activation checkpointing, etc.
A factor of 2 covers this nicely.
3.15 × 10²³ × 2 ≈ 6 × 10²³ FLOPs.

Final estimate
Training GPT-3 required on the order of 3–6 × 10²³ floating-point operations.

?
Prompt

Response

About Kimi K2

Capabilities

Categories

Specifications