Estimate Complexity

Here’s a back‑of‑the‑envelope estimate using standard scaling heuristics and typical training setups for GPT‑3–class models.

Assumptions

Model size: GPT‑3 175B parameters (N ≈ 1.75×10^11).
Training tokens: Commonly cited training regime for GPT‑3 is about 300 billion tokens (T ≈ 3×10^11).
FLOPs per token for a decoder‑only Transformer: A widely used rule of thumb is that the forward+backward pass costs about 2–6 times the parameter count per token, depending on architectural details and implementation efficiency. A mid‑range practical estimate often used is ~6N FLOPs per token for the combined forward and backward pass (this folds in attention and MLP costs, plus constants).

Step‑by‑step

FLOPs per token: FLOPs_per_token ≈ c × N Take c ≈ 6 as a practical constant for forward+backward.
Total FLOPs: Total_FLOPs ≈ T × FLOPs_per_token ≈ T × (c × N)
Plug in numbers: N = 1.75×10^11 T = 3×10^11 c ≈ 6

Total_FLOPs ≈ 3×10^11 × 6 × 1.75×10^11 ≈ 3 × 6 × 1.75 × 10^(11+11) ≈ 31.5 × 10^22 ≈ 3.15×10^23 FLOPs

Sanity checks and range

If we vary c between about 3 and 10 (to capture implementation differences and attention/MLP constants), we get: Low end: 3×10^11 × 3 × 1.75×10^11 ≈ 1.58×10^23 FLOPs High end: 3×10^11 × 10 × 1.75×10^11 ≈ 5.25×10^23 FLOPs
Many public ballparks for GPT‑3 scale land in the 1e23–1e24 FLOPs range. Our central estimate of ~3×10^23 FLOPs sits comfortably in that band.

Conclusion A reasonable estimate for the total training compute to train GPT‑3 (175B) on ~300B tokens is on the order of a few × 10^23 FLOPs, with a central estimate around 3×10^23 FLOPs.

?
Prompt

Response

About Horizon Alpha

Capabilities

Categories

Specifications