Estimate Complexity

Below is one way to arrive at a back-of-the-envelope estimate for the total floating-point operations (FLOPs) needed to train GPT-3 (the 175B-parameter model). Note that different sources use slightly different assumptions, so you may see figures that differ by factors of 2–3; however, the overall scale (10^23 FLOPs) remains consistent.

1. Key Quantities

Number of parameters (Nₚ): 175 billion (1.75 × 10^11)
Number of training tokens (Nₜ): ~300 billion (3.0 × 10^11)
Rule-of-thumb for FLOPs per parameter per token: A common rough estimate is 6 FLOPs per parameter per token for a full forward-and-backward pass with Adam-style updates.

Why 6?

Forward pass often taken as ~2 × Nₚ FLOPs.
Backward pass can be roughly twice the cost of the forward pass (~4 × Nₚ).
Together → 2 + 4 = 6 × Nₚ FLOPs per token.

(Exact counts differ depending on the details of the architecture, hardware efficiency, and whether you count a multiply-add as one or two FLOPs, but 6 is a convenient ballpark multiplier.)

2. Core Calculation

Using the 6 FLOPs rule of thumb:

[ ext{Total FLOPs} ;approx; 6 ; imes; N_p ; imes; N_t. ]

Plug in the numbers:

(N_p = 1.75 imes 10^{11}) (175B parameters).
(N_t = 3.0 imes 10^{11}) (300B tokens).
Multiplier = 6.

So,

[ ext{Total FLOPs} ;=; 6 ; imes; (1.75 imes 10^{11}) ; imes; (3.0 imes 10^{11}). ]

First, multiply the “nice” numbers:

(6 imes 1.75 = 10.5)
(10.5 imes 3.0 = 31.5)

So the coefficient is 31.5. Now include the powers of ten:

(31.5 imes 10^{11} imes 10^{11} = 31.5 imes 10^{22} = 3.15 imes 10^{23}.)

Hence,

[ ext{Total FLOPs} ;approx; 3.15 imes 10^{23}. ]

3. Result and Interpretation

Therefore, a rough estimate for the total training cost of GPT-3 (175B) is on the order of
[ mathbf{3 imes 10^{23}} ext{ FLOPs.} ]

Depending on exact architectural details, optimizer choices, hardware efficiency, and definitions of FLOPs vs. MACs (multiply-accumulate operations), public estimates generally range from around (2 imes 10^{23}) to (4 imes 10^{23}). But the important takeaway is that it is a few times (10^{23}) FLOPs.

4. Summary

Identify parameters and tokens: 175B parameters, 300B tokens.
Adopt the “6 FLOPs per parameter per token” rule for forward + backward + update.
Multiply: (6 imes 1.75 imes 10^{11} imes 3.0 imes 10^{11} approx 3.15 imes 10^{23}).

This simple calculation gives a ballpark figure without delving into the fine details of the GPT-3 architecture or specifics of training hardware.

?
Prompt

Response

1. Key Quantities

2. Core Calculation

3. Result and Interpretation

4. Summary

About o1

Capabilities

Categories

Specifications

Estimate Complexity

?Prompt

Response

1. Key Quantities

2. Core Calculation

3. Result and Interpretation

4. Summary

About o1

Capabilities

Categories

Specifications

?
Prompt