Estimate Complexity

The total training compute for GPT‑3 (175B) is on the order of (10^{23}) floating‑point operations. A reasonable point estimate is (\sim 3\times 10^{23}) FLOPs.

Here’s the step‑by‑step reasoning from basic principles.

1. Key known quantities

From the GPT‑3 paper (architecture details are well known):

Number of parameters (N \approx 175 \text{ billion} = 1.75\times 10^{11})
Number of training tokens (T \approx 300 \text{ billion} = 3\times 10^{11})

We’ll assume:

Standard dense Transformer decoder-only model
Standard training: forward pass + backward pass + optimizer step
We want the total training FLOPs over all tokens and all steps.

2. FLOPs per parameter per token

For a dense neural network, a single weight typically participates in one multiply-add per forward pass per “use”:

1 multiplication + 1 addition ≈ 2 FLOPs per weight usage.

For large Transformers, most compute is in matrix multiplications in the attention and MLP blocks; these are roughly proportional to the total number of parameters, so a common back-of-the-envelope is:

Forward pass ≈ (2N) FLOPs per token
(each parameter used once in a multiply-add on average)

Backward pass is more expensive because you’re computing gradients:

Backward pass is roughly 2× the forward pass cost in terms of FLOPs
→ ≈ (4N) FLOPs per token

On top of that you have:

Some extra overhead for:
- Nonlinearities, layer norms, softmax
- Optimizer updates (Adam, etc.)

These are smaller but non-negligible; a common heuristic is to lump them into a constant factor, giving:

[ \text{FLOPs per token} \approx 6N ]

This “6” is a heuristic constant:

2N (forward) + 4N (backward & optimizer & overhead) ≈ 6N

Some authors use factors between ~3 and ~6; using 6 is a conservative, widely used estimate in scaling law work.

3. Total FLOPs formula

Total training FLOPs:

[ \text{FLOPs}_{\text{total}} \approx (\text{FLOPs per token}) \times (\text{number of tokens}) ]

Using the heuristic:

[ \text{FLOPs}_{\text{total}} \approx 6N T ]

Plug in GPT‑3’s values:

(N = 1.75\times 10^{11})
(T = 3\times 10^{11})

So:

[ \begin{aligned} \text{FLOPs}_{\text{total}} &\approx 6 \times (1.75\times 10^{11}) \times (3\times 10^{11}) \ &= 6 \times 1.75 \times 3 \times 10^{22} \ &= 6 \times 5.25 \times 10^{22} \ &= 31.5 \times 10^{22} \ &= 3.15 \times 10^{23} \text{ FLOPs} \end{aligned} ]

So a good estimate is:

[ \boxed{\text{GPT‑3 (175B) training compute} ;\approx; 3\times 10^{23}\ \text{FLOPs}} ]

4. Sense check (order of magnitude)

Even if our constant factor is off (say it’s 3 instead of 6, or 8 instead of 6), we’re still in the same ballpark:

With factor 3: (\sim 1.6\times 10^{23}) FLOPs
With factor 8: (\sim 4.2\times 10^{23}) FLOPs

So the order of magnitude (10^{23}) FLOPs is robust, and (\sim 3\times 10^{23}) FLOPs is a reasonable central estimate.

?
Prompt

Response

1. Key known quantities

2. FLOPs per parameter per token

3. Total FLOPs formula

4. Sense check (order of magnitude)

About GPT-5.1

Capabilities

Categories

Specifications

Estimate Complexity

?Prompt

Response

1. Key known quantities

2. FLOPs per parameter per token

3. Total FLOPs formula

4. Sense check (order of magnitude)

About GPT-5.1

Capabilities

Categories

Specifications

?
Prompt