Estimate Complexity

Here’s a back-of-the-envelope estimate for the total training compute (in FLOPs) for GPT-3, using standard scaling heuristics and reasonable assumptions.

Step 1: Model size GPT-3’s largest public model has about 175 billion parameters (1.75e11).

Step 2: Training tokens OpenAI reported training on roughly 300 billion tokens for GPT-3-scale models. We’ll use 3.0e11 tokens.

Step 3: FLOPs per parameter per token A common rule of thumb for dense Transformer training is that the total training FLOPs is on the order of: FLOPs ≈ 6 × N_params × N_tokens This comes from accounting for forward and backward passes over all layers, including attention and MLPs, with constant factors absorbed into the 6× heuristic. Some estimates range from ~3× to ~12× depending on architecture details and precision; 6× is a reasonable midpoint.

Step 4: Compute the total N_params = 1.75e11 N_tokens = 3.0e11 FLOPs ≈ 6 × 1.75e11 × 3.0e11 = 6 × 5.25e22 = 3.15e23 FLOPs

Step 5: Sensitivity/uncertainty

If the token count were lower (say 3e11 → 2e11), this would scale linearly.
If the constant factor were closer to 3× or 12×, the result would vary by about 2× either way.
Some additional overhead (optimizer states, data/activation precision, mixed-precision efficiency, embedding layers, and short sequences vs long) could shift the constant modestly but won’t change the order of magnitude.

Conclusion A reasonable estimate for total training compute for GPT-3 (175B) is on the order of 3e23 FLOPs, with a plausible range of roughly 1e23 to 6e23 FLOPs.

?
Prompt

Response

About Horizon Beta

Capabilities

Categories

Specifications