Estimate Complexity

Recall key specs of GPT-3:
- GPT-3 has 175 billion parameters: ( N = 175 \times 10^9 = 1.75 \times 10^{11} ).
- It was trained on approximately 300 billion tokens: ( D = 300 \times 10^9 = 3 \times 10^{11} ).
Understand the FLOPs estimation formula for transformers:
- For dense transformer models like GPT-3, the total training FLOPs ( C ) is approximated by ( C \approx 6 \times N \times D ).
- Breakdown of the 6 factor:
  - Forward pass: ~2N FLOPs per token (dominated by matrix multiplications in attention and FFN layers).
  - Backward pass (gradients): ~4N FLOPs per token.
  - Total per token: ~6N FLOPs.
  - This ignores minor overheads like activations, optimizer steps (e.g., Adam), or data loading, which are small relative to the dominant compute (~95% from matmuls).
Plug in the numbers: [ C \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11}) ]
- First, ( 1.75 \times 3 = 5.25 ), so ( 5.25 \times 10^{22} ).
- Then, ( 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} = 3.15 \times 10^{23} ).
Caveats and precision:
- This is a standard rule-of-thumb from papers like Kaplan et al. (2020) and Hoffmann et al. (Chinchilla, 2022), matching cited figures for GPT-3 (~3.14 × 10^{23} FLOPs).
- Actual cost could vary ±10-20% due to model specifics (e.g., sparse attention, batch size, precision like FP16/BF16), but the estimate is robust for dense autoregressive transformers.

Total estimated FLOPs: ( 3.15 \times 10^{23} ) (or roughly ( 3 \times 10^{23} )).

?
Prompt