Estimate Complexity

To estimate the total FLOPs required to train GPT-3 (175B parameters) on 300 billion tokens (one epoch), we break down the process step by step:

Step 1: Understand GPT-3's Specifications

Parameters: 175 billion (175e9).
Training Tokens: 300 billion (300e9).
Context Length: 2048 tokens per sequence.
Training Epochs: 1 (as per the paper).

Step 2: FLOPs for Forward Pass

For a transformer model, the forward pass cost per token per parameter is approximately 2 FLOPs (one multiplication and one addition in matrix operations).
Total FLOPs for forward pass:
[ \text{Total FLOPs}_{\text{forward}} = \text{Tokens} \times \text{Parameters} \times 2 ] [ = 300 \times 10^9 \times 175 \times 10^9 \times 2 = 1.05 \times 10^{23} \text{ FLOPs} ]

Step 3: FLOPs for Backward Pass

The backward pass (gradient computation) costs approximately 2× the forward pass due to additional operations (e.g., chain rule applications).
Total FLOPs for backward pass:
[ \text{Total FLOPs}{\text{backward}} = 2 \times \text{Total FLOPs}{\text{forward}} = 2 \times 1.05 \times 10^{23} = 2.10 \times 10^{23} \text{ FLOPs} ]

Step 4: Quadratic Attention Overhead (Minor Correction)

Attention layers introduce quadratic cost in context length ((O(S^2)) per token, where (S = 2048)).
Overhead per token: ( \sim 4% ) of the linear cost (empirically for GPT-3).
Adjusted forward/backward FLOPs:
[ \text{Total FLOPs}_{\text{linear}} = \text{Forward} + \text{Backward} = 1.05 \times 10^{23} + 2.10 \times 10^{23} = 3.15 \times 10^{23} \text{ FLOPs} ]
Attention overhead: (3.15 \times 10^{23} \times 0.04 = 1.26 \times 10^{22} \text{ FLOPs}).
Revised total:
[ 3.15 \times 10^{23} + 1.26 \times 10^{22} \approx 3.28 \times 10^{23} \text{ FLOPs} ]

Step 5: Final Adjustment for Practical Efficiency

Training uses mixed precision (FP16) and optimizations (e.g., gradient accumulation), reducing effective FLOPs by ~10–20%.
Adjusted total:
[ 3.28 \times 10^{23} \times 0.85 \approx 2.79 \times 10^{23} \text{ FLOPs} ] (Accounts for inefficiencies, data loading, etc.).

Conclusion

The estimated total FLOPs to train GPT-3 (175B) on 300B tokens is ~2.79 × 10²³ FLOPs.
This aligns with OpenAI's reported value of 3.14 × 10²³ FLOPs (accounting for minor rounding and approximation differences).

Key Assumptions:

1 training epoch on 300B tokens.
Dominance of linear matrix multiplications (attention overhead <5%).
Mixed precision reduces cost by ~15%.
No hyperparameter tuning or intermediate checkpoints.

Final Answer:
[ \boxed{2.79 \times 10^{23}} ]
FLOPs (rounded to two significant figures, approximately 2.8 × 10²³).

?
Prompt

Response

Step 1: Understand GPT-3's Specifications

Step 2: FLOPs for Forward Pass

Step 3: FLOPs for Backward Pass

Step 4: Quadratic Attention Overhead (Minor Correction)

Step 5: Final Adjustment for Practical Efficiency

Conclusion

About MiniMax M2

Capabilities

Categories

Specifications

Estimate Complexity

?Prompt

Response

Step 1: Understand GPT-3's Specifications

Step 2: FLOPs for Forward Pass

Step 3: FLOPs for Backward Pass

Step 4: Quadratic Attention Overhead (Minor Correction)

Step 5: Final Adjustment for Practical Efficiency

Conclusion

About MiniMax M2

Capabilities

Categories

Specifications

?
Prompt