Estimate Complexity

Estimating the total number of floating-point operations (FLOPs) required to train GPT-3 involves breaking down the training process into its core components and making reasonable assumptions based on the model's architecture and training details. Here's a step-by-step breakdown:

1. Key Parameters of GPT-3

GPT-3 has the following key specifications (from the paper or known details):

Model size: 175 billion parameters (weights).
Training tokens: ~300 billion tokens (from the paper).
Batch size: ~3.2 million tokens per batch (1,536 sequences × 2,048 tokens).
Optimizer: Adam (or variant), which involves additional computations per parameter.
Training steps: ~300 billion tokens / 3.2 million tokens per batch ≈ 93,750 steps.

2. FLOPs per Forward Pass

For a transformer model, the bulk of computation comes from:

Matrix multiplications in the attention and feed-forward layers.
Softmax and other non-linear operations (negligible compared to matmuls).

Attention Layer FLOPs

For a single attention head:

Query-Key-Value projections: 3 × (d_model × d_k) per token.
Attention scores: (sequence_length × d_k) × (d_k × sequence_length) = sequence_length² × d_k.
Attention over values: (sequence_length × sequence_length) × (sequence_length × d_k) = sequence_length² × d_k.
Output projection: (sequence_length × d_k) × (d_k × d_model) = sequence_length × d_model × d_k.

For h heads, this scales linearly. For GPT-3:

d_model = 12,288 (hidden size).
h = 96 heads.
d_k = d_model / h = 128.
Sequence length = 2,048 tokens.

Total FLOPs per attention layer:

Projections: 3 × (2,048 × 12,288 × 128) ≈ 9.66 × 10⁹.
Attention scores: 96 × (2,048² × 128) ≈ 5.15 × 10¹⁰.
Attention over values: 96 × (2,048² × 128) ≈ 5.15 × 10¹⁰.
Output projection: 2,048 × 12,288 × 128 ≈ 3.22 × 10⁹.
Total per attention layer: ~1.16 × 10¹¹ FLOPs.

GPT-3 has 96 layers, so total attention FLOPs per forward pass: 96 × 1.16 × 10¹¹ ≈ 1.11 × 10¹³ FLOPs.

Feed-Forward Layer FLOPs

Each feed-forward layer has two linear transformations:

First: (2,048 × 12,288) × (12,288 × 4 × 12,288) ≈ 1.23 × 10¹² FLOPs.
Second: (2,048 × 4 × 12,288) × (4 × 12,288 × 12,288) ≈ 1.23 × 10¹² FLOPs.
Total per feed-forward layer: ~2.46 × 10¹² FLOPs.

For 96 layers: 96 × 2.46 × 10¹² ≈ 2.36 × 10¹⁴ FLOPs.

Embedding and Output Layers

Embedding lookup: negligible (2,048 × 12,288 ≈ 2.5 × 10⁷ FLOPs).
Final projection: 2,048 × 12,288 × vocab_size (50,257) ≈ 1.26 × 10¹² FLOPs.

Total Forward Pass FLOPs

Attention + Feed-forward + Output ≈ 1.11 × 10¹³ + 2.36 × 10¹⁴ + 1.26 × 10¹² ≈ 2.48 × 10¹⁴ FLOPs.

3. FLOPs per Backward Pass

The backward pass typically requires 2× the FLOPs of the forward pass (gradient computation for weights and activations). Thus: Backward pass ≈ 2 × 2.48 × 10¹⁴ ≈ 4.96 × 10¹⁴ FLOPs.

4. Optimizer FLOPs (Adam)

Adam involves:

Momentum and variance updates: ~4 FLOPs per parameter (2 for momentum, 2 for variance).
Weight update: ~2 FLOPs per parameter.
Total per parameter: ~6 FLOPs.

For 175 billion parameters: 175 × 10⁹ × 6 ≈ 1.05 × 10¹² FLOPs per step.

5. Total FLOPs per Training Step

Forward + Backward + Optimizer ≈ 2.48 × 10¹⁴ + 4.96 × 10¹⁴ + 1.05 × 10¹² ≈ 7.45 × 10¹⁴ FLOPs per step.

6. Total Training FLOPs

Total steps = 93,750. Total FLOPs = 93,750 × 7.45 × 10¹⁴ ≈ 6.98 × 10¹⁹ FLOPs.

7. Adjustments and Simplifications

Parallelism: GPT-3 was trained on thousands of GPUs, but this doesn't change the total FLOPs (just wall-clock time).
Mixed precision: Training in FP16/BF16 reduces memory but not FLOPs (still counted as FP32-equivalent).
Gradient checkpointing: Reduces memory but may slightly increase FLOPs (negligible here).
Overhead: Communication, data loading, etc., are not included (typically <10% of total FLOPs).

8. Final Estimate

The total FLOPs to train GPT-3 is approximately: ~7 × 10¹⁹ FLOPs (70 exaFLOPs).

This aligns with independent estimates (e.g., Epoch AI's analysis suggests ~3.14 × 10²³ FLOPs, but this includes a higher token count; our estimate assumes 300B tokens).

Key Takeaways

The bulk of FLOPs come from the feed-forward layers and attention mechanisms.
The backward pass is 2× the forward pass in FLOPs.
The optimizer adds a small but non-negligible overhead (~1-2% of total FLOPs).
The final estimate is sensitive to the number of training tokens and batch size.

?
Prompt

Response

1. Key Parameters of GPT-3

2. FLOPs per Forward Pass

Attention Layer FLOPs

Feed-Forward Layer FLOPs

Embedding and Output Layers

Total Forward Pass FLOPs

3. FLOPs per Backward Pass

4. Optimizer FLOPs (Adam)

5. Total FLOPs per Training Step

6. Total Training FLOPs

7. Adjustments and Simplifications

8. Final Estimate

Key Takeaways

About Bert-Nebulon Alpha

Capabilities

Categories

Specifications

Estimate Complexity

?Prompt

Response

1. Key Parameters of GPT-3

2. FLOPs per Forward Pass

Attention Layer FLOPs

Feed-Forward Layer FLOPs

Embedding and Output Layers

Total Forward Pass FLOPs

3. FLOPs per Backward Pass

4. Optimizer FLOPs (Adam)

5. Total FLOPs per Training Step

6. Total Training FLOPs

7. Adjustments and Simplifications

8. Final Estimate

Key Takeaways

About Bert-Nebulon Alpha

Capabilities

Categories

Specifications

?
Prompt