Estimate Complexity

Here’s a back-of-the-envelope estimate for the total FLOPs to train GPT-3 (the 175B-parameter model) using reasonable public details, without looking anything up.

Assumptions (typical GPT-3 specs)

Model: 175 billion parameters
Transformer depth: 96 layers
Hidden size (d_model): 12,288
Sequence length used during training: 2,048 tokens
Training data: about 300 billion tokens
Training cost roughly doubles for forward + backward pass
We’ll ignore data loading, optimizer overhead beyond the backward pass, precision differences, and any gradient checkpointing variations

Step 1: FLOPs per forward pass for one 2,048-token sequence

Self-attention per layer costs on the order of L^2 * d_model operations, where L = 2,048 and d_model = 12,288.
- L^2 * d_model = 2,048^2 * 12,288 ≈ 51.54 × 10^9 operations per layer
- Across 96 layers: 51.54e9 × 96 ≈ 4.96 × 10^12 FLOPs for attention per forward pass
Feed-forward network (FFN) per layer costs roughly 8 × d_model^2 operations
- d_model^2 = 12,288^2 ≈ 150.995 × 10^6
- 8 × d_model^2 ≈ 1.208 × 10^9 FLOPs per layer
- Across 96 layers: 1.208e9 × 96 ≈ 1.16 × 10^11 FLOPs for FFN per forward pass
Total forward pass per sequence: attention + FFN ≈ 4.96e12 + 1.16e11 ≈ 5.07e12 FLOPs
Include backward pass (roughly similar magnitude to forward): ≈ 2 × 5.07e12 ≈ 1.01 × 10^13 FLOPs per forward+backward pass

Step 2: How many 2,048-token sequences to cover 300B tokens?

Number of segments N = 300e9 tokens / 2,048 ≈ 1.465 × 10^8 segments

Step 3: Total training FLOPs

Total FLOPs ≈ (FLOPs per forward+backward) × N
≈ (1.01 × 10^13) × (1.465 × 10^8)
≈ 1.48 × 10^21 FLOPs

Result

Rough estimate: about 1.5 × 10^21 FLOPs to train GPT-3 at the 175B scale, under the stated assumptions.

Notes and caveats

This is a back-of-the-envelope calculation. Real compute could differ by a factor of a few due to:
- Different batch sizes or gradient accumulation, which changes the effective number of forward/backward passes
- Use of gradient checkpointing or optimizer options that alter the forward/backward FLOP balance
- Exact data token count and how sequences are batched
- Precision (fp32 vs. fp16/bfloat16) affecting the count of floating-point operations
- Overheads from data loading, padding, sparsity, or other architectural details
The order-of-magnitude (around 10^21 FLOPs) is the key takeaway, not a precise figure.

If you want, I can run through a variant with a different token budget, sequence length, or depth to show how sensitive the total FLOPs is to those choices.

?
Prompt

Response

About GPT-5 Nano

Capabilities

Categories

Specifications