Estimating the total number of FLOPs (floating-point operations) required to train GPT-3 without external references involves breaking down the problem into key components: model size, training data size, optimization steps, and the computational complexity of the operations involved. Here's a step-by-step reasoning:
1. Model Size (Number of Parameters)
GPT-3 has approximately 175 billion parameters (as reported by OpenAI). Each parameter is typically a 32-bit floating-point number (FP32), but modern training often uses mixed precision (e.g., FP16 or BF16 for activations and gradients, FP32 for weights), which can reduce memory and some compute but not the total FLOPs.
2. Training Data Size
GPT-3 was trained on about 300 billion tokens (OpenAI's estimate). A "token" is roughly equivalent to a word or subword unit in the vocabulary.
3. Forward and Backward Passes
For each token in the training data, the model performs:
- A forward pass: compute the output given the input.
- A backward pass: compute gradients with respect to the loss.
Each forward or backward pass involves:
- Attention operations
- Feed-forward network (FFN) operations
- Layer normalization and residual connections
The total compute is roughly proportional to the number of parameters and the sequence length, but for autoregressive models like GPT-3, the sequence length per token is effectively 1 (since we're predicting one token at a time using prior context).
However, in practice, training uses batches of sequences, and each sequence has a length of up to 2048 tokens (maximum context length for GPT-3). So, we must consider the effective compute per token averaged over variable-length sequences.
4. FLOPs per Token
A commonly used approximation for the FLOPs required to train a transformer model like GPT-3 is:
FLOPs per token ≈ 6 × number of parameters
This accounts for:
- The forward pass (~3× parameters)
- The backward pass (~3× parameters)
So, for a 175B parameter model:
FLOPs per token ≈ 6 × 175 × 10⁹ = 1.05 × 10¹² FLOPs per token
5. Total Training Tokens
GPT-3 was trained on 300 billion tokens.
So, total FLOPs ≈ FLOPs per token × Number of tokens
Total FLOPs = 1.05 × 10¹² × 3 × 10¹¹ = 3.15 × 10²³ FLOPs
Wait: 300 billion = 3 × 10¹¹
1.05 × 10¹² × 3 × 10¹¹ = 3.15 × 10²³ FLOPs
But this seems too high. Let's double-check the approximation.
Revisiting FLOPs per Token
The "6N" rule of thumb is derived as follows:
- Forward pass: ~2N FLOPs (for self-attention and FFN)
- Gradient computation: ~2N (backprop through attention and FFN)
- Parameter updates: ~2N (gradient descent step)
Total: ~6N FLOPs per token
So yes, 6 × 175B = 1.05 × 10¹² FLOPs per token.
Then:
1.05 × 10¹² FLOPs/token × 3 × 10¹¹ tokens = 3.15 × 10²³ FLOPs
But this is 315 sextillion FLOPs.
However, other sources (which we are avoiding) suggest GPT-3 required ~3.1 × 10²³ FLOPs, so this aligns.
But let's sanity-check:
- 3 × 10¹¹ tokens × 6 × 1.75 × 10¹¹ parameters = 3.15 × 10²³
Wait — that’s 3 × 10¹¹ × 6 × 1.75 × 10¹¹ = 3.15 × 10²³ — correct.
But note: the "6N" is per token, and N is the parameter count.
So yes, the math is consistent.
6. Adjust for Mixed Precision and System-Level Overhead
In practice:
- Some operations use FP16, which may reduce FLOPs slightly (but typically, FLOPs are counted in FP32-equivalent).
- There is communication overhead (model/data parallelism, gradient synchronization).
- There is embedding layer compute (vocabulary size × embedding dimension × sequence length × batch size).
Let’s estimate embedding FLOPs:
- Vocabulary: ~50,000 tokens
- Embedding dimension: 12,288 (GPT-3)
- Sequence length: average ~1,000 (assumed)
- Batch size: large, say 2^16 = 65,536 (per GPU), but global batch size is ~300B / (epochs) — but GPT-3 likely used multiple epochs?
Wait — GPT-3 trained on ~300B tokens once (one epoch over the dataset), but with up to 300B tokens of data.
But to train effectively, models often see the data multiple times (epochs). However, OpenAI likely trained GPT-3 on 300B tokens in one pass (i.e., one epoch), as the dataset was large enough.
So: total tokens ≈ 3 × 10¹¹.
Embedding FLOPs per token:
- Token → embedding: vocabulary lookup (not FLOPs, just memory access)
- But positional embedding and layer norm add some compute.
However, the dominant cost is attention and FFN, which scale with model size.
Embedding layer compute is O(vocab_size × d_model), but it's applied per token, so:
Embedding FLOPs per token ≈ 2 × vocab_size × d_model
= 2 × 50,000 × 12,288 ≈ 1.23 × 10⁹ FLOPs/token
Compare to 1.05 × 10¹² from model compute → ~0.1% of total.
So negligible.
Similarly, final output layer (vocab_size × d_model) is also small compared to forward/backward passes.
7. Final Estimate
Thus, the dominant term is:
Total FLOPs ≈ 6 × number of parameters × number of tokens
= 6 × 175 × 10⁹ × 300 × 10⁹
= 6 × 175 × 300 × 10¹⁸
= 315,000 × 10¹⁸
= 3.15 × 10²³ FLOPs
Rounding, we get:
≈ 3 × 10²³ FLOPs
Summary of Reasoning:
- GPT-3 has 175B parameters.
- It was trained on 300B tokens.
- Each token requires ~6 × parameter count FLOPs due to forward + backward passes.
- Total FLOPs ≈ 6 × 175B × 300B = 3.15 × 10²³ FLOPs.
- Embedding and other overheads are negligible compared to attention/FFN compute.
Final Answer:
The total number of FLOPs required to train GPT-3 is approximately 3 × 10²³ (300 sextillion) FLOPs.