Trinity Large Preview
To estimate the FLOPs for training GPT-3, I'll work through the key components:
Model parameters: GPT-3 has 175 billion parameters.
Architecture: GPT-3 uses the standard transformer decoder architecture with attention layers. Each forward pass requires approximately 2 FLOPs per parameter (one multiply, one add in the matrix operations).
Training process: During training, each parameter is updated once per batch. For each parameter update, we need:
- Forward pass: ~2 FLOPs per parameter
- Backward pass: ~4 FLOPs per parameter (gradient computation is roughly 2× the forward pass)
- Weight update: ~1 FLOP per parameter (gradient × learning rate)
Total: ~7 FLOPs per parameter per batch
Training compute: The total training compute is:
- 175 × 10⁹ parameters × 7 FLOPs/parameter × B batches
- Where B is the total number of batch updates
Batch size: GPT-3 was trained with a context length of 2048 tokens and a batch size of 3.2 million tokens. This means:
- Batches per epoch = Total tokens / Batch size
- Total tokens trained ≈ 300 billion tokens (based on typical large language model training runs)
Therefore: B = 300 × 10⁹ / 3.2 × 10⁶ ≈ 93,750 batches
Total FLOPs: 175 × 10⁹ × 7 × 93,750 ≈ 1.15 × 10²⁰ FLOPs
Converting to PF-days (10¹⁵ FLOPs/day):
- 1.15 × 10²⁰ / (10¹⁵ × 86,400) ≈ 1,330 PF-days
This aligns with the reported ~3,000 PF-days when accounting for additional overhead like activation recomputation and optimizer state.




