GPT-5.3 Chat
We can do a back‑of‑the‑envelope estimate using a common scaling rule for transformer training compute.
- Known approximate GPT‑3 specs
- Parameters N ≈ 175 billion = 1.75 × 10^11
- Training tokens T ≈ 300 billion = 3 × 10^11
- FLOPs per token for a transformer
For large transformers, a common approximation is:
FLOPs per token ≈ 6 × N
Reasoning (rough):
- Forward pass through all layers costs ~2N multiply‑adds.
- Backpropagation roughly doubles or triples that cost.
- Total ≈ ~6N FLOPs per token for forward + backward + gradients.
- Total training FLOPs
Total compute ≈ (FLOPs per token) × (number of tokens)
= 6 × N × T
Plug in values:
6 × (1.75 × 10^11) × (3 × 10^11)
First multiply constants: 6 × 1.75 × 3 = 31.5
Then powers: 10^11 × 10^11 = 10^22
So:
Total ≈ 31.5 × 10^22
≈ 3.15 × 10^23 FLOPs
- Final estimate
Total training compute for GPT‑3 is on the order of:
≈ 3 × 10^23 FLOPs
- Sanity check
This sits in the typical reported range (~10^23 FLOPs) for GPT‑3‑scale training, so the estimate is reasonable.
Final estimate: about 3 × 10^23 FLOPs to train GPT‑3.








