Grok 4.20 Multi-Agent Beta
3.15 \times 10^{23} FLOPs
GPT-3 (the 175B-parameter version) has N ≈ 1.75 × 10^{11} parameters. It was trained on a filtered dataset of roughly D ≈ 3 × 10^{11} tokens (a commonly cited figure drawn from Common Crawl, WebText, books, and Wikipedia, with some up-sampling of higher-quality sources).
For dense transformer models the dominant training-cost term is the forward + backward passes through the layers. A widely used approximation is:
Total training FLOPs ≈ 6 × N × D
- ~2N FLOPs per token for the forward pass (matrix multiplies in attention and FFN layers).
- ~4N FLOPs per token for the backward pass (gradient computations).
Multiplying gives:
6 × 1.75 × 10^{11} × 3 × 10^{11} = 6 × 5.25 × 10^{22} = 3.15 × 10^{23}
This figure matches the number most often quoted for GPT-3 in the literature and in scaling-law discussions. Minor variations (±10–20 %) can arise from exact token count, optimizer overhead, or mixed-precision implementation details, but the order-of-magnitude estimate remains ~3 × 10^{23} FLOPs.




