Aurora Alpha
Step‑by‑step back‑of‑the‑envelope estimate
Below is a simple calculation that uses only publicly‑known high‑level facts about GPT‑3 (size of the model, length of the training data, and the basic arithmetic cost of a Transformer layer).
All numbers are rounded to make the math easy to follow; the goal is to get an order‑of‑magnitude figure rather than an exact count.
1. What we know about GPT‑3
| Property | Approximate value |
|---|---|
| Number of parameters (the “175 B” model) | 1.75 × 10¹¹ |
| Hidden dimension d (the size of each token representation) | ≈ 12 288 |
| Number of Transformer layers L | 96 |
| Feed‑forward inner dimension dff* (usually 4 × d) | ≈ 49 152 |
| Size of the training corpus (tokens) | ≈ 3 × 10¹¹ tokens (≈ 300 B) |
| Training uses Adam‑style optimizer, mixed‑precision, etc. – we ignore those details for the FLOP count. |
These numbers are taken from the model card and the paper; they are the only “hard” data we need.
2. FLOPs per token for a single Transformer layer
A standard Transformer layer consists of two main sub‑blocks:
| Sub‑block | Main operations (per token) | Approx. FLOPs |
|---|---|---|
| Self‑attention (Q, K, V projections + attention scores + weighted sum) | 3 × d × d (proj) + d × d (QK) + d × d (AV) | ≈ 5 × d² |
| Feed‑forward (two linear layers, activation in between) | d × d _ff + d × d × 4 (first linear) + 4 × d × d (second linear) | ≈ 8 × d² |
Summing them gives roughly
[ \text{FLOPs per token per layer} ;\approx; (5+8),d^{2}=13,d^{2}. ]
Why the factor 13?
- The three projection matrices (Q, K, V) each cost d × d → 3 d².
- Computing the attention scores (QKᵀ) costs d × d.
- Multiplying the scores by V costs another d × d.
- The two feed‑forward linear layers each cost 4 d² (because the inner dimension is 4 × d), giving 8 d² total.
3. Plug in the hidden dimension
[ d = 12,288 \quad\Rightarrow\quad d^{2} \approx 1.51\times10^{8}. ]
[ 13,d^{2} \approx 13 \times 1.51\times10^{8} \approx 1.96\times10^{9};\text{FLOPs per token per layer}. ]
4. Include all layers
[ \text{FLOPs per token (forward pass)} = L \times 13,d^{2} \approx 96 \times 1.96\times10^{9} \approx 1.88\times10^{11}. ]
So a single forward pass of one token through the full 96‑layer model costs ≈ 2 × 10¹¹ FLOPs.
5. Account for the whole training corpus
[ \text{Tokens processed} = 3\times10^{11}. ]
[ \text{Forward FLOPs total} = 3\times10^{11} \times 1.88\times10^{11} \approx 5.6\times10^{22}. ]
6. Add the backward pass
During training we must compute gradients. A common rule of thumb is that the backward pass costs about 2 × the forward pass, so the total compute (forward + backward) is roughly 3 × the forward cost.
[ \text{Total training FLOPs} \approx 3 \times 5.6\times10^{22} \approx 1.7\times10^{23}. ]
7. Result (order‑of‑magnitude)
[ \boxed{\text{≈ 10^{23}\ \text{FLOPs (a few × 10^{23})}} ]
In words: training the 175‑billion‑parameter GPT‑3 model required on the order of a hundred sextillion floating‑point operations.
This figure is consistent with the “exascale” compute budgets that have been reported for the largest language‑model training runs.
8. Sources of uncertainty
| Source of error | Effect on the estimate |
|---|---|
| Exact token count (the corpus size is reported as 300 B but the actual number of training steps may be slightly higher) | ± 10 % |
| The factor 13 for FLOPs per layer is a simplification (real implementations have extra ops for layer‑norm, bias terms, dropout, etc.) | + ~ 20 % |
| Mixed‑precision training reduces FLOPs for some operations, but the count above assumes full‑precision arithmetic | – ~ 10 % |
| Over‑head from optimizer state updates, gradient accumulation, checkpointing, etc. | + ~ 10 % |
Even allowing for these variations, the estimate stays within 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOPs.
Bottom line: Using only the model’s size, the number of layers, and the token count, a straightforward arithmetic model yields a total training compute on the order of 10²³ floating‑point operations for GPT‑3. This is the magnitude of compute that modern “large‑scale” language models demand.