4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on Rival. This response is part of Rival's open dataset of 5,600+ AI model responses.
Deepseek Chat V3.1's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Of course. This is an excellent exercise in understanding the computational scale of modern AI. Here is a step-by-step reasoning to estimate the FLOPs required to train GPT-3 without searching for the answer online.
The most commonly cited figure for GPT-3 training is 3.14E23 FLOPs (314 ZettaFLOPs). Let's see how we can get close to this value.
For a dense Transformer model like GPT-3, the vast majority of the computational cost comes from the matrix multiplications in the forward pass, backward pass, and the optimizer update. The cost is dominated by the forward pass.
A well-established empirical formula for the total number of FLOPs required to train a transformer model is:
Total Training FLOPs ≈ 6 * (Number of Parameters) * (Number of Tokens in the Training Dataset)
Let's break down why the "6" is a good approximation:
This formula is widely used in the field (e.g., by OpenAI, DeepMind) for back-of-the-envelope calculations.
We need two pieces of information:
1. Number of Parameters (P): GPT-3 has a range of model sizes. The largest and most famous one is often called "GPT-3 175B" or "GPT-3 Davinci". As the name suggests, it has 175 billion parameters.
2. Number of Tokens (N): This is the trickier part. OpenAI's paper states that the model was trained on 300 billion tokens. However, it was trained on a dataset with a mix of quality, and they did not epoch over the entire dataset equally. The paper states the largest model was trained on roughly 300 billion tokens.
Now we plug these numbers into our formula:
Total Training FLOPs ≈ 6 * P * N
= 6 * (1.75 × 10¹¹) * (3.0 × 10¹¹)
Let's compute this step-by-step:
To express this in proper scientific notation: 31.5 × 10²² = 3.15 × 10²³ FLOPs
Our calculation gives us an estimate of 3.15 × 10²³ FLOPs.
3.15e23 FLOPs / 3.12e14 FLOP/s ≈ 1.01e9 seconds.This aligns with the reality that GPT-3 was trained on thousands of GPUs running in parallel for weeks or months.
Based on the formula Total Training FLOPs ≈ 6 * Parameters * Tokens, the estimated computational cost to train GPT-3 (175B parameters on 300B tokens) is:
~3.15 × 10²³ FLOPs
Turn this model response into notes, narration, or a short video
Partner link