4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Glm 4.6's response to the 'Estimate Complexity' challenge.
Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.
Of course. Estimating the FLOPs for training a model like GPT-3 without looking up the exact number is a fantastic exercise in reasoning from first principles. Here is a step-by-step breakdown of how to arrive at a very close estimate.
The total computational cost for training a neural network is primarily determined by three factors:
The total FLOPs can be approximated by the formula:
Total FLOPs ≈ K * N * C
Let's estimate each of these components.
This is the easiest part. The size of GPT-3's largest model is one of its most famous specifications.
Training involves a forward pass, a backward pass (to calculate gradients), and a parameter update. A common and well-accepted approximation in deep learning is:
Adding these up: 2 (forward) + 4 (backward) + 2 (update) = 8 FLOPs.
However, a more standard and slightly more conservative estimate used in many research papers (like the Chinchilla scaling laws paper) is K = 6. This typically accounts for the forward pass (2 FLOPs) and the two main parts of the backward pass (2 FLOPs + 2 FLOPs), often bundling the optimizer's smaller cost into this figure. Using K=6 is a robust choice for an estimation of this scale.
This is the most challenging and crucial estimation step. We need to reason about the composition of the training dataset.
GPT-3 was trained on a mixture of sources: Common Crawl, WebText2, Books1, Books2, and Wikipedia. Let's try to build a plausible number.
20 * 175 billion = 3.5 trillion tokens. However, GPT-3 predates Chinchilla* and is known to have been "under-trained"—meaning it was trained on significantly fewer tokens than this optimal ratio would suggest.Let's combine our estimates:
Total Estimated Tokens (C) = 4 + 130 + 170 = 304 billion tokens
This is a very strong, defensible estimate. Let's round it for easier calculation.
Now we plug our estimates into the formula:
Total FLOPs ≈ K * N * C Total FLOPs ≈ 6 * (175 x 10⁹) * (300 x 10⁹)
Let's do the math:
6 * 175 = 10501050 * 300 = 315,00010⁹ * 10⁹ = 10¹⁸Total FLOPs ≈ 315,000 x 10¹⁸
To write this in standard scientific notation:
Total FLOPs ≈ 3.15 x 10²³
Estimated Total FLOPs to train GPT-3: 3.15 x 10²³
To put this enormous number in perspective:
3.15 x 10²³ / 10¹⁵ = 3.15 x 10⁸ seconds, which is about 10 years. In reality, they used thousands of GPUs for several weeks.This estimation, built from first principles and reasonable assumptions about the model and its dataset, arrives at a number that is remarkably close to the official figure cited in the original GPT-3 paper, which is ~3.14 x 10²³ FLOP-days. The key was correctly identifying the parameters, using the standard 6*N*C formula, and making a logical, grounded estimate of the training data size.
Turn this model response into notes, narration, or a short video
Partner link