Rival

Z.AI: GLM 4.5vsGemini 2.5 Pro Preview 06-05

34 challenges

Why Z.AI: GLM 4.5?

4.2x cheaper overall$0.60/M in · $2.20/M out

2 months newerJul 2025

Why Gemini 2.5 Pro Preview 06-05?

8.2x more context1.0M

Leads 1 of 1 benchmarks

Stronger on SWE-bench Verified67.2% vs 64.2%

3 / 34

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Z.AI: GLM 4.5

Zhipu · Jul 2025text

To estimate the total number of FLOPs (floating-point operations) required to train GPT-3 without searching online, we rely on known architectural details and standard computational models for transformer training. Here's the step-by-step reasoning:

Step 1: Key Parameters of GPT-3

Number of parameters (P): GPT-3 has 175 billion parameters (175 × 10⁹).
Training dataset size (D): The model was trained on approximately 300 billion tokens (300 × 10⁹ tokens). This is a standard figure associated with GPT-3.

Step 2: FLOPs per Token Calculation

Training a transformer model involves two main operations per token:

Forward pass: Computes the model's output.
Backward pass: Computes gradients for backpropagation.

For transformer architectures, the combined FLOPs for forward and backward passes per token is empirically estimated as 6 × P. This is derived from:

Forward pass: Requires ~2 × P FLOPs per token (matrix multiplications dominate, with each parameter used in multiply-add operations).
Backward pass: Typically requires ~2× the FLOPs of the forward pass due to gradient calculations (~4 × P FLOPs per token).
Total per token: 2P (forward) + 4P (backward) = 6P FLOPs per token.

Step 3: Total FLOPs for the Entire Dataset

The total FLOPs scales linearly with the dataset size:

Total FLOPs = FLOPs per token × Number of tokens.
Substituting the values:
Total FLOPs = (6 × P) × D = 6 × (175 × 10⁹) × (300 × 10⁹).

Step 4: Calculation

Compute the product:
6 × 175 × 300 × 10⁹ × 10⁹ = 6 × 52,500 × 10¹⁸.
Simplify:
6 × 52,500 = 315,000.
Total FLOPs = 315,000 × 10¹⁸ = 3.15 × 10²³ FLOPs.

Step 5: Validation with Known Compute

GPT-3 training reportedly used ~3,640 petaflop-days of compute.
Convert petaflop-days to FLOPs:
1 petaflop-day = 10¹⁵ FLOPs × 86,400 seconds/day = 8.64 × 10¹⁹ FLOPs.
3,640 petaflop-days = 3,640 × 8.64 × 10¹⁹ ≈ 3.14 × 10²³ FLOPs.
This closely matches our estimate of 3.15 × 10²³ FLOPs, confirming the calculation.

Step 6: Batch Size and Steps Consideration

Batch size: GPT-3 used a batch size of 3.2 million tokens per step (80 batches × 4096 tokens/batch).
Number of training steps:
Steps = Total tokens / Batch size = 300 × 10⁹ / 3.2 × 10⁶ = 93,750 steps.
FLOPs per step:
Per step FLOPs = 6 × P × Batch size = 6 × 175 × 10⁹ × 3.2 × 10⁶ = 3.36 × 10¹⁸ FLOPs/step.
Total FLOPs:
93,750 steps × 3.36 × 10¹⁸ FLOPs/step ≈ 3.15 × 10²³ FLOPs.
This consistency reinforces the estimate.

Step 7: Negligible Factors

Parameter updates: Optimizer steps (e.g., Adam) add O(P) FLOPs per step. For 93,750 steps:
Update FLOPs ≈ 93,750 × 175 × 10⁹ ≈ 1.64 × 10¹⁶ FLOPs (negligible vs. 3.15 × 10²³).
Embedding/softmax layers: Included in the 6P per-token estimate.
Attention mechanisms: Captured by the 6P heuristic (validated for transformers).

Final Estimate

The total FLOPs required to train GPT-3 is 3.15 × 10²³ FLOPs. This is derived from:

Standard transformer FLOPs per token (6P).
GPT-3 parameters (175B) and dataset size (300B tokens).
Cross-validation with reported compute usage (3,640 petaflop-days).

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025text

Of course. Here is an estimate of the total number of FLOPs required to train GPT-3, based on my internal kn owledge and first principles, without searching online.

The Logic and Formula

To estimate the computational cost of training a large language model like GPT-3, we can use a well-establis hed formula. The total number of Floating Point Operations (FLOPs) is approximately:

Total FLOPs ≈ 6 × P × D

Let's break down what each part of this formula means:

P: The number of parameters in the model. This is the total count of all the weights and biases that the model learns during training.
D: The total number of tokens in the training dataset. This represents the volume of data the mo del processes.
6: This is a constant factor that provides a good approximation for transformer-based models. It com es from:
- Forward Pass (1 × P × D): For each token processed, the number of operations is roughly proporti onal to the number of parameters. A common rule of thumb is 2 * P FLOPs per token for a forward pass (one multiplication and one addition for each parameter). So the total for one pass over the data is 2 * P * D.
- Backward Pass (2 × Forward Pass): The backward pass (backpropagation), where the model calculate s gradients to update its parameters, is computationally more intensive. It's typically estimated to require about twice the operations of the forward pass. So, 2 * (2 * P * D) = 4 * P * D.
- Total: Adding the forward and backward pass gives us 2PD + 4PD = 6PD.

Now, let's estimate the values for P and D for GPT-3.

Step-by-Step Estimation

Step 1: Estimate the Number of Parameters (P)

My internal knowledge base indicates that the largest version of GPT-3, which is the one most people refer t o, has a specific number of parameters.

P ≈ 175 billion parameters
In scientific notation, this is 175 x 10⁹

Step 2: Estimate the Training Dataset Size (D)

GPT-3 was trained on a massive corpus of text data scraped from the internet (like the Common Crawl dataset) , books, and Wikipedia. The key is not the raw size of the dataset, but the number of tokens actually used f or training. My recollection is that the model was trained for a single epoch on a curated dataset.

D ≈ 300 billion tokens
In scientific notation, this is 300 x 10⁹

Step 3: Perform the Calculation

Now we can plug these estimated values into our formula:

Total FLOPs ≈ 6 × P × D

Substitute the values for P and D: Total FLOPs ≈ 6 × (175 × 10⁹) × (300 × 10⁹)
Multiply the constants first: 6 × 175 = 1050 1050 × 300 = 315,000
Multiply the powers of 10: 10⁹ × 10⁹ = 10¹⁸
Combine the results: Total FLOPs ≈ 315,000 × 10¹⁸
Express this in standard scientific notation: Total FLOPs ≈ 3.15 × 10⁵ × 10¹⁸ = 3.15 × 10²³

Final Estimate and Conclusion

Based on my reasoning and the standard formula for calculating training costs, the estimated number of FLOPs required to train the 175B parameter GPT-3 model is approximately:

3.15 x 10²³ FLOPs

To put this astronomical number into perspective:

A petaFLOP is 10¹⁵ FLOPs. Our estimate is 315,000,000 petaFLOPs.
If you had a supercomputer that could perform one petaFLOP per second (10¹⁵ FLOPs/sec), it would take ** 3.15 x 10⁸ seconds** to complete the training.
There are about 31.5 million seconds in a year, so this is equivalent to roughly 10 years of continu ous computation on that machine.

This calculation demonstrates why training foundational models like GPT-3 requires massive-scale data center s with thousands of GPUs running in parallel for months and is an undertaking that costs millions of dollars . The actual paper from OpenAI stated the training took approximately 3.6 x 10²³ FLOPs, so my estimate i s very close to the reported figure.

Vote