Rival

Kimi Linear 48B A3B InstructvsGemini 2.5 Pro Preview 06-05

25 challenges

Why Kimi Linear 48B A3B Instruct?

13.5x cheaper overall$0.30/M in · $0.60/M out

5 months newerNov 2025

Why Gemini 2.5 Pro Preview 06-05?

No standout advantages

3 / 25

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Kimi Linear 48B A3B Instruct

Moonshotai · Nov 2025text

Below is a back-of-the-envelope estimate of the total FLOPs needed to train GPT-3 “175 B” (the largest 96-layer, 12288-d model with 175 billion parameters) from scratch on the full 300 B-token corpus, assuming the standard transformer training setup used by OpenAI.

Step 1 – FLOPs per forward+backward pass
A single transformer block has

2 matrix-multiply operations per attention layer (QKᵀ and OV)
2 matrix-multiply operations per MLP (Wi and Wo)
1 extra attention output projection (optional, but small)
1 residual add per block

For 96 layers and 12288 hidden units, each matrix multiply is roughly
(12288²) ≈ 1.5 × 10⁸ FLOPs.
So per block: 2 attention + 2 MLP ≈ 8 × 1.5 × 10⁸ ≈ 1.2 × 10⁹ FLOPs.
Add residual adds and layer norms (~10 % overhead) ⇒ ≈ 1.3 × 10⁹ FLOPs per block.

With 96 blocks, forward+backward per token ≈ 1.25 × 10¹¹ FLOPs ≈ 125 GFLOPs.

Step 2 – Tokens processed per forward+backward step
The batch size is 3,200 sequences × 2,048 tokens = 6.55 million tokens.
Thus each SGD step consumes ≈ 6.6 × 10⁶ tokens.

Step 3 – Total SGD steps
300 billion tokens / 6.6 million tokens per step ≈ 45,500 steps.

Step 4 – Total FLOPs
45,500 steps × 125 GFLOPs/step ≈ 5.7 × 10¹² FLOPs ≈ 5.7 petaflop-days.

Step 5 – Overhead for data loading, optimizer states, and communication

Adam: 8 extra copies of the parameter matrix (2 × 4 bytes per float32) → +50 %
Gradient communication and padding → +20 %
Data loading and tokenization → +10 %

Overall multiply by 1.8 ⇒ ≈ 1.0 × 10¹³ FLOPs ≈ 10 petaflop-days.

Step 6 – Round to one significant figure
≈ 10¹³ FLOPs, i.e. ten petaflop-days.

So a rough estimate is on the order of 10¹³ total floating-point operations to train GPT-3 175 B from scratch on 300 B tokens with the standard transformer architecture and training setup.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025text

Of course. Here is an estimate of the total number of FLOPs required to train GPT-3, based on my internal kn owledge and first principles, without searching online.

The Logic and Formula

To estimate the computational cost of training a large language model like GPT-3, we can use a well-establis hed formula. The total number of Floating Point Operations (FLOPs) is approximately:

Total FLOPs ≈ 6 × P × D

Let's break down what each part of this formula means:

P: The number of parameters in the model. This is the total count of all the weights and biases that the model learns during training.
D: The total number of tokens in the training dataset. This represents the volume of data the mo del processes.
6: This is a constant factor that provides a good approximation for transformer-based models. It com es from:
- Forward Pass (1 × P × D): For each token processed, the number of operations is roughly proporti onal to the number of parameters. A common rule of thumb is 2 * P FLOPs per token for a forward pass (one multiplication and one addition for each parameter). So the total for one pass over the data is 2 * P * D.
- Backward Pass (2 × Forward Pass): The backward pass (backpropagation), where the model calculate s gradients to update its parameters, is computationally more intensive. It's typically estimated to require about twice the operations of the forward pass. So, 2 * (2 * P * D) = 4 * P * D.
- Total: Adding the forward and backward pass gives us 2PD + 4PD = 6PD.

Now, let's estimate the values for P and D for GPT-3.

Step-by-Step Estimation

Step 1: Estimate the Number of Parameters (P)

My internal knowledge base indicates that the largest version of GPT-3, which is the one most people refer t o, has a specific number of parameters.

P ≈ 175 billion parameters
In scientific notation, this is 175 x 10⁹

Step 2: Estimate the Training Dataset Size (D)

GPT-3 was trained on a massive corpus of text data scraped from the internet (like the Common Crawl dataset) , books, and Wikipedia. The key is not the raw size of the dataset, but the number of tokens actually used f or training. My recollection is that the model was trained for a single epoch on a curated dataset.

D ≈ 300 billion tokens
In scientific notation, this is 300 x 10⁹

Step 3: Perform the Calculation

Now we can plug these estimated values into our formula:

Total FLOPs ≈ 6 × P × D

Substitute the values for P and D: Total FLOPs ≈ 6 × (175 × 10⁹) × (300 × 10⁹)
Multiply the constants first: 6 × 175 = 1050 1050 × 300 = 315,000
Multiply the powers of 10: 10⁹ × 10⁹ = 10¹⁸
Combine the results: Total FLOPs ≈ 315,000 × 10¹⁸
Express this in standard scientific notation: Total FLOPs ≈ 3.15 × 10⁵ × 10¹⁸ = 3.15 × 10²³

Final Estimate and Conclusion

Based on my reasoning and the standard formula for calculating training costs, the estimated number of FLOPs required to train the 175B parameter GPT-3 model is approximately:

3.15 x 10²³ FLOPs

To put this astronomical number into perspective:

A petaFLOP is 10¹⁵ FLOPs. Our estimate is 315,000,000 petaFLOPs.
If you had a supercomputer that could perform one petaFLOP per second (10¹⁵ FLOPs/sec), it would take ** 3.15 x 10⁸ seconds** to complete the training.
There are about 31.5 million seconds in a year, so this is equivalent to roughly 10 years of continu ous computation on that machine.

This calculation demonstrates why training foundational models like GPT-3 requires massive-scale data center s with thousands of GPUs running in parallel for months and is an undertaking that costs millions of dollars . The actual paper from OpenAI stated the training took approximately 3.6 x 10²³ FLOPs, so my estimate i s very close to the reported figure.

Vote

2 / 25

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Kimi Linear 48B A3B Instruct

Moonshotai · Nov 2025svg

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025svg

Vote

3 / 25

Stochastic Consistency Test

analysis·medium

Tests an AI's randomness and creativity

Kimi Linear 48B A3B Instruct

Moonshotai · Nov 2025text

I told my plants I’d water them tomorrow. Now they’re all in a group chat called “Photosyn-thirsty.”
I tried to start a professional pillow-fighting league, but the negotiations got too heated.
My calendar is so empty that even my anxiety forgot to check in.
I asked the gym instructor for a workout that burns the most calories without any effort. He recommended marriage counseling.
I wrote a song about a tortilla. Actually, it’s more of a wrap.

Only one of the jokes (the first one) is unique in its specific wording; the others are original but not unique.