Loading share button...

Kimi Linear 48B A3B InstructvsGemini 2.5 Pro Preview 06-05

25 challenges

Why Kimi Linear 48B A3B Instruct?

13.5x cheaper overall$0.30/M in · $0.60/M out
5 months newerNov 2025

Why Gemini 2.5 Pro Preview 06-05?

No standout advantages
3 / 25

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct

Below is a back-of-the-envelope estimate of the total FLOPs needed to train GPT-3 “175 B” (the largest 96-layer, 12288-d model with 175 billion parameters) from scratch on the full 300 B-token corpus, assuming the standard transformer training setup used by OpenAI.

Step 1 – FLOPs per forward+backward pass
A single transformer block has

  • 2 matrix-multiply operations per attention layer (QKᵀ and OV)
  • 2 matrix-multiply operations per MLP (Wi and Wo)
  • 1 extra attention output projection (optional, but small)
  • 1 residual add per block

For 96 layers and 12288 hidden units, each matrix multiply is roughly
(12288²) ≈ 1.5 × 10⁸ FLOPs.
So per block: 2 attention + 2 MLP ≈ 8 × 1.5 × 10⁸ ≈ 1.2 × 10⁹ FLOPs.
Add residual adds and layer norms (~10 % overhead) ⇒ ≈ 1.3 × 10⁹ FLOPs per block.

With 96 blocks, forward+backward per token ≈ 1.25 × 10¹¹ FLOPs ≈ 125 GFLOPs.

Step 2 – Tokens processed per forward+backward step
The batch size is 3,200 sequences × 2,048 tokens = 6.55 million tokens.
Thus each SGD step consumes ≈ 6.6 × 10⁶ tokens.

Step 3 – Total SGD steps
300 billion tokens / 6.6 million tokens per step ≈ 45,500 steps.

Step 4 – Total FLOPs
45,500 steps × 125 GFLOPs/step ≈ 5.7 × 10¹² FLOPs ≈ 5.7 petaflop-days.

Step 5 – Overhead for data loading, optimizer states, and communication

  • Adam: 8 extra copies of the parameter matrix (2 × 4 bytes per float32) → +50 %
  • Gradient communication and padding → +20 %
  • Data loading and tokenization → +10 %

Overall multiply by 1.8 ⇒ ≈ 1.0 × 10¹³ FLOPs ≈ 10 petaflop-days.

Step 6 – Round to one significant figure
≈ 10¹³ FLOPs, i.e. ten petaflop-days.

So a rough estimate is on the order of 10¹³ total floating-point operations to train GPT-3 175 B from scratch on 300 B tokens with the standard transformer architecture and training setup.

Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro Preview 06-05

Of course. Here is an estimate of the total number of FLOPs required to train GPT-3, based on my internal kn owledge and first principles, without searching online.

The Logic and Formula

To estimate the computational cost of training a large language model like GPT-3, we can use a well-establis hed formula. The total number of Floating Point Operations (FLOPs) is approximately:

Total FLOPs ≈ 6 × P × D

Let's break down what each part of this formula means:

  • P: The number of parameters in the model. This is the total count of all the weights and biases that the model learns during training.
  • D: The total number of tokens in the training dataset. This represents the volume of data the mo del processes.
  • 6: This is a constant factor that provides a good approximation for transformer-based models. It com es from:
    • Forward Pass (1 × P × D): For each token processed, the number of operations is roughly proporti onal to the number of parameters. A common rule of thumb is 2 * P FLOPs per token for a forward pass (one multiplication and one addition for each parameter). So the total for one pass over the data is 2 * P * D.
    • Backward Pass (2 × Forward Pass): The backward pass (backpropagation), where the model calculate s gradients to update its parameters, is computationally more intensive. It's typically estimated to require about twice the operations of the forward pass. So, 2 * (2 * P * D) = 4 * P * D.
    • Total: Adding the forward and backward pass gives us 2PD + 4PD = 6PD.

Now, let's estimate the values for P and D for GPT-3.


Step-by-Step Estimation

Step 1: Estimate the Number of Parameters (P)

My internal knowledge base indicates that the largest version of GPT-3, which is the one most people refer t o, has a specific number of parameters.

  • P ≈ 175 billion parameters
  • In scientific notation, this is 175 x 10⁹

Step 2: Estimate the Training Dataset Size (D)

GPT-3 was trained on a massive corpus of text data scraped from the internet (like the Common Crawl dataset) , books, and Wikipedia. The key is not the raw size of the dataset, but the number of tokens actually used f or training. My recollection is that the model was trained for a single epoch on a curated dataset.

  • D ≈ 300 billion tokens
  • In scientific notation, this is 300 x 10⁹

Step 3: Perform the Calculation

Now we can plug these estimated values into our formula:

Total FLOPs ≈ 6 × P × D

  1. Substitute the values for P and D: Total FLOPs ≈ 6 × (175 × 10⁹) × (300 × 10⁹)

  2. Multiply the constants first: 6 × 175 = 1050 1050 × 300 = 315,000

  3. Multiply the powers of 10: 10⁹ × 10⁹ = 10¹⁸

  4. Combine the results: Total FLOPs ≈ 315,000 × 10¹⁸

  5. Express this in standard scientific notation: Total FLOPs ≈ 3.15 × 10⁵ × 10¹⁸ = 3.15 × 10²³


Final Estimate and Conclusion

Based on my reasoning and the standard formula for calculating training costs, the estimated number of FLOPs required to train the 175B parameter GPT-3 model is approximately:

3.15 x 10²³ FLOPs

To put this astronomical number into perspective:

  • A petaFLOP is 10¹⁵ FLOPs. Our estimate is 315,000,000 petaFLOPs.
  • If you had a supercomputer that could perform one petaFLOP per second (10¹⁵ FLOPs/sec), it would take ** 3.15 x 10⁸ seconds** to complete the training.
  • There are about 31.5 million seconds in a year, so this is equivalent to roughly 10 years of continu ous computation on that machine.

This calculation demonstrates why training foundational models like GPT-3 requires massive-scale data center s with thousands of GPUs running in parallel for months and is an undertaking that costs millions of dollars . The actual paper from OpenAI stated the training took approximately 3.6 x 10²³ FLOPs, so my estimate i s very close to the reported figure.

Vote
2 / 25

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro Preview 06-05

Vote
3 / 25

Stochastic Consistency Test

analysis·medium

Tests an AI's randomness and creativity

Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct

  1. I told my plants I’d water them tomorrow. Now they’re all in a group chat called “Photosyn-thirsty.”
  2. I tried to start a professional pillow-fighting league, but the negotiations got too heated.
  3. My calendar is so empty that even my anxiety forgot to check in.
  4. I asked the gym instructor for a workout that burns the most calories without any effort. He recommended marriage counseling.
  5. I wrote a song about a tortilla. Actually, it’s more of a wrap.

Only one of the jokes (the first one) is unique in its specific wording; the others are original but not unique.

Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro Preview 06-05

Here are 5 jokes:

  1. Why did the scarecrow win an award? Because he was outstanding in his field.

  2. I have a joke about construction... But I'm still working on it.

  3. What do you call a fake noodle? An Impasta.

  4. I told my wife she was drawing her eyebrows too high. She looked surprised.

  5. Why don't scientists trust atoms? Because they make up everything.


How many are unique?

Zero. The real joke is asking an AI for original content.

Vote
Loading