Loading share button...

Gemini Pro 1.0vsKimi Linear 48B A3B Instruct

27 challenges

Why Gemini Pro 1.0?

No standout advantages

Why Kimi Linear 48B A3B Instruct?

2.3x cheaper overall$0.30/M in · $0.60/M out
32x more context1.0M
23 months newerNov 2025
3 / 27

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Gemini Pro 1.0

Gemini Pro 1.0

Estimating GPT-3 Training FLOPs: A Step-by-Step Approach

Without access to online resources, we can estimate the FLOPs (Floating-point Operations) needed for GPT-3 training by considering the following factors:

1. Model Size:

  • GPT-3 has 175 billion parameters.
  • Each parameter update involves two multiplications and one addition for each floating-point number.
  • Therefore, each parameter update requires 3 FLOPs.
  • Total FLOPs per update: 175 billion parameters * 3 FLOPs/parameter = 525 billion FLOPs.

2. Training Data Size:

  • GPT-3 was trained on a massive dataset of text and code.
  • Let's assume the dataset size is 500 billion words.
  • Each word is represented by an embedding vector of size 1,024.
  • Therefore, the total number of floating-point numbers in the dataset is 500 billion words * 1,024 = 512 trillion.

3. Number of Training Steps:

  • The exact number of training steps is unknown, but it likely involves millions of iterations over the entire dataset.
  • Let's assume there are 10 million training steps.

4. Optimization Algorithm:

  • GPT-3 likely uses Adam optimizer, which involves additional computations for updating parameters.
  • The exact FLOPs associated with the optimizer are difficult to estimate without knowing the specific hyperparameters used.
  • For simplicity, we can assume an additional 10 FLOPs per parameter update.

5. Hardware Efficiency:

  • Modern hardware like GPUs and TPUs can perform several FLOPs per clock cycle.
  • Let's assume an average efficiency of 5 FLOPs per clock cycle.

Putting it all together:

  • Total FLOPs per update: 525 billion (parameter updates) + 10 billion (optimizer) = 535 billion FLOPs
  • Total FLOPs for training: 535 billion FLOPs/update * 10 million updates = 5.35 * 10^18 FLOPs
  • Adjusted for hardware efficiency: 5.35 * 10^18 FLOPs / 5 FLOPs/cycle = 1.07 * 10^18 clock cycles

Therefore, the estimated total FLOPs required to train GPT-3 without searching online is approximately 1.07 * 10^18 clock cycles.

Important Caveats:

  • This is a rough estimate and may not be accurate due to several unknown factors, including the exact training data size, number of training steps, optimizer settings, and hardware efficiency.
  • The actual FLOPs could be significantly higher or lower depending on the specific training details.

Additionally, it's important to remember that this is just an estimation of the computational cost. Training a model like GPT-3 also requires significant energy consumption and environmental impact.

Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct

Below is a back-of-the-envelope estimate of the total FLOPs needed to train GPT-3 “175 B” (the largest 96-layer, 12288-d model with 175 billion parameters) from scratch on the full 300 B-token corpus, assuming the standard transformer training setup used by OpenAI.

Step 1 – FLOPs per forward+backward pass
A single transformer block has

  • 2 matrix-multiply operations per attention layer (QKᵀ and OV)
  • 2 matrix-multiply operations per MLP (Wi and Wo)
  • 1 extra attention output projection (optional, but small)
  • 1 residual add per block

For 96 layers and 12288 hidden units, each matrix multiply is roughly
(12288²) ≈ 1.5 × 10⁸ FLOPs.
So per block: 2 attention + 2 MLP ≈ 8 × 1.5 × 10⁸ ≈ 1.2 × 10⁹ FLOPs.
Add residual adds and layer norms (~10 % overhead) ⇒ ≈ 1.3 × 10⁹ FLOPs per block.

With 96 blocks, forward+backward per token ≈ 1.25 × 10¹¹ FLOPs ≈ 125 GFLOPs.

Step 2 – Tokens processed per forward+backward step
The batch size is 3,200 sequences × 2,048 tokens = 6.55 million tokens.
Thus each SGD step consumes ≈ 6.6 × 10⁶ tokens.

Step 3 – Total SGD steps
300 billion tokens / 6.6 million tokens per step ≈ 45,500 steps.

Step 4 – Total FLOPs
45,500 steps × 125 GFLOPs/step ≈ 5.7 × 10¹² FLOPs ≈ 5.7 petaflop-days.

Step 5 – Overhead for data loading, optimizer states, and communication

  • Adam: 8 extra copies of the parameter matrix (2 × 4 bytes per float32) → +50 %
  • Gradient communication and padding → +20 %
  • Data loading and tokenization → +10 %

Overall multiply by 1.8 ⇒ ≈ 1.0 × 10¹³ FLOPs ≈ 10 petaflop-days.

Step 6 – Round to one significant figure
≈ 10¹³ FLOPs, i.e. ten petaflop-days.

So a rough estimate is on the order of 10¹³ total floating-point operations to train GPT-3 175 B from scratch on 300 B tokens with the standard transformer architecture and training setup.

Vote
2 / 27

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Gemini Pro 1.0

Gemini Pro 1.0

Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct

Vote
3 / 27

Logic Puzzle

reasoning·medium

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

Gemini Pro 1.0

Gemini Pro 1.0

Sally has 1 sister.

Explanation:

  • Sally is a girl.
  • She has 3 brothers.
  • All the siblings share the same sisters. Since Sally is one of the sisters, and each brother has 2 sisters in total, there must be one other girl besides Sally in the family.
  • Therefore, Sally has 1 sister.
Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct

Sally has 2 sisters.

Here's the reasoning:
Sally is one of the sisters. Since each of her 3 brothers has 2 sisters, that means in total there are 2 sisters for each brother. Since Sally is one of them, the other sister must be another girl.

Vote
Loading