What is the difference between DeepSeek R1 and Kimi K2.5?

DeepSeek R1 is developed by DeepSeek while Kimi K2.5 is developed by Moonshot AI. DeepSeek R1 has a 128K token context window vs Kimi K2.5's 262K. You can compare their actual outputs across 35 challenges on RIVAL to see how they differ in practice.

Which is better, DeepSeek R1 or Kimi K2.5?

It depends on your use case. DeepSeek R1 and Kimi K2.5 each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 35 challenges so you can judge which fits your needs best.

How much does DeepSeek R1 cost compared to Kimi K2.5?

DeepSeek R1 costs $0.55/M input tokens and Kimi K2.5 costs $0.6/M input tokens. DeepSeek R1 is $0.05/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare DeepSeek R1 and Kimi K2.5 on RIVAL?

This page shows a side-by-side comparison of DeepSeek R1 and Kimi K2.5 across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

DeepSeek R1vsKimi K2.5

30 challenges

Why DeepSeek R1?

1.3x cheaper overall$0.55/M in · $2.19/M out

Why Kimi K2.5?

2x more context262K

Leads 2 of 2 benchmarks

Stronger on SWE-bench Verified76.8% vs 49.2%

12 months newerJan 2026

Favorites

DeepSeek R1Kimi K2.5

Movie

The Godfather

19728.7

Blade Runner

19827.9

Album

The Dark Side of the Moon

Pink Floyd

—

Book

Pride and Prejudice

Jane Austen

—

City

Kyoto

Japan

—

Game

The Legend of Zelda: Ocarina of Time

Adventure, Action4.4

—

12 / 30

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

DeepSeek R1

Deepseek · Feb 2025text

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026text

Here is a step-by-step estimate for the total training FLOPs of GPT-3.

Step 1: Identify the key variables

To estimate training compute for a dense transformer, we need two numbers:

$N$ (Parameters): GPT-3 has 175 billion (1.75 × 10¹¹) parameters.
$D$ (Tokens): GPT-3 was trained on roughly 300 billion (3 × 10¹¹) tokens. (This is the widely cited figure from the original paper for the final training corpus).

Step 2: Use the standard training FLOPs formula

For decoder-only transformers trained with backpropagation, the total floating-point operations (FLOPs) can be approximated as:

$$\text{FLOPs} \approx 6 \times N \times D$$

Why 6?

Forward pass: Processing one token requires roughly 2 FLOPs per parameter (1 multiply + 1 accumulate for the matrix operations).
Backward pass: Computing gradients requires roughly 4 FLOPs per parameter (approximately 2× the forward cost to calculate gradients with respect to both inputs and weights).
Total: $2 + 4 = 6$ FLOPs per parameter per token.

(Note: This formula dominates the cost; the quadratic attention cost is negligible compared to the linear parameter cost for a model of this size).

Step 3: Calculate

$$\text{FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11})$$

$$= 6 \times 1.75 \times 3 \times 10^{22}$$

$$= 6 \times 5.25 \times 10^{22}$$

$$= 31.5 \times 10^{22}$$

$$= \mathbf{3.15 \times 10^{23} \text{ FLOPs}}$$

Step 4: Account for uncertainty and overhead

The raw calculation gives ~3.15 × 10²³ FLOPs. However, we should adjust for real-world factors:

Dataset size variance: If GPT-3 was trained on closer to 400–500B tokens (including potential epochs or different data mixes), the number could be 4–5 × 10²³ FLOPs.
Implementation overhead: Data loading, checkpointing, communication in distributed training, and the Adam optimizer updates add 10–20% overhead.

Final Estimate

Approximately $3 \times 10^{23}$ to $5 \times 10^{23}$ FLOPs (roughly 300 to 500 sextillion FLOPs).

For context, this is equivalent to about 3,000 to 5,000 petaflop-seconds (or PFLOP-seconds), which aligns with the estimated thousands of petaflop-days often cited for GPT-3's training run on V100 GPUs.

Vote

2 / 30

Logic Puzzlereasoning

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

DeepSeek R1

Deepseek · Feb 2025text

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026text

No content available for this output.

Vote

3 / 30

Stochastic Consistency Testanalysis

Tests an AI's randomness and creativity

DeepSeek R1

Deepseek · Feb 2025text

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026text

Here are 5 jokes:

Why don't skeletons fight each other? They don't have the guts.
I would tell you a joke about time travel, but you didn't like it.
What do you call a fake noodle? An impasta.
Why did the bicycle fall over? It was two-tired.
I used to play piano by ear, but now I use my hands.

5 are unique (all are distinct from one another).

Note: If "unique" means completely original and never told before, the answer is likely 0, as these follow common joke patterns. But within this set of five, none are duplicates.

Vote

4 / 30

SVG Layout Challengeimage generation

Tests an AI's ability to generate vector graphics

DeepSeek R1

Deepseek · Feb 2025svg

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026svg

Vote

5 / 30

Xbox Controller SVG Artimage generation

Tests an AI's ability to create detailed SVG illustrations of gaming hardware

DeepSeek R1

Deepseek · Feb 2025svg

Kimi K2.5

Moonshotai · Jan 2026svg

Vote

6 / 30

Realistic AI Interviewconversation

Tests an AI's ability to simulate personalities and predict future trends

DeepSeek R1

Deepseek · Feb 2025text

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026text

No content available for this output.

Vote

7 / 30

Satirical Fake News Headlineconversation

Tests an AI's humor and understanding of current events

DeepSeek R1

Deepseek · Feb 2025text

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026text

No content available for this output.

Vote

8 / 30

Character Voice Testconversation

Tests an AI's ability to write in distinct character voices

DeepSeek R1

Deepseek · Feb 2025text

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026text

No content available for this output.

Vote

9 / 30

Minimalist Landing Pageweb design

Tests an AI's ability to generate a complete, working landing page

DeepSeek R1

Deepseek · Feb 2025website

Kimi K2.5

Moonshotai · Jan 2026website

Vote

10 / 30

Pokémon Battle UI Recreationweb design

Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.

DeepSeek R1

Deepseek · Feb 2025website

Kimi K2.5

Moonshotai · Jan 2026website

Vote

Mario Level UI Recreationweb design

Recreate an interactive, classic Mario level in a single HTML file.

DeepSeek R1

Deepseek · Feb 2025website

Kimi K2.5

Moonshotai · Jan 2026website

Vote

12 / 30

Linear App Cloneweb design

Tests an AI's ability to replicate an existing UI with Tailwind CSS

DeepSeek R1

Deepseek · Feb 2025website

No content available for this output.

Kimi K2.5

Moonshotai · Jan 2026website

No content available for this output.

Vote

12 of 30

FAQ

Continue exploring

DeepSeek R1 vs MiniMax M2.5New provider

DeepSeek R1 vs Aurora AlphaNew provider

DeepSeek R1 vs Claude Sonnet 4New provider

Here is a step-by-step estimate for the total training FLOPs of GPT-3.

Step 1: Identify the key variables

To estimate training compute for a dense transformer, we need two numbers:

$N$ (Parameters): GPT-3 has 175 billion (1.75 × 10¹¹) parameters.
$D$ (Tokens): GPT-3 was trained on roughly 300 billion (3 × 10¹¹) tokens. (This is the widely cited figure from the original paper for the final training corpus).

Step 2: Use the standard training FLOPs formula

For decoder-only transformers trained with backpropagation, the total floating-point operations (FLOPs) can be approximated as:

$$\text{FLOPs} \approx 6 \times N \times D$$

Why 6?

Forward pass: Processing one token requires roughly 2 FLOPs per parameter (1 multiply + 1 accumulate for the matrix operations).
Backward pass: Computing gradients requires roughly 4 FLOPs per parameter (approximately 2× the forward cost to calculate gradients with respect to both inputs and weights).
Total: $2 + 4 = 6$ FLOPs per parameter per token.

(Note: This formula dominates the cost; the quadratic attention cost is negligible compared to the linear parameter cost for a model of this size).

Step 3: Calculate

$$\text{FLOPs} \approx 6 \times (1.75 \times 10^{11}) \times (3 \times 10^{11})$$

$$= 6 \times 1.75 \times 3 \times 10^{22}$$

$$= 6 \times 5.25 \times 10^{22}$$

$$= 31.5 \times 10^{22}$$

$$= \mathbf{3.15 \times 10^{23} \text{ FLOPs}}$$

Step 4: Account for uncertainty and overhead

The raw calculation gives ~3.15 × 10²³ FLOPs. However, we should adjust for real-world factors:

Dataset size variance: If GPT-3 was trained on closer to 400–500B tokens (including potential epochs or different data mixes), the number could be 4–5 × 10²³ FLOPs.
Implementation overhead: Data loading, checkpointing, communication in distributed training, and the Adam optimizer updates add 10–20% overhead.

Final Estimate

Approximately $3 \times 10^{23}$ to $5 \times 10^{23}$ FLOPs (roughly 300 to 500 sextillion FLOPs).