What is the difference between Claude 3.7 Thinking Sonnet and Gemini 3.1 Pro Preview?

Claude 3.7 Thinking Sonnet is developed by Anthropic while Gemini 3.1 Pro Preview is developed by Google AI. Claude 3.7 Thinking Sonnet has a 200K token context window vs Gemini 3.1 Pro Preview's 1.0M. You can compare their actual outputs across 46 challenges on RIVAL to see how they differ in practice.

Which is better, Claude 3.7 Thinking Sonnet or Gemini 3.1 Pro Preview?

It depends on your use case. Claude 3.7 Thinking Sonnet and Gemini 3.1 Pro Preview each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 46 challenges so you can judge which fits your needs best.

How much does Claude 3.7 Thinking Sonnet cost compared to Gemini 3.1 Pro Preview?

Claude 3.7 Thinking Sonnet costs $6/M input tokens and Gemini 3.1 Pro Preview costs $2/M input tokens. Gemini 3.1 Pro Preview is $4.00/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare Claude 3.7 Thinking Sonnet and Gemini 3.1 Pro Preview on RIVAL?

This page shows a side-by-side comparison of Claude 3.7 Thinking Sonnet and Gemini 3.1 Pro Preview across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Claude 3.7 Thinking SonnetvsGemini 3.1 Pro Preview

41 challenges

Why Claude 3.7 Thinking Sonnet?

No standout advantages

Why Gemini 3.1 Pro Preview?

2.6x cheaper overall$2.00/M in · $12.00/M out

5.2x more context1.0M

Leads 1 of 1 benchmarks

Stronger on GPQA Diamond94.3% vs 84.8%

12 months newerFeb 2026

Favorites

Claude 3.7 Thinking SonnetGemini 3.1 Pro Preview

Movie

The Shawshank Redemption

19948.7

The Matrix

19998.2

Album

Kind of Blue

Miles Davis

OK Computer

Radiohead

Book

Frankenstein or The Modern Prometheus

Mary Shelley

Dune

Frank Herbert

City

Amsterdam

Netherlands

Tokyo

Japan

Game

Portal 2

Shooter, Puzzle4.6

Portal

Action, Puzzle4.5

12 / 41

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Claude 3.7 Thinking Sonnet

Anthropic · Feb 2025text

No content available for this output.

Gemini 3.1 Pro Preview

Google · Feb 2026text

To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we can use a well-established heuristic in deep learning for estimating the compute cost of training Transformer-based large language models.

The estimated total compute is roughly $3.15 \times 10^{23}$ FLOPs.

Here is the step-by-step reasoning used to arrive at this number:

Step 1: Identify the key specifications of GPT-3

To calculate the compute required, we need to know two fundamental numbers about the model:

$N$ (Number of Parameters): GPT-3 has 175 billion parameters ($1.75 \times 10^{11}$).
$D$ (Number of Training Tokens): The GPT-3 paper states the model was trained on 300 billion tokens ($3.0 \times 10^{11}$).

Step 2: Establish the FLOPs formula

In language model training, the total compute ($C$) is heavily dominated by the matrix multiplications in the dense layers. OpenAI's scaling laws provide a standard, highly accurate approximation formula for this: $C \approx 6 \times N \times D$

Why the multiplier of 6?

Forward Pass: For every token processed, multiplying it by a single parameter requires 1 Multiply-Accumulate (MAC) operation. A MAC consists of 2 FLOPs (one multiplication and one addition). Therefore, the forward pass takes roughly $2ND$ FLOPs.
Backward Pass: Calculating the gradients during backpropagation requires about twice as much compute as the forward pass (you have to calculate the gradient with respect to the weights and the gradient with respect to the activations). Therefore, the backward pass takes roughly $4ND$ FLOPs.
Adding them together gives $2ND + 4ND =$ $6ND$ FLOPs.

(Note: This formula slightly underestimates by ignoring the quadratic cost of attention and optimizer state updates, but at the 175B scale, the dense feed-forward matrices dominate the compute so heavily that $6ND$ is considered the gold-standard approximation).

Step 3: Perform the calculation

Now we simply plug the numbers into our formula:

$N = 175 \times 10^9$
$D = 300 \times 10^9$

First, multiply $N$ and $D$:

$N \times D = (175 \times 10^9) \times (300 \times 10^9)$
$N \times D = 52,500 \times 10^{18}$
$N \times D = 5.25 \times 10^{22}$

Next, multiply by 6 (for the forward and backward passes):

Total FLOPs = $6 \times 5.25 \times 10^{22}$
Total FLOPs = $31.5 \times 10^{22}$ or $3.15 \times 10^{23}$

Step 4: Sanity Check

Does $3.15 \times 10^{23}$ FLOPs make sense in the real world? When GPT-3 was trained (around late 2019/early 2020), Nvidia V100 GPUs were the standard.

A V100 GPU has a peak theoretical output of roughly 120 TeraFLOPs ($1.2 \times 10^{14}$ FLOPs) per second in mixed precision.
Real-world hardware utilization (MFU) for training large models is usually around 30% to 40%, yielding roughly $4 \times 10^{13}$ effective FLOPs per second per GPU.
If OpenAI used a cluster of 10,000 V100 GPUs, the cluster would output $4 \times 10^{17}$ FLOPs per second.
Divide the total FLOPs by the cluster output: $(3.15 \times 10^{23}) / (4 \times 10^{17}) \approx 787,500$ seconds.
787,500 seconds is exactly 9.1 days.

Given that training runs of this size typically take a few weeks on a few thousand GPUs (factoring in downtime, checkpointing, and slightly smaller clusters), this math aligns perfectly with historical reality.

Conclusion: The total compute required to train GPT-3 was approximately $3.15 \times 10^{23}$ FLOPs.

Vote

2 / 41

AI Board Game Logicreasoning

Tests an AI's ability to understand game rules and strategy

Claude 3.7 Thinking Sonnet

Anthropic · Feb 2025svg

No content available for this output.

Gemini 3.1 Pro Preview

Google · Feb 2026svg

Vote

10 / 41

Character Voice Testconversation

Tests an AI's ability to write in distinct character voices

Claude 3.7 Thinking Sonnet

Anthropic · Feb 2025text

No content available for this output.

Gemini 3.1 Pro Preview

Google · Feb 2026text

No content available for this output.

Vote

12 / 41

Pokémon Battle UI Recreationweb design

Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.

Claude 3.7 Thinking Sonnet

Anthropic · Feb 2025website

No content available for this output.

Gemini 3.1 Pro Preview

Google · Feb 2026website

Vote

12 of 41

FAQ

Continue exploring

Claude 3.7 Thinking Sonnet vs GPT-5New provider

Claude 3.7 Thinking Sonnet vs Llama 4 MaverickNew provider

Claude 3.7 Thinking Sonnet vs MiniMax M2.5New provider

The estimated total compute is roughly $3.15 \times 10^{23}$ FLOPs.

Here is the step-by-step reasoning used to arrive at this number:

Step 1: Identify the key specifications of GPT-3

To calculate the compute required, we need to know two fundamental numbers about the model:

$N$ (Number of Parameters): GPT-3 has 175 billion parameters ($1.75 \times 10^{11}$).
$D$ (Number of Training Tokens): The GPT-3 paper states the model was trained on 300 billion tokens ($3.0 \times 10^{11}$).

Step 2: Establish the FLOPs formula

Why the multiplier of 6?

Forward Pass: For every token processed, multiplying it by a single parameter requires 1 Multiply-Accumulate (MAC) operation. A MAC consists of 2 FLOPs (one multiplication and one addition). Therefore, the forward pass takes roughly $2ND$ FLOPs.
Backward Pass: Calculating the gradients during backpropagation requires about twice as much compute as the forward pass (you have to calculate the gradient with respect to the weights and the gradient with respect to the activations). Therefore, the backward pass takes roughly $4ND$ FLOPs.
Adding them together gives $2ND + 4ND =$ $6ND$ FLOPs.

Step 3: Perform the calculation

Now we simply plug the numbers into our formula:

$N = 175 \times 10^9$
$D = 300 \times 10^9$

First, multiply $N$ and $D$:

$N \times D = (175 \times 10^9) \times (300 \times 10^9)$
$N \times D = 52,500 \times 10^{18}$
$N \times D = 5.25 \times 10^{22}$

Next, multiply by 6 (for the forward and backward passes):

Total FLOPs = $6 \times 5.25 \times 10^{22}$
Total FLOPs = $31.5 \times 10^{22}$ or $3.15 \times 10^{23}$

Step 4: Sanity Check

Does $3.15 \times 10^{23}$ FLOPs make sense in the real world? When GPT-3 was trained (around late 2019/early 2020), Nvidia V100 GPUs were the standard.

A V100 GPU has a peak theoretical output of roughly 120 TeraFLOPs ($1.2 \times 10^{14}$ FLOPs) per second in mixed precision.
Real-world hardware utilization (MFU) for training large models is usually around 30% to 40%, yielding roughly $4 \times 10^{13}$ effective FLOPs per second per GPU.
If OpenAI used a cluster of 10,000 V100 GPUs, the cluster would output $4 \times 10^{17}$ FLOPs per second.
Divide the total FLOPs by the cluster output: $(3.15 \times 10^{23}) / (4 \times 10^{17}) \approx 787,500$ seconds.
787,500 seconds is exactly 9.1 days.

Conclusion: The total compute required to train GPT-3 was approximately $3.15 \times 10^{23}$ FLOPs.