What is the difference between GPT-5.1 and Grok 3 Thinking?

GPT-5.1 is developed by OpenAI while Grok 3 Thinking is developed by xAI. GPT-5.1 has a 400K token context window vs Grok 3 Thinking's 128K. You can compare their actual outputs across 14 challenges on Rival to see how they differ in practice.

Which is better, GPT-5.1 or Grok 3 Thinking?

It depends on your use case. GPT-5.1 and Grok 3 Thinking each have strengths in different areas. Rival lets you compare their real outputs side-by-side across 14 challenges so you can judge which fits your needs best.

How can I compare GPT-5.1 and Grok 3 Thinking on Rival?

This page shows a side-by-side comparison of GPT-5.1 and Grok 3 Thinking across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Updated Nov 13, 2025

GPT-5.1vsGrok 3 Thinking

14 fights queued

Why GPT-5.1?

3.1x more context400K

9 months newerNov 2025

Why Grok 3 Thinking?

Dead even. This one's a coin flip.

GPT-5.1Grok 3 Thinking

Input price

$1.25/M

—

Output price

$10.00/M

—

Context

400K

128K

Released

Nov 2025

Feb 2025

Favorites

GPT-5.1Grok 3 Thinking

Movie

The Matrix

19998.2

—

Album

Kind of Blue

Miles Davis

—

Book

Братья Карамазовы

Фёдор Михайлович Достоевский

—

City

Tokyo

Japan

—

Game

The Legend of Zelda: Breath of the Wild

Adventure, Action4.5

—

12 / 14

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

GPT-5.1

Openai · Nov 2025text

The total training compute for GPT‑3 (175B) is on the order of (10^{23}) floating‑point operations. A reasonable point estimate is (\sim 3\times 10^{23}) FLOPs.

Here’s the step‑by‑step reasoning from basic principles.

1. Key known quantities

From the GPT‑3 paper (architecture details are well known):

Number of parameters (N \approx 175 \text{ billion} = 1.75\times 10^{11})
Number of training tokens (T \approx 300 \text{ billion} = 3\times 10^{11})

We’ll assume:

Standard dense Transformer decoder-only model
Standard training: forward pass + backward pass + optimizer step
We want the total training FLOPs over all tokens and all steps.

2. FLOPs per parameter per token

For a dense neural network, a single weight typically participates in one multiply-add per forward pass per “use”:

1 multiplication + 1 addition ≈ 2 FLOPs per weight usage.

For large Transformers, most compute is in matrix multiplications in the attention and MLP blocks; these are roughly proportional to the total number of parameters, so a common back-of-the-envelope is:

Forward pass ≈ (2N) FLOPs per token
(each parameter used once in a multiply-add on average)

Backward pass is more expensive because you’re computing gradients:

Backward pass is roughly 2× the forward pass cost in terms of FLOPs
→ ≈ (4N) FLOPs per token

On top of that you have:

Some extra overhead for:
- Nonlinearities, layer norms, softmax
- Optimizer updates (Adam, etc.)

These are smaller but non-negligible; a common heuristic is to lump them into a constant factor, giving:

[ \text{FLOPs per token} \approx 6N ]

This “6” is a heuristic constant:

2N (forward) + 4N (backward & optimizer & overhead) ≈ 6N

Some authors use factors between ~3 and ~6; using 6 is a conservative, widely used estimate in scaling law work.

3. Total FLOPs formula

Total training FLOPs:

[ \text{FLOPs}_{\text{total}} \approx (\text{FLOPs per token}) \times (\text{number of tokens}) ]

Using the heuristic:

[ \text{FLOPs}_{\text{total}} \approx 6N T ]

Plug in GPT‑3’s values:

(N = 1.75\times 10^{11})
(T = 3\times 10^{11})

So:

[ \begin{aligned} \text{FLOPs}_{\text{total}} &\approx 6 \times (1.75\times 10^{11}) \times (3\times 10^{11}) \ &= 6 \times 1.75 \times 3 \times 10^{22} \ &= 6 \times 5.25 \times 10^{22} \ &= 31.5 \times 10^{22} \ &= 3.15 \times 10^{23} \text{ FLOPs} \end{aligned} ]

So a good estimate is:

[ \boxed{\text{GPT‑3 (175B) training compute} ;\approx; 3\times 10^{23}\ \text{FLOPs}} ]

4. Sense check (order of magnitude)

Even if our constant factor is off (say it’s 3 instead of 6, or 8 instead of 6), we’re still in the same ballpark:

With factor 3: (\sim 1.6\times 10^{23}) FLOPs
With factor 8: (\sim 4.2\times 10^{23}) FLOPs

So the order of magnitude (10^{23}) FLOPs is robust, and (\sim 3\times 10^{23}) FLOPs is a reasonable central estimate.

GPT-5.1 vs Grok 3 Thinking

Which is better, GPT-5.1 or Grok 3 Thinking?