What is the difference between Claude Sonnet 4.5 and GPT OSS 20B?

Claude Sonnet 4.5 is developed by Anthropic while GPT OSS 20B is developed by OpenAI. Claude Sonnet 4.5 has a 200K token context window vs GPT OSS 20B's 131K. You can compare their actual outputs across 43 challenges on RIVAL to see how they differ in practice.

Which is better, Claude Sonnet 4.5 or GPT OSS 20B?

It depends on your use case. Claude Sonnet 4.5 and GPT OSS 20B each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 43 challenges so you can judge which fits your needs best.

How much does Claude Sonnet 4.5 cost compared to GPT OSS 20B?

Claude Sonnet 4.5 costs $3/M input tokens and GPT OSS 20B costs $0.02/M input tokens. GPT OSS 20B is $2.98/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare Claude Sonnet 4.5 and GPT OSS 20B on RIVAL?

This page shows a side-by-side comparison of Claude Sonnet 4.5 and GPT OSS 20B across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Claude Sonnet 4.5vsGPT OSS 20B

38 challenges

Why Claude Sonnet 4.5?

1.5x more context200K

2 months newerSep 2025

Why GPT OSS 20B?

150x cheaper overall$0.02/M in · $0.10/M out

Favorites

Claude Sonnet 4.5GPT OSS 20B

Movie

2001: A Space Odyssey

19688.1

—

Album

Kind of Blue

Miles Davis

—

Book

Die Verwandlung

Franz Kafka

—

City

Amsterdam

Netherlands

—

Game

Portal 2

Shooter, Puzzle4.6

—

12 / 38

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Claude Sonnet 4.5

Anthropic · Sep 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

Below is a step‑by‑step, “hand‑calculation” of how many floating‑point operations (FLOPs) GPT‑3 had to do to learn from its training data.
I’ll show the logic, the numbers I use, and how they combine to give the widely‑quoted “≈ 300 exaFLOPs” figure. (All the math can be done in a spreadsheet or a simple calculator.)

1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

Linear projections for queries (Q), keys (K) and values (V) – 3 times.
Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.

2. Key hyper‑parameters of GPT‑3 175 B

Parameter	Value
Hidden dimension (d_{\text{model}})	12 288
Number of layers	96
Attention heads	96
Head size	128
Context window (sequence length) (L)	2 048
Tokens processed (overall)	≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight).
Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]

4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]

5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]

6. Convert to conventional units

(10^{18}) = exa
(10^{21}) = zetta
(10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

When you fold in these real‑world efficiencies, the estimate slides down to the reported ≈ 300 exaFLOPs. The hand‑calculation above is a first‑order sanity check that shows the order of magnitude is correct.

Bottom‑Line

Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!

Vote

2 / 38

AI Board Game Logicreasoning

Tests an AI's ability to understand game rules and strategy

Claude Sonnet 4.5

Anthropic · Sep 2025svg

No content available for this output.

Openai · Aug 2025website

No content available for this output.

Vote

11 / 38

Mario Level UI Recreationweb design

Recreate an interactive, classic Mario level in a single HTML file.

Claude Sonnet 4.5

Anthropic · Sep 2025website

GPT OSS 20B

Openai · Aug 2025website

Vote

12 / 38

Linear App Cloneweb design

Tests an AI's ability to replicate an existing UI with Tailwind CSS

Claude Sonnet 4.5

Anthropic · Sep 2025website

No content available for this output.

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

Vote

12 of 38

FAQ

Continue exploring

Claude Sonnet 4.5 vs MiniMax M2.5New provider

Claude Sonnet 4.5 vs Pony AlphaNew provider

Claude Sonnet 4.5 vs GPT OSS 120BCross-provider

1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

Linear projections for queries (Q), keys (K) and values (V) – 3 times.
Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.

2. Key hyper‑parameters of GPT‑3 175 B

Parameter	Value
Hidden dimension (d_{\text{model}})	12 288
Number of layers	96
Attention heads	96
Head size	128
Context window (sequence length) (L)	2 048
Tokens processed (overall)	≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight).
Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]

4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]

5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]

6. Convert to conventional units

(10^{18}) = exa
(10^{21}) = zetta
(10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

Bottom‑Line

Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!