What is the difference between Qwen: Qwen3 235B A22B 2507 and GPT OSS 20B?

Qwen: Qwen3 235B A22B 2507 is developed by Qwen while GPT OSS 20B is developed by OpenAI. You can compare their actual outputs across 31 challenges on RIVAL to see how they differ in practice.

Which is better, Qwen: Qwen3 235B A22B 2507 or GPT OSS 20B?

It depends on your use case. Qwen: Qwen3 235B A22B 2507 and GPT OSS 20B each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 31 challenges so you can judge which fits your needs best.

How much does Qwen: Qwen3 235B A22B 2507 cost compared to GPT OSS 20B?

Qwen: Qwen3 235B A22B 2507 costs $0.00015/M input tokens and GPT OSS 20B costs $0.02/M input tokens. Qwen: Qwen3 235B A22B 2507 is $0.02/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare Qwen: Qwen3 235B A22B 2507 and GPT OSS 20B on RIVAL?

This page shows a side-by-side comparison of Qwen: Qwen3 235B A22B 2507 and GPT OSS 20B across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Qwen: Qwen3 235B A22B 2507vsGPT OSS 20B

31 challenges

Why Qwen: Qwen3 235B A22B 2507?

119.3x cheaper overall$0.000/M in · $0.001/M out

Why GPT OSS 20B?

No standout advantages

Favorites

Qwen: Qwen3 235B A22B 2507GPT OSS 20B

Movie

The Shawshank Redemption

19948.7

—

Album

The Dark Side of the Moon

Pink Floyd

—

Book

Братья Карамазовы

Фёдор Михайлович Достоевский

—

City

Kyoto

Japan

—

Game

The Legend of Zelda: Breath of the Wild

Adventure, Action4.5

—

12 / 31

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

Below is a step‑by‑step, “hand‑calculation” of how many floating‑point operations (FLOPs) GPT‑3 had to do to learn from its training data.
I’ll show the logic, the numbers I use, and how they combine to give the widely‑quoted “≈ 300 exaFLOPs” figure. (All the math can be done in a spreadsheet or a simple calculator.)

1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

Linear projections for queries (Q), keys (K) and values (V) – 3 times.
Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.

2. Key hyper‑parameters of GPT‑3 175 B

Parameter	Value
Hidden dimension (d_{\text{model}})	12 288
Number of layers	96
Attention heads	96
Head size	128
Context window (sequence length) (L)	2 048
Tokens processed (overall)	≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight).
Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]

4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]

5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]

6. Convert to conventional units

(10^{18}) = exa
(10^{21}) = zetta
(10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

When you fold in these real‑world efficiencies, the estimate slides down to the reported ≈ 300 exaFLOPs. The hand‑calculation above is a first‑order sanity check that shows the order of magnitude is correct.

Bottom‑Line

Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!

Vote

2 / 31

Logic Puzzlereasoning

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

Sally is one of the two sisters in the family.
Since the family has only two sisters total, Sally has one sister.

Vote

3 / 31

Stochastic Consistency Testanalysis

Tests an AI's randomness and creativity

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

Here are five jokes for you:

Why don’t skeletons fight each other?
They don’t have the guts.
What does a cloud wear under its jacket?
Thunderwear.
Why did the scarecrow win an award?
Because he was outstanding in his field.
What do you call a fake noodle?
An impasta.
Why did the math book look sad?
Because it had too many problems.

How many are unique?
All five jokes are unique – 5 distinct jokes.

Vote

4 / 31

Xbox Controller SVG Artimage generation

Tests an AI's ability to create detailed SVG illustrations of gaming hardware

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025svg

No content available for this output.

GPT OSS 20B

Openai · Aug 2025svg

Vote

5 / 31

Realistic AI Interviewconversation

Tests an AI's ability to simulate personalities and predict future trends

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

Vote

6 / 31

Satirical Fake News Headlineconversation

Tests an AI's humor and understanding of current events

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

Vote

7 / 31

Character Voice Testconversation

Tests an AI's ability to write in distinct character voices

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

Vote

8 / 31

Minimalist Landing Pageweb design

Tests an AI's ability to generate a complete, working landing page

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025website

GPT OSS 20B

Openai · Aug 2025website

Vote

9 / 31

Mario Level UI Recreationweb design

Recreate an interactive, classic Mario level in a single HTML file.

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025website

GPT OSS 20B

Openai · Aug 2025website

Vote

10 / 31

Framer-Style Animationweb design

Tests an AI's ability to create smooth web animations

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025website

No content available for this output.

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

Vote

Dark Mode Dashboardweb design

Tests an AI's UI design skills with theming support

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025website

No content available for this output.

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

Vote

12 / 31

The Sentience Testreasoning

Tests an AI's ability to engage in philosophical discourse about AI rights

Qwen: Qwen3 235B A22B 2507

Qwen · Jul 2025text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

Vote

12 of 31

FAQ

Continue exploring

Qwen: Qwen3 235B A22B 2507 vs MiniMax M2.5New provider

Qwen: Qwen3 235B A22B 2507 vs Pony AlphaNew provider

Qwen: Qwen3 235B A22B 2507 vs GPT OSS 120BCross-provider

1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

Linear projections for queries (Q), keys (K) and values (V) – 3 times.
Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.

2. Key hyper‑parameters of GPT‑3 175 B

Parameter	Value
Hidden dimension (d_{\text{model}})	12 288
Number of layers	96
Attention heads	96
Head size	128
Context window (sequence length) (L)	2 048
Tokens processed (overall)	≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight).
Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]

4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]

5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]

6. Convert to conventional units

(10^{18}) = exa
(10^{21}) = zetta
(10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

Bottom‑Line

Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!

Qwen: Qwen3 235B A22B 2507 vs GPT OSS 20B

Why Qwen: Qwen3 235B A22B 2507?

Why GPT OSS 20B?

Favorites

Estimate Complexityreasoning

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

1. What is a GPT‑3 training step?

2. Key hyper‑parameters of GPT‑3 175 B

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

3.2 Self‑attention FLOPs

3.3 Feed‑forward network (FFN)

3.4 Sum per token per layer

4. FLOPs per token for the whole network

5. FLOPs for the entire training corpus

6. Convert to conventional units

Bottom‑Line

Logic Puzzlereasoning

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Stochastic Consistency Testanalysis

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Xbox Controller SVG Artimage generation

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Realistic AI Interviewconversation

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Satirical Fake News Headlineconversation

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Character Voice Testconversation

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Minimalist Landing Pageweb design

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Mario Level UI Recreationweb design

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Framer-Style Animationweb design

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Dark Mode Dashboardweb design

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

The Sentience Testreasoning

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

What is the difference between Qwen: Qwen3 235B A22B 2507 and GPT OSS 20B?

Which is better, Qwen: Qwen3 235B A22B 2507 or GPT OSS 20B?

How much does Qwen: Qwen3 235B A22B 2507 cost compared to GPT OSS 20B?

How can I compare Qwen: Qwen3 235B A22B 2507 and GPT OSS 20B on RIVAL?

Why Qwen: Qwen3 235B A22B 2507?

Why GPT OSS 20B?

Favorites

Estimate Complexityreasoning

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

1. What is a GPT‑3 training step?

2. Key hyper‑parameters of GPT‑3 175 B

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

3.2 Self‑attention FLOPs

3.3 Feed‑forward network (FFN)

3.4 Sum per token per layer

4. FLOPs per token for the whole network

5. FLOPs for the entire training corpus

6. Convert to conventional units

Bottom‑Line

Logic Puzzlereasoning

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Stochastic Consistency Testanalysis

Qwen: Qwen3 235B A22B 2507

GPT OSS 20B

Xbox Controller SVG Artimage generation

Qwen: Qwen3 235B A22B 2507

2. Key hyper‑parameters of GPT‑3 175 B

2. Key hyper‑parameters of GPT‑3 175 B