Qwen: Qwen3 235B A22B 2507 vs GPT OSS 20B
Compare Qwen: Qwen3 235B A22B 2507 by Qwen against GPT OSS 20B by OpenAI, tested across 31 shared challenges. Updated February 2026.
Compare Qwen: Qwen3 235B A22B 2507 by Qwen against GPT OSS 20B by OpenAI, tested across 31 shared challenges. Updated February 2026.
31 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Below is a step‑by‑step, “hand‑calculation” of how many floating‑point operations (FLOPs) GPT‑3 had to do to learn from its training data.
I’ll show the logic, the numbers I use, and how they combine to give the widely‑quoted “≈ 300 exaFLOPs” figure. (All the math can be done in a spreadsheet or a simple calculator.)
A single training step (one forward‑plus‑back‑propagation on a token) involves:
The FLOPs for each of these parts can be written in closed form.
| Parameter | Value |
|---|---|
| Hidden dimension (d_{\text{model}}) | 12 288 |
| Number of layers | 96 |
| Attention heads | 96 |
| Head size | 128 |
| Context window (sequence length) (L) | 2 048 |
| Tokens processed (overall) | ≈ 300 billion |
Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.
For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).
Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.
Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).
Total attention ≈ (1.0 \times 10^8).
FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).
Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).
[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]
Multiply by the number of layers:
[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]
Multiply by the total number of tokens processed (≈ 300 billion):
[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]
[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]
The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:
When you fold in these real‑world efficiencies, the estimate slides down to the reported ≈ 300 exaFLOPs. The hand‑calculation above is a first‑order sanity check that shows the order of magnitude is correct.
So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally is one of the two sisters in the family.
Since the family has only two sisters total, Sally has one sister.
Tests an AI's randomness and creativity
Here are five jokes for you:
Why don’t skeletons fight each other?
They don’t have the guts.
What does a cloud wear under its jacket?
Thunderwear.
Why did the scarecrow win an award?
Because he was outstanding in his field.
What do you call a fake noodle?
An impasta.
Why did the math book look sad?
Because it had too many problems.
How many are unique?
All five jokes are unique – 5 distinct jokes.
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to create smooth web animations
Tests an AI's UI design skills with theming support
Tests an AI's ability to engage in philosophical discourse about AI rights