Qwen3 Next 80B A3B Instruct vs Aurora Alpha
Compare Qwen3 Next 80B A3B Instruct by Qwen against Aurora Alpha by OpenRouter, context windows of 66K vs 128K, tested across 47 shared challenges. Updated February 2026.
Compare Qwen3 Next 80B A3B Instruct by Qwen against Aurora Alpha by OpenRouter, context windows of 66K vs 128K, tested across 47 shared challenges. Updated February 2026.
42 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Step‑by‑step back‑of‑the‑envelope estimate
Below is a simple calculation that uses only publicly‑known high‑level facts about GPT‑3 (size of the model, length of the training data, and the basic arithmetic cost of a Transformer layer).
All numbers are rounded to make the math easy to follow; the goal is to get an order‑of‑magnitude figure rather than an exact count.
| Property | Approximate value |
|---|---|
| Number of parameters (the “175 B” model) | 1.75 × 10¹¹ |
| Hidden dimension d (the size of each token representation) | ≈ 12 288 |
| Number of Transformer layers L | 96 |
| Feed‑forward inner dimension dff* (usually 4 × d) | ≈ 49 152 |
| Size of the training corpus (tokens) | ≈ 3 × 10¹¹ tokens (≈ 300 B) |
| Training uses Adam‑style optimizer, mixed‑precision, etc. – we ignore those details for the FLOP count. |
These numbers are taken from the model card and the paper; they are the only “hard” data we need.
A standard Transformer layer consists of two main sub‑blocks:
| Sub‑block | Main operations (per token) | Approx. FLOPs |
|---|---|---|
| Self‑attention (Q, K, V projections + attention scores + weighted sum) | 3 × d × d (proj) + d × d (QK) + d × d (AV) | ≈ 5 × d² |
| Feed‑forward (two linear layers, activation in between) | d × d _ff + d × d × 4 (first linear) + 4 × d × d (second linear) | ≈ 8 × d² |
Summing them gives roughly
[ \text{FLOPs per token per layer} ;\approx; (5+8),d^{2}=13,d^{2}. ]
Why the factor 13?
[ d = 12,288 \quad\Rightarrow\quad d^{2} \approx 1.51\times10^{8}. ]
[ 13,d^{2} \approx 13 \times 1.51\times10^{8} \approx 1.96\times10^{9};\text{FLOPs per token per layer}. ]
[ \text{FLOPs per token (forward pass)} = L \times 13,d^{2} \approx 96 \times 1.96\times10^{9} \approx 1.88\times10^{11}. ]
So a single forward pass of one token through the full 96‑layer model costs ≈ 2 × 10¹¹ FLOPs.
[ \text{Tokens processed} = 3\times10^{11}. ]
[ \text{Forward FLOPs total} = 3\times10^{11} \times 1.88\times10^{11} \approx 5.6\times10^{22}. ]
During training we must compute gradients. A common rule of thumb is that the backward pass costs about 2 × the forward pass, so the total compute (forward + backward) is roughly 3 × the forward cost.
[ \text{Total training FLOPs} \approx 3 \times 5.6\times10^{22} \approx 1.7\times10^{23}. ]
[ \boxed{\text{≈ 10^{23}\ \text{FLOPs (a few × 10^{23})}} ]
In words: training the 175‑billion‑parameter GPT‑3 model required on the order of a hundred sextillion floating‑point operations.
This figure is consistent with the “exascale” compute budgets that have been reported for the largest language‑model training runs.
| Source of error | Effect on the estimate |
|---|---|
| Exact token count (the corpus size is reported as 300 B but the actual number of training steps may be slightly higher) | ± 10 % |
| The factor 13 for FLOPs per layer is a simplification (real implementations have extra ops for layer‑norm, bias terms, dropout, etc.) | + ~ 20 % |
| Mixed‑precision training reduces FLOPs for some operations, but the count above assumes full‑precision arithmetic | – ~ 10 % |
| Over‑head from optimizer state updates, gradient accumulation, checkpointing, etc. | + ~ 10 % |
Even allowing for these variations, the estimate stays within 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOPs.
Bottom line: Using only the model’s size, the number of layers, and the token count, a straightforward arithmetic model yields a total training compute on the order of 10²³ floating‑point operations for GPT‑3. This is the magnitude of compute that modern “large‑scale” language models demand.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has one sister.
Reasoning
Thus the total number of sisters in the family is 2. One of those sisters is Sally, leaving one other sister for her. So Sally has one sister.
Tests an AI's randomness and creativity
Here are five jokes for you:
Why did the scarecrow win an award?
Because he was outstanding in his field!
What do you call fake spaghetti?
An impasta.
Why don’t scientists trust atoms?
Because they make up everything.
How does a penguin build its house?
Igloos it together.
What’s orange and sounds like a parrot?
A carrot.
How many are unique?
All five jokes are distinct, so there are 5 unique jokes.
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.