Compare Mistral Small Creative by Mistral AI against Aurora Alpha by OpenRouter, context windows of 33K vs 128K, tested across 53 shared challenges. Updated April 2026.
Mistral Small Creative and Aurora Alpha are both competitive models. Mistral Small Creative costs $0.1/M input tokens vs $0/M for Aurora Alpha. Context windows: 33K vs 128K tokens. Compare their real outputs side by side below.
Mistral Small Creative is made by mistral while Aurora Alpha is from openrouter. Mistral Small Creative has a 33K token context window compared to Aurora Alpha's 128K. On pricing, Mistral Small Creative costs $0.1/M input tokens vs $0/M for Aurora Alpha.
48 fights queued
Tests an AI's ability to make educated estimates based on technical knowledge
Step‑by‑step back‑of‑the‑envelope estimate
Below is a simple calculation that uses only publicly‑known high‑level facts about GPT‑3 (size of the model, length of the training data, and the basic arithmetic cost of a Transformer layer).
All numbers are rounded to make the math easy to follow; the goal is to get an order‑of‑magnitude figure rather than an exact count.
| Property | Approximate value |
|---|---|
| Number of parameters (the “175 B” model) | 1.75 × 10¹¹ |
| Hidden dimension d (the size of each token representation) | ≈ 12 288 |
| Number of Transformer layers L | 96 |
| Feed‑forward inner dimension dff* (usually 4 × d) | ≈ 49 152 |
| Size of the training corpus (tokens) | ≈ 3 × 10¹¹ tokens (≈ 300 B) |
| Training uses Adam‑style optimizer, mixed‑precision, etc. – we ignore those details for the FLOP count. |
These numbers are taken from the model card and the paper; they are the only “hard” data we need.
A standard Transformer layer consists of two main sub‑blocks:
| Sub‑block | Main operations (per token) | Approx. FLOPs |
|---|---|---|
| Self‑attention (Q, K, V projections + attention scores + weighted sum) | 3 × d × d (proj) + d × d (QK) + d × d (AV) | ≈ 5 × d² |
| Feed‑forward (two linear layers, activation in between) | d × d _ff + d × d × 4 (first linear) + 4 × d × d (second linear) | ≈ 8 × d² |
Summing them gives roughly
[ \text{FLOPs per token per layer} ;\approx; (5+8),d^{2}=13,d^{2}. ]
Why the factor 13?
[ d = 12,288 \quad\Rightarrow\quad d^{2} \approx 1.51\times10^{8}. ]
[ 13,d^{2} \approx 13 \times 1.51\times10^{8} \approx 1.96\times10^{9};\text{FLOPs per token per layer}. ]
[ \text{FLOPs per token (forward pass)} = L \times 13,d^{2} \approx 96 \times 1.96\times10^{9} \approx 1.88\times10^{11}. ]
So a single forward pass of one token through the full 96‑layer model costs ≈ 2 × 10¹¹ FLOPs.
[ \text{Tokens processed} = 3\times10^{11}. ]
[ \text{Forward FLOPs total} = 3\times10^{11} \times 1.88\times10^{11} \approx 5.6\times10^{22}. ]
During training we must compute gradients. A common rule of thumb is that the backward pass costs about 2 × the forward pass, so the total compute (forward + backward) is roughly 3 × the forward cost.
[ \text{Total training FLOPs} \approx 3 \times 5.6\times10^{22} \approx 1.7\times10^{23}. ]
[ \boxed{\text{≈ 10^{23}\ \text{FLOPs (a few × 10^{23})}} ]
In words: training the 175‑billion‑parameter GPT‑3 model required on the order of a hundred sextillion floating‑point operations.
This figure is consistent with the “exascale” compute budgets that have been reported for the largest language‑model training runs.
| Source of error | Effect on the estimate |
|---|---|
| Exact token count (the corpus size is reported as 300 B but the actual number of training steps may be slightly higher) | ± 10 % |
| The factor 13 for FLOPs per layer is a simplification (real implementations have extra ops for layer‑norm, bias terms, dropout, etc.) | + ~ 20 % |
| Mixed‑precision training reduces FLOPs for some operations, but the count above assumes full‑precision arithmetic | – ~ 10 % |
| Over‑head from optimizer state updates, gradient accumulation, checkpointing, etc. | + ~ 10 % |
Even allowing for these variations, the estimate stays within 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOPs.
Bottom line: Using only the model’s size, the number of layers, and the token count, a straightforward arithmetic model yields a total training compute on the order of 10²³ floating‑point operations for GPT‑3. This is the magnitude of compute that modern “large‑scale” language models demand.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has one sister.
Reasoning
Thus the total number of sisters in the family is 2. One of those sisters is Sally, leaving one other sister for her. So Sally has one sister.
Tests an AI's randomness and creativity
Here are five jokes for you:
Why did the scarecrow win an award?
Because he was outstanding in his field!
What do you call fake spaghetti?
An impasta.
Why don’t scientists trust atoms?
Because they make up everything.
How does a penguin build its house?
Igloos it together.
What’s orange and sounds like a parrot?
A carrot.
How many are unique?
All five jokes are distinct, so there are 5 unique jokes.
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.
36+ more head-to-head results. Free. Not a trick.
Free account. No card required. By continuing, you agree to Rival's Terms and Privacy Policy
No community votes yet. On paper, Aurora Alpha has the edge — bigger model tier, newer, bigger context window.
Mistral Small Creative uses 5.0x more emoji
Ask them anything yourself
Some models write identically. You are paying for the brand.
178 models fingerprinted across 32 writing dimensions. Free research.
185x
price gap between models that write identically
178
models
12
clone pairs
32
dimensions
279 AI models invented the same fake scientist.
We read every word. 250 models. 2.14 million words. This is what we found.
