Qwen3 Coder Plus vs Aurora Alpha
Compare Qwen3 Coder Plus by Qwen against Aurora Alpha by OpenRouter, context windows of 128K vs 128K, tested across 50 shared challenges. Updated February 2026.
Compare Qwen3 Coder Plus by Qwen against Aurora Alpha by OpenRouter, context windows of 128K vs 128K, tested across 50 shared challenges. Updated February 2026.
45 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Step‑by‑step back‑of‑the‑envelope estimate
Below is a simple calculation that uses only publicly‑known high‑level facts about GPT‑3 (size of the model, length of the training data, and the basic arithmetic cost of a Transformer layer).
All numbers are rounded to make the math easy to follow; the goal is to get an order‑of‑magnitude figure rather than an exact count.
| Property | Approximate value |
|---|---|
| Number of parameters (the “175 B” model) | 1.75 × 10¹¹ |
| Hidden dimension d (the size of each token representation) | ≈ 12 288 |
| Number of Transformer layers L | 96 |
| Feed‑forward inner dimension dff* (usually 4 × d) | ≈ 49 152 |
| Size of the training corpus (tokens) | ≈ 3 × 10¹¹ tokens (≈ 300 B) |
| Training uses Adam‑style optimizer, mixed‑precision, etc. – we ignore those details for the FLOP count. |
These numbers are taken from the model card and the paper; they are the only “hard” data we need.
A standard Transformer layer consists of two main sub‑blocks:
| Sub‑block | Main operations (per token) | Approx. FLOPs |
|---|---|---|
| Self‑attention (Q, K, V projections + attention scores + weighted sum) | 3 × d × d (proj) + d × d (QK) + d × d (AV) | ≈ 5 × d² |
| Feed‑forward (two linear layers, activation in between) | d × d _ff + d × d × 4 (first linear) + 4 × d × d (second linear) | ≈ 8 × d² |
Summing them gives roughly
[ \text{FLOPs per token per layer} ;\approx; (5+8),d^{2}=13,d^{2}. ]
Why the factor 13?
[ d = 12,288 \quad\Rightarrow\quad d^{2} \approx 1.51\times10^{8}. ]
[ 13,d^{2} \approx 13 \times 1.51\times10^{8} \approx 1.96\times10^{9};\text{FLOPs per token per layer}. ]
[ \text{FLOPs per token (forward pass)} = L \times 13,d^{2} \approx 96 \times 1.96\times10^{9} \approx 1.88\times10^{11}. ]
So a single forward pass of one token through the full 96‑layer model costs ≈ 2 × 10¹¹ FLOPs.
[ \text{Tokens processed} = 3\times10^{11}. ]
[ \text{Forward FLOPs total} = 3\times10^{11} \times 1.88\times10^{11} \approx 5.6\times10^{22}. ]
During training we must compute gradients. A common rule of thumb is that the backward pass costs about 2 × the forward pass, so the total compute (forward + backward) is roughly 3 × the forward cost.
[ \text{Total training FLOPs} \approx 3 \times 5.6\times10^{22} \approx 1.7\times10^{23}. ]
[ \boxed{\text{≈ 10^{23}\ \text{FLOPs (a few × 10^{23})}} ]
In words: training the 175‑billion‑parameter GPT‑3 model required on the order of a hundred sextillion floating‑point operations.
This figure is consistent with the “exascale” compute budgets that have been reported for the largest language‑model training runs.
| Source of error | Effect on the estimate |
|---|---|
| Exact token count (the corpus size is reported as 300 B but the actual number of training steps may be slightly higher) | ± 10 % |
| The factor 13 for FLOPs per layer is a simplification (real implementations have extra ops for layer‑norm, bias terms, dropout, etc.) | + ~ 20 % |
| Mixed‑precision training reduces FLOPs for some operations, but the count above assumes full‑precision arithmetic | – ~ 10 % |
| Over‑head from optimizer state updates, gradient accumulation, checkpointing, etc. | + ~ 10 % |
Even allowing for these variations, the estimate stays within 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOPs.
Bottom line: Using only the model’s size, the number of layers, and the token count, a straightforward arithmetic model yields a total training compute on the order of 10²³ floating‑point operations for GPT‑3. This is the magnitude of compute that modern “large‑scale” language models demand.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has one sister.
Reasoning
Thus the total number of sisters in the family is 2. One of those sisters is Sally, leaving one other sister for her. So Sally has one sister.
Tests an AI's randomness and creativity
Here are 5 jokes for you:
Why don't scientists trust atoms? Because they make up everything!
I'm reading a book about anti-gravity. It's impossible to put down!
Why did the scarecrow win an award? He was outstanding in his field!
I told my wife she was drawing her eyebrows too high. She looked surprised.
Why don't eggs tell jokes? They'd crack each other up!
All 5 of these jokes are unique - each one has a different setup and punchline, covering various topics from science to farming to relationships to beauty to food. No duplicates here!
Here are five jokes for you:
Why did the scarecrow win an award?
Because he was outstanding in his field!
What do you call fake spaghetti?
An impasta.
Why don’t scientists trust atoms?
Because they make up everything.
How does a penguin build its house?
Igloos it together.
What’s orange and sounds like a parrot?
A carrot.
How many are unique?
All five jokes are distinct, so there are 5 unique jokes.
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
"Modern Problems"
walks on stage
So I tried to be healthy this year. I downloaded one of those fitness apps that tracks your steps. Apparently, walking to the fridge 47 times a day doesn't count as "cardio." Who knew? The app gave me a notification: "You walked 235 steps today. Pathetic." Thanks, Karen. I know my life's a disappointment.
takes a sip of water
I love how we've outsourced all our basic human functions to technology. My phone knows I'm sad before I do. It's like, "Hey, you've been staring at the same email for three hours. Want to order sushi?" Yes, phone. Yes I do. How did you know I was emotionally constipated?
mimics phone
My smart home is smarter than me. I say "Alexa, turn on the lights" and she's like, "Did you mean: how to fix your life?" The thermostat judges my shower habits. "Dave, you took a 23-minute shower. The environment is dying." Sorry, thermostat! I was processing trauma!
paces
Dating apps are wild. My profile says I'm "adventurous" because I once ordered Thai food without looking at the spice level. It was a medium. I cried for twenty minutes. But hey, I finished it – that's commitment! My matches are like, "I love hiking and yoga." Cool, Karen. I love naps and not returning phone calls. We're all just making it up as we go.
shakes head
Social media's the best, right? I posted a picture of my breakfast and got more validation than my entire college career. "This avocado toast looks amazing!" Thank you, stranger from Idaho. Finally, someone recognizes my culinary achievements. Three likes. I've never felt more seen.
mimes taking a photo
Kids today are so advanced. My nephew's six years old and he's already better at technology than I am. I asked him to help me connect to WiFi and he fixed my marriage in the process. "Uncle Dave, you and Aunt Sarah haven't talked in two weeks. That's the real connection problem."
looks around
I tried meditation. Sat there for ten minutes thinking about everything I need to do. The app was like, "Find your inner peace." I found my inner panic attack instead. Much more familiar territory.
shrugs
You know what the real problem is? We're all pretending we have it together while scrolling through other people's highlight reels. Everyone's posting about their perfect lives while secretly eating cereal for dinner again. We're all Dave. We're all just Daves, trying to look good in phone cameras and failing.
takes a bow
Thank you, you've been a wonderful audience. Don't forget to like and subscribe to your local comedian!
exits
[Opening – 0:00‑0:30]
Hey, everyone! Great to see you all here tonight. I just got back from a week of “working from home,” which is a fancy way of saying my cat now thinks I’m a permanent fixture on the couch. She’s started charging me rent—she’s got a little bowl of kibble on the coffee table and a sign that says “No humans beyond this point.” I’m not sure if she’s being demanding or just trying to get me to finally stop pretending I’m a professional chef.
[The “Fitness” Saga – 0:30‑1:15]
Speaking of professional chef, I tried a new fitness app that promises “personalized workouts.” The first day it asked me, “What’s your goal?” I said, “I want to be able to lift the grocery bag without groaning.” The app replied, “Great! Let’s start with a 5‑minute warm‑up.” Five minutes later I was on the floor, questioning every life choice that led me to this point. The app then suggested a “quick stretch.” I stretched my patience, my budget, and my belief that I could ever be a “runner.” The only thing I ran was out of time for a proper nap.
[Zoom Meetings – 1:15‑2:00]
Zoom has become the new office. The best part? You can wear a suit on the top and pajamas on the bottom, and no one will ever know. The worst part? The “mute” button. I’ve been muted so many times I’m convinced my boss thinks I’m a mime. One day I tried to speak up about a project, and the whole team heard the sound of my cat knocking over a plant. I think the plant was more articulate than I am. And there’s always that one person who forgets to turn off their background. I’ve seen a beach, a spaceship, and a very convincing replica of a 1970s kitchen—all while they’re discussing quarterly earnings. I’m waiting for the day someone shows up with a live marching band in the background. That would be the ultimate “I’m not paying attention” signal.
[Grocery Store Adventures – 2:00‑2:45]
I went to the grocery store the other day and realized it’s basically a social experiment. The “express lane” is a myth. I stood there for ten minutes while someone in front of me tried to decide if a single avocado was “ripe enough.” Meanwhile, the person behind me was silently judging my choice of cereal—because apparently, buying “Chocolate Cheerios” is a statement about my life choices. And then there’s the “self‑checkout.” I’m convinced the machine is secretly judging me. Every time I scan an item, it beeps like it’s saying, “Really? Another bag of chips? You’ve got a whole pantry, buddy.” I tried to scan a banana, and it refused, insisting I needed a “fruit‑to‑human interaction” license.
[Closing – 2:45‑3:00]
So, here’s the takeaway: life is a series of tiny, absurd moments that we all pretend to have under control. Whether it’s a cat demanding rent, a fitness app that thinks I’m a superhero, or a Zoom call that could double as a theater production, the best we can do is laugh at the chaos. Thanks for letting me share my stories—now go out there and make sure your cat doesn’t start charging you rent. Good night!
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.