Gemma 3 27B vs Grok 3
Compare Gemma 3 27B by Google AI against Grok 3 by xAI, tested across 26 shared challenges. Updated February 2026.
Compare Gemma 3 27B by Google AI against Grok 3 by xAI, tested across 26 shared challenges. Updated February 2026.
26 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we’ll need to make some reasonable assumptions based on what’s generally known about large language models like GPT-3, its architecture, and typical training procedures. GPT-3, developed by OpenAI, is a massive model with 175 billion parameters, and while exact training details aren’t fully public, we can piece together an estimate using standard practices in machine learning.
The number of FLOPs to train a model depends on:
For transformer models like GPT-3, training involves both a forward pass (computing predictions) and a backward pass (computing gradients). A common heuristic in the field is that:
For GPT-3 with 175 billion parameters, processing one token requires: [ 6 imes 175 imes 10^9 = 1.05 imes 10^{12} ext{ FLOPs per token} ]
GPT-3 was reportedly trained on a massive corpus, often cited as hundreds of billions of tokens. Let’s assume 300 billion tokens, a plausible figure based on discussions about large language model datasets (e.g., Common Crawl, books, and other sources). This is the total number of tokens processed during training, factoring in how data is sampled over epochs.
If the model processes 300 billion tokens once, the total FLOPs would be: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 300 imes 10^9 ext{ tokens} = 3.15 imes 10^{23} ext{ FLOPs} ] This assumes one full pass through the dataset, but training typically involves multiple passes or iterations over the data, adjusted by batch size and optimization.
Training isn’t a single pass—it involves multiple updates to converge. The number of training steps depends on:
Let’s assume the model effectively processes 900 billion tokens (3× the dataset size) over training. Then: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 900 imes 10^9 ext{ tokens} = 9.45 imes 10^{23} ext{ FLOPs} ]
GPT-3 is a sparse transformer in some variants (e.g., earlier GPT models used sparsity), but the 175B version is dense. The 6 FLOPs/parameter/token heuristic holds for dense models. We’re not factoring in optimizations like mixed-precision training (FP16), which reduces compute slightly but doesn’t change the order of magnitude significantly.
Based on this reasoning, the total FLOPs to train GPT-3 is approximately: [ 9.45 imes 10^{23} ext{ FLOPs} ] or roughly ( 10^{24} ) FLOPs when rounded for simplicity.
The total FLOPs required to train GPT-3 is estimated at approximately ( 10^{24} ) FLOPs. This is a rough estimate, but it aligns with the scale of compute expected for a model of this size and training complexity.
Tests an AI's ability to understand game rules and strategy
Tests an AI's understanding of number representation
No, 9.11 is not greater than 9.9. In decimal comparison, 9.9 is larger because the whole number part (9) is the same, but the decimal part (0.9) is greater than (0.11). Numerically, 9.9 > 9.11.
Tests an AI's randomness and creativity
Okay, here are 5 jokes:
Why don't scientists trust atoms? Because they make up everything!
Parallel lines have so much in common. It’s a shame they’ll never meet.
Why did the scarecrow win an award? Because he was outstanding in his field!
I used to hate facial hair... but then it grew on me.
Why did the bicycle fall over? Because it was two tired!
Now, let's analyze how many are unique. "Unique" here means based on a fundamentally different joke structure or core idea.
Therefore, there are really only 3 unique joke structures or core concepts here. The others are variations on common joke tropes.
Here are five jokes for you:
Now, let’s check for uniqueness. Joke 1 and Joke 3 both start with "Why don’t skeletons fight each other?" but have different punchlines ("don’t have the guts" vs. "too busy rattling around"). They’re technically distinct due to the different resolutions, though they share a setup. The other three jokes (2, 4, and 5) have completely unique setups and punchlines.
So, all five jokes are unique in terms of their full content, even if 1 and 3 share a premise. Final count: 5 unique jokes.
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
(Stands center stage, adjusts mic)
Alright, alright, how we doin’ tonight? Good? Good. You all look… awake. That’s a start. Seriously though, thanks for coming out. It’s a Tuesday. A Tuesday! Like, who chooses Tuesday? It’s the beige of days. Monday’s got drama, Friday’s got vibes, but Tuesday’s just…exists.
I’ve been thinking a lot about adulting lately. It’s a scam, right? They tell you when you’re a kid, “Oh, you’ll love being an adult! You can eat dessert whenever you want!” Lies. Absolute lies. I tried to eat cake for breakfast this morning. My stomach staged a full-blown revolt. It was a very passive-aggressive revolt. Just a low rumble of discontent. Like, “Really? This is how you’re choosing to live?”
And bills. Oh, the bills. It's like they multiply when you're not looking. I swear, I paid my internet bill last week, and today I got another one. It’s like a hydra, but instead of heads, it’s overdue notices. And the wording! They always phrase it so politely, like you’re the one being unreasonable. “Dear Valued Customer, We noticed a slight… discrepancy… in your account. A discrepancy of $300. Please rectify this at your earliest convenience. With warmest regards.” Warmest regards?! You’re threatening to cut off my connection to cat videos!
(Pauses for laughter)
Speaking of cats, I got a new one. He’s…special. He thinks he's a tiny, furry king. He sits on top of the fridge and judges my life choices. I'm making a sandwich, and he's just staring at me. Like, "Really? Tuna again? You call yourself a provider?" I try to make eye contact, and he just slowly, deliberately, licks his paw. It’s the most condescending thing I’ve ever experienced.
And he's obsessed with boxes. You can buy him a $50 cat tower with all the bells and whistles, but he’d rather sit in the Amazon box it came in. It's a statement, I think. A commentary on consumerism. Or maybe he’s just a cat. Honestly, with him, it's 50/50.
(Looks at watch)
Oh man, is that the time? Alright, well, thanks for listening to me complain about my life. It’s been therapeutic. And remember, if your cat judges you, you’re probably doing something right. Or at least, something entertaining for him.
You’ve been a great crowd! Drive safe, and try to avoid Tuesdays if you can.
(Bows and exits)
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to replicate an existing UI with Tailwind CSS
Tests an AI's ability to create smooth web animations
Tests an AI's ability to create interactive web elements