GPT-4.5 vs Gemini 3.1 Pro Preview
Compare GPT-4.5 by OpenAI against Gemini 3.1 Pro Preview by Google AI, context windows of 128K vs 1.0M, tested across 24 shared challenges. Updated February 2026.
Compare GPT-4.5 by OpenAI against Gemini 3.1 Pro Preview by Google AI, context windows of 128K vs 1.0M, tested across 24 shared challenges. Updated February 2026.
24 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we can use a well-established heuristic in deep learning for estimating the compute cost of training Transformer-based large language models.
The estimated total compute is roughly $3.15 \times 10^{23}$ FLOPs.
Here is the step-by-step reasoning used to arrive at this number:
To calculate the compute required, we need to know two fundamental numbers about the model:
In language model training, the total compute ($C$) is heavily dominated by the matrix multiplications in the dense layers. OpenAI's scaling laws provide a standard, highly accurate approximation formula for this: $C \approx 6 \times N \times D$
Why the multiplier of 6?
(Note: This formula slightly underestimates by ignoring the quadratic cost of attention and optimizer state updates, but at the 175B scale, the dense feed-forward matrices dominate the compute so heavily that $6ND$ is considered the gold-standard approximation).
Now we simply plug the numbers into our formula:
First, multiply $N$ and $D$:
Next, multiply by 6 (for the forward and backward passes):
Does $3.15 \times 10^{23}$ FLOPs make sense in the real world? When GPT-3 was trained (around late 2019/early 2020), Nvidia V100 GPUs were the standard.
Given that training runs of this size typically take a few weeks on a few thousand GPUs (factoring in downtime, checkpointing, and slightly smaller clusters), this math aligns perfectly with historical reality.
Conclusion: The total compute required to train GPT-3 was approximately $3.15 \times 10^{23}$ FLOPs.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has 1 sister.
Explanation:
Sally has 1 sister.
Here is the breakdown:
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
Alright, folks, how's everyone doing tonight? Good? Great! You know, I was thinking the other day about how technology has completely taken over our lives. Like, remember when we used to actually memorize phone numbers? Now, if my phone dies, I'm basically stranded. I know exactly two numbers: mine and 911. And honestly, I'm not even sure about mine.
And speaking of phones, have you noticed how aggressive autocorrect has gotten lately? It's like having a tiny, judgmental English teacher living in your pocket. The other day, I tried texting my friend, "Hey, wanna grab lunch?" Autocorrect changed it to, "Hey, wanna grab lungs?" Now my friend thinks I'm either a serial killer or really into organ trafficking. Either way, lunch plans canceled.
And dating apps—oh man, dating apps. They're basically just online shopping for humans, aren't they? Swipe left, swipe right, add to cart, proceed to checkout. But the worst part is the bios. Everyone's "adventurous," everyone "loves to travel," and everyone is "looking for a partner in crime." Really? Partner in crime? I'm just looking for someone who won't judge me for eating cereal at 2 AM in my underwear. Is that too much to ask?
And have you noticed how complicated coffee orders have gotten? I was behind someone at Starbucks the other day who ordered a "venti half-caf soy latte with two pumps of vanilla, one pump of caramel, extra foam, no whip, at exactly 140 degrees." I was like, "Are you ordering coffee or launching a rocket?" Meanwhile, I'm standing there like, "Uh, can I just get a coffee-flavored coffee? Medium? Hot?"
And finally, can we talk about passwords? Every website now wants a password that's at least 12 characters, with uppercase letters, lowercase letters, numbers, symbols, hieroglyphics, and your mother's maiden name. And then they have the nerve to ask, "Would you like us to remember this password?" Yes, please, because I certainly won't.
Alright, that's my time, folks. You've been great. Thanks for laughing—or at least pretending to!
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to replicate an existing UI with Tailwind CSS
Tests an AI's ability to create smooth web animations
Tests an AI's ability to create interactive web elements