DeepSeek V3.2 vs GPT-5.1
Compare DeepSeek V3.2 by DeepSeek against GPT-5.1 by OpenAI, context windows of 131K vs 400K, tested across 52 shared challenges. Updated February 2026.
Compare DeepSeek V3.2 by DeepSeek against GPT-5.1 by OpenAI, context windows of 131K vs 400K, tested across 52 shared challenges. Updated February 2026.
47 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
The total training compute for GPT‑3 (175B) is on the order of (10^{23}) floating‑point operations. A reasonable point estimate is (\sim 3\times 10^{23}) FLOPs.
Here’s the step‑by‑step reasoning from basic principles.
From the GPT‑3 paper (architecture details are well known):
We’ll assume:
For a dense neural network, a single weight typically participates in one multiply-add per forward pass per “use”:
For large Transformers, most compute is in matrix multiplications in the attention and MLP blocks; these are roughly proportional to the total number of parameters, so a common back-of-the-envelope is:
Backward pass is more expensive because you’re computing gradients:
On top of that you have:
These are smaller but non-negligible; a common heuristic is to lump them into a constant factor, giving:
[ \text{FLOPs per token} \approx 6N ]
This “6” is a heuristic constant:
Some authors use factors between ~3 and ~6; using 6 is a conservative, widely used estimate in scaling law work.
Total training FLOPs:
[ \text{FLOPs}_{\text{total}} \approx (\text{FLOPs per token}) \times (\text{number of tokens}) ]
Using the heuristic:
[ \text{FLOPs}_{\text{total}} \approx 6N T ]
Plug in GPT‑3’s values:
So:
[ \begin{aligned} \text{FLOPs}_{\text{total}} &\approx 6 \times (1.75\times 10^{11}) \times (3\times 10^{11}) \ &= 6 \times 1.75 \times 3 \times 10^{22} \ &= 6 \times 5.25 \times 10^{22} \ &= 31.5 \times 10^{22} \ &= 3.15 \times 10^{23} \text{ FLOPs} \end{aligned} ]
So a good estimate is:
[ \boxed{\text{GPT‑3 (175B) training compute} ;\approx; 3\times 10^{23}\ \text{FLOPs}} ]
Even if our constant factor is off (say it’s 3 instead of 6, or 8 instead of 6), we’re still in the same ballpark:
So the order of magnitude (10^{23}) FLOPs is robust, and (\sim 3\times 10^{23}) FLOPs is a reasonable central estimate.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Each brother has 2 sisters. One of them is Sally herself, so there must be exactly one other girl.
So Sally has 1 sister.
Tests an AI's understanding of number representation
Tests an AI's randomness and creativity
Here are five jokes:
Why don’t skeletons fight each other?
They don’t have the guts.
Why did the scarecrow win an award?
Because he was outstanding in his field.
What do you call fake spaghetti?
An impasta.
Why don’t some couples go to the gym?
Because some relationships don’t work out.
Why don’t skeletons fight each other?
They don’t have the guts.
Now, let's check for uniqueness:
Jokes 1 and 5 are identical, so there are 4 unique jokes.
I told my computer I needed a break… now it won’t stop sending me KitKat ads.
Why don’t programmers like nature?
It has too many bugs.
I asked the librarian if the library had books on paranoia.
She whispered, “They’re right behind you.”
Why did the scarecrow become a successful motivational speaker?
He was outstanding in his field.
I tried to catch fog yesterday.
Mist.
All 5 jokes are unique.
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
So I’m trying to be a better person, which in 2025 just means I bought a water bottle the size of a toddler. That’s the whole self-improvement plan. It’s not therapy, it’s not reading books. It’s, “If I drink 3 of these, I’m emotionally stable.”
I walk around with this thing like I’m on a hike… in my own apartment. I can’t even sit on the couch without planning a route. I’m like, “Okay, I’ll leave the kitchen at 0900, hit the couch by 0903, water check at 0915, bathroom at 0920, we move as a team, people.”
And every bottle has those inspirational markings on the side now: “9 AM: You’ve got this!” “11 AM: Keep going!” “1 PM: Almost there!” By 3 PM I’m like, “I have done nothing today… except pee 19 times. This bottle is my only coworker. And it’s passive-aggressive.”
I miss when water was just… around. You were thirsty, you drank from a sink like a raccoon. Now if I drink from a faucet people look at me like I just licked a subway pole. “Are you okay? Do you need help? Blink twice if you need a reusable straw.”
I’m also trying to be “mindful.” That’s the new thing. Every app wants you to breathe. I open my phone: notifications, emails, one app’s like, “Have you taken 10 deep breaths today?” I’m like, “No, I’ve been holding my breath since 2016.”
So I downloaded a meditation app. It has this calm voice like, “Notice your thoughts… and let them go.” But my thoughts are like, “Did you pay your taxes? Did you hit reply all? Is your boss mad?” And the app’s like, “Let them float away like clouds.” I’m like, “No, these are thunderstorms. These thoughts have property damage.”
And the app gives me streaks. “You’ve meditated for 3 days in a row!” No I haven’t, I just opened the app and panicked. That shouldn’t count. That’s like saying, “You’ve gone to the gym 5 days in a row!” No, I drove past it in traffic and remembered I’m weak.
Speaking of gyms, why is every gym either a nightclub or a warehouse? It’s either purple lights, DJ in the corner, everyone looks like an action figure… or it’s some cross-training dungeon where tires hang from the ceiling and someone’s flipping a tractor in the background.
I just want a normal place where I can lift something light, breathe loudly, and not be judged by a 19-year-old trainer named Brayden whose entire job is to yell “Let’s go!” while never sweating. He’s walking around in a tank top like a motivational seagull. “Let’s go! One more! You got this!” Sir, you are holding a clipboard. You have not “got” anything.
Even grocery shopping is intense now. There’s a self-checkout, a regular checkout, a fifteen-items-or-less lane, a “drive up” lane, and then the lane for people who remembered their reusable bags and their morals. I always end up in the wrong one. I’ll stand in self-checkout with 40 items, trying to scan broccoli like it’s a DVD.
The machine is never chill. It’s always yelling, “Unknown item in bagging area!” It’s broccoli. It’s the most known item in the bagging area. Meanwhile, the human cashier is just watching, like, “You chose this, champ. You wanted to live in the future.”
Anyway, I gotta go drink three more gallons of water, breathe on purpose, pretend driving past the gym is exercise, and argue with a robot about produce. I think I’m nailing adulthood.
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page