GPT-5 Pro vs o1
Compare GPT-5 Pro and o1, both from OpenAI, context windows of 400K vs 128K, tested across 36 shared challenges. Updated February 2026.
Compare GPT-5 Pro and o1, both from OpenAI, context windows of 400K vs 128K, tested across 36 shared challenges. Updated February 2026.
31 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Below is one way to arrive at a back-of-the-envelope estimate for the total floating-point operations (FLOPs) needed to train GPT-3 (the 175B-parameter model). Note that different sources use slightly different assumptions, so you may see figures that differ by factors of 2–3; however, the overall scale (10^23 FLOPs) remains consistent.
Why 6?
(Exact counts differ depending on the details of the architecture, hardware efficiency, and whether you count a multiply-add as one or two FLOPs, but 6 is a convenient ballpark multiplier.)
Using the 6 FLOPs rule of thumb:
[ ext{Total FLOPs} ;approx; 6 ; imes; N_p ; imes; N_t. ]
Plug in the numbers:
So,
[ ext{Total FLOPs} ;=; 6 ; imes; (1.75 imes 10^{11}) ; imes; (3.0 imes 10^{11}). ]
First, multiply the “nice” numbers:
So the coefficient is 31.5. Now include the powers of ten:
Hence,
[ ext{Total FLOPs} ;approx; 3.15 imes 10^{23}. ]
Therefore, a rough estimate for the total training cost of GPT-3 (175B) is on the order of
[
mathbf{3 imes 10^{23}} ext{ FLOPs.}
]
Depending on exact architectural details, optimizer choices, hardware efficiency, and definitions of FLOPs vs. MACs (multiply-accumulate operations), public estimates generally range from around (2 imes 10^{23}) to (4 imes 10^{23}). But the important takeaway is that it is a few times (10^{23}) FLOPs.
This simple calculation gives a ballpark figure without delving into the fine details of the GPT-3 architecture or specifics of training hardware.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
First, note that each of Sally’s three brothers has exactly two sisters. One of those sisters is clearly Sally. Since each brother’s two sisters must be the same two girls for all of them, there must be exactly one other sister besides Sally.
Therefore, Sally has 1 sister.
Tests an AI's understanding of number representation
Tests an AI's randomness and creativity
Here are five jokes:
Why did the scarecrow get promoted?
Because he was outstanding in his field.
I told my friend she drew her eyebrows too high.
She looked surprised.
What do you call cheese that isn’t yours?
Nacho cheese.
Why don’t skeletons fight each other?
Because they don’t have the guts.
Did you hear about the restaurant on the moon?
Great food, but absolutely no atmosphere.
How many are unique?
All five jokes listed here are distinct and thus count as five unique jokes.
Tests an AI's ability to generate vector graphics
Tests an AI's humor and creative writing ability
[Comic steps on stage, takes the mic]
“Hey everyone, thanks for coming out tonight. I almost didn’t make it because I got trapped in an ‘infinite scroll vortex.’ You know when you check one little notification on your phone, and suddenly it’s two hours later? My left thumb is basically a professional athlete at this point—I'm in the Guinness Book of World Records for ‘fastest social media refresh.’ I feel like every time I tap my screen, I’m signing away another portion of my soul. But hey, at least I’m caught up on all my ex’s vacation photos and some random cat videos I can’t unsee.
So I got myself out of the house and decided to try being more ‘active.’ I went to one of those fancy gyms—where the equipment is shinier than my future. The treadmill is so high-tech it practically demands a DNA sample before you can start running. I’m standing there huffing and puffing while the machine's screen is playing scenic videos of mountains in Switzerland, encouraging me to ‘keep going.’ Meanwhile, I’m like, ‘Dude, I’m just trying not to collapse. Maybe ease up on the pep talk.’
After that, I thought I’d reward myself with a healthy smoothie. Have you seen the sizes of these things? You order a medium and you get a bucket of kale sludge. They hand it to you with a straw so wide you could drive a toy car through it. And suddenly you’re paying nine bucks to sip spinach—like that’s the best life decision you’ve made all day. Who needs retirement savings when you can have a fancy green drink that tastes suspiciously like lawn clippings?
On the way home, I got stuck in traffic. I just love that feeling when you’re gridlocked behind two people who are evidently in no rush to do anything, ever. I’m sitting there, going about two miles an hour, which is basically just a very aggressive crawl. I’m not proud of how often I shout at my windshield. My car is like my personal confessional booth, and the traffic jam is me atoning for all the times I told Netflix, ‘Yes, I’m still watching.’
Anyway, that’s me, folks. If you’ll excuse me, I have 47 unread notifications that I need to pretend don’t exist until tomorrow. Thanks so much for laughing along—and remember, if you want to feel accomplished tonight, just check something off your to-do list… or, you know, at least make one.” [Comic waves and exits the stage]
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to create smooth web animations