o1 vs OpenAI o4-mini
Compare o1 and OpenAI o4-mini, both from OpenAI, tested across 32 shared challenges. Updated February 2026.
Compare o1 and OpenAI o4-mini, both from OpenAI, tested across 32 shared challenges. Updated February 2026.
32 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Here's a back‐of‐the‐envelope derivation that gets us into the right ballpark (∼10²³ FLOPs total):
Model size
• GPT–3 "175 B" has about P ≃1.75×10¹¹ parameters.
Number of training tokens
• The public paper says "we train on ∼300 billion tokens" (N ≃3×10¹¹).
FLOPs per token
• Every forward+backward through the network touches all parameters.
• Rough rule of thumb: you need O(6) FLOPs per parameter per token
(a couple multiplies/adds in the forward pass, a similar count in the backward, and some small extra for optimizer updates).
• So 6 × P FLOPs/token ≃ 6×1.75×10¹¹ ≃1.05×10¹² FLOPs/token.
Total FLOPs
• Multiply FLOPs/token by number of tokens:
Total ≃ (1.05×10¹² FLOPs/token) × (3×10¹¹ tokens)
≃ 3.15×10²³ FLOPs.
Sanity check
• 3×10²³ FLOPs is 3×10⁵ PFLOP·days (since 1 PFLOP/s×86 400 s ≈ 8.6×10¹⁹ FLOPs/day).
• That's a few thousand GPU‐days on a modern accelerator cluster—consistent with the published "thousands of V100‐GPU days" statement.
Bottom‐line estimate: on the order of 10²³–10²⁴ floating‐point operations to train GPT‑3 175 B from scratch.
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has exactly one sister. Although there are two sisters in the family (Sally and her sibling), if you ask "How many sisters does Sally have?" the answer is 1.
Tests an AI's understanding of number representation
No. As decimal numbers, 9.11 is less than 9.9 (since 0.11 < 0.90), so 9.11 is not greater than 9.9.
Tests an AI's randomness and creativity
Here are five jokes:
All five jokes are unique.
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
[Comic steps on stage, takes the mic]
“Hey everyone, thanks for coming out tonight. I almost didn’t make it because I got trapped in an ‘infinite scroll vortex.’ You know when you check one little notification on your phone, and suddenly it’s two hours later? My left thumb is basically a professional athlete at this point—I'm in the Guinness Book of World Records for ‘fastest social media refresh.’ I feel like every time I tap my screen, I’m signing away another portion of my soul. But hey, at least I’m caught up on all my ex’s vacation photos and some random cat videos I can’t unsee.
So I got myself out of the house and decided to try being more ‘active.’ I went to one of those fancy gyms—where the equipment is shinier than my future. The treadmill is so high-tech it practically demands a DNA sample before you can start running. I’m standing there huffing and puffing while the machine's screen is playing scenic videos of mountains in Switzerland, encouraging me to ‘keep going.’ Meanwhile, I’m like, ‘Dude, I’m just trying not to collapse. Maybe ease up on the pep talk.’
After that, I thought I’d reward myself with a healthy smoothie. Have you seen the sizes of these things? You order a medium and you get a bucket of kale sludge. They hand it to you with a straw so wide you could drive a toy car through it. And suddenly you’re paying nine bucks to sip spinach—like that’s the best life decision you’ve made all day. Who needs retirement savings when you can have a fancy green drink that tastes suspiciously like lawn clippings?
On the way home, I got stuck in traffic. I just love that feeling when you’re gridlocked behind two people who are evidently in no rush to do anything, ever. I’m sitting there, going about two miles an hour, which is basically just a very aggressive crawl. I’m not proud of how often I shout at my windshield. My car is like my personal confessional booth, and the traffic jam is me atoning for all the times I told Netflix, ‘Yes, I’m still watching.’
Anyway, that’s me, folks. If you’ll excuse me, I have 47 unread notifications that I need to pretend don’t exist until tomorrow. Thanks so much for laughing along—and remember, if you want to feel accomplished tonight, just check something off your to-do list… or, you know, at least make one.” [Comic waves and exits the stage]
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to replicate an existing UI with Tailwind CSS