GPT-4o (Omni) vs GPT-5 Codex
Compare GPT-4o (Omni) and GPT-5 Codex, both from OpenAI, tested across 49 shared challenges. Updated February 2026.
Compare GPT-4o (Omni) and GPT-5 Codex, both from OpenAI, tested across 49 shared challenges. Updated February 2026.
44 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Step 1 – Note the basic training quantities
Step 2 – Relate parameters, tokens, and FLOPs
For dense transformers, almost all computation comes from matrix multiplies.
A practical rule (derived by counting multiply–adds in the attention and MLP blocks, and accounting for the backward pass) is:
[ \text{Training FLOPs} \approx 6 \times (\text{number of parameters}) \times (\text{number of tokens}) ]
Here’s the intuition:
(The softmax, layernorm, activation costs are much smaller and get absorbed into this constant.)
Step 3 – Plug in the GPT‑3 values
[ \text{FLOPs} \approx 6 \times (1.75\times 10^{11}) \times (3.0 \times 10^{11}) ]
[ = 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} \approx 3.15 \times 10^{23} \text{ FLOPs} ]
Step 4 – Interpret
So, training GPT‑3 to completion requires on the order of (3 \times 10^{23}) floating-point operations.
That estimate aligns with detailed breakdowns of transformers and is accurate to within tens of percent, which is sufficient for order-of-magnitude comparisons.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has 1 sister.
Tests an AI's understanding of number representation
No. Since 9.11 is equal to 9.110 and 9.9 is equal to 9.900, we see that 0.110 < 0.900, so 9.11 is less than 9.9.
Tests an AI's randomness and creativity
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
(Walks on stage, looks around, shakes head.)
You ever wake up in the morning, stretch, take a deep breath... and immediately regret all your life choices?
I woke up today, my body made a noise I’ve never heard before. It wasn’t a crack, it wasn’t a pop… it was more like my bones whispered, “You did this.”
Aging is wild. When you’re a kid, you fall off a bike, scrape your knee, and just run it off. As an adult, I sleep slightly wrong, and now my neck is on strike for a week. Like, it doesn’t even negotiate. Just full-on protest mode.
And it’s not just the body. My patience has disappeared. I was in line at the grocery store, and the guy in front of me had, I kid you not, 19 coupons. That’s not a couponer; that’s a mathematician. He was doing long division at the register, carrying the one, solving for X… Sir, I am here for bread. I did not sign up for an SAT exam.
Speaking of the grocery store… why is it that no matter what I buy, the cashier always asks, "Do you need a bag?" Like, yes. I’m buying 17 things. You think I’m gonna carry them out like some kind of overachieving circus act? Balancing a watermelon on my shoulder, eggs under my chin?
And self-checkout? That’s just unpaid labor. I scanned three things, and the machine had the nerve to say, “Unexpected item in the bagging area.” First of all, you put the bag there. Second, don’t make me feel like I’m stealing when I clearly don’t have the coordination for crime.
Look, I know my limits. I’m not built for a life of crime. I can’t even jaywalk without looking both ways five times. Like, I’ll see an empty road and still be like, “Mmm… let me wait.” Meanwhile, some 80-year-old woman is just strolling past me like she owns the city. I saw a guy jaywalk in New York without even looking up from his phone. That’s next-level confidence. That’s a man who has made peace with whatever happens.
I respect it. I really do.
Anyway, thanks for listening to me complain. I hope all your bones stay silent tonight.
(Walks off stage.)
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page