Claude Sonnet 3.6 (2022-10-22) vs GPT-5 Codex
Compare Claude Sonnet 3.6 (2022-10-22) by Anthropic against GPT-5 Codex by OpenAI, tested across 42 shared challenges. Updated February 2026.
Compare Claude Sonnet 3.6 (2022-10-22) by Anthropic against GPT-5 Codex by OpenAI, tested across 42 shared challenges. Updated February 2026.
37 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Step 1 – Note the basic training quantities
Step 2 – Relate parameters, tokens, and FLOPs
For dense transformers, almost all computation comes from matrix multiplies.
A practical rule (derived by counting multiply–adds in the attention and MLP blocks, and accounting for the backward pass) is:
[ \text{Training FLOPs} \approx 6 \times (\text{number of parameters}) \times (\text{number of tokens}) ]
Here’s the intuition:
(The softmax, layernorm, activation costs are much smaller and get absorbed into this constant.)
Step 3 – Plug in the GPT‑3 values
[ \text{FLOPs} \approx 6 \times (1.75\times 10^{11}) \times (3.0 \times 10^{11}) ]
[ = 6 \times 5.25 \times 10^{22} = 31.5 \times 10^{22} \approx 3.15 \times 10^{23} \text{ FLOPs} ]
Step 4 – Interpret
So, training GPT‑3 to completion requires on the order of (3 \times 10^{23}) floating-point operations.
That estimate aligns with detailed breakdowns of transformers and is accurate to within tens of percent, which is sufficient for order-of-magnitude comparisons.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has 1 sister.
Tests an AI's understanding of number representation
No. Since 9.11 is equal to 9.110 and 9.9 is equal to 9.900, we see that 0.110 < 0.900, so 9.11 is less than 9.9.
Tests an AI's randomness and creativity
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page