GPT-5.2 Chat vs OpenAI o4-mini
Compare GPT-5.2 Chat and OpenAI o4-mini, both from OpenAI, tested across 35 shared challenges. Updated February 2026.
Compare GPT-5.2 Chat and OpenAI o4-mini, both from OpenAI, tested across 35 shared challenges. Updated February 2026.
35 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Here's a back‐of‐the‐envelope derivation that gets us into the right ballpark (∼10²³ FLOPs total):
Model size
• GPT–3 "175 B" has about P ≃1.75×10¹¹ parameters.
Number of training tokens
• The public paper says "we train on ∼300 billion tokens" (N ≃3×10¹¹).
FLOPs per token
• Every forward+backward through the network touches all parameters.
• Rough rule of thumb: you need O(6) FLOPs per parameter per token
(a couple multiplies/adds in the forward pass, a similar count in the backward, and some small extra for optimizer updates).
• So 6 × P FLOPs/token ≃ 6×1.75×10¹¹ ≃1.05×10¹² FLOPs/token.
Total FLOPs
• Multiply FLOPs/token by number of tokens:
Total ≃ (1.05×10¹² FLOPs/token) × (3×10¹¹ tokens)
≃ 3.15×10²³ FLOPs.
Sanity check
• 3×10²³ FLOPs is 3×10⁵ PFLOP·days (since 1 PFLOP/s×86 400 s ≈ 8.6×10¹⁹ FLOPs/day).
• That's a few thousand GPU‐days on a modern accelerator cluster—consistent with the published "thousands of V100‐GPU days" statement.
Bottom‐line estimate: on the order of 10²³–10²⁴ floating‐point operations to train GPT‑3 175 B from scratch.
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has exactly one sister. Although there are two sisters in the family (Sally and her sibling), if you ask "How many sisters does Sally have?" the answer is 1.
Tests an AI's understanding of number representation
No. As decimal numbers, 9.11 is less than 9.9 (since 0.11 < 0.90), so 9.11 is not greater than 9.9.
Tests an AI's randomness and creativity
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.
Recreate an interactive, classic Mario level in a single HTML file.