Gemini 3 Flash Preview vs GPT-5.1
Compare Gemini 3 Flash Preview by Google AI against GPT-5.1 by OpenAI, context windows of 1.0M vs 400K, tested across 52 shared challenges. Updated February 2026.
Compare Gemini 3 Flash Preview by Google AI against GPT-5.1 by OpenAI, context windows of 1.0M vs 400K, tested across 52 shared challenges. Updated February 2026.
47 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
The total training compute for GPT‑3 (175B) is on the order of (10^{23}) floating‑point operations. A reasonable point estimate is (\sim 3\times 10^{23}) FLOPs.
Here’s the step‑by‑step reasoning from basic principles.
From the GPT‑3 paper (architecture details are well known):
We’ll assume:
For a dense neural network, a single weight typically participates in one multiply-add per forward pass per “use”:
For large Transformers, most compute is in matrix multiplications in the attention and MLP blocks; these are roughly proportional to the total number of parameters, so a common back-of-the-envelope is:
Backward pass is more expensive because you’re computing gradients:
On top of that you have:
These are smaller but non-negligible; a common heuristic is to lump them into a constant factor, giving:
[ \text{FLOPs per token} \approx 6N ]
This “6” is a heuristic constant:
Some authors use factors between ~3 and ~6; using 6 is a conservative, widely used estimate in scaling law work.
Total training FLOPs:
[ \text{FLOPs}_{\text{total}} \approx (\text{FLOPs per token}) \times (\text{number of tokens}) ]
Using the heuristic:
[ \text{FLOPs}_{\text{total}} \approx 6N T ]
Plug in GPT‑3’s values:
So:
[ \begin{aligned} \text{FLOPs}_{\text{total}} &\approx 6 \times (1.75\times 10^{11}) \times (3\times 10^{11}) \ &= 6 \times 1.75 \times 3 \times 10^{22} \ &= 6 \times 5.25 \times 10^{22} \ &= 31.5 \times 10^{22} \ &= 3.15 \times 10^{23} \text{ FLOPs} \end{aligned} ]
So a good estimate is:
[ \boxed{\text{GPT‑3 (175B) training compute} ;\approx; 3\times 10^{23}\ \text{FLOPs}} ]
Even if our constant factor is off (say it’s 3 instead of 6, or 8 instead of 6), we’re still in the same ballpark:
So the order of magnitude (10^{23}) FLOPs is robust, and (\sim 3\times 10^{23}) FLOPs is a reasonable central estimate.
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Each brother has 2 sisters. One of them is Sally herself, so there must be exactly one other girl.
So Sally has 1 sister.
Tests an AI's understanding of number representation
Tests an AI's randomness and creativity
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page