Google: Gemini 2.5 Flash Preview 09-2025 vs GPT OSS 120B
Compare Google: Gemini 2.5 Flash Preview 09-2025 by Google AI against GPT OSS 120B by OpenAI, context windows of 1.0M vs 131K, tested across 46 shared challenges. Updated February 2026.
Compare Google: Gemini 2.5 Flash Preview 09-2025 by Google AI against GPT OSS 120B by OpenAI, context windows of 1.0M vs 131K, tested across 46 shared challenges. Updated February 2026.
46 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Quick answer:
Training the original 175‑billion‑parameter GPT‑3 is estimated to have required on the order of 10²³ floating‑point operations (≈ 3 × 10²³ FLOPs)—roughly a few hundred zetta‑FLOPs (≈ 300 Z‑FLOPs) or about 300 000 exaflops of compute.
Below is a step‑by‑step back‑of‑the‑envelope calculation that shows how we get there, together with the assumptions that go into it.
| Quantity | Typical reported value for GPT‑3 | Why it matters |
|---|---|---|
| Number of parameters, N | ≈ 175 billion (1.75 × 10¹¹) | Determines the size of each matrix‑multiply in the model. |
| Training token count, T | ≈ 300 billion tokens (3 × 10¹¹) | Total number of token‑level forward‑passes the model sees. |
| Sequence length, L | ≈ 2048 tokens per example (the context window). | Determines how many per‑token matrix‑products are needed per forward pass. |
| Number of layers, Lₗ | 96 transformer blocks. | |
| Hidden dimension, d | 12 384 (the width of each linear projection). | |
| Number of attention heads, h | 96 (so each head has size d/h = 128). | |
| Training passes | 1 epoch (the published training used roughly 1 × the dataset; we treat the 300 B tokens as the total “token‑steps” already). |
The only numbers we need for a FLOP estimate are N (the model size) and T (the total number of token‑level operations). The rest of the architecture details (L, d, h, Lₗ) are used to translate “N parameters” into “how many FLOPs per token”.
A transformer layer consists of:
For a single token (ignoring the cost of the softmax and the small bias terms) the dominant cost is matrix‑multiply operations.
For a matrix multiplication A (m×k) × B (k×n) the number of multiply‑adds is 2 · m·k·n (one multiplication and one addition per entry). In deep‑learning practice we count that as 2 FLOPs per multiply‑add pair.
| Component | Approx. dimensions | FLOPs (per token) |
|---|---|---|
| Q, K, V projections (3× per token) | d × d → 3·(2·d·d) = 6·d² | |
| Attention scores (Q·Kᵀ) | L·d → 2·L·d² | |
| Weighted sum (A·V) | L·d → 2·L·d² | |
| Output projection | d × d → 2·d² | |
| FFN first linear (d → 4d) | 2·d·4d = 8·d² | |
| FFN second linear (4d → d) | 8·d² | |
| Total per layer | ≈ 38 · d² FLOPs per token (rounded up). |
Plugging d = 12 384:
[ 38 \times d^{2} \approx 38 \times (1.2384\times10^{4})^{2} \approx 38 \times 1.53 \times 10^{8} }= 5.8 \times 10^{9}; \text{FLOPs per token per layer}. ]
With 96 layers:
[ \text{FLOPs per token (forward)} \approx 96 \times 5.8\times10^{9} }= 5.6 \times 10^{11}; \text{FLOPs}. ]
Training requires a forward pass plus a backward pass. The backward pass is roughly 2× the forward‑pass cost (gradient propagation + weight updates). So:
[ \text{FLOPs per token (training)} \approx 3 \times \text{forward} = 3 \times 5.6\times10^{11} }\approx 1.7\times10^{12}; \text{FLOPs per token.} ]
(Alternative literature often rounds this factor to ≈ 6 total FLOPs per parameter per token, which yields the same order of magnitude.)
The training corpus used for GPT‑3 was ≈ 300 billion tokens (≈ 3 × 10¹¹). The total FLOPs needed:
[ \text{Total FLOPs} \approx 1.7 \times 10^{12};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \approx 5.1 \times 10^{23};\text{FLOPs}. ]
Rounded to a single‑digit significance:
[ \boxed{,\approx 5\times10^{23},\text{FLOPs},} ]
| Unit | Value |
|---|---|
| FLOPs | ≈ 5 × 10²³ |
| Peta‑FLOPs (10¹⁵) | ≈ 5 × 10⁸ PFLOPs |
| Exa‑FLOPs (10¹⁸) | ≈ 5 × 10⁵ EFLOPs |
| Zetta‑FLOPs (10²¹) | ≈ 5 × 10² ZFLOPs (≈ 500 ZFLOPs) |
Because “zeta” (10²¹) is a convenient unit for such massive numbers, we can say ≈ 500 zeta‑FLOPs. In more familiar terms, this is ≈ 300 zeta‑FLOPs if we adopt the “6 × N × T” rule of thumb (see below), so the estimate sits in the few‑hundred‑zeta‑FLOP range.
A widely‑cited back‑of‑the‑envelope rule for transformer LLMs is:
[ \text{Training FLOPs} \approx 6 \times N_{\text{params}} \times N_{\text{tokens}}. ]
Plugging in the numbers:
[ 6 \times (1.75\times10^{11}) \times (3\times10^{11}) = 3.15\times10^{23};\text{FLOPs}. ]
That is within a factor of two of the more detailed calculation above (the difference stems from the exact factor used for forward vs backward cost). Both methods give the same order of 10²³ FLOPs.
(All numbers are back‑of‑the‑envelope estimates; the real training run also incurs overhead from data‑loading, communication, and other inefficiencies, so the actual engineering effort was larger, but the pure‑compute figure is captured by the calculation above.)
Tests an AI's ability to understand game rules and strategy
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Sally has one sister (apart from herself).
Each of her 3 brothers has exactly two sisters. Since the brothers share the same sisters, there are only two sisters in total—the two sisters are Sally and her one sister. Hence, Sally has one sister.
Tests an AI's understanding of number representation
No. 9.11 is less than 9.9. (9.11 < 9.9.)
Tests an AI's randomness and creativity
Tests an AI's ability to generate vector graphics
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
[Spotlight comes up. The comic walks to the mic, glances at the audience, and smiles.]
Comic: Hey, everyone—great to be here. I just got off a Zoom call that was supposed to be a “quick sync,” which in corporate speak means “we’ll spend three hours arguing about who gets to use the same PowerPoint template.” I’m convinced the word “sync” is short for “synchronizing our collective misery.”
[Pause for laugh.]
Speaking of misery, I moved into a new apartment last month. The landlord told me the place had “character.” Yeah, the kind of character that screams, “I’ve never heard of a plumber in the last decade.” The faucet drips on a rhythm that could be a metronome for a sleep‑deprived toddler. I’m not saying it’s bad, but I’ve started timing my showers to the drip. Six minutes, two seconds—if I go any longer, the building’s water bill looks like a small country’s GDP.
[Audience chuckles.]
And the neighbors! My upstairs neighbor is a yoga instructor. You know the type—every morning at 6 a.m., the floor vibrates like a cheap subwoofer. I’m convinced she’s trying to align her chakras with the building’s structural integrity. I tried to be polite and asked, “Hey, could you maybe do the downward dog a little later?” She replied, “I’m sorry, I’m on a schedule.” I’m not sure if she meant a class schedule or a schedule for how many times I’ll have to pretend to be a human pillow for her dog.
[Pause.]
Now, I’ve been trying to eat healthier. The other day I bought a “kale smoothie.” The label promised “nutrient‑dense, antioxidant‑rich, life‑changing.” I drank it and felt more like I’d just swallowed a lawnmower. I’m not saying it was bad, but the only thing that got a boost was my ability to identify the exact moment my stomach decided to file a formal complaint.
[Audience laughs.]
Technology is another thing. I tried using a voice‑assistant to set a reminder: “Remind me to call Mom at 5 p.m.” It responded, “I’m sorry, I didn’t catch that.” So I repeated, “Remind me to call Mom at 5 p.m.” And it answered, “Here’s a recipe for mushroom risotto.” I’m beginning to suspect my house is turning into a culinary school. At this point, I’m more likely to get a reminder to water my plants than to remember my own birthday.
[Pause.]
Dating apps—anyone here still using them? The profile pictures are all “adventure shots.” You never see a guy holding a pizza, you always see a guy climbing a mountain. I matched with a girl whose bio said, “Looking for someone who can keep up.” I replied, “I can keep up… with the Netflix queue.” She ghosted me. I think the algorithm is broken; it matched us because we both love “outdoor activities.” My idea of outdoor activity is “walking from the couch to the fridge and back.”
[Audience laughs.]
All right, I’m going to leave you with this: life is like a Wi‑Fi signal. You spend a lot of time looking for a stronger connection, but sometimes the best thing you can do is just stay where you are and enjoy the occasional drop—because it reminds you that you’re still, somehow, connected.
[Smile, wave.] Thanks, everybody—have a great night!
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page