What is the difference between Gemini 3 Flash Preview and Inception: Mercury 2?

Gemini 3 Flash Preview is developed by Google AI while Inception: Mercury 2 is developed by Inception. Gemini 3 Flash Preview has a 1.0M token context window vs Inception: Mercury 2's 128K. You can compare their actual outputs across 52 challenges on Rival to see how they differ in practice.

Which is better, Gemini 3 Flash Preview or Inception: Mercury 2?

It depends on your use case. Gemini 3 Flash Preview and Inception: Mercury 2 each have strengths in different areas. Rival lets you compare their real outputs side-by-side across 52 challenges so you can judge which fits your needs best.

How much does Gemini 3 Flash Preview cost compared to Inception: Mercury 2?

Gemini 3 Flash Preview costs $0.5/M input tokens and Inception: Mercury 2 costs $0.25/M input tokens. Inception: Mercury 2 is $0.25/M cheaper per input. Check their side-by-side outputs on Rival to see if the price difference is justified by quality.

How can I compare Gemini 3 Flash Preview and Inception: Mercury 2 on Rival?

This page shows a side-by-side comparison of Gemini 3 Flash Preview and Inception: Mercury 2 across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Lab

Updated Mar 4, 2026

Gemini 3 Flash PreviewvsInception: Mercury 2

47 fights queued

Why Gemini 3 Flash Preview?

8.2x more context1.0M

Why Inception: Mercury 2?

3.6x cheaper overall$0.25/M in · $0.75/M out

3 months newerMar 2026

Favorites

Gemini 3 Flash PreviewInception: Mercury 2

Movie

Interstellar

20148.5

The Shawshank Redemption

19948.7

Album

Kind of Blue

Miles Davis

Thriller

Michael Jackson

Book

La biblioteca de babel

Jorge Luis Borges

Pride and Prejudice

Jane Austen

City

Kyoto

Japan

Tokyo

Japan

Game

Outer Wilds

Indie, Adventure4.4

Minecraft

Action, Arcade4.4

12 / 47

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Gemini 3 Flash Preview

Google · Dec 2025text

Nothing here. The model returned empty. We stared at it for a while.

Quantity	Approximate value (publicly known)	Why it matters
Number of model parameters	(N_{\text{par}} \approx 175;\text{B} = 1.75\times10^{11})	Each parameter is touched during forward and backward passes.
Training token count	(N_{\text{tok}} \approx 300;\text{B} = 3.0\times10^{11})	GPT‑3 was trained on ~300 billion tokens (Common Crawl + WebText + Books + Wikipedia).
Model architecture	96 transformer layers, hidden size (d = 12,288), feed‑forward size (4d)	Determines how many FLOPs per token per layer.
Optimizer	Adam (2‑momentum)	Adds a modest constant factor (≈2) to the raw forward‑backward count.
Training regime	1 epoch over the data (the 300 B tokens)	No extra passes; the token count already incorporates the number of epochs.

Operation (per token)	FLOPs (approx.)	Explanation
Q, K, V projections (3 × (d\times d))	(3d^{2})	Each projection is a dense matrix multiply.
Attention scores (dot‑product over sequence length (L))	(L d)	For a token we compute dot‑products with all (L) keys.
Attention output projection (1 × (d\times d))	(d^{2})
Feed‑forward first linear ( (d \times 4d) )	(4d^{2})
Activation (GELU)	≈ (4d) (tiny)
Feed‑forward second linear ( (4d \times d) )	(4d^{2})
Total per layer (forward)	≈ (12d^{2} + Ld)

Step	Value	Comment
Parameters	(1.75\times10^{11})	175 B
Training tokens	(3.0\times10^{11})	300 B
Base forward‑backward FLOPs	(1.0\times10^{23})	2 × Npar × Ntok
Adam optimizer factor	≈ 2	Gives ~(2\times10^{23})
Final ball‑park	(1!-!2 \times 10^{23}) FLOPs	Equivalent to ~100 exaflop‑seconds, i.e. ~3 years of a 100 PFLOP/s machine.

Source of uncertainty	Effect on estimate
Exact token count (some reports say 300 B, others 500 B)	Linear scaling – 500 B tokens would push the estimate to ~1.7 × 10²³ FLOPs.
Model‑specific tricks (e.g., mixed‑precision, gradient checkpointing)	Mixed‑precision reduces FLOPs per multiply‑add by ≈ 0.5×, but checkpointing adds extra recomputation; net effect is modest (≈ ± 20 %).
Additional regularization (e.g., dropout, layer‑norm)	Adds a few percent overhead.
Hardware‑specific optimizations (e.g., fused kernels)	May reduce the effective FLOP count, but the theoretical count stays the same.