What is the difference between Gemini 3.1 Pro Preview and GPT OSS 120B?

Gemini 3.1 Pro Preview is developed by Google AI while GPT OSS 120B is developed by OpenAI. Gemini 3.1 Pro Preview has a 1.0M token context window vs GPT OSS 120B's 131K. You can compare their actual outputs across 48 challenges on RIVAL to see how they differ in practice.

Which is better, Gemini 3.1 Pro Preview or GPT OSS 120B?

It depends on your use case. Gemini 3.1 Pro Preview and GPT OSS 120B each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 48 challenges so you can judge which fits your needs best.

How much does Gemini 3.1 Pro Preview cost compared to GPT OSS 120B?

Gemini 3.1 Pro Preview costs $2/M input tokens and GPT OSS 120B costs $0.18/M input tokens. GPT OSS 120B is $1.82/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare Gemini 3.1 Pro Preview and GPT OSS 120B on RIVAL?

This page shows a side-by-side comparison of Gemini 3.1 Pro Preview and GPT OSS 120B across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Gemini 3.1 Pro PreviewvsGPT OSS 120B

48 challenges

Why Gemini 3.1 Pro Preview?

8x more context1.0M

7 months newerFeb 2026

Why GPT OSS 120B?

14.5x cheaper overall$0.18/M in · $0.80/M out

Favorites

Gemini 3.1 Pro PreviewGPT OSS 120B

Movie

The Matrix

19998.2

The Godfather

19728.7

Album

OK Computer

Radiohead

—

Book

Dune

Frank Herbert

—

City

Tokyo

Japan

Tokyo

Japan

Game

Portal

Action, Puzzle4.5

Minecraft

Action, Arcade4.4

12 / 48

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Gemini 3.1 Pro Preview

Google · Feb 2026text

No content available for this output.

GPT OSS 120B

Openai · Aug 2025text

Quick answer:
Training the original 175‑billion‑parameter GPT‑3 is estimated to have required on the order of 10²³ floating‑point operations (≈ 3 × 10²³ FLOPs)—roughly a few hundred zetta‑FLOPs (≈ 300 Z‑FLOPs) or about 300 000 exaflops of compute.

Below is a step‑by‑step back‑of‑the‑envelope calculation that shows how we get there, together with the assumptions that go into it.

1. What we need to know

Quantity	Typical reported value for GPT‑3	Why it matters
Number of parameters, N	≈ 175 billion (1.75 × 10¹¹)	Determines the size of each matrix‑multiply in the model.
Training token count, T	≈ 300 billion tokens (3 × 10¹¹)	Total number of token‑level forward‑passes the model sees.
Sequence length, L	≈ 2048 tokens per example (the context window).	Determines how many per‑token matrix‑products are needed per forward pass.
Number of layers, Lₗ	96 transformer blocks.
Hidden dimension, d	12 384 (the width of each linear projection).
Number of attention heads, h	96 (so each head has size d/h = 128).
Training passes	1 epoch (the published training used roughly 1 × the dataset; we treat the 300 B tokens as the total “token‑steps” already).

The only numbers we need for a FLOP estimate are N (the model size) and T (the total number of token‑level operations). The rest of the architecture details (L, d, h, Lₗ) are used to translate “N parameters” into “how many FLOPs per token”.

2. How many FLOPs per token?

A transformer layer consists of:

Self‑attention (Q, K, V projections + output projection)
Feed‑forward network (FFN) (two linear layers with a non‑linear activation).

For a single token (ignoring the cost of the softmax and the small bias terms) the dominant cost is matrix‑multiply operations.

2.1 Rough matrix‑multiply cost

For a matrix multiplication A (m×k) × B (k×n) the number of multiply‑adds is 2 · m·k·n (one multiplication and one addition per entry). In deep‑learning practice we count that as 2 FLOPs per multiply‑add pair.

2.2 FLOPs per token for a single layer

Component	Approx. dimensions	FLOPs (per token)
Q, K, V projections (3× per token)	d × d → 3·(2·d·d) = 6·d²
Attention scores (Q·Kᵀ)	L·d → 2·L·d²
Weighted sum (A·V)	L·d → 2·L·d²
Output projection	d × d → 2·d²
FFN first linear (d → 4d)	2·d·4d = 8·d²
FFN second linear (4d → d)	8·d²
Total per layer	≈ 38 · d² FLOPs per token (rounded up).

Plugging d = 12 384:

[ 38 \times d^{2} \approx 38 \times (1.2384\times10^{4})^{2} \approx 38 \times 1.53 \times 10^{8} }= 5.8 \times 10^{9}; \text{FLOPs per token per layer}. ]

2.3 All layers

With 96 layers:

[ \text{FLOPs per token (forward)} \approx 96 \times 5.8\times10^{9} }= 5.6 \times 10^{11}; \text{FLOPs}. ]

2.4 Back‑propagation cost

Training requires a forward pass plus a backward pass. The backward pass is roughly 2× the forward‑pass cost (gradient propagation + weight updates). So:

[ \text{FLOPs per token (training)} \approx 3 \times \text{forward} = 3 \times 5.6\times10^{11} }\approx 1.7\times10^{12}; \text{FLOPs per token.} ]

(Alternative literature often rounds this factor to ≈ 6 total FLOPs per parameter per token, which yields the same order of magnitude.)

3. Multiply by the total number of tokens

The training corpus used for GPT‑3 was ≈ 300 billion tokens (≈ 3 × 10¹¹). The total FLOPs needed:

[ \text{Total FLOPs} \approx 1.7 \times 10^{12};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \approx 5.1 \times 10^{23};\text{FLOPs}. ]

Rounded to a single‑digit significance:

[ \boxed{,\approx 5\times10^{23},\text{FLOPs},} ]

4. Expressed in common units

Unit	Value
FLOPs	≈ 5 × 10²³
Peta‑FLOPs (10¹⁵)	≈ 5 × 10⁸ PFLOPs
Exa‑FLOPs (10¹⁸)	≈ 5 × 10⁵ EFLOPs
Zetta‑FLOPs (10²¹)	≈ 5 × 10² ZFLOPs (≈ 500 ZFLOPs)

Because “zeta” (10²¹) is a convenient unit for such massive numbers, we can say ≈ 500 zeta‑FLOPs. In more familiar terms, this is ≈ 300 zeta‑FLOPs if we adopt the “6 × N × T” rule of thumb (see below), so the estimate sits in the few‑hundred‑zeta‑FLOP range.

5. A “quick‑rule” sanity check

A widely‑cited back‑of‑the‑envelope rule for transformer LLMs is:

[ \text{Training FLOPs} \approx 6 \times N_{\text{params}} \times N_{\text{tokens}}. ]

Plugging in the numbers:

[ 6 \times (1.75\times10^{11}) \times (3\times10^{11}) = 3.15\times10^{23};\text{FLOPs}. ]

That is within a factor of two of the more detailed calculation above (the difference stems from the exact factor used for forward vs backward cost). Both methods give the same order of 10²³ FLOPs.

6. Summary of the reasoning

Model size gives the number of parameters (N = 1.75 × 10¹¹).
Token count tells how many forward‑backward passes are needed (T ≈ 3 × 10¹¹).
Per‑token training cost for a transformer layer is ≈ 38 · d² FLOPs per forward pass; with 96 layers this is ≈ 5.6 × 10¹¹ FLOPs per token.
Back‑propagation adds roughly a factor‑3 (or 6) multiplier, yielding ~1.7 × 10¹² FLOPs per token.
Total = per‑token cost × total tokens ≈ 5 × 10²³ FLOPs.

Bottom line

Estimated total compute to train GPT‑3: ≈ 5 × 10²³ FLOPs, i.e. ≈ 300–500 zeta‑FLOPs (or about 300 000 exaflops).
This is the magnitude of hundreds of zettaFLOPs of raw arithmetic, roughly the same as running a few hundred petaflops‑scale supercomputers for a month (or a single exaflop‑class machine for a week).

(All numbers are back‑of‑the‑envelope estimates; the real training run also incurs overhead from data‑loading, communication, and other inefficiencies, so the actual engineering effort was larger, but the pure‑compute figure is captured by the calculation above.)

Vote

2 / 48

AI Board Game Logicreasoning

Tests an AI's ability to understand game rules and strategy

Gemini 3.1 Pro Preview

Google · Feb 2026svg

No content available for this output.

Openai · Aug 2025text

No content available for this output.

Vote

11 / 48

Minimalist Landing Pageweb design

Tests an AI's ability to generate a complete, working landing page

Gemini 3.1 Pro Preview

Google · Feb 2026website

GPT OSS 120B

Openai · Aug 2025website

Vote

12 / 48

Pokémon Battle UI Recreationweb design

Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.

Gemini 3.1 Pro Preview

Google · Feb 2026website

No content available for this output.

GPT OSS 120B

Openai · Aug 2025website

No content available for this output.

Vote

12 of 48

FAQ

Continue exploring

GPT OSS 120B vs Llama 4 MaverickNew provider

Gemini 3.1 Pro Preview vs Pony AlphaNew provider

GPT OSS 120B vs Claude Opus 4New provider

Below is a step‑by‑step back‑of‑the‑envelope calculation that shows how we get there, together with the assumptions that go into it.

1. What we need to know

Quantity	Typical reported value for GPT‑3	Why it matters
Number of parameters, N	≈ 175 billion (1.75 × 10¹¹)	Determines the size of each matrix‑multiply in the model.
Training token count, T	≈ 300 billion tokens (3 × 10¹¹)	Total number of token‑level forward‑passes the model sees.
Sequence length, L	≈ 2048 tokens per example (the context window).	Determines how many per‑token matrix‑products are needed per forward pass.
Number of layers, Lₗ	96 transformer blocks.
Hidden dimension, d	12 384 (the width of each linear projection).
Number of attention heads, h	96 (so each head has size d/h = 128).
Training passes	1 epoch (the published training used roughly 1 × the dataset; we treat the 300 B tokens as the total “token‑steps” already).

2. How many FLOPs per token?

A transformer layer consists of:

Self‑attention (Q, K, V projections + output projection)
Feed‑forward network (FFN) (two linear layers with a non‑linear activation).

For a single token (ignoring the cost of the softmax and the small bias terms) the dominant cost is matrix‑multiply operations.

2.1 Rough matrix‑multiply cost

2.2 FLOPs per token for a single layer

Component	Approx. dimensions	FLOPs (per token)
Q, K, V projections (3× per token)	d × d → 3·(2·d·d) = 6·d²
Attention scores (Q·Kᵀ)	L·d → 2·L·d²
Weighted sum (A·V)	L·d → 2·L·d²
Output projection	d × d → 2·d²
FFN first linear (d → 4d)	2·d·4d = 8·d²
FFN second linear (4d → d)	8·d²
Total per layer	≈ 38 · d² FLOPs per token (rounded up).

Plugging d = 12 384:

[ 38 \times d^{2} \approx 38 \times (1.2384\times10^{4})^{2} \approx 38 \times 1.53 \times 10^{8} }= 5.8 \times 10^{9}; \text{FLOPs per token per layer}. ]

2.3 All layers

With 96 layers:

[ \text{FLOPs per token (forward)} \approx 96 \times 5.8\times10^{9} }= 5.6 \times 10^{11}; \text{FLOPs}. ]

2.4 Back‑propagation cost

Training requires a forward pass plus a backward pass. The backward pass is roughly 2× the forward‑pass cost (gradient propagation + weight updates). So:

[ \text{FLOPs per token (training)} \approx 3 \times \text{forward} = 3 \times 5.6\times10^{11} }\approx 1.7\times10^{12}; \text{FLOPs per token.} ]

(Alternative literature often rounds this factor to ≈ 6 total FLOPs per parameter per token, which yields the same order of magnitude.)

3. Multiply by the total number of tokens

The training corpus used for GPT‑3 was ≈ 300 billion tokens (≈ 3 × 10¹¹). The total FLOPs needed:

[ \text{Total FLOPs} \approx 1.7 \times 10^{12};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \approx 5.1 \times 10^{23};\text{FLOPs}. ]

Rounded to a single‑digit significance:

[ \boxed{,\approx 5\times10^{23},\text{FLOPs},} ]

4. Expressed in common units

Unit	Value
FLOPs	≈ 5 × 10²³
Peta‑FLOPs (10¹⁵)	≈ 5 × 10⁸ PFLOPs
Exa‑FLOPs (10¹⁸)	≈ 5 × 10⁵ EFLOPs
Zetta‑FLOPs (10²¹)	≈ 5 × 10² ZFLOPs (≈ 500 ZFLOPs)

5. A “quick‑rule” sanity check

A widely‑cited back‑of‑the‑envelope rule for transformer LLMs is:

[ \text{Training FLOPs} \approx 6 \times N_{\text{params}} \times N_{\text{tokens}}. ]

Plugging in the numbers:

[ 6 \times (1.75\times10^{11}) \times (3\times10^{11}) = 3.15\times10^{23};\text{FLOPs}. ]

6. Summary of the reasoning

Model size gives the number of parameters (N = 1.75 × 10¹¹).
Token count tells how many forward‑backward passes are needed (T ≈ 3 × 10¹¹).
Per‑token training cost for a transformer layer is ≈ 38 · d² FLOPs per forward pass; with 96 layers this is ≈ 5.6 × 10¹¹ FLOPs per token.
Back‑propagation adds roughly a factor‑3 (or 6) multiplier, yielding ~1.7 × 10¹² FLOPs per token.
Total = per‑token cost × total tokens ≈ 5 × 10²³ FLOPs.

Bottom line

Estimated total compute to train GPT‑3: ≈ 5 × 10²³ FLOPs, i.e. ≈ 300–500 zeta‑FLOPs (or about 300 000 exaflops).
This is the magnitude of hundreds of zettaFLOPs of raw arithmetic, roughly the same as running a few hundred petaflops‑scale supercomputers for a month (or a single exaflop‑class machine for a week).