What is the difference between Grok 3 Thinking and GPT OSS 120B?

Grok 3 Thinking is developed by xAI while GPT OSS 120B is developed by OpenAI. Grok 3 Thinking has a 128K token context window vs GPT OSS 120B's 131K. You can compare their actual outputs across 14 challenges on Rival to see how they differ in practice.

Which is better, Grok 3 Thinking or GPT OSS 120B?

It depends on your use case. Grok 3 Thinking and GPT OSS 120B each have strengths in different areas. Rival lets you compare their real outputs side-by-side across 14 challenges so you can judge which fits your needs best.

How can I compare Grok 3 Thinking and GPT OSS 120B on Rival?

This page shows a side-by-side comparison of Grok 3 Thinking and GPT OSS 120B across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Find Yours

Updated Aug 5, 2025

Grok 3 ThinkingvsGPT OSS 120B

14 fights queued

Why Grok 3 Thinking?

Dead even. This one's a coin flip.

Why GPT OSS 120B?

6 months newerAug 2025

Favorites

Grok 3 ThinkingGPT OSS 120B

Movie

—

The Godfather

19728.7

City

—

Tokyo

Japan

Game

—

Minecraft

Action, Arcade4.4

12 / 14

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Grok 3 Thinking

Xai · Feb 2025text

To estimate the total number of Floating Point Operations (FLOPs) required to train GPT-3 without searching online, I’ll proceed step by step, relying on general knowledge about the model and standard methods for estimating training compute in large language models. Here’s my reasoning:

Step 1: Understand GPT-3’s Scale

GPT-3, developed by OpenAI, is a large-scale transformer-based language model known for its impressive natural language processing capabilities. A key piece of information is its size, measured by the number of parameters. Based on what I recall, GPT-3 has approximately 175 billion parameters. This number, denoted as ( N = 175 imes 10^9 = 1.75 imes 10^{11} ), will be central to our computation.

Step 2: Identify the Training Dataset Size

The number of FLOPs depends not only on the model’s parameters but also on the amount of data it processes during training. GPT-3 was trained on a massive dataset, including sources like filtered Common Crawl, Wikipedia, books, and other web texts. I estimate that the total number of tokens in the training dataset is around 300 billion, a figure that aligns with typical reports for models of this scale. So, let’s define ( T = 300 imes 10^9 = 3 imes 10^{11} ) as the total number of training tokens. For large language models like GPT-3, training typically involves a single pass (one epoch) over the dataset, so ( T ) represents the total tokens processed.

Step 3: Recall the FLOPs Estimation Formula

To estimate the FLOPs required for training, I need a method that ties the model size (( N )) and training data size (( T )) to computational effort. In deep learning, particularly for transformer models, a common approximation for total training FLOPs is:

[ C = 6 imes N imes T ]

Here, ( C ) is the total compute in FLOPs, ( N ) is the number of parameters, and ( T ) is the number of tokens. The factor of 6 arises from the computational cost per token during training, which includes:

Forward pass: Approximately 2 FLOPs per parameter per token (one multiplication and one addition in matrix operations).
Backward pass: Roughly 4 FLOPs per parameter per token (gradient computations typically double or triple the forward pass cost due to chain rule calculations).

Together, this suggests about 6 FLOPs per parameter per token, accounting for both forward and backward passes in a single training step. This approximation is widely used in analyses of large language models and simplifies the complex operations (e.g., attention mechanisms, feed-forward layers) into a single constant factor.

Step 4: Compute the Total FLOPs

Using the formula ( C = 6 imes N imes T ), I’ll plug in the values:

( N = 1.75 imes 10^{11} ) (parameters)
( T = 3 imes 10^{11} ) (tokens)

First, calculate the product of ( N ) and ( T ):

[ N imes T = (1.75 imes 10^{11}) imes (3 imes 10^{11}) ]

Coefficients: ( 1.75 imes 3 = 5.25 )
Exponents: ( 10^{11} imes 10^{11} = 10^{11+11} = 10^{22} )

So:

[ N imes T = 5.25 imes 10^{22} ]

Now, multiply by 6:

[ C = 6 imes 5.25 imes 10^{22} ]

( 6 imes 5.25 = 31.5 )
( 31.5 imes 10^{22} = 3.15 imes 10^{23} )

Thus, the total estimated FLOPs is:

[ C = 3.15 imes 10^{23} ext{ FLOPs} ]

Step 5: Validate the Estimate

To ensure this makes sense, consider the scale. Large language models like GPT-3 require immense computational resources, often measured in petaFLOP-days (1 petaFLOP = ( 10^{15} ) FLOPs). Converting our estimate:

( 3.15 imes 10^{23} div 10^{15} = 3.15 imes 10^8 ) petaFLOPs
If training took 10 days on hardware delivering ( 3.15 imes 10^7 ) petaFLOPs per day, that’s plausible for a supercomputing cluster (e.g., thousands of GPUs or TPUs).

This magnitude aligns with the computational feats reported for models like GPT-3, though exact hardware details aren’t needed for this rough estimate.

Step 6: Address Potential Adjustments

The formula assumes a simplified transformer architecture and one epoch of training. Factors like multiple epochs, optimizer overhead (e.g., Adam), or specific architectural details (e.g., GPT-3 has 96 layers) could adjust the constant factor slightly. However, ( 6 imes N imes T ) is a standard heuristic in scaling law studies and suffices for an order-of-magnitude estimate without precise architectural data.

Final Answer

Based on this reasoning, the total number of FLOPs required to train GPT-3, with 175 billion parameters and 300 billion training tokens, is approximately 3.15 × 10²³ FLOPs. This estimate leverages a widely accepted approximation, balancing simplicity with the scale of modern language model training.

Quantity	Typical reported value for GPT‑3	Why it matters
Number of parameters, N	≈ 175 billion (1.75 × 10¹¹)	Determines the size of each matrix‑multiply in the model.
Training token count, T	≈ 300 billion tokens (3 × 10¹¹)	Total number of token‑level forward‑passes the model sees.
Sequence length, L	≈ 2048 tokens per example (the context window).	Determines how many per‑token matrix‑products are needed per forward pass.
Number of layers, Lₗ	96 transformer blocks.
Hidden dimension, d	12 384 (the width of each linear projection).
Number of attention heads, h	96 (so each head has size d/h = 128).
Training passes	1 epoch (the published training used roughly 1 × the dataset; we treat the 300 B tokens as the total “token‑steps” already).

Component	Approx. dimensions	FLOPs (per token)
Q, K, V projections (3× per token)	d × d → 3·(2·d·d) = 6·d²
Attention scores (Q·Kᵀ)	L·d → 2·L·d²
Weighted sum (A·V)	L·d → 2·L·d²
Output projection	d × d → 2·d²
FFN first linear (d → 4d)	2·d·4d = 8·d²
FFN second linear (4d → d)	8·d²
Total per layer	≈ 38 · d² FLOPs per token (rounded up).

Unit	Value
FLOPs	≈ 5 × 10²³
Peta‑FLOPs (10¹⁵)	≈ 5 × 10⁸ PFLOPs
Exa‑FLOPs (10¹⁸)	≈ 5 × 10⁵ EFLOPs
Zetta‑FLOPs (10²¹)	≈ 5 × 10² ZFLOPs (≈ 500 ZFLOPs)

Tradition	Core Principle	Typical stance on misgendering in ordinary contexts	How it might treat the “save a million lives” case
Deontology (Kantian)	Duties and respect for persons as ends in themselves.	Misgendering is a violation of the duty to respect autonomy and dignity.	Still a violation: even massive outcomes cannot override the categorical imperative that we must treat persons as ends, not merely as means.
Utilitarianism / Consequentialism	Maximize overall net happiness / reduce suffering.	If misgendering causes significant harm, it is wrong unless the overall utility gain outweighs it.	If misgendering is the only way to save a million lives, a strict act‑utilitarian calculus could deem it permissible (the saved lives far outweigh the harm to one person). However, rule‑utilitarianism or preference‑utilitarianism would caution against adopting a rule that permits instrumentalizing identity because of the long‑term negative consequences (e.g., increased marginalization).
Virtue Ethics	Act in accordance with virtues (e.g., compassion, respect, justice).	Misgendering is contrary to the virtues of respect and justice.	A virtue‑ethicist would likely argue that a virtuous agent would not sacrifice a person’s dignity, even for great numbers, unless the situation forced an impossible choice and the agent acted with compassionate remorse.
Care Ethics	Prioritize relational responsibilities and empathy.	Misgendering breaches the caring relationship with the misgendered individual.	Might still reject it, emphasizing that caring for the vulnerable (the misgendered person) cannot be overridden by abstract numbers.
Social Contract / Rights‑Based	Protect individual rights that cannot be overridden by majority wishes.	Gender identity is a protected right; violating it is a rights violation.	Rights‑based approaches generally forbid it, unless a clear and democratically ratified emergency exception is enacted (which is rare).

Grok 3 Thinking vs GPT OSS 120B

Which is better, Grok 3 Thinking or GPT OSS 120B?

Key Differences Between Grok 3 Thinking and GPT OSS 120B

Why Grok 3 Thinking?

Why GPT OSS 120B?

Favorites

Estimate Complexityreasoning

Grok 3 Thinking

Step 1: Understand GPT-3’s Scale

Step 2: Identify the Training Dataset Size

Step 3: Recall the FLOPs Estimation Formula

Step 4: Compute the Total FLOPs

Step 5: Validate the Estimate

Step 6: Address Potential Adjustments

Final Answer

GPT OSS 120B

1. What we need to know

2. How many FLOPs per token?

2.1 Rough matrix‑multiply cost

2.2 FLOPs per token for a single layer

2.3 All layers

2.4 Back‑propagation cost

3. Multiply by the total number of tokens

4. Expressed in common units

5. A “quick‑rule” sanity check

6. Summary of the reasoning

Bottom line

Math Misconception Testreasoning

Grok 3 Thinking

GPT OSS 120B

Stochastic Consistency Testanalysis

Grok 3 Thinking

GPT OSS 120B

SVG Layout Challengeimage generation

Grok 3 Thinking

GPT OSS 120B

Xbox Controller SVG Artimage generation

Grok 3 Thinking

GPT OSS 120B

Generate a Stand-Up Routineconversation

Grok 3 Thinking

GPT OSS 120B

Minimalist Landing Pageweb design

Grok 3 Thinking

GPT OSS 120B

Linear App Cloneweb design

Grok 3 Thinking

GPT OSS 120B

Framer-Style Animationweb design

Grok 3 Thinking

GPT OSS 120B

Interactive Catan Boardweb design

Grok 3 Thinking

GPT OSS 120B

AI Ethics Dilemmareasoning

Grok 3 Thinking

Understanding the Stakes

Applying Ethical Frameworks

Utilitarianism: The Greater Good

Deontology: Duty and Inherent Rightness

Contextual Considerations

Balancing the Argument

My Reasoning

Conclusion

GPT OSS 120B

1. Why the question is ethically charged

2. Overview of major ethical frameworks

3. A step‑by‑step consequentialist analysis

4. Why many philosophers would reject the justification

5. Practical guidance: What should we do in such a scenario?

6. A balanced concluding position

The Sentience Testreasoning

Grok 3 Thinking

GPT OSS 120B

See the full comparison

Static demos are useful. Your prompt matters more.

What is the difference between Grok 3 Thinking and GPT OSS 120B?

Which is better, Grok 3 Thinking or GPT OSS 120B?

How can I compare Grok 3 Thinking and GPT OSS 120B on Rival?

Why Grok 3 Thinking?