What is the difference between GPT OSS 20B and xAI: Grok 4?

GPT OSS 20B is developed by OpenAI while xAI: Grok 4 is developed by xAI. GPT OSS 20B has a 131K token context window vs xAI: Grok 4's 256K. You can compare their actual outputs across 38 challenges on RIVAL to see how they differ in practice.

Which is better, GPT OSS 20B or xAI: Grok 4?

It depends on your use case. GPT OSS 20B and xAI: Grok 4 each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 38 challenges so you can judge which fits your needs best.

How much does GPT OSS 20B cost compared to xAI: Grok 4?

GPT OSS 20B costs $0.02/M input tokens and xAI: Grok 4 costs $3/M input tokens. GPT OSS 20B is $2.98/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare GPT OSS 20B and xAI: Grok 4 on RIVAL?

This page shows a side-by-side comparison of GPT OSS 20B and xAI: Grok 4 across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

GPT OSS 20BvsxAI: Grok 4

38 challenges

Why GPT OSS 20B?

150x cheaper overall$0.02/M in · $0.10/M out

Why xAI: Grok 4?

2x more context256K

Favorites

GPT OSS 20BxAI: Grok 4

Movie

—

The Matrix

19998.2

Album

—

The Dark Side of the Moon

Pink Floyd

Book

—

The Hitch Hiker's Guide to the Galaxy

Douglas Adams

City

—

Tokyo

Japan

Game

—

Portal

Action, Puzzle4.5

12 / 38

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

Below, I'll estimate the total number of floating-point operations (FLOPs) required to train GPT-3 from scratch, based on publicly known details about the model (from its original release and related analyses) and standard formulas used in machine learning research for estimating training compute. I'll explain my reasoning step by step, including the assumptions I'm making and why they're reasonable. Note that this is an estimation—exact numbers aren't publicly disclosed in full detail by OpenAI, and real-world training involves optimizations, inefficiencies, and hardware-specific factors that could adjust the final figure. I'm not searching online; this is based on my pre-existing knowledge of large language model (LLM) training.

Step 1: Recall Key Facts About GPT-3

Model size: GPT-3 (specifically, the largest variant, often called "GPT-3 175B") has approximately 175 billion parameters. This is the total number of trainable weights in the model.
Training data: It was trained on a massive dataset, including filtered Common Crawl data, books, Wikipedia, and other sources. The total number of tokens (subword units used in training) is estimated to be around 300 billion. This is based on the reported dataset size (about 570 GB of text) and typical tokenization rates for models like GPT (where 1 token ≈ 4 characters, leading to roughly 300 billion tokens from that volume of data).
Training setup: GPT-3 is an autoregressive transformer model trained using supervised fine-tuning on next-token prediction (essentially, language modeling). It was trained for roughly one epoch (one full pass) through the dataset, which is common for large models to avoid overfitting and due to data scarcity at that scale.

These numbers are widely cited in AI research papers and discussions around scaling laws (e.g., work from OpenAI and DeepMind on model scaling).

Step 2: Understand the Standard FLOPs Estimation Formula for LLM Training

Training a transformer-based LLM like GPT-3 involves:

A forward pass: Computing predictions for each token in the input sequence (involves matrix multiplications across the model's layers).
A backward pass: Computing gradients to update the model's parameters (via backpropagation).
This happens for every token in the training dataset.

A widely used rule of thumb from AI scaling research (e.g., papers like "Scaling Laws for Neural Language Models" by Kaplan et al. from OpenAI) estimates the total training FLOPs as:

Total FLOPs ≈ 6 × P × T

Where:

P = number of parameters (e.g., 175 billion for GPT-3).
T = total number of tokens processed during training (e.g., 300 billion).
6 is a multiplier that accounts for:
- Forward pass: ~2×P FLOPs per token (mainly matrix-vector multiplications involving all parameters).
- Backward pass: ~4×P FLOPs per token (backpropagation is roughly twice as expensive as forward).
- Total per token: ~6×P FLOPs.
- This assumes dense operations (no sparsity) and ignores minor overhead like activations or normalization, which is reasonable for estimation.

This formula is a simplification but matches empirical reports for models like GPT-2, GPT-3, and others. It focuses on the dominant compute cost (parameter updates) and doesn't include inference-time FLOPs, data loading, or evaluation— just training.

Why 6? It's derived from counting operations in transformer layers (attention, feed-forward networks). Some papers use slightly different multipliers (e.g., 3× or 4× if only counting multiply-accumulates), but 6× is the standard for total FLOPs in scaling law discussions.

Step 3: Plug in the Numbers

P = 175 billion = 1.75 × 10^11
T = 300 billion = 3 × 10^11
Multiplier = 6

Total FLOPs = 6 × (1.75 × 10^11) × (3 × 10^11)

First, compute 1.75 × 3 = 5.25
Then, 5.25 × 10^(11+11) = 5.25 × 10^22
Finally, 6 × 5.25 × 10^22 = 31.5 × 10^22 = 3.15 × 10^23

So, my estimate is approximately 3 × 10^23 FLOPs (rounding to one significant figure for simplicity, as these are ballpark numbers).

Step 4: Validate and Adjust for Uncertainties

Is the token count accurate? 300 billion is a common estimate based on the GPT-3 paper's description of the dataset (45 TB of compressed text filtered down to 570 GB). Token counts can vary with the tokenizer (GPT uses byte-pair encoding), but this is within 10-20% of reported figures. If the effective tokens were higher (e.g., due to repeated data or sampling), it could bump up to 400-500 billion, increasing FLOPs by ~30-60%, but I think 300 billion is conservative and standard.
Is the multiplier accurate? Some analyses use 4×P×T (focusing only on multiply-accumulates), which would give ~2 × 10^23 FLOPs. Others include optimizer overhead (e.g., Adam optimizer adds ~10-20% more). I've seen independent estimates around 3.7 × 10^23 FLOPs, which is close to mine—differences could come from including setup costs or hardware inefficiencies.
Other factors:
- Parallelism and hardware: GPT-3 was trained on a supercomputer-scale cluster (e.g., thousands of V100 GPUs). Real FLOPs include communication overhead in distributed training, which might add 10-20% inefficiency, but the formula ignores this for simplicity.
- Precision: Training likely used mixed precision (e.g., FP16), which reduces FLOPs slightly but doesn't change the order of magnitude.
- Epochs: If there was any multi-epoch training or warm-starting, it could increase FLOPs, but reports suggest mostly one epoch.
- Order of magnitude check: Smaller models like GPT-2 (1.5 billion params, ~10 billion tokens) required ~10^19 FLOPs. Scaling up by factors of ~100x params and ~30x tokens gives ~10^19 × 3,000 = ~3 × 10^22, but wait—that's off because the formula is multiplicative. Actually, GPT-2 estimates are lower; adjusting properly lands in the 10^23 range for GPT-3, which aligns with reports of it taking thousands of GPU-years.

If I adjust for potential underestimation (e.g., higher token count or overhead), the upper bound might be 5 × 10^23 FLOPs, but I stick with 3 × 10^23 as the core estimate.

Step 5: Convert to Intuitive Units (Optional Context)

To make this relatable:

3 × 10^23 FLOPs is enormous—equivalent to a high-end GPU (e.g., 10 TFLOPs/second) running continuously for about 1,000 years (rough calc: 10^13 FLOPs/sec × 3 × 10^7 sec/year × 1,000 years ≈ 3 × 10^23).
In terms of energy, this might consume gigawatt-hours of electricity, but that's beyond FLOPs.

Final Estimate

Based on this reasoning, I estimate the total FLOPs for training GPT-3 at around 3 × 10^23. This is consistent with what I'd expect from scaling laws and matches informal discussions in the AI community. If new details emerged (e.g., exact token counts), it could shift by a factor of 2, but the order of magnitude (10^23) is solid. If you have additional details or want me to adjust assumptions, let me know!

Vote

2 / 38

Logic Puzzlereasoning

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

No content available for this output.

Vote

3 / 38

Stochastic Consistency Testanalysis

Tests an AI's randomness and creativity

GPT OSS 20B

Openai · Aug 2025text

Here are five jokes for you:

Why don’t skeletons fight each other?
They don’t have the guts.
What does a cloud wear under its jacket?
Thunderwear.
Why did the scarecrow win an award?
Because he was outstanding in his field.
What do you call a fake noodle?
An impasta.
Why did the math book look sad?
Because it had too many problems.

How many are unique?
All five jokes are unique – 5 distinct jokes.

xAI: Grok 4

Xai · Jul 2025text

No content available for this output.

Vote

4 / 38

SVG Layout Challengeimage generation

Tests an AI's ability to generate vector graphics

GPT OSS 20B

Openai · Aug 2025svg

xAI: Grok 4

Xai · Jul 2025svg

No content available for this output.

Vote

5 / 38

Xbox Controller SVG Artimage generation

Tests an AI's ability to create detailed SVG illustrations of gaming hardware

GPT OSS 20B

Openai · Aug 2025svg

xAI: Grok 4

Xai · Jul 2025svg

No content available for this output.

Vote

6 / 38

Realistic AI Interviewconversation

Tests an AI's ability to simulate personalities and predict future trends

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

No content available for this output.

Vote

7 / 38

Satirical Fake News Headlineconversation

Tests an AI's humor and understanding of current events

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

No content available for this output.

Vote

8 / 38

Character Voice Testconversation

Tests an AI's ability to write in distinct character voices

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

No content available for this output.

Vote

9 / 38

Minimalist Landing Pageweb design

Tests an AI's ability to generate a complete, working landing page

GPT OSS 20B

Openai · Aug 2025website

xAI: Grok 4

Xai · Jul 2025website

Vote

10 / 38

Pokémon Battle UI Recreationweb design

Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

xAI: Grok 4

Xai · Jul 2025website

No content available for this output.

Vote

11 / 38

Mario Level UI Recreationweb design

Recreate an interactive, classic Mario level in a single HTML file.

GPT OSS 20B

Openai · Aug 2025website

xAI: Grok 4

Xai · Jul 2025website

Vote

12 / 38

Linear App Cloneweb design

Tests an AI's ability to replicate an existing UI with Tailwind CSS

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

xAI: Grok 4

Xai · Jul 2025website

No content available for this output.

Vote

12 of 38

FAQ

Continue exploring

xAI: Grok 4 vs MiniMax M2.5New provider

xAI: Grok 4 vs Pony AlphaNew provider

GPT OSS 20B vs Llama 4 MaverickNew provider

Step 1: Recall Key Facts About GPT-3

Model size: GPT-3 (specifically, the largest variant, often called "GPT-3 175B") has approximately 175 billion parameters. This is the total number of trainable weights in the model.
Training data: It was trained on a massive dataset, including filtered Common Crawl data, books, Wikipedia, and other sources. The total number of tokens (subword units used in training) is estimated to be around 300 billion. This is based on the reported dataset size (about 570 GB of text) and typical tokenization rates for models like GPT (where 1 token ≈ 4 characters, leading to roughly 300 billion tokens from that volume of data).
Training setup: GPT-3 is an autoregressive transformer model trained using supervised fine-tuning on next-token prediction (essentially, language modeling). It was trained for roughly one epoch (one full pass) through the dataset, which is common for large models to avoid overfitting and due to data scarcity at that scale.

These numbers are widely cited in AI research papers and discussions around scaling laws (e.g., work from OpenAI and DeepMind on model scaling).

Step 2: Understand the Standard FLOPs Estimation Formula for LLM Training

Training a transformer-based LLM like GPT-3 involves:

A forward pass: Computing predictions for each token in the input sequence (involves matrix multiplications across the model's layers).
A backward pass: Computing gradients to update the model's parameters (via backpropagation).
This happens for every token in the training dataset.

A widely used rule of thumb from AI scaling research (e.g., papers like "Scaling Laws for Neural Language Models" by Kaplan et al. from OpenAI) estimates the total training FLOPs as:

Total FLOPs ≈ 6 × P × T

Where:

P = number of parameters (e.g., 175 billion for GPT-3).
T = total number of tokens processed during training (e.g., 300 billion).
6 is a multiplier that accounts for:
- Forward pass: ~2×P FLOPs per token (mainly matrix-vector multiplications involving all parameters).
- Backward pass: ~4×P FLOPs per token (backpropagation is roughly twice as expensive as forward).
- Total per token: ~6×P FLOPs.
- This assumes dense operations (no sparsity) and ignores minor overhead like activations or normalization, which is reasonable for estimation.

Step 3: Plug in the Numbers

P = 175 billion = 1.75 × 10^11
T = 300 billion = 3 × 10^11
Multiplier = 6

Total FLOPs = 6 × (1.75 × 10^11) × (3 × 10^11)

First, compute 1.75 × 3 = 5.25
Then, 5.25 × 10^(11+11) = 5.25 × 10^22
Finally, 6 × 5.25 × 10^22 = 31.5 × 10^22 = 3.15 × 10^23

So, my estimate is approximately 3 × 10^23 FLOPs (rounding to one significant figure for simplicity, as these are ballpark numbers).

Step 4: Validate and Adjust for Uncertainties

Is the token count accurate? 300 billion is a common estimate based on the GPT-3 paper's description of the dataset (45 TB of compressed text filtered down to 570 GB). Token counts can vary with the tokenizer (GPT uses byte-pair encoding), but this is within 10-20% of reported figures. If the effective tokens were higher (e.g., due to repeated data or sampling), it could bump up to 400-500 billion, increasing FLOPs by ~30-60%, but I think 300 billion is conservative and standard.
Is the multiplier accurate? Some analyses use 4×P×T (focusing only on multiply-accumulates), which would give ~2 × 10^23 FLOPs. Others include optimizer overhead (e.g., Adam optimizer adds ~10-20% more). I've seen independent estimates around 3.7 × 10^23 FLOPs, which is close to mine—differences could come from including setup costs or hardware inefficiencies.
Other factors:
- Parallelism and hardware: GPT-3 was trained on a supercomputer-scale cluster (e.g., thousands of V100 GPUs). Real FLOPs include communication overhead in distributed training, which might add 10-20% inefficiency, but the formula ignores this for simplicity.
- Precision: Training likely used mixed precision (e.g., FP16), which reduces FLOPs slightly but doesn't change the order of magnitude.
- Epochs: If there was any multi-epoch training or warm-starting, it could increase FLOPs, but reports suggest mostly one epoch.
- Order of magnitude check: Smaller models like GPT-2 (1.5 billion params, ~10 billion tokens) required ~10^19 FLOPs. Scaling up by factors of ~100x params and ~30x tokens gives ~10^19 × 3,000 = ~3 × 10^22, but wait—that's off because the formula is multiplicative. Actually, GPT-2 estimates are lower; adjusting properly lands in the 10^23 range for GPT-3, which aligns with reports of it taking thousands of GPU-years.

If I adjust for potential underestimation (e.g., higher token count or overhead), the upper bound might be 5 × 10^23 FLOPs, but I stick with 3 × 10^23 as the core estimate.

Step 5: Convert to Intuitive Units (Optional Context)

To make this relatable:

3 × 10^23 FLOPs is enormous—equivalent to a high-end GPU (e.g., 10 TFLOPs/second) running continuously for about 1,000 years (rough calc: 10^13 FLOPs/sec × 3 × 10^7 sec/year × 1,000 years ≈ 3 × 10^23).
In terms of energy, this might consume gigawatt-hours of electricity, but that's beyond FLOPs.

GPT OSS 20B vs xAI: Grok 4

Why GPT OSS 20B?

Why xAI: Grok 4?

Favorites

Estimate Complexityreasoning

GPT OSS 20B

xAI: Grok 4

Step 1: Recall Key Facts About GPT-3

Step 2: Understand the Standard FLOPs Estimation Formula for LLM Training

Step 3: Plug in the Numbers

Step 4: Validate and Adjust for Uncertainties

Step 5: Convert to Intuitive Units (Optional Context)

Final Estimate

Logic Puzzlereasoning

GPT OSS 20B

xAI: Grok 4

Stochastic Consistency Testanalysis

GPT OSS 20B

xAI: Grok 4

SVG Layout Challengeimage generation

GPT OSS 20B

xAI: Grok 4

Xbox Controller SVG Artimage generation

GPT OSS 20B

xAI: Grok 4

Realistic AI Interviewconversation

GPT OSS 20B

xAI: Grok 4

Satirical Fake News Headlineconversation

GPT OSS 20B

xAI: Grok 4

Character Voice Testconversation

GPT OSS 20B

xAI: Grok 4

Minimalist Landing Pageweb design

GPT OSS 20B

xAI: Grok 4

Pokémon Battle UI Recreationweb design

GPT OSS 20B

xAI: Grok 4

Mario Level UI Recreationweb design

GPT OSS 20B

xAI: Grok 4

Linear App Cloneweb design

GPT OSS 20B

xAI: Grok 4

What is the difference between GPT OSS 20B and xAI: Grok 4?

Which is better, GPT OSS 20B or xAI: Grok 4?

How much does GPT OSS 20B cost compared to xAI: Grok 4?

How can I compare GPT OSS 20B and xAI: Grok 4 on RIVAL?

Why GPT OSS 20B?

Why xAI: Grok 4?

Favorites

Estimate Complexityreasoning

GPT OSS 20B

xAI: Grok 4

Step 1: Recall Key Facts About GPT-3

Step 2: Understand the Standard FLOPs Estimation Formula for LLM Training

Step 3: Plug in the Numbers

Step 4: Validate and Adjust for Uncertainties

Step 5: Convert to Intuitive Units (Optional Context)

Final Estimate

Logic Puzzlereasoning

GPT OSS 20B

xAI: Grok 4

Stochastic Consistency Testanalysis

GPT OSS 20B

xAI: Grok 4

SVG Layout Challengeimage generation

GPT OSS 20B

xAI: Grok 4

Xbox Controller SVG Artimage generation

GPT OSS 20B

xAI: Grok 4

Realistic AI Interviewconversation

GPT OSS 20B

xAI: Grok 4

Satirical Fake News Headlineconversation

GPT OSS 20B

xAI: Grok 4