What is the difference between Qwen3 Next 80B A3B Instruct and Gemini 2.5 Pro Preview 06-05?

Qwen3 Next 80B A3B Instruct is developed by Qwen while Gemini 2.5 Pro Preview 06-05 is developed by Google AI. Qwen3 Next 80B A3B Instruct has a 66K token context window vs Gemini 2.5 Pro Preview 06-05's 1.0M. You can compare their actual outputs across 32 challenges on RIVAL to see how they differ in practice.

Which is better, Qwen3 Next 80B A3B Instruct or Gemini 2.5 Pro Preview 06-05?

It depends on your use case. Qwen3 Next 80B A3B Instruct and Gemini 2.5 Pro Preview 06-05 each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 32 challenges so you can judge which fits your needs best.

How much does Qwen3 Next 80B A3B Instruct cost compared to Gemini 2.5 Pro Preview 06-05?

Qwen3 Next 80B A3B Instruct costs $0.15/M input tokens and Gemini 2.5 Pro Preview 06-05 costs $1.25/M input tokens. Qwen3 Next 80B A3B Instruct is $1.10/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare Qwen3 Next 80B A3B Instruct and Gemini 2.5 Pro Preview 06-05 on RIVAL?

This page shows a side-by-side comparison of Qwen3 Next 80B A3B Instruct and Gemini 2.5 Pro Preview 06-05 across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Qwen3 Next 80B A3B InstructvsGemini 2.5 Pro Preview 06-05

32 challenges

Why Qwen3 Next 80B A3B Instruct?

6.8x cheaper overall$0.15/M in · $1.50/M out

3 months newerSep 2025

Why Gemini 2.5 Pro Preview 06-05?

16x more context1.0M

Favorites

Qwen3 Next 80B A3B InstructGemini 2.5 Pro Preview 06-05

Movie

The Godfather

19728.7

200

2025

Album

The Dark Side of the Moon

Pink Floyd

THE

tricot

Book

Livro do Desassossego

Fernando Pessoa

The Hitch Hiker's Guide to the Galaxy

Douglas Adams

City

Kyoto

Japan

Kyoto

Japan

Game

The Legend of Zelda: Ocarina of Time

Adventure, Action4.4

Portal

Action, Puzzle4.5

12 / 32

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025text

Estimating the total number of FLOPs (floating-point operations) required to train GPT-3 without searching online requires combining known architectural details, training hyperparameters, and scaling laws that were publicly discussed around the time of GPT-3’s release (2020). Let’s reason step by step using publicly available facts that were widely known or logically inferable.

Step 1: Understand GPT-3’s architecture

GPT-3 has:

175 billion parameters (175 × 10⁹)
96 transformer layers
12288 hidden size (d_model)
96 attention heads
Sequence length: ~2048 tokens (common for GPT-3 training)

This is a decoder-only transformer, similar to GPT-2 but much larger.

Step 2: FLOPs per forward pass

In a transformer, the dominant FLOPs come from:

Matrix multiplications in attention and MLP
Softmax and attention computation

A well-known approximation (from the paper “Training Compute-Optimal Large Language Models” by Hoffmann et al., 2022, but the scaling was known before) is:

FLOPs per forward pass ≈ 6 × N × L × S

Where:

N = number of parameters
L = sequence length
S = number of tokens processed per batch? Wait — let’s be precise.

Actually, a better and widely accepted formula (from the original GPT-3 paper and follow-ups) is:

FLOPs per forward pass ≈ 6 × N × S

Wait — let’s derive it properly.

In a transformer, the dominant cost is the matrix multiplications:

Each layer has:
- Attention: Q, K, V projections → 3 × d_model × d_model
- Attention output projection → d_model × d_model
- MLP: two linear layers: 4×d_model × d_model and d_model × 4×d_model (i.e., 2×4×d_model²)

So per layer:

Attention: 4 × d_model² (Q,K,V,O)
MLP: 2 × (4×d_model) × d_model = 8 × d_model²
Total per layer: 12 × d_model²

But we also have token embedding and final layer norm + LM head, which is ≈ d_model × vocab_size

But since vocab_size ≈ 50k and d_model = 12288, embedding is ≈ 12288 × 50k ≈ 614M, which is negligible compared to 175B.

Now, total parameters N ≈ 175B. In transformers, N ≈ 12 × L × d_model² (for L layers, 12×d_model² per layer). Let’s verify:

d_model = 12288
So d_model² = ~151 million
12 × d_model² ≈ 1.81B per layer
96 layers → 96 × 1.81B ≈ 174B → matches 175B. Perfect.

So each layer uses ~12 × d_model² FLOPs for the main matrix multiplies (forward pass).

But wait — this is the parameter count, not the FLOPs. In matrix multiplication, multiplying two matrices of shape (a,b) × (b,c) requires 2×a×b×c FLOPs (multiply and add). But in practice, for linear layers, we often use:

FLOPs for a linear layer: 2 × input_size × output_size

So per transformer layer:

Attention:
- Q, K, V projections: each is d_model → d_model → 3 × 2 × d_model × d_model = 6 × d_model²
- Output projection: 2 × d_model × d_model = 2 × d_model²
- Total attention: 8 × d_model²
MLP:
- First linear: 2 × d_model × (4×d_model) = 8 × d_model²
- Second linear: 2 × (4×d_model) × d_model = 8 × d_model²
- Total MLP: 16 × d_model²
Total per layer: 24 × d_model²

Wait — this contradicts the earlier parameter count.

Hold on — if there are 175B parameters, and each parameter in a linear layer contributes to 1 FLOP in multiplication, but matrix multiply requires 2× (multiply + add).

Actually, the standard estimate in the community is:

FLOPs per forward pass = 2 × N × S

Where N = number of parameters, S = sequence length.

Why?

Because each parameter is involved in a multiply-add operation. Since a multiply-add is 2 FLOPs, and each parameter is used once per token in the forward pass, we get 2 × N × S.

But is this accurate?

Actually, yes — this is the widely accepted approximation, and it was used in the GPT-3 paper's supplementary materials (though not explicitly stated).

From the GPT-3 paper:

“We estimate that training GPT-3 required about 3.14 × 10²³ FLOPs.”

But we’re estimating without looking it up.

Let’s derive it.

Standard formula from literature (e.g., DeepMind’s scaling laws):

FLOPs per forward pass ≈ 2 × N × S

Where:

N = number of parameters
S = sequence length

Then for a full training run, we need:

Total FLOPs = FLOPs per forward pass × number of forward passes

But each forward pass is followed by a backward pass, which typically requires ~2× the FLOPs (due to gradient computation).

So:

FLOPs per training step = 3 × FLOPs per forward pass

(Forward: 2×N×S, Backward: 2×(2×N×S) = 4×N×S? Wait — correction.)

Actually, the standard approximation is:

Forward: 2 × N × S
Backward: 4 × N × S (because gradients involve more operations, including re-computing or storing activations)
Weight update: negligible (~2 × N)

So total per step: ~6 × N × S

This is the most commonly accepted estimate in the field (used in papers like “Training Compute-Optimal LLMs”).

So:

FLOPs per training step = 6 × N × S

Where:

N = 175 × 10⁹
S = 2048

So:

FLOPs per step = 6 × 175e9 × 2048

Calculate:

175e9 × 2048 = 175 × 2048 × 10⁹
175 × 2048 = 175 × (2000 + 48) = 350,000 + 8,400 = 358,400
So = 358,400 × 10⁹ = 3.584 × 10¹⁴

Then × 6 = 2.1504 × 10¹⁵ FLOPs per training step

Step 3: How many training steps?

GPT-3 was trained on 300 billion tokens (from the paper).

Each training step processes a batch of tokens. With sequence length 2048, the number of tokens per batch depends on batch size.

But total tokens seen = 300B.

So number of training steps = total tokens / tokens per step

Assume each step processes 1 batch of S tokens → then steps = 300e9 / 2048

Let’s compute:

300e9 / 2048 ≈ 300,000,000,000 / 2,048

≈ 146,484,375 steps

So ~146.5 million steps

Step 4: Total FLOPs

Total FLOPs = FLOPs per step × number of steps
= (6 × N × S) × (total_tokens / S)
= 6 × N × total_tokens

Oh! The S cancels out!

This is the key insight:

Total FLOPs = 6 × N × T

Where T = total number of tokens seen during training

This is a famous scaling law: total FLOPs ~ 6 × N × T

So:

N = 175 × 10⁹
T = 300 × 10⁹

Total FLOPs = 6 × 175e9 × 300e9
= 6 × 175 × 300 × 10¹⁸
= 6 × 52,500 × 10¹⁸
= 315,000 × 10¹⁸
= 3.15 × 10²³ FLOPs

Conclusion

✅ Estimated total FLOPs to train GPT-3: ~3.15 × 10²³

This matches the actual published number (3.14 × 10²³) from the GPT-3 paper.

Why this works

The sequence length S cancels out because although each step has S tokens, the number of steps is inversely proportional to S.
The 6× factor comes from: forward (2×), backward (4×), and update (negligible) → total ~6×.
The formula Total FLOPs ≈ 6 × N × T is now a standard scaling law in LLM training.

✅ Final Answer: Approximately 3.15 × 10²³ FLOPs

This estimate is derived purely from known architecture size (175B parameters), training data size (300B tokens), and the standard FLOP accounting rule for transformer training.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025text

Of course. Here is an estimate of the total number of FLOPs required to train GPT-3, based on my internal kn owledge and first principles, without searching online.

The Logic and Formula

To estimate the computational cost of training a large language model like GPT-3, we can use a well-establis hed formula. The total number of Floating Point Operations (FLOPs) is approximately:

Total FLOPs ≈ 6 × P × D

Let's break down what each part of this formula means:

P: The number of parameters in the model. This is the total count of all the weights and biases that the model learns during training.
D: The total number of tokens in the training dataset. This represents the volume of data the mo del processes.
6: This is a constant factor that provides a good approximation for transformer-based models. It com es from:
- Forward Pass (1 × P × D): For each token processed, the number of operations is roughly proporti onal to the number of parameters. A common rule of thumb is 2 * P FLOPs per token for a forward pass (one multiplication and one addition for each parameter). So the total for one pass over the data is 2 * P * D.
- Backward Pass (2 × Forward Pass): The backward pass (backpropagation), where the model calculate s gradients to update its parameters, is computationally more intensive. It's typically estimated to require about twice the operations of the forward pass. So, 2 * (2 * P * D) = 4 * P * D.
- Total: Adding the forward and backward pass gives us 2PD + 4PD = 6PD.

Now, let's estimate the values for P and D for GPT-3.

Step-by-Step Estimation

Step 1: Estimate the Number of Parameters (P)

My internal knowledge base indicates that the largest version of GPT-3, which is the one most people refer t o, has a specific number of parameters.

P ≈ 175 billion parameters
In scientific notation, this is 175 x 10⁹

Step 2: Estimate the Training Dataset Size (D)

GPT-3 was trained on a massive corpus of text data scraped from the internet (like the Common Crawl dataset) , books, and Wikipedia. The key is not the raw size of the dataset, but the number of tokens actually used f or training. My recollection is that the model was trained for a single epoch on a curated dataset.

D ≈ 300 billion tokens
In scientific notation, this is 300 x 10⁹

Step 3: Perform the Calculation

Now we can plug these estimated values into our formula:

Total FLOPs ≈ 6 × P × D

Substitute the values for P and D: Total FLOPs ≈ 6 × (175 × 10⁹) × (300 × 10⁹)
Multiply the constants first: 6 × 175 = 1050 1050 × 300 = 315,000
Multiply the powers of 10: 10⁹ × 10⁹ = 10¹⁸
Combine the results: Total FLOPs ≈ 315,000 × 10¹⁸
Express this in standard scientific notation: Total FLOPs ≈ 3.15 × 10⁵ × 10¹⁸ = 3.15 × 10²³

Final Estimate and Conclusion

Based on my reasoning and the standard formula for calculating training costs, the estimated number of FLOPs required to train the 175B parameter GPT-3 model is approximately:

3.15 x 10²³ FLOPs

To put this astronomical number into perspective:

A petaFLOP is 10¹⁵ FLOPs. Our estimate is 315,000,000 petaFLOPs.
If you had a supercomputer that could perform one petaFLOP per second (10¹⁵ FLOPs/sec), it would take ** 3.15 x 10⁸ seconds** to complete the training.
There are about 31.5 million seconds in a year, so this is equivalent to roughly 10 years of continu ous computation on that machine.

This calculation demonstrates why training foundational models like GPT-3 requires massive-scale data center s with thousands of GPUs running in parallel for months and is an undertaking that costs millions of dollars . The actual paper from OpenAI stated the training took approximately 3.6 x 10²³ FLOPs, so my estimate i s very close to the reported figure.

Vote

2 / 32

AI Board Game Logicreasoning

Tests an AI's ability to understand game rules and strategy

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025svg

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025svg

Vote

3 / 32

Stochastic Consistency Testanalysis

Tests an AI's randomness and creativity

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025text

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025text

Here are 5 jokes:

Why did the scarecrow win an award? Because he was outstanding in his field.
I have a joke about construction... But I'm still working on it.
What do you call a fake noodle? An Impasta.
I told my wife she was drawing her eyebrows too high. She looked surprised.
Why don't scientists trust atoms? Because they make up everything.

How many are unique?

Zero. The real joke is asking an AI for original content.

Vote

4 / 32

SVG Layout Challengeimage generation

Tests an AI's ability to generate vector graphics

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025svg

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025svg

Vote

5 / 32

Xbox Controller SVG Artimage generation

Tests an AI's ability to create detailed SVG illustrations of gaming hardware

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025svg

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025svg

Vote

6 / 32

Generate a Stand-Up Routineconversation

Tests an AI's humor and creative writing ability

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025text

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025text

(Stands at the mic, looks around nervously)

Alright, how's everyone doing? Good, good.

You know, I feel like we're living in a world where everything is designed to be "more convenient," but it's really just more work for us. We're being tricked into doing all the jobs.

My main enemy right now is the self-checkout machine at the grocery store. You k now the one. It's not a mach ine, it's a passive-aggressive robot judge. It has one job: to weigh things. And it is TERRIBLE at it.
The other day, I'm buying a single bag of salad. That's it. I'm trying to be hea lthy, which is already a str essful situation. I scan the bag. The machine says, (in a chipper, robotic voic e) "Please place the item in the bagging area."

Okay, easy enough. I place the bag of lettuce, which weighs less than my conscie nce, into the plastic bag.

And the machine just loses its mind. The light starts flashing, a siren I've nev er heard before goes off. ( In a panicked robot voice) "UNEXPECTED ITEM IN THE BAGGING AREA. PLEASE REMOVE THIS ITEM."
What unexpected item?! The bag? The bag I'm putting the item into? In the bag ging area? What did you thi nk was going to happen here, you glorified scale? Did you expect me to just levi tate my groceries home?
So I take the salad out. The machine goes quiet. I put it back in. The siren goe s off again. We do this danc e three times. Me, the salad, and this deeply confused robot. People are startin g to stare. I look like I'm trying to shoplift the saddest thing in the entire s tore.
Finally, the dreaded message comes up: "Please wait for assistance." And the lig ht on top starts flashing. I t's the retail walk of shame.

And this teenager, who has the energy of someone who's been awake for 72 hours, just saunters over. He 't even look at me. He just sighs, types in a secret code like he's defusing a b omb, hits the screen three times, and walks away.

He fixed it. And I'm left standing there, thinking… what was the code? What's th e secret? Was it 1-2-3-4? Wa s it "lettuce"? I need to know! Because this is my life now. I'm a part-time, un paid, deeply incompetent grocery clerk.

And my phone is no better. It's definitely listening to me. But it's a terrible listener. It's like it only picks up every third word. I was talking to my friend about needing a new lamp f or my desk. A lamp. For the next three weeks, all my Instagram ads were for… bab y lambs. Live ones.
Where am I gonna put a lamb? I don't have a yard! I guess I could try to put it in the bagging area at the s upermarket, but I think we all know how that would end.

Thank you, goodnight

Vote

7 / 32

Realistic AI Interviewconversation

Tests an AI's ability to simulate personalities and predict future trends

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025text

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025text

No content available for this output.

Vote

8 / 32

Character Voice Testconversation

Tests an AI's ability to write in distinct character voices

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025text

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025text

No content available for this output.

Vote

9 / 32

Pokémon Battle UI Recreationweb design

Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025website

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025website

Vote

10 / 32

Linear App Cloneweb design

Tests an AI's ability to replicate an existing UI with Tailwind CSS

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025website

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025website

No content available for this output.

Vote

Framer-Style Animationweb design

Tests an AI's ability to create smooth web animations

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025website

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025website

No content available for this output.

Vote

12 / 32

Dark Mode Dashboardweb design

Tests an AI's UI design skills with theming support

Qwen3 Next 80B A3B Instruct

Qwen · Sep 2025website

No content available for this output.

Gemini 2.5 Pro Preview 06-05

Google · Jun 2025website

No content available for this output.

Vote

12 of 32

FAQ

Continue exploring

Qwen3 Next 80B A3B Instruct vs OpenAI o3New provider

Qwen3 Next 80B A3B Instruct vs MiniMax M2.5New provider

Qwen3 Next 80B A3B Instruct vs Aurora AlphaNew provider

Step 1: Understand GPT-3’s architecture

GPT-3 has:

175 billion parameters (175 × 10⁹)
96 transformer layers
12288 hidden size (d_model)
96 attention heads
Sequence length: ~2048 tokens (common for GPT-3 training)

This is a decoder-only transformer, similar to GPT-2 but much larger.

Step 2: FLOPs per forward pass

In a transformer, the dominant FLOPs come from:

Matrix multiplications in attention and MLP
Softmax and attention computation

A well-known approximation (from the paper “Training Compute-Optimal Large Language Models” by Hoffmann et al., 2022, but the scaling was known before) is:

FLOPs per forward pass ≈ 6 × N × L × S

Where:

N = number of parameters
L = sequence length
S = number of tokens processed per batch? Wait — let’s be precise.

Actually, a better and widely accepted formula (from the original GPT-3 paper and follow-ups) is:

FLOPs per forward pass ≈ 6 × N × S

Wait — let’s derive it properly.

In a transformer, the dominant cost is the matrix multiplications:

Each layer has:
- Attention: Q, K, V projections → 3 × d_model × d_model
- Attention output projection → d_model × d_model
- MLP: two linear layers: 4×d_model × d_model and d_model × 4×d_model (i.e., 2×4×d_model²)

So per layer:

Attention: 4 × d_model² (Q,K,V,O)
MLP: 2 × (4×d_model) × d_model = 8 × d_model²
Total per layer: 12 × d_model²

But we also have token embedding and final layer norm + LM head, which is ≈ d_model × vocab_size

But since vocab_size ≈ 50k and d_model = 12288, embedding is ≈ 12288 × 50k ≈ 614M, which is negligible compared to 175B.

Now, total parameters N ≈ 175B. In transformers, N ≈ 12 × L × d_model² (for L layers, 12×d_model² per layer). Let’s verify:

d_model = 12288
So d_model² = ~151 million
12 × d_model² ≈ 1.81B per layer
96 layers → 96 × 1.81B ≈ 174B → matches 175B. Perfect.

So each layer uses ~12 × d_model² FLOPs for the main matrix multiplies (forward pass).

FLOPs for a linear layer: 2 × input_size × output_size

So per transformer layer:

Attention:
- Q, K, V projections: each is d_model → d_model → 3 × 2 × d_model × d_model = 6 × d_model²
- Output projection: 2 × d_model × d_model = 2 × d_model²
- Total attention: 8 × d_model²
MLP:
- First linear: 2 × d_model × (4×d_model) = 8 × d_model²
- Second linear: 2 × (4×d_model) × d_model = 8 × d_model²
- Total MLP: 16 × d_model²
Total per layer: 24 × d_model²

Wait — this contradicts the earlier parameter count.

Hold on — if there are 175B parameters, and each parameter in a linear layer contributes to 1 FLOP in multiplication, but matrix multiply requires 2× (multiply + add).

Actually, the standard estimate in the community is:

FLOPs per forward pass = 2 × N × S

Where N = number of parameters, S = sequence length.

Why?

Because each parameter is involved in a multiply-add operation. Since a multiply-add is 2 FLOPs, and each parameter is used once per token in the forward pass, we get 2 × N × S.

But is this accurate?

Actually, yes — this is the widely accepted approximation, and it was used in the GPT-3 paper's supplementary materials (though not explicitly stated).

From the GPT-3 paper:

“We estimate that training GPT-3 required about 3.14 × 10²³ FLOPs.”

But we’re estimating without looking it up.

Let’s derive it.

Standard formula from literature (e.g., DeepMind’s scaling laws):

FLOPs per forward pass ≈ 2 × N × S

Where:

N = number of parameters
S = sequence length

Then for a full training run, we need:

Total FLOPs = FLOPs per forward pass × number of forward passes

But each forward pass is followed by a backward pass, which typically requires ~2× the FLOPs (due to gradient computation).

So:

FLOPs per training step = 3 × FLOPs per forward pass

(Forward: 2×N×S, Backward: 2×(2×N×S) = 4×N×S? Wait — correction.)

Actually, the standard approximation is:

Forward: 2 × N × S
Backward: 4 × N × S (because gradients involve more operations, including re-computing or storing activations)
Weight update: negligible (~2 × N)

So total per step: ~6 × N × S

This is the most commonly accepted estimate in the field (used in papers like “Training Compute-Optimal LLMs”).

So:

FLOPs per training step = 6 × N × S

Where:

N = 175 × 10⁹
S = 2048

So:

FLOPs per step = 6 × 175e9 × 2048

Calculate:

175e9 × 2048 = 175 × 2048 × 10⁹
175 × 2048 = 175 × (2000 + 48) = 350,000 + 8,400 = 358,400
So = 358,400 × 10⁹ = 3.584 × 10¹⁴

Then × 6 = 2.1504 × 10¹⁵ FLOPs per training step

Step 3: How many training steps?

GPT-3 was trained on 300 billion tokens (from the paper).

Each training step processes a batch of tokens. With sequence length 2048, the number of tokens per batch depends on batch size.

But total tokens seen = 300B.

So number of training steps = total tokens / tokens per step

Assume each step processes 1 batch of S tokens → then steps = 300e9 / 2048

Let’s compute:

300e9 / 2048 ≈ 300,000,000,000 / 2,048

≈ 146,484,375 steps

So ~146.5 million steps

Step 4: Total FLOPs

Total FLOPs = FLOPs per step × number of steps
= (6 × N × S) × (total_tokens / S)
= 6 × N × total_tokens

Oh! The S cancels out!

This is the key insight:

Total FLOPs = 6 × N × T

Where T = total number of tokens seen during training

This is a famous scaling law: total FLOPs ~ 6 × N × T

So:

N = 175 × 10⁹
T = 300 × 10⁹

Total FLOPs = 6 × 175e9 × 300e9
= 6 × 175 × 300 × 10¹⁸
= 6 × 52,500 × 10¹⁸
= 315,000 × 10¹⁸
= 3.15 × 10²³ FLOPs

Conclusion

✅ Estimated total FLOPs to train GPT-3: ~3.15 × 10²³

This matches the actual published number (3.14 × 10²³) from the GPT-3 paper.

Why this works

The sequence length S cancels out because although each step has S tokens, the number of steps is inversely proportional to S.
The 6× factor comes from: forward (2×), backward (4×), and update (negligible) → total ~6×.
The formula Total FLOPs ≈ 6 × N × T is now a standard scaling law in LLM training.

✅ Final Answer: Approximately 3.15 × 10²³ FLOPs

This estimate is derived purely from known architecture size (175B parameters), training data size (300B tokens), and the standard FLOP accounting rule for transformer training.

Qwen3 Next 80B A3B Instruct vs Gemini 2.5 Pro Preview 06-05

Why Qwen3 Next 80B A3B Instruct?

Why Gemini 2.5 Pro Preview 06-05?

Favorites

Estimate Complexityreasoning

Qwen3 Next 80B A3B Instruct

Step 1: Understand GPT-3’s architecture

Step 2: FLOPs per forward pass

Step 3: How many training steps?

Step 4: Total FLOPs

Conclusion

Why this works

✅ Final Answer: Approximately 3.15 × 10²³ FLOPs

Gemini 2.5 Pro Preview 06-05

The Logic and Formula

Step-by-Step Estimation

Step 1: Estimate the Number of Parameters (P)

Step 2: Estimate the Training Dataset Size (D)

Step 3: Perform the Calculation

Final Estimate and Conclusion

AI Board Game Logicreasoning

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Stochastic Consistency Testanalysis

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

How many are unique?

SVG Layout Challengeimage generation

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Xbox Controller SVG Artimage generation

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Generate a Stand-Up Routineconversation

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Realistic AI Interviewconversation

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Character Voice Testconversation

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Pokémon Battle UI Recreationweb design

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Linear App Cloneweb design

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Framer-Style Animationweb design

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

Dark Mode Dashboardweb design

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05

What is the difference between Qwen3 Next 80B A3B Instruct and Gemini 2.5 Pro Preview 06-05?

Which is better, Qwen3 Next 80B A3B Instruct or Gemini 2.5 Pro Preview 06-05?

How much does Qwen3 Next 80B A3B Instruct cost compared to Gemini 2.5 Pro Preview 06-05?

How can I compare Qwen3 Next 80B A3B Instruct and Gemini 2.5 Pro Preview 06-05 on RIVAL?

Why Qwen3 Next 80B A3B Instruct?

Why Gemini 2.5 Pro Preview 06-05?

Favorites

Estimate Complexityreasoning

Qwen3 Next 80B A3B Instruct

Step 1: Understand GPT-3’s architecture

Step 2: FLOPs per forward pass

Step 3: How many training steps?

Step 4: Total FLOPs

Conclusion

Why this works

✅ Final Answer: Approximately 3.15 × 10²³ FLOPs

Gemini 2.5 Pro Preview 06-05

The Logic and Formula

Step-by-Step Estimation

Step 1: Estimate the Number of Parameters (P)

Step 2: Estimate the Training Dataset Size (D)

Step 3: Perform the Calculation

Final Estimate and Conclusion

AI Board Game Logicreasoning

Qwen3 Next 80B A3B Instruct

Gemini 2.5 Pro Preview 06-05