What is the difference between Grok 3 Thinking and Qwen3 235B A22B?

Grok 3 Thinking is developed by xAI while Qwen3 235B A22B is developed by Qwen. Grok 3 Thinking has a 128K token context window vs Qwen3 235B A22B's 33K. You can compare their actual outputs across 13 challenges on Rival to see how they differ in practice.

Which is better, Grok 3 Thinking or Qwen3 235B A22B?

It depends on your use case. Grok 3 Thinking and Qwen3 235B A22B each have strengths in different areas. Rival lets you compare their real outputs side-by-side across 13 challenges so you can judge which fits your needs best.

How can I compare Grok 3 Thinking and Qwen3 235B A22B on Rival?

This page shows a side-by-side comparison of Grok 3 Thinking and Qwen3 235B A22B across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Updated Apr 28, 2025

Grok 3 ThinkingvsQwen3 235B A22B

13 fights queued

Why Grok 3 Thinking?

3.9x more context128K

Leads 1 of 1 benchmarks

Stronger on LiveCodeBench79.4% vs 70.7%

Why Qwen3 235B A22B?

2 months newerApr 2025

Grok 3 ThinkingQwen3 235B A22B

Context

128K

33K

Released

Feb 2025

Apr 2025

Benchmarks1 common

LiveCodeBench

79.4%+11%

70.7%

Favorites

Grok 3 ThinkingQwen3 235B A22B

Movie

—

The Godfather

19728.7

Album

—

Sgt Peppers Lonely Hearts Club Band

The Beatles

Book

—

Nineteen Eighty-Four

George Orwell

City

—

Istanbul

Turkey

Game

—

The Legend of Zelda: Ocarina of Time

Adventure, Action4.4

12 / 13

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Grok 3 Thinking

Xai · Feb 2025text

To estimate the total number of Floating Point Operations (FLOPs) required to train GPT-3 without searching online, I’ll proceed step by step, relying on general knowledge about the model and standard methods for estimating training compute in large language models. Here’s my reasoning:

Step 1: Understand GPT-3’s Scale

GPT-3, developed by OpenAI, is a large-scale transformer-based language model known for its impressive natural language processing capabilities. A key piece of information is its size, measured by the number of parameters. Based on what I recall, GPT-3 has approximately 175 billion parameters. This number, denoted as ( N = 175 imes 10^9 = 1.75 imes 10^{11} ), will be central to our computation.

Step 2: Identify the Training Dataset Size

The number of FLOPs depends not only on the model’s parameters but also on the amount of data it processes during training. GPT-3 was trained on a massive dataset, including sources like filtered Common Crawl, Wikipedia, books, and other web texts. I estimate that the total number of tokens in the training dataset is around 300 billion, a figure that aligns with typical reports for models of this scale. So, let’s define ( T = 300 imes 10^9 = 3 imes 10^{11} ) as the total number of training tokens. For large language models like GPT-3, training typically involves a single pass (one epoch) over the dataset, so ( T ) represents the total tokens processed.

Step 3: Recall the FLOPs Estimation Formula

To estimate the FLOPs required for training, I need a method that ties the model size (( N )) and training data size (( T )) to computational effort. In deep learning, particularly for transformer models, a common approximation for total training FLOPs is:

[ C = 6 imes N imes T ]

Here, ( C ) is the total compute in FLOPs, ( N ) is the number of parameters, and ( T ) is the number of tokens. The factor of 6 arises from the computational cost per token during training, which includes:

Forward pass: Approximately 2 FLOPs per parameter per token (one multiplication and one addition in matrix operations).
Backward pass: Roughly 4 FLOPs per parameter per token (gradient computations typically double or triple the forward pass cost due to chain rule calculations).

Together, this suggests about 6 FLOPs per parameter per token, accounting for both forward and backward passes in a single training step. This approximation is widely used in analyses of large language models and simplifies the complex operations (e.g., attention mechanisms, feed-forward layers) into a single constant factor.

Step 4: Compute the Total FLOPs

Using the formula ( C = 6 imes N imes T ), I’ll plug in the values:

( N = 1.75 imes 10^{11} ) (parameters)
( T = 3 imes 10^{11} ) (tokens)

First, calculate the product of ( N ) and ( T ):

[ N imes T = (1.75 imes 10^{11}) imes (3 imes 10^{11}) ]

Coefficients: ( 1.75 imes 3 = 5.25 )
Exponents: ( 10^{11} imes 10^{11} = 10^{11+11} = 10^{22} )

So:

[ N imes T = 5.25 imes 10^{22} ]

Now, multiply by 6:

[ C = 6 imes 5.25 imes 10^{22} ]

( 6 imes 5.25 = 31.5 )
( 31.5 imes 10^{22} = 3.15 imes 10^{23} )

Thus, the total estimated FLOPs is:

[ C = 3.15 imes 10^{23} ext{ FLOPs} ]

Step 5: Validate the Estimate

To ensure this makes sense, consider the scale. Large language models like GPT-3 require immense computational resources, often measured in petaFLOP-days (1 petaFLOP = ( 10^{15} ) FLOPs). Converting our estimate:

( 3.15 imes 10^{23} div 10^{15} = 3.15 imes 10^8 ) petaFLOPs
If training took 10 days on hardware delivering ( 3.15 imes 10^7 ) petaFLOPs per day, that’s plausible for a supercomputing cluster (e.g., thousands of GPUs or TPUs).

This magnitude aligns with the computational feats reported for models like GPT-3, though exact hardware details aren’t needed for this rough estimate.

Step 6: Address Potential Adjustments

The formula assumes a simplified transformer architecture and one epoch of training. Factors like multiple epochs, optimizer overhead (e.g., Adam), or specific architectural details (e.g., GPT-3 has 96 layers) could adjust the constant factor slightly. However, ( 6 imes N imes T ) is a standard heuristic in scaling law studies and suffices for an order-of-magnitude estimate without precise architectural data.

Final Answer

Based on this reasoning, the total number of FLOPs required to train GPT-3, with 175 billion parameters and 300 billion training tokens, is approximately 3.15 × 10²³ FLOPs. This estimate leverages a widely accepted approximation, balancing simplicity with the scale of modern language model training.

Grok 3 Thinking vs Qwen3 235B A22B

Which is better, Grok 3 Thinking or Qwen3 235B A22B?